Getting started

Assuming that you just installed BETL I’ll give you an example on how to use it. The goal of this getting started guide is to generate the ETL and DDL for building the first two layers in a data warehouse, which I call staging and raw data warehouse.

  • Note that this setup is not mandatory for using BETL. You can choose your own layer setup. For example: a common setup is not to have an intermediate layer between raw data warehouse and datamarts.

Ingesting the object tree #

First we need to extract the meta data that tells us which tables there are and under which database, schema and server. This is called the object tree. Ingesting is a synonym for importing.

Let’s have a look at the current meta data repository. The following command shows you the BETL meta data repository.

exec dbo.info
-- or if you prefer to type less:
info
It will return:
  • the objects ( of different object types. e.g. servers, databases, schemas, tables, views and users). On a fresh install it will give you only 3 objects. The first is the LOCALHOST server object and the second object is of type user. This object stores user bound properties ( e.g. log_level). The third is the default unknown record that is present in many tables ( key=-1). Instead of a null foreign key, you can use -1 to reference unknown objects.
  • column meta data relating to objects ( column name, data type, etc). This is empty on a clean install.
  • properties that can be bound to objects. For example the include_staging property can be used to mark tables to be included in our staging layer. Properties can be set at any object in the object tree ( e.g. server, database, schema or table level). Properties are inherited. For example if you set the include property to true for a schema, then this property is set for all tables and views in this schema. You can exclude a table by setting the include property to false for this table while the include property of the schema is set to true.
meta data

Let’s get back to our first step: Ingesting the object tree. This can be done with the following code:

declare @obj_tree_param ObjTreeTableParam 

insert into @obj_tree_param 
select  *
from dbo.obj_tree_ms_sql_server

exec [dbo].[ingest_obj_tree] @obj_tree_param

-- see what happened
exec dbo.info
We only ingested the current database. Next step: ingest the source database object tree.

Ingest object tree of source database #

object tree

Because Azure SQL database does not support cross database queries, the solution for azure is a little bit different than on premise. The main difference is that for azure we have to use an ETL tool like Azure data factory and for the on premise solution we can just build the solution using only T-SQL and in some cases a linked server.

Write ADF pipeline to ingest object tree of source database #

pl_staging_aw all pipelines are stored here: https://github.com/basvdberg/BETL/tree/main/adf_betl_getting_started

This is the pipeline that imports the tables into our staging layer. Actually there is not much code generation done here, but we are using the meta data of betl and the generic copy task of ADF.

  1. Ingest object tree. As discussed above, but in this case for the AdventureWorks database. Have a look at the meta data by executing the stored procedure info or by running select * from obj_ext. Different database vendors require adjustments of this query ( e.g. Oracle, Postgress, MySQL, Db2).
  2. Set exclude property. We manage our configuration of what needs to be done by properties. Properties are linked to the object tree. The entire database is included and two tables are excluded. (Just by 3 simple statements).
  3. Lookup staging tables returns a list of tables that is picked up by the ForEach loop in the next step.
  4. Inside the ForeachTable Loop there are two steps:
    1. First the target table is dropped if exists
    2. Then the Generic copy task first creates the table and then copies it.

This pipeline shows the generic power of ADF combined with good meta data.

Raw data warehouse layer #

The next layer I call the raw datawarehouse. This is modelled exactly like the source system. But we add some things like:

  1. History. If something changes in the source we maintain the current and the previous values including the time period that a record is valid.
  2. Deletes. If something is deleted in the source, we mark is as deleted instead of deleting it.
  3. Meta data. for every record we link to the dbo.transfer table in betl. This table holds meta data and statistics for this transfer. A transfer is part of a batch and logging is done per batch. This meta data can also be used for lineage.

Let’s have a look at the overall orchestration pipeline for adventureworks, called pl_aw. This is sometimes called the master pipeline.

First we extract the required tables into our staging layer as we have seen above. The processing of the rdw layer can be divided into 2 steps:
1. ddl
2. etl

ddl tables #

The ddl step makes sure that RDW tables are created including extra columns for recording history and other meta data. Also primary keys and identity columns are created here. This ddl step can be skipped for Test, Acceptance and Production environment, because ddl changes are normally done via the release process. (However it is possible to implement a runtime ddl changes on a production environment as well. For example for adding columns in the rdw layer). You have to be careful in this case that you don’t implement ddl changes that will result in data loss.

ddl latest views #

I like the anchor modelling method of recording history. In short, you only have to add an effective date time stamp and you only have to insert records. Never delete or update anything. Most of the time you need the most recent version of a record. For your convenience we create a latest view that always returns this current version of a record. History tables are suffixed with for example _his. I’ve created the following naming convention ( please feel free to create your own). Objectives: group related tables (coming from the same source system) together by name, group layers by name, use latest views by default.

naming convention #

object namenaming convention *example<schema_name>.<obj_name> *
staging table<staging>_<source_system_abbreviation>.<source_system_schema_if_not_dbo>_<object_name>staging_aw.saleslt_customer
rdw history tablerdw.<source_system_abbreviation>_
<source_system_schema_if_not_dbo>_<object_name>_his
rdw.aw_sales_lt_customer_his
rdw latest viewrdw.<source_system_abbreviation>_
<source_system_schema_if_not_dbo>_<object_name>
rdw.aw_sales_lt_customer
* always lower case* db_name = sqldb_rdw