Introduction
In today’s world, the cloud computing is replacing the legacy systems and providing cutting edge solutions to its users. Considering the outstanding impact that cloud solutions provide, the [client] collaborated with Digifloat to build a data warehouse on the Snowflake cloud. The [Client] provides solutions in the manufacturing and energy sector. It had an on-premise and customized ERP data warehouse that we migrated to Snowflake cloud and led to detailed data analysis and reliable reports to the stakeholders via Microsoft PowerBI.
It is always a challenge to migrate from an existing solution to a completely new platform(s). With the well-grounded expertise of Digifloat’s engineers, this ELT solution has become a reality.
The Client
The [Client] is a leader in Australian manufacturing and supply of advanced pipe systems and solutions, connecting Australian people with water and energy.
The [Client] provides a broad range of pipeline systems and solutions for building, infrastructure, irrigation and rural, mining and industrial, gas and energy applications.
The [Client] has a history of 61 years in Australia with proven long-term performance and reliability. The [Client] is passionate about creating sustainable, innovative solutions for its customers and communities. The [Client]’s commitment to safety, health and environmental sustainability is integral to the way it does business.
Cloud Technologies
- Snowflake Cloud – The cloud data warehouse (DWH) solutions provider
- Microsoft Azure Datafactory (ADF) – The orchestrator to migrate source data and run dbt cloud jobs through API calls
- Microsoft Azure DevOps – To maintain code repositories and provide CI/CD between 3-tier architecture
- Microsoft Azure DevOps Boards – To resolve bugs, have discussions, manage the tasks and introduce new features in the solution
- Microsoft PowerBI – For data visualizations and reports development
- dbt Core – The transformation engine of the whole ELT process
- dbt Cloud – For scheduling of jobs and logs generation
- GitHub – To manage the backend code base
Client ERP
MXP (Progress db based system highly customized)
Problem
DWH Framework: Highly fragmented data architecture.
1 Single custom coded small data mart in SQL server (v2014) Loosely following a star schema architecture with many devotions to that framework. SQL server SSIS framework being used to populate / refresh this data model every 24Hrs. Only invoiced sales as a core subject with some stock information. Probably covered around 15% of business functions / departments. ComOps being used on top of this SQL database. ComOps is an old OLAP based reporting tool with limited graphical capability, so all reporting was largely list based and crude.
In addition to this single datamart, many data extracts come from both MXP and multiple other systems to provide individual reporting requirements not covered by the SQL ComOps solution. These are extracted in different ways including coded exports or direct connections to operational systems.
No single solution to address reporting requirements means a disorganized data environment with very limited ability to reconcile business information to source systems. No data governance is in place or general management beyond ensuring that extracts had run.
Solution
The diagram below illustrates the system architecture:
Architecture
The new solution is a quantum leap forward employing all modern thinking in the development of data warehousing. Although using a fairly classic three tier architecture from source to ODS (data layer in the Snowflake cloud) to data warehouse, it also includes an Azure datalake between the source and ODS to provide longer term big data capability and to act as a highly compressed long term data storage mechanism for all operational applications. This new approach uses an untransformed mirror approach to data from source to ODS, meaning the ODS becomes a single source of all data flowing into the centralized data warehouse. ADF is used as the tool to lift and shit data from all applications into the data lake and the ODS. Beyond that dbt is used as the transformation tool to populate the architected centralized data warehouse, this transforms data into enterprise scale integrated, conformed data following all the best practices rules of multi-dimensional modelling.
This centralized enterprise data warehouse housed in the Snowflake cloud-based platform presents data from all targeted source applications seamlessly. The PowerBI tool sits on top of this DWH platform to act as the enterprise tool for all corporate reporting. Using this tool and DWH combination introduces the opportunity for business users to explore very wide data subject areas covering all core processes including, quotes, sales orders, purchase orders, packing, stock management, manufacturing process, invoicing etc. It is envisaged that a much more self-service model to business reporting will be possible as a result of introducing this enterprise approach and tools.
Microsoft Azure Datafactory (ADF)
The solution starts with migration of ERP data from source to the Azure Datalake storage of Snowflake cloud via Azure Datafactory pipelines. These data pipelines copy data from source according to the source processes. For example, the facts table of invoices depends on the “Invoice” and “Invoice-Line” tables mainly along with dimensional data of items, customers, entities, vendors, purchase orders (PO), orders, calendars and many more. So, these tables were copies from the ERP to Snowflake cloud.
Along with the migration of core tables, ADF jobs also run the dbt cloud jobs through call to dbt cloud API. This shall be explained in the detailed later. The ADF pipelines were scheduled, and utilized COPY and WEB API activities to run this process. During this process, the data rows are compressed and loaded as strings to optimize the data refresh routines. At the end of the day, the Snowflake cloud hosts the snapshot of the source ERP tables.
dbt Core (The backend)
Once the data is written in the Snowflake database, comes the backend, the dbt. For the introduction, dbt or Data Build Tool, is an open-source command-line tool that helps organizations build, test, and maintain their data infrastructure. It provides SQL based CTE queries to make transformation layer. With its advanced architecture, it provides a great many adapters to connect with almost all data sources and targets.
The dbt’s transformation queries includes the logic behind the dimensions and facts tables. The data warehouse consists of sales and stocks stars which include the dimensions of items, customers, vendor, invoices, orders, PO, bill-or-lorries and many more. There dimension tables were based on the concept of Slowly Changing Dimension (SCD Type-2) to maintain transactional history over the time. Hence, the dbt snapshots served the purpose exceptionally. Our solution included the use of special “GROUP_IDs” to group the rows based on the natural keys of the records for the aggregation purposes. To overcome the data level inconsistencies during the transformation, we used JINJA to provide us with the power of logical programming with SQL. For example, the data contained system generated compressed values separated by semi-colon delimiter e.g. a;b;c;d. The requirement here was to separate these values and create columns on the runtime. So, the JINJA provided extended functionality to include this feature in the code base.
During the development of the facts tables from the dimension tables, the key challenge was to ensure the inclusion of 100% accurate data. While Snowflake is sensitive to the case of the strings in the join, the source ERP had data rows that where not so perfect due to human errors and included mixture of string cases. For example, if a particular customer was named as “ABC”, the same customer was recorded as “aBc” somewhere else in the same data. So to resolve this issue, we converted the cases of the string fields involved in the joins. This led to the 100% data inclusion.
After the transformations were completed in the dbt’s scripts, it was time for the post processing of the data to mark the new and older records in the transactional changes. dbt has a systematic constraint where all of its code logic has to be done with “select” statement of SQL. So in order to mark the latest and old records, and update the “GROUP_IDs” w.r.t. the natural key on the records, it was time to introduce the “POST-HOOKS” of dbt. dbt provides a strong feature to implement the SQL statements other than “select” only. Here we created dynamic MACRO(s) to implement the said use-cases. This resulted in a final data that can be aggregated at very deep granularity level. This helped the team of data analysits to build strong and meaningful reports with accurate reports.
Hence, these dimension tables were joined together to make Fact tables for sales and stocks stars on the Snowflake cloud as target destination. The challenge to maintain referential integrity is handled via system generated HASH keys and surrogate keys. At the facts level, there was a requirement that we need to maintain the transactional history in some of the facts tables because those facts tables were based on the source tables that had multiple changes per row per day. For example, the order table had considerable data level transactional changes per day and the fact_order needed to have its own history despite being a fact only table. So we utilized dbt Snapshots to build historical data inside the facts where it was needed.
Apart from the fact tables that needed history in them, there were other tables that needed to have only the deltas of the source tables and no history at all. So, those fact tables were built as transient tables via dbt “models”.
KPIs/ Measures on the fact tables
Once the facts tables were in place, it was time for the reporting and measurement of the key performance indicators. All of these KPIs were possible because of the flexibilities provided by dbt. Using dbt, we can use a mixture of environment specific and dbt provided methods to make transformations on the data.
The KPIs dependent on the mathematical calculations were computed within the the facts code but those KPIs which required aggregation, were taken to the front-end, the PowerBI dashboards. This combination of calculations from backend and front-end resulted in excellent results for the stakeholders.
dbt Cloud
The dbt-core jobs are scheduled and run through dbt-cloud jobs. These jobs are called through requests made to their APIs via ADF. dbt Cloud provides an interactive GUI in order to perform all of the ELT tasks plus the scheduling of the jobs which cannot be done with the dbt-core. The data lineage graphs can be obtained from both the dbt-core and dbt-cloud. In our scenario, there were dependencies in order to run the dimensions followed by fact tables and then the mini tables that were built on the top of the fact tables.
In order to set the sequence of execution, dbt-core jobs were divided into two parts. CLI commands are set in dbt-cloud to build the dependencies (mini dimension tables) for the main dimension tables. Once all of the dimensions run successfully, the job calls to the next facts job in order to build the fact tables. After the successful run of the facts jobs, the remaining tables are created that are built on the top of facts and dimensions combined.
Snowflake Cloud
The Snowflake cloud is the main DWH solutions provider. After the ERP data is loaded into Snowflake storage, it is transformed into dimensions and facts tables over a 3-tiered architecture on Snowflake. These tiers include the DEV, UAT and PRODUCTION environments. Every development is made on the development tier and after it is passed through series of test cases, it is promoted to UAT before it can finally land into the PRODUCTION environment for the client.
The data governance model is employed to ensure the data integrity. Every layer has its own users with restricted access to the information they need. This enables us from accidental or intended misuse of the important data.
Microsoft PowerBI
The PowerBI is a huge leap forward for the [Client] providing it with extended graphical reports to the stackholders. The PowerBI based reports are mature and self explanatory for the teams to make decisions. The KPIs are enhanced further with business intelligence provided by PowerBI. Users now have access to always up-to-date reports where routines are scheduled to showcase fresh statistice on the dashboards.
Microsoft Azure DevOps
Azure DevOps has huge capability when it comes to the management, be it feature development or bug reports. It provides agility and brings pace to the development. Ever since the first day, the team utilized DevOps power to harness the efficiency. Once the code is committed to the Azure DevOps repository via GitHub, it is taken up by the dbt-cloud jobs pointed towards the Azure DevOps repository. It enables the continuous integration of the latest code which includes the new features and bug fixes.
Conclusion
This solution has paved the way for this client to improve the business process massively. Their teams have been trained to get themselves familiar with the new environment. The automation of this process that starts from ERP and ends at PowerBI dashboard, provides excellent efficiency to extend the business with reliability on their data.
Haseeb Ahmad
Sr. Consultant/Data Analytics