Before the availability of Azure Synapse Analytics:
Before the availability of Azure Synapse Analytics, data warehousing on Azure typically involved multiple services working together.
Ingest:
Azure Data Factory is a superior tool when it comes to data integration. It provides connectors to 90 plus sources ranging from On-premises, multi-cloud and software as a service application.
Store:
Azure Data Lake Storage Gen2 is undoubtedly the best choice with its hierarchical namespace, low cost and better security.
Prep & Train:
- ADF Data Flows
(ADF data flows gives us code free development capability. But they lack the ability to process complex data structures and unstructured data.) - Azure Databricks (Databricks with its spark notebooks is a great choice for processing complex data structures and unstructured data.)
Model & Server:
Azure SQL data warehouse is a massively parallel processing engine which can handle large volumes of data, suitable for handling large scale analytic workload.
Visualize:
Use Power BI to visualize the data to gain insights.
Issues with this Architecture:
While effective, this architecture has several drawbacks:
- Multiple Interfaces: Developers must navigate through multiple tabs and workspaces to monitor and manage these services.
- Integration Challenges: There is a lot of management in terms of being able to get these services to talk to each other and set up security.
- Provisioned Resources: All compute resources are provisioned, meaning they take time to become available and incur charges while running.
- Lack of Serverless Query Engine: This solution lacks a serverless query engine. The three compute engines (ADF Data Flows, Azure Databricks, and Azure SQL Data Warehouse) do not share metadata, resulting in isolated operations. For example, Spark tables created in Databricks are not directly accessible by Azure SQL Data Warehouse.
Microsoft’s solution to this problem is Azure Synapse Analytics
Ingest:
Azure Data Factory is a superior tool when it comes to data integration. It provides connectors to 90 plus sources ranging from On-premises, multi-cloud and software as a service application.
Store:
Azure Data Lake Storage Gen2 is undoubtedly the best choice with its hierarchical namespace, low cost and better security.
Prep & Train:
- ADF Data Flows
(ADF data flows gives us code free development capability. But they lack the ability to process complex data structures and unstructured data.) - Azure Databricks (Databricks with its spark notebooks is a great choice for processing complex data structures and unstructured data.)
Model & Server:
Azure SQL data warehouse is a massively parallel processing engine which can handle large volumes of data, suitable for handling large scale analytic workload.
Visualize:
Use Power BI to visualize the data to gain insights.
Issues with this Architecture:
While effective, this architecture has several drawbacks:
- Multiple Interfaces: Developers must navigate through multiple tabs and workspaces to monitor and manage these services.
- Integration Challenges: There is a lot of management in terms of being able to get these services to talk to each other and set up security.
- Provisioned Resources: All compute resources are provisioned, meaning they take time to become available and incur charges while running.
- Lack of Serverless Query Engine: This solution lacks a serverless query engine. The three compute engines (ADF Data Flows, Azure Databricks, and Azure SQL Data Warehouse) do not share metadata, resulting in isolated operations. For example, Spark tables created in Databricks are not directly accessible by Azure SQL Data Warehouse.
Microsoft’s solution to this problem is Azure Synapse Analytics
Azure Synapse Analytics addresses these challenges by integrating the above components into a unified platform. Here’s how it transforms the data warehousing landscape:
Data Integration:
Within Synapse, instead of Azure Data Factory you’ll see Synapse Pipelines. (Both are similar)
Compute:
- Synapse Data Flows:
- ADF mapping data flow in synapse are called Synapse Data flows.(Both provide the same functionality)
- Spark Pool:
- Instead of Databricks, there is a Spark Pool within Synapse to perform big data analytics.
- This is a new service rather than renaming Databricks to Spark. In this case, Microsoft has taken the Vanilla Apache Spark and Delta Lake and then built this service from the ground up. So, we won’t have the optimized version of Spark and the Delta Lake that comes with Databricks instead, you get the Vanilla Spark.
- Dedicated SQL Pool:
- Dedicated SQL Pool is a provisioned resource that stores data in a relational table format with columnar storage. It also uses a massively parallel processing or MPP Architecture to leverage up to 60 nodes to run your queries in parallel. (Azure SQL Data Warehouse in Synapse is now called Dedicated SQL Pool.)
- Serverless SQL Pool:
- This is a service which is available on demand. We don’t need to provision any resources; this lets us query data in the Data Lake using the familiar T-SQL syntax. Azure allocates the resources and runs the queries as required and returns the results.
Storage:
- Shared Meta store across various compute engines:
- Serverless SQL Pool can access the Spark tables created by the Spark Pool, thanks to the shared Meta store. Meaning tables created using Spark Pool can be accessed using Serverless SQL Pool in T-SQL.
- Synapse Link:
- Enables replication of the transactional data in Cosmos DB and Azure Dataverse and then query the analytical store directly from Synapse without impacting the transactional systems. This allows near-real-time operational reporting without the need to perform any ETL to bring the data from Cosmos DB or Dataverse into Synapse
Data Visualization:
- Integration with Power BI within Azure Synapse Analytics.
Development / Monitoring / Management & Security:
- There is one development studio or workspace for all our services rather than having separate workspace for each service. So, we can perform monitoring, development, management, and everything from one single studio, which is great for a developer.
Mian Ali Shah
Associate Consultant