A data warehouse is a centralized and integrated repository that stores large volumes of structured, semi-structured, and unstructured data collected from various sources within an organization. A data warehouse is a federated repository for all the data that an enterprise’s various business systems collect. The repository may be physical or logical. Data warehousing is an electronic method of organizing information. Data warehousing is the process of constructing and using a data warehouse. Data warehousing involves data cleaning, data integration, and data consolidations.
Data Warehouse Features
IT provides information around a subject rather than the organization’s ongoing operations. These subjects can be product, customers, suppliers, sales, revenue, etc.
Warehouse integration combines data from several sources into a single, unified warehouse. The data warehouse can be accessed by any department within an organization, and the data can be easily structured into spreadsheets or tables for research and analysis purposes.
Data warehouse is entirely contained within a time period. Another way of stating that, is that the DW is consistent within a period, meaning that the data warehouse is loaded daily, hourly, or on some other periodic basis, and does not change within that period
As the name defines the data resided in data warehouse is permanent. It also means that data is not erased or deleted when new data is inserted Data warehouse is not database
Databases and data warehouses are related but not the same.
A database is a way to record and access information from a single source. A database is often handling real-time data to support day-to-day business processes like transaction processing.
A data warehouse is a way to store historical information from multiple sources to allow you to analyse and report on related data (e.g., your sales transaction data, mobile app data, and CRM data). Unlike a database, the information isn’t updated in real-time and is better for data analysis of broader trends.
Data warehouse is not data lake
A data lake is for storing all raw data that may or may not yet have an intended use case. A data warehouse, on the other hand, holds data that has already been processed and filtered, so it’s ready to be used and analysed.
A data lake, hosted on big data platforms like IBM or Hadoop, is ideal for data scientists and analysts to store raw data until they know what they want to do with it, or as a repository to store large amounts of unstructured data.
A data warehouse is perfect for giving access to structured and semi-structured data to multiple business users so they can run queries against it and make decisions quickly.
Properties of Data Warehouse Architectures
- Separation: Analytical and transactional processing should be keep apart as much as possible.
- Scalability: Hardware and software architectures should be simple to upgrade the data volume, which has to be managed and processed, and the number of user’s requirements, which have to be met, progressively increase.
- Extensibility: The architecture should be able to perform new operations and technologies without redesigning the whole system.
- Security: Monitoring accesses are necessary because of the strategic data stored in the data warehouses.
- Administrability: Data Warehouse management should not be complicated.
When to use a data warehouse
You should consider a data warehouse if you want to:
- Store all your historical data in a central repository.
- Analyse your web, mobile, CRM, and other applications together in a single place.
- Get deeper business insights than traditional analytics tools by querying data directly with SQL.
- Provide multiple people access to the same data set simultaneously.
Data Warehouse Architecture
Data Warehouse Architecture is complex as it’s an information system that contains historical and commutative data from multiple sources. There are 3 approaches for constructing Data Warehouse layers: Single tier, two tier and three tier. This 3-tier architecture of Data Warehouse is explained as below.
The objective of a single layer is to minimize the amount of data stored. This goal is to remove data redundancy. This architecture is not frequently used in practice.
Two-layer architecture is one of the Data Warehouse layers which separates physically available sources and data warehouse. This architecture is not expandable and also not supporting a large number of end-users. It also has connectivity problems because of network limitations.
Three-Tier Data Warehouse Architecture
This is the most widely used Architecture of Data Warehouse.
It consists of the Top, Middle and Bottom Tier.
Bottom Tier: The database of the Datawarehouse servers as the bottom tier. It is usually a relational database system. Data is cleansed, transformed, and loaded into this layer using back-end tools.
Middle Tier: The middle tier in Data warehouse is an OLAP server which is implemented using either ROLAP or MOLAP model. For a user, this application tier presents an abstracted view of the database. This layer also acts as a mediator between the end-user and the database.
Top-Tier: The top tier is a front-end client layer. Top tier is the tools and API that you connect and get data out from the data warehouse. It could be Query tools, reporting tools, managed query tools, Analysis tools and Data mining tools.
Data Warehousing Schemas
Star schema is the fundamental schema among the data mart schema and it is simplest. This schema is widely used to develop or build a data warehouse and dimensional data marts. It includes one or more fact tables indexing any number of dimensional tables. The star schema is a necessary cause of the snowflake schema.
It is also efficient for handling basic queries.
It is said to be star as its physical model resembles to the star shape having a fact table at its canter and the dimension tables at its peripheral representing the star’s points
Characteristics of Star Schema:
Every dimension in a star schema is represented with the only one-dimension table.
- The dimension table should contain the set of attributes.
- The dimension table is joined to the fact table using a foreign key
- The dimension table are not joined to each other
- Fact table would contain key and measure
- The Star schema is easy to understand and provides optimal disk usage.
- The schema is widely supported by BI Tools
Snowflake Schema in data warehouse is a logical arrangement of tables in a multidimensional database such that the ER diagram resembles a snowflake shape. A Snowflake Schema is an extension of a Star Schema, and it adds additional dimensions. The dimension tables are normalized which splits data into additional tables.
Characteristics of Snowflake Schema:
- The main benefit of the snowflake schema it uses smaller disk space.
- Easier to implement a dimension is added to the Schema
- Due to multiple tables query performance is reduced
- The primary challenge that you will face while using the snowflake Schema is that you need to perform more maintenance efforts because of the more lookup tables.
A Galaxy Schema contains two fact table that share dimension tables between them. It is also called Fact Constellation Schema. The schema is viewed as a collection of stars hence the name Galaxy Schema.
Characteristics of Galaxy Schema:
- The dimensions in this schema are separated into separate dimensions based on the various levels of hierarchy.
- For example, if geography has four levels of hierarchy like region, country, state, and city then Galaxy schema should have four dimensions.
- Moreover, it is possible to build this type of schema by splitting the one-star schema into more Star schemes.
- The dimensions are large in this schema which is needed to build based on the levels of hierarchy.
- This schema is helpful for aggregating fact tables for better understanding.
Data Warehousing Security
The objective of a data warehouse is to make large amounts of data easily accessible to the users, hence allowing the users to extract information about the business as a whole. But we know that there could be some security restrictions applied on the data that can be an obstacle for accessing the information. If the analyst has a restricted view of data, then it is impossible to capture a complete picture of the trends within the business.
The data from each analyst can be summarized and passed on to management where the different summaries can be aggregated. As the aggregations of summaries cannot be the same as that of the aggregation, it is possible to miss some information trends in the data unless someone is analysing the data as a whole.
Data Warehousing Future Aspects
As we have seen that the size of the open database has grown approximately double its magnitude in the last few years, it shows the significant value that it contains.
As the size of the databases grow, the estimates of what constitutes a very large database continues to grow.
The hardware and software that are available today do not allow to keep a large amount of data online. For example, a Telco call record requires 10TB of data to be kept online, which is just a size of one month’s record. If it requires to keep records of sales, marketing customer, employees, etc., then the size will be more than 100 TB.
The record contains textual information and some multimedia data. Multimedia data cannot be easily manipulated as text data. Searching the multimedia data is not an easy task, whereas textual information can be retrieved by the relational software available today.
Apart from size planning, it is complex to build and run data warehouse systems that are ever increasing in size. As the number of users increases, the size of the data warehouse also increases. These users will also require accessing the system.
With the growth of the Internet, there is a requirement of users to access data online.
Hence the future shape of data warehouse will be very different from what is being created today.
Junior Consultant/ Data Analytics