If your organization is serious about using data reporting as a key strategic business tool, you will need to build a data warehouse at some point, but designing a modern data warehouse is not an easy or trivial undertaking. More than 50% of data warehouse projects have low or zero adoption rates. So, how do you start to design and build a data warehouse? What are the pitfalls and how can you optimize it? And most importantly, where to start?
This article provides tips on how to set up a data warehouse and avoid some common pitfalls.
What Is A Data Warehouse?
In a modern enterprise, data is usually stored in different locations. There may be a reason for this:
Application databases – this is probably the most important use case for start-ups. For other companies, this may be a product sales application.
Web applications – these may be applications that are needed to grow or maintain the business. Examples include email marketing applications such as Mailchimp, web analytics applications such as Google Analytics or Mixpanel, or accounting applications such as Xero and Quickbooks.
Spreadsheets: can be in the form of spreadsheets (Excel, CSV) or spreadsheets such as Google Sheets. You can update this data manually.
The data warehouse synchronizes data from different sources in one place, for all your information needs.
It provides reliable data and manages the questionable workload of everyone in the organization.
Data Warehouse Design
The architecture of a typical data warehouse is as follows:
The data warehouse is designed and built to meet your information needs. Once you have identified the data you need, you design it so that you can transfer it to the data warehouse.
Create A Schema For Each Data Source
Create a database schema for each data source that you want to synchronize with the data warehouse. This
- This will allow you to quickly identify the data sources for each table, which is useful when the number of data sources grows. It also allows future data analysts and employees in the organization to quickly access individual data sources.
- You can assign specific access rights to a data source. For example, a data technician may not want to allow a junior analyst to read a particular schema but not write to it, which is particularly useful when the number of data sources grows over time. Think about how many sources your data might have. For example, you could create a flowchart for the email marketing, financial data and advertising data you want to import from these schemes into your store.
When you import a contact table from Mailchimp into your database, you can give it the following name:
Choose name, Mailchimp email address. Contact Mailchimp.
How To Create A Schema
Creating a layout is very simple. Just insert one line to create a new layout. In Postgres, it’s really just three words.
Creating a layout
Example:
Creating a layout for Mailchimp
Note a: New analysts may get confused between a database schema. Schemas can be defined in two ways. Schemas can be used to describe
- why tables and fields in the database are related, or
- A folder is an arrangement of the tables in the database and the records in the folders.
Note b: MySQL databases do not support schemas, so you need to use naming conventions for imported tables, e.g. mailchimp_contacts etc.
Transferring Raw Data To The Data Warehouse
The next step is to synchronize the source data with the data warehouse. Your engineers should be aware that this is an ETL scenario.
Please consider the following points when designing your import plan:
- Remove explicitly unnecessary columns. It is not difficult to add them later when you realize you need them.
- Rename the columns to make them more descriptive or more appropriate to the database (e.g. lower or upper case). Of course, delete unnecessary entries. These can be entries of internal users testing the production system.
- Correlate the log values to make them easier to read. This way analysts do not have to change their names when writing reports. For example, some records may contain numeric keys or abbreviated names that the rest of the organization does not know. In these cases it may be useful to assign these keys to their actual values during import.
- At the end of the import, use the database index of the target table. Troubleshooting: in case of failed tasks and service interruptions, create emails/messages with detailed error logs and send them to the relevant agencies. To reduce the number of false positives in this case, you can set up automatic re-tests at set times.
One question that often arises is how to convert data before you need it. The archive generally recommends not to do this. At least not from the start. In particular If this is your first data warehouse project. There are several reasons for this.
A. You May Not Have Clear Data Conversion Requirements
Even if you have clear requirements, these are likely to change or be exceeded during the project.
You don’t want to waste time redesigning the ETL scenario according to what different stakeholders want at different times.
Providing reference data helps to separate the dependencies of the ETL scenario from the business requirements.
B. Reference Data May Also Be Needed In Other Cases
Think of the source data as an interactive database that can be transformed into a series of derived tables by grouping along different dimensions or combining tables from other sources.
When the data is transformed, details of the original data are lost that may be needed in subsequent usage reports.
For example, if you aggregate sales by period, you lose the details of each transaction item that another user may need for comparison with other reports. If you enter data that is not translated, you can link it to other data sources.
The need for raw data becomes even more important when you start to create data models that can be reused to answer different questions.
C. Reducing The Load On Source Systems
Performing a data transformation on a source system can be resource intensive, especially when it is a database that serves customers from all over the world.
You don’t want to burden them with time-consuming data conversion tasks.
In some cases it makes sense to convert data before transfer, but this is usually for companies that have already built a robust data warehouse and want to improve it.
Transforming Data To Solve A Specific Problem
The decision to transform data can be complex. If you don’t take action, you risk wasting a lot of time optimizing data that has no value for business planning.
A good rule of thumb is to set a goal first. Transformed data should only be created to solve a specific case or problem.
These practical use cases can be identified through reports and dashboards generated with the imported data.
When users start reporting performance problems, the data can be transformed. This is possible because the reports contain
- embedded queries containing tables or custom expressions.
- Or multiple headings in multiple tables together.
This is where the flexibility of SQL-based reporting can help identify problems that can be solved by data transformation. Any analyst can quickly identify the cause of reports with long queries and optimize their performance.
This can be achieved primarily through automated pre-combination of data. This can be achieved through materialized visualization, where data conversion tasks can be created that simultaneously
- Merging large transaction tables to speed up query execution.
- Create derived tables with columns from different data sources.
- Avoiding/hiding sensitive data for selected groups of users.
Another suggestion is to create a new database schema in the data warehouse to store transformed (or post-processed) tables.
Similar to the previous approach, where all data sources are mapped to a schema, creating a specific schema can help to define the list of created/transformed data tables. This will be useful later when additional data import and transformation sets are created as the data matures.
Creating Internal Data Documents
This is important, especially if you don’t want your data warehouse to be a black box that only a few engineers can work with. If users don’t understand, they won’t dare ask questions.
You can start by creating a shared document that describes a common understanding of:
- Tables and columns in source data and how to interpret them
- Include a Data Diagram if it exists
- How to read your columns in your reports and any assumptions underlying each of them
Each time a report is created, update this document to reflect any new level of business understanding of your data.
Wrap-up
Uncertain or changing business requirements make it difficult to choose the right storage technology. Whatever tool is chosen, it must be scalable and flexible.
Generally speaking, designing a data warehouse is an excellent solution to easily collect and analyze business data. It increases access to data, speeds up analysis activities, improves the quality of information for reporting and ensures secure data management.
Did you find this article strategically relevant? Want to learn more about how to build a modern data warehouse? Then check out our page on Data Warehouse: Cloud, Hybrid & On-Premise Solutions or visit our page on Data Warehouse Consulting.