Tuesday, July 19, 2016

Guiding Principles of Data Warehousing

I know Kymball and Inmon have written vast texts on data warehousing. So much in fact that you can find literature to support or criticize almost any design. On this page I want to give my 6 Guiding Principles that I use when designing a bespoke data warehouse. A corollary to my principles is that no particular design feature is bad, it can only be deemed appropriate/inappropriate in a particular implementation.

1. No Loss of Data
All information taken from source systems should be held without losing information. This might sound simple, but there can be subtle losses of data by omitting columns, trimming columns, correcting data, denormalising, transformation, cleansing, etc. The test is any question that the source system can answer (within the tables extracted) should also be answerable from the DW (and with the exact same answer).

2. Keep it Simple
Minimise physical layers of transformation that make it difficult to determine data at source. A common question will be “why does this record have that value in the DW?” The simpler, and less physical, the transformation, the easier it is to reconcile DW data with source systems. By making reconciliation easier, we will be able to build trust that the DW is valid and can be relied on to give a “correct” answer. If there is a bug in the ETL, it will be easier to uncover in a simple DW. Once a view of the data is materialised in the DW, it becomes more difficult to ascertain where the data came from and which insert and update statements have acted on the data.

3. Ease of Enhancement
This is closely related to the second principle in keeping it simple. The simpler the architecture and minimal physical transformation, the easier it will be for future developers to enhance the DW to add/extend data sources or perform new analysis. DWs are invariably never finished as changes in business activity, regulatory requirements, enterprise systems, competitive action, supplier action, etc. will drive new DW information and analytical requirements. The best indication for ease of enhancement is how much development effort did it take to create? Enhancements will be a proportion of the initial development effort. A DW that took 10 developer years to create will take 10 times the effort to enhance than a DW that took 1 developer year to create. Even if both DWs contain the same information.

4. Historical Integrity
If the DW users require it, historical integrity should be maintained so the DW can report on what the values "were" at the time of the transactions (facts). While it is important to maintain this information, it can be expensive, so it should only be maintained on tables that require it and only on columns of these tables that are worthy of it.

5. High Performance and Room for Scale
The DW should be able to perform well with the anticipated volumes. There is no excuse for slow queries that were known in advance along with anticipated table sizes. Of course, as the DW grows in unanticipated ways and new queries/analyses are written, further optimisation will be required. The initial design should take into consideration the size of the tables and how they will be joined with related tables. This will lead to some large tables often joined having similar clustered indexes, and appropriate normalisation, partitioning etc. These considerations are often very easy to implement at the beginning, but become more time consuming once implemented and built upon.

6. Appropriate Design
Just like building architecture, there is no one design that suits all. The design should be appropriate for the purpose. We won’t follow others’ design guidelines blindly.

No comments: