Monday, July 31, 2017

SQL DW Best pratices

Data Warehouse best practices
This is general guidance for DW and some of concept are based on customer’s scenarios, this is a only guidance and customers technical team can get an idea from here, applicability of this is depend on customers scenarios, this is information only document.

·        Identify the Dim and Fact
o   Dim : provide the descriptive context – attributes with the who, what, when, why, or how. They should always include friendly names & descriptions.
o   Dimension tables should *not* contain aggregatable numeric values (measures)
o   Fact: Fact tables contain the numeric, quantitative data (aka measures). Typically one fact table per distinct business process
·        Follow good naming convention to represent entity and attributes

·        Benefits of a Star Schema
o   Optimal for known reporting scenarios
o   Denormalized structure, structured around business logic, is good for performance & consistency
o   Usability for:
§  Stable, predictable environment
§  Less joins, easier navigation
§  Friendly, recognizable names
§  History retention
§  Integrate multiple systems Decoupled from source system

·        Performance of DW
o   Handling Larger Fact Tables
o   Clustered Columnstore Index
§  Reducing data storage due to compression of redundant values
§   Improving query times for large datasets
§  Improving query times due to reduced I/O (ex: column elimination)
§  suitable for:Tables over 1 million rows
§  Data structured in a denormalized star schema format (DW not OLTP) ü Support for analytical query workload which scans a large number of rows, and retrieves few columns
§  Data which is not frequently updated (‘cold’ data not ‘hot’)
§  Can selectively be used on insert-oriented workloads (ex: IoT)
o   Table Partitioning
§  Improving data load times due to partition switching
§  Flexibility for maintenance on larger tables
§  Improving query performance (possibly) due parallelism & partition elimination behavior
§  Speeding up ETL processes
§  Large datasets (50GB+) ü Small maintenance windows
§  Use of a sliding window ü Storage of partitions on separate drives (filegroups)
§  Older (cold) data on cheaper storage
§  Historical data on read-only filegroup ü
§  Speeding up queries (possibly) ,Partition elimination , Parallelism


·        Track History in a DW
o   Most common options for tracking history:
§  1. Slowly changing dimension
§  2. Fact snapshot tables
§  3. Timestamp tracking fact New option in SQL Server 2016:
§  4. Temporal data tables à Not a full replacement for slowly changing dimensions, but definitely useful for auditing

·        Handling of Nulls
o   Dimensions Rule of thumb is to avoid nulls in attribute columns.
o   Best practice is to avoid nulls in foreign keys. (However, nulls are ok for a measure.) By using an ‘unknown member’ relationship to the dimension, you can:
§   Safely do inner joins
§  Allow the fact record to be inserted & meet referential integrity
§  Allow the fact record to be inserted which avoids understating measurement amounts

·        Manually Maintained Data Maintain a DML script in a Lookup (LKP) table instead of hard-coding in the ETL
·        User SQL Data tool ( SSDT ) to compare schema and data

·        Extensibility in a DW Design

o   change in mind. Ex: Create a lookup table with code/descriptions, or implement in a view, rather than hard-coding in ETL.
o   Plan for a hybrid environment with multiple architectures. Introduce conformed dimensions first whenever possible.
o   Try to avoid isolated “stovepipe” implementations unless the isolation is absolutely intended. Conduct active prototyping sessions with business users to flush out requirements. A data modeling tool like Power BI works well for this.
o   Consider using an OLAP cube or in-memory model (like Analysis Services) for:
§   Summary data (as opposed to summary tables in your DW) • Year-to-Date type of calculations • Year-over-Year type of calculations • Aggregate level calculations (as opposed to row-by-row calculations)




Summary

·        DW Design Principles
o   Staging as a “kitchen” area
o   Integrate data from multiple systems to increase its value
o   Denormalize the data into a star schema
o   A column exists in one and only one place in the star schema
o   Avoid snowflake design most of the time
o   Use surrogate keys which are independent from source systems
o   Use conformed dimensions
o   Know the grain of every table
o   Have a strategy for handling changes, and for storage of history
o   Store the lowest level of detail that you can
o   Use an ‘unknown member’ to avoid understating facts
o   Transform the data, but don’t “fix” it in the DW
o   Structure your dimensional model around business processes
o   Design facts around a single business event
o   Always use friendly names & descriptions
o   Use an explicit date dimension in a “role-playing” way
o   Utilize bridge tables to handle many-to-many scenarios
o   Plan for complexities such as:
§  Header/line data
§  Semi-additive facts
§  Multiple currencies
§  Multiple units of measure
§  Alternate hierarchies and calculations per business units
§  Allocation of measures in a snowflake design
§  Reporting of what didn’t occur (factless facts)
§  Dimensional only analysis







·        Reference architectures