ASPIRE Data Model Principles

urn:js:virtue:aspire:principle:36.1

TL;DR

These are the Principles that each of the ASPIRE data models must adhere to. The principles are based on the best practices of data modelling production and provide guide rails for the ASPIRE Data Model Standards.

Rational

Delivering a large and complex programme such as ASPIRE involves basing the deliveries on sound design principles.

There are a number of Data Models used throughout the ASPIRE eco-system. Some are owned and maintained by the Information Architect and some are owned and maintained by the Data Modelling team.

A summary of the ASPIRE Data Models is as follows:- Figure 1: ASPIRE Data Model Layers

These Principals cover the models owned by Data Architecture:

The L4 model is used to understand and document business understanding, scope and requirements.
The L5 model is used as part of the detailed design and development delivery.

Implications

Failure to adhere to the Principles and the associated Standards will result in a high risk that the Aspire project fails to deliver the desired outcome. To support the ambition to have a single, democratised dimensional layer it is essential that consistant design Principles and Standards are defined, regularly reviewed and adherence is audited.

Appendix

L4 Logical Model Principles

Purpose of the model

The L3 Conceptual Model gives an Enterprise Level perspective of the Sainsbury’s data landscape. The L4 Logical Model will build upon the Conceptual Model and increase the detail to drive understanding into the development activities.
The Logical Model will be used to document the Entity, Attribute and Relationship requirements in enough detail to enable a Data Modeller to produce a Solution Data Model.
The Logical Model is not physicalized but it will be used:-

During project delivery to test that each of the changes made to this model represent a business requirement and ensures each change has been reflected in the structure and mappings into the Solution Data Model.
Is a structured and unambiguous notation to use when speaking to business customers about the data we hold within ASPIRE.

Figure 2: Use of the L4 Logical Model

For brevity, from now on the L3 Conceptual Data Model will be referred to as the L3 Model, the L4 Logical Data Model will be referred to as the L4 Model and the L5 Solution Model will be referred to as the L5 Model.

Model Principles

This sub-section details the Principles behind why ASPIRE needs a L4 Logical Data Model:-

The L4 Model is a technology independent representation of the Sainsbury’s data activities and relationships and will be built up in house to document for both technical and non-technical users what data the ASPIRE data models hold. For example, if someone asks “What data do we hold about Sales Transactions in ASPIRE?” the L4 Model will answer this in a lower level of detail than what is held in the Information Architecture L3 Model
The L3 Model is the master and roadmap for the content of the L4 Model
Major changes to the L3 Model must be reflected in the L4 Model and visa-versa
The Entities within the L4 Model will always inherit the meaning/description from the L3 Model Entity
The L4 Model will cover functionality implemented by the L5 Model. As described below, the L5 Model covers the flow of data through both Integration and Presentation Layers, but the L4 Model has no concept of these different layers
Changes cannot be made to the L4 Model unless they are also represented in the L3 Model. As the L3 Model matures there will be projects that deliver changes to the solution models, i.e. the L4 Model and L5 Model that don’t require changes to the L3 Model
Meta-data will be built into the L4 Model to report on lineage between the L3 Model Entity and L4 Model Entities
One L3 Model Entity may be represented as one or more L4 Model Entities if the requirement necessitates further elaboration. This may especially be true in the areas of sub-typing and resolving many to many relationships
The L4 Model will introduce attributes to the Entities
Each attribute will contain a business definition as defined by the Business Analysts
The L4 Model will introduce Subject Areas to break the model down into more focused areas that are easier to manage
All Relationships in the L4 Model will show the cardinality of the links between the dependent Entities
All Relationships in the L4 Model will have a business meaning for the link
Entities are keyed on Natural Keys. The choice about the use of Surrogate Keys, Hash Keys or Natural Key strategies is within the remit of the L5 Model
The L4 Model may contain changes that are not yet modelled in the L5 Model
The model will be maintained in ERWIN
Within the L4 Model no overall single Entity Relationship Diagram will contain a summary of all of the Entities and Relationships because this is mastered in the L3 Model.

L5 Solution Model Principles

Purpose of the model

The L5 Solution Model will be used to define and generate the physical objects that make up the schemas that deliver the data throughout the ASPIRE eco-system. The Solution Model will contain, for example, definitions for tables, views, columns and the DDL generated will be used by the development teams to build the solution.

The L5 Solution Model will be split into specific Data Models that serve the following purpose:-

Data Lake Data Model
ADW Integration Layer Data Model
ADW Presentation Layer Data Model.

The L5 Solution Model implements the requirements as defined in the L4 Logical Model.
The relationship between the three Solution Models introduced above is documented in the ASPIRE Target Architecture document owned by the Solution Architecture team.
At a high level the models are used at the following points within the ASPIRE eco-system:- Figure 3: ASPIRE eco-system location of the different models

Data Lake Model Principles

This sub-section details the Principles behind why ASPIRE needs a L5 Data Lake Data Model:-

There must be understanding of what data objects are contained within the Data Lake. A list of all objects will be documented for the ASPIRE programme
Data objects will be segmented not by Subject Area but the parent S3 Bucket
Segmentation will also identify which data objects are within the Raw Layer, Curated Layer and which are within the Information Layer
Data copied from the ADW Presentation Layer into the Data Lake Information Layer will mirror the ADW Presentation Layer object structure
Data Quality at a row by row level will be performed on the files as they move from the Raw Layer to the Curated Layer
Data objects can be modified as they move between the different layers of the Data Lake, for example hash keys are generated and technical meta-data columns are added to enrich the data objects when landed in the Curated Layer
Data will be stored in a variety of formats at the discretion of the ASPIRE development team
All Data Lake data objects will have their meta-data uploaded to Alation.
Additional meta-data will be added once in Alation to align the data objects with the relevant L3 Logical Data Model entity.

The Data Lake principles implement the following DACE Data Management principles:-

Data Ingestion: Source Once, Use Many Times
Data Ingestion: One Way In
Data Ingestion: No Data Without Metadata
Data Ingestion: Retain As-Received Data
Data Processing: Process All Data
Data Processing: Store Record-Level Data

ADW Integration Layer Model Principles

The primary goal of the Integration Layer is to bring data into the ADW into a source system independent structure via internal common structures, concepts and language
The storage of the data will be modelled so as to avoid data silos
Data is loaded as presented from the curated Data Lake input sources
The granularity of the data in the Integration Layer will be at the lowest level of detail as supplied from the source systems
Very large sources of data, for example ClickStream web activity or detailed Sales Basket transactions will need to be saved to disk sparingly. All data will be written to the Integration Layer, the choice about whether to move this data to the Presentation Layer (e.g. a table) or virtualise the data (e.g. a view or internal Snowflake native format type) is a Presentation Layer decision
The model must enable agile development, so will be efficient in that reusable patterns can minimize modelling work and can be an enabler to reduce ETL development via the implementation of common framework components
The Integration Layer will be structured to add flexibility to meet both current and future requirements. This will be achieved by creating a data model that splits the decisions about where data is initially stored and how it is subsequently used
The structure must enable high performance parallel data loading of targets
Surrogate sequences will not be used as they introduce bottlenecks
Hash Keys will not be used as after analysis with Snowflake Consultants it has been advised they are not performant for joining very large datasets
The drivers for the use of the Integration Layer model are:-
- Data Integrity – a primary goal of this model; define referential integrity and eliminate or strictly limit data redundancy
- Scalability – allow for increases in volume of data or the addition of new sources of data for existing business areas
- Flexibility – the design is not tied to one or more source systems, allowing for the flexibility to change the structure/sources of data being fed into this layer
- Consistency – a pattern based approach to data modelling will ensure all similar attributes will be held in the same places irrespective of the source feed being loaded
The model will be maintained in ERWIN
The model will be built using the Data Vault modelling pattern
As Data Vault is based on 3NF/normalised modelling, if no data is present then no default record will be created
All DDL for the Integration Layer will be generated out of ERWIN
The schemas created by the development team must match the generated DDL

The ADW Integration Layer principles implement the following DACE Data Management principles:-

Data Processing: Process All Data
Data Processing: Store Record-Level Data
Data Processing: Maintain Change History
Data Processing: Create Consistent Keys

ADW Presentation Layer Model Principles

This sub-section details the Principles behind why ASPIRE needs a L5 Presentation Layer Data Model:-
1. The primary goal of this layer is to enable a clear understanding of the data and ensure access of the data is performant depending on the tools used by the consumers
2. The Presentation Layer can create separate target objects based on a single Integration Layer object, for example all customers are loaded into the same object in the Integration Layer but split into separate Argos and Nectar customers in the Presentation Layer. This is to reduce uncertainty and reduce filters/where clauses in the Presentation Layer
3. The contents of the Presentation Layer are sourced only from the Integration Layer
4. The Natural Key(s) from the Integration Layer will be carried over into the Presentation Layer
5. The model will be built using the Dimensional modelling pattern – therefore denormalisation will be permitted in the Presentation Layer on a case by case basis. The Dimensional targets will be built upon a combination of the Raw Data Vault and Business Data Vault areas within the Integration Layer
6. All major dimensions in the Presentation Layer will be historised with clear from/to date ranges available
7. Historised Dimensions in the Presentation Layer must support intra-day refresh to give the customer’s the option of always seeing the latest data
8. Minor dimensions in the Presentation Layer will not be historised
9. All event level facts will have a date (and optionally time) available
10. Snapshot (denormalised single row records) may be used where needed
11. Data derived from business rules must only be calculated in one place and the result of the calculation made available in one place in the Presentation Layer for the use of different consumers. The Business Data Vault (part of the Integration Layer) will assist in achieving this
12. The more complex business Entities, for example Customer or Item will be denormalised and sourced from potentially many Integration Layer objects into one or a smaller number of Presentation Layer objects. This will simplify the access of the data for the Presentation Layer tools and consumers but will involve data manipulation via a view or ETL development
13. Snowflake (not the technology but the modelling technique) structures are to be minimised but will be considered on a case by case basis
14. Redundant keys will be included on facts and aggregates to improve run-time performance when accessing the data. However late arriving data may necessitate the re-aggregation of affected objects
15. Hierarchies will be collapsed into the relevant parent dimension
16. All foreign keys are mandatory, the use of -1 for Unknown and -2 for Not Applicable must be followed
17. The design and development team have the option of fully physicalizing the objects in the Presentation Layer. However in some cases, for example simple reference data or large transactional data sets the use of views can be chosen to provide a virtual object in the Presentation Layer
18. The model should support some attributes within a historised dimension being non-historised; this can help reduce the number of type 2 Dimension rows created for a data element that changes often and is not need to be historised, for example OUTSTADING_DEBT on an Account dimension
19. The model will be maintained in ERWIN
20. Only a physical data model will be required due to the presence of the L4 Logical Data Model and the single target for ADW which is Snowflake
21. All DDL for the Presentation Layer will be generated out of ERWIN
22. The schemas created by the development team must match the generated DDL

The ADW Presentation Layer principles implement the following DACE Data Management principles:-

Data Processing: Store Record-Level Data Data Processing: Maintain Change History Data Provision: One Way Out Data Provision: Share Data Data Provision: Common Controlled Vocabulary

ASPIRE Data Model Principles

urn:js:virtue:aspire:principle:36.1

TL;DR

Rational

Implications

Appendix

L4 Logical Model Principles

Purpose of the model

Model Principles

L5 Solution Model Principles

Purpose of the model

Data Lake Model Principles

ADW Integration Layer Model Principles

ADW Presentation Layer Model Principles