ADW Integration Layer Model Standards
urn:js:virtue:aspire:standard:18.1
TL;DR
The standards to be followed by Data Architects when creating the ADW Integration Layer Model.
Definition
The standards for the ADW Integration Layer Data Model are detailed below. The naming convention for each standard follows this approach:
Example, IGE-1 means
- I denotes the Integration Layer Solution/Physical Data Model
- GE denotes the General section
- 1 is the number of the standard
The shortcut used is clearly indicated in the section names below.
General (GE)
| Standard | Overview | Detail |
|---|---|---|
| IGE-1 | The Integration Layer will be built using the Data Vault 2.0 modelling methodology. | The Integration Layer will consist of these areas:- Raw Data Vault – Source-independent data loaded from multiple sources and stored in a common format. Business Data Vault – an extension of a Raw Data Vault that applies selected business rules, denormalisations, calculations, and other query assistance functions in order to facilitate user access and reporting. Business Data Vault tables should be refreshed once their dependent Raw Data Vault tables are refreshed. Additionally, the integration layer also has these minor sub-layers (schemas):
|
| IGE-2 | A combination of the Entity types from Data Vault 2.0 will be utilised. | In the Raw Data Vault, the following Entity Types will be implemented:-
|
| IGE-3 | Unique keys will be referred to as Natural Keys. | The term Natural Key will be used, and Business Key (although equivalent) will not be used in any models. |
| IGE-4 | To enable lineage to the LDM extra meta-data will be added for each object created. | The following meta-data will be created and populated for every object created:-
|
| IGE-5 | Time portion of Date Time’s is optional if not applicable. | A time portion of a Date Time is to be set to 00:00:00 if no time is specified. |
| IGE-6 | A BDV object can reference a PL object | It is acceptable to access PL objects from BDV when creating other objects. The aim should be to make such references as simple and transparent as possible. When creating a new BDV objects, the PL object select statements should not involve complex logic on the attributes selected. Eg: Dimension look up is allowed If however, complex logic does need to be performed on the selected attributes from the PL object then it is suggested that a new BDV object is created that contains transformed data from the PL object. This BDV table (plus any other sources of data) can then be used to create the new PL/BDV object. Eg: Using a PL metric to derive another metric in BDV is not allowed. |
| IGE-7 | Publication of date and/or time columns | In the RDV:
In the BDV, follow the same standards as for the PL (see PGE-10). |
Hubs (HU)
| Standard | Overview | Detail |
|---|---|---|
| IHU-1 | Hubs are used to record the major Entities that underpin the business, the ‘things’ the business owns and stores additional data about. | Every row in a Hub is a unique Natural Key for the Entity. This Natural Key(s) may come from more than one source system. If the same Natural Key is loaded from +1 system then duplicated are not created but the row is re-used. Reference data about the Hub is stored in linked Satellites. |
| IHU-2 | Hubs don’t have to have Satellites. This is known as a Stub-Hub. | A Stub-Hub is simply a list of unique Natural Keys. It is a place-holder for future expansion when reference data about the Entity is available to be modelled. Once this reference data is available then Satellites will be added and the Stub-Hub will become a Hub. Stub-Hub is not a sub-type of a Hub, it is an application at a point in time of the Hub. |
| IHU-3 | Hubs will be identified by a Natural Key. | Every row in the Hub will contain a Natural Key used to identify the entity in the source systems. |
| IHU-4 | Multi-part Natural Keys will be supported (i.e. Compound Keys). | It is possible for a Natural Key from a source system to be composed of a Compound Key over more than one column. |
| IHU-5 | The source system of the data in the Hub will be stored. | The source system for the Natural Key will be stored in the Hub. Where more than one source system supply the same natural key, the source system on the Hub will be the first system to load data into the Hub. |
| IHU-6 | Naming of Hubs. | The naming convention of a Hub is <Entity>_HUB For example STORE_HUB PROMOTION_HUB |
| IHU-7 | Length of Hub name. | |
| IHU-8 | Abbreviations | |
| IHU-9 | Structure of the Hub. | Detailed below are the standard columns and naming convention that must be present for each Hub. No other attributes are to be held on the Hub Entities. Abbreviations In the physical model, the Natural Key columns of the Hub will not be abbreviated. The name of the Natural Key columns must align with the name of the Hub, for example on NECTAR_ACCOUNT_HUB the Natural Keys will be called NECTAR_ACCOUNT_NK1, NECTAR_ACCOUNT_NK2 and NECTAR_ACCOUNT_NK3. |
| IHU-10 | Sub-types of Major LDM Entities will be physicalized as specific Hubs. | This will allow the mapping from raw input feeds to be modelled and build more closely. |
Sample Hub Definition
STORE_HUB
| Attribute | Data Type | Mand? | Description |
|---|---|---|---|
| STORE_NK1 | VARCHAR | Y | Natural Key from the source system. First part will be mandatory. The other two will be optional (because they are part of the Primary Key they will be loaded with a default value). Part of the Primary Key. |
| STORE_NK2 | VARCHAR | Y | Second Natural Key value. Part of the Primary Key. |
| STORE_NK3 | VARCHAR | Y | Third Natural Key value. Part of the Primary Key. |
| BATCH_ID | VARCHAR | N | Unique process value to identify the job instance (e.g. Ab Initio ETL, Snowpipe, Python process) |
| JOB_NAME | VARCHAR | N | The name of the [sub] process that inserted this record. |
| LOAD_TS | DATETIME | Y | System date when the record was inserted. Not part of any data feed values. |
| SOURCE_SYSTEM_CD | VARCHAR | Y | 6 character string to uniquely identify the source system that supplied the row of data in the Hub. |
Note, this is the column order, no additional columns can be added to a Hub.
Satellites (SA)
| Standard | Overview | Detail |
|---|---|---|
| ISA-1 | Satellites hold data about Hubs that may change over time. | A delta driven record of descriptive information that can be accessed either for the current time or a time period in the past. No history is stored in the Hub, all of the History is stored in the Satellites. |
| ISA-2 | Satellites will be split by source system. | Because of the nature of the Integration Layer, each distinct source system will have a separate Satellite. This allows rapid load of data into the Integration Layer. At a later date if the business rules change then an easier choice can be made about an alternative data source than loading the data again. |
| ISA-3 | Every row in a Satellite will contain a date and time of when the data was loaded. | System Load Dates will be mandatory. For example on 1-Feb 13:03 a record was loaded for a Satellite, this is the date and time that the data change was first notified. |
| ISA-4 | History will be maintain via an insert policy, i.e. only a System Load DateTime will be maintained as part of the primary key. | This ensures when a new piece of data we want to track for history change is available the model does not need to have two changes made, one to close off the existing entry and one to create the new entry. |
| ISA-5 | Only changes will be written to Satellites. | On a day by day basis, if a row is presented that contains no changes from the latest row in the Satellite, then no additional rows are created. The System Load DateTime in a Satellite indicates when a change arrived from source. |
| ISA-6 | The Primary Key of a Satellite will be a two part key. | Satellite Primary Keys consist of:-
|
| ISA-7 | Naming of Satellites. | The naming convention of a Satellite is <Entity>_<Context>_<Source System>_SAT <Context> is optional.
Assuming SCVCUS as source system code.Multiple satellite tables as following:CUSTOMER_SCVCUS_SAT, CUSTOMER_ADDRESS_SCVCUS_SAT,CUSTOMER_PHONE_SCVCUS_SAT STORE_R09ITM_SAT PROMOTION_DPPSTR_SAT CUSTOMER_ADDRESS_SCVCUS_SAT The SOURCE_SYSTEM_CD will exist on each row in a Satellite, and the same value will exist in the Satellite name <Source System> above. |
| ISA-8 | Length of Satellite name. | |
| ISA-9 | Abbreviations | |
| ISA-10 | Structure of the Satellite. | Detailed below are the standard columns and naming convention that must be present for each Satellite. Abbreviations The Natural Keys of a Satellite are inherited from the Hub. The Natural Keys cannot be changed or abbreviated. Attribute columns of the Satellites will be abbreviated in line with the abbreviations mastered in the L5 Logical model. |
| ISA-11 | Business effective dates are attributes within the Satellite. | The Key for the Satellite is the System Load Date. If an attribute within the Satellite is business date sensitive, for example the date that a customer status changed, then this will have to be supplied by the source system and stored as an attribute on the Satellite. |
| ISA-12 | Only by exception will Satellites with non-standard Primary Keys be created. | In the Data Vault 2.0 standards such targets are known as Multi-Active Satellite “A Multi-Active Satellite is a structure built to house multiple active child records, where these child records have no stand-alone business key of their own. A prime example would be an employee with multiple active addresses at the same time.“ |
Sample Satellite Definition
STORE_RMS9_SAT
| Attribute | Data Type | Mand? | Description |
|---|---|---|---|
| STORE_NK1 | VARCHAR | Y | Natural Key from the source system. First part will be mandatory. The other two will be optional (because they are part of the Primary Key they will be loaded with a default value). Part of the Primary Key. |
| STORE_NK2 | VARCHAR | Y | Second Natural Key value. Part of the Primary Key. |
| STORE_NK3 | VARCHAR | Y | Third Natural Key value. Part of the Primary Key. |
| LOAD_TS | DATETIME | Y | System Load Date and Time for this row, i.e. the time it was discovered and loaded into the Integration Layer. Final part of the Primary Key. |
| BATCH_ID | VARCHAR | N | Unique process value to identify the job instance (e.g. Ab Initio ETL, Snowpipe, Python process) |
| JOB_NAME | VARCHAR | N | The name of the [sub] process that inserted this record. |
| RECORD_DELETED_FLAG | CHAR | N | Used to denote whether the Natural Key is no longer being supplied by the source system. |
| HASH_DIFF | NUMBER(38,0) | N | A hash key based on the non key and non-technical attributes, i.e. the columns from the source systems. It may be used by the code frameworks for determining whether a change has occurred in the source record compared to the current record, i.e. used for CDC processing. |
| RECORD_ID | VARCHAR | N | A UUID/GUID to uniquely identify each row of data. |
| SOURCE_SYSTEM_CD | VARCHAR | Y | 6 character string to uniquely identify the source system that supplied the row of data in the Hub. |
| DQ_AUDIT | VARCHAR | N | JSON message of any error/warnings on this row of data from the DQ audit rules within the Curated Data Lake layer. |
| First column from source system | Various | N | Always null cardinality. |
| Second column from source system | Various | N | Always null cardinality. |
| Etc… | Various | N | Always null cardinality. |
Links (LK)
| Standard | Overview | Detail |
|---|---|---|
| ILK-1 | Links hold relationships between two or more Natural Keys at a point in time. | Links are typically transactions within the business between two or more Entities, for example Online Purchase, Product Promotion or New Manager for a Store. |
| ILK-2 | Links always have a time or datetime element. | Links never have effective from and to date ranges, they record the instance of an interaction. Links are never Hubs. Links are inherently associative in nature. |
| ILK-3 | Links can store relationships between Hubs of the same type. | Two or more Natural Keys belonging to the same Hub can have a relationship. Common examples are Parent/Child relationships, or ‘same-as’ which links two different natural keys within the same Hub together. These will always be named as <HUB>_RELATED_LINK. |
| ILK-4 | The Hub Natural Key will be denormalised into the Link. | The Natural Key from the Hubs will cannot be renamed or abbreviated. |
| ILK-5 | Every Link needs to have a Reason for the relationship. | The reason will describe why more than one natural keys (Hubs) are being joined. For example
REASON_CD – Has two modes of usage USE CASE 1 – The REASON_CD can be the same in Link tables created for data from the same business process or event to uniquely identify the reason why hubs from the same business process or event are being joined at this time USE CASE 2 – The REASON_CD can be different for Link tables created for data from the same business process or event when required to provide context and filter records. |
| ILK-6 | The Primary Key of a Link will be a Compound Key. | Link Primary Keys consist of:-
|
| ILK-7 | Naming of Links. | The naming convention of a Link is <Hub#1>_<Hub#2>_<Hub#x>_LINK For example STORE_EMPLOYEE_LINK STORE_RELATED_LINK PROMOTION_PRODUCT_LINK Alternatively, in the case of a transactional Link, or where the number of Hubs make the name resulting from <Hub#1>_<Hub#2>_<Hub#x>_LINK unwieldy, a Link can be named according to the subject matter. For example: PURCHASE_ORDER_LINK. |
| ILK-8 | Length of Satellite name. | |
| ILK-9 | Abbreviations | |
| ILK-10 | Structure of the Link. | Detailed below are the standard columns and naming convention that must be present for each Link. Abbreviations The Natural Keys of a Link are inherited from the Hub. The Natural Keys cannot be changed or abbreviated. Measure columns of the Links will be abbreviated in line with the abbreviations mastered in the L5 Logical model. |
| ILK-11 | Links don’t have a Hash Key. | Due to the cost of calculating a Hash Key for high volume transactional data, a unique Hash Key will not be generated for a Link. In the Data Vault 2.0 standards it is a rule to create Hash Keys for Links, this is to allow a Satellite to be connected to a Link. However our standard (below) will add data that would have gone onto a Satellite directly onto the Link. |
| ILK-12 | A Link can contain additional data about the event. | Additional fields from the IGE-8 source feed can be added as attributes directly to the Link. This will avoid creating a second Satellite for each Link and will reduce high volume joins at run-time. |
| ILK-13 | Duplicates will be loaded | Another use of the Hash Key on Link tables is to ensure duplicates are not loaded into the Integration Layer. We will make use of the physical process/batch load meta-data to ensure duplicate data runs are not processed. |
| ILK-14 | Links can be denormalised if the volumetrics suit this pattern of model. | On a case by case basis large transactional events, for example place an order, web traffic history, sales basket content can be written in a denormalised manner for performance optimisation. These are known as Transactional-Links and one difference is that they contain a Transaction Date and/or Time as part of the Primary Key. This means the Load Datetime can be made a Measure of the Link rather than part of the Primary Key. |
Sample Link Definition
STORE_EMPLOYEE_LINK
| Attribute | Data Type | Mand? | Description |
|---|---|---|---|
| STORE_NK1 | VARCHAR | Y | This will be primary key of the Hub and is a Foreign Key in this Link. First part of the Link Primary Key. |
| STORE_NK2 | VARCHAR | Y | This will be primary key of the Hub and is a Foreign Key in this Link. First part of the Link Primary Key. |
| STORE_NK3 | VARCHAR | Y | This will be primary key of the Hub and is a Foreign Key in this Link. First part of the Link Primary Key. |
| EMPLOYEE_NK1 | VARCHAR | Y | This will be primary key of the Hub and is a Foreign Key in this Link. Second part of the Link Primary Key. |
| EMPLOYEE_NK2 | VARCHAR | Y | This will be primary key of the Hub and is a Foreign Key in this Link. Second part of the Link Primary Key. |
| EMPLOYEE_NK3 | VARCHAR | Y | This will be primary key of the Hub and is a Foreign Key in this Link. Second part of the Link Primary Key. |
| LOAD_TS | DATETIME | Y | System Load DateTime for this row, i.e. the date it was discovered and loaded into the Integration Layer. Third part of the Link Primary Key. |
| REASON_CD | VARCHAR | Y | 6 character string to uniquely identify reason why these Hubs are being joined at this point in time. Fourth part of the Link Primary Key. |
| SOURCE_SYSTEM_CD | VARCHAR | Y | 6 character string to uniquely identify the source system that supplied the row of data in the Hub. Fifth part of the Link Primary Key. |
| BATCH_ID | VARCHAR | N | Unique process value to identify the job instance (e.g. Ab Initio ETL, Snowpipe, Python process) |
| JOB_NAME | VARCHAR | N | The name of the [sub] process that inserted this record. |
| RECORD_ID | VARCHAR | N | A UUID/GUID to uniquely identify each row of data. |
| DQ_AUDIT | VARCHAR | N | JSON message of any error/warnings on this row of data from the DQ audit rules within the Curated Data Lake layer. |
| First column from source system | Various | N | Always null cardinality. We would expect the business date [and time] of this transaction to be made available here. |
| Second column from source system | Various | N | Always null cardinality. |
| Etc... | Various | N | Always null cardinality. |
Reference (RE)
| Standard | Overview | Detail |
|---|---|---|
| IRE-1 | Naming of References. | The naming convention of a Reference object is <Entity>_REF For example SOURCE_SYSTEM_REF REASON_REF |
| IRE-2 | Structure of the Reference. | Detailed below are the standard columns and naming convention that must be present for each Reference object.
|
| IRE-3 | References can either be point in time or historised. | The Load Datetime can be used as required. |
| IRE-4 | Reference objects are keyed on Codes and do not require a hash key. | Because of the small amount of data a hash key is not required for each row in a Reference object. |
Sample Reference Definition
SOURCE_SYSTEM_REF
| Attribute | Data Type | Mand? | Description |
|---|---|---|---|
| SOURCE_SYSTEM_CD | VARCHAR | Y | This will be primary key of the Reference and will be used as a Foreign Key in other objects. First part of the Primary Key. |
| LOAD_TS | DATETIME | Y | System Load DateTime for this row, i.e. the date it was discovered and loaded into the Integration Layer. |
| BATCH_ID | VARCHAR | N | Unique process value to identify the job instance (e.g. Ab Initio ETL, Snowpipe, Python process) |
| JOB_NAME | VARCHAR | N | The name of the [sub] process that inserted this record. |
| RECORD_ID | VARCHAR | N | A UUID/GUID to uniquely identify each row of data. |
| SOURCE_SYSTEM_DESC | VARCHAR | Y | Description of the Reference data Code. |
| DQ_AUDIT | VARCHAR | N | JSON message of any error/warnings on this row of data from the DQ audit rules within the Curated Data Lake layer. |
| Other attributes… | Various | N | Always null cardinality. |
| Etc… | Various | N | Always null cardinality. |
Point in Time (PT)
| Standard | Overview | Detail |
|---|---|---|
| IPT-1 | Point in Time tables are used to simplify and speed up Presentation Layer generation. | A Point In Time target is a System Driven Satellite loaded with a Hubs’ Natural Key(s), and the surrounding Satellites’ Primary Key values. These PK’s are a full copy of the entire PK housed in the Satellite. |
| IPT-2 | Point in Time tables are physicalized. | The tables will be created as physical tables in the Business Data Vault. |
| IPT-3 | Point in Time only contain reference data linked to a Hub. | The scope of each Point in Time target is a single Hub and all its associated satellites. |
| IPT-4 | Point in Time must contain the Primary Key of each Satellite. | This is achieved by each satellite being a column and the latest LOAD_TS of the satellite at the time that the Point in Time table is created is populated. |
| IPT-5 | A Point in Time target will be created every time at least one of the Satellites has a new record inserted. | The DateTime part of the Primary Key of a Point in Time target will contain the Date and Time that one of the Satellites last had a row inserted. PIT tables can support intra-day changes due to the Time element. |
| IPT-6 | Optionally a Point in Time table may be created to represent data changing based on Business Dates stored in the Satellites. | The DateTime part of the Primary Key of a Point in Time target can be based on a Business datetime, this will allow a view of the Hub and its Satellite data to be viewable by business dates and not on the day it was loaded. |
| IPT-7 | The Primary Key of a Point in Time. | PIT Primary Keys consist of:-
|
| IPT-8 | Naming of Point in Time. | The naming convention of a PIT is <Entity>_PIT For example STORE_PIT PROMOTION_PIT |
| IPT-9 | Structure of the Point in Time. | Detailed below are the standard columns and naming convention that must be present for each PIT:-
|
| IPT-10 | Stubbing a Satellite. | Stubs with a low date are not being created in Satellites. This means that in a PIT table a <null> Load Datetime can exist for some Satellites that contain no data. |
Bridge (BR)
| Standard | Overview | Detail |
|---|---|---|
| IBR-1 | The Bridge Table is a combination of primary keys and natural keys spread across multiple Hubs and Links. They can be thought of as “base level Fact Tables”. | They provide a snapshot of key structures and are generally not temporal in nature. That said, because Bridge Tables live within the Business Data Vault, they can also house computed fields, and / or temporality. |
| IBR-2 | Bridge tables are physicalized. | The tables will be created as physical tables in the Business Data Vault. |
| IBR-3 | Bridge tables enable virtualisation downstream. | The most common use of a Bridge table in the Business Data Vault is to allow one or more Presentation Layer objects to be created as Views. |
| IBR-4 | The grain of the Bridge table matches the expected downstream usage. | The grain does not need to match any particular combination of Link tables, but rather the expected downstream target. This is because the primary purpose of a Bridge table is to allow virtualisation of the Presentation Layer target (typically via a View). |
| IBR-5 | The Bridge table can meet two technical purposes:-
| Columns are denormalised onto the Bridge table as per the requirements stated on the left. |
| IBR-6 | The Primary Key of a Bridge. | Bridge Primary Keys consist of:-
|
| IBR-7 | Naming of Bridge. | The naming convention of a Bridge is <Entity>_BR For example SALES_TRANS_LINE_BR ITEM_MRCHSE_HIERARCHY_BR |
| IBR-8 | Structure of the Bridge. | Detailed below are the standard columns and naming convention that must be present for each Bridge:-
|
| IBR-9 | Column Naming Conversion | Since Bridge table is to support downstream/ PL objects, It is not required to follow NK[1,2,3] naming convention for Natural Keys. Rather it can adhere to PL naming conventions. |
Reporting (RP)
| Standard | Overview | Detail |
|---|---|---|
| IRP-1 | A Reporting table is used to simplify and store calculations that are used downstream. | This allows a single version of the truth target to be created in the Business Data Vault. |
| IRP-2 | The Reporting table will contain metrics, KPIs, intermediate calculations, results etc. | A variety of calculations can be created in Reporting tables on a case by case basis. |
| IRP-3 | The grain of a Reporting table will be suitable for the downstream use. | Reporting tables do not have to mirror the Primary Keys of objects in the Raw Data Vault, for example a rollup to KPI level can occur for a transactional feed. |
| IRP-4 | Reporting tables are physicalized. | The tables will be created as physical tables in the Business Data Vault. |
| IRP-5 | Naming of Reporting table. | The naming convention of a Reporting table is <Entity>_REP For example SALES_TRANS_REP FINANCE_KPI_REP |
| IRP-6 | Structure of the Reporting table. | Detailed below are the standard columns and naming convention that must be present for each Reporting table:-
|
| IRP-7 | Column Naming Conversion | Since reporting table is to support downstream/ PL objects, It is not required to follow NK[1,2,3] naming convention for Natural Keys. Rather it can adhere to PL naming conventions. |
PII Data in Integration Layer
| Standard | Overview | Detail |
|---|---|---|
| PII-1 | Clear text PII data will be held in columns classified as PII. | The column with PII data should be classified as PII_Customer or PII_Colleague as appropriate in MUNIN. It is advised that the Data Classification mentioned in the mapping specifications should be in sync with the Data Classifiation provided in MUNIN. |
| PII-2 | Access to PII columns will be restricted. | Access to PII columns in database objects will be restricted using appropriate role permissions using Barbossa. The Barbossa roles should be created/configured with appropriate policies to provide access to database objects containing PII data. |
| PII-3 | Natural Keys are Clear text PII. | Hash the PII attribute and store under NK column. Store the Clear Text PII as an additional attribute in the HUB. The attribute need not have a prefix 'PII'. For example : Current state: For NECTAR_COLLECTOR_CARD_HUB - NECTAR_COLLECTOR_CARD_NK1, PII_NECTAR_COLLECTOR_CARD_NUM Future state : NECTAR_COLLECTOR_CARD_NK1, NECTAR_COLLECTOR_CARD_NUM - PII prefix can be removed as there is no need to explicitly include the prefix as classification. |
| PII-4 | Natural Keys are non-PII but some of the Attributes are PII | PII clear text value can be stored in suitable RDV objects - SAT/LINK. Column names has to be stored without 'PII' Prefix. |