Technical Column Standards
urn:js:virtue:aspire:standard:22.1
TL;DR
This document describes the “why, what and how” of Technical Metadata in ADW.
Definition
Technical Metadata helps make sense of the data itself, and enable tracking of non-functional requirements.
It should be added to the end of each physicalised object in the Physical Data Model. It is the responsibility of the transformation tool to write them.
Technical Metadata are applicable to every physicalised object in ADW, regardless of batch/streaming.
They are not applicable to non-physicalised objects i.e. views.
Requirements
There are three top-level requirements that the Technical Metadata aims to fulfil.
1. Data Lineage
It’s important to know the provenance of the data as it moves through each layer of Aspire, so that it is easy to understand and use.
| Requirement | Description | Metadata Columns |
|---|---|---|
| Uniquely identify the record within the Aspire ecosystem | A identifier retained through all transformation processes (until it’s obscured by aggregation and/or joining with other data). | RECORD_ID |
| Identify where the data was generated | A reference to system where each row of data originated. | SOURCE_SYSTEM_CD |
| Identify when the data was generated | The date at which the record was created/modified in the data store in which the data resides. | LOAD_TS |
| Identify the data which precedes this | Link each Aspire object to the objects which immediately precede it. This may be a table, file, stream, service or any other component. | SOURCE_PATH, JOB_NAME |
| Identify relationships in the data | Where a relationship between two data entities has been created then the context for this should be provided in the data to allow consumers to understand and choose which records they require. For example if a link table has Argos and JS data, then the relationship code may indicate which brand the link records were created for. | REASON_CD |
2. Data Transformation
Rules applied to attributes to apply business rules for data quality or interpretation purposes.
| Requirement Name | Description | Associated Columns |
|---|---|---|
| Identify the data processing technology used | Show the technical capability used to move or transform the data at the macro level such as SQLDBM, Glue, or Hera. This is used for root cause analysis. | ETL_FRAMEWORK |
| Uniquely identify the processing run | A job run ID which can be traced back through the system for fault resolution and to fix any data corruption. | BATCH_ID |
| Identify records deleted within Data Vault | Capture which rows have been logically removed from the data vault for fault resolution and downstream consumption decisions or refresh. | RECORD_DELETED_FLAG |
3. Data Consumability
| Requirement Name | Description | Associated Columns |
|---|---|---|
| Identify when the data arrived in Aspire | Not all source systems contain business event timestamps, so as a second-best, capture when the file was generated/landed in Aspire. | RECORD_ARRIVAL_TS |
| Easily identify changing data | Simplify change detection by hashing of all non-key records (as they don’t change), as ^ separated strings.* | HASH_DIFF |
*Not to be confused with a hash key (the hash of all primary keys) which is not required. Also note that change detection can be done using SQL which may not result in creating an additional column.
Applicability
| Layer/Schema | STAGE | RDV | RDV | RDV | RDV | RDV | BDV | BDV | BDV | BDV | PL | PL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Column Name/Table Type | HUB | SAT | LINK | TXN LINK | REF | BR-Tx | BR-Ref | PIT | REP | DIM | FACT | ||
| BATCH_ID*# | Y | Y | Y | Y | Y | Y | |||||||
| RECORD_ARRIVAL_TS | Y | Y | Y | Y | Y | Y | Y | Y | Y | ||||
| HASH_DIFF | Y | Y | Y | Y | |||||||||
| JOB NAME | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y** | Y** | |
| LOAD_TS | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y** | Y** | |
| REASON_CD | Y | Y | Y*+ | Y*+ | |||||||||
| RECORD_DELETED_FLAG | Y | Y | Y | Y | Y | Y | |||||||
| RECORD_ID | Y | Y | Y | Y | Y | Y | |||||||
| SOURCE_PATH*# | Y | Y | Y | Y | Y | Y | Y*+ | Y*+ | Y*+ | Y*+ | |||
| SOURCE_SYSTEM_CD | Y | Y | Y | Y | Y | Y | Y*+ | Y*+ | Y*+ | Y*+ | |||
| VALID_TO_TS | Y | ||||||||||||
| VALID_FROM_TS | Y | ||||||||||||
| ETL_FRAMEWORK | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y |
Key ** If table is physicalised
*# Only applicable for Batch Processing
*+ Not applicable for domain level consolidation or Agregations. Applicable only for record level consolidation.
Implementation
ADW_STAGE
Batch use cases
Technical columns must stay as relational columns in staging. Requirement to persist technical columns in a single variant column is not applicable for staging.
Also, embedding the technical metadata into a single variant using COPY INTO is a challenge and hence as part of HERA s design discussions it has been agreed that HERA will populate technical metadata into relational columns.
Streaming use cases
Technical metadata single variant is not applicable. However,depending on the technical capability of the data loading tool there are two options
1) Load as relational column
2) Load as part of the RECORD_CONTENT Variant column.
These options are currently under discussion and once the solution is finalised this section will be updated.
DV layers
Technical Metadata is persisted into a single json (variant) type column named TECHNICAL_METADATA.
Having a single column rather than multiple keeps the data structures & data models clear.
Exceptions
In certain cases, the Technical Metadata should be kept in separate table columns instead of being nested as described above.
These cases are described below.
LOAD_TS- if used as part of a compound key (historicised targets, on SAT tables).SOURCE_SYSTEM_CD- if used as part of a compound key (RDV targets).REASON_CD- if used as part of a compound key (RDV targets).VALID_FROM_TS- this is not a true technical field it is used for business purposes.VALID_TO_TS- as above.RECORD_ARRIVAL_TS- if used as part of a compound key.- Any technical columns on which the table needs to be clustered for performance reasons.
Metadata definitions
Mandatory fields
| Column | Data Type | Duplicated at Batch Level? | Purpose | Specification | Example | Is Cascaded | HERA |
|---|---|---|---|---|---|---|---|
BATCH_ID | varchar | Y | Uniquely identifies the specific transform execution responsible for the loading the record | The ETL framework must record a BATCH_ID in its logs to enable traceability to the data. For data sets ingested through Streams, BATCH_ID is a single ID created in the ETL framework for the set of records consumed from the stream at a point in time. For this reason, BATCH_ID is not required for streams unless microbatched. RECORD_ID captures the uuid at the record level for both streaming and batched data. | daf7f295-7541-4f67-afc5-871976e28f3f | Yes , created at staging and cascaded to other layers. | BATCH_IDs configured in Hera are calculated using UUID_STRING(uuid, name) where name is set to source path to generate unique batch id for every batch. |
RECORD_ARRIVAL_TS | datetime | N | Indicates when a record is effective/received within Aspire | Populated by the ETL Framework, used to measure the effectivity of the record within Aspire. If a BUSINESS_EVENT_TS is provided by source, then this column should not be used to derive VALID_FROM_TS/VALID_TO_TS in the BDV/PL layer. Instead, BUSINESS_EVENT_TS should be used. Kindly note RECORD_ARRIVAL_TS was originally called EFFECTIVE_FROM_TS, but has been renamed to more clearly denote that this field is capturing the arrival timestamp of the record to Aspire. Timestamp on which source entities has changed should be captured under a data attribute BUSINESS_EVENT_TS. For S3 based process RECORD_ARRIVAL_TS is derived from the SOURCE_PATH using the following code: to_date(split_part(metadata$filename, '/', 3),'YYYYMMDD')::timestamp_ntz AS RECORD_ARRIVAL_TS. For streaming loads that skip S3, the RECORD_ARRIVAL_TS is the same as staging LOAD_TS. In worst case scenario, if this value cannot be extracted from the source path then the RECORD_ARRIVAL_TS is the LOAD_TS of that record in staging. Kindly note the above code has to be customised based on theindividual usecase as the this is dependant on the source path. | 2019-01-20 00:00:00 | Yes , existing pipelines cascade from staging. | NA , as the pipleines has to derive this in transform schema views from source path. |
ETL_FRAMEWORK | varchar | Y | Identifies the tool & server responsible for executing the loading/movement/transformation. | This enables the correct system/service to query the BATCH_ID against to retrieve execution logs (if any). | DBT, SnowSQL, Glue appended with the name of the server/instance | Yes, cascaded from staging. | Hera - Snowpipe |
JOB_NAME | varchar | Y | Identifies the source and target of the ETL transformation | Description for the transaction taking place. Naming format: {source layer}_{source location}_to_{target layer}_{target location}. Each transformation must include Aspire Source → Target layer definition (E.g. Raw → Staging, Staging → RDV, RDV → BDV, RDV → PL, BDV → PL, PL → Information). Each transformation must include the S3 bucket name or Table Names affected. In cases of fixing records for data issues (Updates to RECORD_DELETED_FLAG/DQ_FIXES) populate this field with ‘ERROR «JIRA/TICKET_REFERENCES»’. References could be from Jira/servicenow. | staging_GOLAPP_DIGINT_to_RDV_link_rpnapp_replenishment_tran_link | No, specific to the job loading the data in the layer. | Snowpipe name |
LOAD_TS | datetime | Y | Identifies when this record was written into the current table | Used as part of the PK for all historised targets (Links, Satellites, Reference etc). It is a non-PK column for Hubs and Transactional Links. | 2019-01-20 07:30:10.986 | No, specific to the job loading the data in the layer. | Currenttimestamp() |
RECORD_DELETED_FLAG | varchar | N | Identifies records deleted (at source or within data warehouse) due to erroneous processing | Contains Y or N. Present in all Satellites, Non-Transactional Links and Bridge Tables (Reference Data) used to flag if a Natural Key is not being supplied at the current time from the source system. This helps to identify the records that are deleted at the source system. This field is also used to soft delete erroneous data that has been loaded in DV and consumed for reporting. Refer to RDV immutability standards for example scenarios and how to implement record deleted flag. Kindly note this field will be updated in an existing record in cases of any processing/human errors whereas a new record will have to be inserted with updated flag in case source record deletion has to be captured. | N/Y | No - Defaults to N. Downstream processing has to ignore records that are deleted. | N for all new records |
SOURCE_PATH | varchar | Y | Identifies the full file location of a given source file | This field captures the full AWS S3 file path using METADATA$FileName construct. | s3path/YYYYMMDD/filename. | Yes, cascaded from staging. | Metadata$Filename |
SOURCE_SYSTEM_CD | varchar | Y | Uniquely identifies the source system feed the data pipeline has ingested from | Source system code referencing the Source System feed which in turn connects to the source system (captured in the metadata tables as OPERATIONAL_SYSTEM_CD) where this data was retrieved from. The Data Architects maintain the master source system codes in METADATA_PL.SOURCE_SYSTEM; more details here. | Yes, cascaded from staging. | Value from HERA config file | |
RECORD_ID | varchar | N | Uniquely identifies each record through the Aspire estate | A globally unique identifier (GUID) used to track records throughout the Aspire estate. This GUID should remain constant throughout Staging & RDV, such that a record can be tracked through all of these layers. A record should have a GUID added at the earliest point of contact with the Aspire ecosystem. This is generated in the Staging layer using UUID_string() functions. | daf7f295-7541-4f67-afc5-871976e28f3f | Yes, cascaded from staging. | UUID_STRING() |
Optional fields
| Column | Data Type | Duplicated at Batch Level? | Purpose | Specification | Example | Is Cascaded | HERA |
|---|---|---|---|---|---|---|---|
VALID_FROM_TS | datetime | N | Earliest date time from when the record is applicable | Used only for Type 2 SCD tables where an effective from & to date are required. Although this is not a typically technical column, this includes as part of a dimension template. | 2019-01-20 07:30:10.986 | No, Applicable only for PL. | NA |
VALID_TO_TS | datetime | N | Latest date time from when the record is applicable | See above. NB This field will be persisted as a physical column separate to that containing all other technical columns. | 2019-01-20 07:30:10.986 | No, Applicable only for PL. | NA |
REASON_CD | varchar | Y | 6 characters to uniquely identify reason why link record was created | Used in RDV Link tables to identify how the Hubs are joined together. Also, the same Hubs can be linked together for multiple reasons and these reasons are captured in REASON_CD. A REASON_CD can be shared between records in many Link tables if the same business event caused the data to be created. The Data Architects maintain the master source codes in METADATA_PL.TABLE_REASON; more details here.. | STRORG | Yes , cascaded from RDV to BDV layers for record level consolidation. | NA |
HASH_DIFF | varchar | N | Used for change data capture (CDC) within the database table | This is not mandatory in Staging especially if the increments to downstream layer are calculated using streams. In such cases, Hash diff can be calculated at the transform schema views between staging and RDV to perform the CDC. NB: Technical columns plus natural key columns are excluded from the hash. A Separator (^) is required.For example - If a table has 3 natural key columns (a,b,c) plus 2 further columns (d,e) plus 5 technical columns (v,w,x,y,HASH_DIFF), the Snowflake SQL would be: HASH_DIFF = MD5( NVL(d, '') \|\| '^' \|\| NVL(e, ''). In cases where the teams have to process datasets that gets split into multiple SATS which requires different HASH_DIFFs to be generated upon Keys to track changes in the respective tables, it is advisable to calculate this value in the Transform schema views. Teams can choose to do CDC through usecase-appropriate mechanisms (SQL / MD5). | B7CB15D659039B8649E3E86D5C26E74BG | No , unless there is one to one mapping of records between layers. | NA |
BUSINESS_EVENT_TS
A source data attribute that captures the source entity timestamps. This is not a technical metadata attribute. Requirement for any new source feed has to be mandated to include the BUSINESS_EVENT_TS as part of the data feed.
This is not the record arrival timestamp or the timestamp when the data became effective within the database. This is the actual timestamp of change at source. If we receive created or updated timestamp from source as part of the dataset it needs to be stored under the name BUSINESS_EVENT_TS.
There are scenarios where the record at source is amended at a time that doesn’t define the effectivity of the record. For example, a record of a product’s future promotion:
{
"record_id": "001",
"created_modified_date": "2021-02-02 09:00",
"promotion_name": "Doritos - buy one, get one free"
"promotion_effective_date": "2030-01-01 00:00"
}
Both created_modified_date and promotion_effective_date are liable to change if the promotion’s start is rescheduled.
The correct field to create a Type 2 SCD on is created_modified_date.
The SCD should not be created on promotion_effective_date as fluctuations in the event effectivity date (e.g. the promotion launch is brought forward) would result in the dimension having the wrong chronological order.
Existing tables not meeting the standard
The cost of adapting all current tables to comply to the new standard, plus adapting the ETL to populate these, far outweighs the simplicity of having only one approach to technical columns.
Therefore, there is no expectation that all existing tables will be updated to meet this new standard. It is only mandatory for:
- New tables, and
- Existing tables which require a missing technical column.
That way there are only two types of tables: those having the old technical column standard and those having the new.