Technical Column Standards

urn:js:virtue:aspire:standard:22.1

TL;DR

This document describes the “why, what and how” of Technical Metadata in ADW.

Definition

Technical Metadata helps make sense of the data itself, and enable tracking of non-functional requirements.

It should be added to the end of each physicalised object in the Physical Data Model. It is the responsibility of the transformation tool to write them.

Technical Metadata are applicable to every physicalised object in ADW, regardless of batch/streaming.

They are not applicable to non-physicalised objects i.e. views.

Requirements

There are three top-level requirements that the Technical Metadata aims to fulfil.

1. Data Lineage

It’s important to know the provenance of the data as it moves through each layer of Aspire, so that it is easy to understand and use.

Requirement	Description	Metadata Columns
Uniquely identify the record within the Aspire ecosystem	A identifier retained through all transformation processes (until it’s obscured by aggregation and/or joining with other data).	`RECORD_ID`
Identify where the data was generated	A reference to system where each row of data originated.	`SOURCE_SYSTEM_CD`
Identify when the data was generated	The date at which the record was created/modified in the data store in which the data resides.	`LOAD_TS`
Identify the data which precedes this	Link each Aspire object to the objects which immediately precede it. This may be a table, file, stream, service or any other component.	`SOURCE_PATH`, `JOB_NAME`
Identify relationships in the data	Where a relationship between two data entities has been created then the context for this should be provided in the data to allow consumers to understand and choose which records they require. For example if a link table has Argos and JS data, then the relationship code may indicate which brand the link records were created for.	`REASON_CD`

2. Data Transformation

Rules applied to attributes to apply business rules for data quality or interpretation purposes.

Requirement Name	Description	Associated Columns
Identify the data processing technology used	Show the technical capability used to move or transform the data at the macro level such as SQLDBM, Glue, or Hera. This is used for root cause analysis.	`ETL_FRAMEWORK`
Uniquely identify the processing run	A job run ID which can be traced back through the system for fault resolution and to fix any data corruption.	`BATCH_ID`
Identify records deleted within Data Vault	Capture which rows have been logically removed from the data vault for fault resolution and downstream consumption decisions or refresh.	`RECORD_DELETED_FLAG`

3. Data Consumability

Requirement Name	Description	Associated Columns
Identify when the data arrived in Aspire	Not all source systems contain business event timestamps, so as a second-best, capture when the file was generated/landed in Aspire.	`RECORD_ARRIVAL_TS`
Easily identify changing data	Simplify change detection by hashing of all non-key records (as they don’t change), as ^ separated strings.*	`HASH_DIFF`

*Not to be confused with a hash key (the hash of all primary keys) which is not required. Also note that change detection can be done using SQL which may not result in creating an additional column.

Applicability

Layer/Schema	STAGE	RDV	RDV	RDV	RDV	RDV	BDV	BDV	BDV	BDV	PL	PL
Column Name/Table Type		HUB	SAT	LINK	TXN LINK	REF	BR-Tx	BR-Ref	PIT	REP	DIM	FACT
BATCH_ID*#	Y	Y	Y	Y	Y	Y
RECORD_ARRIVAL_TS		Y	Y	Y	Y	Y	Y	Y	Y	Y
HASH_DIFF			Y	Y		Y		Y
JOB NAME	Y	Y	Y	Y	Y	Y	Y	Y	Y	Y	Y**	Y**
LOAD_TS	Y	Y	Y	Y	Y	Y	Y	Y	Y	Y	Y**	Y**
REASON_CD				Y	Y		Y*+			Y*+
RECORD_DELETED_FLAG	Y		Y	Y	Y	Y		Y
RECORD_ID	Y	Y	Y	Y	Y	Y
SOURCE_PATH*#	Y	Y	Y	Y	Y	Y	Y*+	Y*+	Y*+	Y*+
SOURCE_SYSTEM_CD	Y	Y	Y	Y	Y	Y	Y*+	Y*+	Y*+	Y*+
VALID_TO_TS											Y
VALID_FROM_TS											Y
ETL_FRAMEWORK	Y	Y	Y	Y	Y	Y	Y	Y	Y	Y	Y

Key ** If table is physicalised
*# Only applicable for Batch Processing
*+ Not applicable for domain level consolidation or Agregations. Applicable only for record level consolidation.

Implementation

ADW_STAGE

Batch use cases

Technical columns must stay as relational columns in staging. Requirement to persist technical columns in a single variant column is not applicable for staging.

Also, embedding the technical metadata into a single variant using COPY INTO is a challenge and hence as part of HERA s design discussions it has been agreed that HERA will populate technical metadata into relational columns.

Streaming use cases

Technical metadata single variant is not applicable. However,depending on the technical capability of the data loading tool there are two options

1) Load as relational column
2) Load as part of the RECORD_CONTENT Variant column.

These options are currently under discussion and once the solution is finalised this section will be updated.

DV layers

Technical Metadata is persisted into a single json (variant) type column named TECHNICAL_METADATA.

Having a single column rather than multiple keeps the data structures & data models clear.

Exceptions

In certain cases, the Technical Metadata should be kept in separate table columns instead of being nested as described above.

These cases are described below.

LOAD_TS - if used as part of a compound key (historicised targets, on SAT tables).
SOURCE_SYSTEM_CD - if used as part of a compound key (RDV targets).
REASON_CD - if used as part of a compound key (RDV targets).
VALID_FROM_TS - this is not a true technical field it is used for business purposes.
VALID_TO_TS - as above.
RECORD_ARRIVAL_TS - if used as part of a compound key.
Any technical columns on which the table needs to be clustered for performance reasons.

Metadata definitions

Mandatory fields

Column	Data Type	Duplicated at Batch Level?	Purpose	Specification	Example	Is Cascaded	HERA
`BATCH_ID`	`varchar`	Y	Uniquely identifies the specific transform execution responsible for the loading the record	The ETL framework must record a `BATCH_ID` in its logs to enable traceability to the data. For data sets ingested through Streams, `BATCH_ID` is a single ID created in the ETL framework for the set of records consumed from the stream at a point in time. For this reason, `BATCH_ID` is not required for streams unless microbatched. `RECORD_ID` captures the uuid at the record level for both streaming and batched data.	daf7f295-7541-4f67-afc5-871976e28f3f	Yes , created at staging and cascaded to other layers.	`BATCH_ID`s configured in Hera are calculated using `UUID_STRING(uuid, name)` where name is set to source path to generate unique batch id for every batch.
`RECORD_ARRIVAL_TS`	`datetime`	N	Indicates when a record is effective/received within Aspire	Populated by the ETL Framework, used to measure the effectivity of the record within Aspire. If a `BUSINESS_EVENT_TS` is provided by source, then this column should not be used to derive `VALID_FROM_TS`/`VALID_TO_TS` in the BDV/PL layer. Instead, `BUSINESS_EVENT_TS` should be used. Kindly note `RECORD_ARRIVAL_TS` was originally called `EFFECTIVE_FROM_TS`, but has been renamed to more clearly denote that this field is capturing the arrival timestamp of the record to Aspire. Timestamp on which source entities has changed should be captured under a data attribute `BUSINESS_EVENT_TS`. For S3 based process `RECORD_ARRIVAL_TS` is derived from the `SOURCE_PATH` using the following code: `to_date(split_part(metadata$filename, '/', 3),'YYYYMMDD')::timestamp_ntz AS RECORD_ARRIVAL_TS`. For streaming loads that skip S3, the `RECORD_ARRIVAL_TS` is the same as staging `LOAD_TS`. In worst case scenario, if this value cannot be extracted from the source path then the RECORD_ARRIVAL_TS is the LOAD_TS of that record in staging. Kindly note the above code has to be customised based on theindividual usecase as the this is dependant on the source path.	2019-01-20 00:00:00	Yes , existing pipelines cascade from staging.	NA , as the pipleines has to derive this in transform schema views from source path.
`ETL_FRAMEWORK`	`varchar`	Y	Identifies the tool & server responsible for executing the loading/movement/transformation.	This enables the correct system/service to query the `BATCH_ID` against to retrieve execution logs (if any).	DBT, SnowSQL, Glue appended with the name of the server/instance	Yes, cascaded from staging.	`Hera - Snowpipe`
`JOB_NAME`	`varchar`	Y	Identifies the source and target of the ETL transformation	Description for the transaction taking place. Naming format: `{source layer}_{source location}_to_{target layer}_{target location}`. Each transformation must include Aspire Source → Target layer definition (E.g. Raw → Staging, Staging → RDV, RDV → BDV, RDV → PL, BDV → PL, PL → Information). Each transformation must include the S3 bucket name or Table Names affected. In cases of fixing records for data issues (Updates to `RECORD_DELETED_FLAG`/`DQ_FIXES`) populate this field with ‘ERROR «JIRA/TICKET_REFERENCES»’. References could be from Jira/servicenow.	staging_GOLAPP_DIGINT_to_RDV_link_rpnapp_replenishment_tran_link	No, specific to the job loading the data in the layer.	Snowpipe name
`LOAD_TS`	`datetime`	Y	Identifies when this record was written into the current table	Used as part of the PK for all historised targets (Links, Satellites, Reference etc). It is a non-PK column for Hubs and Transactional Links.	2019-01-20 07:30:10.986	No, specific to the job loading the data in the layer.	`Currenttimestamp()`
`RECORD_DELETED_FLAG`	`varchar`	N	Identifies records deleted (at source or within data warehouse) due to erroneous processing	Contains Y or N. Present in all Satellites, Non-Transactional Links and Bridge Tables (Reference Data) used to flag if a Natural Key is not being supplied at the current time from the source system. This helps to identify the records that are deleted at the source system. This field is also used to soft delete erroneous data that has been loaded in DV and consumed for reporting. Refer to RDV immutability standards for example scenarios and how to implement record deleted flag. Kindly note this field will be updated in an existing record in cases of any processing/human errors whereas a new record will have to be inserted with updated flag in case source record deletion has to be captured.	N/Y	No - Defaults to N. Downstream processing has to ignore records that are deleted.	`N` for all new records
`SOURCE_PATH`	`varchar`	Y	Identifies the full file location of a given source file	This field captures the full AWS S3 file path using `METADATA$FileName` construct.	s3path/YYYYMMDD/filename.	Yes, cascaded from staging.	`Metadata$Filename`
`SOURCE_SYSTEM_CD`	`varchar`	Y	Uniquely identifies the source system feed the data pipeline has ingested from	Source system code referencing the Source System feed which in turn connects to the source system (captured in the metadata tables as `OPERATIONAL_SYSTEM_CD`) where this data was retrieved from. The Data Architects maintain the master source system codes in `METADATA_PL.SOURCE_SYSTEM`; more details here.		Yes, cascaded from staging.	Value from HERA config file
`RECORD_ID`	`varchar`	N	Uniquely identifies each record through the Aspire estate	A globally unique identifier (GUID) used to track records throughout the Aspire estate. This GUID should remain constant throughout Staging & RDV, such that a record can be tracked through all of these layers. A record should have a GUID added at the earliest point of contact with the Aspire ecosystem. This is generated in the Staging layer using `UUID_string()` functions.	daf7f295-7541-4f67-afc5-871976e28f3f	Yes, cascaded from staging.	`UUID_STRING()`

Optional fields

Column	Data Type	Duplicated at Batch Level?	Purpose	Specification	Example	Is Cascaded	HERA
`VALID_FROM_TS`	`datetime`	N	Earliest date time from when the record is applicable	Used only for Type 2 SCD tables where an effective from & to date are required. Although this is not a typically technical column, this includes as part of a dimension template.	2019-01-20 07:30:10.986	No, Applicable only for PL.	NA
`VALID_TO_TS`	`datetime`	N	Latest date time from when the record is applicable	See above. NB This field will be persisted as a physical column separate to that containing all other technical columns.	2019-01-20 07:30:10.986	No, Applicable only for PL.	NA
`REASON_CD`	`varchar`	Y	6 characters to uniquely identify reason why link record was created	Used in RDV Link tables to identify how the Hubs are joined together. Also, the same Hubs can be linked together for multiple reasons and these reasons are captured in `REASON_CD`. A `REASON_CD` can be shared between records in many Link tables if the same business event caused the data to be created. The Data Architects maintain the master source codes in `METADATA_PL.TABLE_REASON`; more details here..	STRORG	Yes , cascaded from RDV to BDV layers for record level consolidation.	NA
`HASH_DIFF`	`varchar`	N	Used for change data capture (CDC) within the database table	This is not mandatory in Staging especially if the increments to downstream layer are calculated using streams. In such cases, Hash diff can be calculated at the transform schema views between staging and RDV to perform the CDC. NB: Technical columns plus natural key columns are excluded from the hash. A Separator (^) is required.For example - If a table has 3 natural key columns (a,b,c) plus 2 further columns (d,e) plus 5 technical columns (v,w,x,y,HASH_DIFF), the Snowflake SQL would be: `HASH_DIFF = MD5( NVL(d, '') \\|\\| '^' \\|\\| NVL(e, '')`. In cases where the teams have to process datasets that gets split into multiple SATS which requires different `HASH_DIFF`s to be generated upon Keys to track changes in the respective tables, it is advisable to calculate this value in the Transform schema views. Teams can choose to do CDC through usecase-appropriate mechanisms (SQL / MD5).	`B7CB15D659039B8649E3E86D5C26E74BG`	No , unless there is one to one mapping of records between layers.	NA

`BUSINESS_EVENT_TS`

A source data attribute that captures the source entity timestamps. This is not a technical metadata attribute. Requirement for any new source feed has to be mandated to include the BUSINESS_EVENT_TS as part of the data feed.

This is not the record arrival timestamp or the timestamp when the data became effective within the database. This is the actual timestamp of change at source. If we receive created or updated timestamp from source as part of the dataset it needs to be stored under the name BUSINESS_EVENT_TS.

There are scenarios where the record at source is amended at a time that doesn’t define the effectivity of the record. For example, a record of a product’s future promotion:

{
	"record_id": "001",
	"created_modified_date": "2021-02-02 09:00",
	"promotion_name": "Doritos - buy one, get one free"
	"promotion_effective_date": "2030-01-01 00:00"
}

Both created_modified_date and promotion_effective_date are liable to change if the promotion’s start is rescheduled.

The correct field to create a Type 2 SCD on is created_modified_date.

The SCD should not be created on promotion_effective_date as fluctuations in the event effectivity date (e.g. the promotion launch is brought forward) would result in the dimension having the wrong chronological order.

Existing tables not meeting the standard

The cost of adapting all current tables to comply to the new standard, plus adapting the ETL to populate these, far outweighs the simplicity of having only one approach to technical columns.

Therefore, there is no expectation that all existing tables will be updated to meet this new standard. It is only mandatory for:

New tables, and
Existing tables which require a missing technical column.

That way there are only two types of tables: those having the old technical column standard and those having the new.