Staging Table Structure

urn:js:virtue:aspire:pattern:.

TL;DR

These are patterns for staging table structures in the ASPIRE Data Warehouse.

Instructions

The following are patterns for staging tables in the ASPIRe Data Warehouse based on whether they are populated from semi-structured or structured data sources.

Semi-Structured

For semi-structured data the table definition should be:

Column Name	Type	Content
MESSAGE	VARIANT	S3 object record payload (JSON, Parquet, AVRO, ORC)
FILE_NAME	VARCHAR(2000)	S3 object name (metadata@filename if PIPE)
ROW_FILE_NUMBER	NUMBER	S3 object record number (metadata@row_file_number if PIPE)
LOAD_TS	TIMESTAMP_NTZ(9)	Timestamp of data load into staging table
BATCH_ID	VARCHAR(2000)	Used to uniquely identify the specific transform execution responsible for the loading the record
JOB_NAME	VARCHAR(2000)	Used to identify the source and target of the ETL transformation
RECORD_DELETED_FLAG	VARCHAR(1)	Used to identify records deleted at Source / records deleted within data warehouse due to erroneous processing
RECORD_ID	VARCHAR(2000)	Used to uniquely identify each record through the Aspire estate
SOURCE_PATH	VARCHAR(2000)	Used to identify the full file location of a given source file
SOURCE_SYSTEM_CD	VARCHAR(2000)	Used to uniquely identify the source system feed the data pipeline has ingested from
ETL_FRAMEWORK	VARCHAR(2000)	Used to identify the tool & server responsible for executing the loading/movement/transformation

Structured

For structured data the table should contain a column for each data item in the feed but include the following as standard:

Column Name	Type	Content
FILE_NAME	VARCHAR(2000)	S3 object name (metadata@filename if PIPE)
ROW_FILE_NUMBER	NUMBER	S3 object record number (metadata@row_file_number if PIPE)
LOAD_TS	TIMESTAMP_NTZ(9)	Timestamp of data load into staging table
BATCH_ID	VARCHAR(2000)	Used to uniquely identify the specific transform execution responsible for the loading the record
JOB_NAME	VARCHAR(2000)	Used to identify the source and target of the ETL transformation
RECORD_DELETED_FLAG	VARCHAR(1)	Used to identify records deleted at Source / records deleted within data warehouse due to erroneous processing
RECORD_ID	VARCHAR(2000)	Used to uniquely identify each record through the Aspire estate
SOURCE_PATH	VARCHAR(2000)	Used to identify the full file location of a given source file
SOURCE_SYSTEM_CD	VARCHAR(2000)	Used to uniquely identify the source system feed the data pipeline has ingested from
ETL_FRAMEWORK	VARCHAR(2000)	Used to identify the tool & server responsible for executing the loading/movement/transformation