Link Search Menu Expand Document

Staging Table Structure

urn:js:virtue:aspire:pattern:.

TL;DR

These are patterns for staging table structures in the ASPIRE Data Warehouse.

Instructions

The following are patterns for staging tables in the ASPIRe Data Warehouse based on whether they are populated from semi-structured or structured data sources.

Semi-Structured

For semi-structured data the table definition should be:

Column Name Type Content
MESSAGE VARIANT S3 object record payload (JSON, Parquet, AVRO, ORC)
FILE_NAME VARCHAR(2000) S3 object name (metadata@filename if PIPE)
ROW_FILE_NUMBER NUMBER S3 object record number (metadata@row_file_number if PIPE)
LOAD_TS TIMESTAMP_NTZ(9) Timestamp of data load into staging table
BATCH_ID VARCHAR(2000) Used to uniquely identify the specific transform execution responsible for the loading the record
JOB_NAME VARCHAR(2000) Used to identify the source and target of the ETL transformation
RECORD_DELETED_FLAG VARCHAR(1) Used to identify records deleted at Source / records deleted within data warehouse due to erroneous processing
RECORD_ID VARCHAR(2000) Used to uniquely identify each record through the Aspire estate
SOURCE_PATH VARCHAR(2000) Used to identify the full file location of a given source file
SOURCE_SYSTEM_CD VARCHAR(2000) Used to uniquely identify the source system feed the data pipeline has ingested from
ETL_FRAMEWORK VARCHAR(2000) Used to identify the tool & server responsible for executing the loading/movement/transformation

Structured

For structured data the table should contain a column for each data item in the feed but include the following as standard:

Column Name Type Content
FILE_NAME VARCHAR(2000) S3 object name (metadata@filename if PIPE)
ROW_FILE_NUMBER NUMBER S3 object record number (metadata@row_file_number if PIPE)
LOAD_TS TIMESTAMP_NTZ(9) Timestamp of data load into staging table
BATCH_ID VARCHAR(2000) Used to uniquely identify the specific transform execution responsible for the loading the record
JOB_NAME VARCHAR(2000) Used to identify the source and target of the ETL transformation
RECORD_DELETED_FLAG VARCHAR(1) Used to identify records deleted at Source / records deleted within data warehouse due to erroneous processing
RECORD_ID VARCHAR(2000) Used to uniquely identify each record through the Aspire estate
SOURCE_PATH VARCHAR(2000) Used to identify the full file location of a given source file
SOURCE_SYSTEM_CD VARCHAR(2000) Used to uniquely identify the source system feed the data pipeline has ingested from
ETL_FRAMEWORK VARCHAR(2000) Used to identify the tool & server responsible for executing the loading/movement/transformation