Staging Table Structure
urn:js:virtue:aspire:pattern:.
TL;DR
These are patterns for staging table structures in the ASPIRE Data Warehouse.
Instructions
The following are patterns for staging tables in the ASPIRe Data Warehouse based on whether they are populated from semi-structured or structured data sources.
Semi-Structured
For semi-structured data the table definition should be:
| Column Name | Type | Content |
|---|---|---|
| MESSAGE | VARIANT | S3 object record payload (JSON, Parquet, AVRO, ORC) |
| FILE_NAME | VARCHAR(2000) | S3 object name (metadata@filename if PIPE) |
| ROW_FILE_NUMBER | NUMBER | S3 object record number (metadata@row_file_number if PIPE) |
| LOAD_TS | TIMESTAMP_NTZ(9) | Timestamp of data load into staging table |
| BATCH_ID | VARCHAR(2000) | Used to uniquely identify the specific transform execution responsible for the loading the record |
| JOB_NAME | VARCHAR(2000) | Used to identify the source and target of the ETL transformation |
| RECORD_DELETED_FLAG | VARCHAR(1) | Used to identify records deleted at Source / records deleted within data warehouse due to erroneous processing |
| RECORD_ID | VARCHAR(2000) | Used to uniquely identify each record through the Aspire estate |
| SOURCE_PATH | VARCHAR(2000) | Used to identify the full file location of a given source file |
| SOURCE_SYSTEM_CD | VARCHAR(2000) | Used to uniquely identify the source system feed the data pipeline has ingested from |
| ETL_FRAMEWORK | VARCHAR(2000) | Used to identify the tool & server responsible for executing the loading/movement/transformation |
Structured
For structured data the table should contain a column for each data item in the feed but include the following as standard:
| Column Name | Type | Content |
|---|---|---|
| FILE_NAME | VARCHAR(2000) | S3 object name (metadata@filename if PIPE) |
| ROW_FILE_NUMBER | NUMBER | S3 object record number (metadata@row_file_number if PIPE) |
| LOAD_TS | TIMESTAMP_NTZ(9) | Timestamp of data load into staging table |
| BATCH_ID | VARCHAR(2000) | Used to uniquely identify the specific transform execution responsible for the loading the record |
| JOB_NAME | VARCHAR(2000) | Used to identify the source and target of the ETL transformation |
| RECORD_DELETED_FLAG | VARCHAR(1) | Used to identify records deleted at Source / records deleted within data warehouse due to erroneous processing |
| RECORD_ID | VARCHAR(2000) | Used to uniquely identify each record through the Aspire estate |
| SOURCE_PATH | VARCHAR(2000) | Used to identify the full file location of a given source file |
| SOURCE_SYSTEM_CD | VARCHAR(2000) | Used to uniquely identify the source system feed the data pipeline has ingested from |
| ETL_FRAMEWORK | VARCHAR(2000) | Used to identify the tool & server responsible for executing the loading/movement/transformation |