Persist ADW Staging

urn:js:virtue:aspire:proposal:6.1

TL;DR

This proposal suggests persisting all raw data in ADW Staging schema tables following load from S3 Raw.

Rational

This proposal has the following perceived benefits:

We avoid the need to reload data from S3 Raw where new use-cases arise that require additional data into the RDV. Having all data available in staging increases flexibility and decreases time to deliver.
Provides an accessible source for raw data, either with direct staging table access or for unload to S3 (for curated-like consumption).
Simplified historic reconciliation of RDV
DQ on raw data value can be achieved using Snowbocrop against a single format, as opposed to Robocrop that needs to cater for multiple formats.
Opens up opportunities to simplify data retention activities and move S3 to be a retention compliant backup for Snowflake.

Impact on S3 landing

S3 Raw for non-sensitive data remains unchanged - full history can be retained if required.
S3 Raw for sensitive data becomes transient (circa 30 days lifespan) with retention compliant history being copied back after each retention treatment run.

Impact on existing and build in-flight pipelines

The recommendation would be to build a new pipeline from raw -> staging to run alongside existing pipelines, and then plan a migration from old to new.

Steps could be:

Create a new staging table
Build the pipeline from Raw -> Staging (see proposed pattern below)
Perform a historic load using a bulk Snowflake load
Start incremental updates to keep the staging table up to date
On a planned date, swap out the old staging table with new and continue ETL from the new staging table
Decommission existing pipeline load from curated -> staging and remove old table from staging
Where no consumer use-case for curated we can decommission the old pipeline entirely and remove the curated data and bucket from S3
Where there is a consumer use-case for curated the old pipeline can continue to run, or we can look at an unload from a PL object sitting over staging as a replacement which can deliver to a new consumer layer bucket - consumers can then port across once ready so we can decommission curated

Depending on the status of in-flight build you would either deliver and then look to convert as above, or stop and build as above from scratch. This will depend on delivery dates and priorities.

Future architecture view

The following diagram provides a view of the proposed ASPIRe architecture:

alt-text

Key points:

The S3 curated and information layers are removed in favour of a new ‘Consumer Layer’
ADW Staging becomes a persisted, immutable area which contains full history of raw data
Processing of data from S3 Raw -> S3 Curated is stopped in favour of use-case specific unloads from ADW into the consumer layer
DQ, Lockbox hashing and Heimdallr all move to work on ADW exclusively
Heimdallr treats data in the staging lake and unloads compliant data to S3 for backup purposes
- Alternative is we look at replication of the staging lake to GCP to get added protection from single cloud platform failure - raw could then become transient
All data presented to a consumer is via ADW Presentation Layer objects
- Unloads to the consumer layer from staging can be via views - this provides flexibility over what is unloaded and masking will be applied as per unloading role
- PL views over the staging lake can provide quick access to data for consumer analysis and reporting but would need to be tactical only
- Strategic delivery from the PL should be via the Data Vault layers

Staging load pattern

The following diagram provides a suggested pattern for loading data into the staging lake:

alt-text

This pattern encorporates these recent decisions:

Decouple curated
Intelligent Tiering by default
S3 Object Ownership

Using Snowpipe we have the option of automated or manual ingestion runs, although automated would be recommended to simplify and accelerate delivery to staging.

It is accepted that the data hitting the raw bucket that is the source for the pipe needs to be in a VARIANT compliant format (JSON, Parquet, XML, ORC, Avro). This may require a conversion step prior to landing in raw and if so we should recommend JSON given it is the most efficient for Snowflake to ingest to VARIANT.

Staging Lake principles and standards

Staging tables should all follow a common structure and naming convention
Staging tables should be of standard table type
Staging data will kept data retention compliant by Heimdallr
Data is immutable apart from updated for data retention treatment - this will avoid time travel costs other than when data is removed for reload or retention and ensure consistency of data extracted over time
Time Travel should be set to a low number (say 3 days max) to allow for easy recovery but minimise costs when reloads occur (note fail-safe of 7 days still provides a recovery point)
Change Tracking should be enabled on each table by default - this enables easy selection of data inserted over any time period
Staging is not a consumption layer - any consumption needs to be via a PL object that uses a staging table as a source

Implications

None.