Link Search Menu Expand Document

Decouple Curated

urn:js:virtue:aspire:proposal:3.1

TL;DR

From the pipeline to ingest data from S3 to stage, decouple the write to curated and make it optional.

Rational

Barbossa will leverage Snowflake column level masking to control access to PII and other highly sensitive data. It will also allow us to present plaintext sensitive data to roles that have the privileges. This will remove the need for PII specific schemas and allow us to present all data together in the RDV and higher layers.

The current pipeline pattern is to process raw data into curated where sensitive data is encrypted. If we continue with this pattern then a decryption will be required before the data is loaded into ADW. The proposal here is then to remove the need for this decryption by loading data directly from our raw buckets into ADW staging and decouple curated from the pipeline.

This also provides the opportunity to move Curated to be optional, and use-case specific. Using current patterns curated data is produced for all feeds but is often not consumed by a use-case. This creates unnecessary expense through processing and storage and if decoupled from the raw → staging pipeline, could be avoided.

Architecture

Current architecture pattern:

alt text

Decoupling curated and loading staging direct from raw would then look like this:

alt text

Benefits

  • No need to decrypt sensitive data held in curated to leverage the benefits of Barbossa
  • Decreased latency and complexity for delivery of data to ADW
  • The decoupling of curated means we can manage both loads into staging and curated independently without one impacting the other
  • Reduction in compute and storage costs where curated data is not required
  • Curated data can be produced independently to the staging load, both increasing flexibility to change but also allowing for use-case specific content and cadence
  • Ability to include plaintext sensitive data in the curated object where the use-case requires it - Barbossa could in the future manage this and align roles across the data lake and ADW
  • We could look to remove field level encryption as it serves no purpose for consumers of curated data - we could consider replacing it with hashed values if this provides benefits to consumers
  • Reduction in scope for data retention indexing and treatment

Post change data flows

The following diagram provides a picture of the architecture after this change:

alt text

Key changes:

  • The Curated Layer is removed - this will no longer be a complete democratised layer so and use-cases that continue will push to the new ‘Consumer Layer’
  • Information is renamed to ‘Consumer Layer’ and will be the point from which consumers will access S3 data - this layer will be populated from:
    • The Raw Layer
    • ADW (all layers)
    • From a stream where micro-batching is required
  • (Decision pending) Robocrop will work directly over ADW only and will no longer DQ raw data in S3 - if raw data DQ is required it can be done on ADW staging
  • Lockbox will be used between Raw -> Consumer layers if the use-case requires consumer data to join to ADW, but in the main will be used to facilitate hashing on ADW

Existing pipelines

There is no need immediate need for existing pipelines to change but as soon as plaintext sensitive data is able to flow into the RDV, and an existing pipeline needs to deliver it, the move should be made. The aim is for the decoupled pattern to be the standard going forward so eventually it would be good to see all pipelines move to it.

Impact

This proposal creates impact in other areas, detailed below. The aim of this decision is to agree that we should decouple curated. If this is agreed we can then look to put proposals out to deal with the impacted areas.

  • DQ - Robocrop currently DQs raw data and includes the output in the curated file for onward loading into ADW. With curated being optional other options need to be considered for how and if we continue with DQ on raw data lake S3 objects.
  • Options for moving data from raw -> staging - should we standardise on an approach?
  • Access control on ADW staging - if we are putting sensitive data in staging we will need to define suitable access controls.
  • Options for processing data from raw -> curated - if curated is now optional, and simplified due to encryption being deprecated, we should re-assess our options on how it is produced and efficiently managed.

Implications

None.