Separate Data Environment Layers

urn:js:virtue:aspire:proposal:35.1

TL;DR

The different layers (STAGE / RDV etc…) of a data environment (e.g. ADW_PROD) should be separated into individual structures. This allows greater flexibility in environment structure, and a reduced potential blast radius in failure scenarios.

Rational

Snowflake uses Databases and Schemas as structures to manage the organisation of Data. These objects are intrinsically linked with the RBAC security controls that Snowflake implements.

We currently place all of the data from our DEV, PREPROD and PROD data environments into a single database. We also have a number of layers within each data environment:

Stage (and Temp)
Raw Data Vault (RDV)
Business Data Vault (BDV)
Presentation Layer (PL)

Other than the Presentation Layer, each of these layers are stored within a single Snowflake Schema. The result is that the STAGE, RDV, and BDV schemas contain a very large number of tables, from most squads / teams. On the date of writing (2022-01-05), these schemas had 508, 778, 171 tables respectively. When these tables were excluded, the median number of tables in an ADW_PROD schema was 20.

Several issues arise from having schemas that are shared across the entire data environments:

Increased schema complexity, reduced usability, and clashes
Increased blast radius
Reduced granularity of security controls

Schema complexity, reduced usability, and name clashes

Having hundreds of tables within a schema makes traversing and understanding the schema difficult, especially if you wish to utilise the schema in any way other than direct access to a table that you know the name of.

More tables also means a higher likelihood of name clashes, particularly when you have all teams working in the same space. This means that you need to start adding additional naming conventions to tables to reduce the likelihood of collisions.

Blast radius

As Snowflake provides a large number of functions at the schema / database level, when items are grouped together within these objects, it is more probable that many objects will be affected by commands. DROP SCHEMA, ALTER SCHEMA, REVOKE-type, and GRANT-type commands in the current structure would impact objects across our entire data estate, rather than being limited to one IRM or domain.

This increases the impact of any incidents that may cause data exposure, data loss or interruptions to data ingestion / access.

Granularity of Security controls

To reduce the manual overhead of object management, security controls around ETL and deployment (CICD / Object Ownership) processes are often controlled at the schema-level in Snowflake. This is true for the Barbossa system being used to manage our RBAC controls in Snowflake.

Generally CICD / ETL users have access to all of the objects within a schema. Given that the Stage, RDV and BDV schemas are some of the most critical schemas for our pipelines, as they form the layers of our data environments, most teams require access to them for their CICD / ETL processes. As such, this means that CICD / ETL processes have wide ranging, and often privileged access within these schema, and therefore to a much larger number of tables than necessary.

If instead, the layers of the data environment were separated into schemas / database combinations that provided IRM-level segregation of the data environment layers, then these controls could be targeted with granularity at only the schemas necessary for that pipeline, and thus reduce the security risk.

Proposal - Separate, Namespaced Schemas

Separate the layers within a database using naming conventions for schemas e.g.

ADW_PROD.ADW_STAGE_IRM_A
ADW_PROD.ADW_STAGE_IRM_B
ADW_PROD.ADW_RDV_IRM_A
…

Benefits

Reduced schema complexity
Reduced blast radius for operations within a database layer
Increased granularity of security controls
Data environment contained within a single database
Entire environment can be cloned in a single statement
Naming convention will ensure that the different layers are grouped together visually in the database
Tags can easily be applied to an entire environment

Implications

Relies on a naming convention to separate the data layers within a database

Migration Method

It is suggested that if this proposal were to pass:

on a specified date, new items should be added as defined by the new standard, and
old items should be migrated over time as resource / effort allows rather than mandating a sudden migration of all existing objects.

The timelines for this would be determined in agreement with the engineering management and leadership due to the effort required.

Other Options Considered

Separate Databases

Have separate databases for each data layer within an environment, for example:

ADW_PROD_STAGE
ADW_PROD_TEMP
ADW_PROD_RDV
ADW_PROD_BDV
ADW_PROD_PL

Benefits

Reduced schema complexity
Reduced blast radius for operations within a database layer
Increased granularity of security controls
Sharing / replicating data is easier because there is a reduced number of objects in each DB
Reduced number of schemas in each DB, making them more approachable
Maximum granularity of control, and reduction in blast radius
Minimal need for naming conventions within schemas
Cloning a layer is easier and will have a reduced number of objects within it. This will also reduce the confusion of a clone layer referring to an object in the original rather than within the clone.
Per-layer settings such as time travel can be more easily applied to each layer, as they will be passed down to all the schemas within the layer

Implications

MicroStrategy can have issues querying across databases
Cloning an entire environment is more difficult
Tag inheritance within an environment requires duplicating across each layer

Do Nothing

Keep the status quo.

Benefits

No change to current working practices
No migration work required

Implications

See Rational.