Separate Data Environment Layers
urn:js:virtue:aspire:proposal:35.1
TL;DR
The different layers (STAGE / RDV etc…) of a data environment (e.g. ADW_PROD) should be separated into individual structures. This allows greater flexibility in environment structure, and a reduced potential blast radius in failure scenarios.
Rational
Snowflake uses Databases and Schemas as structures to manage the organisation of Data. These objects are intrinsically linked with the RBAC security controls that Snowflake implements.
We currently place all of the data from our DEV, PREPROD and PROD data environments into a single database. We also have a number of layers within each data environment:
- Stage (and Temp)
- Raw Data Vault (RDV)
- Business Data Vault (BDV)
- Presentation Layer (PL)
Other than the Presentation Layer, each of these layers are stored within a single Snowflake Schema. The result is that the STAGE, RDV, and BDV schemas contain a very large number of tables, from most squads / teams. On the date of writing (2022-01-05), these schemas had 508, 778, 171 tables respectively. When these tables were excluded, the median number of tables in an ADW_PROD schema was 20.
Several issues arise from having schemas that are shared across the entire data environments:
- Increased schema complexity, reduced usability, and clashes
- Increased blast radius
- Reduced granularity of security controls
Schema complexity, reduced usability, and name clashes
Having hundreds of tables within a schema makes traversing and understanding the schema difficult, especially if you wish to utilise the schema in any way other than direct access to a table that you know the name of.
More tables also means a higher likelihood of name clashes, particularly when you have all teams working in the same space. This means that you need to start adding additional naming conventions to tables to reduce the likelihood of collisions.
Blast radius
As Snowflake provides a large number of functions at the schema / database level, when items are grouped together within these objects, it is more probable that many objects will be affected by commands. DROP SCHEMA, ALTER SCHEMA, REVOKE-type, and GRANT-type commands in the current structure would impact objects across our entire data estate, rather than being limited to one IRM or domain.
This increases the impact of any incidents that may cause data exposure, data loss or interruptions to data ingestion / access.
Granularity of Security controls
To reduce the manual overhead of object management, security controls around ETL and deployment (CICD / Object Ownership) processes are often controlled at the schema-level in Snowflake. This is true for the Barbossa system being used to manage our RBAC controls in Snowflake.
Generally CICD / ETL users have access to all of the objects within a schema. Given that the Stage, RDV and BDV schemas are some of the most critical schemas for our pipelines, as they form the layers of our data environments, most teams require access to them for their CICD / ETL processes. As such, this means that CICD / ETL processes have wide ranging, and often privileged access within these schema, and therefore to a much larger number of tables than necessary.
If instead, the layers of the data environment were separated into schemas / database combinations that provided IRM-level segregation of the data environment layers, then these controls could be targeted with granularity at only the schemas necessary for that pipeline, and thus reduce the security risk.
Proposal - Separate, Namespaced Schemas
Separate the layers within a database using naming conventions for schemas e.g.
- ADW_PROD.ADW_STAGE_IRM_A
- ADW_PROD.ADW_STAGE_IRM_B
- ADW_PROD.ADW_RDV_IRM_A
- …
Benefits
- Reduced schema complexity
- Reduced blast radius for operations within a database layer
- Increased granularity of security controls
- Data environment contained within a single database
- Entire environment can be cloned in a single statement
- Naming convention will ensure that the different layers are grouped together visually in the database
- Tags can easily be applied to an entire environment
Implications
- Relies on a naming convention to separate the data layers within a database
Migration Method
It is suggested that if this proposal were to pass:
- on a specified date, new items should be added as defined by the new standard, and
- old items should be migrated over time as resource / effort allows rather than mandating a sudden migration of all existing objects.
The timelines for this would be determined in agreement with the engineering management and leadership due to the effort required.
Other Options Considered
Separate Databases
Have separate databases for each data layer within an environment, for example:
- ADW_PROD_STAGE
- ADW_PROD_TEMP
- ADW_PROD_RDV
- ADW_PROD_BDV
- ADW_PROD_PL
Benefits
- Reduced schema complexity
- Reduced blast radius for operations within a database layer
- Increased granularity of security controls
- Sharing / replicating data is easier because there is a reduced number of objects in each DB
- Reduced number of schemas in each DB, making them more approachable
- Maximum granularity of control, and reduction in blast radius
- Minimal need for naming conventions within schemas
- Cloning a layer is easier and will have a reduced number of objects within it. This will also reduce the confusion of a clone layer referring to an object in the original rather than within the clone.
- Per-layer settings such as time travel can be more easily applied to each layer, as they will be passed down to all the schemas within the layer
Implications
- MicroStrategy can have issues querying across databases
- Cloning an entire environment is more difficult
- Tag inheritance within an environment requires duplicating across each layer
Do Nothing
Keep the status quo.
Benefits
- No change to current working practices
- No migration work required
Implications
See Rational.