Ingestion Standards for Graph Database

urn:js:virtue:aspire:proposal:5.1

TL;DR

This decision is looking to agree a set of Principles and Standards for aligning the usage of Graph Databases to ASPIRE.

Rational

Use Cases within Sainsbury’s that require highly connected datasets with many complex relationships are becoming more frequent e.g. Data Lineage. Tracing data back through data pipelines to its authorative source and forwards to its usage. Graph database use graph structures for semantic queries with nodes, edges, and properties to represent and store data and are most adept at handling these use cases.

This decision is looking to agree a set of Principles and Standards for aligning the usage of Graph Databases to ASPIRE.

Principles & Standards

Data captured for Graph DB Ingestion(through the various processes) must be landed into the Aspire Data Lake Raw Bucket before applying any filtering, cleansing or aggregation logic
Data from the Raw Data Lake should be loaded into the Snowflake database to enforce common business entities and Data Modelling Standards.
Cleansing / DQ or any other Transformation Rules for defining the nodes and edges should be applied outside of the Graph Database(In Snowflake)
Data for the Graph Database should be sourced from the Snowflake Objects depending on the use case either from (4a) RDV when no transformation or Business alignment is required or (4b) BDV when there is a requirement for transformed and Business aligned data or (4c) PL when there is a requirement to visualise the data using traditional method e.g. Microstrategy in addition to the Graph Database Use Case
Graph Database should be used only for traversing purpose. No transformation logic or Rules should lie within it
Aspire Principles/Standards must be observed for any PII or Commercially Sensitive Data

Approaches

Option 1 - Persist the data into Snowflake RDV/BDV/PL Schemas

Load the data from the proposed MVP sources (Snowflake, Mazel, Microstrategy) into RDV without any transformation
Cleansing/DQ or Transformation to be done in BDV
Make the output available in PL layer in the form of Facts and Dimensions. Currently there are no requirements
Extract the data from PL layer into Raw S3 bucket and push it to Tiger Graph

Option 2 - Persist the data into Snowflake RDV/BDV Schemas

Load the data from the proposed MVP sources (Snowflake, Mazel, Microstrategy) into RDV without any transformation
Cleansing/DQ or Transformation to be done in BDV
Extract the data from RDV/BDV layer into Raw S3 bucket and push it to Tiger Graph

Recommended Approach

With no reporting requirements we recommend Option 2. We will be building the history for the data over the time and hence if there are any reporting requirements we can build the PL in the future.

Sample Architecture

alt text

Summary

The Solution Architecture above has been designed for the Data Lineage MVP in accordance with the Principles listed previously. Data Lineage meta-data will be provided by 3 sources (initial MVP only) i.e. Snowflake, Maazel and Microstrategy. Files extracts in different formats are provided to a Raw S3 Bucket. The data is then ingested into Aspire, where the 3 different formats are cleaned and transformed into a consistent dataset and made available in the Snowflake database. Data to support the specific Data Lineage Use Case is then extracted to S3 to be ingested into the Graph DB for analysis and querying. The Graph DB in this Use Case is a managed service from Tiger Graph.

Implications

None.