Proposal for Common Integration Model (CIM)
urn:js:virtue:aspire:proposal:13.1
TL;DR
Proposal to enhance Aspire Data workflow & problems supporting the Aspire Assembly Line vision. The Common Integration Model is a backbone Data Model used to describe published data.
Rational
Aspire Assembly Line
The Aspire Assesmbly line is our strategic approach to building Analytics and Data products in Aspire. It outlines how data inputs should be used to drive our analytical products. However, our current ‘on-boarding’ workflows are still very manual and do not capture the data required to create a robust supportable solution. This results in several pain points and as the volume and type of data we ingest into Aspire grows, the lack of consistent processes and metadata will impact our ability to move at pace.
| Problem | Impact |
|---|---|
| Non-standardised documentation of data contracts | This means that supportability is reduced and only squads who have built it know about the publishers and any data quality issues associated with the data |
| No exposure/publication of these data contracts | Stakeholders and ADA can only make use of the data after it has landed in the PL limiting their ability to move fast |
| No standard automation of pipeline setup | This limits us in understanding implications when making changes to pipeline processes |
| SLAs not clearly defined | This impacts end consumers of data products as they have no understanding of how latent their data would be if problems arise |
| Ownership and accountablility | There is a lack of clear accountability for each step of the assembly line. |
| Hard to capture adherence/adoption to principals and standards | This makes it harder to carry out audits and secure the ASPIRE platform |
The recommendation is to introduce a new Common Integration Model (CIM) and associated processes to enable us to capture essential metadata for our data.
What is the Common Integration Model
The Common Integration Model (CIM) is a backbone Data Model used to describe published data. In essence the CIM will hold the information about our data and integration points to internal capabilities, enabling us and our products to be metadata driven.
Data Model
The data model below showcases how metadata attribures can be arranged. This way the attributes are not confined and rigid, allowing the CIM to evolve as other capabilities mature and new ones are added.
Mapping CIM to ASPIRE Assemby Line
Below showcases how the CIM maps to the steps of the Aspire Assembly Line (steps 1 to 4 - upto Data Into ASPIRE).
Metadata Collection
In this component, Data Publishers will be able to tell us about their data and submit related metadata.
Here we have a couple of different journeys which Publishers will be able to take:
-
First contact - In this journey, data publishers will describe their dataset along with its metadata, schema, among other attributes which are needed to support the ASPIRE assembly line. This metadata will be validated and checked for completeness (optional/mandatory attributes). After which it will be given a unique URI/URN for each set of data.
-
Data Verification - In this journey, data publishers will have the option to submit actual data which will be verified against the metadata and schema definitions which have been provided. Simple schema checks will be carried out to make sure that the data conforms to the schema provided.
-
Updates - In this journey, data publishers will have to ability to submit updates about their dataset. This journey will follow that same process as first contact but just not generate a new URN/URI, rather it will create a new version of the CIM
Metadata Verification and Publication
This component will cater for validating and publishing the metadata in the form of a CIM. Here there is a verification gate, where it will have to pass some internal checks, eg data governance and data security checks - just like we currently do in our workflows, in order for it to be fully accepted and published (Not all details from the data publisher can be trusted, as they could be external parties whom have no reference to our IRMs, data classifications and data security processes). These verifications can be automated by sending notifications to related internal parties, where they can assess and approve requests.
Metadata Storage
In this component, the validated and approved CIM will be stored. This will act as a store where users and services of ASPIRE can navigate onto and see which data publishers are sending data to ASPIRE and what datasets they are providing.
Some key points
- The Common Integration Model (CIM) supports a number of steps within the Aspire Assembly line
- A CIM will define the metadata we need to understand and use a set of observable data
- With a CIM Aspire data ingestion pipelines will only accept data that conforms to the CIM principles
- We move towards being metadata driven, which is an industry standard approach to solving this data problem
How will the Common Integration Model & tooling add value to the workflow
| Stakeholder | Value Add | | - | - | | Data Publishers| <li>Empowered</li><li>Self servicing</li> | | Data Squads | <li>Standardised and version controlled documentation-as-code</li><li>Promotes Collaboration</li><li>Can focus on Outcomes</li><li>Can easily support cross squad pipelines</li> | | Data Governance | <li>Mandatory Artefacts will be evidenced</li><li>Auditable workflows</li><li>Integration with Munin and Barbossa</li> | | Infosec | <li>Easily run audits and traces on workflows</li> | | ADA Squads | <li>Will have the ability to quickly search data incoming to ASPIRE</li><li>More inclusive</li> |
Implications
- With the CIM there are no implications to any existing data feeds or processes.
- The idea would be to use CIM for new feeds and slowly migrate existing feeds to CIM in the future.
Appendix
Lets take an example - Current Workflow for Data Ingestion
Currently when a data publisher wants to send their data to ASPIRE, they need to go through a couple of sessions with our engineering teams in which they describe their data and how ASPIRE can access it. Below is a simplified workflow of the processes involved:
Key pieces of information produced
- Information about the Data Publisher
- Information about the data (metadata)
- Classification of the data
- Code for infrastructure to cater for ingestion
- An endpoint for the Data Publisher
As we move towards the ASPIRE assembly line, its makes sense to standardise how data from is ingested and progressed across ASPIRE (Data In). This proposal is part of a larger proposal for CIM and is only aimed at publishers of data who will send data to ASPIRE. This helps us with creating standardised data contracts with the publisher. In the near future, this metadata can be used to aid in automation of the ingest pipelines, creating a sort of self-serve for publishers.
Metadata required to ingest data into Aspire includes:
- Ownership
- Data Classification
- Schema
- Data Quality Rules
- IRM
- Data Dictionary
Stakeholders and ADA can only make use of the data after it has landed in the Presentation Layer (PL). At the moment, this mostly occurs towards the end of the project, and it is expensive for the stakeholders or ADA to ask for changes (such as missing data items or performance improvements).
Below is a workflow of how the journey will look like from a Publisher’s point of view:
Benefits of CIM
- Standardised and version controlled documentation-as-code for our integrations
- Adherence to Group Data Governance Framework
- Improved Ways of Working
- Accelerated Integrated Futures Program
Example CIM
Based on the information described above, here is an example of what the CIM can look like and the attributes which would be used to describe a feed:
## Sample CIM attributes for Ingestion
CIMDetails:
UniqueReference: 5de26437-9157-40de-a63d-1f131d20c51a
Version: 1.0
Base:
SourceName: "Logistics Orders"
SourceSystemDetails: "Logistics Service"
ExtractName: "Deliveries"
DataClassification: "Commercial"
IRM: "Logistics"
Brand: "Sainsburys"
Description: "Logistics orders data for Sainsburys"
TechnicalDetails:
PublicationFrequency: "Daily"
Latency: None
DocumentationLink: "https://sainsburys-confluence.valiantys.net/pages/DOCUMENT"
ExpectedDataVolumes: 25
ExpctedDataVolumeUnit: "MB"
ExternalConstraints: "OnPrem"
DataDetails:
PublicationMode: "Batch"
PublicationFormat: "JSON"
LogicalDataModel: "https://sainsburys-confluence.valiantys.net/pages/DATAMODEL"
DataSchema: "https://sainsburys-confluence.valiantys.net/pages/SCHEMA"
SupportDetails:
ServiceNow: "(CI Name)"
ServiceNowResolverGroup:
ServiceHours:
ServiceLevelAgreements:
ChangeCycle:
TeamDetails:
OwnerTeamName:
OwnerContacts:
TeamDL:
EscalationContact:
The idea here is that we will have enough information to carry out basic checks and verify what the data producers are sending. Once they have been verified, we can then send them back an endpoint for their data and possibly create a feedback loop for rejected data.
What can be done with the CIM
Once publishers have given us the information about their data and how we can access it, we can easily use this metadata to setup data pipelines and start ingesting to Snowflake STAGING. Based on the CIM metadata, we have a set of attributes which can be used to automate the setup of the base infrastructure needed to ingest that data.
| # | Component | Description |
|---|---|---|
| 1 | Big Black Box (BBB) | This is a framework which will leverage the metadata and create/deploy artifacts (common terraform modules) |
| 2 | Infrastructure Automation | This will take the YAML doc and create the necessary components for the new feed. Here standard DevOps can be used to construct a set of terraform modules that can be triggered. |
| 3 | Move to Staging (ETL) | This will setup the data pipelines for ingesting the publisher’s data. |
| 4 | Maazel | Passes on the statuses and CIM messages to other processes, will be used a message bus. |
| 5 | LockBox/Snowbocrop | Capabilities that have input to processing |
Next Steps
- Complex Data Structures
- Integration of Capabilities
- GUI/Frontend for Metadata collection
- Endpoints for publishers