Proposal for Common Integration Model (CIM)

urn:js:virtue:aspire:proposal:13.1

TL;DR

Proposal to enhance Aspire Data workflow & problems supporting the Aspire Assembly Line vision. The Common Integration Model is a backbone Data Model used to describe published data.

Rational

Aspire Assembly Line

The Aspire Assesmbly line is our strategic approach to building Analytics and Data products in Aspire. It outlines how data inputs should be used to drive our analytical products. However, our current ‘on-boarding’ workflows are still very manual and do not capture the data required to create a robust supportable solution. This results in several pain points and as the volume and type of data we ingest into Aspire grows, the lack of consistent processes and metadata will impact our ability to move at pace.

Assembly Line and pain points

Problem	Impact
Non-standardised documentation of data contracts	This means that supportability is reduced and only squads who have built it know about the publishers and any data quality issues associated with the data
No exposure/publication of these data contracts	Stakeholders and ADA can only make use of the data after it has landed in the PL limiting their ability to move fast
No standard automation of pipeline setup	This limits us in understanding implications when making changes to pipeline processes
SLAs not clearly defined	This impacts end consumers of data products as they have no understanding of how latent their data would be if problems arise
Ownership and accountablility	There is a lack of clear accountability for each step of the assembly line.
Hard to capture adherence/adoption to principals and standards	This makes it harder to carry out audits and secure the ASPIRE platform

The recommendation is to introduce a new Common Integration Model (CIM) and associated processes to enable us to capture essential metadata for our data.

What is the Common Integration Model

The Common Integration Model (CIM) is a backbone Data Model used to describe published data. In essence the CIM will hold the information about our data and integration points to internal capabilities, enabling us and our products to be metadata driven.

Data Model

The data model below showcases how metadata attribures can be arranged. This way the attributes are not confined and rigid, allowing the CIM to evolve as other capabilities mature and new ones are added.

Mapping CIM to ASPIRE Assemby Line

Below showcases how the CIM maps to the steps of the Aspire Assembly Line (steps 1 to 4 - upto Data Into ASPIRE).

Metadata Collection

In this component, Data Publishers will be able to tell us about their data and submit related metadata.

Here we have a couple of different journeys which Publishers will be able to take:

First contact - In this journey, data publishers will describe their dataset along with its metadata, schema, among other attributes which are needed to support the ASPIRE assembly line. This metadata will be validated and checked for completeness (optional/mandatory attributes). After which it will be given a unique URI/URN for each set of data.
Data Verification - In this journey, data publishers will have the option to submit actual data which will be verified against the metadata and schema definitions which have been provided. Simple schema checks will be carried out to make sure that the data conforms to the schema provided.
Updates - In this journey, data publishers will have to ability to submit updates about their dataset. This journey will follow that same process as first contact but just not generate a new URN/URI, rather it will create a new version of the CIM

Metadata Verification and Publication

This component will cater for validating and publishing the metadata in the form of a CIM. Here there is a verification gate, where it will have to pass some internal checks, eg data governance and data security checks - just like we currently do in our workflows, in order for it to be fully accepted and published (Not all details from the data publisher can be trusted, as they could be external parties whom have no reference to our IRMs, data classifications and data security processes). These verifications can be automated by sending notifications to related internal parties, where they can assess and approve requests.

Metadata Storage

In this component, the validated and approved CIM will be stored. This will act as a store where users and services of ASPIRE can navigate onto and see which data publishers are sending data to ASPIRE and what datasets they are providing.

Some key points

The Common Integration Model (CIM) supports a number of steps within the Aspire Assembly line
A CIM will define the metadata we need to understand and use a set of observable data
With a CIM Aspire data ingestion pipelines will only accept data that conforms to the CIM principles
We move towards being metadata driven, which is an industry standard approach to solving this data problem

How will the Common Integration Model & tooling add value to the workflow

Implications

With the CIM there are no implications to any existing data feeds or processes.
The idea would be to use CIM for new feeds and slowly migrate existing feeds to CIM in the future.

Appendix

Lets take an example - Current Workflow for Data Ingestion

Currently when a data publisher wants to send their data to ASPIRE, they need to go through a couple of sessions with our engineering teams in which they describe their data and how ASPIRE can access it. Below is a simplified workflow of the processes involved:

Current Workflow

Key pieces of information produced

Information about the Data Publisher
Information about the data (metadata)
Classification of the data
Code for infrastructure to cater for ingestion
An endpoint for the Data Publisher

As we move towards the ASPIRE assembly line, its makes sense to standardise how data from is ingested and progressed across ASPIRE (Data In). This proposal is part of a larger proposal for CIM and is only aimed at publishers of data who will send data to ASPIRE. This helps us with creating standardised data contracts with the publisher. In the near future, this metadata can be used to aid in automation of the ingest pipelines, creating a sort of self-serve for publishers.

Metadata required to ingest data into Aspire includes:

Ownership
Data Classification
Schema
Data Quality Rules
IRM
Data Dictionary

Stakeholders and ADA can only make use of the data after it has landed in the Presentation Layer (PL). At the moment, this mostly occurs towards the end of the project, and it is expensive for the stakeholders or ADA to ask for changes (such as missing data items or performance improvements).

Below is a workflow of how the journey will look like from a Publisher’s point of view:

Benefits of CIM

Standardised and version controlled documentation-as-code for our integrations
Adherence to Group Data Governance Framework
Improved Ways of Working
Accelerated Integrated Futures Program

Example CIM

Based on the information described above, here is an example of what the CIM can look like and the attributes which would be used to describe a feed:

## Sample CIM attributes for Ingestion

CIMDetails:
    UniqueReference: 5de26437-9157-40de-a63d-1f131d20c51a
    Version: 1.0

Base:
    SourceName: "Logistics Orders"
    SourceSystemDetails: "Logistics Service"
    ExtractName: "Deliveries"
    DataClassification: "Commercial"
    IRM: "Logistics"
    Brand: "Sainsburys"
    Description: "Logistics orders data for Sainsburys"

TechnicalDetails:
    PublicationFrequency: "Daily"
    Latency: None
    DocumentationLink: "https://sainsburys-confluence.valiantys.net/pages/DOCUMENT"
    ExpectedDataVolumes: 25
    ExpctedDataVolumeUnit: "MB"
    ExternalConstraints: "OnPrem"

DataDetails:
    PublicationMode: "Batch"
    PublicationFormat: "JSON"
    LogicalDataModel: "https://sainsburys-confluence.valiantys.net/pages/DATAMODEL"
    DataSchema: "https://sainsburys-confluence.valiantys.net/pages/SCHEMA"

SupportDetails:
    ServiceNow: "(CI Name)"
    ServiceNowResolverGroup:
    ServiceHours:
    ServiceLevelAgreements:
    ChangeCycle:

TeamDetails:
    OwnerTeamName:
    OwnerContacts:
        TeamDL:
        EscalationContact:

The idea here is that we will have enough information to carry out basic checks and verify what the data producers are sending. Once they have been verified, we can then send them back an endpoint for their data and possibly create a feedback loop for rejected data.

What can be done with the CIM

Once publishers have given us the information about their data and how we can access it, we can easily use this metadata to setup data pipelines and start ingesting to Snowflake STAGING. Based on the CIM metadata, we have a set of attributes which can be used to automate the setup of the base infrastructure needed to ingest that data.

#	Component	Description
1	Big Black Box (BBB)	This is a framework which will leverage the metadata and create/deploy artifacts (common terraform modules)
2	Infrastructure Automation	This will take the YAML doc and create the necessary components for the new feed. Here standard DevOps can be used to construct a set of terraform modules that can be triggered.
3	Move to Staging (ETL)	This will setup the data pipelines for ingesting the publisher’s data.
4	Maazel	Passes on the statuses and CIM messages to other processes, will be used a message bus.
5	LockBox/Snowbocrop	Capabilities that have input to processing

Next Steps

Complex Data Structures
Integration of Capabilities
GUI/Frontend for Metadata collection
Endpoints for publishers