Link Search Menu Expand Document

Aspire Data Warehouse RDV Immutability Standards

urn:js:virtue:aspire:standard:28.1

TL;DR

Immutability chart to be referred when handling data in RDV.

Definition

As per data vault, data once written shouldn’t be updated or deleted. Upserts and soft deletes are used to track the changes. On a day to day basis various scenarios are handled in datawarehouse. Immutability chart given in this standard defines the remediation patterns for such RDV scenarios.

Immutability chart

Scenario Remediation Pattern Is Stage Immutable Is RDV Immutable Recommendations /Additional comments
Source entity changes - additions and modifications (Row level changes) Data vault CDC Yes Yes In case of transactional data it is an insert only pattern hence no Data Vault CDC is performed.
Source entity deletions (Row level Changes) For eg:- Daily full file excludes deleted entries Data vault CDC plus logics between Stage and RDV in Tran Views Yes Yes Staging will not have the dropped-out records in cases where deletion records are calculated between Stage and RDV. Calulating deletions are usually done for datasets qualifying for SAT or Link Objects as this is not generally applicable for datasets which are transactional in nature.
Addition of new column to an existing object 1) Truncate, change structure and reload/replay the history 2) Add new object and load the history 3) Add column to the existing object and load incremental data when no history to be loaded. Yes Yes All approaches are recommended and shouldn’t change the existing history of changes captured in the object. If for any reason, data values in history files is expected to change the track of history it has to be discussed with business for impact.
Invalid Identical duplicate dataset Eg:- Live streams replaying the data due to error / nature of source publishing mechanisms / Data source is a based on a backup Deduplicate between Staging and RDV Yes Not Applicable Deduplication process is usually built-in as part of historic pipelines. If incremental daily files are having identical duplicates, then this must be picked up with source for source remediation.If source are unable to provide a fix it must be deduped in the incremental pipelines. In this case such data won’t land in RDV as it is filtered between staging and RDV.
Valid identical duplicates Eg:- Multiple redemptions of a single item in a single transaction load to RDV Yes Yes In cases where identical duplicates has to be allowed for business scenarios so those duplicates are accounted in reporting , deduping is not required for such usecases/scenarios.
Historic data refresh Truncate and replay the data Yes Yes If for any reason, data values in history files are expected to change the track of history it has to be discussed with business for impact. Kindly note the technical metadata columns will change as we are replaying the data. History data refresh when adding a new column would/might result in increased number of rows for every business keys but it shouldn’t change the actual values of the existing attributes.
PII data exposed Data classification Yes Yes  
Data retention Eg:- Data to be treated for any regulatory compliance Heimdallr No No Data retention will be done by Heimdallr. Approach as recommended by Heimdallr must be followed.
Pipeline/human ETL errors Delete from RDV and reprocess Yes No If identified as part of the daily load, Records can be deleted from RDV and reprocessed. In cases where the issue has been identified after a period and the data has been refreshed to decision making reports, the recommendation is to soft delete the wrongly loaded records using RECORD_DELETED_FLAG in RDV for any future auditing purpose. In this case run query stats to confirm the stakeholder access history and communication. Currently in staging the practise is to delete such wrongly loaded records and reload then process to RDV. Going forward it is recommended to use the RECORD_DELETED_FLAG in staging as well to soft delete those wrongly loaded records to keep staging immutable for these pipelines errors and provides for any operational error statistics report against source data quality. Also it is important for the purpose of Heimdallr we keep the data persistent in staging.
Impermissible data exposed in Aspire Eg:- data flowing through Aspire which we are legally not allowed to process Delete the data from RDV and staging No No This is a rare scenario concerning data protection and data governance. Data must be removed from the layers where users access the data.
DQ Checks Manual Fixes Yes, for the original records No Format changes to keep the data in sync for referential integrity can be made through an update. DQ fixes shouldn’t generally change the data values as those changes has to come through the change data control. After manual fixes to the loaded data for ongoing correction pipeline also has to be amended.
Source data structure changes Eg:- attribute and value re-aligments in the source feed Pipeline alignment & Data Vault CDC Yes Yes (in cases of soft deletions) /No In this scenario, Data pipelines has to be aligned to reflect the source structure changes in staging and then the data in RDV has to be deleted ( soft deleted in case if the data is already consumed for reporting) and continue with the Data vault CDC process. This will ensure the data records sent by source are correctly reflected in the RDV. In cases of transactional link CDC is not applicable - Refer to scenario ‘Backfill bulk data’.
Backfill bulk data Eg:- data missed for a period of time or data wrongly sent for a period of time that needs correction Purge and reload / Merge Yes No Purge and reload has to be assesed for business/usecase impact and avoided if it would result in losing history changes. Backfilling record has to be achieved through deleting old record using RECORD_DELETED_FLAG and adding a new record if the reason for backfilling is source driven / Records in RDV are already consumed for reporting.
Data Obscenities Eg:- Customer comments which are not real data obscenities but still requires redaction because it is near to obscenity Manual fixes Yes , for the original records No Redact the data with wild characters ( *** /—). This will be an update to the existing data and hence RDV is not Immutable. As with manual fixes,update the job_name for those records with ‘ERROR:«Jira Ticket /References»‘.Business approval on the action is required. Original data has to be held in staging for any future consolidations /debugging purpose.

Conclusion

1) RDV must stay immutable for any source entity changes. 2) RDV doesn’t have to be immutable for pipeline errors /DQ data format changes /corrections. 3) Issues identified as part of the daily load can be treated with delete and reload as it is considered as data movement failures rather than data changes or data issues. 4) Issues identified but after the data has been utilised for decision making and reporting purpose, although we need to correct the data it is not recommended to hard delete . Instead soft delete the data using RECORD_DELETED_FLAG so the data is traceable for any further queries.In case of updating faulty loads ‘JOB_NAME’ technical metadata column has to be updated with relevant information. Suggested format -‘ERROR:«Jira Ticket /References»’. This will help to split the errored and soft deleted records from source deleted entries. Downstream calculations has to include the RECORD_DELETED_FLAG and/or JOB_NAME to filter out the valid records depending on usecase. 4) Staging should be kept immutable in all applicable scenarios to measure quality of data sent by source and also to be able to replay the data to RDV if required.In cases of scenarios where we would like to account for amount of duplicates sent by source this can be retrieved from staging. 5) Data retention is an exception scenario where underlying records has to be treated and anonymised. 6) As with any other standard this standard is recommended for any new pipelines /change pipeline requests.