Source Once, Use Many Times

urn:js:virtue:aspire:principle:8.1

TL;DR

The data feed should include all records, entities and attributes at the lowest level of granularity.

Rational

To facilitate reuse, establish an agreement for a comprehensive data feed of all the relevant data from the source system with periodic updates – preferably deltas and untransformed from source as far as possible. The data feed should include all records, entities and attributes at the lowest level of granularity. This is to ensure that the wide range of user groups (including data scientists) have the data needed for their current and foreseeable requirements.

Note: When applying this principle due consideration must be given to the legal and regulatory as well as the supplier constraints. In addition, data must be sourced from the master source of the data and not from an intermediate source such as another data warehouse.

The key drivers are:

Promote reuse – Offers the best possibility for reuse of data across the organisation, and ensuring it is fit for purpose.
Reduce burden on data suppliers – A comprehensive record-level data feed will reduce repetition/duplication and thus ensuring efficient use.
Future-proof data feeds – Creating a comprehensive record-level data feed at the outset eliminates the need for rework at a later date as requirements change thus reducing subsequent delivery timelines, effort and expenditure.
Simplify the sourcing process – By not applying logic to restrict or manipulate the data content at source will also minimise the impact upon the source system (both during development and at runtime) as well as speeding up the delivery of data.
Enables data discovery – Providing a full dataset enables the discovery of previously unknown patterns. Avoid skewed results – Transforming the data from source before it arrives could potentially skew the results.

Implications

The potential implications are:

Extra initial effort – To ingest a comprehensive record-level data feed will take extra effort. However, this increase is not linear and it will be significantly less than the effort required to modify the data feed at a later date.
Higher storage costs – The larger datasets will incur higher storage costs. Over time as more of the data is used, this cost will become less of an issue. Also, the costs incurred for the additional storage will be significantly lower than the cost of modifying the data feed at a later date.
Increased network traffic – The larger datasets will result in higher network traffic. Over the long term, however, having a single extract of data for onward use will have a lower impact than multiple smaller extracts, and should give a predictable network load.
Higher up front cost – The full sourcing cost will be incurred at the outset irrespective of whether this is above and beyond the initial data requirements. However, this should reduce the need to go back to the source system for subsequent data requirements.
Effective data management – To manage the received data effectively and efficiently requires a holistic understanding of the data requirements and usage across the organisation.
Effective data governance – Data governance processes and policies need to be in place to ensure that data can be found and used effectively.
Consider future data needs – Potential future requirements also need to be taken into consideration when negotiating the agreement with the supplier.