Data Sharing with External Consumers
urn:js:virtue:aspire:proposal:2.1
TL;DR
This is a proposal to codify the patterns and anti-patterns for data sharing.
Rational
Table of Contents
- Target Audience
- Summary
- Current Issues with data sharing
- Modern Data Sharing
- Common Usecases and options
- Option - Snowflake Data Sharing
- Option - Snowflake Data Sharing, Different Region or Different Cloud Platform
- Option - Snowflake Data Sharing, 3rd party not a Snowflake Customer
- Option - Rest APIs
- Option - S3 Gateway Endpoint
- Option - Presigned URL via central management
- Option - Data sharing via SFTP (Backward compatability for legacy applications)
- Anti Patterns
- JS Policies and Controls
Target Audience
This write up focuses on data sharing use cases from ASPIRE to outside organisations covering the below stakeholders:
- B2B systems
- B2C users
- Engineering resources
Summary
Within DataTech there are several mechanisms used to share internal data with external consumers and systems. Most of the Squads have been getting around by using mechanisms such as SFTP to data shares via S3 to emailing them to consumers. None of these mechanisms are auditable or perfectly secure. Moreover, SFTP involves lot of operational risks like: sending data to the wrong 3rd party, Encrypting the data prior to transit which requires keys rotation and management overhead.
Currently there is lack of defined standards which detail the process and mechanisms to use for sharing internal data and thus creating a struggle for data providers and consumers to easily share data and ensure data consistency. The aim of this writeup is to agree a standard set of ways with which we can safely and securely share our data with external consumers and allow us to maintain consistency amongst the squads without any prejeduce on either pull or push mechanisms.
Current Issues with data sharing
Traditionally, data sharing is a slow process and it reduces Sainsbury’s ability to execute quickly. This creates a myriad of challenges, including the following:
- Multiple, static, and out-of-sync versions of the same data residing in different environments and across multiple data silos
- No practical single source of truth or governance exists for data currenlty shared within Sainsbuy’s Org
- Critical business decisions are made based on outdated, incomplete, or inaccurate data
- Potential number of data breaches and accidental data loss/disclosure risks multiply, along with their associated costs, such as breach notifications, and damage to an Sainsbury’s brand
- No Audit trails for data which has been shared
Modern Data Sharing
With modern data sharing options inside a cloud-built data platform, in a matter of minutes you (as a data provider) can enable live access to any of your data for any number of data consumers, inside or outside our organisation. We can share data across internal business towers, with business partners across our eco-system, and with external organisations to easily support richer analytics, data-driven initiatives, new business models, and possibly new revenue streams.
- Eliminates movement and copying of data : Modern data sharing offers direct, real-time access to live data in a secure, managed, and controlled environment.
- Provides ready-to-use data: Data consumers get the full capabilities of a data warehouse, allowing them to query and analyse shared data as soon as they’re given access to it. They can combine shared data with their own data.
- Protects personally identifiable information (PII) and complies with industry requirements: Data providers can easily and securely share data while ensuring no PII or sensitive data is compromised by using advanced sharing functions, while still enabling the data consumer to make full use of the data.
- Enables data sharing without added costs: Modern data sharing eliminates the duplicative costs of building the infrastructure needed to store shared data, since data consumers view the shared data directly from the data provider without having to copy or move data.
- Enables data sharing with unlimited data providers and consumers: A modern cloud data platform can serve an unlimited number of data providers and consumers, with full transactional integrity and data consistency.
SFTP Operational Risk and Technical Overhead (sending data to the wrong endpoint)
Although SFTP is an encrypted transfer protocol, the data is only encrypted in transit, and if you send data to the wrong endpoint then the recipient can read the data. The risk is that it is possible to send data to the wrong end-point, ie you send Batman’s data to The Joker. Well known institutions have been caught out by this risk and include The London Stock Exchange and SWIFT.
This risk can only be mitigiated by having a Public/Private key for each recipient endpoint, encrypting the data prior to transit and the recipient decrypting upon receipt. This solution represents a significant technical overhead.
Common Usecases and options
Below matrix is a sample of some of the common usecases and the preferred options with which data can be exposed.
NOTE: SFTP has been added as a Secondary option due to lot of operational risks and overhead.
| # | Criteria | Options | Suggestion |
|---|---|---|---|
| 1 | 3rd party is a Snowflake Customer within the same Region and Same Cloud Provider | Snowflake Data Sharing | Mandatory |
| 2 | 3rd party is a Snowflake Customer within the Different Region or Different Cloud Platforms | Snowflake Data Sharing | Recommended |
| Snowflake Data Sharing with Reader Account | Recommended | ||
| API | Recommended | ||
| S3 Data Sharing | Recommended | ||
| SFTP | Suggested | ||
| Email/FTP/Manual Sharing | Prohibited | ||
| 3 | 3rd party not a Snowflake Customer | Snowflake Data Sharing with Reader Account | Recommended |
| API | Recommended | ||
| S3 Data Sharing | Recommended | ||
| SFTP | Suggested | ||
| Email/FTP/Manual Sharing | Prohibited |
Option - Snowflake Data Sharing
Assuming the 3rd party with which we want to share data with is also a Snowflake Customer and they are within the same Region and same Cloud Provider, enables sharing selected objects in a database in your account with other Snowflake accounts
- Tables
- External tables
- Secure views
- Secure materialized views
- Secure UDFs
Process Flow
Pros:
- No Data Movement
- Real Time Access
- Consistent Data Across Multiple Consumers
- Better Data Governance (Controlled, Customised Views)
- Simple to Implement (No extracts to build, no APIs to write, no additional software to install etc)
Cons:
- Cannot clone or perform any DML changes on a table that was imported from a share
- Warehouses should be used efficiently by the consumers or else it could increase their costs
Futher information: https://docs.snowflake.com/en/user-guide/data-sharing-intro.html
Option - Snowflake Data Sharing, Different Region or Different Cloud Platform
Assuming the 3rd party with which we want to share data with is a Snowflake Customer in a Different Region or Different Cloud Platform:
Note: Snowflake utilizes database replication to allow data providers to securely share data with data consumers across different regions and cloud platforms. A new account needs to be created in the same region as the 3rd party , data needs to be replicated to the new account to create a share
Process Flow
Pros:
- Real Time Access
- Consistent Data Across Multiple Consumers as Data providers only need to create one copy of the dataset per region; and not a copy per consumer
- Better Data Governance (Controlled, Customised Views)
- Simple to Implement (No extracts to build, no APIs to write, no additional software to install etc)
Cons:
- Data needs to be Replicated to the same region and cloud provider
- Secure data share is not allowed with different regions or cloud platforms when one or more external table exists as part of data share
- Entire Database needs to be replicated
- Refresh is charged (Credits involved)
- Sharing to or from Virtual Private Snowflake (VPS) is currently not supported
Futher information: https://docs.snowflake.com/en/user-guide/secure-data-sharing-across-regions-plaforms.html
Option - Snowflake Data Sharing, 3rd party not a Snowflake Customer
Assuming the 3rd party is not a Snowflake customer, a Snowflake Reader Account can be created to share data. A read-only share can be created with charge of compute costs going back to the 3rd party. Threshold limits can be set on the compute warehouse to control the costs if needed.
Process Flow
Pros :
- No Data Movement
- Real Time Access
- Consistent Data Across Multiple Consumers
- Better Data Governance (Controlled, Customised Views)
- Audit and access logs
- Data can be revoked as needed
Cons :
- Credits charged to Provider Account (Can restrict the credit usage through Resource Monitor)
- Bad queries can lead to additional costs
- No DML Operations can be performed on the Reader Account
Futher information: https://docs.snowflake.com/en/user-guide/data-sharing-reader-create.html
Option - Rest APIs
Build APIs to fetch the data from Snowflake or S3 having appropriate keys/tokens and associated permissions/roles to pull the data
Pros:
- Secured - via SSL/TLS and API keys/tokens
- Consistent Data Across Multiple Consumers
Cons :
- Lack of central Governance
- Overhead of key rotation and management
- More development times, ongoing maintenance requirements, and providing support
- Mangement of multiple versions of APIs (interface changes)
Option - S3 Gateway Endpoint
Consumers can Implement a gateway endpoint to our S3 bucket, we provide role - they pull data over AWS backbone without traversing the internet.
Pros:
- Secured - Data will be accessed through the AWS Backbone
- Data staged within our environments
- Lifecycle policies can be applied to have a access deadline
Cons:
- Lack of Central Data Governance
- Overhead of lifecycle policies
- Storage costs could rise if large objcets are shared for very long periods
Option - Presigned URL via central management
We can provide 3rd Party consumers with a pre-signed URL for secure access to S3 objects for a limited time period - they use the URL to pull the data.
Pros:
- Consumers can access S3 objects without the need for AWS credentials or IAM permissions
- Access and Permission can be controlled by the S3 Bucket Owner
Cons:
- Temporary Access: Consumers can’t access the data once the expiry time has lapsed, they must request it again
Option - Data sharing via SFTP (Backward compatability for legacy applications)
Process Flow
Pros :
- Quick and dirty!
- Can be used to link up with legacy technologies
Cons :
- Encryption of Data prior to Transit
- Data could potentially be sent to wrong server is not handled properly
- Key Rotation and Management overhead
- Sharing Real Time Data
- Lack of Governance and Audit Trail
Anti Patterns
These are some of the options which are unsafe and should be avoided:
- FTPs
- Emails
- Manual Sharing
- Copy and Paste
- Screen Snapshots
JS Policies and Controls
- Data Governance & Information Security Policy
- Data Handling Policy
- JS Identity & Access Management Policy
- JS Cryptography Policy
- Keeping Our Information Safe Policy
- Cloud Security Framework
- System Monitoring Security Policy
- Secure Build Policy
- Vulnerability and Patch Management Policy
- Network Security Policy
Implications
None.