Data Sharing Pattern for External Consumers
urn:js:virtue:aspire:pattern:.
TL;DR
Data sharing pattern for consumers external to JS.
Instructions
These are in order of preference.
Anti Patterns
These are some of the options which are unsafe and should be avoided:
- FTPs
- Emails
- Manual Sharing
- Copy and Paste
- Screen Snapshots
Pattern 1 - Snowflake Data Sharing
For Same Region Same Cloud
Assuming the 3rd party with which we want to share data with is also a Snowflake Customer and they are within the same Region and same Cloud Provider
Process Flow
Pros:
- No Data Movement
- Real Time Access
- Consistent Data Across Multiple Consumers
- Better Data Governance (Controlled, Customised Views)
- Simple to Implement (No extracts to build, no APIs to write, no additional software to install etc)
Cons:
- Cannot clone or perform any DML changes on a table that was imported from a share
- Warehouses should be used efficiently by the consumers or else it could increase their costs
Futher information: https://docs.snowflake.com/en/user-guide/data-sharing-intro.html
For Different Region or Different Cloud Platform
Assuming the 3rd party with which we want to share data with is a Snowflake Customer in a Different Region or Different Cloud Platform:
Note: Snowflake utilizes database replication to allow data providers to securely share data with data consumers across different regions and cloud platforms. A new account needs to be created in the same region as the 3rd party , data needs to be replicated to the new account to create a share
Process Flow
Pros :
- Real Time Access
- Consistent Data Across Multiple Consumers as Data providers only need to create one copy of the dataset per region; and not a copy per consumer
- Better Data Governance (Controlled, Customised Views)
- Simple to Implement (No extracts to build, no APIs to write, no additional software to install etc)
Cons:
- Data needs to be Replicated to the same region and cloud provider
- Secure data share is not allowed with different regions or cloud platforms when one or more external table exists as part of data share
- Entire Database needs to be replicated
- Refresh is charged (Credits involved)
- Sharing to or from Virtual Private Snowflake (VPS) is currently not supported
Futher information: https://docs.snowflake.com/en/user-guide/secure-data-sharing-across-regions-plaforms.html
3rd party not a Snowflake Customer
Assuming the 3rd party is not a Snowflake customer, a Snowflake Reader Account can be created to share data. A read-only share can be created with charge of compute costs going back to the 3rd party. Threshold limits can be set on the compute warehouse to control the costs if needed.
Process Flow
Pros :
- No Data Movement
- Real Time Access
- Consistent Data Across Multiple Consumers
- Better Data Governance (Controlled, Customised Views)
- Audit and access logs
- Data can be revoked as needed
Cons :
- Credits charged to Provider Account (Can restrict the credit usage through Resource Monitor)
- Bad queries can lead to additional costs
- No DML Operations can be performed on the Reader Account
Futher information: https://docs.snowflake.com/en/user-guide/data-sharing-reader-create.html
Pattern 2 - Rest APIs
Build APIs to fetch the data from Snowflake or S3 having appropriate keys/tokens and associated permissions/roles to pull the data
Pros:
- Secured - via SSL/TLS and API keys/tokens
- Consistent Data Across Multiple Consumers
Cons :
- Lack of central Governance
- Overhead of key rotation and management
- More development times, ongoing maintenance requirements, and providing support
- Mangement of multiple versions of APIs (interface changes)
Pattern 3 - S3 Gateway Endpoint
Consumers can Implement a gateway endpoint to our S3 bucket, we provide role - they pull data over AWS backbone without traversing the internet.
Pros:
- Secured - Data will be accessed through the AWS Backbone
- Data staged within our environments
- Lifecycle policies can be applied to have a access deadline
Cons:
- Lack of Central Data Governance
- Overhead of lifecycle policies
- Storage costs could rise if large objcets are shared for very long periods
Pattern 4 - Presigned URL via central management
We can provide 3rd Party consumers with a pre-signed URL for secure access to S3 objects for a limited time period - they use the URL to pull the data.
Pros:
- Consumers can access S3 objects without the need for AWS credentials or IAM permissions
- Access and Permission can be controlled by the S3 Bucket Owner
Cons:
- Temporary Access: Consumers can’t access the data once the expiry time has lapsed, they must request it again
Pattern 5 - Data sharing via SFTP (Backward compatability for legacy applications)
Process Flow
Pros :
- Quick and dirty!
- Can be used to link up with legacy technologies
Cons :
- Encryption of Data prior to Transit
- Data could potentially be sent to wrong server is not handled properly
- Key Rotation and Management overhead
- Sharing Real Time Data
- Lack of Governance and Audit Trail