Link Search Menu Expand Document

Asynchronous Query Manager for Snowflake

urn:js:virtue:aspire:proposal:32.1

TL;DR

A service to run on-demand queries against Snowflake in an asynchronous mechanism that removes the need for idle client compute time.

Rational

Currently queries against the Snowflake Data Warehouse happen Synchronously. For transactional queries, this is fine. However, long running queries related to data import, export or that require complex joins can be very long running. In these situations the query system is left running (at cost), but not performing any actions, waiting for the query to return. This is wasteful. Here we look at possible options for solving this problem.

Decision Drivers & Background

  1. Reducing idle client compute time, and therefore cost
  2. Currently no feature in Snowflake that enables this kind of interaction pattern
  3. Minimising polling architectures
  4. Minimising duplication of Snowflake interaction patterns across teams
    1. A number of teams are tackling the problem of issuing long-running queries against Snowflake. Each team appears to implement a bespoke solution to each instance of the problem, creating additional complexity and duplication.
  5. Want to implement a robust, reusable pattern for issuing queries against Snowflake.

    Use Cases

  6. Where complex orchestration is required before ingest jobs are executed - this orchestration is not currently possible using Tasks & Streams
  7. Running queries in a Serverless architecture without having to setup Container Clusters or VMs
  8. A service managing queries can put controls in place to aid with cost management - i.e. cancelling queries that unexpectedly run for a long period of time
  9. Enable async queries for GUI applications

Possible future opportunities:

  1. Support for allowing queries to be run against Snowflake via single sign-on rather than through having Snowflake users created
    1. Potential for access for that user set by their position in the organisation

      Considered Options

  2. Option 1 - Construct an Asynchronous Query Manager Service
  3. Option 2 - Make use of Snowbot
  4. Option 3 - Do nothing, run synchronous queries
  5. Option 4 - Utilise Tasks and Streams

    Decision Outcome

    Chosen option: “[option 1] - Construct an Asynchronous Query Manager Service”, because it enables running on-demand queries against Snowflake in an asynchronous mechanism that removes the need for idle client compute time. This service can easily be wrapped in an external orchestration tool, and thus scheduled if required. Other options do not provide the same level of orchestration as is possible with this mechanism.

Implications

Positive Consequences

  • Cost savings from reduced idle client compute time. The cost of running the AQM should fall within the AWS free-tier.
  • Can be integrated with Voyager to replace the Fargate cluster running to execute queries against Snowflake.
  • Currently required by the Bank ASPIRE PoC and Orange squads so meets current needs of multiple squads, and provides an opportunity to show developing of a capability using cross-team resource.
  • Easily integrated with other orchestration solutions, such as the Orchestration Capability being developed, or StepFunctions etc…

    Negative Consequences

  • Uses Snowflake preview features to get true event driven pattern. An alternative option is available, but is a centralised polling pattern.
  • Small additional effort required during Bank ASPIRE PoC to add the wrapper to present the capability as a service that other teams can utilise.

    Pros and Cons of the Options

    [option 1] - Construct an Asynchronous Query Manager Service

    Implement a service which provides Asynchronous interaction with Snowflake, an Asynchronous Query Manager Service.

Details: AQM - Proposal Overview

  • Good, because …
    • central service, which provides a common capability across the org and reduces duplication of functionality within teams
    • enables event-based interaction with Snowflake, therefore minimising idle time. This also makes it easier to integrate Snowflake queries as part of StepFunctions, and other event-based orchestration mechanisms.
    • polling effort centralised, so that admin-query load on snowflake is minimised (any system would need to be polling-based because it is not possible to extract events from Snowflake)
    • access permissions are maintained as the query is run with the requester’s snowflake credentials
    • easily integrated with the planned orchestration system
    • small micro-service which can easily be dropped into existing or new workflows
    • abstracts interaction patterns with Snowflake away from individual teams
    • serverless architecture means the cost of doing this is likely to be very minimal or fall completely within the AWS free tier whilst also being highly scalable
  • Bad, because…
    • streaming of data from Snowflake as the query runs is not supported. However, in this case, as you are waiting to act on data as it is returned, a synchronous interaction pattern is more suitable.
    • minimal additional load placed on the Snowflake management service servicing the queries of the query table
    • return events would be delayed by upto one minute as polling is limited in schedule by Cloudwatch Events

      [option 2] - Make use of Snowbot

      Snowbot enables the scheduling of SQL queries against Snowflake. It may be possible to implement a wrapper around Snowbot to enable an event-based interaction pattern.

Details: SnowBot

  • Good, because …
    • makes use of existing systems
  • Bad, because…
    • a not-insignificant amount of work would be required to enable the event-driven workflow using Snowbot: 
      • currently no events emitted by Snowbot. This would need to be added to the system, either based from a DynamoDB stream, or from the container service.
      • currently no dynamic scheduling system for Snowbot. This is currently provided by a web interface. Would need additional work to enable this by scheduling one-off Cloudwatch Events as the input triggers.
    • queries are executed as a service user, rather than as a user provided by the requester and therefore with the correct permissions
    • does not address the issue of having a container sat idle waiting for a query response, thus causing wasted compute effort and cost
    • Delay in query-start as would have to wait for the next possible Cloudwatch Event that could be scheduled according to Cron

      [option 3] - Run synchronous queries

      Do nothing, run queries as they are using synchronous methods.

  • Good, because …
    • known pattern & security
  • Bad, because…
    • wasted idle compute resource waiting for queries to return, and therefore wasted expenditure
    • duplicated effort as teams implement their own methods of doing this, or try to implement isolated event-based processing

      [option 4] - Utilise Snowflake Tasks and Streams

      It is possible to create orchestration using Tasks and Streams by having Tasks triggered regularly, but only execute if a Stream has data in it. By this mechanism, it is possible to create an event-like functionality in Snowflake whereby Tasks are executed in response to new data. It is also possible to daisy-chain tasks off of each other, and thereby create an orchestration tree of multiple actions occuring as a result of a single input action.

For more information see: Snowflake Create Task, and ETL using Snowflake Streams and Tasks.

In addition to this Snowflake Notifications are due to arrive in the near future which would mean we no longer needed to rely on External Functions to get events out of Snowflake. However, there is a question around being able to link this back to a useful event in order to take some actions based off of the notification.

  • Good, because …
    • Banana is already building the solution
    • Native Snowflake integration / functionality
    • Good for streaming data and other tasks that need to be executed without pre-conditions
  • Bad, because…
    • Complex orchestration for joining events. But we can work around it with external functions
    • No out-of-the-box graphical view. This would need to be created from the Task table
    • Difficult to debug, as would need to pull logs out of the TASK_HISTORY, then search for and trace the error through to it’s consequences.
      • Banana is working in this issue using external funcitons, and native support for this is on the way
    • Not possible to generate a stream from multiple tables, 1-1 mapping. It is not currently possible to create a stream from a view.
    • It is theoretically possible to use tasks by injecting data into a table to trigger a stream, which would then trigger a task but this would be very hacky and not fit-for-purpose

Appendix

Migrated From Confluence

link Original Author : Goodhand, James