Closed Bug 1338365 Opened 8 years ago Closed 6 years ago

Document review process for creating custom datasets

Categories

(Data Platform and Tools :: Documentation and Knowledge Repo (RTMO), defect, P3)

defect
Points:
3

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: amiyaguchi, Assigned: amiyaguchi)

References

Details

There should be a defined organizational process for generating derived datasets in Spark before querying and visualizing in Redash. An additional review process will allow dataset use cases to grow organically and improve the overall utility of Redash. The only necessary background to create a new dataset should be a basic understanding of SQL and a guide with links to appropriate documentation. This process has the potential to be automated. OUTLINE: * tutorial notebook with examples * skeleton notebook for forking * accompanying documentation describing: - benefits and limitations to custom datasets in the context of redash - locations to store data - review process * how to format request (table name, table location, frequency) DESCRIPTION: User defined datasets/tables can make interactive querying in Redash significantly faster by offloading the hard work to a dedicated Spark job. Particularly complex queries on presto run on the order of 30+ minutes, and often fail because of high memory usage. These datasets act as domain-specific preprocessing on upstream datasets, possibly taking advantage of joins, filtering, or python-based user defined functions. Individuals or teams who want a more performant Redash experience with detailed queries can easily create dataset/table. This would be defined by a scheduled ATMO jupyter notebook. An opinionated notebook with configurable boilerplate will provide an easy development experience for those only experienced with SQL. A separate guide or detailed notebook should explain the process of adding a new dataset. The user should be familiar with save directories in s3, and the how to be aware of their own data usage. These notebooks should be signed off by someone in data platform, ensuring that the data is being saved to a sane location. Adding the table to the parquet2hive service should follow review approval. Automating this process could require some substantial changes to the data pipeline, such as improving integration with our job scheduling system, or automatic dataset detection with hive. A stricter expiration policy should apply to custom datasets, with the potential of becoming a supported dataset.
Assignee: nobody → amiyaguchi
Points: --- → 2
Priority: -- → P2
A preliminary tutorial has been created in [1]. It shows off a simple example of querying using a sqlContext. The ATMO tutorial [2] is solid introductory material for working with Spark. I want to avoid duplicating information, because we don't need multiple general tutorials. Instead, documentation supporting this bug should make the process around accessing data through Redash transparent and approachable. [1] https://gist.github.com/acmiyaguchi/5bec2bf0e180025c1e1c1c72331db86a/d420aeb75ad604d4cb0926b59ade20addfaa8608 [2] https://wiki.mozilla.org/Telemetry/Custom_analysis_with_spark
I started a cookbook at [0] for this. You should expand on it there and add your example notebooks. The downside of using just SQL is then you can't have a user defined dataset based on raw pings. Though I do like the idea of having a nice template, where users can just put in the few transformations they need. [0] https://github.com/harterrt/telemetry-docs/blob/master/cookbooks/create_a_dataset.adoc
Component: Metrics: Pipeline → Documentation and Knowledge Repo (RTMO)
Product: Cloud Services → Data Platform and Tools
Points: 2 → 3
Summary: Define a process to create custom tables/dataset for Redash using Spark → Document review process for creating custom datasets
Depends on: 1349065, 1349070
Priority: P2 → P3
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.