1338365 - Document review process for creating custom datasets

Assignee

Description

•

8 years ago

There should be a defined organizational process for generating derived datasets in Spark before querying and visualizing in Redash. An additional review process will allow dataset use cases to grow organically and improve the overall utility of Redash. The only necessary background to create a new dataset should be a basic understanding of SQL and a guide with links to appropriate documentation. This process has the potential to be automated. OUTLINE: * tutorial notebook with examples * skeleton notebook for forking * accompanying documentation describing: - benefits and limitations to custom datasets in the context of redash - locations to store data - review process * how to format request (table name, table location, frequency) DESCRIPTION: User defined datasets/tables can make interactive querying in Redash significantly faster by offloading the hard work to a dedicated Spark job. Particularly complex queries on presto run on the order of 30+ minutes, and often fail because of high memory usage. These datasets act as domain-specific preprocessing on upstream datasets, possibly taking advantage of joins, filtering, or python-based user defined functions. Individuals or teams who want a more performant Redash experience with detailed queries can easily create dataset/table. This would be defined by a scheduled ATMO jupyter notebook. An opinionated notebook with configurable boilerplate will provide an easy development experience for those only experienced with SQL. A separate guide or detailed notebook should explain the process of adding a new dataset. The user should be familiar with save directories in s3, and the how to be aware of their own data usage. These notebooks should be signed off by someone in data platform, ensuring that the data is being saved to a sane location. Adding the table to the parquet2hive service should follow review approval. Automating this process could require some substantial changes to the data pipeline, such as improving integration with our job scheduling system, or automatic dataset detection with hive. A stricter expiration policy should apply to custom datasets, with the potential of becoming a supported dataset.

Anthony Miyaguchi [:amiyaguchi]

Assignee

Updated

•

8 years ago

Assignee: nobody → amiyaguchi

Points: --- → 2

Priority: -- → P2

Anthony Miyaguchi [:amiyaguchi]

Assignee

Comment 1

•

8 years ago

A preliminary tutorial has been created in [1]. It shows off a simple example of querying using a sqlContext. The ATMO tutorial [2] is solid introductory material for working with Spark. I want to avoid duplicating information, because we don't need multiple general tutorials. Instead, documentation supporting this bug should make the process around accessing data through Redash transparent and approachable. [1] https://gist.github.com/acmiyaguchi/5bec2bf0e180025c1e1c1c72331db86a/d420aeb75ad604d4cb0926b59ade20addfaa8608 [2] https://wiki.mozilla.org/Telemetry/Custom_analysis_with_spark

Frank Bertsch [:frank]

Comment 2

•

8 years ago

I started a cookbook at [0] for this. You should expand on it there and add your example notebooks. The downside of using just SQL is then you can't have a user defined dataset based on raw pings. Though I do like the idea of having a nice template, where users can just put in the few transformations they need. [0] https://github.com/harterrt/telemetry-docs/blob/master/cookbooks/create_a_dataset.adoc

Thomas Huelbert

Updated

•

7 years ago

Component: Metrics: Pipeline → Documentation and Knowledge Repo (RTMO)

Product: Cloud Services → Data Platform and Tools

Ryan Harter [:harter]

Updated

•

7 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1313108

Anthony Miyaguchi [:amiyaguchi]

Assignee

Updated

•

7 years ago

Points: 2 → 3

Summary: Define a process to create custom tables/dataset for Redash using Spark → Document review process for creating custom datasets

Anthony Miyaguchi [:amiyaguchi]

Assignee

Updated

•

7 years ago

Depends on: 1349065, 1349070

Thomas Huelbert

Updated

•

7 years ago

Priority: P2 → P3

Anthony Miyaguchi [:amiyaguchi]

Assignee

Comment 3

•

6 years ago

https://docs.telemetry.mozilla.org/cookbooks/create_a_dataset.html

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Bugzilla

Document review process for creating custom datasets

Categories

(Data Platform and Tools :: Documentation and Knowledge Repo (RTMO), defect, P3)

Tracking

(Not tracked)

People

(Reporter: amiyaguchi, Assigned: amiyaguchi)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Updated

Updated

Updated

Updated

Updated

Comment 3