Closed
Bug 1338365
Opened 8 years ago
Closed 6 years ago
Document review process for creating custom datasets
Categories
(Data Platform and Tools :: Documentation and Knowledge Repo (RTMO), defect, P3)
Data Platform and Tools
Documentation and Knowledge Repo (RTMO)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: amiyaguchi, Assigned: amiyaguchi)
References
Details
There should be a defined organizational process for generating derived datasets in Spark before querying and visualizing in Redash. An additional review process will allow dataset use cases to grow organically and improve the overall utility of Redash. The only necessary background to create a new dataset should be a basic understanding of SQL and a guide with links to appropriate documentation.
This process has the potential to be automated.
OUTLINE:
* tutorial notebook with examples
* skeleton notebook for forking
* accompanying documentation describing:
- benefits and limitations to custom datasets in the context of redash
- locations to store data
- review process
* how to format request (table name, table location, frequency)
DESCRIPTION:
User defined datasets/tables can make interactive querying in Redash significantly faster by offloading the hard work to a dedicated Spark job. Particularly complex queries on presto run on the order of 30+ minutes, and often fail because of high memory usage. These datasets act as domain-specific preprocessing on upstream datasets, possibly taking advantage of joins, filtering, or python-based user defined functions.
Individuals or teams who want a more performant Redash experience with detailed queries can easily create dataset/table. This would be defined by a scheduled ATMO jupyter notebook. An opinionated notebook with configurable boilerplate will provide an easy development experience for those only experienced with SQL.
A separate guide or detailed notebook should explain the process of adding a new dataset. The user should be familiar with save directories in s3, and the how to be aware of their own data usage.
These notebooks should be signed off by someone in data platform, ensuring that the data is being saved to a sane location. Adding the table to the parquet2hive service should follow review approval.
Automating this process could require some substantial changes to the data pipeline, such as improving integration with our job scheduling system, or automatic dataset detection with hive. A stricter expiration policy should apply to custom datasets, with the potential of becoming a supported dataset.
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → amiyaguchi
Points: --- → 2
Priority: -- → P2
Assignee | ||
Comment 1•8 years ago
|
||
A preliminary tutorial has been created in [1]. It shows off a simple example of querying using a sqlContext. The ATMO tutorial [2] is solid introductory material for working with Spark.
I want to avoid duplicating information, because we don't need multiple general tutorials. Instead, documentation supporting this bug should make the process around accessing data through Redash transparent and approachable.
[1] https://gist.github.com/acmiyaguchi/5bec2bf0e180025c1e1c1c72331db86a/d420aeb75ad604d4cb0926b59ade20addfaa8608
[2] https://wiki.mozilla.org/Telemetry/Custom_analysis_with_spark
Comment 2•8 years ago
|
||
I started a cookbook at [0] for this. You should expand on it there and add your example notebooks. The downside of using just SQL is then you can't have a user defined dataset based on raw pings. Though I do like the idea of having a nice template, where users can just put in the few transformations they need.
[0] https://github.com/harterrt/telemetry-docs/blob/master/cookbooks/create_a_dataset.adoc
Updated•7 years ago
|
Component: Metrics: Pipeline → Documentation and Knowledge Repo (RTMO)
Product: Cloud Services → Data Platform and Tools
Updated•7 years ago
|
Assignee | ||
Updated•7 years ago
|
Points: 2 → 3
Summary: Define a process to create custom tables/dataset for Redash using Spark → Document review process for creating custom datasets
Assignee | ||
Updated•7 years ago
|
Updated•7 years ago
|
Priority: P2 → P3
Assignee | ||
Comment 3•6 years ago
|
||
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•