Closed
Bug 1243528
Opened 9 years ago
Closed 8 years ago
Store validated incoming main pings in Parquet files on S3
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P5)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: rvitillo, Unassigned)
References
Details
(Whiteboard: [loasis])
Raw "main" pings are commonly read from Spark for exploratory analyses, which are stored in partitioned json blobs. We should explore the option of storing main pings in Parquet files which will reduce the processing costs (no json parsing, no need to fetch many fields if only few are read) and storage size through e.g. per-column dictionary compression.
This Bug is going to require a separate schema from the one defined in Bug 1243523 since the format of some of the fields has to change (e.g. histograms) to guaruantee the best possible performances at read time.
The job could either be implemented with Scala/Spark, which will require to define an Avro schema, or as a Heka filter, which currently lacks support for Parquet though.
As we are producing more and more custom derived datasets, it's not yet entirely clear if our "master dataset", i.e. telemetry-2, is going to be as popular as it is right now. If most of our users are going to end up working with derived datasets we might as well decide to drop this Bug.
Updated•9 years ago
|
Whiteboard: [loasis]
Reporter | ||
Comment 1•9 years ago
|
||
Note that this work, if completed, could accelerate the development and the build-time of derived dataset.
Reporter | ||
Updated•9 years ago
|
Points: --- → 3
Priority: -- → P5
Reporter | ||
Updated•9 years ago
|
Reporter | ||
Updated•8 years ago
|
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Reporter | ||
Comment 2•8 years ago
|
||
I am closing this bug as we have the main summary dataset now.
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•