Closed Bug 1243528 Opened 9 years ago Closed 8 years ago

Store validated incoming main pings in Parquet files on S3

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P5)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: rvitillo, Unassigned)

References

Details

(Whiteboard: [loasis])

Raw "main" pings are commonly read from Spark for exploratory analyses, which are stored in partitioned json blobs. We should explore the option of storing main pings in Parquet files which will reduce the processing costs (no json parsing, no need to fetch many fields if only few are read) and storage size through e.g. per-column dictionary compression. This Bug is going to require a separate schema from the one defined in Bug 1243523 since the format of some of the fields has to change (e.g. histograms) to guaruantee the best possible performances at read time. The job could either be implemented with Scala/Spark, which will require to define an Avro schema, or as a Heka filter, which currently lacks support for Parquet though. As we are producing more and more custom derived datasets, it's not yet entirely clear if our "master dataset", i.e. telemetry-2, is going to be as popular as it is right now. If most of our users are going to end up working with derived datasets we might as well decide to drop this Bug.
Depends on: 1243524
Whiteboard: [loasis]
Note that this work, if completed, could accelerate the development and the build-time of derived dataset.
Blocks: 1251580
Points: --- → 3
Priority: -- → P5
Blocks: 1255752
No longer blocks: 1251580
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
I am closing this bug as we have the main summary dataset now.
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.