Closed Bug 1243528 Opened 9 years ago Closed 8 years ago

Store validated incoming main pings in Parquet files on S3

Tracking

(Not tracked)

Status:

RESOLVED WONTFIX

People

(Reporter: rvitillo, Unassigned)

References

Details

(Whiteboard: [loasis])

Roberto Agostino Vitillo (:rvitillo)

Reporter

Description

•

9 years ago

Raw "main" pings are commonly read from Spark for exploratory analyses, which are stored in partitioned json blobs. We should explore the option of storing main pings in Parquet files which will reduce the processing costs (no json parsing, no need to fetch many fields if only few are read) and storage size through e.g. per-column dictionary compression. This Bug is going to require a separate schema from the one defined in Bug 1243523 since the format of some of the fields has to change (e.g. histograms) to guaruantee the best possible performances at read time. The job could either be implemented with Scala/Spark, which will require to define an Avro schema, or as a Heka filter, which currently lacks support for Parquet though. As we are producing more and more custom derived datasets, it's not yet entirely clear if our "master dataset", i.e. telemetry-2, is going to be as popular as it is right now. If most of our users are going to end up working with derived datasets we might as well decide to drop this Bug.

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

9 years ago

Depends on: 1243524

Thomas Huelbert

Updated

•

9 years ago

Whiteboard: [loasis]

Roberto Agostino Vitillo (:rvitillo)

Reporter

Comment 1

•

9 years ago

Note that this work, if completed, could accelerate the development and the build-time of derived dataset.

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

9 years ago

Blocks: 1251580

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

9 years ago

Points: --- → 3

Priority: -- → P5

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

9 years ago

Blocks: 1255752
No longer blocks: 1251580

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

8 years ago

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → WONTFIX

Roberto Agostino Vitillo (:rvitillo)

Reporter

Comment 2

•

8 years ago

I am closing this bug as we have the main summary dataset now.

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Store validated incoming main pings in Parquet files on S3

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P5)

Tracking

(Not tracked)

People

(Reporter: rvitillo, Unassigned)

References

Details

(Whiteboard: [loasis])

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Updated

Updated

Updated

Updated

Comment 2

Updated