Closed Bug 1329023 Opened 8 years ago Closed 8 years ago

ReDash errors when parsing parquet files output by SyncView

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: tcsc, Assigned: robotblake)

References

Details

(Whiteboard: [fxsync])

Attachments

(2 files)

[telemetry-batch-view] thomcc:strip-invalid-unicode > mozilla:master 8 years ago GitHub Autolander Bot (deleted), text/x-github-pull-request		Details
[telemetry-batch-view] acmiyaguchi:sync_part > mozilla:master 8 years ago GitHub Autolander Bot (deleted), text/x-github-pull-request		Details

Thom Chiovoloni [:tcsc] (ex-moco)

Reporter

Description

•

8 years ago

Sometimes the parquet data files output by SyncView.scala can't be read by ReDash. Example error: https://irccloud.mozilla.com/pastebin/2A4o9YPP I've been unable to replicate this outside of ReDash (e.g. using a Spark cluster), but it only happens on some queries inside ReDash (although -- it happens on most non-trivial ones), and I'm not 100% certain that what I tried could have reproduced the issue regardless. This takes over from bug 1314864, which I had been keeping open to indicate that "this doesn't quite work yet".

Mark Hammond [:markh] [:mhammond]

Updated

•

8 years ago

Blocks: 1319576

Katie Parlante

Updated

•

8 years ago

Component: Telemetry → Metrics: Pipeline

Product: Toolkit → Cloud Services

Alex Davis [:adavis]

Updated

•

8 years ago

Blocks: 1289536

Alex Davis [:adavis]

Comment 1

•

8 years ago

Also note that this has broken our existing Sync dashboard: https://sql.telemetry.mozilla.org/dashboard/sync-error-dashboard?p_start_date=2016-10-01%2000:00:00.000&p_end_date=2016-10-16%2000:00:00.000&p_version=5&p_min_err=1 It also block an ongoing cross-functional effort to repair bookmark data since we can't create our dashboards to measure our current corruption types and the impact of our upcoming repair efforts. https://docs.google.com/document/d/1WDg71fGVTLyQ3jSqB2ZEdUepoZ36nasxvNYppxesnV0/edit https://docs.google.com/spreadsheets/d/1iIvkEURziNFVZwzXCkRdt8oWQrDRl4sEr0YYikquMEI/edit#gid=0

Thomas Huelbert

Updated

•

8 years ago

Assignee: nobody → bimsland

Points: --- → 3

Priority: -- → P1

Blake Imsland [:robotblake]

Assignee

Comment 2

•

8 years ago

Sorry for the slow reply, I'm still looking into this issue and have been hitting a bit of a wall, initially we thought it might be due to a schema mismatch but that looks fine. It appears to still be happening in newer versions of Presto so reaching out to the devs to see if they have any ideas might also be a possibility. :amiyaguchi is going to jump in and help with this as well as he has more experience with Spark than myself.

Flags: needinfo?(amiyaguchi)

Anthony Miyaguchi [:amiyaguchi]

Comment 3

•

8 years ago

I've replicated the bug using spark [1], so we can cross off Presto as suspect. [1] https://gist.github.com/acmiyaguchi/4610f1a19748908a7314feb0d550fcae

Flags: needinfo?(amiyaguchi)

Blake Imsland [:robotblake]

Assignee

Comment 4

•

8 years ago

It looks like there's some corrupt (or invalid UTF8) engine.name fields in a row which is causing the failure, I'm attempting to write some code to get a dump of what is actually in those fields.

Thom Chiovoloni [:tcsc] (ex-moco)

Reporter

Comment 5

•

8 years ago

Worth noting that we only allow a small set of (entirely ASCII) names [0] through as of bug 1311557. The fix for that bug should be everywhere by now, but I'm willing to add a similar test to the SyncView scala code, if necessary. That said, the old code also was able to handle the same pings without issue, so i *suspect* it's not a problem with the input (Shouldn't the JSON parsing fail if the UTF8 is invalid anyway?) [0]: https://dxr.mozilla.org/mozilla-central/source/services/sync/modules/telemetry.js#57-58

Anthony Miyaguchi [:amiyaguchi]

Comment 6

•

8 years ago

I've written a test case that demonstrates the JSON parser does fail on invalid UTF8 [0]. In slightly more detail, I use the invalid surrogate pair "\udc00" as a test case. The default Java behavior will let "\udc00" survive parsing by using UTF-16, so I constructed a more realistic case of an invalid byte string. The latter case fails at parsing, which should be caught in the batch view logic. The underlying bug can still be in the batch view script, but maybe the problem is more complicated than a corrupted string. We'll have to dig deeper and reverse engineer the problematic fields in the parquet binary. [0] https://github.com/mozilla/telemetry-batch-view/compare/mozilla:7139c4a...acmiyaguchi:34fb269

Anthony Miyaguchi [:amiyaguchi]

Comment 7

•

8 years ago

As a small update, I have a simplified query that is broken because of this bug [0]. The case is a very small spark query counting the number of engine names. This doesn't tell us anything new, it's just makes it easy to replicate. [0] https://gist.github.com/acmiyaguchi/4610f1a19748908a7314feb0d550fcae

Anthony Miyaguchi [:amiyaguchi]

Comment 8

•

8 years ago

I am suspicious of the parquet-mr (reference implementation) library. Blake and I have been testing the invalid partition locally using tooling provided by parquet [1]. Using the dump utility will cause parquet to blow up. More specifically, the following will throw an exception. > java -jar target/parquet-tools-1.6.0rc3-SNAPSHOT.jar \ > dump --columns engines.list.element.name --no-color \ > part-r-00003-dc7b201d-2055-4b95-bcdd-17bc325fb6e7.snappy.parquet I also ran the file through the c++ implementation [1], using the parquet-scan and parquet_reader tools. This can be replicated with the following command where column 14 refers to engines.name: > ./build/latest/parquet-scan --column=14 part-r-00003-dc7b201d-2055-4b95-bcdd-17bc325fb6e7.snappy.parquet The c++ tool seemed to decode the parquet without issues. This means that there could potentially be a bug with the parquet-mr decoder (or encoder? or the cpp implementation?). The next steps will involve debugging/dumping the process where the exception occurs. I want to add this edge case to parquet's test suite (if it makes sense). I also want to regenerate the SyncView dataset for 20161218, maybe running it again will help. [0] https://github.com/apache/parquet-mr/tree/master/parquet-tools [1] https://github.com/apache/parquet-cpp

Anthony Miyaguchi [:amiyaguchi]

Comment 9

•

8 years ago

Rerunning the SyncView job again for 20161218 seems to fix the issue on the test case. The job fails if a dataset already exists, which makes rerunning backfill from Airflow more difficult. I will be manually replacing problematic datasets with regenerated data to validate that this mitigates issues. We still don't know what the root cause of this issue is. The build of parquet-mr that generated the original corrupted data is build 32c46643845ea8a705c35d4ec8fc654cc8ff816d.

Anthony Miyaguchi [:amiyaguchi]

Comment 10

•

8 years ago

I wrote the date wrong in comment 8, the date should be 20161208. Running the SyncView job again does not fix the issue.

GitHub Autolander Bot

Comment 11

•

8 years ago

Attached file [telemetry-batch-view] thomcc:strip-invalid-unicode > mozilla:master (deleted) — Details

Roberto Agostino Vitillo (:rvitillo)

Comment 12

•

8 years ago

We can try to use binary search to pinpoint the raw ping/s which is/are triggering this failure. One way to do that is to make some changes to the summaries method [1] of the Dataset class to read just a segment of files from S3. Once we have found a raw file on S3 that is causing the failure it should be easy to determine which pings in that file are the problematic one. [1] https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/heka/Dataset.scala#L31

Anthony Miyaguchi [:amiyaguchi]

Comment 13

•

8 years ago

I've found that repartitioning the data before writing to parquet makes the dataset readable again. An interesting side effect of this is that there are now 0.05% more rows than there were previously. I am writing a script to fix the invalid dates as they crop up in the original query. I will make the repair notebook available when I'm done with it.

GitHub Autolander Bot

Comment 14

•

8 years ago

Attached file [telemetry-batch-view] acmiyaguchi:sync_part > mozilla:master (deleted) — Details

Anthony Miyaguchi [:amiyaguchi]

Comment 15

•

8 years ago

I have a notebook to identify broken partitions [0]. Once the pull-request in comment 14 is merged, rerunning a troublesome day should fix the problem. I think that the underlying problem lies somewhere between the rdd->dataframe conversion and the collect->parquet encoding logic. Repartitioning causes the data to be shuffled and reorganized in memory, which might fix the internal representation of the data. I have no definitive proof, only a workaround. [0] https://gist.github.com/acmiyaguchi/e319fc2af5c61bd5229a3e13bcde67c1

Anthony Miyaguchi [:amiyaguchi]

Updated

•

8 years ago

Blocks: 1333896

Alex Davis [:adavis]

Comment 16

•

8 years ago

I reran the dashboard this morning and everything seems to be good. This dashboard runs fine now: https://sql.telemetry.mozilla.org/dashboard/sync-error-dashboard?p_start_date=2017-01-01%2000:00:00.000&p_end_date=2017-01-30%2000:00:00.000&p_version=5&p_min_err=1 I can even tell that now that we are collecting data from release, I need to rethink how I plot my graphs. :-/

Anthony Miyaguchi [:amiyaguchi]

Comment 17

•

8 years ago

:robotblake ran the query again on presto, and now the dashboard seems to be in working order according to :adavis in comment 16. This bug was solved by repartitioning the data in TelemetryBatchView::SyncView before writing to s3. Bug 1333896 has been filed to target the underlying issue identified here.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.