Closed
Bug 1329023
Opened 8 years ago
Closed 8 years ago
ReDash errors when parsing parquet files output by SyncView
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: tcsc, Assigned: robotblake)
References
Details
(Whiteboard: [fxsync])
Attachments
(2 files)
Sometimes the parquet data files output by SyncView.scala can't be read by ReDash. Example error: https://irccloud.mozilla.com/pastebin/2A4o9YPP
I've been unable to replicate this outside of ReDash (e.g. using a Spark cluster), but it only happens on some queries inside ReDash (although -- it happens on most non-trivial ones), and I'm not 100% certain that what I tried could have reproduced the issue regardless.
This takes over from bug 1314864, which I had been keeping open to indicate that "this doesn't quite work yet".
Updated•8 years ago
|
Component: Telemetry → Metrics: Pipeline
Product: Toolkit → Cloud Services
Comment 1•8 years ago
|
||
Also note that this has broken our existing Sync dashboard:
https://sql.telemetry.mozilla.org/dashboard/sync-error-dashboard?p_start_date=2016-10-01%2000:00:00.000&p_end_date=2016-10-16%2000:00:00.000&p_version=5&p_min_err=1
It also block an ongoing cross-functional effort to repair bookmark data since we can't create our dashboards to measure our current corruption types and the impact of our upcoming repair efforts.
https://docs.google.com/document/d/1WDg71fGVTLyQ3jSqB2ZEdUepoZ36nasxvNYppxesnV0/edit
https://docs.google.com/spreadsheets/d/1iIvkEURziNFVZwzXCkRdt8oWQrDRl4sEr0YYikquMEI/edit#gid=0
Updated•8 years ago
|
Assignee: nobody → bimsland
Points: --- → 3
Priority: -- → P1
Assignee | ||
Comment 2•8 years ago
|
||
Sorry for the slow reply, I'm still looking into this issue and have been hitting a bit of a wall, initially we thought it might be due to a schema mismatch but that looks fine. It appears to still be happening in newer versions of Presto so reaching out to the devs to see if they have any ideas might also be a possibility.
:amiyaguchi is going to jump in and help with this as well as he has more experience with Spark than myself.
Flags: needinfo?(amiyaguchi)
Comment 3•8 years ago
|
||
I've replicated the bug using spark [1], so we can cross off Presto as suspect.
[1] https://gist.github.com/acmiyaguchi/4610f1a19748908a7314feb0d550fcae
Flags: needinfo?(amiyaguchi)
Assignee | ||
Comment 4•8 years ago
|
||
It looks like there's some corrupt (or invalid UTF8) engine.name fields in a row which is causing the failure, I'm attempting to write some code to get a dump of what is actually in those fields.
Reporter | ||
Comment 5•8 years ago
|
||
Worth noting that we only allow a small set of (entirely ASCII) names [0] through as of bug 1311557. The fix for that bug should be everywhere by now, but I'm willing to add a similar test to the SyncView scala code, if necessary.
That said, the old code also was able to handle the same pings without issue, so i *suspect* it's not a problem with the input (Shouldn't the JSON parsing fail if the UTF8 is invalid anyway?)
[0]: https://dxr.mozilla.org/mozilla-central/source/services/sync/modules/telemetry.js#57-58
Comment 6•8 years ago
|
||
I've written a test case that demonstrates the JSON parser does fail on invalid UTF8 [0]. In slightly more detail, I use the invalid surrogate pair "\udc00" as a test case. The default Java behavior will let "\udc00" survive parsing by using UTF-16, so I constructed a more realistic case of an invalid byte string. The latter case fails at parsing, which should be caught in the batch view logic.
The underlying bug can still be in the batch view script, but maybe the problem is more complicated than a corrupted string. We'll have to dig deeper and reverse engineer the problematic fields in the parquet binary.
[0] https://github.com/mozilla/telemetry-batch-view/compare/mozilla:7139c4a...acmiyaguchi:34fb269
Comment 7•8 years ago
|
||
As a small update, I have a simplified query that is broken because of this bug [0]. The case is a very small spark query counting the number of engine names. This doesn't tell us anything new, it's just makes it easy to replicate.
[0] https://gist.github.com/acmiyaguchi/4610f1a19748908a7314feb0d550fcae
Comment 8•8 years ago
|
||
I am suspicious of the parquet-mr (reference implementation) library.
Blake and I have been testing the invalid partition locally using tooling provided by parquet [1]. Using the dump utility will cause parquet to blow up. More specifically, the following will throw an exception.
> java -jar target/parquet-tools-1.6.0rc3-SNAPSHOT.jar \
> dump --columns engines.list.element.name --no-color \
> part-r-00003-dc7b201d-2055-4b95-bcdd-17bc325fb6e7.snappy.parquet
I also ran the file through the c++ implementation [1], using the parquet-scan and parquet_reader tools. This can be replicated with the following command where column 14 refers to engines.name:
> ./build/latest/parquet-scan --column=14 part-r-00003-dc7b201d-2055-4b95-bcdd-17bc325fb6e7.snappy.parquet
The c++ tool seemed to decode the parquet without issues. This means that there could potentially be a bug with the parquet-mr decoder (or encoder? or the cpp implementation?).
The next steps will involve debugging/dumping the process where the exception occurs. I want to add this edge case to parquet's test suite (if it makes sense). I also want to regenerate the SyncView dataset for 20161218, maybe running it again will help.
[0] https://github.com/apache/parquet-mr/tree/master/parquet-tools
[1] https://github.com/apache/parquet-cpp
Comment 9•8 years ago
|
||
Rerunning the SyncView job again for 20161218 seems to fix the issue on the test case. The job fails if a dataset already exists, which makes rerunning backfill from Airflow more difficult. I will be manually replacing problematic datasets with regenerated data to validate that this mitigates issues.
We still don't know what the root cause of this issue is. The build of parquet-mr that generated the original corrupted data is build 32c46643845ea8a705c35d4ec8fc654cc8ff816d.
Comment 10•8 years ago
|
||
I wrote the date wrong in comment 8, the date should be 20161208. Running the SyncView job again does not fix the issue.
Comment 11•8 years ago
|
||
Comment 12•8 years ago
|
||
We can try to use binary search to pinpoint the raw ping/s which is/are triggering this failure. One way to do that is to make some changes to the summaries method [1] of the Dataset class to read just a segment of files from S3. Once we have found a raw file on S3 that is causing the failure it should be easy to determine which pings in that file are the problematic one.
[1] https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/heka/Dataset.scala#L31
Comment 13•8 years ago
|
||
I've found that repartitioning the data before writing to parquet makes the dataset readable again. An interesting side effect of this is that there are now 0.05% more rows than there were previously.
I am writing a script to fix the invalid dates as they crop up in the original query. I will make the repair notebook available when I'm done with it.
Comment 14•8 years ago
|
||
Comment 15•8 years ago
|
||
I have a notebook to identify broken partitions [0]. Once the pull-request in comment 14 is merged, rerunning a troublesome day should fix the problem.
I think that the underlying problem lies somewhere between the rdd->dataframe conversion and the collect->parquet encoding logic. Repartitioning causes the data to be shuffled and reorganized in memory, which might fix the internal representation of the data. I have no definitive proof, only a workaround.
[0] https://gist.github.com/acmiyaguchi/e319fc2af5c61bd5229a3e13bcde67c1
Comment 16•8 years ago
|
||
I reran the dashboard this morning and everything seems to be good.
This dashboard runs fine now:
https://sql.telemetry.mozilla.org/dashboard/sync-error-dashboard?p_start_date=2017-01-01%2000:00:00.000&p_end_date=2017-01-30%2000:00:00.000&p_version=5&p_min_err=1
I can even tell that now that we are collecting data from release, I need to rethink how I plot my graphs. :-/
Comment 17•8 years ago
|
||
:robotblake ran the query again on presto, and now the dashboard seems to be in working order according to :adavis in comment 16. This bug was solved by repartitioning the data in TelemetryBatchView::SyncView before writing to s3. Bug 1333896 has been filed to target the underlying issue identified here.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•