1149666 - Export recent nightly Telemetry V4 data by clientId

Assignee

Description

•

10 years ago

We want to extract a similar data set to bug 1149664 for comparison purposes. Similar to bug 1140094, but with updated data, it should include: - Document Type: "main" - Application Name: "Firefox" - Channel: "nightly" - App Version: 39.0a1 - Sample: 100% (disk-space permitting) - Date range: forever (effectively since Feb. 26th) Output will be in the same form as bug 1140094 unless otherwise specified.

Mark Reid [:mreid]

Assignee

Updated

•

10 years ago

Blocks: 1134661

brendan c

Comment 1

•

10 years ago

yep, format was fine. less than 100% sample is fine too-- as discussed over email, a couple thousand records (for local processing) is sufficient at this stage. thanks mark!

Rob Miller [:rmiller]

Updated

•

10 years ago

Assignee: nobody → trink

Priority: -- → P2

Benjamin Smedberg

Comment 2

•

10 years ago

Can we export it into a file structure so that it's easy to query individual IDs? e.g. <clientID>/dump.xz That will make it really easy to combine these with FHR data.

Mark Reid [:mreid]

Assignee

Comment 3

•

10 years ago

I added a tarball to peach-gw in ~mreid/bug1149666.data.tar.gz It contains a sample of ~1600 clientIds, split up into a separate directory per client. Within each file, the records are formatted as described in bug bug 1140094. It may still require some post-processing to get it into a form that works for you, since several of the important pieces of the document are now split out of the main payload into separate fields. Benjamin, please take a look at this file and see if it suits your needs. I will pass the export code along to trink if any further changes to the format are needed.

Flags: needinfo?(benjamin)

brendan c

Comment 4

•

10 years ago

mreid: permissions error on peach trying to read that new data extract tar (child): /home/mreid/bug1149666.data.tar.gz: Cannot open: Permission denied

Mark Reid [:mreid]

Assignee

Comment 5

•

10 years ago

Sorry, I forgot to update the permissions - I updated them last night, should be readable now.

brendan c

Comment 6

•

10 years ago

(In reply to Mark Reid [:mreid] from comment #3) > It may still require some post-processing to get it into a form that works > for you, since several of the important pieces of the document are now split > out of the main payload into separate fields. Mark, can you elaborate on the details here? I'm now finding v4 submissions that seem to be missing ['payload']['info'], so my little v2/v4 comparison script is breaking. Is this expected as part of the splitting you mention?

brendan c

Comment 7

•

10 years ago

Oh, I think I see what happened... in some of these files (why not all? what's going on?), a lot of stuff has moved from the document structure itself to the "metadata" section (or whatever), and what was found at e.g. [1]['payload']['info'] is now at [0]['payload.info']. Unfortunately, the data in this dump also contains a million extra escape characters, and the JSON that has been moved to the new locations is not parsed when the rest of the document is parsed. Mark, can you double check the script that dumps the data and to see if it's doing something to add a bunch of backslashes everywhere? Also, it's pretty important for the v2/v4 comparison work that the data layout not change-- the comparison process will grind to a halt if the data is a moving target. If the data is disassembled upstream in the pipeline, it really need to be reassembled in the original, stable format for processing and analysis. I will make a note to that effect in the per-clientId API bug.

brendan c

Comment 8

•

10 years ago

Upon more evaluation, I'm sorry to say that this data set is not really usable for my purposes of comparing v2 and v4. :-( I have the following needs 1) I need the pings to be formatted as specified in the discussions we had in Jan and Feb about the v4 packet structure, because I need to be able to refer to the docs. 2) I need the pings to be parse-able as JSON directly, without backslash stripping hacks, because this is hacky and error-prone. 3) I need the ping data to be consistent across all the subsessions I'm going to examine (at this point, even subsessions belonging to the same clientId have different sets of fields in different places), because I don't want to have to maintain a bunch of special case rules about where to look for data (again: hacky and error-prone). This reassembly needs to happen upstream of the Metrics Team -- I could try to do it, but that adds an extra layer of code to mistrust. And I guess it really fits within the requirements of the per-client API, so maybe it will have to wait until that is ready? In any case, my work on v2/v4 parity (and therefore, the Metric Teams ability to sign off on the change) is blocked on getting a new extract of data that is grouped by client and consistently reassembled. I can file a new bug if needed.

Mark Reid [:mreid]

Assignee

Comment 9

•

10 years ago

(In reply to brendan c from comment #8) > Upon more evaluation, I'm sorry to say that this data set is not really > usable for my purposes of comparing v2 and v4. :-( > > In any case, my work on v2/v4 parity (and therefore, the Metric Teams > ability to sign off on the change) is blocked on getting a new extract of > data that is grouped by client and consistently reassembled. I can file a > new bug if needed. This is the right place to sort out a usable format - I knew the export I provided had some rough edges, I just didn't want to do extra work if it was suitable as-is. The "backslash-filled" fields are json-encoded strings, so you can handle them by json-parsing the string value. This can be done as part of the export to save you that step, and we can reconstitute the payload in its original format. Stay tuned for an updated export.

Mike Trinkala [:trink]

Comment 10

•

10 years ago

Please check the export code into the data-pipeline repo.

Flags: needinfo?(mreid)

Mike Trinkala [:trink]

Comment 11

•

10 years ago

https://github.com/mozilla-services/data-pipeline/pull/57

Status: NEW → RESOLVED

Closed: 10 years ago

Flags: needinfo?(mreid)

Flags: needinfo?(benjamin)

Resolution: --- → FIXED

Mike Trinkala [:trink]

Comment 12

•

10 years ago

Re-assigned to Mark to actually run it.

Assignee: trink → mreid

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Mark Reid [:mreid]

Assignee

Comment 13

•

10 years ago

Updated data file at ~mreid/bug1149666.20150413.tar.gz Please let me know if the data format looks OK.

Flags: needinfo?(bcolloran)

brendan c

Comment 14

•

10 years ago

Perfect Mark, thank you.

Flags: needinfo?(bcolloran)

Mark Reid [:mreid]

Assignee

Updated

•

10 years ago

Status: REOPENED → RESOLVED

Closed: 10 years ago → 10 years ago

Resolution: --- → FIXED

Benjamin Smedberg

Comment 15

•

10 years ago

mreid, is the sampling here random or are you using a heuristic like "all the clientIDs that end in 01"? If there's a heuristic, I could pull a matching sample of FHR records fairly cheaply.

Benjamin Smedberg

Updated

•

10 years ago

Flags: needinfo?(mreid)

Mark Reid [:mreid]

Assignee

Comment 16

•

10 years ago

The heuristic is: crc32(clientId) % 100 == 1 Code reference: https://github.com/mozilla-services/data-pipeline/blob/master/heka/sandbox/decoders/extract_telemetry_dimensions.lua#L95

Flags: needinfo?(mreid)

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard

Bugzilla

Export recent nightly Telemetry V4 data by clientId

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P2)

Tracking

(Not tracked)

People

(Reporter: mreid, Assigned: mreid, Mentored)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Updated

Comment 15

Updated

Comment 16

Updated