Closed
Bug 1149666
Opened 10 years ago
Closed 10 years ago
Export recent nightly Telemetry V4 data by clientId
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P2)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mreid, Assigned: mreid, Mentored)
References
Details
We want to extract a similar data set to bug 1149664 for comparison purposes.
Similar to bug 1140094, but with updated data, it should include:
- Document Type: "main"
- Application Name: "Firefox"
- Channel: "nightly"
- App Version: 39.0a1
- Sample: 100% (disk-space permitting)
- Date range: forever (effectively since Feb. 26th)
Output will be in the same form as bug 1140094 unless otherwise specified.
yep, format was fine. less than 100% sample is fine too-- as discussed over email, a couple thousand records (for local processing) is sufficient at this stage.
thanks mark!
Updated•10 years ago
|
Assignee: nobody → trink
Priority: -- → P2
Comment 2•10 years ago
|
||
Can we export it into a file structure so that it's easy to query individual IDs? e.g.
<clientID>/dump.xz
That will make it really easy to combine these with FHR data.
Assignee | ||
Comment 3•10 years ago
|
||
I added a tarball to peach-gw in ~mreid/bug1149666.data.tar.gz
It contains a sample of ~1600 clientIds, split up into a separate directory per client. Within each file, the records are formatted as described in bug bug 1140094.
It may still require some post-processing to get it into a form that works for you, since several of the important pieces of the document are now split out of the main payload into separate fields.
Benjamin, please take a look at this file and see if it suits your needs.
I will pass the export code along to trink if any further changes to the format are needed.
Flags: needinfo?(benjamin)
mreid: permissions error on peach trying to read that new data extract
tar (child): /home/mreid/bug1149666.data.tar.gz: Cannot open: Permission denied
Assignee | ||
Comment 5•10 years ago
|
||
Sorry, I forgot to update the permissions - I updated them last night, should be readable now.
(In reply to Mark Reid [:mreid] from comment #3)
> It may still require some post-processing to get it into a form that works
> for you, since several of the important pieces of the document are now split
> out of the main payload into separate fields.
Mark, can you elaborate on the details here? I'm now finding v4 submissions that seem to be missing ['payload']['info'], so my little v2/v4 comparison script is breaking. Is this expected as part of the splitting you mention?
Oh, I think I see what happened... in some of these files (why not all? what's going on?), a lot of stuff has moved from the document structure itself to the "metadata" section (or whatever), and what was found at e.g. [1]['payload']['info'] is now at [0]['payload.info'].
Unfortunately, the data in this dump also contains a million extra escape characters, and the JSON that has been moved to the new locations is not parsed when the rest of the document is parsed. Mark, can you double check the script that dumps the data and to see if it's doing something to add a bunch of backslashes everywhere?
Also, it's pretty important for the v2/v4 comparison work that the data layout not change-- the comparison process will grind to a halt if the data is a moving target. If the data is disassembled upstream in the pipeline, it really need to be reassembled in the original, stable format for processing and analysis. I will make a note to that effect in the per-clientId API bug.
Upon more evaluation, I'm sorry to say that this data set is not really usable for my purposes of comparing v2 and v4. :-(
I have the following needs
1) I need the pings to be formatted as specified in the discussions we had in Jan and Feb about the v4 packet structure, because I need to be able to refer to the docs.
2) I need the pings to be parse-able as JSON directly, without backslash stripping hacks, because this is hacky and error-prone.
3) I need the ping data to be consistent across all the subsessions I'm going to examine (at this point, even subsessions belonging to the same clientId have different sets of fields in different places), because I don't want to have to maintain a bunch of special case rules about where to look for data (again: hacky and error-prone).
This reassembly needs to happen upstream of the Metrics Team -- I could try to do it, but that adds an extra layer of code to mistrust. And I guess it really fits within the requirements of the per-client API, so maybe it will have to wait until that is ready?
In any case, my work on v2/v4 parity (and therefore, the Metric Teams ability to sign off on the change) is blocked on getting a new extract of data that is grouped by client and consistently reassembled. I can file a new bug if needed.
Assignee | ||
Comment 9•10 years ago
|
||
(In reply to brendan c from comment #8)
> Upon more evaluation, I'm sorry to say that this data set is not really
> usable for my purposes of comparing v2 and v4. :-(
>
> In any case, my work on v2/v4 parity (and therefore, the Metric Teams
> ability to sign off on the change) is blocked on getting a new extract of
> data that is grouped by client and consistently reassembled. I can file a
> new bug if needed.
This is the right place to sort out a usable format - I knew the export I provided had some rough edges, I just didn't want to do extra work if it was suitable as-is.
The "backslash-filled" fields are json-encoded strings, so you can handle them by json-parsing the string value. This can be done as part of the export to save you that step, and we can reconstitute the payload in its original format. Stay tuned for an updated export.
Comment 10•10 years ago
|
||
Please check the export code into the data-pipeline repo.
Flags: needinfo?(mreid)
Comment 11•10 years ago
|
||
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(mreid)
Flags: needinfo?(benjamin)
Resolution: --- → FIXED
Comment 12•10 years ago
|
||
Re-assigned to Mark to actually run it.
Assignee: trink → mreid
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 13•10 years ago
|
||
Updated data file at ~mreid/bug1149666.20150413.tar.gz
Please let me know if the data format looks OK.
Flags: needinfo?(bcolloran)
Assignee | ||
Updated•10 years ago
|
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Comment 15•10 years ago
|
||
mreid, is the sampling here random or are you using a heuristic like "all the clientIDs that end in 01"? If there's a heuristic, I could pull a matching sample of FHR records fairly cheaply.
Updated•10 years ago
|
Flags: needinfo?(mreid)
Assignee | ||
Comment 16•10 years ago
|
||
The heuristic is: crc32(clientId) % 100 == 1
Code reference:
https://github.com/mozilla-services/data-pipeline/blob/master/heka/sandbox/decoders/extract_telemetry_dimensions.lua#L95
Flags: needinfo?(mreid)
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•