Closed Bug 1149666 Opened 10 years ago Closed 10 years ago

Export recent nightly Telemetry V4 data by clientId

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mreid, Assigned: mreid, Mentored)

References

Details

We want to extract a similar data set to bug 1149664 for comparison purposes. Similar to bug 1140094, but with updated data, it should include: - Document Type: "main" - Application Name: "Firefox" - Channel: "nightly" - App Version: 39.0a1 - Sample: 100% (disk-space permitting) - Date range: forever (effectively since Feb. 26th) Output will be in the same form as bug 1140094 unless otherwise specified.
Blocks: 1134661
yep, format was fine. less than 100% sample is fine too-- as discussed over email, a couple thousand records (for local processing) is sufficient at this stage. thanks mark!
Assignee: nobody → trink
Priority: -- → P2
Can we export it into a file structure so that it's easy to query individual IDs? e.g. <clientID>/dump.xz That will make it really easy to combine these with FHR data.
I added a tarball to peach-gw in ~mreid/bug1149666.data.tar.gz It contains a sample of ~1600 clientIds, split up into a separate directory per client. Within each file, the records are formatted as described in bug bug 1140094. It may still require some post-processing to get it into a form that works for you, since several of the important pieces of the document are now split out of the main payload into separate fields. Benjamin, please take a look at this file and see if it suits your needs. I will pass the export code along to trink if any further changes to the format are needed.
Flags: needinfo?(benjamin)
mreid: permissions error on peach trying to read that new data extract tar (child): /home/mreid/bug1149666.data.tar.gz: Cannot open: Permission denied
Sorry, I forgot to update the permissions - I updated them last night, should be readable now.
(In reply to Mark Reid [:mreid] from comment #3) > It may still require some post-processing to get it into a form that works > for you, since several of the important pieces of the document are now split > out of the main payload into separate fields. Mark, can you elaborate on the details here? I'm now finding v4 submissions that seem to be missing ['payload']['info'], so my little v2/v4 comparison script is breaking. Is this expected as part of the splitting you mention?
Oh, I think I see what happened... in some of these files (why not all? what's going on?), a lot of stuff has moved from the document structure itself to the "metadata" section (or whatever), and what was found at e.g. [1]['payload']['info'] is now at [0]['payload.info']. Unfortunately, the data in this dump also contains a million extra escape characters, and the JSON that has been moved to the new locations is not parsed when the rest of the document is parsed. Mark, can you double check the script that dumps the data and to see if it's doing something to add a bunch of backslashes everywhere? Also, it's pretty important for the v2/v4 comparison work that the data layout not change-- the comparison process will grind to a halt if the data is a moving target. If the data is disassembled upstream in the pipeline, it really need to be reassembled in the original, stable format for processing and analysis. I will make a note to that effect in the per-clientId API bug.
Upon more evaluation, I'm sorry to say that this data set is not really usable for my purposes of comparing v2 and v4. :-( I have the following needs 1) I need the pings to be formatted as specified in the discussions we had in Jan and Feb about the v4 packet structure, because I need to be able to refer to the docs. 2) I need the pings to be parse-able as JSON directly, without backslash stripping hacks, because this is hacky and error-prone. 3) I need the ping data to be consistent across all the subsessions I'm going to examine (at this point, even subsessions belonging to the same clientId have different sets of fields in different places), because I don't want to have to maintain a bunch of special case rules about where to look for data (again: hacky and error-prone). This reassembly needs to happen upstream of the Metrics Team -- I could try to do it, but that adds an extra layer of code to mistrust. And I guess it really fits within the requirements of the per-client API, so maybe it will have to wait until that is ready? In any case, my work on v2/v4 parity (and therefore, the Metric Teams ability to sign off on the change) is blocked on getting a new extract of data that is grouped by client and consistently reassembled. I can file a new bug if needed.
(In reply to brendan c from comment #8) > Upon more evaluation, I'm sorry to say that this data set is not really > usable for my purposes of comparing v2 and v4. :-( > > In any case, my work on v2/v4 parity (and therefore, the Metric Teams > ability to sign off on the change) is blocked on getting a new extract of > data that is grouped by client and consistently reassembled. I can file a > new bug if needed. This is the right place to sort out a usable format - I knew the export I provided had some rough edges, I just didn't want to do extra work if it was suitable as-is. The "backslash-filled" fields are json-encoded strings, so you can handle them by json-parsing the string value. This can be done as part of the export to save you that step, and we can reconstitute the payload in its original format. Stay tuned for an updated export.
Please check the export code into the data-pipeline repo.
Flags: needinfo?(mreid)
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(mreid)
Flags: needinfo?(benjamin)
Resolution: --- → FIXED
Re-assigned to Mark to actually run it.
Assignee: trink → mreid
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated data file at ~mreid/bug1149666.20150413.tar.gz Please let me know if the data format looks OK.
Flags: needinfo?(bcolloran)
Perfect Mark, thank you.
Flags: needinfo?(bcolloran)
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
mreid, is the sampling here random or are you using a heuristic like "all the clientIDs that end in 01"? If there's a heuristic, I could pull a matching sample of FHR records fairly cheaply.
Flags: needinfo?(mreid)
Flags: needinfo?(mreid)
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.