1525600 - Investigate glean.baseline.duration field with NULL length

Reporter

Description

•

6 years ago

From bug 1520182 comment 9, it looks like the considered baseline pings show duration in 4 pings having NULL duration. All the rest have unit 'second' as expected. There were also 299 pings with 0 seconds of duration.

We should investigate the NULL values and make sure that 0s are legitimate values for super short sessions and not a bug in the metric type itself.

Alessio Placitelli [:Dexter]

Reporter

Updated

•

6 years ago

Blocks: 1491345

Priority: -- → P3

Whiteboard: [telemetry:mobilesdk:m?]

Alessio Placitelli [:Dexter]

Reporter

Updated

•

6 years ago

Blocks: 1525603

Michael Droettboom [:mdroettboom]

Assignee

Comment 1

•

6 years ago

This shows my lack of knowledge of the backend infrastructure, but shouldn't the NULL values have been detected as violating the schema? timespan values are indicated as "type": "integer". When the JSON schema fails for a ping, what happens to it?

As for the 0 values, it looks like the conversion routine we are using to convert from nanoseconds to seconds is truncating (see convert). Maybe we should always round up?

Alessio Placitelli [:Dexter]

Reporter

Comment 2

•

6 years ago

(In reply to Michael Droettboom [:mdroettboom] from comment #1)

This shows my lack of knowledge of the backend infrastructure, but shouldn't the NULL values have been detected as violating the schema? timespan values are indicated as "type": "integer". When the JSON schema fails for a ping, what happens to it?

This is a good question for Frank :)

As for the 0 values, it looks like the conversion routine we are using to convert from nanoseconds to seconds is truncating (see convert). Maybe we should always round up?

I think this is a good idea. Moreover, that's what Chris suggests as well at the end of his analysis :)

Flags: needinfo?(fbertsch)

Chris H-C :chutten

Comment 3

•

6 years ago

To my knowledge schema failures will dump the record into the telemetry-errors stream

Frank Bertsch [:frank]

Comment 4

•

6 years ago

(In reply to Chris H-C :chutten from comment #3)

To my knowledge schema failures will dump the record into the telemetry-errors stream

This is correct. The ping will not appear in the STMO table if it is not valid.

I'm not sure how that value could be NULL in the raw data, since it is both required and and integer. My only thought is that the JSON representation of the integer value did not match with the int64 representation, and since the parquet writer can't cast it to a string, has to write it as NULL.

Can we first create a few example pings: with duration missing, with duration.value NULL, and any others, and verify that they match the JSON schema we have? If we include them in the validation section of mozilla-pipeline-schemas we can also test the parquet writer.

Flags: needinfo?(fbertsch)

Alessio Placitelli [:Dexter]

Reporter

Updated

•

6 years ago

Whiteboard: [telemetry:mobilesdk:m?] → [telemetry:mobilesdk:m6]

Michael Droettboom [:mdroettboom]

Assignee

Comment 5

•

6 years ago

Attached file GitHub Pull Request (deleted) — Details

Assignee: nobody → mdroettboom

Alessio Placitelli [:Dexter]

Reporter

Comment 6

•

6 years ago

Mike, did you by any chance run the validation from comment 4? What was the outcome of that?

Flags: needinfo?(mdroettboom)

Michael Droettboom [:mdroettboom]

Assignee

Comment 7

•

6 years ago

Yeah -- I'm working on the validation from comment 4, but :frank and I are both stumped by the way in which it's failing. I've reached out to trink for more info.

Flags: needinfo?(mdroettboom)

Michael Droettboom [:mdroettboom]

Assignee

Comment 8

•

6 years ago

In investigating why we might be getting null timespan values in the data, I’m trying to write an expected failing test for mozilla-pipeline-schemas, but running into a snag. The timespan value in the JSON schema is "type": "integer", but I’ve set it to null in the expected failing test.

When I run the tests, I’m getting the following, which leads me to think that Parquet is running on this test (even though it should have failed against the JSON schema).

1: 1551123550588450968 [info] input.test_json_schemas detaching received: 0 msg:                                                                                                         
1: 1551123551089097226 [debug] output.test_parquet_schemas writing      heka.telemetry_errors.1                                                                                          
1: 1551123551089331305 [info] hindsight input plugins have exited; cascading shutdown initiated                                                                                          
1: 1551123551089404073 [info] hindsight stop signal received                                                                                                                             
1: 1551123551106236909 [debug] output.test_parquet_schemas writing      heka.telemetry_duplicates.1                                                                                      
1: 1551123551108320178 [debug] output.test_parquet_schemas writing      heka.moz_ingest_uri_errors.1                                                                                     
1: 1551123551109491683 [debug] output.test_parquet_schemas writing      glean.baseline.1                                                                                                 
1: 1551123551134329636 [debug] output.test_parquet_schemas writing      glean.events.1                                                                                                   
1: 1551123551148678859 [debug] output.test_parquet_schemas writing      glean.metrics.1
1: 1551123551157320637 [debug] output.test_parquet_schemas writing      glean.metrics.1
1: 1551123551157990943 [error] output.test_parquet_schemas terminated: process_message() column 'value' is required
1: 1551123551158019818 [info] output.test_parquet_schemas detaching received: 1 msg: process_message() column 'value' is required

Flags: needinfo?(mtrinkala)

Michael Droettboom [:mdroettboom]

Assignee

Comment 9

•

6 years ago

Attached file GitHub Pull Request (deleted) — Details

Mike Trinkala [:trink]

Comment 10

•

6 years ago

When I set the timespan value to null I see the schema failure:

1: 1551303681931551208 [debug] input.test_json_schemas loading ../../validation/glean/baseline.1.all.pass.json
1: 1551303681931668555 [info] input.test_json_schemas detaching received: 1 msg: process_message() run/input/test_json_schemas.lua:117: SchemaURI: #/properties/metrics/properties/timespan/additionalProperties/properties/value Keyword: type DocumentURI: #/metrics/timespan/examples.timespan_example/value

or remove it
1: 1551303925940361491 [debug] input.test_json_schemas loading ../../validation/glean/baseline.1.all.pass.json
1: 1551303925940666503 [info] input.test_json_schemas detaching received: 1 msg: process_message() run/input/test_json_schemas.lua:117: SchemaURI: #/properties/metrics/properties/timespan/additionalProperties Keyword: required DocumentURI: #/metrics/timespan/examples.timespan_example

Do you have a branch you are working on that I can take a look at?

Mike Trinkala [:trink]

Comment 11

•

6 years ago

Never mind. I added a fail file and can reproduce it. The message is still injected even on failure. We should flag it differently if we want to test a parquet output failure too or simply not inject it if we don't care.

Flags: needinfo?(mtrinkala)

Mike Trinkala [:trink]

Comment 12

•

6 years ago

https://github.com/mozilla-services/mozilla-pipeline-schemas/commit/9ddfb99e347266d3ffed44162bdcc223b4f93960#diff-82ed0437c3467c01f567de50900f0ddbR91

Michael Droettboom [:mdroettboom]

Assignee

Comment 13

•

6 years ago

:trink: I've investigated this a bit more. It seems that this doesn't support failing tests for parquet (is that correct)? It doesn't seem like there are any failing parquet tests in the repo at the moment. Trying to make that work, it seems that writer:dissect_record throws a C++ exception which isn't caught causing the whole process to go down. So fixing this probably requires converting that exception in to a return value and then doing the expected failure check like is currently done for JSON. All of that seems like a bigger task than I was hoping for... So probably best to just not pass along failing JSON tests to Parquet as a stopgap for now. Does all that make sense?

Flags: needinfo?(mtrinkala)

Michael Droettboom [:mdroettboom]

Assignee

Comment 14

•

6 years ago

Attached file GitHub Pull Request (deleted) — Details

Mike Trinkala [:trink]

Comment 15

•

6 years ago

The C++ exception is caught and reported back to the plugin (that is the error message you see) here: https://github.com/mozilla-services/lua_sandbox_extensions/blob/master/parquet/parquet.cpp#L1660

Currently the plugin lets it pass through as a fatal error as is wasn't designed to test error cases it simply fails the test. You would use a pcall here https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/f1f2bad05ccd4cffa44815f42b8dec218f27d695/tests/hindsight/run/output/test_parquet_schemas.lua#L95 to trap and handle it. The metadata on the message would also have to be adjusted so it can verify it was supposed to fail.

If we don't want to add negative testing for parquet output this PR is fine. The reason we added negative testing for the JSON schemas was due to a mistake in a schema allowing bad data through (this is less likely in the parquet schema) but I will let mreid make the call (mreid if you want it, please assign this bug to me, otherwise close it and merged the PR)

Flags: needinfo?(mtrinkala) → needinfo?(mreid)

Mark Reid [:mreid]

Comment 16

•

6 years ago

I don't think we need to support negative testing for parquet - skipping the ".fail" examples before attempting to write parquet matches what we do on ingestion, where schema validation failure means a message won't be written out as parquet.

So yes, let's take this as-is. Thanks!

Flags: needinfo?(mreid)

Michael Droettboom [:mdroettboom]

Assignee

Comment 17

•

6 years ago

Cool. Thanks for the feedback. Sorry I missed the details about what the parquet plugin was doing... :)

Michael Droettboom [:mdroettboom]

Assignee

Comment 18

•

6 years ago

So, we've confirmed that a time value of null is rejected by the JSON schema, and therefore (we assume) is not making it into the data. But it sounds like it's possible that there is a value that passes JSON schema but still ends up as a null in parquet. Is there a way I could look at the raw JSON pings that came in so I could look at the values and try to cross reference with what ended up in parquet? Or just any other general thoughts about what might be happening in the pipeline?

Flags: needinfo?(fbertsch)

Frank Bertsch [:frank]

Comment 19

•

6 years ago

I recommend we wait a bit until we have the Glean data in BQ, and see if the issue is still present. I don't want us to spend too much time debugging a tool that we're going to stop using fairly soon! Does that sound reasonable?

Flags: needinfo?(fbertsch) → needinfo?(mdroettboom)

Michael Droettboom [:mdroettboom]

Assignee

Comment 20

•

6 years ago

Sounds good.

Flags: needinfo?(mdroettboom)

Michael Droettboom [:mdroettboom]

Assignee

Updated

•

6 years ago

Assignee: mdroettboom → nobody

Alessio Placitelli [:Dexter]

Reporter

Updated

•

6 years ago

Whiteboard: [telemetry:mobilesdk:m6] → [telemetry:mobilesdk:backlog]

Alessio Placitelli [:Dexter]

Reporter

Updated

•

5 years ago

status-firefox67: affected → ---

Component: Telemetry → Glean: SDK

Product: Toolkit → Data Platform and Tools

Version: Trunk → unspecified

Travis Long [:travis_]

Comment 21

•

5 years ago

I may have come upon a path of execution in Glean that I can reproduce that generates null values for the duration field. It actually didn't even include it in the ping.

I discovered this while working with the Debug View, so cheers to that being a great tool for debugging!

Steps to reproduce:

Make sure that the underlying integrating client application is not running, even in the background.
Using GleanDebugActivity from the command line, launch the application using the sendPings command (I also used tagPings to help me see the pings in the Debug View)

At this point Glean sends the pings and launches the application. I believe it may be handling the sendPings before the application is launched (or completely launched, i.e. Glean.initialize() has been called). This means that Glean is not initialized, and/or the application has not gone through the lifecycle events in order to set the duration and so it is null.

I also noticed, when continuing to sendPings using the GleanDebugActivity, all subsequent pings sent after the application was loaded ended up with a zero duration. I believe this is because the duration is set as part of Glean handling the go to background event, and this doesn't happen when the GleanDebugActivity is used to send pings.

Alessio Placitelli [:Dexter]

Reporter

Comment 22

•

5 years ago

(In reply to Travis Long from comment #21)

I may have come upon a path of execution in Glean that I can reproduce that generates null values for the duration field. It actually didn't even include it in the ping.

This is not really generating nulls, but rather not including the field (which can be expected).

Alessio Placitelli [:Dexter]

Reporter

Comment 23

•

5 years ago

Hey Frank,

looks like we're still seeing NULLs (see the great analysis from Chris, "Foreground Session Durations" section) in Firefox for FireTV data.

Does this mean we should investigate this, or is the data still coming from the old tooling path?

Flags: needinfo?(fbertsch)

Travis Long [:travis_]

Comment 24

•

5 years ago

Attached file Capture from the wild in Reference Browser (deleted) — Details

Not sure if this is helpful or not but I just manage to catch a series of ping payloads from hindsight that had the invalid value error for the duration. There are 3-4 pings starting with a seq #1 that all have the error from the same client.

Alessio Placitelli [:Dexter]

Reporter

Updated

•

5 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1566757

Frank Bertsch [:frank]

Comment 25

•

5 years ago

This query has some examples: https://sql.telemetry.mozilla.org/queries/63902/source

I also ran a query to get the total ping_counts and client_counts with NULL glean_baseline_duration:

Row	is_null	ping_count	client_count
1	Null	169874	22586
2	Not Null	97749	31385

So it looks like this is happening a lot....

Flags: needinfo?(fbertsch)

Frank Bertsch [:frank]

Comment 26

•

5 years ago

You'll see a ton of this: {"glean.error.invalid_value":{"glean.baseline.duration":2}}

Michael Droettboom [:mdroettboom]

Assignee

Comment 27

•

5 years ago

Given that errors are being recorded -- I think the only reason that can happen is if stop is called before start. Both of these are called from a LifecycleObserver, which feels like it should always call start before stop, but I wouldn't be surprised if there were an Android bug there.

Going to investigate if this occurs more often on a particular version of Android...

Alessio Placitelli [:Dexter]

Reporter

Updated

•

5 years ago

Assignee: nobody → alessio.placitelli

Priority: P3 → P1

Alessio Placitelli [:Dexter]

Reporter

Updated

•

5 years ago

Whiteboard: [telemetry:mobilesdk:backlog] → [telemetry:mobilesdk:m7]

Alessio Placitelli [:Dexter]

Reporter

Updated

•

5 years ago

Whiteboard: [telemetry:mobilesdk:m7] → [telemetry:glean-rs:m7]

Alessio Placitelli [:Dexter]

Reporter

Comment 28

•

5 years ago

Whoops, sorry, I meant to take bug 1566757

Assignee: alessio.placitelli → nobody

Priority: P1 → P3

Whiteboard: [telemetry:glean-rs:m7] → [telemetry:glean-rs:backlog]

Michael Droettboom [:mdroettboom]

Assignee

Updated

•

5 years ago

Assignee: nobody → mdroettboom

Whiteboard: [telemetry:glean-rs:backlog] → [telemetry:glean-rs:m8]

Michael Droettboom [:mdroettboom]

Assignee

Updated

•

5 years ago

Assignee: mdroettboom → nobody

Michael Droettboom [:mdroettboom]

Assignee

Comment 29

•

5 years ago

The null values here may be related to bug 1586881, and we should go back and look for these once the fix for that lands. This is highly correlated with that bug -- 90% of the time that NULL values appear, there is also an error reported for glean.baseline.duration.

TL;DR from above: The 0 values are not errors -- they just indicate durations of less than 1 second, so I've renamed this bug accordingly.

Summary: Investigate glean.baseline.duration field with NULL and 0s length → Investigate glean.baseline.duration field with NULL length

Michael Droettboom [:mdroettboom]

Assignee

Updated

•

5 years ago

Assignee: nobody → mdroettboom

Michael Droettboom [:mdroettboom]

Assignee

Updated

•

5 years ago

Whiteboard: [telemetry:glean-rs:m8] → [telemetry:glean-rs:m11]

Michael Droettboom [:mdroettboom]

Assignee

Comment 30

•

5 years ago

No longer seeing NULL values on Fenix nightly, so closing this bug.

We are still getting errors reported for baseline.duration, however, which is distinct from the value being set to null, and bug 1587548 remains open for that reason.

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

GitHub Pull Request 6 years ago Michael Droettboom [:mdroettboom] (deleted), text/x-github-pull-request		Details
GitHub Pull Request 6 years ago Michael Droettboom [:mdroettboom] (deleted), text/x-github-pull-request		Details
GitHub Pull Request 6 years ago Michael Droettboom [:mdroettboom] (deleted), text/x-github-pull-request		Details
Capture from the wild in Reference Browser 5 years ago Travis Long [:travis_] (deleted), text/plain		Details