Closed Bug 1525600 Opened 6 years ago Closed 5 years ago

Investigate glean.baseline.duration field with NULL length

Categories

(Data Platform and Tools :: Glean: SDK, enhancement, P3)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Dexter, Assigned: mdroettboom)

References

Details

(Whiteboard: [telemetry:glean-rs:m11])

Attachments

(4 files)

From bug 1520182 comment 9, it looks like the considered baseline pings show duration in 4 pings having NULL duration. All the rest have unit 'second' as expected. There were also 299 pings with 0 seconds of duration.

We should investigate the NULL values and make sure that 0s are legitimate values for super short sessions and not a bug in the metric type itself.

Blocks: 1491345
Priority: -- → P3
Whiteboard: [telemetry:mobilesdk:m?]
Blocks: 1525603

This shows my lack of knowledge of the backend infrastructure, but shouldn't the NULL values have been detected as violating the schema? timespan values are indicated as "type": "integer". When the JSON schema fails for a ping, what happens to it?

As for the 0 values, it looks like the conversion routine we are using to convert from nanoseconds to seconds is truncating (see convert). Maybe we should always round up?

(In reply to Michael Droettboom [:mdroettboom] from comment #1)

This shows my lack of knowledge of the backend infrastructure, but shouldn't the NULL values have been detected as violating the schema? timespan values are indicated as "type": "integer". When the JSON schema fails for a ping, what happens to it?

This is a good question for Frank :)

As for the 0 values, it looks like the conversion routine we are using to convert from nanoseconds to seconds is truncating (see convert). Maybe we should always round up?

I think this is a good idea. Moreover, that's what Chris suggests as well at the end of his analysis :)

Flags: needinfo?(fbertsch)

To my knowledge schema failures will dump the record into the telemetry-errors stream

(In reply to Chris H-C :chutten from comment #3)

To my knowledge schema failures will dump the record into the telemetry-errors stream

This is correct. The ping will not appear in the STMO table if it is not valid.

I'm not sure how that value could be NULL in the raw data, since it is both required and and integer. My only thought is that the JSON representation of the integer value did not match with the int64 representation, and since the parquet writer can't cast it to a string, has to write it as NULL.

Can we first create a few example pings: with duration missing, with duration.value NULL, and any others, and verify that they match the JSON schema we have? If we include them in the validation section of mozilla-pipeline-schemas we can also test the parquet writer.

Flags: needinfo?(fbertsch)
Whiteboard: [telemetry:mobilesdk:m?] → [telemetry:mobilesdk:m6]
Attached file GitHub Pull Request (deleted) —
Assignee: nobody → mdroettboom

Mike, did you by any chance run the validation from comment 4? What was the outcome of that?

Flags: needinfo?(mdroettboom)

Yeah -- I'm working on the validation from comment 4, but :frank and I are both stumped by the way in which it's failing. I've reached out to trink for more info.

Flags: needinfo?(mdroettboom)

In investigating why we might be getting null timespan values in the data, I’m trying to write an expected failing test for mozilla-pipeline-schemas, but running into a snag. The timespan value in the JSON schema is "type": "integer", but I’ve set it to null in the expected failing test.

When I run the tests, I’m getting the following, which leads me to think that Parquet is running on this test (even though it should have failed against the JSON schema).

1: 1551123550588450968 [info] input.test_json_schemas detaching received: 0 msg:                                                                                                         
1: 1551123551089097226 [debug] output.test_parquet_schemas writing      heka.telemetry_errors.1                                                                                          
1: 1551123551089331305 [info] hindsight input plugins have exited; cascading shutdown initiated                                                                                          
1: 1551123551089404073 [info] hindsight stop signal received                                                                                                                             
1: 1551123551106236909 [debug] output.test_parquet_schemas writing      heka.telemetry_duplicates.1                                                                                      
1: 1551123551108320178 [debug] output.test_parquet_schemas writing      heka.moz_ingest_uri_errors.1                                                                                     
1: 1551123551109491683 [debug] output.test_parquet_schemas writing      glean.baseline.1                                                                                                 
1: 1551123551134329636 [debug] output.test_parquet_schemas writing      glean.events.1                                                                                                   
1: 1551123551148678859 [debug] output.test_parquet_schemas writing      glean.metrics.1
1: 1551123551157320637 [debug] output.test_parquet_schemas writing      glean.metrics.1
1: 1551123551157990943 [error] output.test_parquet_schemas terminated: process_message() column 'value' is required
1: 1551123551158019818 [info] output.test_parquet_schemas detaching received: 1 msg: process_message() column 'value' is required
Flags: needinfo?(mtrinkala)
Attached file GitHub Pull Request (deleted) —

When I set the timespan value to null I see the schema failure:

1: 1551303681931551208 [debug] input.test_json_schemas loading ../../validation/glean/baseline.1.all.pass.json
1: 1551303681931668555 [info] input.test_json_schemas detaching received: 1 msg: process_message() run/input/test_json_schemas.lua:117: SchemaURI: #/properties/metrics/properties/timespan/additionalProperties/properties/value Keyword: type DocumentURI: #/metrics/timespan/examples.timespan_example/value

or remove it
1: 1551303925940361491 [debug] input.test_json_schemas loading ../../validation/glean/baseline.1.all.pass.json
1: 1551303925940666503 [info] input.test_json_schemas detaching received: 1 msg: process_message() run/input/test_json_schemas.lua:117: SchemaURI: #/properties/metrics/properties/timespan/additionalProperties Keyword: required DocumentURI: #/metrics/timespan/examples.timespan_example

Do you have a branch you are working on that I can take a look at?

Never mind. I added a fail file and can reproduce it. The message is still injected even on failure. We should flag it differently if we want to test a parquet output failure too or simply not inject it if we don't care.

Flags: needinfo?(mtrinkala)

:trink: I've investigated this a bit more. It seems that this doesn't support failing tests for parquet (is that correct)? It doesn't seem like there are any failing parquet tests in the repo at the moment. Trying to make that work, it seems that writer:dissect_record throws a C++ exception which isn't caught causing the whole process to go down. So fixing this probably requires converting that exception in to a return value and then doing the expected failure check like is currently done for JSON. All of that seems like a bigger task than I was hoping for... So probably best to just not pass along failing JSON tests to Parquet as a stopgap for now. Does all that make sense?

Flags: needinfo?(mtrinkala)
Attached file GitHub Pull Request (deleted) —

The C++ exception is caught and reported back to the plugin (that is the error message you see) here: https://github.com/mozilla-services/lua_sandbox_extensions/blob/master/parquet/parquet.cpp#L1660

Currently the plugin lets it pass through as a fatal error as is wasn't designed to test error cases it simply fails the test. You would use a pcall here https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/f1f2bad05ccd4cffa44815f42b8dec218f27d695/tests/hindsight/run/output/test_parquet_schemas.lua#L95 to trap and handle it. The metadata on the message would also have to be adjusted so it can verify it was supposed to fail.

If we don't want to add negative testing for parquet output this PR is fine. The reason we added negative testing for the JSON schemas was due to a mistake in a schema allowing bad data through (this is less likely in the parquet schema) but I will let mreid make the call (mreid if you want it, please assign this bug to me, otherwise close it and merged the PR)

Flags: needinfo?(mtrinkala) → needinfo?(mreid)

I don't think we need to support negative testing for parquet - skipping the ".fail" examples before attempting to write parquet matches what we do on ingestion, where schema validation failure means a message won't be written out as parquet.

So yes, let's take this as-is. Thanks!

Flags: needinfo?(mreid)

Cool. Thanks for the feedback. Sorry I missed the details about what the parquet plugin was doing... :)

So, we've confirmed that a time value of null is rejected by the JSON schema, and therefore (we assume) is not making it into the data. But it sounds like it's possible that there is a value that passes JSON schema but still ends up as a null in parquet. Is there a way I could look at the raw JSON pings that came in so I could look at the values and try to cross reference with what ended up in parquet? Or just any other general thoughts about what might be happening in the pipeline?

Flags: needinfo?(fbertsch)

I recommend we wait a bit until we have the Glean data in BQ, and see if the issue is still present. I don't want us to spend too much time debugging a tool that we're going to stop using fairly soon! Does that sound reasonable?

Flags: needinfo?(fbertsch) → needinfo?(mdroettboom)

Sounds good.

Flags: needinfo?(mdroettboom)
Assignee: mdroettboom → nobody
Whiteboard: [telemetry:mobilesdk:m6] → [telemetry:mobilesdk:backlog]
Component: Telemetry → Glean: SDK
Product: Toolkit → Data Platform and Tools
Version: Trunk → unspecified

I may have come upon a path of execution in Glean that I can reproduce that generates null values for the duration field. It actually didn't even include it in the ping.

I discovered this while working with the Debug View, so cheers to that being a great tool for debugging!

Steps to reproduce:

  1. Make sure that the underlying integrating client application is not running, even in the background.
  2. Using GleanDebugActivity from the command line, launch the application using the sendPings command (I also used tagPings to help me see the pings in the Debug View)

At this point Glean sends the pings and launches the application. I believe it may be handling the sendPings before the application is launched (or completely launched, i.e. Glean.initialize() has been called). This means that Glean is not initialized, and/or the application has not gone through the lifecycle events in order to set the duration and so it is null.

I also noticed, when continuing to sendPings using the GleanDebugActivity, all subsequent pings sent after the application was loaded ended up with a zero duration. I believe this is because the duration is set as part of Glean handling the go to background event, and this doesn't happen when the GleanDebugActivity is used to send pings.

(In reply to Travis Long from comment #21)

I may have come upon a path of execution in Glean that I can reproduce that generates null values for the duration field. It actually didn't even include it in the ping.

This is not really generating nulls, but rather not including the field (which can be expected).

Hey Frank,

looks like we're still seeing NULLs (see the great analysis from Chris, "Foreground Session Durations" section) in Firefox for FireTV data.

Does this mean we should investigate this, or is the data still coming from the old tooling path?

Flags: needinfo?(fbertsch)

Not sure if this is helpful or not but I just manage to catch a series of ping payloads from hindsight that had the invalid value error for the duration. There are 3-4 pings starting with a seq #1 that all have the error from the same client.

This query has some examples: https://sql.telemetry.mozilla.org/queries/63902/source

I also ran a query to get the total ping_counts and client_counts with NULL glean_baseline_duration:

Row is_null ping_count client_count
1 Null 169874 22586
2 Not Null 97749 31385

So it looks like this is happening a lot....

Flags: needinfo?(fbertsch)

You'll see a ton of this: {"glean.error.invalid_value":{"glean.baseline.duration":2}}

Given that errors are being recorded -- I think the only reason that can happen is if stop is called before start. Both of these are called from a LifecycleObserver, which feels like it should always call start before stop, but I wouldn't be surprised if there were an Android bug there.

Going to investigate if this occurs more often on a particular version of Android...

Assignee: nobody → alessio.placitelli
Priority: P3 → P1
Whiteboard: [telemetry:mobilesdk:backlog] → [telemetry:mobilesdk:m7]
Whiteboard: [telemetry:mobilesdk:m7] → [telemetry:glean-rs:m7]

Whoops, sorry, I meant to take bug 1566757

Assignee: alessio.placitelli → nobody
Priority: P1 → P3
Whiteboard: [telemetry:glean-rs:m7] → [telemetry:glean-rs:backlog]
Assignee: nobody → mdroettboom
Whiteboard: [telemetry:glean-rs:backlog] → [telemetry:glean-rs:m8]
Assignee: mdroettboom → nobody

The null values here may be related to bug 1586881, and we should go back and look for these once the fix for that lands. This is highly correlated with that bug -- 90% of the time that NULL values appear, there is also an error reported for glean.baseline.duration.

TL;DR from above: The 0 values are not errors -- they just indicate durations of less than 1 second, so I've renamed this bug accordingly.

Summary: Investigate glean.baseline.duration field with NULL and 0s length → Investigate glean.baseline.duration field with NULL length
Assignee: nobody → mdroettboom
Whiteboard: [telemetry:glean-rs:m8] → [telemetry:glean-rs:m11]

No longer seeing NULL values on Fenix nightly, so closing this bug.

We are still getting errors reported for baseline.duration, however, which is distinct from the value being set to null, and bug 1587548 remains open for that reason.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: