Make slow SQL telemetry available in BigQuery
Categories
(Data Platform and Tools :: General, task, P2)
Tracking
(Not tracked)
People
(Reporter: lina, Assigned: ascholtz)
References
(Depends on 1 open bug)
Details
(Whiteboard: [dataquality])
Attachments
(1 file)
(deleted),
text/x-github-pull-request
|
Details |
(I'm not sure if this component is the right place for this bug, please feel free to move it if it's not!).
I wanted to whip up a dashboard showing slow SQL statements, to keep an eye on the performance of new bookmark sync and other Places queries.
It doesn't look like it has a schema definition yet, which might be why I wasn't able to query moz-fx-data-shared-prod.telemetry_live.main_v4.payload.slow_sql
. Do we need to add one?
However, I'm also wondering if the current format of the slow SQL data can be fed into BigQuery. Looking at the docs, and at Services.telemetry.slowSQL
in Desktop, I see a structure like this:
{ "mainThread": Stats, "otherThreads": Stats }
Where Stats
is an object where the keys are the sanitized SQL statements, and each value is an array with two elements: [hitCount, totalTime]
. For a complete example:
{
"mainThreads": {},
"otherThreads": {
"UPDATE moz_places SET frecency = ...": [2, 271]
}
}
Are those dynamic keys going to be a problem for the pipeline? Do we need to transform them into something (an array of { sql, hitCount, totalTime }
objects?) on the backend, so that we can validate and get them into BQ?
Thanks!
Reporter | ||
Comment 1•5 years ago
|
||
Do we need to transform them into something (an array of
{ sql, hitCount, totalTime }
objects?) on the backend, so that we can validate and get them into BQ?
Or, can we store them as JSON strings, and define SQL functions to extract the statements and values, like we do for histograms?
Comment 2•5 years ago
|
||
This is indeed going to additional_properties in main pings since it's not fully defined in the schema. For example:
SELECT
json_extract(additional_properties, '$.payload.slowSQL')
FROM
`moz-fx-data-shared-prod.telemetry.main`
WHERE
DATE(submission_timestamp) = "2019-10-30"
LIMIT 100
Which returns mostly nulls but got one hit: {"mainThread":{"UPDATE moz_cache SET DataSiz...
.
It would be possible to choose to store these as JSON strings. But it also sounds like the structure is defined enough that we could express it in JSON schema and have that flow to fields in BQ. For the case of dynamic keys, it would show up as a REPEATED key/value STRUCT where the key is the SQL statement and the value would likely be an anonymous tuple.
Reporter | ||
Comment 3•5 years ago
|
||
Here's a query I've created, using the data in additional_properties
:
https://sql.telemetry.mozilla.org/queries/65832
Defining a schema for slowSQL
would go a long way toward making it easier to analyze. Something in the pipeline strips escape sequences from newlines, backslashes, and quotes, so we can't just pass additional_properties.slowSQL
to JSON.parse
in a SQL function.
It would also be helpful to normalize statements before recording them (bug incoming), and record the DB name in a separate field, instead of a comment in the SQL string. I added some more notes at the top of the query.
Comment 4•5 years ago
|
||
For reference, here is a pyspark code to process the slowSQL payload. This uses raw pings from the old AWS workflow to get the payload. CMD 4 on could be useful for slowSQL analyses.
Updated•5 years ago
|
Reporter | ||
Updated•4 years ago
|
Updated•4 years ago
|
Comment 5•4 years ago
|
||
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 6•4 years ago
|
||
The schema changes have been deployed and payload.slow_sql
can now be queried in BigQuery.
Assignee | ||
Updated•1 year ago
|
Description
•