Closed Bug 1367023 Opened 7 years ago Closed 6 years ago

Need a Slow SQL periodic report for Sqlite consumers perf triage

Categories

(Data Platform and Tools :: General, enhancement, P3)

enhancement
Points:
3

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mak, Assigned: ccd, NeedInfo)

References

(Blocks 1 open bug, )

Details

(Whiteboard: [DataPlatform])

Attachments

(1 file)

From some time the Slow SQL telemetry dashboard has been down, as a result we are lacking any kind of data to identify and prioritize bottlenecks and janks due to Sqlite usage in the product.
We still have quite a few components using Sqlite and none of the current Talos tests are good to evaluate their impact.

While it's possible a dashboard is not the right solution to this problem, having some kind of periodic report would be of great help to identify new and existing performance issues in the way consumers use Sqlite.

Thus, I'd like to kick-off a discussion about what we can do to have this kind of periodic report.
What I care about is something like a top-tenish of SQL offenders on and off the main-thread, where by that we care both about which queries take more time on a single execution, but also which queries are executed the most often in short timeframes (blocks of 1 hour or such).

Ni? mreid to report what we had available in the previous Slow SQL dashboard.

PS: I'm not 100% sure this is in the right component, there are a lot of components about data collection and I couldn't find a documentation about which one to use for new requirements :(
Flags: needinfo?(mreid)
Blocks: StorageJank
Component: Metrics: Data Tools → Datasets: General
Product: Cloud Services → Data Platform and Tools
Attached file SlowSQL.ipynb (deleted) —
This is the notebook that previously powered the SlowSQL dashboard.

Historically, each day's data was processed, then the current week's daily files were combined into a weekly summary, which was published to a web-facing S3 bucket. The SlowSQL dashboard consumed these weekly summary files directly.
Flags: needinfo?(mreid)
bug 1399879 is one case where this report would have helped us noticing a bug in the wild.
Marking as P3, since I do not anticipate our team being able to handle this in Q4. Let me know if this is in error.
Points: --- → 3
Priority: -- → P3
Bug 1461753 is another case where this could have been useful (if done weekly or such), since we started having a lot of queries taking more time than usual and causing other queries on the main-thread to be slow.

We currently still don't have a good way to monitor increase/decrease in the number and time of slow sql queries.
Whiteboard: [DataPlatform]
We ran into this need again with bug 1462046, is it something you might be able to help with? Thanks.
Flags: needinfo?(rharter)
Flags: needinfo?(mreid)
Adding Corey and myself to the bug. I'm sure we can help.
Thanks, Liz. :joy and Corey will follow up with you to figure out what analysis needs to be done.
Flags: needinfo?(rharter)
Flags: needinfo?(mreid)
V1 version of the SlowSQL dataset awaiting to be published on re.dash for subsequent dashboarding and other potential uses: https://bugzilla.mozilla.org/show_bug.cgi?id=1465272
Corey, if we can we help you anyway with this, just let us know.
Working on getting the dataset publishing expedited on #fx-metrics. Looks like the Bugzilla template referenced in docs.telemetry.mozilla.org was off, which is why it has been taking so long!
Initial draft of Slow SQL dashboard on re.dash: https://sql.telemetry.mozilla.org/dashboard/slowsql_1
Edits as per Marco and I's meeting added
Would it be possible to sort the app version in the select fields? finding a version in the middle of many is not trivial.
Assignee: nobody → cdowhygelund
Status: NEW → ASSIGNED
Flags: needinfo?(cdowhygelund)
The ascending sort on app version has been implemented. 

In addition the weekly update job has been implemented and successfully run.
Flags: needinfo?(cdowhygelund)
(In reply to Marco Bonardo [::mak] from comment #13)
> Would it be possible to sort the app version in the select fields? finding a
> version in the middle of many is not trivial.

Implemented
I think for further improvements of bug fixes we could file separate bugs. The current state looks good enough.

Is the dashboard position final, should it be moved? Could it be linked on the main telemetry.mozilla.org page?
Otherwise, I think we can mark this as fixed.
Flags: needinfo?(cdowhygelund)
I believe the location is final with respect to this bug. Moving it to telemetry.mozilla.org should be a separate bug ticket.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Flags: needinfo?(cdowhygelund)
Resolution: --- → FIXED

:mak Is this dashboard still being utilized? The underlying job is utilizing a deprecated library, in addition to the dashboard relying on AWS infrastructure which is being decommissioned. Therefore, it needs to be transitioned over to GCP if it is being used.

Flags: needinfo?(mak)

Lina is working on something related in bug 1592263. Having a way to visualize slow queries is always useful, though it may be enough to be able to query the data as Lina is doing, provided that method is not going away. If so, we could get rid of this dashboard.
ni? Lina just as an heads up.

Flags: needinfo?(mak) → needinfo?(lina)

For reference, the slowSQL job used to populate this deprecated dashboard.

Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: