Closed Bug 909798 Opened 11 years ago Closed 11 years ago

Add ability to Ouija slaves graph to display acceptable failure rate for a given slave type

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: jmaher, Unassigned)

References

Details

(Whiteboard: [good first bug][mentor=dminor][lang=python])

Attachments

(1 file, 1 obsolete file)

0001-show-failure-rates-switch-between-failure-rates.patch 11 years ago akruglov (deleted), patch	dminor : review-	Details \| Diff \| Splinter Review
resolved merge conflict 11 years ago akruglov (deleted), patch	dminor : review+	Details \| Diff \| Splinter Review

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Description

•

11 years ago

There are many instances where a specific slave will run into hardware or OS configuration issues and fail more frequently than its peers. This is easy to detect when looking back at history because it will show an abnormal amount of failures. It would be nice to have another column in the slaves table that would show the max % of acceptable failures. This should be done by looking at the machine name 't-w864-ix-093', and you can determine the type by stripping off the -[0-9]+ at the end so it would be 't-w864-ix'. There are many families of machines and by looking at the total number of jobs and the total number of failures (excluding retries), we could determine a % failure for that given platform.

akruglov

Comment 1

•

11 years ago

But slaves table contains data related to slaves, not platforms. How acceptable failure rate should be displayed in this case? I see only one solution - calculate it for given platform and then show it for every machine of this type. Is that correct? What is the expected order for this column? After "Passes" and before "Total"?

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 2

•

11 years ago

Pretty much right on. If you have all the data from your query for the slaves, you can parse the slave name into platform. Given that platform, you can calculate the platform failure rate (PFR). Now when displaying the specific slave information, it would be nice to know the total failure rate of that specific slave (numfailures/numjobs). As for comparing this to the platform in general, there are a couple thoughts I have: 1) it would be redundant to have a column with the PFR, but it would be nice to have 2) calculating the PFR will be flawed if we have a few slaves with abnormally high failure rates. For 1, I think we can live with it, it might look like: name, failures, retries, infra, total, % failure, expected % failure tegra-314, 21, 20, 9, 500, 10.0%, 3.14% tegra-157, 3, 6, 1, 500, 2.0%, 3.14% In this case, we can quickly determine that tegra-314 should be looked into and tegra-157 is operating normally. Bonus points for making this a sortable table (I recall a javascript library to do this easily) on any given column :)

akruglov

Comment 3

•

11 years ago

Few questions: 1. How do I calculate failure rate for platform? PFR = all failures (on all slaves of given platform) / all runs (failures + retries + passes on all slaves for given platform)? or all failures / (all runs - retries)? 2. What is infra? This column is not present at this moment, should I add it too? 3. You didn't mention how to mitigate point 2 in your reply (abnormal high failure rates for several slaves that spoil overall statistics for PFR calculation).

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 4

•

11 years ago

1) PFR = (sum of all failures on all slaves for a given platform) / (all runs on all slaves for the given platform). I am not sure if we should exclude retries or not. Retries usually indicate a failure and it helps point out problems, but if it is a problem with a build/test/harness, then we will retry a lot and rack up failures on a lot of machines. For now lets exclude them, bonus points to toggle that ;) 2) Infra is infrastructure related failures. Specifically things like DNS failures, power outages, etc. These are rare enough, but sometimes include hardware failures on the specific machine in test. these should be denoted as a different failure type. 3) I don't have a solution to mitigate high failure rates propping up the overall statistics. For now we can live with it, although I am open to more suggestions. Thanks for making sure you understand this bug and do the right thing. Looking forward to your patch.

akruglov

Comment 5

•

11 years ago

1) That should be another checkbox to toggle that? 2) How can I recognize such failures? I looked into values stored in testtype, result, buildtype in database and found nothing similar to infra failures. 3) Perhaps, I can dig into statistics, but that was long time ago since I studied it in university :)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 6

•

11 years ago

ok, I can find all the colors here: http://54.215.155.53/data/results?platform=android4.0 test failure: orange infra: red retry: blue passing: green If making a checkbox to include/exclude retries is doable, I vote for that. Let me know if that helps at all.

akruglov

Comment 7

•

11 years ago

OK, I added infra results. Now failure rate is calculated as (num of fails * 100) / (num of fails + num of infra + num of passes). Server side is ready for calculating failure rate including retries as (num of fails * 100) / (num of fails + num of retries + num of infra + num of passes). I can submit patch for that right now. I need a bit more time to add checkbox for 'including retries' in failure rate calculations. Could we move sorting into separate issue?

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 8

•

11 years ago

Lets do the sortable tables in a different bug. I have filed bug 919960 to track that work.

akruglov

Comment 9

•

11 years ago

Attached patch 0001-show-failure-rates-switch-between-failure-rates.patch (obsolete) (deleted) — Details — Splinter Review

Attachment #811673 - Flags: review?(dminor)

Dan Minor [:dminor]

Comment 10

•

11 years ago

Comment on attachment 811673 [details] [diff] [review] 0001-show-failure-rates-switch-between-failure-rates.patch Review of attachment 811673 [details] [diff] [review]: ----------------------------------------------------------------- The changes look good, but unfortunately the patch you attached does not apply cleanly to the latest ouija changes from github and needs to be rebased. In case you haven't done this before, merging from the github ouija master to your local master and then running 'git rebase master' from your local branch is probably the easiest way to do this. Once the patch is updated, I'll be happy to take another look at it. Thanks!

Attachment #811673 - Flags: review?(dminor) → review-

akruglov

Comment 11

•

11 years ago

Attached patch resolved merge conflict (deleted) — Details — Splinter Review

Thanks, Dan! I resolved merge conflict.

Attachment #811673 - Attachment is obsolete: true

Attachment #812245 - Flags: review?(dminor)

Dan Minor [:dminor]

Comment 12

•

11 years ago

Comment on attachment 812245 [details] [diff] [review] resolved merge conflict Review of attachment 812245 [details] [diff] [review]: ----------------------------------------------------------------- Great work, thanks!

Attachment #812245 - Flags: review?(dminor) → review+

Dan Minor [:dminor]

Comment 13

•

11 years ago

Committed here: https://github.com/dminor/ouija/commit/b0889c6390f92eb53f8b2b8aeb1f175e54885be7 and in production.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Add ability to Ouija slaves graph to display acceptable failure rate for a given slave type

Categories

(Testing :: General, defect)

Tracking

(Not tracked)

People

(Reporter: jmaher, Unassigned)

References

Details

(Whiteboard: [good first bug][mentor=dminor][lang=python])

Crash Data

Security

(public)

User Story

Attachments

(1 file, 1 obsolete file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Attachment

General

Description

File Name

Content Type