Open Bug 1529764 Opened 6 years ago Updated 3 years ago

Backfill regresses/regressed-by fields for older bugs

Categories

(bugzilla.mozilla.org :: Bulk Bug Edit Requests, task)

Production
task
Not set
normal

Tracking

()

ASSIGNED

People

(Reporter: emceeaich, Assigned: marco)

References

(Blocks 1 open bug)

Details

(Whiteboard: [october-2019-bmo-triage])

Once the Regresses/Regressed-By fields in bug 1461492 go live, we should clean up data in recent bugs so we can provide tools like BugBug with useful training data for spotting regressions, and do other analysis.

Going through 20 years of bugs is not feasible, but we should look at past year's regressions and get them updated.

There are ~5,800 regressions (https://mzl.la/2Xgy4e9) since 2018-01-01. ~4,000 of them have the 'blocks' field (which had been used as the regressed by field). So we could update those.

So questions:

  • How far back to clean up?
  • What about the existing regressions where we haven't found the change set that introduced the bug?
Flags: needinfo?(mcastelluccio)
No longer blocks: 1461492
Depends on: 1461492

Assigning to myself because I've been thinking about this and I'm still thinking about this.

We need be careful to only update old Regresses/Regressed-By when we are really sure because we don't want to introduce noise (for ML and other tools it's better to have missing data rather than wrong data).

Assignee: nobody → mcastelluccio
Status: NEW → ASSIGNED
Flags: needinfo?(mcastelluccio)

A few ideas:

  • Parse mozregression comments - This is bound to be almost 100% correct.
  • Parse the uplift request comment - Given that the field in the template was "[Feature/Bug causing the regression]", we can't be sure that the bug ID here is actually a regressor, it could be the tracking bug for the feature.
  • Parse comments to find "caused by bug XXX" / "regressed by bug XXX" - Bound to be not always correct, e.g. could be part of a question "was this regressed by bug XXX?" or "no, this wasn't regressed by bug XXX!".
  • Changes to "blocks"/"blocked-by" at the same time as adding "regression" or setting "has_regression_range" to "yes" or removing "regressionwindow-wanted".
  • Using the SZZ algorithm (very far from being 100% accurate).

We can also mix some of these (e.g. parse mozregression comment and check that the mentioned bug is in "blocks"/"blocked_by").

I'll try to check how many links we'd generate by applying these rules over the past 1-2 years. Maybe it'll be possible to review them manually.

Summary: Clean up Regression Information → Backfill regresses/regressed-by fields for older bugs

Marco, is this still needed?

Flags: needinfo?(mcastelluccio)

I still plan to do this, it's definitely not high priority though.

Flags: needinfo?(mcastelluccio)

(In reply to Marco Castelluccio [:marco] from comment #2)

A few ideas:

  • Parse mozregression comments - This is bound to be almost 100% correct.
  • Parse the uplift request comment - Given that the field in the template was "[Feature/Bug causing the regression]", we can't be sure that the bug ID here is actually a regressor, it could be the tracking bug for the feature.
  • Parse comments to find "caused by bug XXX" / "regressed by bug XXX" - Bound to be not always correct, e.g. could be part of a question "was this regressed by bug XXX?" or "no, this wasn't regressed by bug XXX!".
  • Changes to "blocks"/"blocked-by" at the same time as adding "regression" or setting "has_regression_range" to "yes" or removing "regressionwindow-wanted".
  • Using the SZZ algorithm (very far from being 100% accurate).

We can also mix some of these (e.g. parse mozregression comment and check that the mentioned bug is in "blocks"/"blocked_by").

I'll try to check how many links we'd generate by applying these rules over the past 1-2 years. Maybe it'll be possible to review them manually.

Additionally, we can parse the If not all supported branches, which bug introduced the flaw? from security request comments.

I've built a script to apply these rules to generate a list (bugs between 2017-04-29 and 2019-04-29), but it is pretty long (more than 1000 entries). I could go through them manually from time to time, but it's going to take a while (if I did 5 per day, it would take a year :D).

(In reply to Marco Castelluccio [:marco] from comment #5)

Additionally, we can parse the If not all supported branches, which bug introduced the flaw? from security request comments.

I've built a script to apply these rules to generate a list (bugs between 2017-04-29 and 2019-04-29), but it is pretty long (more than 1000 entries). I could go through them manually from time to time, but it's going to take a while (if I did 5 per day, it would take a year :D).

If given the bug id list I can generate a script to run on the server to make the changes automatically. It could done in a way where emails are not sent out as well.

(In reply to David Lawrence [:dkl] from comment #6)

(In reply to Marco Castelluccio [:marco] from comment #5)

Additionally, we can parse the If not all supported branches, which bug introduced the flaw? from security request comments.

I've built a script to apply these rules to generate a list (bugs between 2017-04-29 and 2019-04-29), but it is pretty long (more than 1000 entries). I could go through them manually from time to time, but it's going to take a while (if I did 5 per day, it would take a year :D).

If given the bug id list I can generate a script to run on the server to make the changes automatically. It could done in a way where emails are not sent out as well.

Unfortunately not all the items in the list are correct, as those fields were all used inconsistently (except maybe the security form one, but it is a very small subset).

You need to log in before you can comment on or make changes to this bug.