Closed
Bug 1272532
Opened 8 years ago
Closed 8 years ago
exceptions:KeyError: in jobs._load_jobs
Categories
(Tree Management :: Treeherder: Data Ingestion, defect)
Tree Management
Treeherder: Data Ingestion
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: camd, Assigned: wlach)
References
(Blocks 1 open bug)
Details
Attachments
(2 files)
This seems to be happening when we get jobs and retries of that job in the same chunk of builds4hr. This happens on both stage and prod. For example:
exceptions:KeyError: '64600b5dba1a152c60205c63ea03af8400f9e95d'
The retry job in question that's trying to be set to retried AND completed is here:
https://treeherder.mozilla.org/#/jobs?repo=fx-team&revision=a97eec37ec2d4ded719ad996d75c431b7043fa5a&filter-searchStr=44f22fd30969183ab9c21f8ef6d95fb509345b32&selectedJob=9314573
This seems to have started with the change in commit:
https://github.com/mozilla/treeherder/commit/104787282d0db965ea948342ce1f2926936f3d27#diff-a03f6e7c1747a7ef236e211422cc7c5eR1781
The logic changed there with handling retries. I'm still looking into it.
Reporter | ||
Comment 1•8 years ago
|
||
It's possible we could get away with just changing line 1781 to be:
if retry_guid_root in jobs_to_update:
del jobs_to_update[retry_guid_root]
But I don't quite understand why that guid isn't in there. I *think* it's because the completed and retried are in the same batch of jobs being submitted.
The logic changed here a bit. But I think this is the problem:
https://github.com/mozilla/treeherder/commit/104787282d0db965ea948342ce1f2926936f3d27#diff-a03f6e7c1747a7ef236e211422cc7c5eR1764
That line is still referencing ``job_id_lookup`` but in the old code, ``job_id_lookup`` was getting modified in the for-loop. Now we're modifying ``jobs_to_update``, but still getting the keys from ``job_id_lookup``, so the list of keys may not exist in ``jobs_to_update``
Will: Would you check my reasoning on this?
Flags: needinfo?(wlachance)
Reporter | ||
Comment 2•8 years ago
|
||
I should have mentioned in comment 1, the fix might be just to change:
lookup_keys = job_id_lookup.keys()
to
lookup_keys = jobs_to_update.keys()
Assignee | ||
Comment 4•8 years ago
|
||
(In reply to Cameron Dawson [:camd] from comment #1)
> It's possible we could get away with just changing line 1781 to be:
>
> if retry_guid_root in jobs_to_update:
> del jobs_to_update[retry_guid_root]
>
> But I don't quite understand why that guid isn't in there. I *think* it's
> because the completed and retried are in the same batch of jobs being
> submitted.
>
> The logic changed here a bit. But I think this is the problem:
>
> https://github.com/mozilla/treeherder/commit/
> 104787282d0db965ea948342ce1f2926936f3d27#diff-
> a03f6e7c1747a7ef236e211422cc7c5eR1764
>
> That line is still referencing ``job_id_lookup`` but in the old code,
> ``job_id_lookup`` was getting modified in the for-loop. Now we're modifying
> ``jobs_to_update``, but still getting the keys from ``job_id_lookup``, so
> the list of keys may not exist in ``jobs_to_update``
>
> Will: Would you check my reasoning on this?
This makes sense to me. I would create a unit test that reproduces this problem, then fix it. :)
Flags: needinfo?(wlachance)
Assignee | ||
Comment 6•8 years ago
|
||
Since I caused this problem, it's probably fair that I take responsibility for fixing it.
Assignee: nobody → wlachance
Comment 7•8 years ago
|
||
Assignee | ||
Comment 8•8 years ago
|
||
I believe the above PR identifies and fixes the problem. I'm going to postpone asking for review until I've had a chance to give this more thought, but leaving the solution here in case there's an emergency.
Assignee | ||
Comment 9•8 years ago
|
||
(In reply to William Lachance (:wlach) from comment #8)
> I believe the above PR identifies and fixes the problem. I'm going to
> postpone asking for review until I've had a chance to give this more
> thought, but leaving the solution here in case there's an emergency.
For example, I'm somewhat concerned about the possibility of there being two retry guids in the first place -- is there anything we can/should do to try to prevent that from happening? This will probably be easier to fix at the root once we move away from datasource.
Assignee | ||
Comment 10•8 years ago
|
||
Comment on attachment 8754517 [details]
[treeherder] wlach:1272532 > mozilla:master
As discussed, I'm not sure if this 100% fixes the problem but spending more time on this doesn't seem like a good investment (given that we'll be changing the way job ingestion works completely when we move away from datasource)
Attachment #8754517 -
Flags: review?(emorley)
Updated•8 years ago
|
Attachment #8754517 -
Flags: review?(emorley) → review+
Comment 11•8 years ago
|
||
Commits pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/4d143c055f6ebfec3b2541125c0a45213c3705fb
Bug 1272532 - Handle case where we get two different retry guids for a job
This is an obscure case (that we should probably handle better), but I
believe this patch should at least fix the new relic exception in the
common cases (specifically when we get the same retry with a different
guid in the same batch of jobs). This patch also adds a bunch of unit
tests to test ingesting batches of jobs containing retries. There are
many cases which still fail (which would probably never occur in real
life), but I'm going to put off fixing those for now -- see comments
in the unit tests for details.
https://github.com/mozilla/treeherder/commit/969d063b522ac560e767468ecc4549a9a5cea164
Merge pull request #1507 from wlach/1272532
Bug 1272532 - Handle case where we get two different retry guids for a job
Assignee | ||
Comment 12•8 years ago
|
||
Fixed enough for now.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•