Closed Bug 1466923 Opened 6 years ago Closed 6 years ago

Test verification of web-platform tests sometimes reports PASS for steps with test failures

Categories

(Testing :: web-platform-tests, enhancement)

Version 3
enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: gbrown, Assigned: gbrown)

References

(Blocks 1 open bug)

Details

Follow-up on https://bugzilla.mozilla.org/show_bug.cgi?id=1466740#c2. Consider: https://treeherder.mozilla.org/logviewer.html#?job_id=181826441&repo=mozilla-inbound&lineNumber=1329-1343 02:37:46 INFO - ## All results ## 02:37:46 INFO - 02:37:46 INFO - ### /css/css-scoping/shadow-host-with-before-after.html ### 02:37:46 INFO - | Subtest | Results | Messages | 02:37:46 INFO - |---------|---------|----------| 02:37:46 INFO - | | FAIL | | 02:37:46 INFO - 02:37:46 INFO - ::: Running tests in a loop 10 times : PASS 02:37:46 INFO - ::: Running tests in a loop with restarts 5 times : PASS 02:37:46 INFO - ::: Running tests in a loop 10 times with flags chaos_mode_flags=3 : PASS 02:37:46 INFO - ::: Running tests in a loop with restarts 5 times with flags chaos_mode_flags=3 : PASS 02:37:46 INFO - ::: 02:37:46 INFO - ::: Test verification PASS 02:37:46 INFO - ::: 02:37:47 INFO - Return code: 0 If we know that a test has failed, why list PASS for each step, and for overall test verification? It seems inconsistent and confusing. (On the other hand, it's not too serious, because the task is still orange.)
:jgraham -- I suspect you understand this code much better than I. Should we be checking the 'results' here, in addition to 'inconsistent'? https://dxr.mozilla.org/mozilla-central/rev/d8f180ab74921fd07a66d6868914a48e5f9ea797/testing/web-platform/tests/tools/wptrunner/wptrunner/stability.py#267-277
Flags: needinfo?(james)
It depends what you think the purpose of test-verify is. In my model the defining feature of test-verify is that it checks stability, and the PASS at the end only refers to whether the test is stable or not. Putting FAIL for a test that consistenly FAILs is wrong because — assuming it's marked as failing in the metadata — nothing is wrong. That's not a situation that occurs for other test harnesses because they typically don't allow any non-passing tests, but it is something we need in wpt.. In this specific case it seems like the test itself FAILed, stably, on Windows. So what we see is TEST-UNEXPECTED-FAIL from the test not macthing expectations, FAIL (only) in the list of results obtained, and PASS for the stability check. This is perhaps confusing, but it is presenting all the information. I'm not sure it would be less confusing to make some of these things degenerate, e.g. marking the stability check as FAIL if the test has an unexpected result might also make people believe the test is unstable when that isn't actually the case.
Flags: needinfo?(james)
(In reply to James Graham [:jgraham] from comment #2) Thanks James, that's quite helpful. > In my model the defining feature of test-verify is that it checks stability, I might agree with you, but I'm not 100% sure... > Putting FAIL for a test that consistently FAILs is wrong because — assuming > it's marked as failing in the metadata — nothing is wrong. I agree completely with that case. I am still bothered by the case -- as in comment 0 -- where a test fails consistently but it is not marked as failing in the metadata. If there is no metadata expectation that the test fails, but TVw finds the test failing consistently, is that "stable"?
> If there is no metadata expectation that the test fails, but TVw finds the test failing consistently, is that "stable"? Yes, IMO. I should note that we use the same code for stablity checking upstream, and in that case we don't have expectation data, so any test that happens to not pass on Firefox or Chrome would be regarded as "unstable" if we made stability dependent on the results themselves rather than just on the consistency of results between runs. Obviously for the m-c case we always have expectation data by the time TVw runs, so we could have different behaviour there, but that would have to be a flag of some sort, since we can't tell the difference between missing expectation data and expecting a pass, which would add some complexity. Since we always fail the job correctly and the issue is "just" how hard it is to interpret the log, I'd rather not change anything here. Ideally we would start scheduling things so that changing a web-platform-test always causes the relevant test to run in the normal jobs, which I think would make things clearer here.
great discussion. I think this is either a P5 or a WONTFIX given that the job reports a proper green || orange colour when finished. If we correctly identify the proper test then we point someone at the right problem (developer or sheriff). The remaining question is how can we ensure less confusion when reading a log file and figuring out what failed and why. Can we print out expected status from web-platform-tests? Maybe knowing it is marked as FAIL but it was expected FAIL, then that would help reduce confusion; of course any action like this would need to be output in the right place so it could be read easily (thinking about debug and the hundreds of lines of spew between useful nuggets of information).
Assignee: nobody → gbrown
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.