Bugzilla

Comment 1

•

7 years ago

Looks like the safety check we put here is failing:

http://searchfox.org/mozilla-central/source/xpcom/threads/TimerThread.cpp#646

Perhaps the issue is that the binary heap does not have a stable sort and there are two nsITimers with the same mTimeout.

What do you think Olli?

Blocks: 1368493

Flags: needinfo?(bugs)

Updated

•

7 years ago

Component: Selection → XPCOM

Comment 2

•

7 years ago

But we compare TimeStamps, not Timers.
This is puzzling.

Flags: needinfo?(bugs)

Comment 3

•

7 years ago

Oh, we do indeed.  It would be interesting to know the state of the list when that assert failed.

Comment 4

•

7 years ago

Perhaps we should use std::priority_queue:

http://en.cppreference.com/w/cpp/container/priority_queue

It uses a binary heap underneath, but handles all the push_heap/pop_heap stuff manually.  That would let this method just use a normal iterator without worrying about corrupting the list.

(Sorry, I was not aware of priority_queue before just now.)

Comment 5

•

7 years ago

Oh, I don't think priority_queue exposes iterators.  Sorry.

Comment 6

•

7 years ago

Attached patch timer_thread_debug.diff (obsolete) (deleted) — Details — Splinter Review

May need to add more debug code, but hopefully this would reveal something.

Attachment #8876213 - Flags: review?(bkelly)

Comment 7

•

7 years ago

Comment on attachment 8876213 [details] [diff] [review]
timer_thread_debug.diff

Review of attachment 8876213 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/threads/TimerThread.cpp
@@ +648,5 @@
> +#ifdef DEBUG
> +  if (!mTimers.IsEmpty()) {
> +    if (firstTimeStamp != mTimers[0]->Timeout()) {
> +      TimeStamp now = TimeStamp::Now();
> +      printf("firstTimeStamp %f, mTimers[0]->Timeout() %f, "

Should this be printf_stderr?

Attachment #8876213 - Flags: review?(bkelly) → review+

Updated

•

7 years ago

Keywords: leave-open

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 8

•

7 years ago

Attached patch printf_stderr (deleted) — Details — Splinter Review

Attachment #8876213 - Attachment is obsolete: true

Pulsebot

Comment 9

•

7 years ago

Pushed by opettay@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6fdd4f87cc0e
investigate why assertion fails on Windows, r=bkelly

Comment 10

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/6fdd4f87cc0e

Comment hidden (Intermittent Failures Robot)

4 failures in 814 pushes (0.005 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 2
* autoland: 2

Platform breakdown:
* windows7-32-vm: 4

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1371438&startday=2017-06-12&endday=2017-06-18&tree=all

Comment hidden (Intermittent Failures Robot)

3 failures in 892 pushes (0.003 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 2
* mozilla-inbound: 1

Platform breakdown:
* windows7-32-vm: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1371438&startday=2017-06-19&endday=2017-06-25&tree=all

Comment hidden (Intermittent Failures Robot)

2 failures in 718 pushes (0.003 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* pine: 1
* autoland: 1

Platform breakdown:
* windows7-32-vm: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1371438&startday=2017-06-26&endday=2017-07-02&tree=all

Comment hidden (Intermittent Failures Robot)

1 failures in 822 pushes (0.001 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 1

Platform breakdown:
* windows7-32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1371438&startday=2017-07-17&endday=2017-07-23&tree=all

Comment hidden (Intermittent Failures Robot)

2 failures in 1008 pushes (0.002 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 2

Platform breakdown:
* windows7-32: 1
* windows10-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1371438&startday=2017-07-24&endday=2017-07-30&tree=all

Comment hidden (Intermittent Failures Robot)

1 failures in 908 pushes (0.001 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-beta: 1

Platform breakdown:
* windows7-32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1371438&startday=2017-08-21&endday=2017-08-27&tree=all

Comment hidden (Intermittent Failures Robot)

1 failures in 1032 pushes (0.001 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-central: 1

Platform breakdown:
* windows7-32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1371438&startday=2017-09-11&endday=2017-09-17&tree=all

Comment hidden (Intermittent Failures Robot)

1 failures in 885 pushes (0.001 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-central: 1

Platform breakdown:
* windows7-32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1371438&startday=2017-09-25&endday=2017-10-01&tree=all

Comment 20

•

7 years ago

Our extra printf has been in for a while.  Doesn't seem like big changes when this impacts:

  firstTimeStamp 2.000000, mTimers[0]->Timeout() 1.000000, initialFirstTimer 0EEE04C0, current first 0F7F63C0

From:

https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=133455725&lineNumber=28509

What do you guys think is the next step?  It definitely seems like our windows TimeStamp code just sometimes reorders things on our automation windows VM test machines.  Could we force these to just avoid the high performance counters from the start so we don't have to suffer the switch in the middle of a test?

For example, see bug 1391693 comment 101 for another place this is issue showed up for me.

Flags: needinfo?(honzab.moz)

Flags: needinfo?(bugs)

Comment 21

•

7 years ago

I'm not familiar enough with Windows timestamps to understand what is buggy there.

Flags: needinfo?(bugs)

Comment 22

•

7 years ago

(In reply to Ben Kelly [:bkelly] from comment #20)
> Our extra printf has been in for a while.  Doesn't seem like big changes
> when this impacts:
> 
>   firstTimeStamp 2.000000, mTimers[0]->Timeout() 1.000000, initialFirstTimer
> 0EEE04C0, current first 0F7F63C0

If you are getting 1ms difference then I'm afraid it's not caused by the GetTickCount() fallback.  If timestamp has already switched to use it, |now| would only contain the 16ms resolution value.  Hence, calculation of a duration (= something_captured_earlier - now) cannot be 2 or 1 ms.  It would be 0 or 16 ms, because we would subtract two GetTickCount() captured values.

I more suspect that this is because the jitter of QueryPerformaceCounter (the high res counter we use) is too small to engage the fallback switch - it doesn't reach the threshold, but is large enough to cause these assertion failures.

It could also be that TSC is reported stable by the VM but in reality it isn't.  In that case we don't even bother detecting jitter to potentially fallback.

> 
> From:
> 
> https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-
> central&job_id=133455725&lineNumber=28509
> 
> What do you guys think is the next step?  It definitely seems like our
> windows TimeStamp code just sometimes reorders things on our automation
> windows VM test machines.  Could we force these to just avoid the high
> performance counters from the start so we don't have to suffer the switch in
> the middle of a test?

I would rather go the opposite way first - let's have an env var (TimeStamp is used to early for prefs) to control this. 0 - default behavior as now (fallback detection), 1 - force high res all the time, 2 - force low res all the time.

Let's run the tests with hi-res forced for awhile to eliminate the fallback bit.

I can file a bug if you agree.

> 
> For example, see bug 1391693 comment 101 for another place this is issue
> showed up for me.

Flags: needinfo?(honzab.moz) → needinfo?(bkelly)

Comment 23

•

7 years ago

And to add - I don't know exactly how the two timestamps you are comparing are captured, but to be honest, I think the assertion that fails here is a bit naive.  Two different timestamps regardless if captured one by one simply may have some delay between them.  You should add a reasonable threshold, like (a - b).ToMillis() < 2 or so.  Note that if two threads are capturing these two timestamps, or even just one in close succession, on Windows the thread scheduling may introduce delays.

Comment 24

•

7 years ago

(In reply to Honza Bambas (:mayhemer) from comment #23)
> And to add - I don't know exactly how the two timestamps you are comparing
> are captured, but to be honest, I think the assertion that fails here is a
> bit naive.  Two different timestamps regardless if captured one by one
> simply may have some delay between them.  You should add a reasonable
> threshold, like (a - b).ToMillis() < 2 or so.  Note that if two threads are
> capturing these two timestamps, or even just one in close succession, on
> Windows the thread scheduling may introduce delays.

The assertion is basically checking:

1. Capture t1 with TimeStamp::Now()
2. Capture t2 with TimeStamp::Now()
3. Sort t1 and t2 into a list with soonest first using TimeStamp::operator<(), etc.  Result is [t1, t2]
4. Some time later re-sort the list.  Assert that first element is still t1.

The assertion that is failing is that t2 is now sorting before t1.

I don't see how this could happen if TimeStamp values were stable.  Also, it only happens on automation VM windows machines.

If any fuzzing needs to be done I think TimeStamp::operator<() should be taking care of that, no?

Flags: needinfo?(bkelly)

Comment 25

•

7 years ago

(In reply to Honza Bambas (:mayhemer) from comment #22)
> > What do you guys think is the next step?  It definitely seems like our
> > windows TimeStamp code just sometimes reorders things on our automation
> > windows VM test machines.  Could we force these to just avoid the high
> > performance counters from the start so we don't have to suffer the switch in
> > the middle of a test?
> 
> I would rather go the opposite way first - let's have an env var (TimeStamp
> is used to early for prefs) to control this. 0 - default behavior as now
> (fallback detection), 1 - force high res all the time, 2 - force low res all
> the time.
> 
> Let's run the tests with hi-res forced for awhile to eliminate the fallback
> bit.

I think this would be great to try.

Comment 26

•

7 years ago

(In reply to Ben Kelly [:bkelly] from comment #24)
> (In reply to Honza Bambas (:mayhemer) from comment #23)
> > And to add - I don't know exactly how the two timestamps you are comparing
> > are captured, but to be honest, I think the assertion that fails here is a
> > bit naive.  Two different timestamps regardless if captured one by one
> > simply may have some delay between them.  You should add a reasonable
> > threshold, like (a - b).ToMillis() < 2 or so.  Note that if two threads are
> > capturing these two timestamps, or even just one in close succession, on
> > Windows the thread scheduling may introduce delays.
> 
> The assertion is basically checking:
> 
> 1. Capture t1 with TimeStamp::Now()
> 2. Capture t2 with TimeStamp::Now()
> 3. Sort t1 and t2 into a list with soonest first using
> TimeStamp::operator<(), etc.  Result is [t1, t2]
> 4. Some time later re-sort the list.  Assert that first element is still t1.
> 
> The assertion that is failing is that t2 is now sorting before t1.

Thanks for this explanation.  This was the missing bit.

But, when [1] shows 1 and 2 ms, we have not fell back to GetTickCount64 yet.  It would return 0 or 16ms difference, see [2] - we return difference of values (inside TimeStamp) captured with GetTickCount64 only, which SHOULD have only ~15.6 ms resolution.  This is a software counter (inside windows kernel) that is updated on context switch, to my knowledge, insensitive to timeBeginPeriod.

[1] https://dxr.mozilla.org/mozilla-central/rev/19b32a138d08f73961df878a29de6f0aad441683/xpcom/threads/TimerThread.cpp#656
[2] https://dxr.mozilla.org/mozilla-central/rev/19b32a138d08f73961df878a29de6f0aad441683/mozglue/misc/TimeStamp_windows.cpp#329

> 
> I don't see how this could happen if TimeStamp values were stable.  Also, it
> only happens on automation VM windows machines.
> 
> If any fuzzing needs to be done I think TimeStamp::operator<() should be
> taking care of that, no?

I'll file a bug and try to find time to write a patch to control the qpc/gtc/auto-detect usage.