Closed Bug 1139856 Opened 10 years ago Closed 9 years ago

Futex wakeups may sometimes be lost

Categories

(Core :: JavaScript Engine, defect)

x86_64
All
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla41
Tracking Status
firefox41 --- fixed

People

(Reporter: lth, Assigned: lth)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 1 obsolete file)

There is some evidence that futex wakeups are sometimes lost:

https://bugzilla.mozilla.org/show_bug.cgi?id=1137465
https://github.com/lars-t-hansen/parlib-simple/issues/26

Bug 1137465 comment 6 speculates that this may be a result of interrupt logic but that is not well substantiated.

This reproduces only intermittently but I have pretty good luck losing functionality within one minute by running the mandelbrot-animated-simd demo from github.com/lars-t-hansen on 8 cores on my MacBook Pro.  Viewed in a debugger, all the workers are in futexWait and the main thread is not doing anything in particular; it's not hung.  There's actually no hard evidence that a wakeup is lost, but losing one is consistent with what happens and the system state.
Harder to reproduce now, but still reproduces.  (I had hoped this was caused by bug 1146364.)

The full reference is github.com/lars-t-hansen/parlib-simple/demo/mandelbrot-animation-simd.
I have a better test case now that stresses futexes much more and I've been able to track this down.

The bug is a race condition between futexWait, requestInterrupt, and futexWake.  In futexWait the waiting thread takes a lock, and then waits on a condition with the lock released.  While that thread is sleeping an interrupt fires, requestInterrupt acquires the lock, and requestInterrupt wakes the waiting thread before releasing the lock again.  The wakeup changes the state of the futex to a value that indicates the futex is not waiting.  If a futexWake then arrives (and takes the lock) just after this, before the waiting thread has woken up (and has reclaimed the lock), then the wake will be lost (dropped on the floor) because the futex is indicated as not waiting.

The code is safe, in that all variables are manipulated within proper critical sections, but really the lock that is acquired by requestInterrupt should be transfered without unlocking to futexWait, where it now has to be reclaimed.  That transfer seems complicated; the better fix is probably to change the meaning of an existing state value or to introduce an additional state value to cover this case, and to adapt the futex system to handle the additional state.
Attached patch bug1139856-no-lost-wakeup.diff (obsolete) (deleted) — Splinter Review
Treat the existing intermediate state WokenForJSInterrupt as a waiting state.  The change makes intuitive sense, as WokenForJSInterrupt is morphed into  WaitingInterrupted once the waiting thread wakes up, and the latter state value is already treated as a waiting state.
Attachment #8604603 - Flags: review?(luke)
Comment on attachment 8604603 [details] [diff] [review]
bug1139856-no-lost-wakeup.diff

Review of attachment 8604603 [details] [diff] [review]:
-----------------------------------------------------------------

Hah, good find.

::: js/src/builtin/AtomicsObject.cpp
@@ +1036,5 @@
> +    // WokenForJSInterrupt for a short time before it actually wakes
> +    // up and goes into WaitingInterrupted.  In those states the
> +    // worker is still waiting, and if an explicit wake arrives the
> +    // worker transitions to Woken.  See further comments in wait()
> +    // below and also see bug #1139856.

I think the reason you added this comment (and we forgot to add this case initially) is that the name WokenForJSInterrupt sounds like we're not waiting when in fact we still are.  How about WaitingNotifiedForInterrupt?  That makes it clear that the transition of states here is:
  Waiting -> WaitingNotifiedForInterrupt -> WaitingInterrupted
and that all three are isWaiting() so no scary comment is necessary.

@@ +1170,5 @@
>          state_ = Woken;
>          return;
>      }
> +    if (state_ == WokenForJSInterrupt)
> +        return;

This initially confused me until I saw that this only hits when reason != WakeExplicit.  How about moving this into the switch (in the WakeForJSInterrupt/WaitingNotifiedForInterrupt case) where it is then quite obvious in meaning.  That being said, if we hit this case, we've already called PR_NotifyCondVar() and the waiter hasn't been woken yet (or we wouldn't have our lock), so isn't it benign to just ignore this case?
(In reply to Luke Wagner [:luke] from comment #4)
> Comment on attachment 8604603 [details] [diff] [review]
> bug1139856-no-lost-wakeup.diff
> 
> Review of attachment 8604603 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Hah, good find.

I'm quite happy now.  You should have seen me while I was looking, though.

> ::: js/src/builtin/AtomicsObject.cpp
> @@ +1036,5 @@
> > +    // WokenForJSInterrupt for a short time before it actually wakes
> > +    // up and goes into WaitingInterrupted.  In those states the
> > +    // worker is still waiting, and if an explicit wake arrives the
> > +    // worker transitions to Woken.  See further comments in wait()
> > +    // below and also see bug #1139856.
> 
> I think the reason you added this comment (and we forgot to add this case
> initially) is that the name WokenForJSInterrupt sounds like we're not
> waiting when in fact we still are.  How about WaitingNotifiedForInterrupt? 
> That makes it clear that the transition of states here is:
>   Waiting -> WaitingNotifiedForInterrupt -> WaitingInterrupted
> and that all three are isWaiting() so no scary comment is necessary.

I'm in favor of this naming change and will make it.

> @@ +1170,5 @@
> >          state_ = Woken;
> >          return;
> >      }
> > +    if (state_ == WokenForJSInterrupt)
> > +        return;
> 
> This initially confused me until I saw that this only hits when reason !=
> WakeExplicit.  How about moving this into the switch (in the
> WakeForJSInterrupt/WaitingNotifiedForInterrupt case) where it is then quite
> obvious in meaning.  That being said, if we hit this case, we've already
> called PR_NotifyCondVar() and the waiter hasn't been woken yet (or we
> wouldn't have our lock), so isn't it benign to just ignore this case?

It is in fact benign to ignore the case altogether, but I was feeling pedantic: isWaiting() is now true for any of the three states, and if another interrupt is requested before the thread wakes up to process the WokenForJSInterrupt state then we will get in here again; I felt that it was better to handle that situation explicitly even if I could have ignored it.

Re where we put the code, I probably don't care very much.  I'll try moving it where you suggest and if it looks plausible I'll leave it there.
Attachment #8604603 - Attachment is obsolete: true
Attachment #8604603 - Flags: review?(luke)
Attachment #8604716 - Flags: review?(luke)
Comment on attachment 8604716 [details] [diff] [review]
bug1139856-no-lost-wakeup.diff, take 2

Review of attachment 8604716 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/builtin/AtomicsObject.cpp
@@ +1036,5 @@
> +    // WaitingNotifiedForInterrupt for a short time before it actually
> +    // wakes up and goes into WaitingInterrupted.  In those states the
> +    // worker is still waiting, and if an explicit wake arrives the
> +    // worker transitions to Woken.  See further comments in wait()
> +    // below and also see bug #1139856.

Explicitly describing the transition is friendly, but I wouldn't mention the bug#, since it makes it look like this is some weird hack when it's just a consequence of the essential state machine.
Attachment #8604716 - Flags: review?(luke) → review+
https://hg.mozilla.org/mozilla-central/rev/2e7ce565cfc4
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla41
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: