Open Bug 273649 Opened 20 years ago Updated 2 years ago

PR_CallOnce unsafe on some SMP platforms : PowerPC and Itanium

Categories

(NSPR :: NSPR, defect, P4)

Tracking

(Not tracked)

People

(Reporter: julien.pierre, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: FIPS)

Attachments

(2 files, 2 obsolete files)

Chris Elving found this problem in PR_CallOnce as part of the investigation of bug 237934 . Apparently, a separate bug was never filed on PR_CallOnce . Here is the original bug description : When I read Wan-Teh's update, I looked at the source to PR_CallOnce and found the following: PR_IMPLEMENT(PRStatus) PR_CallOnce( PRCallOnceType *once, PRCallOnceFN func) { if (!_pr_initialized) _PR_ImplicitInitialization(); if (!once->initialized) { if (PR_AtomicSet(&once->inProgress, 1) == 0) { once->status = (*func)(); PR_Lock(mod_init.ml); once->initialized = 1; PR_NotifyAllCondVar(mod_init.cv); PR_Unlock(mod_init.ml); } else { PR_Lock(mod_init.ml); while (!once->initialized) { PR_WaitCondVar(mod_init.cv, PR_INTERVAL_NO_TIMEOUT); } PR_Unlock(mod_init.ml); } } return once->status; } Here, once->initialized is inspected outside of mod_init.ml. I believe this is unsafe on SMP platforms that don't guarantee store order; immediately after CPU0 executes func and sets once->initialized = 1, it's possible for CPU1's cache to contain once->initialized == 1 WITHOUT containing the result of func. For example, it could be possible for CPU1 to return from PR_CallOnce only to discover a NULL symWrapKeysLock. Do you agree that PR_CallOnce is potentially unsafe?
Wan-Teh, While the bug only affects a limited set of platforms, it would be nice to fix it in NSPR 4.6 . The generic fix I believe would be to always call : PR_Lock(mod_init.ml); before inspecting once->initialized . However, that will be very inefficient and should be done only on the platforms that truly need it done this way. Chris mentioned that HP PA Risc 2.0 is a platform that could theoritically suffer in this case, however he wasn't able to actually experience a problem on it.
Priority: -- → P4
Target Milestone: --- → 4.6
Attached patch test program for this condition (obsolete) (deleted) — Splinter Review
This is a little test program I wrote to try to reproduce the condition . It uses PR_CallOnce to fill an array, and then checks the content of the array from multiple threads to make sure the array really has been filled. The array size probably should be made tunable, and I'm not sure if the notification mechanism I'm using for the threads is the most efficient . The program takes two arguments, duration in seconds and number of threads. It should only be run for a short period of time, since the same array is used, and it is expected that after a short number of ops, the cache in all CPUs will become coherent. The program could be made smarter and try to fill multiple arrays with different values until something is inconsistent - using multiple PR_CallOnce. I didn't attempt that. So far, I was unable to find any problem on a quad-CPU HP PA-RISC box . Is there any other architecture that's likely to show this problem ?
PR_CallOnce can be made correct by adding a memory barrier instruction for the case once->initialized is true, like this: if (!once->initialized) { ... } else { mb(); } We can have two implementations. The default implementation always takes the slow path. Then, for those platforms where we know how to implement mb() (including those where mb() is not necessary, such as x86), we use the implementation I described above along with the definition of the mb() macro. This code is broken on multiprocessor systems with weak memory consistency models. I suggest you do the experiments on PowerPC multiprocessor systems running IBM AIX, Linux, or Mac OS X, or Alpha multiprocessor systems running Tru64 Unix or Linux. Your test program should be modified to have one thread that will get to run the initialization routine: sleep(1); /* ensure the other threads are spinning */ PR_CallOnce(&dummy, &fill); and other threads that will see once->initialize to be true and then check the initialized data. You need to modify the definition of the PRCallOnceType structure to declare 'initialized' with 'volatile'. Then, those other threads should spin waiting until dummy.initialized is true: while (!dummy.initialized) { ; /* you can also call sched_yield() */ } PR_CallOnce(&dummy, &fill); You should call the appropriate pthread function or system call to ensure that the initialzer thread is running on a different CPU from the other threads. (Look for "processor affinity" related functions.)
Attachment #182425 - Flags: superreview?(julien.pierre.bugs)
Attachment #182425 - Flags: review?(chris.elving)
Wan-Teh, Thanks for the tips, I will try to improve my test program with them. I'm hoping to keep it portable so I can try on multiple architectures. Regarding the new implementation of PR_CallOnce, it appears to be correct. I didn't give it r+ because : a) I haven't yet been able to reproduce the problem and verifies that it fixes it. b) I know you didn't intend to do it in this pass, but I think before we can check this in, we should add conditional compilation to avoid the additional PR_Lock / PR_Unlock in 99% of the calls to PR_CallOnce, when the function was previously called. That would require identifying the vulnerable platforms first. The default should probably be to assume vulnerability to be on the safe side, and then define macros for each platform we know not to be vulnerable, eg. HAVE_TSO . I do have access to PowerPC AIX machines, so I will try that next, and retry on HP-UX with the improved test program. I don't have access to any Macs. I'll check for an Alpha box, but I know we no longer build on that.
Attached patch updated test program (deleted) — Splinter Review
- removed duration argument - added argument for buffer size - made the test write one element in the array every 4KB rather than the full array (this is in the hope there are multiple cache lines involved for the writes, and to prevent a full cache memory flush if CPU cache is full) - launched multiple threads spinning waiting for dummy.initialized, then calling PR_CallOnce and check() - launched one thread after the spinning threads to do the PR_CallOnce and do the initial fill I tried this on a PowerPC AIX box, so far without any success, even with very large buffers. I don't know how to query the # of CPUs on AIX though, so I'm not sure if the test was relevant (it wouldn't be if it was a single-CPU box). In the spinning threads, I removed the PR_CallOnce call completely, and I still didn't get any failure - ie. the memory written by the function was in the expected state in all threads immediately after the initialized flag became non-zero . I would leave this running in a script overnight just to try my luck, except the AIX box isn't dedicated to our group. The HP box was hung so I couldn't try there today.
Attachment #182231 - Attachment is obsolete: true
Wan-Teh, Do you think the updated test program is adequate ? I haven't been able to reproduce the problem at all with it. If it doesn't exist on any current platform, I would be inclined to close this defect as WORKSFORME .
This is a valid bug. A test failure proves the existence of bugs. A test success doesn't prove the non-existence of bugs. You can search for "double checked locking", which is what this code is known for. The difficulty of creating a test that exhibits this bug means the correctness of our fix can only be ensured by code reviews, and any attempt to do without locking in the default code path must be reviewed by experts of that CPU architecture.
Wan-Teh, I have read about the subject . I agree with you the test success is not 100% proof. But the fix you proposed is rather drastic, and we should not make a change for any platform where it isn't needed. You said earlier you believe the change should be made for HP PA-RISC, PowerPC and Alpha . Do you believe that should be unconditionally done ? Do we know experts on those CPUs that can say whether a change is required for all machines using those architectures, or perhaps only on a subset ?
Comment on attachment 182450 [details] [diff] [review] updated test program One way to improve this test is to "inline" PR_CallOnce() and check() in SpinThread, and unroll the for loop in check(). NSPR's "system" test program reports the number of CPUs. You must run the test on a multiprocessor system. You can also ask your friends at IBM about functions that bind threads to CPUs. FillThread and SpinThread need to run on different CPUs. If we can make this test fail, it will be a wonderful contribution to multithreaded programming.
The AIX box I used was multiprocessor (dual). The HP boxes are still down, I'll have to check with IT. I believe the SpinThreads are running on different CPUs, because they are started first and are CPU intensive. When running with a number of SpinThreads greater than the number of CPUs, the scheduler has no choice but to assign them different CPUs. Of course, they may not always stay there indefinitely. The FillThread starts with a delay (which I have tried to adjust, with no effect). By the time it starts running, there are already SpinThreads scheduled on one or more other CPU(s).
Comment on attachment 182425 [details] [diff] [review] Portable, correct implementation of PR_CallOnce. Marking r- because the locking of mod_init.ml should be platform conditional.
Attachment #182425 - Flags: superreview?(julien.pierre.bugs) → superreview-
This patch does 2 things : 1) It ifdef's the two implementations of PR_CallOnce based on whether the CPU supports total store order or not. I didn't ifdef for every line of code that changed, because the body of code was too small, and it was unreadable. Rather, there is a single ifdef, each with 1 block of code for each implementation . 2) It creates an internal function called _pr_CallOnce which handles the case of calling a function with or without arguments. PR_CallOnce and PR_CallOnceWithArg now just call that static function, instead of duplicating an otherwise almost identical implementation. This patch does not currently define _PR_NO_TSO for any platforms, since none have been identified that currently need it for sure. We still need the CPU experts to step in to make that determination.
Attachment #182425 - Attachment is obsolete: true
Attachment #183973 - Flags: review?(wtchang)
We should figure out the optimized implementations for the most common processors today: - AMD64/EM64T - Itanium - PowerPC - SPARC - x86 For x86 and AMD64/EM64T, no assembly code is needed because these processors have "sequential consistency". For PowerPC, you just need an "isync" instruction after the if statement. (See The PowerPC Architecture, Section E.1.2 "Lock Acquisition and Release", p. 254.) This requires assembly code. For Itanium, we can definitely use a "mf" instruction after the if statement, but we should be able to code the if statement using an "ld.acq" (ordered load) instruction. This also requires assembly code. I don't know if SPARC needs a memory barrier here. A better name for the _PR_NO_TSO macro would be _PR_HAVE_MB (memory barrier), and we would implement the MB() macro as inline assembly. Then we would write the code like this: #ifdef _PR_HAVE_MB if (!once->initialized) { ... return once->status; } MB(); return once->status; #else PR_Lock(mod_init.ml); if (!once->initialized) { ... PR_Unlock(mod_init.ml); return once->status; #endif Also C's inline function feature is not universally supported.
Wan-Teh, Sparc does not need a memory barrier for this case . Do we have any HP PA-RISC CPU expert ? Sun still ships products on that architecture. Do the compilers for PowerPC and Itanium allow inline assembly ? On AIX, the old IBM compilers did not. I don't know if that's changed . For Mac, I know gcc does, but there are other compilers available for it too that may not.
Summary: PR_CallOnce unsafe on some SMP platforms → PR_CallOnce unsafe on some SMP platforms : PowerPC and Itanium
Julien, we can look for a PA-RISC architecture manual, which will have this type of information. Whether inline assembly is available depends on the compilers, not on the processor architectures. So for example, Windows, Linux, and HP-UX run on Itanium; we need to figure out how to do inline assembly in MSVC, GCC, and HP compiler.
I found PA-RISC 1.1 and 2.0 architecture manuals online. The memory barrier instruction is "sync". I also realized that the optimized implementation needs another memory barrier right before setting once->initialized to 1. The pseudocode I showed earlier is missing this memory barrier.
Comment on attachment 182425 [details] [diff] [review] Portable, correct implementation of PR_CallOnce. removing review request from obsolete patch
Attachment #182425 - Flags: review?(chris.elving)
Blocks: 354609
QA Contact: wtchang → nspr
The target milestone is already released. Resetting target milestone.
Target Milestone: 4.6 → ---
Comment on attachment 183973 [details] [diff] [review] conditionally add implementation based on _PR_NO_TSO define I'm cancelling this old review request due to the additional work requested by Wan-Teh in comment 14 .
Attachment #183973 - Flags: review?(wtc)
Wan-Teh points out that this may be needed on Sparc too . See http://docs.sun.com/app/docs/doc/816-5168/6mbb3hrgk?a=view#membar-ops-3c . I am trying to test one of those functions but getting a link error right now. It seems that membar_consumer() isn't available from any library in /usr/lib . At least, find /usr/lib -name \*.so | xargs nm | grep membar_consumer came up empty .
On my x64 box, I found the following in /usr/lib/amd64/libc.so : [14362] | 317328| 4|FUNC |GLOB |0 |12 |_membar_consumer [16209] | 317328| 4|FUNC |WEAK |0 |12 |membar_consumer (dbx) dis membar_consumer 0x000000000004d790: lfence 0x000000000004d793: ret and in /usr/lib/libc.so : [7836] | 145776| 6|FUNC |GLOB |0 |11 |_membar_consumer [6274] | 145776| 6|FUNC |WEAK |0 |11 |membar_consumer (dbx) dis membar_consumer 0xfeef3970: _membar_exit : lock xorl $0x00000000,(%esp) 0xfeef3975: _membar_exit+0x0005: ret
Julien, does Solaris SPARC have membar_enter? On Solaris 10, we may be able to use the new membar_consumer function. See the membar_ops(3C) man page: http://docs.sun.com/app/docs/doc/816-5168/6mbb3hrgk?a=view#membar-ops-3c With Visual C++ 2003 .NET (7.1) or later, we may be able to use the _ReadWriteBarrier compiler intrinsic function: http://msdn2.microsoft.com/en-us/library/f20w0x5e(VS.80).aspx But the man page says it delineates the boundaries of code for complier optimizations, and doesn't mention processor's memory consistency model. So I'm not sure if it is the right function.
Wan-Teh, No, there isn't any library containing any membar symbol on Sparc. It's hard to believe that any x86 CPU would have relaxed the store order. But it sounds like it may be possible with x64.
Whiteboard: FIPS
I think we need to resolve this. We should either fix it or mark it WONTFIX. Let's not expand the scope of this bug to include other CPU architectures than the ones originally reported. We can file new bugs for other arch's. Maybe Julien should take this bug.
Severity: normal → S3

The bug assignee is inactive on Bugzilla, so the assignee is being reset.

Assignee: wtc → nobody
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: