Closed Bug 1555306 Opened 5 years ago Closed 3 years ago

Collision on NS find_objects()/sqlite3PagerSharedLock() when fetching OSCP causes extereme variance in page load times

Categories

(Core :: Security: PSM, task, P1)

Desktop
Windows 10
task

Tracking

()

RESOLVED DUPLICATE of bug 1694542
Performance Impact high
Tracking Status
firefox69 --- affected

People

(Reporter: acreskey, Assigned: jesup)

References

Details

(Keywords: perf:pageload, Whiteboard: [psm-assigned])

Attachments

(3 files)

It was observed that there is extreme variance in the page load times for https://www.washingtonpost.com

e.g. results from ~2000ms to ~5000ms when testing against a recorded http session.

Finding the root cause may help us pinpoint and reduce noise in our runtime.

Notes:
Browsertime was used against static sites recorded with WebPageReplay.
The attached graph shows the distribution of 2 sets of 25 cold loads on the 2017 reference laptop.
Both sets of results, straight_wpr_0 and straight_wpr_1 were made with identical configurations. i.e. outside of noise you would expect the same results.

jesup:
I've attached captured profiles through browsertime.

geckoProfile-7 and geckoProfile-25 are of the more common fast load times (~2100ms)
geckoProfile-19 and geckoProfile-22 are two of the slowest loads (~5000ms)

Added Socket,JS to the thread list and also turned off responsiveness.

There's a video which autoplays on this recording, by the way.

I noticed that in the slow loads the network requests visible in the content process are generally "waiting for socket thread" for 4800+ms.

Added 'SSL,DNS' to the thread list.

Fast load (2164ms)
http://bit.ly/30PixDL

Slow load (4781ms)
http://bit.ly/30PYDbQ

I think it's the OCSP requests in the SSL cert threads that have to wait for main thread?

Some profiles from live sites to see if the OCSP delays are a biproduct of the webpagereplay recording:

Faster live site: 3357ms
http://bit.ly/2W2XcDv

Slow live site: 6281ms (why are the no samples from the parent process from the first 5seconds?)
http://bit.ly/2VZliPd

Note that I did see long, eg 862ms, DoOCSPRequest calls in the SSL Cert threads on the slow live site.

Depends on: 1555801

SSL threads are colliding with each other for an sqlite DB access critical section, and with the SocketThread.

Clip from #necko:
(base profile: http://bit.ly/2KbbJdI)
[15:14] jesup Socket thread is 90% in ssl3_HandleHandshakeMessage(), split evenly between CERT_NewTempCertificate(), HandshakeCallback() and tls13_HandlePostHelloHandshakeMessage() basically
...
[ lots of blocking involving sqlite3PagerSharedLock() ]
...
[15:30] jesup both the socketthread and the SSL Cert#2 thread are calling find_objects and blocking on a critical section
[15:33] mayhemer I'm honestly lost in the profile, hard to figure out where delays are in the graphs and where to actually start
[15:36] jesup Summary: the socketthread and the SSL Cert#2 thread are calling find_objects() and blocking on a critical section. This eats the first 1/3. then SSL cert #2 dispatches to mainthread (which likely hands off to Socket Thread) the ssl connection to get the OCSP, and 1/2 of that is CERT_NewTempCertificate() (and waiting on the DB), and 1/2 is tls13_HandlePostHelloHandshakeMessage() (which...
[15:36] jesup ...blocks on find_objects() again)
[15:38] jesup let me find who owns that find_objects critical section...
[15:40] jesup aha, the lock on the sqlite is within find_objects also, but apparently not in the critical section
[15:40] jesup aha, it's fighting with SSL Cert #5
[15:41] jesup Look at this profile, with find_objects highlighted: http://bit.ly/2Ke3qOA
[15:43] jesup Note that in the two sections where SocketThread is blocked on the critical section in find_objects(), SSL cert #5 is in find_objects. And SSL Cert #6 is also fighting for it

[15:46] mayhemer in other words, we are doing IO under the lock!
[15:46] mayhemer I like the busy waits sqlite does.. 2x 100ms sleep
[15:48] acreskey jesup: good find! I think longterm that sqllite DB is going away
[15:50] mayhemer I just wonder how to fix this. this seems to me like a nss bug, since it calls a sync API to read data (always a bad thing)
[15:51] mayhemer note that this may also go to HW security tokens, not just disk

Moving to Networking; this might be more an NSS issue/design issue, but it's at least partly an issue in how networking is interfacing to NSS.

Component: Performance → Networking
Summary: Extreme variance in page load times on www.washingtonpost.com site - recorded http session, reference laptop → Collision on NS find_objects()/sqlite3PagerSharedLock() when fetching OSCP causes extereme variance in page load times
Whiteboard: [qf:investigate] → [qf:p1:pageload]

This looks like a nss bug to me.

Component: Networking → Security: PSM

The socket thread is blocked by calling SSL_ForceHandshake from nsNSSSocketInfo::DriveHandshake:

RtlEnterCriticalSection
struct nssCryptokiInstanceStr * * find_objects(struct NSSTokenStr *, struct nssSessionStr *, struct CK_ATTRIBUTE *, unsigned long, unsigned int, <unnamed-tag> *)
nssToken_FindObjectsByTemplate
nssToken_FindCertificateByIssuerAndSerialNumber
nssTrustDomain_FindCertificateByIssuerAndSerialNumber
NSSTrustDomain_FindCertificateByEncodedCertificate
CERT_NewTempCertificate
ssl3_HandleHandshakeMessage
ssl3_HandleNonApplicationData
ssl3_HandleRecord
ssl3_GatherCompleteHandshake
SSL_ForceHandshake
nsNSSSocketInfo::DriveHandshake()
bool mozilla::net::nsHttpConnection::EnsureNPNComplete(nsresult & const, unsigned int & const)
mozilla::net::nsHttpConnection::OnSocketWritable()

Cache I/O thread is deserializing security info:

find_objects
nssToken_FindObjectsByTemplate
nssToken_FindCertificateByIssuerAndSerialNumber
nssTrustDomain_FindCertificateByIssuerAndSerialNumber
nssTrustDomain_FindCertificateByEncodedCertificate
CERT_NewTempCertificate
CERT_DecodeCertFromPackage
nsNSSCertificate::InitFromDER(char*, int)
nsNSSCertificate::Read(nsIObjectInputStream*)
nsBinaryInputStream::ReadObject(bool, nsISupports**)
NS_ReadOptionalObject(nsIObjectInputStream*, bool, nsISupports**)
mozilla::psm::TransportSecurityInfo::Read(nsIObjectInputStream*)
nsBinaryInputStream::ReadObject(bool, nsISupports**)
NS_DeserializeObject(nsTSubstring<char> const&, nsISupports**)
mozilla::net::CacheEntry::GetSecurityInfo(nsISupports**)
mozilla::net::nsHttpChannel::OpenCacheInputStream(nsICacheEntry*, bool, bool)
mozilla::net::nsHttpChannel::OnCacheEntryCheck(nsICacheEntry*, nsIApplicationCache*, unsigned int*)
mozilla::net::CacheEntry::InvokeCallback(mozilla::net::CacheEntry::Callback&)
mozilla::net::CacheEntry::InvokeCallbacks(bool)
mozilla::net::CacheEntry::InvokeCallbacks()
mozilla::net::CacheEntry::OnFileReady(nsresult, bool)
non-virtual thunk to mozilla::net::CacheEntry::OnFileReady(nsresult, bool)
mozilla::net::CacheFile::OnMetadataRead(nsresult)
non-virtual thunk to mozilla::net::CacheFile::OnMetadataRead(nsresult)
mozilla::net::CacheFileMetadata::OnDataRead(mozilla::net::CacheFileHandle*, char*, nsresult)
mozilla::net::ReadEvent::Run()
mozilla::net::CacheIOThread::LoopOneLevel(unsigned int)
mozilla::net::CacheIOThread::ThreadFunc()

We're seeing this sort of problem crop up regularly in test runs; here's one from foxnews:
https://perfht.ml/2X4bwjw
2s OCSP response - 1s "waiting for Socket Thread", 1s waiting for DNS. From previous investigations, we'd likely find (like above) that we have a bunch of SSL threads and the Socket thread fighting over the sqlite DB via find_names().
Also this profile https://perfht.ml/2QIC4RE (simpleperf on an geckoview_example startup) shows Cache2 IO hitting find_names() as well, for deserialization of security info stored in cache entries.

Is there any way to reduce the delay when we collide (change the 100ms backoff in sqlite perhaps (per mayhemer)), or to make these async operations in some manner? Or even serialize all of these onto a single thread (even if it blocks callers) to avoid the 100ms backoff behavior?

I hear the sqlite DB is to be replaced with an rkv store - when would this occur? Bug#? Would that impact this pattern/delay?

Flags: needinfo?(dkeeler)

The NSS sqlite DB is not being replaced with rkv. You're probably thinking of intermediate preloading (bug 1535662), which we're still in the process of shipping (it's only enabled on Nightly right now). Intermediate preloading helps a bit with this contention by potentially avoiding a call to find_objects (bug 1557092).

I think the issue here is the transaction created by the intermediate caching code importing certificates[0] is blocking readers. Setting PRAGMA journal_mode = WAL; made a test program I wrote that pits a reader thread against a writer thread go orders of magnitude faster, but I'm not sure if we can set and ship that in NSS (for one, the sqlite documentation page says it doesn't work with DBs on networked drives - I don't know if that means "it just won't do WAL and falls back to DELETE" or if it means "it'll fail in byzantine ways if you try"). To reduce the long sleeps we could override the default sqlite busy handler, but that would still block any threads trying to read while the transaction completes. Since that involves syncing to the underlying filesystem, that'll always be slow. Another option would be to stop caching intermediates altogether. In theory once intermediate preloading ships intermediate caching is less important (it's mostly a band-aid anyway to help with misconfigured servers).

[0] See bug 1529044 for the most recent implementation, although the feature has been around for a long time. Also, any code that imports new certificates would have this problem. Luckily, it's rare that users do this.

Flags: needinfo?(dkeeler)

I see this is currently done off an Idle runnable... is that causing this collision?

I do see all the SSL threads and the SocketThread calling VerifyCert() https://perfht.ml/2RbtLhH and this profile shows SSL Thread 4 (https://perfht.ml/2R9WjYM) running PK11_ImportCerts() throughout this time period.

My concern with using 'idle' priority here is that it doesn't mean we're really idle... this is off-main-thread, so unless there's other SSL runnables queued to this SSL thread, those are empty, and the only thing delaying us running the idle runnable is GetIdlePeriodHint(), which looks at upcoming RefreshDriver and timer executions. The net result is that we'll be pretty likely to run it even if we're super-busy on MainThread.

Idle as currently designed was intended for MainThread usage, unfortunately. You could dispatch an Idle runnable to mainThread that dispatched a runnable to an SSL thread; that would ensure we run it in a real MainThread idle period (or very close to). This wouldn't stop all collisions, but I suspect would make them much less likely.

Perhaps even better would be running this off some "deep idle" callback, with a (soft) time limit to avoid failing to import forever, but waiting for a longer idle period.

Attached patch Non-compiling patch (deleted) — Splinter Review

So unique_ptr<> and lambdas hate each other, and this doesn't compile (nor any mutation of it I've tried); one could stop trying to abuse lambdas and just create a runnable class that does "if (NS_IsMainThread()) { thread->Dispatch(this, NS_DISPATCH_NORMAL); return; }", and send that runnable to the MainThread idle queue to start. I'd advise that instead of twisting the code in a knot to make the 3-level-lambda thing work.

I was trying this to a) ensure it works, and b) see if the collisions (mostly) disappear

f?'s to put it on both of your radars. Andrew, if we can get a working patch we can see if this makes the noise go away or lessen.

You could "best case" test it by turning off intermediate cert import like you tried before

Assignee: nobody → rjesup
Attachment #9071884 - Flags: feedback?(dkeeler)
Attachment #9071884 - Flags: feedback?(acreskey)

So, trying to run this on "Idle" on an SSL thread is exactly the same as just Dispatching it; non-MainThreads don't implement PrioritizedEventQueue

When an intermediate cert save is needed, we could Dispatch to Socket Thread to start (or reset if running) a Timer. When the timer fires, either Dispatch to SSL Cert threadpool to save, or Dispatch to MainThread Idle queue, and when that runs that Dispatch to SSL Cert to save. Perhaps use a 10s or 60s timer, with perhaps a deadline that we won't reset the timer to more than N seconds from the oldest dirty data waiting (example: 30s timer, with a max 2 min dirty time). It depends on how much we care about possible not writing it out. (We could also force a write as part of shutdown if dirty, perhaps)

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #15)

I see this is currently done off an Idle runnable... is that causing this collision?

I don't think that's the issue - it's that we have many threads wanting to read the sqldb and some threads wanting to occasionally write to the sqldb (the fact that NSS decoding a certificate results in reading the sqldb is an unfortunate consequence of design decisions made in NSS).

Perhaps even better would be running this off some "deep idle" callback, with a (soft) time limit to avoid failing to import forever, but waiting for a longer idle period.

Right - we basically want to do this when the network is idle. Maybe something like that already exists?
Also, it's not really a problem if importing gets delayed indefinitely (until shutdown, basically) because as long as we hold those certificates alive in memory, we'll be able to find them during verification, which is why we're saving them in the first place. Of course, saving a bunch of intermediates at shutdown might have performance characteristics we want to avoid.

One thing to keep in mind is that when we ship intermediate preloading, we have the option of saving certificates in cert_storage (rkv), which won't have this problem. (The issue is there's some crash bugs and performance issues we have to solve before we can ship that.)

Comment on attachment 9071884 [details] [diff] [review] Non-compiling patch Review of attachment 9071884 [details] [diff] [review]: ----------------------------------------------------------------- Like I said, we want to do this when the network is idle, if we can. I think this is the right sort of thing, though.
Attachment #9071884 - Flags: feedback?(dkeeler) → feedback+

Necko currently can't provide any net-idle notification (like that there are no requests happening for a while) but we have a mechanism that we could relatively easily base such a notification on.

I can try out some of these variations in the next bit.
I'd like to find a scenario in CI where I can see the impact of this -- the tp6 mobile cold load tests should do it.
But definitely I will need to do some local performance measurement for sites like foxnews.com.

Comment on attachment 9071884 [details] [diff] [review] Non-compiling patch Review of attachment 9071884 [details] [diff] [review]: ----------------------------------------------------------------- At one point I had a tried out dispatching the import to the main thread (where it was enventually done on idle). For a new starting point, I made a comparison of our current import cert behaviour against "don't import at all": https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=8f91877992a12b1dcd514bf80c11d1a6638446b9&newProject=try&newRevision=133b78c7f0d1eab73b209f3421f5626e2121dae4&framework=10 What I find interesting is that, at least on Moto G5 android, I'm not seeing a performance improvement from skipping the import. (This does line up with measurements I had taken a few months back). We now have desktop cold loads, so I'm triggering more jobs there. we have a limited selection of sites.
Attachment #9071884 - Flags: feedback?(acreskey) → feedback+

You may need to do the test in browsertime where it was first seen, not in raptor, especially if there are network effects here (like really fast OCSP responses, or something)

Priority: -- → P1
Whiteboard: [qf:p1:pageload] → [psm-assigned][qf:p1:pageload]

Ah, from a slack conversation (thank you Dana) I realized why I'm not see intermediate import in our raptor tests.
We import the mitmproxy CA cert into firefox so we can connect directly to mitmproxy to return the archived http responses.
And so no intermediates are encountered during testing. (Same thing when using WebPageReplay)

FYI, I just found out that we don't do many OCSP request on android because by pref we only make them for EV certificates:
https://searchfox.org/mozilla-central/rev/5e660d3dfcba897c8501e3fda1d415565a096e7e/mobile/android/app/mobile.js#421

A lock convoy through CERT_NewTempCertificate() and find_objects() seem to be implicated in bug 1577944 (see comment 2) via a new (non OCSP) path to CERT_NewTempCertificate() that was added in bug 1551177.

I believe bug 1694542 addressed this - Randell, can you confirm?

Flags: needinfo?(rjesup)

forwarding to andrew

Flags: needinfo?(rjesup) → needinfo?(acreskey)

Let me just double check this on the reference laptop. I'll need a day for it to stabilize from OS updates.

Flags: needinfo?(acreskey)

I haven't been able to find any cases where the OCSP requests are blocked.

Trying the usual sites and I can see the requests but they are completing quite quickly.

Great - thanks!

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → DUPLICATE
Performance Impact: --- → P1
Keywords: perf:pageload
Whiteboard: [psm-assigned][qf:p1:pageload] → [psm-assigned]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: