Intermittent segfault in Python testing
Categories
(Data Platform and Tools :: Glean: SDK, defect, P2)
Tracking
(Not tracked)
People
(Reporter: mdroettboom, Assigned: mdroettboom)
References
(Blocks 1 open bug)
Details
Attachments
(3 files)
https://circleci.com/gh/mozilla/glean/65637
glean-core/python/tests/test_dispatcher.py ........ [ 7%]
glean-core/python/tests/test_glean.py .......s...s....ss.............. [ 35%]
glean-core/python/tests/test_loader.py make: *** [Makefile:68: test-python] Segmentation fault (core dumped)
Assignee | ||
Updated•4 years ago
|
Comment 2•4 years ago
|
||
Another crash: https://app.circleci.com/pipelines/github/mozilla/glean/4605/workflows/9658e5a8-88f0-4d49-9cf7-d220fcdcc2c3/jobs/66708/steps
It might be time we reproduce this.
Assignee | ||
Updated•4 years ago
|
Comment 3•4 years ago
|
||
Assignee | ||
Comment 4•4 years ago
|
||
Added saving coredump artifacts to CI. That alone seems to have fixed it, since I haven't seen this since ;)
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 5•4 years ago
|
||
Here's a recent Python testing segfault: https://app.circleci.com/pipelines/github/mozilla/glean/4937/workflows/820ec954-f0b1-481e-baee-d35fc0fad978/jobs/73951/steps
Comment 6•4 years ago
|
||
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 7•4 years ago
|
||
Here's another segfault, this time with a coredump: https://app.circleci.com/pipelines/github/mozilla/glean/4958/workflows/0cd2547e-a819-4c33-a102-f3218d4926dd/jobs/74356
Assignee | ||
Comment 8•4 years ago
|
||
Here's the backtrace:
#0 0x00007f04d8d9ea0d in mdb_env_reader_dest (ptr=0x7f04da61c080) at /home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/lmdb-rkv-sys-0.9.6/lmdb/libraries/liblmdb/mdb.c:4483
#1 0x00007f04dae9d1a1 in __nptl_deallocate_tsd () at pthread_create.c:301
#2 0x00007f04dae9dfc4 in __nptl_deallocate_tsd () at pthread_create.c:256
#3 start_thread (arg=<optimized out>) at pthread_create.c:497
#4 0x00007f04dac414cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Assignee | ||
Comment 9•4 years ago
|
||
I guess, ouch?
:jan-erik Flagging you in case you have any ideas of where to look next or any ideas based on things you make know about lmdb.
Assignee | ||
Comment 10•4 years ago
|
||
At least there's some logic to the backtrace -- the segfaults started to appear around the time that threading was added to the Python bindings.
There's some evidence that this kind of thing occurs if the parent thread exits before joining on the child thread: https://stackoverflow.com/questions/26308066/segmentation-fault-in-pthread-create
The Glean Python bindings actually do that already, but there might be a race condition around that somewhere. In any event, it's possible that cleaning up the threading details may be a legitimate fix to this short of figuring out what the heck is going on in lmdb.
Assignee | ||
Comment 11•4 years ago
|
||
Possible issue with mdb_env_reader_dest
: https://www.openldap.org/lists/openldap-bugs/201809/msg00009.html
Assignee | ||
Comment 12•4 years ago
|
||
Also related: https://github.com/erthink/ReOpenLDAP/issues/48
Comment 13•4 years ago
|
||
I now Firefox encountered some other crashes with LMDB in the past.
:vporof, have you seen the above issue before?
Assignee | ||
Comment 14•4 years ago
|
||
Another data point -- this is quite likely a testing-environment-only problem. In the unit tests we create/destroy the LMDB environment with every test. In normal use, this only happens once per process so it would be hard to hit.
Comment 15•4 years ago
|
||
Comment 16•4 years ago
|
||
I haven't seen that issue yet. Because of the prevalence of these new crashes, current plan is to move all stores that use RKV to the safe-mode storage driver and away from LMDB. The work is happening in https://bugzilla.mozilla.org/show_bug.cgi?id=1612550
I'll make this block rkv-perf-mode in the meantime though.
Updated•4 years ago
|
Assignee | ||
Comment 17•4 years ago
|
||
We haven't seen this crash in a long while, so it may now be less prevalent given recent Python threading changes. In addition, I'm confident this is a testing-framework-only thing.
Description
•