Closed
Bug 817762
Opened 12 years ago
Closed 11 years ago
Mozpools states seem stuck
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dividehex, Assigned: dustin)
References
Details
Attachments
(8 files, 1 obsolete file)
(deleted),
patch
|
mcote
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
mcote
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
Atoll
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
catlee
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
dividehex
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
dividehex
:
review+
dustin
:
checked-in+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
dividehex
:
review+
dustin
:
checked-in+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
dividehex
:
review+
dustin
:
checked-in+
|
Details | Diff | Splinter Review |
I ran into an issue while installing the panda-android.2 image on chassis 2. Everything seemed to be going well the statemachine entered android_rebooting. Once it entered that state, it never changed.
2012-12-03T11:20:01 statemachine writing pxe_config panda-android.2, boot config {} to db
2012-12-03T11:20:01 statemachine entering state start_pxe_boot
2012-12-03T11:20:01 statemachine entering state pxe_power_cycling
2012-12-03T11:20:01 bmm setting PXE config to 'panda-android.2'
2012-12-03T11:20:01 bmm initiating power cycle
2012-12-03T11:20:04 statemachine entering state pxe_booting
2012-12-03T11:21:15 syslog second stage URL found: http://10.12.48.27/scripts/android-second-stage.sh
2012-12-03T11:21:15 syslog wget success: --2025-11-06 06:42:00-- http://10.12.48.27/scripts/android-second-stage.sh#012Connecting to 10.12.48.27:80... connected.#012HTTP request sent, awaiting response... 200 OK#012Length: 1829 (1.8K) [application/x-sh]#012Saving to: `/opt/scripts/second-stage.sh'#012#012 0K . 100% 9.53M=0s#012#0122025-11-06 06:42:00 (9.53 MB/s) - `/opt/scripts/second-stage.sh' saved [1829/1829]
2012-12-03T11:21:15 syslog Executing second-stage.sh
2012-12-03T11:21:15 syslog beginning android-second-stage.sh
2012-12-03T11:21:15 syslog setting time with ntpdate
2012-12-03T11:21:24 syslog Submitting lifeguard event at http://10.12.48.27/api/device/panda-0022/event/android_downloading/
2012-12-03T11:21:24 bmm clearing PXE config
2012-12-03T11:21:24 statemachine entering state android_downloading
2012-12-03T11:21:24 syslog got android image panda-android.2
2012-12-03T11:21:24 syslog fetching http://10.12.48.27/artifacts/panda-android.2/system.tar.bz2
2012-12-03T11:21:36 syslog fetching http://10.12.48.27/artifacts/panda-android.2/userdata.tar.bz2
2012-12-03T11:21:38 syslog fetching http://10.12.48.27/artifacts/panda-android.2/boot.tar.bz2
2012-12-03T11:21:39 syslog Formatting partitions
2012-12-03T11:21:51 syslog Mounting partitions
2012-12-03T11:21:51 syslog Submitting lifeguard event at http://10.12.48.27/api/device/panda-0022/event/android_extracting/
2012-12-03T11:21:51 statemachine entering state android_extracting
2012-12-03T11:21:52 syslog Extracting artifacts
2012-12-03T11:23:43 syslog Unmounting partitions
2012-12-03T11:23:58 syslog Imaging complete. Rebooting
2012-12-03T11:23:58 syslog Submitting lifeguard event at http://10.12.48.27/api/device/panda-0022/event/android_rebooting/
2012-12-03T11:23:58 statemachine entering state android_rebooting
Assignee | ||
Comment 1•12 years ago
|
||
My guess is that something got wedged in the power-cycling code, since all of the affected pandas were on the same relay board, and other operations were proceeding apace.
Amy had something similar happen in rack 10, where she had mis-formatted the relay info.
It's possible that this happens when an "unexpected" error (which the misformatted relay info might trigger) occurs.
Comment 2•12 years ago
|
||
Does this only affect a handful of devices? Only Android?
Asking because jhopkins was having issues with b2g pandas not coming up post-image.
I'm not sure if that's this bug or something else ?
Assignee | ||
Comment 3•12 years ago
|
||
It has nothing to do with the image on the pandas. The manifestation is that a state that should time out (e.g., *_rebooting) never does. It may be related to what jhopkins is seeing. I restarted mozpool on mobile-imaging-001 for him a few hours ago.
Assignee | ||
Comment 4•12 years ago
|
||
I've been tailing the logs on mobile-services for a while now, and I haven't seen this happen again.
Assignee | ||
Comment 5•12 years ago
|
||
I also can't reproduce this at home. I've tried:
- incorrectly formatted relay hosts (no :'s, like amy's)
- correctly formatted relay hosts with a bogus hostname
- correctly formatted relay hosts with a bogus bank #
in all cases, I get a traceback in the log, and things proceed as expected.
Assignee | ||
Comment 6•12 years ago
|
||
Still no replication here. Another possibility is that a MySQL connection hangs. We do all of the MySQL transactions synchronously, so if this hung the devicemanager or requestmanager thread, all timeouts would stop. I'm not sure how to work around that, so I don't want to try to address it without more evidence.
Assignee | ||
Comment 8•12 years ago
|
||
My notes from debugging this on mobile-imaging-001.p1 (which has now been restarted and is back in service):
mozpool-s 17890 apache 3u IPv4 21022167 0t0 TCP localhost:8010->localhost:49740 (ESTABLISHED)
mozpool-s 17890 apache 4u IPv4 21020638 0t0 TCP mobile-imaging-001.p1.releng.scl1.mozilla.com:38556->stage-rw-vip.db.scl3.mozilla.com:mysql (CLOSE_WAIT)
mozpool-s 17890 apache 5u IPv4 19798499 0t0 TCP *:8010 (LISTEN)
mozpool-s 17890 apache 6u IPv4 21020575 0t0 TCP mobile-imaging-001.p1.releng.scl1.mozilla.com:38554->stage-rw-vip.db.scl3.mozilla.com:mysql (CLOSE_WAIT)
mozpool-s 17890 apache 7u IPv4 21020014 0t0 TCP mobile-imaging-001.p1.releng.scl1.mozilla.com:38542->stage-rw-vip.db.scl3.mozilla.com:mysql (CLOSE_WAIT)
mozpool-s 17890 apache 9u IPv4 21020152 0t0 TCP mobile-imaging-001.p1.releng.scl1.mozilla.com:38545->stage-rw-vip.db.scl3.mozilla.com:mysql (CLOSE_WAIT)
all CLOSE_WAIT = server closed, client didn't get the memo yet. Doing an API call will open one in ESTABLISHED.
on a "running" system, these are all ESTABLISHED, and anything reading/writing those fd's would get an error, so this is unlikely a MySQL-originated hang
(gdb) info threads
11 Thread 0x7f6b3a7ae700 (LWP 17899) 0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
10 Thread 0x7f6b39dad700 (LWP 17900) 0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
9 Thread 0x7f6b393ac700 (LWP 17901) 0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
8 Thread 0x7f6b23fff700 (LWP 17902) 0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
7 Thread 0x7f6b235fe700 (LWP 17903) 0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
6 Thread 0x7f6b22bfd700 (LWP 17904) 0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
5 Thread 0x7f6b221fc700 (LWP 17905) 0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
4 Thread 0x7f6b217fb700 (LWP 17906) 0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
3 Thread 0x7f6b20dfa700 (LWP 17907) 0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
2 Thread 0x7f6b0bfff700 (LWP 17908) 0x000000395bc0d6c0 in sem_wait () from /lib64/libpthread.so.0
* 1 Thread 0x7f6b45d27700 (LWP 17890) 0x000000368d8dc273 in poll () from /lib64/libc.so.6
device list API call works
power-control API calls work
everything works except states that need to timeout, e.g., pc_rebooting
state transitions (handle_event) seem to work anyway, and use the same LockByName instance
- handle_event (for bogus event "fooie") successful on every panda on the server
- slowness is Apache, grr
Oldest state_timeout value is for panda-0159 and panda-0141 (tie), when went to b2g_rebooting:
device.panda-0141 INFO - [2012-12-05 18:57:37,035] entering state b2g_rebooting
device.panda-0159 INFO - [2012-12-05 18:57:37,409] entering state b2g_rebooting
So whatever failed, failed then? Is it global or panda-specific?
-> maybe the tie is responsible??
request.1643 INFO - [2012-12-05 18:57:14,447] entering state closed
127.0.0.1:40744 - - [05/Dec/2012 18:57:14] "HTTP/1.1 POST /api/device/panda-0085/request/" - 409 Conflict
127.0.0.1:40747 - - [05/Dec/2012 18:57:21] "HTTP/1.1 GET /api/request/1638/status/" - 200 OK
device.panda-0141 INFO - [2012-12-05 18:57:37,035] entering state b2g_rebooting *****
127.0.0.1:40748 - - [05/Dec/2012 18:57:37] "HTTP/1.1 GET /api/device/panda-0141/event/b2g_rebooting/" - 200 OK
device.panda-0159 INFO - [2012-12-05 18:57:37,409] entering state b2g_rebooting *****
127.0.0.1:40749 - - [05/Dec/2012 18:57:37] "HTTP/1.1 GET /api/device/panda-0159/event/b2g_rebooting/" - 200 OK
device.panda-0137 INFO - [2012-12-05 18:57:38,313] entering state b2g_rebooting (also failed)
127.0.0.1:40750 - - [05/Dec/2012 18:57:38] "HTTP/1.1 GET /api/device/panda-0137/event/b2g_rebooting/" - 200 OK
device.panda-0143 INFO - [2012-12-05 18:57:47,087] entering state b2g_rebooting (also failed) ..
127.0.0.1:40751 - - [05/Dec/2012 18:57:47] "HTTP/1.1 GET /api/device/panda-0143/event/b2g_rebooting/" - 200 OK
device.panda-0158 INFO - [2012-12-05 18:57:48,686] entering state b2g_rebooting
127.0.0.1:40752 - - [05/Dec/2012 18:57:48] "HTTP/1.1 GET /api/device/panda-0158/event/b2g_rebooting/" - 200 OK
device.panda-0127 INFO - [2012-12-05 18:57:50,703] entering state b2g_rebooting
127.0.0.1:40753 - - [05/Dec/2012 18:57:50] "HTTP/1.1 GET /api/device/panda-0127/event/b2g_rebooting/" - 200 OK
device.panda-0104 INFO - [2012-12-05 18:57:54,091] entering state b2g_rebooting
127.0.0.1:40754 - - [05/Dec/2012 18:57:54] "HTTP/1.1 GET /api/device/panda-0104/event/b2g_rebooting/" - 200 OK
127.0.0.1:40755 - - [05/Dec/2012 18:57:54] "HTTP/1.1 GET /api/request/1630/status/" - 200 OK
request.1644 INFO - [2012-12-05 18:57:55,004] entering state finding_device
request.1644 INFO - [2012-12-05 18:57:55,046] assigning requested device panda-0106
request.1644 INFO - [2012-12-05 18:57:55,086] request succeeded
request.1644 INFO - [2012-12-05 18:57:55,086] entering state contacting_lifeguard
request.1644 INFO - [2012-12-05 18:57:55,144] entering state device_busy
requests.packages.urllib3.connectionpool INFO - [2012-12-05 18:57:55,169] Starting new HTTP connection (1): mobile-imaging-001.p1.releng.scl1.mozilla.com
device.panda-0106 WARNING - [2012-12-05 18:57:55,192] ignored event free in state failed_pxe_booting
127.0.0.1:40758 - - [05/Dec/2012 18:57:55] "HTTP/1.1 POST /api/device/panda-0106/event/free/" - 200 OK
requests.packages.urllib3.connectionpool DEBUG - [2012-12-05 18:57:55,211] "POST /api/device/panda-0106/event/free/ HTTP/1.1" 200 2
request.1644 INFO - [2012-12-05 18:57:55,211] entering state closed
127.0.0.1:40756 - - [05/Dec/2012 18:57:55] "HTTP/1.1 POST /api/device/panda-0106/request/" - 409 Conflict
127.0.0.1:40759 - - [05/Dec/2012 18:57:56] "HTTP/1.1 GET /api/request/1632/status/" - 200 OK
device.panda-0150 INFO - [2012-12-05 18:57:58,719] entering state b2g_rebooting
127.0.0.1:40760 - - [05/Dec/2012 18:57:58] "HTTP/1.1 GET /api/device/panda-0150/event/b2g_rebooting/" - 200 OK
127.0.0.1:40761 - - [05/Dec/2012 18:58:01] "HTTP/1.1 GET /api/request/1634/status/" - 200 OK
127.0.0.1:40762 - - [05/Dec/2012 18:58:03] "HTTP/1.1 GET /api/request/1635/status/" - 200 OK
127.0.0.1:40763 - - [05/Dec/2012 18:58:04] "HTTP/1.1 GET /api/request/1636/status/" - 200 OK
127.0.0.1:40764 - - [05/Dec/2012 18:58:13] "HTTP/1.1 GET /api/request/1633/status/" - 200 OK
127.0.0.1:40765 - - [05/Dec/2012 18:58:14] "HTTP/1.1 GET /api/request/1637/status/" - 200 OK
request.1645 INFO - [2012-12-05 18:58:14,586] entering state finding_device
request.1645 INFO - [2012-12-05 18:58:14,628] assigning requested device panda-0085
request.1645 INFO - [2012-12-05 18:58:14,641] request succeeded
request.1645 INFO - [2012-12-05 18:58:14,641] entering state contacting_lifeguard
request.1645 INFO - [2012-12-05 18:58:14,700] entering state device_busy
requests.packages.urllib3.connectionpool INFO - [2012-12-05 18:58:14,780] Starting new HTTP connection (1): mobile-imaging-001.p1.releng.scl1.mozilla.com
device.panda-0085 WARNING - [2012-12-05 18:58:14,803] ignored event free in state failed_b2g_downloading
127.0.0.1:40768 - - [05/Dec/2012 18:58:14] "HTTP/1.1 POST /api/device/panda-0085/event/free/" - 200 OK
requests.packages.urllib3.connectionpool DEBUG - [2012-12-05 18:58:14,821] "POST /api/device/panda-0085/event/free/ HTTP/1.1" 200 2
request.1645 INFO - [2012-12-05 18:58:14,822] entering state closed
127.0.0.1:40766 - - [05/Dec/2012 18:58:14] "HTTP/1.1 POST /api/device/panda-0085/request/" - 409 Conflict
127.0.0.1:40769 - - [05/Dec/2012 18:58:21] "HTTP/1.1 GET /api/request/1638/status/" - 200 OK
127.0.0.1:40770 - - [05/Dec/2012 18:58:54] "HTTP/1.1 GET /api/request/1630/status/" - 200 OK
request.1646 INFO - [2012-12-05 18:58:55,355] entering state finding_device
request.1646 INFO - [2012-12-05 18:58:55,397] assigning requested device panda-0106
request.1646 INFO - [2012-12-05 18:58:55,410] request succeeded
request.1646 INFO - [2012-12-05 18:58:55,410] entering state contacting_lifeguard
request.1646 INFO - [2012-12-05 18:58:55,501] entering state device_busy
requests.packages.urllib3.connectionpool INFO - [2012-12-05 18:58:55,526] Starting new HTTP connection (1): mobile-imaging-001.p1.releng.scl1.mozilla.com
device.panda-0106 WARNING - [2012-12-05 18:58:55,548] ignored event free in state failed_pxe_booting
127.0.0.1:40773 - - [05/Dec/2012 18:58:55] "HTTP/1.1 POST /api/device/panda-0106/event/free/" - 200 OK
requests.packages.urllib3.connectionpool DEBUG - [2012-12-05 18:58:55,566] "POST /api/device/panda-0106/event/free/ HTTP/1.1" 200 2
request.1646 INFO - [2012-12-05 18:58:55,566] entering state closed
127.0.0.1:40771 - - [05/Dec/2012 18:58:55] "HTTP/1.1 POST /api/device/panda-0106/request/" - 409 Conflict
127.0.0.1:40774 - - [05/Dec/2012 18:58:56] "HTTP/1.1 GET /api/request/1632/status/" - 200 OK
127.0.0.1:40775 - - [05/Dec/2012 18:59:01] "HTTP/1.1 GET /api/request/1634/status/" - 200 OK
127.0.0.1:40776 - - [05/Dec/2012 18:59:03] "HTTP/1.1 GET /api/request/1635/status/" - 200 OK
127.0.0.1:40777 - - [05/Dec/2012 18:59:04] "HTTP/1.1 GET /api/request/1636/status/" - 200 OK
127.0.0.1:40778 - - [05/Dec/2012 18:59:14] "HTTP/1.1 GET /api/request/1633/status/" - 200 OK
127.0.0.1:40779 - - [05/Dec/2012 18:59:14] "HTTP/1.1 GET /api/request/1637/status/" - 200 OK
request.1647 INFO - [2012-12-05 18:59:14,969] entering state finding_device
request.1647 INFO - [2012-12-05 18:59:15,010] assigning requested device panda-0085
request.1647 INFO - [2012-12-05 18:59:15,025] request succeeded
request.1647 INFO - [2012-12-05 18:59:15,025] entering state contacting_lifeguard
request.1647 INFO - [2012-12-05 18:59:15,218] entering state device_busy
requests.packages.urllib3.connectionpool INFO - [2012-12-05 18:59:15,244] Starting new HTTP connection (1): mobile-imaging-001.p1.releng.scl1.mozilla.com
device.panda-0085 WARNING - [2012-12-05 18:59:15,267] ignored event free in state failed_b2g_downloading
127.0.0.1:40782 - - [05/Dec/2012 18:59:15] "HTTP/1.1 POST /api/device/panda-0085/event/free/" - 200 OK
requests.packages.urllib3.connectionpool DEBUG - [2012-12-05 18:59:15,285] "POST /api/device/panda-0085/event/free/ HTTP/1.1" 200 2
request.1647 INFO - [2012-12-05 18:59:15,285] entering state closed
127.0.0.1:40780 - - [05/Dec/2012 18:59:15] "HTTP/1.1 POST /api/device/panda-0085/request/" - 409 Conflict
127.0.0.1:40783 - - [05/Dec/2012 18:59:21] "HTTP/1.1 GET /api/request/1638/status/" - 200 OK
127.0.0.1:40784 - - [05/Dec/2012 18:59:51] "HTTP/1.1 GET /api/bmm/pxe_config/list/" - 200 OK
127.0.0.1:40785 - - [05/Dec/2012 18:59:54] "HTTP/1.1 GET /api/request/1630/status/" - 200 OK
request.1648 INFO - [2012-12-05 18:59:55,708] entering state finding_device
request.1648 INFO - [2012-12-05 18:59:55,748] assigning requested device panda-0106
request.1648 INFO - [2012-12-05 18:59:55,762] request succeeded
request.1648 INFO - [2012-12-05 18:59:55,762] entering state contacting_lifeguard
request.1648 INFO - [2012-12-05 18:59:55,817] entering state device_busy
requests.packages.urllib3.connectionpool INFO - [2012-12-05 18:59:55,843] Starting new HTTP connection (1): mobile-imaging-001.p1.releng.scl1.mozilla.com
device.panda-0106 WARNING - [2012-12-05 18:59:55,866] ignored event free in state failed_pxe_booting
127.0.0.1:40788 - - [05/Dec/2012 18:59:55] "HTTP/1.1 POST /api/device/panda-0106/event/free/" - 200 OK
requests.packages.urllib3.connectionpool DEBUG - [2012-12-05 18:59:55,885] "POST /api/device/panda-0106/event/free/ HTTP/1.1" 200 2
request.1648 INFO - [2012-12-05 18:59:55,885] entering state closed
127.0.0.1:40786 - - [05/Dec/2012 18:59:55] "HTTP/1.1 POST /api/device/panda-0106/request/" - 409 Conflict
127.0.0.1:40789 - - [05/Dec/2012 18:59:56] "HTTP/1.1 GET /api/request/1632/status/" - 200 OK
<---- timeout for panda-0141 and panda-0159 should have happened here ---->
127.0.0.1:40790 - - [05/Dec/2012 19:00:01] "HTTP/1.1 GET /api/request/1634/status/" - 200 OK
127.0.0.1:40791 - - [05/Dec/2012 19:00:03] "HTTP/1.1 GET /api/request/1635/status/" - 200 OK
127.0.0.1:40792 - - [05/Dec/2012 19:00:04] "HTTP/1.1 GET /api/request/1636/status/" - 200 OK
127.0.0.1:40793 - - [05/Dec/2012 19:00:14] "HTTP/1.1 GET /api/request/1633/status/" - 200 OK
127.0.0.1:40794 - - [05/Dec/2012 19:00:14] "HTTP/1.1 GET /api/request/1637/status/" - 200 OK
Assignee | ||
Comment 9•12 years ago
|
||
Attachment #689234 -
Flags: review?(mcote)
Comment 10•12 years ago
|
||
Comment on attachment 689234 [details] [diff] [review]
logging.patch
Cool cool.
Attachment #689234 -
Flags: review?(mcote) → review+
Assignee | ||
Comment 11•12 years ago
|
||
With the logging in place,
device INFO - [2012-12-06 12:39:45,679] tick
device ERROR - [2012-12-06 12:39:45,686] _run failed
Traceback (most recent call last):
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/statedriver.py", line 44, in run
self._run();
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/statedriver.py", line 65, in _run
for machine_name in self._get_timed_out_machine_names():
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/lifeguard/devicemachine.py", line 61, in _get_timed_out_machine_names
return data.get_timed_out_devices(self.imaging_server_id)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/data.py", line 218, in get_timed_out_devices
return get_timed_out(model.devices, model.devices.c.name, imaging_server_id)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/data.py", line 199, in get_timed_out
& (tbl.c.imaging_server_id == imaging_server_id)))
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1449, in execute
params)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1584, in _execute_clauseelement
compiled_sql, distilled_params
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1698, in _execute_context
context)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1691, in _execute_context
context)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 331, in do_execute
cursor.execute(statement, parameters)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/cursors.py", line 117, in execute
self.errorhandler(self, exc, value)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/connections.py", line 189, in defaulterrorhandler
raise errorclass, errorvalue
ProgrammingError: (ProgrammingError) (1146, u"Table 'mozpool.devices' doesn't exist") 'SELECT devices.name \nFROM devices \nWHERE devices.state_timeout < %s AND devices.imaging_server_id = %s' (datetime.datetime(2012, 12, 6, 12, 39, 45, 6
79912), 201L)
device INFO - [2012-12-06 12:39:45,688] _run returned
(and I know why that specific error happened -- I had to quickly drop and re-create the table to work around a MySQL bug).
I'm willing to blame earlier failures on similar things, and assume that the traceback just didn't make it into the logs (although it should..). So, time to bulletproof that loop.
Assignee | ||
Comment 12•12 years ago
|
||
This leaves a not-unreasonable amount of logging in, I think.
Attachment #689396 -
Flags: review?(mcote)
Comment 13•12 years ago
|
||
Comment on attachment 689396 [details] [diff] [review]
bug817762.patch
Review of attachment 689396 [details] [diff] [review]:
-----------------------------------------------------------------
::: mozpool/statedriver.py
@@ +42,1 @@
> try:
Missed this in the last patch, but having a try/except inside a try/finally seems like overkill. :)
Attachment #689396 -
Flags: review?(mcote) → review+
Assignee | ||
Comment 14•12 years ago
|
||
Mark also pointed out in irc that the thread is not a deamon so it makes interrupting the process a bit hard (note that web.py's worker threads do the same thing - they don't quit until Firefox drops the HTTP/1.1 connection).
Assignee | ||
Comment 15•12 years ago
|
||
I landed this with a flattened try syntax and thd.setDaemon(1) added.
I'm going to optimistically call this done.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 17•12 years ago
|
||
optimism is just future disappointment..
request WARNING - [2013-02-06 14:21:20,200] polling thread still running at 104186s; not starting another
device WARNING - [2013-02-06 14:21:20,200] polling thread still running at 104186s; not starting another
this is on the staging server.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 18•12 years ago
|
||
Looking back in the logs, I see a few times when a single timeout loop spilled past 10 seconds, but recovered afterward. These really are just warnings.
device WARNING - [2013-02-05 03:19:17,396] polling thread still running at 10s; not starting another
device.fakepanda-2043 INFO - [2013-02-05 03:19:17,271] entering state free
device INFO - [2013-02-05 03:19:18,356] handling timeout on fakepanda-2044
fakedevices.fakepanda-2043 DEBUG - [2013-02-05 03:19:18,360] pinged; result=True (state self_test_ok)
device WARNING - [2013-02-05 03:19:18,826] polling thread still running at 11s; not starting another
device.fakepanda-2044 INFO - [2013-02-05 03:19:18,599] entering state free
device INFO - [2013-02-05 03:19:19,808] handling timeout on panda-0010
fakedevices.fakepanda-2044 DEBUG - [2013-02-05 03:19:19,810] pinged; result=True (state self_test_ok)
device WARNING - [2013-02-05 03:19:20,310] polling thread still running at 12s; not starting another
device.panda-0010 INFO - [2013-02-05 03:19:20,062] entering state free
device INFO - [2013-02-05 03:19:22,145] handling timeout on fakepanda-2052
later, some failures caused by losing a db connection:
request WARNING - [2013-02-05 09:25:03,688] polling thread still running at 10s; not starting another
device WARNING - [2013-02-05 09:25:03,865] polling thread still running at 10s; not starting another
request WARNING - [2013-02-05 09:25:04,689] polling thread still running at 11s; not starting another
device WARNING - [2013-02-05 09:25:04,866] polling thread still running at 11s; not starting another
request WARNING - [2013-02-05 09:25:05,791] polling thread still running at 12s; not starting another
device WARNING - [2013-02-05 09:25:05,968] polling thread still running at 12s; not starting another
request WARNING - [2013-02-05 09:25:07,003] polling thread still running at 13s; not starting another
device WARNING - [2013-02-05 09:25:07,180] polling thread still running at 13s; not starting another
request WARNING - [2013-02-05 09:25:08,336] polling thread still running at 14s; not starting another
device WARNING - [2013-02-05 09:25:08,512] polling thread still running at 14s; not starting another
request WARNING - [2013-02-05 09:25:09,802] polling thread still running at 16s; not starting another
device WARNING - [2013-02-05 09:25:09,977] polling thread still running at 16s; not starting another
request WARNING - [2013-02-05 09:25:11,414] polling thread still running at 17s; not starting another
device WARNING - [2013-02-05 09:25:11,590] polling thread still running at 17s; not starting another
request WARNING - [2013-02-05 09:25:13,188] polling thread still running at 19s; not starting another
device WARNING - [2013-02-05 09:25:13,363] polling thread still running at 19s; not starting another
request WARNING - [2013-02-05 09:25:15,139] polling thread still running at 21s; not starting another
device WARNING - [2013-02-05 09:25:15,314] polling thread still running at 21s; not starting another
request WARNING - [2013-02-05 09:25:17,286] polling thread still running at 23s; not starting another
device WARNING - [2013-02-05 09:25:17,460] polling thread still running at 23s; not starting another
request WARNING - [2013-02-05 09:25:19,646] polling thread still running at 25s; not starting another
device WARNING - [2013-02-05 09:25:19,821] polling thread still running at 25s; not starting another
request WARNING - [2013-02-05 09:25:22,241] polling thread still running at 28s; not starting another
device WARNING - [2013-02-05 09:25:22,418] polling thread still running at 28s; not starting another
request WARNING - [2013-02-05 09:25:25,098] polling thread still running at 31s; not starting another
device WARNING - [2013-02-05 09:25:25,274] polling thread still running at 31s; not starting another
request WARNING - [2013-02-05 09:25:28,240] polling thread still running at 34s; not starting another
device WARNING - [2013-02-05 09:25:28,416] polling thread still running at 34s; not starting another
request WARNING - [2013-02-05 09:25:31,696] polling thread still running at 38s; not starting another
device WARNING - [2013-02-05 09:25:31,872] polling thread still running at 38s; not starting another
request WARNING - [2013-02-05 09:25:35,498] polling thread still running at 41s; not starting another
device WARNING - [2013-02-05 09:25:35,674] polling thread still running at 41s; not starting another
request WARNING - [2013-02-05 09:25:39,679] polling thread still running at 46s; not starting another
device WARNING - [2013-02-05 09:25:39,856] polling thread still running at 45s; not starting another
request WARNING - [2013-02-05 09:25:44,279] polling thread still running at 50s; not starting another
device WARNING - [2013-02-05 09:25:44,455] polling thread still running at 50s; not starting another
request WARNING - [2013-02-05 09:25:49,339] polling thread still running at 55s; not starting another
device WARNING - [2013-02-05 09:25:49,515] polling thread still running at 55s; not starting another
request WARNING - [2013-02-05 09:25:54,899] polling thread still running at 61s; not starting another
device WARNING - [2013-02-05 09:25:55,076] polling thread still running at 61s; not starting another
request WARNING - [2013-02-05 09:26:01,022] polling thread still running at 67s; not starting another
device WARNING - [2013-02-05 09:26:01,198] polling thread still running at 67s; not starting another
request WARNING - [2013-02-05 09:26:07,756] polling thread still running at 74s; not starting another
device WARNING - [2013-02-05 09:26:07,929] polling thread still running at 74s; not starting another
request WARNING - [2013-02-05 09:26:15,165] polling thread still running at 81s; not starting another
device WARNING - [2013-02-05 09:26:15,337] polling thread still running at 81s; not starting another
request WARNING - [2013-02-05 09:26:23,312] polling thread still running at 89s; not starting another
device WARNING - [2013-02-05 09:26:23,482] polling thread still running at 89s; not starting another
request WARNING - [2013-02-05 09:26:32,273] polling thread still running at 98s; not starting another
device WARNING - [2013-02-05 09:26:32,446] polling thread still running at 98s; not starting another
request WARNING - [2013-02-05 09:26:42,124] polling thread still running at 108s; not starting another
device WARNING - [2013-02-05 09:26:42,306] polling thread still running at 108s; not starting another
request WARNING - [2013-02-05 09:26:52,970] polling thread still running at 119s; not starting another
device WARNING - [2013-02-05 09:26:53,147] polling thread still running at 119s; not starting another
request WARNING - [2013-02-05 09:27:04,900] polling thread still running at 131s; not starting another
device WARNING - [2013-02-05 09:27:05,075] polling thread still running at 131s; not starting another
request WARNING - [2013-02-05 09:27:18,021] polling thread still running at 144s; not starting another
device WARNING - [2013-02-05 09:27:18,194] polling thread still running at 144s; not starting another
Traceback (most recent call last):
File "/opt/mozpool/frontend/lib/python2.7/site-packages/web/application.py", line 239, in process
return self.handle()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/web/application.py", line 230, in handle
return self._delegate(fn, self.fvars, args)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/web/application.py", line 420, in _delegate
return handle_class(cls)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/web/application.py", line 396, in handle_class
return tocall(*args)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/templeton/handlers.py", line 65, in wrap
results = json.dumps(func(*a, **kw), cls=DateTimeJSONEncoder)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/bmm/handlers.py", line 86, in GET
return data.list_pxe_configs()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/data.py", line 445, in list_pxe_configs
conn = sql.get_conn()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/sql.py", line 49, in get_conn
return get_engine().connect()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2471, in connect
return self._connection_cls(self, **kwargs)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 878, in __init__
self.__connection = connection or engine.raw_connection()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2557, in raw_connection
return self.pool.unique_connection()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 184, in unique_connection
return _ConnectionFairy(self).checkout()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 474, in checkout
self)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/event.py", line 377, in __call__
fn(*args, **kw)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/sql.py", line 15, in _checkout_listener
cursor.execute("SELECT 1")
File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/cursors.py", line 117, in execute
self.errorhandler(self, exc, value)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/connections.py", line 187, in defaulterrorhandler
raise Error(errorclass, errorvalue)
Error: (<class 'socket.error'>, error(104, 'Connection reset by peer'))
127.0.0.1:38413 - - [05/Feb/2013 09:27:31] "HTTP/1.1 GET /api/bmm/pxe_config/list/" - 500 Internal Server Error
request WARNING - [2013-02-05 09:27:32,455] polling thread still running at 158s; not starting another
device WARNING - [2013-02-05 09:27:32,626] polling thread still running at 158s; not starting another
request WARNING - [2013-02-05 09:27:48,329] polling thread still running at 174s; not starting another
device WARNING - [2013-02-05 09:27:48,505] polling thread still running at 174s; not starting another
request WARNING - [2013-02-05 09:28:05,789] polling thread still running at 192s; not starting another
device WARNING - [2013-02-05 09:28:05,972] polling thread still running at 192s; not starting another
request WARNING - [2013-02-05 09:28:24,997] polling thread still running at 211s; not starting another
device WARNING - [2013-02-05 09:28:25,181] polling thread still running at 211s; not starting another
request WARNING - [2013-02-05 09:28:46,130] polling thread still running at 232s; not starting another
device WARNING - [2013-02-05 09:28:46,313] polling thread still running at 232s; not starting another
request WARNING - [2013-02-05 09:29:09,369] polling thread still running at 255s; not starting another
device WARNING - [2013-02-05 09:29:09,561] polling thread still running at 255s; not starting another
127.0.0.1:38414 - - [05/Feb/2013 09:29:31] "HTTP/1.1 GET /api/bmm/pxe_config/list/" - 200 OK
request WARNING - [2013-02-05 09:29:34,939] polling thread still running at 281s; not starting another
device WARNING - [2013-02-05 09:29:35,120] polling thread still running at 281s; not starting another
request WARNING - [2013-02-05 09:30:03,061] polling thread still running at 309s; not starting another
device WARNING - [2013-02-05 09:30:03,243] polling thread still running at 309s; not starting another
127.0.0.1:38417 - - [05/Feb/2013 09:30:27] "HTTP/1.1 GET /api/device/panda-0021/state/" - 200 OK
127.0.0.1:38416 - - [05/Feb/2013 09:30:27] "HTTP/1.1 GET /api/device/panda-0010/state/" - 200 OK
request WARNING - [2013-02-05 09:30:33,997] polling thread still running at 340s; not starting another
device WARNING - [2013-02-05 09:30:34,181] polling thread still running at 340s; not starting another
request WARNING - [2013-02-05 09:31:08,022] polling thread still running at 374s; not starting another
device WARNING - [2013-02-05 09:31:08,208] polling thread still running at 374s; not starting another
request WARNING - [2013-02-05 09:31:45,464] polling thread still running at 411s; not starting another
device WARNING - [2013-02-05 09:31:45,628] polling thread still running at 411s; not starting another
request WARNING - [2013-02-05 09:32:26,644] polling thread still running at 452s; not starting another
device WARNING - [2013-02-05 09:32:26,808] polling thread still running at 452s; not starting another
..
it never recovered from that one.
Immediate plan is to add some monitoring for this (probably by touching a file on every startup, and monitoring that file age with nagios). I'll need to stare at the code some more to figure out what's really going wrong here. The MySQL error is probably a big hint.
Assignee | ||
Comment 19•12 years ago
|
||
bug 838925 for the monitoring.
Assignee | ||
Comment 20•12 years ago
|
||
http://stackoverflow.com/questions/132058/showing-the-stack-trace-from-a-running-python-application may be helpful for debugging this
Assignee | ||
Comment 21•12 years ago
|
||
In my dev environment, I tried cutting off traffic from mozpool to mysql using iptables -j DROP. I do see the timeout loop hang, which is not ideal -- but it comes back after the iptables rule is deleted. This is the case even if I restart mysql in the interim - the ongoing queries eventually get an RST, at which time they raise an exception, which is caught and logged, and life goes on normally.
I would blame synchronous calls to `requests`, but those only occur in the mozpool layer, and both layers are seeing the timeouts at the same time.
Assignee | ||
Comment 22•12 years ago
|
||
Still can't replicate :(
Assignee | ||
Comment 23•12 years ago
|
||
I *may* have replicated this. At least, I have a thread that's hung for 3558s, after blocking and unblocking traffic to the mysql server repeatedly.
It's blocked somewhere in DeviceMachine.handle_timeout().
My guess is that I managed to block traffic right after a query had been sent, but before the query response was returned. So from the mozpool system, it looks like the query is still pending (with a full TCP connection and everything). From the DB server, it probably tried to send a few packets for the response, got no ACKs, and finally closed the socket.
So I think that the fix is to add timeouts to all DB calls and/or set SO_KEEPALIVE and tweak the kernel to use something less than 2h for that timer.
Assignee | ||
Comment 24•12 years ago
|
||
Poking around with gdb shows a bunch of threads in __libc_recv. The fd is optimized out, unfortunately, but that's strong evidence we're hung in a read here.
Assignee | ||
Comment 25•12 years ago
|
||
It's funny how all roads lead back to asynchronous programming being a good idea :)
Otherwise, the best fix I can find for this is to use multiprocessing to put all DB connections in a subprocess, and shoot that process in the head if it takes longer than desired to return a result. That's pretty awful.
Assignee | ||
Comment 26•12 years ago
|
||
Or use SO_KEEPALIVE. I've confirmed that MySQLdb sets SO_KEEPALIVE, so it only remains to dial the kernel's timeout to something < 2h.
Assignee | ||
Comment 27•12 years ago
|
||
Attachment #712617 -
Flags: review?(rsoderberg)
Comment 28•12 years ago
|
||
Comment on attachment 712617 [details] [diff] [review]
bug817762.patch
Review of attachment 712617 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good, feel free to ignore my bikeshed about timeout values below.
::: modules/tweaks/manifests/tcp_keepalive.pp
@@ +1,5 @@
> +# This Source Code Form is subject to the terms of the Mozilla Public
> +# License, v. 2.0. If a copy of the MPL was not distributed with this
> +# file, You can obtain one at http://mozilla.org/MPL/2.0/.
> +
> +# Configure the kernel to send keepalive packets every minute on idle tcp
s/every minute/60 seconds/, to match the value below ("every minute" is technically equivalent to "60 seconds", but the unit conversion obscures the link between this paragraph and that value).
@@ +2,5 @@
> +# License, v. 2.0. If a copy of the MPL was not distributed with this
> +# file, You can obtain one at http://mozilla.org/MPL/2.0/.
> +
> +# Configure the kernel to send keepalive packets every minute on idle tcp
> +# connections, and fail after 5m of no response (5 probes). This is pretty
I don't think that itvl=60 is a good choice here. If the remote isn't responding, it probably isn't going to respond by waiting 60 seconds between probes; a delay of 5 seconds between probes should be more than sufficient?
300 second timeouts exceed the useful timeout window for most applications. What is the maximum timeout that provides a tolerable experience?
The tcp_keepalive_time here of "60 seconds" implies that your app needs to know within 60 seconds that the connection has gone away, and so you would use need time=40, itvl=5, probes=4 to make sure that the socket is terminated at T=60s for being idle.
If the maximum application timeout window is 300 seconds, then you would use e.g. time=280, itvl=5, probes=4.
@@ +9,5 @@
> +# failure at the other end of the connection.
> +
> +class tweaks::tcp_keepalive {
> + case $::operatingsystem {
> + CentOS, Ubuntu: {
You'll have better compatibility with either case $::osfamily { RedHat, Debian: ... } or case $::kernel { Linux: ... }
@@ +11,5 @@
> +class tweaks::tcp_keepalive {
> + case $::operatingsystem {
> + CentOS, Ubuntu: {
> + sysctl::value {
> + 'net.ipv4.tcp_keepalive_time':
# wait N seconds since last data packet (or probe response) received before sending the first probe
@@ +13,5 @@
> + CentOS, Ubuntu: {
> + sysctl::value {
> + 'net.ipv4.tcp_keepalive_time':
> + value => 60;
> + 'net.ipv4.tcp_keepalive_itvl':
# wait N seconds before a keepalive probe fails (times out)
@@ +15,5 @@
> + 'net.ipv4.tcp_keepalive_time':
> + value => 60;
> + 'net.ipv4.tcp_keepalive_itvl':
> + value => 60;
> + 'net.ipv4.tcp_keepalive_probes':
# terminate the connection after N probes have timed out
@@ +17,5 @@
> + 'net.ipv4.tcp_keepalive_itvl':
> + value => 60;
> + 'net.ipv4.tcp_keepalive_probes':
> + value => 5;
> + }
Add per-line docs for each value, as these settings are particularly tricky to understand without context (and even the explanatory paragraph above isn't necessarily enough).
Attachment #712617 -
Flags: review?(rsoderberg) → review+
Comment 29•12 years ago
|
||
(In reply to Richard Soderberg [:atoll] from comment #28)
> use need time=40, itvl=5, probes=4 to make sure that the socket is
> terminated at T=60s for being idle.
Correction: to make sure that the socket is terminated at T=60s for being dead.
Assignee | ||
Comment 30•12 years ago
|
||
"a few minutes" is OK for this application. I picked 3. $::operatingsystem is a very common pattern around puppetagain, so I kept that pattern, although you're right that $::kernel is equivalent.
Assignee | ||
Comment 31•12 years ago
|
||
I landed this. Let's hope it works :)
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 32•12 years ago
|
||
This is still occurring for shorter times.. On mobile-imaging-002, I see some pauses for about 900s and about 1050s. There's still no obvious cause in the logs - the timeout before is for a device in the free state, so it should just be pinging.
I see the same on mobile-imaging-005, 003, and 004. All at the same time - beginning at
device WARNING - [2013-02-23 04:39:12,276] polling thread still running at 10s; not starting another
:cyborgshadow, I see
[04:35:55] <nagios-scl3> cyborgshadow: Downtime for backup2.db.scl3.mozilla.com scheduled for 4:00:00
right around that time. Did anything related happen on the buildbot cluster?
Group: infra
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 33•12 years ago
|
||
FTR, racks 2 to 5 are not in used by the automation since we don't yet have enough load.
I could set rack 2 and see if it clears this.
FYI, racks 1 to 5 are allocated for b2g pandas.
Assignee | ||
Comment 34•12 years ago
|
||
I assume it's good/bad luck which racks are affected.
Assignee | ||
Comment 35•12 years ago
|
||
Assignee | ||
Comment 36•12 years ago
|
||
All of the servers failed with something like
request WARNING - [2013-02-28 13:48:30,166] polling thread still running at 211s; not starting another
device WARNING - [2013-02-28 13:48:38,886] polling thread still running at 211s; not starting another
at exactly that time. They then recovered. That was the Great Switch Failure Of 2013-02-28. So that's good news. However, one didn't recover, on mobile-imaging-007:
device WARNING - [2013-02-28 13:48:30,985] polling thread still running at 211s; not starting another
...
device WARNING - [2013-03-01 07:41:20,658] polling thread still running at 64581s; not starting another
note that only the device loop failed, not the requests loop.
Assignee | ||
Comment 37•12 years ago
|
||
Unfortunately, all of the cool introspection techniques require Python's symbols, which we don't build for the custom package.
The relevant threads are, I think,
Thread 14 (Thread 0x7fd5c7228700 (LWP 25028)):
#0 0x0000003e9dade513 in select () from /lib64/libc.so.6
#1 0x00007fd5cb7b51d9 in ?? () from /opt/mozpool/frontend/lib/python2.7/lib-dynload/time.so
#2 0x00007fd5d1a3dec6 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#3 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#4 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#5 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#6 0x00007fd5d19bedc1 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#7 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#8 0x00007fd5d19a256f in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#9 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#10 0x00007fd5d1a37493 in PyEval_CallObjectWithKeywords () from /tools/python27/lib/libpython2.7.so.1.0
#11 0x00007fd5d1a75692 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#12 0x0000003e9de077f1 in start_thread () from /lib64/libpthread.so.0
#13 0x0000003e9dae592d in clone () from /lib64/libc.so.6
Process 25028 attached - interrupt to quit
select(0, NULL, NULL, NULL, {17, 489110}) = 0 (Timeout)
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
write(2, "device WARNING - [2013-03-01 08:"..., 104) = 104
select(0, NULL, NULL, NULL, {60, 240069}
** the Lifeguard StateDriver, repeatedly logging about the polling thread still running
Thread 13 (Thread 0x7fd5c5e26700 (LWP 25030)):
#0 0x0000003e9dade513 in select () from /lib64/libc.so.6
#1 0x00007fd5cb7b51d9 in ?? () from /opt/mozpool/frontend/lib/python2.7/lib-dynload/time.so
#2 0x00007fd5d1a3dec6 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#3 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#4 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#5 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#6 0x00007fd5d19bedc1 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#7 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#8 0x00007fd5d19a256f in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#9 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#10 0x00007fd5d1a37493 in PyEval_CallObjectWithKeywords () from /tools/python27/lib/libpython2.7.so.1.0
#11 0x00007fd5d1a75692 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#12 0x0000003e9de077f1 in start_thread () from /lib64/libpthread.so.0
#13 0x0000003e9dae592d in clone () from /lib64/libc.so.6
Process 25030 attached - interrupt to quit
select(0, NULL, NULL, NULL, {0, 727435}) = 0 (Timeout)
clone(child_stack=0x7fd5c5424ff0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fd5c54259d0, tls=0x7fd5c5425700, child_tidptr=0x7fd5c54259d0) = 12054
futex(0x29819a0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fd5bc0013c0, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x29819a0, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
select(0, NULL, NULL, NULL, {10, 0}) = 0 (Timeout)
clone(child_stack=0x7fd5c5424ff0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fd5c54259d0, tls=0x7fd5c5425700, child_tidptr=0x7fd5c54259d0) = 12057
futex(0x29819a0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x29819a0, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x29819a0, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
select(0, NULL, NULL, NULL, {10, 0}^C <unfinished ...>
** the Mozpool StateDriver, repeatedly cloning off a new timeout handling thread
Thread 2 (Thread 0x7fd5961fc700 (LWP 2196)):
#0 0x0000003e9de0e8ec in recv () from /lib64/libpthread.so.0
#1 0x00007fd5c9a7fa81 in ?? () from /opt/mozpool/frontend/lib/python2.7/lib-dynload/_socket.so
#2 0x00007fd5c9a7fcae in ?? () from /opt/mozpool/frontend/lib/python2.7/lib-dynload/_socket.so
#3 0x00007fd5d1a3dec6 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#4 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#5 0x00007fd5d1a3dae7 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#6 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#7 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#8 0x00007fd5d19bedc1 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#9 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#10 0x00007fd5d19a256f in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#11 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#12 0x00007fd5d19fbec0 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#13 0x00007fd5d19f21b8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#14 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#15 0x00007fd5d1a3d24d in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#16 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#17 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#18 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#19 0x00007fd5d19beec8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#20 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#21 0x00007fd5d19a256f in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#22 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#23 0x00007fd5d19fbec0 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#24 0x00007fd5d19f21b8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#25 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#26 0x00007fd5d1a3c5ac in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#27 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#28 0x00007fd5d19beec8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#29 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#30 0x00007fd5d1a3c5ac in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#31 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#32 0x00007fd5d19beec8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#33 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#34 0x00007fd5d1a3c5ac in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#35 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#36 0x00007fd5d1a3dae7 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#37 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#38 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#39 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#40 0x00007fd5d19bedc1 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#41 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#42 0x00007fd5d19a256f in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#43 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#44 0x00007fd5d19fbec0 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#45 0x00007fd5d19f21b8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#46 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#47 0x00007fd5d1a3d24d in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#48 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#49 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#50 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#51 0x00007fd5d19beec8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#52 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#53 0x00007fd5d19a256f in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#54 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#55 0x00007fd5d19fbec0 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#56 0x00007fd5d19f21b8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#57 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#58 0x00007fd5d1a3c5ac in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#59 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#60 0x00007fd5d1a3dae7 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#61 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#62 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#63 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#64 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#65 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#66 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#67 0x00007fd5d19beec8 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#68 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#69 0x00007fd5d1a3c5ac in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#70 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#71 0x00007fd5d1a3edd2 in PyEval_EvalFrameEx () from /tools/python27/lib/libpython2.7.so.1.0
#72 0x00007fd5d1a3f6be in PyEval_EvalCodeEx () from /tools/python27/lib/libpython2.7.so.1.0
#73 0x00007fd5d19bedc1 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#74 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#75 0x00007fd5d19a256f in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#76 0x00007fd5d198ff03 in PyObject_Call () from /tools/python27/lib/libpython2.7.so.1.0
#77 0x00007fd5d1a37493 in PyEval_CallObjectWithKeywords () from /tools/python27/lib/libpython2.7.so.1.0
#78 0x00007fd5d1a75692 in ?? () from /tools/python27/lib/libpython2.7.so.1.0
#79 0x0000003e9de077f1 in start_thread () from /lib64/libpthread.so.0
#80 0x0000003e9dae592d in clone () from /lib64/libc.so.6
Process 2196 attached - interrupt to quit
recvfrom(5, ^C <unfinished ...>
[root@mobile-imaging-007 ~]# lsof -p 25019
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
...
mozpool-s 25019 apache 5u IPv4 53125226 0t0 TCP mobile-imaging-007.p7.releng.scl1.mozilla.com:48075->buildbot-rw-vip.db.scl3.mozilla.com:mysql (ESTABLISHED)
meaning that it's waiting for incoming data from the DB. Per above, that TCP connection should see traffic every 120s, if not every 5s.
[root@mobile-imaging-007 ~]# tcpdump tcp port 48075
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
shows no output over ~10m. Is SO_KEEPALIVE set? Linux doesn't provide a programmatic way to look at sockopts, but we can do it in gdb.
(gdb) call (int*)malloc(4)
$1 = (int *) 0x2c87990
(gdb) call (int*)malloc(4)
$2 = (int *) 0x2bc56d0
(gdb) call *(int *)$2 = 4
$3 = 4
(gdb) call getsockopt(5, 1, 9, $1, $2)
$4 = 0
(gdb) p *(int *)$1
$5 = 0
(gdb) quit
$1 is *optval and $2 is *optlen, so *(int *)$1 is the value of SO_KEEPALIVE. The $5 = 0 means "disabled", where 1 would mean "enabled", as confirmed by creating a socket by hand in another process. So at least for some connections, Python-MySQL is not setting SO_KEEPALIVE, as I thought in comment 26. A grep of the Python-MySQL source for SO_KEEPALIVE confirms this. Comment 26 was based on reading an strace of a new connection being made, but must have been a mis-read of the source.
Assignee | ||
Comment 38•12 years ago
|
||
Sorry, I forgot we're using PyMySQL and not Python-MySQL. PyMySQL doesn't set SO_KEEPALIVE either, but at least it's in Python and thus monkey-patchable.
Assignee | ||
Comment 39•12 years ago
|
||
Assignee | ||
Updated•12 years ago
|
Group: infra
Assignee | ||
Comment 40•12 years ago
|
||
Unfortunately, this fix isn't possible with MySQLdb -- it uses the mysql client API, which hides the socket even from the C code, let alone the Python. The patch at least logs loudly in that case.
Attachment #720043 -
Flags: review?(catlee)
Comment 41•12 years ago
|
||
Comment on attachment 720043 [details] [diff] [review]
bug817762-set-keepalive.patch
Review of attachment 720043 [details] [diff] [review]:
-----------------------------------------------------------------
If we ever need to use the mysql driver, we could iterate over all open fd's and try to set the keepalive socket option on them.
Attachment #720043 -
Flags: review?(catlee) → review+
Assignee | ||
Comment 42•12 years ago
|
||
Richard helpfully points out that the MySQL client libraries which Python-MySQL talks to *do* set this option. So there's no need to work around this with Python-MySQL. I've already landed attachment 720043 [details] [diff] [review], but I'll make some comment-only updates to indicate that this is a complete solution. I'll update teh PyMySQL issue too.
Assignee | ||
Comment 43•12 years ago
|
||
I just deployed mozpool-2.0.3 in bug 848738, which contains only this fix.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 44•12 years ago
|
||
request.13980 INFO - [2013-03-11 15:01:02,522] entering state device_not_found
device WARNING - [2013-03-11 15:01:03,141] polling thread still running at 295639s; not starting another
request WARNING - [2013-03-11 15:01:12,357] polling thread still running at 10s; not starting another
request WARNING - [2013-03-11 15:01:13,359] polling thread still running at 11s; not starting another
request WARNING - [2013-03-11 15:01:14,460] polling thread still running at 12s; not starting another
request WARNING - [2013-03-11 15:01:15,672] polling thread still running at 13s; not starting another
request WARNING - [2013-03-11 15:01:17,004] polling thread still running at 14s; not starting another
request WARNING - [2013-03-11 15:01:18,470] polling thread still running at 16s; not starting another
request WARNING - [2013-03-11 15:01:20,083] polling thread still running at 17s; not starting another
request WARNING - [2013-03-11 15:01:21,856] polling thread still running at 19s; not starting another
request WARNING - [2013-03-11 15:01:23,807] polling thread still running at 21s; not starting another
request WARNING - [2013-03-11 15:01:25,953] polling thread still running at 23s; not starting another
request WARNING - [2013-03-11 15:01:28,314] polling thread still running at 25s; not starting another
request WARNING - [2013-03-11 15:01:30,910] polling thread still running at 28s; not starting another
request WARNING - [2013-03-11 15:01:33,767] polling thread still running at 31s; not starting another
request WARNING - [2013-03-11 15:01:36,908] polling thread still running at 34s; not starting another
request WARNING - [2013-03-11 15:01:40,364] polling thread still running at 38s; not starting another
request WARNING - [2013-03-11 15:01:44,166] polling thread still running at 41s; not starting another
request WARNING - [2013-03-11 15:01:48,348] polling thread still running at 46s; not starting another
request WARNING - [2013-03-11 15:01:52,947] polling thread still running at 50s; not starting another
request WARNING - [2013-03-11 15:01:58,007] polling thread still running at 55s; not starting another
request.13980 INFO - [2013-03-11 15:02:02,619] entering state closed
device WARNING - [2013-03-11 15:02:03,441] polling thread still running at 295699s; not starting another
device WARNING - [2013-03-11 15:03:03,742] polling thread still running at 295759s; not starting another
device WARNING - [2013-03-11 15:04:04,043] polling thread still running at 295820s; not starting another
db.sql DEBUG - [2013-03-11 15:04:13,723] setting SO_KEEPALIVE on MySQL socket 3
db.sql DEBUG - [2013-03-11 15:04:33,744] setting SO_KEEPALIVE on MySQL socket 4
device WARNING - [2013-03-11 15:05:04,307] polling thread still running at 295880s; not starting another
db.sql DEBUG - [2013-03-11 15:05:33,817] setting SO_KEEPALIVE on MySQL socket 6
device WARNING - [2013-03-11 15:06:04,608] polling thread still running at 295940s; not starting another
device WARNING - [2013-03-11 15:07:04,908] polling thread still running at 296001s; not starting another
db.sql DEBUG - [2013-03-11 15:07:13,912] setting SO_KEEPALIVE on MySQL socket 9
device WARNING - [2013-03-11 15:08:05,209] polling thread still running at 296061s; not starting another
db.sql DEBUG - [2013-03-11 15:08:13,975] setting SO_KEEPALIVE on MySQL socket 8
device WARNING - [2013-03-11 15:09:05,510] polling thread still running at 296121s; not starting another
device WARNING - [2013-03-11 15:10:05,810] polling thread still running at 296182s; not starting another
device WARNING - [2013-03-11 15:11:06,111] polling thread still running at 296242s; not starting another
device WARNING - [2013-03-11 15:12:06,411] polling thread still running at 296302s; not starting another
device WARNING - [2013-03-11 15:13:06,712] polling thread still running at 296362s; not starting another
device WARNING - [2013-03-11 15:14:07,013] polling thread still running at 296423s; not starting another
db.sql DEBUG - [2013-03-11 15:14:14,354] setting SO_KEEPALIVE on MySQL socket 3
db.sql DEBUG - [2013-03-11 15:14:34,375] setting SO_KEEPALIVE on MySQL socket 4
device WARNING - [2013-03-11 15:15:07,313] polling thread still running at 296483s; not starting another
db.sql DEBUG - [2013-03-11 15:15:34,448] setting SO_KEEPALIVE on MySQL socket 6
device WARNING - [2013-03-11 15:16:07,614] polling thread still running at 296543s; not starting another
device WARNING - [2013-03-11 15:17:07,914] polling thread still running at 296604s; not starting another
db.sql DEBUG - [2013-03-11 15:17:14,543] setting SO_KEEPALIVE on MySQL socket 9
device WARNING - [2013-03-11 15:18:08,215] polling thread still running at 296664s; not starting another
db.sql DEBUG - [2013-03-11 15:18:14,607] setting SO_KEEPALIVE on MySQL socket 8
device WARNING - [2013-03-11 15:19:08,516] polling thread still running at 296724s; not starting another
and from the sec mozpool:
device WARNING - [2013-03-11 15:23:34,240] polling thread still running at 293767s; not starting another
request WARNING - [2013-03-11 15:23:36,863] polling thread still running at 10s; not starting another
request WARNING - [2013-03-11 15:23:37,864] polling thread still running at 11s; not starting another
request WARNING - [2013-03-11 15:23:38,965] polling thread still running at 12s; not starting another
request WARNING - [2013-03-11 15:23:40,177] polling thread still running at 13s; not starting another
request WARNING - [2013-03-11 15:23:41,509] polling thread still running at 14s; not starting another
request WARNING - [2013-03-11 15:23:42,975] polling thread still running at 16s; not starting another
request WARNING - [2013-03-11 15:23:44,588] polling thread still running at 17s; not starting another
request WARNING - [2013-03-11 15:23:46,361] polling thread still running at 19s; not starting another
request WARNING - [2013-03-11 15:23:48,312] polling thread still running at 21s; not starting another
request WARNING - [2013-03-11 15:23:50,458] polling thread still running at 23s; not starting another
request WARNING - [2013-03-11 15:23:52,819] polling thread still running at 25s; not starting another
request WARNING - [2013-03-11 15:23:55,415] polling thread still running at 28s; not starting another
request ERROR - [2013-03-11 15:23:56,863] failure in _tick
Traceback (most recent call last):
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/statedriver.py", line 78, in _tick
self.poll_for_timeouts()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/statedriver.py", line 101, in poll_for_timeouts
for machine_name in self._get_timed_out_machine_names():
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/mozpool/requestmachine.py", line 71, in _get_timed_out_machine_names
return data.get_timed_out_requests(self.imaging_server_id)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/data.py", line 491, in get_timed_out_requests
return get_timed_out(model.requests, model.requests.c.id, imaging_server_id)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/data.py", line 276, in get_timed_out
res = sql.get_conn().execute(select(
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/sql.py", line 49, in get_conn
return get_engine().connect()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2471, in connect
return self._connection_cls(self, **kwargs)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 878, in __init__
self.__connection = connection or engine.raw_connection()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2557, in raw_connection
return self.pool.unique_connection()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 184, in unique_connection
return _ConnectionFairy(self).checkout()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 401, in __init__
rec = self._connection_record = pool._do_get()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 738, in _do_get
(self.size(), self.overflow(), self._timeout))
TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30
Assignee | ||
Comment 45•12 years ago
|
||
The sec server is running 2.0.2 (without the fix), but mobile-imaging-001 is running the fixed version:
/var/log/mozpool.log.1:root INFO - [2013-03-07 10:44:06,038] Mozpool-2.0.3 server starting
so something else is amiss.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 46•12 years ago
|
||
Same on 003 and 004. The heartbeat monitor is missing these because only one of the two timeout loops (for mozpool and lifeguard) is failing.
Assignee | ||
Comment 47•12 years ago
|
||
This isn't actually blocking 802317.
I need some help on this. As you can see from the above, I'm beyond grasping at straws at this point.
No longer blocks: 802317
Assignee | ||
Comment 48•12 years ago
|
||
I verified that the socket fd's do have SO_KEEPALIVE set via the gdb trick above.
I shut off access to the db using iptables. When the connection was restored, there were some "Connection reset by peer" errors, followed by normal operation - no wedging.
Comment 49•12 years ago
|
||
I think mozpool on mobile-imaging-001.p1.releng.scl1.mozilla.com got wedged as part of the firewall maintenance tonight.
20:00 < nagios-releng> Wed 20:00:37 PDT [433] mobile-imaging-001.p1.releng.scl1.mozilla.com:http is CRITICAL: CRITICAL - Socket timeout after 10 seconds (http://m.allizom.org/http)
20:01 < nagios-releng> Wed 20:01:07 PDT [434] panda-0102.p1.releng.scl1.mozilla.com is DOWN :(Host Check Timed Out)
20:01 < nagios-releng> Wed 20:01:07 PDT [435] panda-0094.p1.releng.scl1.mozilla.com is DOWN :(Host Check Timed Out)
20:01 < nagios-releng> Wed 20:01:28 PDT [436] panda-0099.p1.releng.scl1.mozilla.com is DOWN :(Host Check Timed Out)
20:01 < nagios-releng> Wed 20:01:28 PDT [437] panda-0122.p1.releng.scl1.mozilla.com is DOWN :(Host Check Timed Out)
I logged in as root@mobile-imaging-001.p1.releng.scl1.mozilla.com and killed
apache 30477 30475 0 Mar11 ? 00:16:45 /opt/mozpool/frontend/bin/python2.7 /opt/mozpool/frontend/bin/mozpool-server 8010
nagios started being happy again, and there was much rejoicing.
Assignee | ||
Comment 50•12 years ago
|
||
Well, there was less sadness anyway.
I learned about sys._current_frames() at PyCon. It might be worth patching mozpool to dump that periodically if this is indeed still the problem. The logs from the 13th are gone, unfortunately.
Assignee | ||
Comment 51•12 years ago
|
||
Surely this debugging output will lead me to a solution!
This only happens rarely now, so it could still take a while..
Attachment #732293 -
Flags: review?(jwatkins)
Reporter | ||
Comment 52•12 years ago
|
||
Comment on attachment 732293 [details] [diff] [review]
bug817762-curframes.patch
Will this be backed out once the cause is discovered and fixed?
r+
Attachment #732293 -
Flags: review?(jwatkins) → review+
Assignee | ||
Comment 53•12 years ago
|
||
This and a pile of other patches made on this bug's behalf. Patch is landed, although it will wait until a new deployment.
Assignee | ||
Comment 54•12 years ago
|
||
We haven't seen this recur in a month now. I'll give it another month, and then remove some of the debugging code.
Assignee | ||
Comment 55•11 years ago
|
||
36h ago, when there was a Zayo failure and ~90s of BGP reconvergence:
device WARNING - [2013-05-29 01:56:23,320] polling thread still running at 98s; not starting another
request WARNING - [2013-05-29 01:56:30,549] polling thread still running at 98s; not starting another
but they all recovered. Another week, and we'll start backing this out.
Comment 56•11 years ago
|
||
dustin, is this related? There are 30 minutes of nothing.
2013-06-03T07:18:55 statemachine entering state operation_complete
2013-06-03T07:18:55 statemachine sending imaging result 'complete' to Mozpool
2013-06-03T07:18:55 statemachine entering state ready
2013-06-03T07:49:16 sut connecting to SUT agent
"If you see an "Imaging complete" line, and then big fat nothing -- no pinging or anything -- then that's a repeat of bug 817762, which makes pandas sad (literally). Re-open the bug and find Dustin. "
Assignee | ||
Comment 57•11 years ago
|
||
Nope - that looks normal, actually. It imaged the board, and then didn't do anything to it while it was reserved, then did a SUT check 40m later, after the reservation was closed.
Assignee | ||
Comment 58•11 years ago
|
||
Bug 817762: Roll back
Revert "Bug 817762: write out current frames whenever a timeout thread
is wedged; r=dividehex" but leave the monitoring of the polling thread's
duration, as this gives useful output during network failures, etc.
Soften the logging a little bit by not logging until the hang has lasted
for 80s.
Attachment #759209 -
Flags: review?(jwatkins)
Reporter | ||
Comment 59•11 years ago
|
||
Attachment #759209 -
Flags: review?(jwatkins) → review+
Assignee | ||
Updated•11 years ago
|
Attachment #759209 -
Flags: checked-in+
Assignee | ||
Updated•11 years ago
|
Status: REOPENED → RESOLVED
Closed: 12 years ago → 11 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 60•11 years ago
|
||
This just recurred on -010.
> 127.0.0.1:52550 - - [11/Jun/2013 11:48:56] "HTTP/1.1 GET /api/request/208647/status/" - 200 OK
> sut.cli ERROR - [2013-06-11 11:49:06,060] Exception initiating DeviceManager!: Remote Device Error: unable to connect to panda-0071.p10.releng.scl1.mozilla.com after 1 attempts
> 127.0.0.1:52551 - - [11/Jun/2013 11:49:08] "HTTP/1.1 GET /api/request/208644/status/" - 200 OK
> 127.0.0.1:52552 - - [11/Jun/2013 11:49:11] "HTTP/1.1 GET /api/request/208650/status/" - 200 OK
> device INFO - [2013-06-11 11:49:14,965] handling timeout on panda-0072
> request INFO - [2013-06-11 11:49:16,060] handling timeout on 208650
> 127.0.0.1:52553 - - [11/Jun/2013 11:49:26] "HTTP/1.1 GET /api/request/208643/status/" - 200 OK
> 127.0.0.1:52555 - - [11/Jun/2013 11:49:34] "HTTP/1.1 GET /api/request/208646/status/" - 200 OK
> db.pool DEBUG - [2013-06-11 11:49:55,610] setting SO_KEEPALIVE on MySQL socket 13
> db.pool DEBUG - [2013-06-11 11:49:56,549] setting SO_KEEPALIVE on MySQL socket 15
> 127.0.0.1:52560 - - [11/Jun/2013 11:49:56] "HTTP/1.1 GET /api/request/208647/status/" - 200 OK
> db.pool DEBUG - [2013-06-11 11:50:08,733] setting SO_KEEPALIVE on MySQL socket 18
> 127.0.0.1:52563 - - [11/Jun/2013 11:50:08] "HTTP/1.1 GET /api/request/208644/status/" - 200 OK
> 127.0.0.1:52565 - - [11/Jun/2013 11:50:11] "HTTP/1.1 GET /api/request/208650/status/" - 200 OK
> db.pool DEBUG - [2013-06-11 11:50:18,054] setting SO_KEEPALIVE on MySQL socket 20
> db.pool DEBUG - [2013-06-11 11:50:27,101] setting SO_KEEPALIVE on MySQL socket 22
> 127.0.0.1:52569 - - [11/Jun/2013 11:50:27] "HTTP/1.1 GET /api/request/208643/status/" - 200 OK
> db.pool DEBUG - [2013-06-11 11:50:34,490] setting SO_KEEPALIVE on MySQL socket 24
> 127.0.0.1:52572 - - [11/Jun/2013 11:50:34] "HTTP/1.1 GET /api/request/208646/status/" - 200 OK
> db.pool DEBUG - [2013-06-11 11:50:34,744] setting SO_KEEPALIVE on MySQL socket 26
> device WARNING - [2013-06-11 11:50:36,441] polling thread still running at 81s; not starting another
> 127.0.0.1:52576 - - [11/Jun/2013 11:50:37] "HTTP/1.1 GET /api/relay/panda-relay-004/test/" - 200 OK
> 127.0.0.1:52577 - - [11/Jun/2013 11:50:38] "HTTP/1.1 GET /api/relay/panda-relay-006/test/" - 200 OK
> 127.0.0.1:52578 - - [11/Jun/2013 11:50:39] "HTTP/1.1 GET /api/relay/panda-relay-002/test/" - 200 OK
> 127.0.0.1:52580 - - [11/Jun/2013 11:50:39] "HTTP/1.1 GET /api/bmm/pxe_config/list/" - 200 OK
> device WARNING - [2013-06-11 11:50:44,589] polling thread still running at 89s; not starting another
> 127.0.0.1:52583 - - [11/Jun/2013 11:50:48] "HTTP/1.1 GET /api/relay/panda-relay-079/test/" - 200 OK
> device WARNING - [2013-06-11 11:50:53,553] polling thread still running at 98s; not starting another
> device WARNING - [2013-06-11 11:51:03,413] polling thread still running at 108s; not starting another
> device WARNING - [2013-06-11 11:51:14,259] polling thread still running at 119s; not starting another
> device WARNING - [2013-06-11 11:51:26,189] polling thread still running at 131s; not starting another
> device WARNING - [2013-06-11 11:51:39,313] polling thread still running at 144s; not starting another
> device WARNING - [2013-06-11 11:51:53,749] polling thread still running at 158s; not starting another
> device WARNING - [2013-06-11 11:52:09,628] polling thread still running at 174s; not starting another
> device WARNING - [2013-06-11 11:52:27,084] polling thread still running at 192s; not starting another
> device WARNING - [2013-06-11 11:52:46,298] polling thread still running at 211s; not starting another
> device WARNING - [2013-06-11 11:53:07,433] polling thread still running at 232s; not starting another
> device WARNING - [2013-06-11 11:53:30,662] polling thread still running at 255s; not starting another
> device WARNING - [2013-06-11 11:53:56,236] polling thread still running at 281s; not starting another
> device WARNING - [2013-06-11 11:54:24,367] polling thread still running at 309s; not starting another
> device WARNING - [2013-06-11 11:54:55,311] polling thread still running at 340s; not starting another
> device WARNING - [2013-06-11 11:55:29,349] polling thread still running at 374s; not starting another
> device WARNING - [2013-06-11 11:56:06,781] polling thread still running at 411s; not starting another
> device WARNING - [2013-06-11 11:56:47,968] polling thread still running at 453s; not starting another
> device WARNING - [2013-06-11 11:57:33,273] polling thread still running at 498s; not starting another
> device WARNING - [2013-06-11 11:58:23,108] polling thread still running at 548s; not starting another
> device WARNING - [2013-06-11 11:59:17,927] polling thread still running at 602s; not starting another
> device WARNING - [2013-06-11 12:00:18,228] polling thread still running at 663s; not starting another
> root INFO - [2013-06-11 12:01:01,564] Mozpool-4.1.1 server starting
> db.pool DEBUG - [2013-06-11 12:01:01,658] setting SO_KEEPALIVE on MySQL socket 3
> db.pool DEBUG - [2013-06-11 12:01:01,690] setting SO_KEEPALIVE on MySQL socket 4
> request INFO - [2013-06-11 12:01:01,696] handling timeout on 208643
> device INFO - [2013-06-11 12:01:01,704] handling timeout on panda-0061
> db.pool DEBUG - [2013-06-11 12:01:01,744] setting SO_KEEPALIVE on MySQL socket 6
> db.pool DEBUG - [2013-06-11 12:01:09,643] setting SO_KEEPALIVE on MySQL socket 10
> db.pool DEBUG - [2013-06-11 12:01:11,695] setting SO_KEEPALIVE on MySQL socket 12
> db.pool DEBUG - [2013-06-11 12:01:17,705] setting SO_KEEPALIVE on MySQL socket 14
> db.pool DEBUG - [2013-06-11 12:01:18,431] setting SO_KEEPALIVE on MySQL socket 17
> db.pool DEBUG - [2013-06-11 12:01:28,213] setting SO_KEEPALIVE on MySQL socket 19
> db.pool DEBUG - [2013-06-11 12:01:35,736] setting SO_KEEPALIVE on MySQL socket 22
> db.pool DEBUG - [2013-06-11 12:01:47,746] setting SO_KEEPALIVE on MySQL socket 24
> device WARNING - [2013-06-11 12:02:23,155] polling thread still running at 81s; not starting another
> device WARNING - [2013-06-11 12:02:31,304] polling thread still running at 89s; not starting another
> device WARNING - [2013-06-11 12:02:40,268] polling thread still running at 98s; not starting another
> device WARNING - [2013-06-11 12:02:50,128] polling thread still running at 108s; not starting another
> device WARNING - [2013-06-11 12:03:00,974] polling thread still running at 119s; not starting another
> device WARNING - [2013-06-11 12:03:12,904] polling thread still running at 131s; not starting another
> device WARNING - [2013-06-11 12:03:26,027] polling thread still running at 144s; not starting another
> device WARNING - [2013-06-11 12:03:40,463] polling thread still running at 158s; not starting another
> device WARNING - [2013-06-11 12:03:56,342] polling thread still running at 174s; not starting another
> device WARNING - [2013-06-11 12:04:13,810] polling thread still running at 192s; not starting another
> device WARNING - [2013-06-11 12:04:33,024] polling thread still running at 211s; not starting another
> device WARNING - [2013-06-11 12:04:54,157] polling thread still running at 232s; not starting another
> device WARNING - [2013-06-11 12:05:17,406] polling thread still running at 255s; not starting another
> device WARNING - [2013-06-11 12:05:42,980] polling thread still running at 281s; not starting another
> device WARNING - [2013-06-11 12:06:11,111] polling thread still running at 309s; not starting another
> device WARNING - [2013-06-11 12:06:42,055] polling thread still running at 340s; not starting another
> request.208643 WARNING - [2013-06-11 12:01:01,756] HERE
> request WARNING - [2013-06-11 12:02:23,174] polling thread still running at 81s; not starting another
> db.pool DEBUG - [2013-06-11 12:07:08,132] setting SO_KEEPALIVE on MySQL socket 203
> device INFO - [2013-06-11 12:07:08,133] handling timeout on panda-0062
> request.208645 INFO - [2013-06-11 12:01:04,179] Request closed.
Note that this continued to time out for 5m *after* I restarted it ("Mozpool-4.1.1 server starting"). Note, too, the times on the log entries -- they're *way* out of order.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 61•11 years ago
|
||
There were a bunch of dead items reported on AWS buildbot masters right after this, so maybe there was a network blip?
Assignee | ||
Comment 62•11 years ago
|
||
And the other weird thing here is that it fixed itself this time.
Assignee | ||
Comment 63•11 years ago
|
||
The mystery of the mis-ordered log entries is solved. The message time is set early in the call to _log, and then it runs through the relevant handlers. Well, for 'request.###', the first handler is writing to the logs table in the DB. It would seem that all of the DB connections were a bit tied up at that time.
Looking on another imaging server:
device WARNING - [2013-06-11 12:04:15,207] polling thread still running at 904s; not starting another
So, I think that the DB just got overloaded at this time. Mozpool weathered it as best it could.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 64•11 years ago
|
||
DB load graphs show a spike at almost exactly 9am pacific, but nothing at noon pacific. So, this could be network or it could be DB.
Reporter | ||
Comment 66•11 years ago
|
||
mozpool-server on mobile-imaging-009.p9 hung today @ ~1646 PDT. Restarting mozpool-server recovered it. This also caused the nagios checks for the pandas and relays in p9 to alert since the checks are dependent on the mozpool http API.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Comment 67•11 years ago
|
||
First Traceback started at 1530 PDT. I would be surprised if this wasn't related to the major SCL3 network outage taking place today.
===============================
device ERROR - [2013-07-08 15:30:11,388] failure in _tick
Traceback (most recent call last):
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/statedriver.py", line 81, in _tick
self.poll_for_timeouts()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/statedriver.py", line 115, in poll_for_timeouts
for machine_name in self._get_timed_out_machine_names():
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/lifeguard/devicemachine.py", line 73, in _get_timed_out_machine_names
return self.db.devices.list_timed_out(self.imaging_server_id)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/base.py", line 100, in list_timed_out
& (tbl.c.imaging_server_id == imaging_server_id)))
File "/opt/mozpool/frontend/lib/python2.7/site-packages/mozpool/db/pool.py", line 65, in execute
conn = self.engine.connect()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2471, in connect
return self._connection_cls(self, **kwargs)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 878, in __init__
self.__connection = connection or engine.raw_connection()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2557, in raw_connection
return self.pool.unique_connection()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 184, in unique_connection
return _ConnectionFairy(self).checkout()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 402, in __init__
conn = self.connection = self._connection_record.get_connection()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 324, in get_connection
self.connection = self.__connect()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/pool.py", line 344, in __connect
connection = self.__pool._creator()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/strategies.py", line 80, in connect
return dialect.connect(*cargs, **cparams)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 281, in connect
return self.dbapi.connect(*cargs, **cparams)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/__init__.py", line 93, in Connect
return Connection(*args, **kwargs)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/connections.py", line 510, in __init__
self._connect()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/connections.py", line 677, in _connect
self._request_authentication()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/connections.py", line 719, in _request_authentication
self._send_authentication()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/connections.py", line 767, in _send_authentication
auth_packet = MysqlPacket(self)
File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/connections.py", line 200, in __init__
self.__recv_packet()
File "/opt/mozpool/frontend/lib/python2.7/site-packages/pymysql/connections.py", line 206, in __recv_packet
raise OperationalError(2013, "Lost connection to MySQL server during query")
OperationalError: (OperationalError) (2013, 'Lost connection to MySQL server during query') None None
Assignee | ||
Comment 68•11 years ago
|
||
Yes, that tb is normal during a MySQL failure. I'll be curious to know how long the hang was, and whether it was greater than the socket timeout.
Reporter | ||
Comment 69•11 years ago
|
||
For reference to the scl3 outage:
https://bugzilla.mozilla.org/show_bug.cgi?id=891128
Assignee | ||
Comment 70•11 years ago
|
||
On mobile-imaging-001, it looks like MySQL went away at about 2013-07-08 16:11 (pacific). About 180s after that, the "Lost connection.." errors start. There is a recovery a few minutes later, followed by another spate of polling failures. The last "polling thread still running" is at 16:22:52. There were some additional MySQL failures at 16:29:54, but those did not affect the timeout loop.
On mobile-imaging-009, things started at 16:10:26 with some lost connections. Mozpool treated those as failed operations, handled within the normal flow of the device state machine. The timeout loop failed at the same times as on mobile-imaging-001. It saw the same MySQL failures at around 16:29:54.
However, its device polling thread continued to hang:
> device WARNING - [2013-07-08 17:25:11,049] polling thread still running at 4462s; not starting another
There were lots of errors about
> Timeout: HTTPConnectionPool(host='mobile-imaging-009.p9.releng.scl1.mozilla.com', port=80): Request timed out. (timeout=30)
and eventually
> socket.error: [Errno 24] Too many open files
followed by a bunch of failures to connect to MySQL that are probably hiding the same underlying errno. Then the host was restarted.
Without seeing what happened, I can't say much more than that. This could partly be a resource overload - too many async HTTP connections to self waiting for a long `requests` timeout, while lifeguard is happily initiating more connections. But that doesn't explain the state of the device loop.
Assignee | ||
Comment 71•11 years ago
|
||
This also affected mobile-imaging-008.
I hate threads.
I'm tired of this bug.
Let's change the "still running" warning to kill the entire process when it reaches, say, 600s. Supervisord will restart the process.
Reporter | ||
Comment 72•11 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #71)
> This also affected mobile-imaging-008.
>
> I hate threads.
>
> I'm tired of this bug.
>
> Let's change the "still running" warning to kill the entire process when it
> reaches, say, 600s. Supervisord will restart the process.
I think that's is a good idea as long as it gets reported somewhere (nagios, email, etc) with 'loud noises and bright flashing lights'
Assignee | ||
Comment 73•11 years ago
|
||
This failed on -007 as well. None of the threads were blocked reading or writing from a socket -- they were all in sem_wait, which indicates some kind of sync primitive.
Assignee | ||
Comment 75•11 years ago
|
||
I didn't add any notification to this, aside from the logging. I forgot, but on reflection, I don't see why -- what action can we take when this occurs, other than to needlessly re-open this bug again?
Assignee | ||
Comment 76•11 years ago
|
||
Comment on attachment 774158 [details] [diff] [review]
bug817762-suicide.patch
I'll revise to send an email.
Attachment #774158 -
Attachment is obsolete: true
Attachment #774158 -
Flags: review?(jwatkins)
Assignee | ||
Comment 77•11 years ago
|
||
Attachment #774687 -
Flags: review?(jwatkins)
Assignee | ||
Comment 78•11 years ago
|
||
Attachment #774696 -
Flags: review?(jwatkins)
Reporter | ||
Comment 79•11 years ago
|
||
Comment on attachment 774687 [details] [diff] [review]
bug817762-suicide-r2.patch
Review of attachment 774687 [details] [diff] [review]:
-----------------------------------------------------------------
lgtm
Attachment #774687 -
Flags: review?(jwatkins) → review+
Reporter | ||
Updated•11 years ago
|
Attachment #774696 -
Flags: review?(jwatkins) → review+
Reporter | ||
Comment 80•11 years ago
|
||
This closed the trees tonight :-{ SHIPIT PLS
https://bugzilla.mozilla.org/show_bug.cgi?id=893511
Assignee | ||
Updated•11 years ago
|
Attachment #774687 -
Flags: checked-in+
Assignee | ||
Updated•11 years ago
|
Attachment #774696 -
Flags: checked-in+
Assignee | ||
Comment 81•11 years ago
|
||
The preceding two patches will be in Mozpool-4.1.2 (bug 893757)
Assignee | ||
Comment 82•11 years ago
|
||
I pushed a fix to the puppet patch in https://hg.mozilla.org/build/puppet/rev/32837fc8e15d
Assignee | ||
Comment 83•11 years ago
|
||
It occurs to me that the problem here is probably two-part now:
1. DB error causes an exception, which skips or breaks lock cleanup
2. Threads pile up on the lock
So a possible way to duplicate this is to make all DB methods randomly raise exceptions, and see if I can get a failure.
Updated•11 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
Assignee | ||
Comment 84•11 years ago
|
||
So the root cause here is loooong DB queries, blocked while the entire DB server is locked for many minutes. Some of the failures we saw were only that and nothing more.
There seems to be a deadlock that occurs, too, in this situation. That may occur when something that is supposed to take only N seconds doesn't even *start* for N seconds.
At any rate, the suicide code seems to be holding this together, and getting production Mozpool off of the buildbot DB cluster (which has the multi-minute locks) will help, too (bug 897109). There's nothing more to do here.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•