Closed
Bug 771218
Opened 12 years ago
Closed 12 years ago
Ongoing problems with Socorro staging
Categories
(mozilla.org Graveyard :: Server Operations, task, P1)
mozilla.org Graveyard
Server Operations
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: laura, Assigned: bburton)
References
(Blocks 1 open bug)
Details
Attachments
(1 file)
(deleted),
text/plain
|
Details |
Yesterday, the two staging processors each died, unable to get connections to PostgreSQL or HBase. This has happened several times this week. When we have noticed and restarted, things come back to life.
I'm assuming network or Zeus trouble - can please someone take a look?
From a timing perspective:
socorro-processor2.stage.metrics.phx1 stopped at 2012-07-04 05:46:36,199
socorro-processor1.stage.metrics.phx1 stopped at 2012-07-04 18:24:04,516
Comment 1•12 years ago
|
||
Are these times UTC?
Reporter | ||
Comment 2•12 years ago
|
||
Those times are from the logs on the machines, so I guess it's whatever the system time is.
Comment 3•12 years ago
|
||
Who from server ops is going to examine Zeus?
Comment 4•12 years ago
|
||
this problem continues. On restarting the processors on both socorro-processor2.stage.metrics.phx1 and socorro-processor1.stage.metrics.phx1 at 2012-07-05 10:30:31 they ran for eleven hours and then suddenly lost their connections to PostgreSQL at 2012-07-05 21:10:17
This problem is holding us back from getting our next release ready.
Assignee | ||
Comment 5•12 years ago
|
||
Per IRC, I'll take a look at Zeus
Assignee: server-ops → bburton
Status: NEW → ASSIGNED
Assignee | ||
Comment 6•12 years ago
|
||
After digging through the event log on pp-zlb09 I found the following entries, which show a blip between 9:44PM and 9:57PM PST yesterday, which is after the timestamp :lars noted but in the same hour
I'm adding :jakem and :cshields for some input
:ravi do you see anything on the network side for the below time stamps?
[05/Jul/2012:21:44:43 -0700] WARN monitors/socorro-thrift-check monitorfail Monitor has detected a failure in node '10.8.100.65:9090': Timeout
[05/Jul/2012:21:44:44 -0700] SERIOUS pools/socorro-thrift-stage:9090 nodes/10.8.100.65:9090 nodefail Node 10.8.100.65 has failed - A monitor has detected a failure
[05/Jul/2012:21:44:48 -0700] WARN monitors/socorro-thrift-check monitorfail Monitor has detected a failure in node '10.8.100.62:9090': Timeout
[05/Jul/2012:21:44:49 -0700] SERIOUS pools/socorro-thrift-stage:9090 nodes/10.8.100.62:9090 nodefail Node 10.8.100.62 has failed - A monitor has detected a failure
[05/Jul/2012:21:44:50 -0700] WARN monitors/socorro-thrift-check monitorfail Monitor has detected a failure in node '10.8.100.66:9090': Timeout
[05/Jul/2012:21:44:51 -0700] SERIOUS pools/socorro-thrift-stage:9090 nodes/10.8.100.66:9090 nodefail Node 10.8.100.66 has failed - A monitor has detected a failure
[05/Jul/2012:21:44:55 -0700] INFO monitors/socorro-thrift-check monitorok Monitor is working for node '10.8.100.66:9090'.
[05/Jul/2012:21:44:56 -0700] INFO pools/socorro-thrift-stage:9090 nodes/10.8.100.66:9090 nodeworking Node 10.8.100.66 is working again
[05/Jul/2012:21:44:56 -0700] WARN monitors/socorro-thrift-check monitorfail Monitor has detected a failure in node '10.8.100.63:9090': Timeout
[05/Jul/2012:21:44:56 -0700] SERIOUS pools/socorro-thrift-stage:9090 nodes/10.8.100.63:9090 nodefail Node 10.8.100.63 has failed - A monitor has detected a failure
[05/Jul/2012:21:44:58 -0700] WARN monitors/socorro-thrift-check monitorfail Monitor has detected a failure in node '10.8.100.67:9090': Timeout
[05/Jul/2012:21:44:59 -0700] SERIOUS pools/socorro-thrift-stage:9090 nodes/10.8.100.67:9090 nodefail Node 10.8.100.67 has failed - A monitor has detected a failure
[05/Jul/2012:21:45:01 -0700] INFO monitors/socorro-thrift-check monitorok Monitor is working for node '10.8.100.63:9090'.
[05/Jul/2012:21:45:02 -0700] INFO pools/socorro-thrift-stage:9090 nodes/10.8.100.63:9090 nodeworking Node 10.8.100.63 is working again
[05/Jul/2012:21:45:02 -0700] WARN monitors/socorro-thrift-check monitorfail Monitor has detected a failure in node '10.8.100.69:9090': Timeout
[05/Jul/2012:21:45:03 -0700] SERIOUS pools/socorro-thrift-stage:9090 nodes/10.8.100.69:9090 nodefail Node 10.8.100.69 has failed - A monitor has detected a failure
[05/Jul/2012:21:45:24 -0700] INFO monitors/socorro-thrift-check monitorok Monitor is working for node '10.8.100.67:9090'.
[05/Jul/2012:21:45:25 -0700] INFO pools/socorro-thrift-stage:9090 nodes/10.8.100.67:9090 nodeworking Node 10.8.100.67 is working again
[05/Jul/2012:21:45:30 -0700] INFO monitors/socorro-thrift-check monitorok Monitor is working for node '10.8.100.62:9090'.
[05/Jul/2012:21:45:31 -0700] INFO pools/socorro-thrift-stage:9090 nodes/10.8.100.62:9090 nodeworking Node 10.8.100.62 is working again
[05/Jul/2012:21:45:37 -0700] WARN monitors/socorro-thrift-check monitorfail Monitor has detected a failure in node '10.8.100.63:9090': Monitor exited, exit code 1, no output generated
[05/Jul/2012:21:45:38 -0700] SERIOUS pools/socorro-thrift-stage:9090 nodes/10.8.100.63:9090 nodefail Node 10.8.100.63 has failed - A monitor has detected a failure
[05/Jul/2012:21:45:44 -0700] WARN monitors/socorro-thrift-check monitorfail Monitor has detected a failure in node '10.8.100.66:9090': Monitor exited, exit code 1, no output generated
[05/Jul/2012:21:45:45 -0700] SERIOUS pools/socorro-thrift-stage:9090 nodes/10.8.100.66:9090 nodefail Node 10.8.100.66 has failed - A monitor has detected a failure
[05/Jul/2012:21:45:47 -0700] WARN monitors/socorro-thrift-check monitorfail Monitor has detected a failure in node '10.8.100.67:9090': Monitor exited, exit code 1, no output generated
[05/Jul/2012:21:45:48 -0700] SERIOUS pools/socorro-thrift-stage:9090 nodes/10.8.100.67:9090 nodefail Node 10.8.100.67 has failed - A monitor has detected a failure
[05/Jul/2012:21:45:50 -0700] WARN monitors/socorro-thrift-check monitorfail Monitor has detected a failure in node '10.8.100.62:9090': Monitor exited, exit code 1, no output generated
[05/Jul/2012:21:45:51 -0700] SERIOUS pools/socorro-thrift-stage:9090 nodes/10.8.100.62:9090 nodefail Node 10.8.100.62 has failed - A monitor has detected a failure
[05/Jul/2012:21:45:51 -0700] SERIOUS pools/socorro-thrift-stage:9090 pooldied Pool has no back-end nodes responding
[05/Jul/2012:21:56:51 -0700] INFO monitors/socorro-thrift-check monitorok Monitor is working for node '10.8.100.65:9090'.
[05/Jul/2012:21:56:52 -0700] INFO pools/socorro-thrift-stage:9090 nodes/10.8.100.65:9090 nodeworking Node 10.8.100.65 is working again
[05/Jul/2012:21:56:52 -0700] INFO pools/socorro-thrift-stage:9090 poolok Pool now has working nodes
[05/Jul/2012:21:56:53 -0700] INFO monitors/socorro-thrift-check monitorok Monitor is working for node '10.8.100.69:9090'.
[05/Jul/2012:21:56:54 -0700] INFO pools/socorro-thrift-stage:9090 nodes/10.8.100.69:9090 nodeworking Node 10.8.100.69 is working again
[05/Jul/2012:21:57:08 -0700] INFO monitors/socorro-thrift-check monitorok Monitor is working for node '10.8.100.63:9090'.
[05/Jul/2012:21:57:09 -0700] INFO pools/socorro-thrift-stage:9090 nodes/10.8.100.63:9090 nodeworking Node 10.8.100.63 is working again
[05/Jul/2012:21:57:15 -0700] INFO monitors/socorro-thrift-check monitorok Monitor is working for node '10.8.100.66:9090'.
[05/Jul/2012:21:57:16 -0700] INFO pools/socorro-thrift-stage:9090 nodes/10.8.100.66:9090 nodeworking Node 10.8.100.66 is working again
[05/Jul/2012:21:57:19 -0700] INFO monitors/socorro-thrift-check monitorok Monitor is working for node '10.8.100.67:9090'.
[05/Jul/2012:21:57:20 -0700] INFO pools/socorro-thrift-stage:9090 nodes/10.8.100.67:9090 nodeworking Node 10.8.100.67 is working again
[05/Jul/2012:21:57:21 -0700] INFO monitors/socorro-thrift-check monitorok Monitor is working for node '10.8.100.62:9090'.
[05/Jul/2012:21:57:22 -0700] INFO pools/socorro-thrift-stage:9090 nodes/10.8.100.62:9090 nodeworking Node 10.8.100.62 is working again
Comment 7•12 years ago
|
||
please note: the problems in the logs in Comment #6 involve HBase. The problem that we're seeing is with Postgres.
the postgres problem happened again at today at 2012-07-06 10:58:44. this time it affected socorro-processor2.stage.metrics.phx1, but not socorro-processor1.stage.metrics.phx1.
Comment 8•12 years ago
|
||
What is the actual check testing?
Logs for the switches these hosts are connected to are clean for the times mentioned. I bumped logging level to "info" from "notice" to give us a better chance to catch something, but I don't see any indication of network issues.
I also checked the core. It is clean. STP has been quiet for a number of days now. Still looking, but I'm not seeing anything on the network side.
Reporter | ||
Comment 10•12 years ago
|
||
Busted again right now, FWIW
Assignee | ||
Comment 11•12 years ago
|
||
(In reply to Ravi Pina [:ravi] from comment #8)
> What is the actual check testing?
The check is a ping check
Assignee | ||
Comment 12•12 years ago
|
||
Looking at socorro-processor2.stage.metrics.phx1
/var/log/messages only has puppetd run info
/var/log/dmesg is auditd messages
I don't see anything with yum since 6/27 and those were just audited changes
I'm going to hand this to the SRE team to look at and see if there is anything in the hardware logs or maybe a network driver issue?
Assignee | ||
Updated•12 years ago
|
Assignee: bburton → server-ops
Comment 13•12 years ago
|
||
We got processors+stageDB working now. However, we don't know what caused the original problem; by the time we cleaned it up, the original issue had ceased.
Updated•12 years ago
|
Priority: -- → P1
Summary: Ongoing problems with Socorro staging - network related? → Ongoing problems with Socorro staging
Comment 14•12 years ago
|
||
Monitored all night last night. Processors reported connection failure at 4:10AM Pacific. In the pgbouncer logs, we have:
2012-07-11 04:08:31.272 2789 LOG Stats: 0 req/s, in 108 b/s, out 154 b/s,query 4383 us
2012-07-11 04:08:33.711 2789 LOG C-0x1bffa40: breakpad/processor@10.8.70.208:48083 closing because: client unexpected eof (age=57431)
2012-07-11 04:08:36.516 2789 LOG C-0x1bffe78: breakpad/processor@10.8.70.208:48107 closing because: client unexpected eof (age=57433)
2012-07-11 04:08:41.510 2789 LOG C-0x1bff770: breakpad/processor@10.8.70.208:48063 closing because: client unexpected eof (age=57438)
2012-07-11 04:08:42.519 2789 LOG C-0x1bff8d8: breakpad/processor@10.8.70.208:48075 closing because: client unexpected eof (age=57439)
2012-07-11 04:08:42.519 2789 LOG C-0x1bff338: breakpad/processor@10.8.70.208:48028 closing because: client unexpected eof (age=57439)
2012-07-11 04:08:48.380 2789 LOG C-0x1bff608: breakpad/processor@10.8.70.208:48043 closing because: client unexpected eof (age=57445)
2012-07-11 04:08:54.385 2789 LOG C-0x1bffd10: breakpad/processor@10.8.70.208:48102 closing because: client unexpected eof (age=57451)
2012-07-11 04:09:31.273 2789 LOG Stats: 0 req/s, in 111 b/s, out 156 b/s,query 4147 us
2012-07-11 04:10:31.274 2789 LOG Stats: 0 req/s, in 116 b/s, out 165 b/s,query 4073 us
That seems to confirm that the problem is the loss of the network connection between the processor and the database server.
Interestingly, though, the monitor doesn't lose its connection until 7AM:
2012-07-11 06:58:42.363 PDT,"monitor","breakpad",18123,"10.8.70.208:15950",4ffd7d30.46cb,1,"idle",2012-07-11 06:18:40 PDT,6/0,0,LOG,08P01,"unexpected EOF on client connection",,,,,,,,,""
So, I think we need a Zeus expert to start looking at Zeus for periodically recurring issues. Thanks!
Assignee | ||
Comment 15•12 years ago
|
||
Current assessment is that an issue with our leap second "fix" we pushed with puppet was the source of the issue, as it was causing the second time of the clock to be reset to :20 seconds somewhat randomly, this has been fixed and staging restarted
We'll monitor for 24 hours and RF or investigate further, as needed.
Assignee | ||
Updated•12 years ago
|
Assignee: server-ops → bburton
Reporter | ||
Comment 16•12 years ago
|
||
processor2 is having PG connection difficulties again, via mpressman:
2012-07-11 12:17:18,488 ERROR - Thread-9 - DatabaseError: error with no message from the libpq
Comment 17•12 years ago
|
||
2012-07-11 12:17:18,486 ERROR - Thread-9 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-11 12:17:18,486 ERROR - Thread-9 - error with no message from the libpq
2012-07-11 12:17:18,487 ERROR - Thread-9 - trace back follows:
2012-07-11 12:17:18,487 ERROR - Thread-9 - Traceback (most recent call last):
2012-07-11 12:17:18,488 ERROR - Thread-9 - File "/data/socorro/application/socorro/processor/processor.py", line 526, in processJob
threadLocalCursor.execute("update jobs set starteddatetime = %s where id = %s", (startedDateTime, jobId))
2012-07-11 12:17:18,488 ERROR - Thread-9 - DatabaseError: error with no message from the libpq
Something BAD happened in Thread-9:
Traceback (most recent call last):
File "/data/socorro/application/socorro/lib/threadlib.py", line 128, in run
aFunction(arguments)
File "/data/socorro/application/socorro/processor/processor.py", line 478, in processJobWithRetry
result = self.processJob(jobTuple)
File "/data/socorro/application/socorro/processor/processor.py", line 651, in processJob
threadLocalDatabaseConnection.rollback()
InterfaceError: connection already closed
connection already closed
Comment 18•12 years ago
|
||
There was a mention during the meeting about the file leapsecond_2012_06_30 in /tmp
Is this supposed to be removed? It still exists on processor2
Comment 19•12 years ago
|
||
so just to clear this up for the bug, chields mentioned that the existence of the file is of no concern.
Assignee | ||
Comment 20•12 years ago
|
||
There was nothing logged in Zeus during the noon hour for socorro, that I can see
I'm going to ask an SRE to check the server
Reporter | ||
Comment 21•12 years ago
|
||
processor2 is down again, started paging at
Date/Time: 07-12-2012 04:58:36
Comment 22•12 years ago
|
||
These might be a red-herring but I found these messages in socorro1.stage.db.phx1:/var/log/pgbouncer/pgbouncer-processor.log:
2012-07-12 08:23:32.150 2789 LOG S-0x1c1d558: breakpad/processor@unix:5432 closing because: server lifetime over (age=3600)
2012-07-12 08:23:58.750 2789 LOG S-0x1c1d558: breakpad/processor@unix:5432 new connection to server
2012-07-12 08:24:04.843 2789 LOG S-0x1c1ca18: breakpad/processor@unix:5432 new connection to server
2012-07-12 08:30:53.063 2789 LOG S-0x1c1bc08: breakpad/processor@unix:5432 closing because: server lifetime over (age=3600)
they might just be suggestive of normal activity, I'm unsure.
Comment 23•12 years ago
|
||
Ashish,
That's normal activity.
Comment 24•12 years ago
|
||
Interesting. I only see one error in the pgBouncer log for the period 04:00:00 to 06:00:00. One client unexpectedly disconnected. Otherwise, all activity durning that period looks normal, and in fact there's evidence of continuing processing during that period.
Are you sure *all* processors disconnected, rather than just one of them?
Comment 25•12 years ago
|
||
only processor2 had trouble with connections on all threads. processor1 continued working without trouble.
Comment 26•12 years ago
|
||
ah, that would be this then:
2012-07-12 04:53:00.124 2789 LOG C-0x1bfec30: breakpad/processor@10.8.70.208:1746 closing because: client unexpected eof (age=41861)
as per above "client unexpected eof" means that the client vanished off the network without disconnecting. it means either network breakage or that the client crashed.
Comment 27•12 years ago
|
||
at the same time, the client reported: "DatabaseError: error with no message from the libpq"
I believe that means, from the client perspective, the server vanished off the network."
In the past in similar situations, we've gotten error messages that say, "server closed the connection unexpectedly"
Comment 28•12 years ago
|
||
More evidence that this is a network issue. The monitor -- a separate process, one which connects to PostgreSQL directly rather than using pgbouncer -- is also randomly losing connections:
Monitor Log:
2012-07-12 03:43:06,138 ERROR - jobCleanupThread - Caught Error: <class 'psycopg2.InterfaceError'>
2012-07-12 03:43:06,138 ERROR - jobCleanupThread - connection already closed
PostgreSQL Log:
2012-07-12 03:43:06.048 PDT,"monitor","breakpad",29015,"10.8.70.208:8589",4ffea7df.7157,1,"idle",2012-07-12 03:33:03 PDT,10/0,0,LOG,08P01,"unexpected EOF on client connection",,,,,,,,,""
Assignee | ||
Comment 29•12 years ago
|
||
I've been double checking things, as per the above comments, connects to Postgres are dying from both socorroadm.stage and socorro-processor[1-2].stage
socorroadm.stage is an ESX VM, socorro-processor[1-2].stage are HP blades
Both the monitor and processor apps are using 10.8.70.126 as their databaseHost, which is a VIP on pp-zlb09
A review of the Zeus logs shows db errors only being recorded on 07/06 for the month of July
[root@pp-zlb09.phx log]# grep socorro /usr/local/zeus/zxtm/log/errors | grep Jul | grep db
[06/Jul/2012:15:25:02 -0700] SERIOUS pools/db-socorro01-rw-processor-pool nodes/10.8.70.100:6433 nodefail Node 10.8.70.100 has failed - A monitor has detected a failure
[06/Jul/2012:15:25:02 -0700] SERIOUS pools/db-socorro01-rw-processor-pool pooldied Pool has no back-end nodes responding
[06/Jul/2012:15:25:02 -0700] SERIOUS pools/db-socorro01-rw-web-pool pooldied Pool has no back-end nodes responding
[06/Jul/2012:15:25:02 -0700] SERIOUS pools/db-socorro01-rw-web-pool nodes/10.8.70.100:6432 nodefail Node 10.8.70.100 has failed - A monitor has detected a failure
[06/Jul/2012:15:25:02 -0700] SERIOUS pools/db-socorro01-rw-web-pool pooldied Pool has no back-end nodes responding
[06/Jul/2012:15:33:23 -0700] INFO pools/db-socorro01-rw-processor-pool nodes/10.8.70.100:6433 nodeworking Node 10.8.70.100 is working again
[06/Jul/2012:15:33:23 -0700] INFO pools/db-socorro01-rw-processor-pool poolok Pool now has working nodes
[06/Jul/2012:15:33:23 -0700] INFO pools/db-socorro01-rw-web-pool poolok Pool now has working nodes
[06/Jul/2012:15:33:23 -0700] INFO pools/db-socorro01-rw-web-pool nodes/10.8.70.100:6432 nodeworking Node 10.8.70.100 is working again
[06/Jul/2012:15:33:23 -0700] INFO pools/db-socorro01-rw-web-pool poolok Pool now has working nodes
All the Postgres traffic to 10.8.70.126 gets sent to 10.8.70.125
I'd been focusing on trying to find a fault on the processor blades, but given c28 and IRC discussion, I am now reviewing the Postgres host
Assignee | ||
Comment 30•12 years ago
|
||
Found possible storage / driver issue in dmesg log, opened bug 773390 to have Ops look into
Assignee | ||
Comment 31•12 years ago
|
||
Per bug 773390 and IRC, updates have been applied to the db server, the processors and monitor have been restarted and :lars confirmed stage is running.
We'll wait 24 hours to see if the issue persists.
Comment 32•12 years ago
|
||
no joy.
staging processor #2 has again lost its connection to postgres.
processor log:
2012-07-12 14:54:05,356 ERROR - Thread-10 - DatabaseError: error with no message from the libpq
server log:
2012-07-12 14:53:40.052 6137 LOG C-0x15fc4a0: breakpad/processor@10.8.70.208:28101 closing because: client unexpected eof (age=4500)
Comment 33•12 years ago
|
||
just a notification that this problem is not confined to connecting to Postgres. This morning, both staging processors are having a bad time trying to talk to Hbase at 10.8.100.209
It started at 6:05
2012-07-13 06:05:58,810 CRITICAL - Thread-5 - something's gone horribly wrong with the HBase connection
2012-07-13 06:05:58,811 CRITICAL - Thread-5 - Caught Error: <class 'socorro.storage.hbaseClient.FatalException'>
2012-07-13 06:05:58,811 CRITICAL - Thread-5 - the connection is not viable. retries fail:
Comment 34•12 years ago
|
||
interestingly, the monitor appears to have no trouble at all talking to HBase.
Assignee | ||
Comment 35•12 years ago
|
||
:mpressman made a comment on irc over the weekend that things had been stable?
Is that correct?
Comment 36•12 years ago
|
||
Both staging processors spent all day Sunday trying to connect to HBase without success. Today, the processors are wiling away their time trying to talk to a silent Postgres.
Assignee | ||
Comment 37•12 years ago
|
||
I'm not sure what to look at next, I'm not seeing any consistent errors in the Zeus logs and this seems not to be isolated to specific hosts, but previously netops confirmed seeing no network issues from the processors hosts.
I'll try to talk with Corey, Jason, and some other webops folks tomorrow morning to discuss what's the next best step
Comment 38•12 years ago
|
||
I think it'd help to come up with a simpler, reproducible test case. It's hard to tell if anything is fixed since we can't make the problem happen on demand, and it feels like we're still not 100% where the problem is (please correct me if this is wrong).
Would it make sense to try to have some simple persistent connection across the hosts in question (e.g. netcat?) If *that* can't stay connected then we know it's not e.g. the software involved.
Assignee | ||
Updated•12 years ago
|
Whiteboard: [pending webops discussion]
Comment 39•12 years ago
|
||
This bug is long and confusing. Can someone provide the data flow(s) with IP and port pairs here?
Reporter | ||
Comment 40•12 years ago
|
||
OK. Dropped connections and timeouts have been seen with the following flows:
* socorroadm.stage.private.phx1.mozilla.com (monitor) -> socorro-thrift-stage.zlb.phx1.mozilla.com:9090 (Thrift/HBase)
* socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> socorro-thrift-stage.zlb.phx1.mozilla.com:9090 (Thrift/HBase)
* socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> socorro1.stage.db.phx1.mozilla.com:5432 (or is this 6432? I don't recall whether it's direct (5432) or through pgbouncer(6432). solarce/mpressman/jberkus/sheeri can confirm) (PostgreSQL)
Ravi, is that what you need?
Reporter | ||
Comment 41•12 years ago
|
||
IP details are here https://mana.mozilla.org/wiki/display/websites/crash-stats.mozilla.com+(Socorro)#crash-statsmozillacomSoc orro-MonitorAdminhost
Comment 42•12 years ago
|
||
Laura,
> * socorro-processor{1,2}.stage.metrics.phx1.mozilla.com ->
> socorro1.stage.db.phx1.mozilla.com:5432 (or is this 6432? I don't recall
> whether it's direct (5432) or through pgbouncer(6432).
It's port 6433 actually.
We've also seen dropped connections between the monitor and dev.db (port 5432). Not sure which server the monitor is.
Comment 43•12 years ago
|
||
BTW, since we've never seen issues like this one in production, I'm wondering if this might be related to the virtualization system staging monitor & processors run on?
Assignee | ||
Comment 44•12 years ago
|
||
(In reply to [:jberkus] Josh Berkus from comment #43)
> BTW, since we've never seen issues like this one in production, I'm
> wondering if this might be related to the virtualization system staging
> monitor & processors run on?
socorroadm.stage is a VM
socorrop-processor[1-2].stage are HP blades.
Assignee | ||
Comment 45•12 years ago
|
||
(In reply to Laura Thomson :laura from comment #40)
> OK. Dropped connections and timeouts have been seen with the following
> flows:
> * socorroadm.stage.private.phx1.mozilla.com (monitor) ->
> socorro-thrift-stage.zlb.phx1.mozilla.com:9090 (Thrift/HBase)
> * socorro-processor{1,2}.stage.metrics.phx1.mozilla.com ->
> socorro-thrift-stage.zlb.phx1.mozilla.com:9090 (Thrift/HBase)
> * socorro-processor{1,2}.stage.metrics.phx1.mozilla.com ->
> socorro1.stage.db.phx1.mozilla.com:5432 (or is this 6432? I don't recall
> whether it's direct (5432) or through pgbouncer(6432).
> solarce/mpressman/jberkus/sheeri can confirm) (PostgreSQL)
>
> Ravi, is that what you need?
Monitor connections on VM:
* socorroadm.stage.private.phx1.mozilla.com (monitor) -> socorro-thrift-stage.zlb.phx1.mozilla.com:9090 (Thrift/HBase) (on pp-zlb10.phx.mozilla.net)
* socorroadm.stage.private.phx1.mozilla.com (monitor) -> socorro-stage-zlb.db.phx1.mozilla.com:5432 (Postgres) (on pp-zlb09.phx.mozilla.net)
Processor connections on HP blades:
* socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> socorro-thrift-stage.zlb.phx1.mozilla.com:9090 (Thrift/HBase) (on pp-zlb10.phx.mozilla.net)
* socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> socorro-stage-zlb.db.phx1.mozilla.com:5432 (Postgres) (on pp-zlb09.phx.mozilla.net)
Comment 46•12 years ago
|
||
The metrics ADU export has been failing on Socorro Stage every night, and seems related to the ongoing problems. For example:
ERROR 25-07 20:49:36,124 An I/O error occured while sending to the backend.
Not a very informative error message, but at least it gives the time.
I'm connecting to socorro-stage-zlb.db.phx1.mozilla.com, and running the job manually worked when I tried it earlier this week (24-07 11:38:57,118 to be specific). Please let me know if I can provide any other information.
Assignee | ||
Comment 47•12 years ago
|
||
(In reply to Ravi Pina [:ravi] from comment #39)
> This bug is long and confusing. Can someone provide the data flow(s) with
> IP and port pairs here?
Ravi, the short is that we're seeing connection failures to Postgres and Hbase VIPs on Zeus. The VIPs are on separate Zeus servers (pp-zlb09 and pp-zlb10). When we see connection failures in the logs of the applications connecting to the VIPs, I am not seeing entries in the Zeus error logs at or around the time of the failures.
The data flows are as follows
Monitor connections on an ESX VM:
* socorroadm.stage.private.phx1.mozilla.com (monitor) -> socorro-thrift-stage.zlb.phx1.mozilla.com:9090 (Thrift/HBase) (on pp-zlb10.phx.mozilla.net)
* socorroadm.stage.private.phx1.mozilla.com (monitor) -> socorro-stage-zlb.db.phx1.mozilla.com:5432 (Postgres) (on pp-zlb09.phx.mozilla.net)
Processor connections on two HP blades:
* socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> socorro-thrift-stage.zlb.phx1.mozilla.com:9090 (Thrift/HBase) (on pp-zlb10.phx.mozilla.net)
* socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> socorro-stage-zlb.db.phx1.mozilla.com:5432 (Postgres) (on pp-zlb09.phx.mozilla.net)
I've worked with SREs to review the hardware of the Postgres master and the processor blades at points throughout diagnosing this bug and that has checked out so far. Given that and the failures from multiple servers to multiple VIPs on different Zeus nodes, it's hard not to look at the network as the commonality, but the Zeus cluster is also a common point.
Is there additional info you'd need to help assess if there is a possible networking related issue?
Would you recommend we try moving these VIPs to the internal (db) Zeus cluster as the next step?
Comment 48•12 years ago
|
||
(In reply to Brandon Burton [:solarce] from comment #47)
> Would you recommend we try moving these VIPs to the internal (db) Zeus
> cluster as the next step?
This needs to be done anyway, and would help in other situations as well. Are they using the same VIPs as anything prod? If not can we disrupt staging at any time to make this change?
Comment 49•12 years ago
|
||
The staging VIPs are different from the prod ones.
Comment 50•12 years ago
|
||
Oh, and stage is already pretty disrupted, so just let us know when you're taking it down. No confirmation needed.
Assignee | ||
Comment 51•12 years ago
|
||
(In reply to [:jberkus] Josh Berkus from comment #50)
> Oh, and stage is already pretty disrupted, so just let us know when you're
> taking it down. No confirmation needed.
I'll be working on this over the next 1-2 hours and you may see disruption as various points, I'll confirm in #breakpad when I'm ready to test out things
Assignee | ||
Updated•12 years ago
|
Whiteboard: [pending webops discussion] → [moving Zeus VIPs to internal cluster]
Assignee | ||
Updated•12 years ago
|
Whiteboard: [moving Zeus VIPs to internal cluster] → [testing VIPs on internal cluster]
Assignee | ||
Comment 52•12 years ago
|
||
Hbase VIP moved at Noon, confirmed working by :lars
PG VIPs (5432, 6432, 6433) moved at 4:34, awaiting confirmation from socorro team, :solarce confirmed telnet works.
Comment 53•12 years ago
|
||
We need to watch the PG connections for a 24-hour period to determine if they are better.
By Monday, we should know.
Comment 54•12 years ago
|
||
I regret to report that our problems have not been resolved.
both staging processors and the monitor lost their ability to talk to both HBase and Postgres at various times over the weekend.
socorro-processor1.stage.metrics.phx1:
2012-07-29 06:08:33,142 CRITICAL - Thread-2 - something's gone horribly wrong with the HBase connection
2012-07-29 06:08:33,143 CRITICAL - Thread-2 - the connection is not viable. retries fail:
socorro-processor2.stage.metrics.phx1:
2012-07-30 06:04:57,996 CRITICAL - Thread-9 - something's gone horribly wrong with the HBase connection
2012-07-30 06:04:57,997 CRITICAL - Thread-9 - the connection is not viable. retries fail:
on the 29th, while processor1 was struggling with no HBase connection, processor2 was unable to talk to Postgres on some (but not all) threads. Because of being unable to talk to Postgres, we don't know what what was happening with HBase connectivity, as the processor won't try to talk to HBase if it cannot talk to Postgres.
the monitor, on the 29th, suffered the same HBase connectivity problem as processor1 for about 40 minutes. Then it recovered.
Six hours later, the monitor lost the Postgres connection. Recovery happened about 9 hours later.
The monitor lost the HBase connection again on the 30th at the same time that processor2 did. Processor1 lost the HBase connection about 5 minutes later.
Reporter | ||
Comment 55•12 years ago
|
||
Escalating this to blocker. If we can't get staging sorted out, we will be unable to ship support for rapid betas. The trouble we have had with crontabber supports the fact that we can't ship high-risk changes without a stable staging environment.
Blocks: daily_beta_tracking
Severity: normal → blocker
Assignee | ||
Comment 56•12 years ago
|
||
Zeus saw failures for the Postgres VIP, but not the HBase VIP
It was a couple failures on 7/27 and many on 7/29. None on 7/30 so far.
Assignee | ||
Comment 57•12 years ago
|
||
Hardware involved
Postgres
socorro1.stage.db.phx1 - https://inventory.mozilla.org/en-US/systems/show/1550/
Processors
socorro-processor1.stage.metrics.phx1 - https://inventory.mozilla.org/en-US/systems/show/3921/
socorro-processor2.stage.metrics.phx1 - https://inventory.mozilla.org/en-US/systems/show/3923/
HBase
+ means disabled node in Zeus
10.8.100.62:9090 - hp-node62.phx1 - https://inventory.mozilla.org/en-US/systems/show/2568/
10.8.100.63:9090 - hp-node63.phx1 - https://inventory.mozilla.org/en-US/systems/show/2569/
+10.8.100.64:9090 - hp-node64.phx1 - https://inventory.mozilla.org/en-US/systems/show/2570/
10.8.100.65:9090 - hp-node65.phx1 - https://inventory.mozilla.org/en-US/systems/show/2571/
10.8.100.66:9090 - hp-node66.phx1 - https://inventory.mozilla.org/en-US/systems/show/2572/
10.8.100.67:9090 - hp-node67.phx1 - https://inventory.mozilla.org/en-US/systems/show/2572/
+10.8.100.68:9090 - hp-node68.phx1 - https://inventory.mozilla.org/en-US/systems/show/2574/
10.8.100.69:9090 - hp-node69.phx1 - https://inventory.mozilla.org/en-US/systems/show/2575/
Zeus
pp-zlb01.phx1 - https://inventory.mozilla.org/en-US/systems/show/1545/
pp-zlb02.phx1 - https://inventory.mozilla.org/en-US/systems/show/1561/
Assignee | ||
Comment 58•12 years ago
|
||
The data flows are now
Monitor connections on an ESX VM:
* socorroadm.stage.private.phx1.mozilla.com (monitor) -> socorro-thrift-stage.zlb.phx1.mozilla.com:9090 (Thrift/HBase) (on pp-zlb01.phx.mozilla.net)
* socorroadm.stage.private.phx1.mozilla.com (monitor) -> socorro-stage-zlb.db.phx1.mozilla.com:5432 (Postgres) (on pp-zlb01.phx.mozilla.net)
Processor connections on two HP blades:
* socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> socorro-thrift-stage.zlb.phx1.mozilla.com:9090 (Thrift/HBase) (on pp-zlb01.phx.mozilla.net)
* socorro-processor{1,2}.stage.metrics.phx1.mozilla.com -> socorro-stage-zlb.db.phx1.mozilla.com:5432 (Postgres) (on pp-zlb01.phx.mozilla.net)
Assignee | ||
Comment 59•12 years ago
|
||
Tracking Changes made so far
1. jberkus realized the socorro-processor virtual server was using the 5432 backend and not the 6433 backend. I fixed this
[30/Jul/2012:15:02:04 -0700] INFO vservers/db-socorro-stage-processor confmod Configuration file modified
[30/Jul/2012:15:02:04 -0700] INFO vservers/db-socorro-stage-processor vsreconf Virtual server reconfigured on port 6433
2. cshields found that the virtual server connection timeout for postgres is set to 0 in prod, he made the same changes on stage in Zeus
[30/Jul/2012:15:02:03 -0700] INFO vservers/db-socorro-stage-web confmod Configuration file modified
[30/Jul/2012:15:02:03 -0700] INFO vservers/db-socorro-stage-web vsreconf Virtual server reconfigured on port 6432
[30/Jul/2012:15:02:04 -0700] INFO vservers/db-socorro-stage-processor confmod Configuration file modified
[30/Jul/2012:15:02:04 -0700] INFO vservers/db-socorro-stage-processor vsreconf Virtual server reconfigured on port 6433
[30/Jul/2012:15:02:05 -0700] INFO vservers/db-socorro-stage-pg confmod Configuration file modified
[30/Jul/2012:15:02:05 -0700] INFO vservers/db-socorro-stage-pg vsreconf Virtual server reconfigured on port 5432
Tracking additional issues seen so far.
1. HBase nagios alerts
23:04:29 nagios-phx1 | [168] hp-node68.phx1.mozilla.com:Zeus - Port 9090 is CRITICAL: ERROR: Received noSuchName(2) error-status at error-index 1
23:06:49 nagios-phx1 | [172] hp-node64.phx1.mozilla.com:Zeus - Port 9090 is CRITICAL: ERROR: Received noSuchName(2) error-status at error-index 1
Neither Zeus nor the socorro processors saw an error. Aj has acked the alerts and we're waiting on feedback from :tmary about the state of the nodes
2. socorro-processor2 connections to postgres interrupted
First from 13:43 - 13:47
2012-07-30 13:43:46,953 ERROR - Thread-1 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 13:43:46,955 ERROR - Thread-1 - DatabaseError: error with no message from the libpq
2012-07-30 13:44:11,010 ERROR - Thread-2 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 13:44:11,012 ERROR - Thread-2 - DatabaseError: error with no message from the libpq
2012-07-30 13:44:47,088 ERROR - Thread-8 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 13:44:47,089 ERROR - Thread-8 - DatabaseError: error with no message from the libpq
2012-07-30 13:45:29,180 ERROR - Thread-11 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 13:45:29,182 ERROR - Thread-11 - DatabaseError: error with no message from the libpq
2012-07-30 13:47:35,453 ERROR - Thread-7 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 13:47:35,455 ERROR - Thread-7 - DatabaseError: error with no message from the libpq
and then at 3:07, five minutes after the time out config change, so it could be related to that
2012-07-30 15:07:23,217 ERROR - Thread-9 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 15:07:23,219 ERROR - Thread-9 - DatabaseError: error with no message from the libpq
2012-07-30 15:07:23,222 ERROR - Thread-1 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 15:07:23,224 ERROR - Thread-1 - DatabaseError: error with no message from the libpq
2012-07-30 15:07:23,227 ERROR - Thread-10 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 15:07:23,228 ERROR - Thread-10 - DatabaseError: error with no message from the libpq
2012-07-30 15:07:23,231 ERROR - Thread-3 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 15:07:23,233 ERROR - Thread-3 - DatabaseError: error with no message from the libpq
2012-07-30 15:07:23,236 ERROR - Thread-4 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 15:07:23,238 ERROR - Thread-4 - DatabaseError: error with no message from the libpq
2012-07-30 15:07:23,241 ERROR - Thread-2 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 15:07:23,242 ERROR - Thread-2 - DatabaseError: error with no message from the libpq
2012-07-30 15:07:23,245 ERROR - Thread-8 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 15:07:23,247 ERROR - Thread-8 - DatabaseError: error with no message from the libpq
2012-07-30 15:07:23,249 ERROR - Thread-7 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 15:07:23,250 ERROR - Thread-7 - DatabaseError: error with no message from the libpq
2012-07-30 15:07:23,254 ERROR - Thread-11 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 15:07:23,255 ERROR - Thread-11 - DatabaseError: error with no message from the libpq
2012-07-30 15:07:23,258 ERROR - Thread-5 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 15:07:23,259 ERROR - Thread-5 - DatabaseError: error with no message from the libpq
2012-07-30 15:07:23,262 ERROR - Thread-6 - Caught Error: <class 'psycopg2.DatabaseError'>
2012-07-30 15:07:23,264 ERROR - Thread-6 - DatabaseError: error with no message from the libpq
Assignee | ||
Comment 60•12 years ago
|
||
Summary of today was Nagios checks added, a couple more Zeus settings (specifically postgres connection timeouts) were set on stage to make match prod, and a push to stage and prod was done (unrelated to issue, but a change none-the-less)
Per IRC with cshields, lars, and jberkus, we'll monitor through tomorrow to see how it works with the Zeus changes
Comment 61•12 years ago
|
||
There have been neither Postgres nor HBase/Thrift connection problems overnight since Monday. This is very encouraging.
Severity: blocker → normal
Comment 62•12 years ago
|
||
earlier change of blocker to normal was inadvertent.
Severity: normal → blocker
Comment 63•12 years ago
|
||
sorry to report that soon after my comment #61, staging processors lost the connection to HBase:
2012-07-31 06:12:22,627 CRITICAL - Thread-11 - Caught Error: <class 'socorro.storage.hbaseClient.FatalException'>
2012-07-31 06:12:22,628 CRITICAL - Thread-11 - the connection is not viable. retries fail:
this lasted through about 7:24 where the processors recovered.
Severity: blocker → normal
Assignee | ||
Comment 64•12 years ago
|
||
This was seen on processor1, processor2, and socorroadm.stage, though more times on the processors
processor1
2012-07-31 07:12:51,461 CRITICAL - Thread-5 - something's gone horribly wrong with the HBase connection
2012-07-31 07:22:14,044 CRITICAL - Thread-7 - something's gone horribly wrong with the HBase connection
2012-07-31 07:22:14,313 CRITICAL - Thread-2 - something's gone horribly wrong with the HBase connection
2012-07-31 07:22:37,876 CRITICAL - Thread-10 - something's gone horribly wrong with the HBase connection
2012-07-31 07:22:51,143 CRITICAL - Thread-7 - something's gone horribly wrong with the HBase connection
2012-07-31 07:22:51,369 CRITICAL - Thread-2 - something's gone horribly wrong with the HBase connection
2012-07-31 07:24:13,510 CRITICAL - Thread-11 - something's gone horribly wrong with the HBase connection
2012-07-31 07:24:50,623 CRITICAL - Thread-11 - something's gone horribly wrong with the HBase connection
processor2
2012-07-31 07:12:48,702 CRITICAL - Thread-5 - something's gone horribly wrong with the HBase connection
2012-07-31 07:12:53,481 CRITICAL - Thread-2 - something's gone horribly wrong with the HBase connection
2012-07-31 07:14:26,559 CRITICAL - Thread-7 - something's gone horribly wrong with the HBase connection
2012-07-31 07:17:04,527 CRITICAL - Thread-8 - something's gone horribly wrong with the HBase connection
2012-07-31 07:19:05,589 CRITICAL - Thread-6 - something's gone horribly wrong with the HBase connection
2012-07-31 07:22:28,071 CRITICAL - Thread-3 - something's gone horribly wrong with the HBase connection
2012-07-31 07:23:05,122 CRITICAL - Thread-3 - something's gone horribly wrong with the HBase connection
supportadm.stage
2012-07-31 07:14:41,999 DEBUG - MainThread - somethings gone horribly wrong with HBase
2012-07-31 07:14:53,071 DEBUG - MainThread - somethings gone horribly wrong with HBase
Nagios did not alert anything for socorro stage today.
Zeus did not see any errors, the last thing is sees is the config changes from yesterday.
[30/Jul/2012:15:02:03 -0700] INFO vservers/db-socorro-stage-web confmod Configuration file modified
[30/Jul/2012:15:02:03 -0700] INFO vservers/db-socorro-stage-web vsreconf Virtual server reconfigured on port 6432
[30/Jul/2012:15:02:04 -0700] INFO vservers/db-socorro-stage-processor confmod Configuration file modified
[30/Jul/2012:15:02:04 -0700] INFO vservers/db-socorro-stage-processor vsreconf Virtual server reconfigured on port 6433
[30/Jul/2012:15:02:05 -0700] INFO vservers/db-socorro-stage-pg confmod Configuration file modified
[30/Jul/2012:15:02:05 -0700] INFO vservers/db-socorro-stage-pg vsreconf Virtual server reconfigured on port 5432
Also, we have not seen any Postgres errors since the changes yesterday.
I'm going to get tcpdumps running on the active HBase nodes, the processors, and supportadm.stage
Assignee | ||
Comment 65•12 years ago
|
||
Per :tmary on IRC, I enabled hp-node64 and hp-node68, so far they look fine to Zeus
[31/Jul/2012:10:18:28 -0700] INFO pools/socorro-thrift-stage:9090 confmod Configuration file modified
Assignee | ||
Comment 66•12 years ago
|
||
Updates on what I've done today.
I started tcpdumps at 11:30am, as outlined in https://bugzilla.mozilla.org/show_bug.cgi?id=778831#c2 (each running under screen, as root)
I've been monitoring the logs with the following commands
pp-zlb01 - grep 2012 /usr/local/zeus/zxtm/log/errors | grep socorro
socorroadm.stage - grep DatabaseError /var/log/socorro/socorro-monitor.log and grep HBase /var/log/socorro/socorro-monitor.log
socorro-processor[1-2].stage - grep HBase /var/log/socorro/socorro-processor and grep DatabaseError /var/log/socorro/socorro-processor.log
Assignee | ||
Comment 67•12 years ago
|
||
… I've been monitoring the logs and have not seen any new issues since the errors lars and I noted between 6-7:30am.
If we see further errors, working with NetOps to review the tcpdumps is the next best step.
I'll be out on PTO tomorrow, until 8/8, so if there is anything I was doing that anyone is unsure how to also do, please let me know tonight, I'll check bugmail before I take off for the airport tomorrow.
Comment 68•12 years ago
|
||
on Wednesday, August 1 - the processors lost their connection to HBase at 6:10am pdt, the problem remains on-going at 6:56am pdt.
2012-08-01 06:10:12,203 CRITICAL - Thread-11 - the connection is not viable. retries fail:
Interestingly, the monitor has not lost its connection to HBase and continues blithely along assigning jobs to processors that cannot do their work.
Severity: normal → blocker
Comment 69•12 years ago
|
||
there is a time consistency in the HBase outages.
Since July 3rd, the outages all begin within 10 minutes of 6am.
prior to that:
July 3 - 7:30am and 10:00am
July 2 - 12:50am
July 1 - 3:00am
Comment 70•12 years ago
|
||
This debugging is using up all the disk space on socorroadm.stage.private.phx1 with tcpdump logs:
19G /root/778831
That job can't keep running and storing those logs locally. I'll compress the old ones to buy some time but it'll still need to be killed or redone to copy things off-machine in the next 24h and solarce is on PTO. If no one else has any suggestions of where to copy that data I'm going ot need to kill the tcpdump job.
(3.1Gb of logs in /var/log/socorro, much of which is in /var/log/socorro/syslog, isn't helping but that's small compared to the tcpdumps).
Comment 71•12 years ago
|
||
right on schedule, the HBase connection to staging processors and monitor failed at 6:02am PDT. So we don't know why, but we can predict the time of failure. The error message is the same as all previous times:
2012-08-02 06:02:24,807 CRITICAL - MainThread - Caught Error: <class 'socorro.storage.hbaseClient.FatalException'>
2012-08-02 06:02:24,807 CRITICAL - MainThread - the connection is not viable. retries fail:
Comment 72•12 years ago
|
||
(In reply to K Lars Lohn [:lars] [:klohn] from comment #71)
> right on schedule, the HBase connection to staging processors and monitor
> failed at 6:02am PDT. So we don't know why, but we can predict the time of
> failure. The error message is the same as all previous times:
Since we can predict the failures, could we not set up a cron job to engage and disengage the tcpdump around the time at which the behaviour is expected to occur ? (re: comment #70)
Reporter | ||
Comment 73•12 years ago
|
||
Update: we now know there are three separate causes of failure, and we have a plan.
1. PG connection timeouts. When these were moved to a new VIP, it was noticed that the timeout value was set too low (300ms) and different from prod. This has been fixed.
2. PG connection timeouts on a Sunday around noon. This is due to the data being refreshed from prod -> stage. Being on a Sunday, this is the least critical of the three, but I would like to address it if possible. That is in the hands of mpressman.
3. HBase connection timeouts each day at around 6am. This is due to the data being refreshed from prod->stage. This is on tmary to fix.
Comment 74•12 years ago
|
||
:laura - if these three failure modes are now known, do you still need the tcpdump data to diagnose or can I kill the process before it fills up disk again?
Reporter | ||
Comment 75•12 years ago
|
||
pir: I think you can kill it for now at least. Thanks!
Comment 76•12 years ago
|
||
laura: killed, dumps compressed. There's enough slack space after compression that the dumps don't need to be deleted immediately, but if we're sure they're not required I'll delete them too.
Comment 77•12 years ago
|
||
> 3. HBase connection timeouts each day at around 6am. This is due to the
> data being refreshed from prod->stage. This is on tmary to fix.
No of simultaneous mappers has been reduced to 8 - during tests with reduced number of mappers, there were no errors in logs
--
Comment 78•12 years ago
|
||
The export of ADI data to Socorro staging postgres failed again last night. are there still known outstanding problems to be resolved with the database?
Reporter | ||
Comment 79•12 years ago
|
||
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #78)
> The export of ADI data to Socorro staging postgres failed again last night.
> are there still known outstanding problems to be resolved with the database?
I didn't think so, but apparently I was wrong.
cshields, solarce: can somebody take a look at what happened with Postgres?
dre: tmary: did the Hbase adjustments get us through the 6am work without problems this morning?
Comment 80•12 years ago
|
||
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #78)
> The export of ADI data to Socorro staging postgres failed again last night.
> are there still known outstanding problems to be resolved with the database?
I need exact times and source/dest IPs/ports for this please... (or point me in the right direction to RTFM)
Comment 81•12 years ago
|
||
(In reply to Mark Reid from comment #46)
> The metrics ADU export has been failing on Socorro Stage every night, and
> seems related to the ongoing problems. For example:
>
> ERROR 25-07 20:49:36,124 An I/O error occured while sending to the backend.
>
> Not a very informative error message, but at least it gives the time.
>
> I'm connecting to socorro-stage-zlb.db.phx1.mozilla.com, and running the job
> manually worked when I tried it earlier this week (24-07 11:38:57,118 to be
> specific). Please let me know if I can provide any other information.
These errors are still happening each night. Here is the latest log message:
ERROR 02-08 20:46:00,788 - Delete adu records - An error occurred, processing will be stopped:
Couldn't execute SQL: DELETE FROM raw_adu WHERE date = '2012-08-02'
An I/O error occured while sending to the backend.
Comment 82•12 years ago
|
||
(In reply to Corey Shields [:cshields] from comment #80)
> I need exact times and source/dest IPs/ports for this please... (or point
> me in the right direction to RTFM)
Time in my previous comment.
Source: etl1.metrics.scl3.mozilla.com
Dest: socorro-stage-zlb.db.phx1.mozilla.com
Port: 5432
User: breakpad_metrics
Comment 83•12 years ago
|
||
the HBase trouble was not resolved. I see at least 50 HBase connection retry failures in the processors between 6am and 7:30am PDT.
Comment 84•12 years ago
|
||
Daniel,
It's more likely that our changes to the network require updates to your configuration. Please send:
a) what time the ADI update runs on staging
b) what IP address (or URI) you're hitting
c) username/password (via email).
Thanks!
Comment 85•12 years ago
|
||
Josh,
All the information you are requesting is in comment 81 and comment 82 above with the exception of the password. We can e-mail you the password we are using, but given the fact we can run manually and it completes, I don't think it is likely to be a password issue.
Comment 86•12 years ago
|
||
Daniel,
Yes, Mreid found me on IRC. The password is verified correct.
He says the job runs at 20:46 Pacific, +/- 5min.
Reporter | ||
Comment 87•12 years ago
|
||
(In reply to T [:tmary] Meyarivan from comment #77)
> > 3. HBase connection timeouts each day at around 6am. This is due to the
> > data being refreshed from prod->stage. This is on tmary to fix.
>
> No of simultaneous mappers has been reduced to 8 - during tests with reduced
> number of mappers, there were no errors in logs
>
> --
Following up: reducing the mappers from 8 to 1 solves the problem, but it also means the refresh will take 8-10 hours each day to complete. I've asked tmary to spec out adding some more nodes and file a bug for hardware. It may be complicated since he says there is no space in the current set of switches (for hp-nodeXX servers) and rewiring will be required. He also said once prod is upgraded we may be able to borrow nodes from sec-phx1. Daniel, can you comment on that? I'm not sure I want to borrow from prod.
Reporter | ||
Comment 88•12 years ago
|
||
With regard to Postgres/ADI, I suggest we watch the network during the import and try to see what's causing the failure. Need an ops/pg volunteer - maybe both.
Comment 89•12 years ago
|
||
(In reply to Laura Thomson :laura from comment #87)
> asked tmary to spec out adding some more nodes and file a bug for hardware.
> It may be complicated since he says there is no space in the current set of
> switches (for hp-nodeXX servers) and rewiring will be required.
Let me know if this problem can be solved with additional space. DC Ops has additional capacity coming online soon in phx1, we can make accommodations for you.
Comment 90•12 years ago
|
||
I was out on Friday, Aug 3, but on returning, I find new timing to the HBase failures in the staging processor logs.
On Aug 3, there are periodic CRITICAL log entries throughout the day. On Aug 4, I find HBase connection failures begin at 02:16 and lasting for three minutes. The problem recurs at 02:50 and then, in brief bursts, continues every fifteen to sixty minutes through the night. It is 07:40 now, the last failure that I see was at 07:19.
I will continue to watch periodically through the weekend.
Comment 91•12 years ago
|
||
on Monday morning, processor2 continues to have connectivity issues with HBase. Processor1 is having no trouble at all. We're not out of the woods on this issue.
Reporter | ||
Comment 92•12 years ago
|
||
Another update: ADIs did not get processed again today, which is a PG/network issue.
Comment 93•12 years ago
|
||
(In reply to Laura Thomson :laura from comment #92)
> Another update: ADIs did not get processed again today, which is a
> PG/network issue.
I don't know what that means.
Comment 94•12 years ago
|
||
Re: Postgres & ADI - the export succeeded last night. Looks like it was failing due to a bug on our end. I'll keep an eye on it.
Reporter | ||
Comment 95•12 years ago
|
||
Summary of the four issues found (and some fixed):
- (FIXED) PostgreSQL connections failing, fixed by changing virtual server connection timeout for postgres to 0 same as in prod (comment #59)
- (FIXED) HBase refresh used up too many resources, making HBase drop connections. Fixed by reducing the number of mappers from 8 to 1 (comment #77). Now the refresh takes 8 hours, so we should add capacity - bug to be filed.
- (FIXED) ADI import was timing out due to a bug on the Metrics end.
- (WONTFIX) PostgreSQL connections timeout during data refresh, starting Sundays at noon PT. We can live with this.
It took us a long time to converge on these because, since the problems all began around the same time, we assumed a single root cause. We'll try not to make that (understandable) mistake again.
I'm going to call this good, and ask solarce to close the bug. Thanks very much to all ops, netops, metrics, DBAs, and Socorro team who worked on getting us stable.
Assignee | ||
Comment 96•12 years ago
|
||
Marking as resolved. Thanks so much to everyone who helped dig into and resolve these issues.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•12 years ago
|
Whiteboard: [testing VIPs on internal cluster]
Comment 97•12 years ago
|
||
I'm glad we were able to get this resolved. Awesome work, everyone!
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•