1387407 - hg.mozilla.org pushlog for Try isn't showing pushes for recent commits

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Description

•

7 years ago

inbound and autoland seem to be fine, but Try hasn't shown new jobs pushed in the last 100 minutes which one can see at https://hg.mozilla.org/try/shortlog

Ed Morley [:emorley]

Updated

•

7 years ago

Assignee: nobody → emorley

Status: NEW → ASSIGNED

Priority: -- → P1

Ed Morley [:emorley]

Comment 1

•

7 years ago

On New Relic:
https://rpm.newrelic.com/accounts/677903/applications/14179757/filterable_errors#/show/a2ee18a3-7911-11e7-931c-0242ac110012_8314_12387/stack_trace?top_facet=transactionUiName&primary_facet=error.class&barchart=barchart&_k=rinpuv

File "/app/treeherder/etl/tasks/pulse_tasks.py", line 30, in store_pulse_resultsets
File "/app/treeherder/etl/resultset_loader.py", line 42, in process
File "/app/treeherder/etl/resultset_loader.py", line 237, in transform
File "/app/treeherder/etl/resultset_loader.py", line 244, in fetch_resultset
exceptions:IndexError: list index out of range

Corresponds to:
https://github.com/mozilla/treeherder/blob/3648340ab4155a24747a68ddde3f5b8a0eb24f99/treeherder/etl/resultset_loader.py#L244

It looks like the hg.mozilla.org pushlog is broken. The most recent entries here are from 2 hours ago:
https://hg.mozilla.org/try/pushloghtml
https://hg.mozilla.org/try/json-pushes

...whereas the most recent commits are from minutes ago:
https://hg.mozilla.org/try/shortlog

This is the last push that the API returns:
https://hg.mozilla.org/try/json-pushes?version=2&full=1&startID=210594&endID=210595

The next push ID (210596) doesn't exist:
https://hg.mozilla.org/try/json-pushes?version=2&full=1&startID=210595&endID=210596

Note also that when visiting a specific recent commit, the "push id" and related fields all show as "unknown":
https://hg.mozilla.org/try/rev/9b6f166ff58b

Assignee: emorley → nobody

Status: ASSIGNED → NEW

Component: Treeherder: Data Ingestion → Mercurial: hg.mozilla.org

Product: Tree Management → Developer Services

Version: --- → unspecified

Ed Morley [:emorley]

Updated

•

7 years ago

Summary: New jobs/pushes don't show up on Try → hg.mozilla.org pushlog for Try isn't showing pushes for recent commits

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Updated

•

7 years ago

Flags: needinfo?(klibby)

Ed Morley [:emorley]

Updated

•

7 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1387418

Justin Wood (:Callek)

Comment 2

•

7 years ago

For "maybe-related" information, Mercurial 4.2 was just deployed half a day ago, in https://bugzilla.mozilla.org/show_bug.cgi?id=1385979

Kendall Libby [:fubar] (he/him)

Comment 3

•

7 years ago

Attempted to run replicatesync to make sure there weren't any issues with replication, and it's not happy with try:

13:32 hgssh4.dmz.scl3#/var/hg/venv_tools/bin/hg replicatesync
** unknown exception encountered, please report by visiting
** https://mercurial-scm.org/wiki/BugTracker
** Python 2.7.5 (default, Nov  6 2016, 00:28:07) [GCC 4.8.5 20150623 (Red Hat 4.8.5-11)]
** Mercurial Distributed SCM (version 4.2.2)
** Extensions loaded: blackbox, clonebundles, obsolescencehacks, pushlog, serverlog, readonly, vcsreplicator
Traceback (most recent call last):
  File "/var/hg/venv_tools/bin/hg", line 45, in <module>
    mercurial.dispatch.run()
  File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 83, in run
    status = (dispatch(req) or 0) & 255
  File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 164, in dispatch
    ret = _runcatch(req)
  File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 295, in _runcatch
    return _callcatch(ui, _runcatchfunc)
  File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 303, in _callcatch
    return scmutil.callcatch(ui, func)
  File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/scmutil.py", line 146, in callcatch
    return func()
  File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 285, in _runcatchfunc
    return _dispatch(req)
  File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 912, in _dispatch
    cmdpats, cmdoptions)
  File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 648, in runcommand
    ret = _runcommand(ui, options, cmd, d)
  File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 920, in _runcommand
    return cmdfunc()
  File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/dispatch.py", line 909, in <lambda>
    d = lambda: util.checksignature(func)(ui, *args, **strcmdopt)
  File "/var/hg/venv_tools/lib/python2.7/site-packages/mercurial/util.py", line 1080, in check
    return func(*args, **kwargs)
  File "/var/hg/version-control-tools/pylib/vcsreplicator/vcsreplicator/hgext.py", line 317, in replicatecommand
    sendreposyncmessage(ui, repo)
  File "/var/hg/version-control-tools/pylib/vcsreplicator/vcsreplicator/hgext.py", line 224, in sendreposyncmessage
    partition=repo.replicationpartition)
  File "/var/hg/venv_tools/lib/python2.7/site-packages/vcsreplicator/producer.py", line 128, in record_hg_repo_sync
    }, partition=partition)
  File "/var/hg/venv_tools/lib/python2.7/site-packages/vcsreplicator/producer.py", line 49, in send_message
    self.topic, partition, msg)
  File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/producer/base.py", line 349, in send_messages
    return self._send_messages(topic, partition, *msg)
  File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/producer/base.py", line 390, in _send_messages
    fail_on_error=self.sync_fail_on_error
  File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/client.py", line 480, in send_produce_request
    (not fail_on_error or not self._raise_on_response_error(resp))]
  File "/var/hg/venv_tools/lib/python2.7/site-packages/kafka/client.py", line 247, in _raise_on_response_error
    raise resp
kafka.common.FailedPayloadsError


and then kafka gets sad and needs restarting. I don't know that this is actually related to the problem at hand, however.

Flags: needinfo?(klibby)

Kendall Libby [:fubar] (he/him)

Comment 4

•

7 years ago

ok, looks like replication of pushlog data:

hgssh4.dmz.scl3# ls -al /repo/hg/mozilla/try/.hg/pushlog2.db
-rw-rw-r-- 1 hg scm_level_1 159708160 Aug  4 13:57 /repo/hg/mozilla/try/.hg/pushlog2.db

hgweb11.dmz.scl3# ls -al /repo/hg/mozilla/try/.hg/pushlog2.db
-rw-rw-r-- 1 hg hg 159698944 Aug  4 10:32 /repo/hg/mozilla/try/.hg/pushlog2.db

Kendall Libby [:fubar] (he/him)

Comment 5

•

7 years ago

Aug 03 07:42:45 hgssh4.dmz.scl3.mozilla.com java[1054]: [2017-08-03 07:42:45,397] ERROR [Replica Manager on Broker 11]: Error when processing fetch request for partition [pushdata,4] offset 4012838 from consumer with correlation id 18. Possible cause: Request for offset 4012838 but we only have log segments in the range 1972025 to 2040814. (kafka.server.ReplicaManager)

Kendall Libby [:fubar] (he/him)

Comment 6

•

7 years ago

(In reply to Kendall Libby [:fubar] from comment #5)
> Aug 03 07:42:45 hgssh4.dmz.scl3.mozilla.com java[1054]: [2017-08-03
> 07:42:45,397] ERROR [Replica Manager on Broker 11]: Error when processing
> fetch request for partition [pushdata,4] offset 4012838 from consumer with
> correlation id 18. Possible cause: Request for offset 4012838 but we only
> have log segments in the range 1972025 to 2040814.
> (kafka.server.ReplicaManager)

or that's a red herring, since it predates at least where pushlog data stops on the clients.

Flags: needinfo?(gps)

Kendall Libby [:fubar] (he/him)

Comment 7

•

7 years ago

:gps suggested bouncing kafka on all of the nodes, but it hasn't helped (and replicatesync still throws the above error). he'll be online in < 30min to take a look

Gregory Szorc [:gps]

Comment 8

•

7 years ago

The issue is actively resolved. I'm still trying to piece together what happened.

I do know there was a failure writing to the pushlog for one autoland push. There are a few changesets in the repo with no pushlog entry. This almost certainly confused Treeherder. It may have confused aspects of the replication mechanism.

Gregory Szorc [:gps]

Comment 9

•

7 years ago

Timeline of events:

09:40:59 VYV03354@nifty.ne.jp connects via ssh
09:41:11 VYV03354@nifty.ne.jp transaction starts
09:41:18 bind-autoland@mozilla.com connections via ssh
09:41:19 VYV03354@nifty.ne.jp transaction to add 4 changesets with head d8ded36356b0 closes
09:41:19 hgweb mirrors start connecting to replicate changes
09:41:21 hgweb mirrors start fetching bundles
09:41:23 bind-autoland@mozilla.com transaction starts
09:41:25 VYV03354@nifty.ne.jp disconnects
09:41:28 hgweb mirrors start fetching pushlog data
09:41:30 hgweb11 confirms replication of VYV03354@nifty.ne.jp's push
09:41:31 bind-autoland@mozilla.com transaction starts to close
09:41:37 bind-autoland@mozilla.com transaction closes; added 14 changesets with head 82464b3a6ebd
09:41:37 hgweb mirrors start fetching bind-autoland's transaction
09:41:37 bind-autoland@mozilla.com disconnects
09:41:50 hgweb11 confirms replication of autoland's push
09:43:03 mikokm@gmail.com connects via ssh, submits a push with 4 changesets and head 5b4ed2075b1c and hgssh appears happy
09:43:54 hgweb11 confirms replication of mikokm's push
...
XX:XX:XX various other pushes to try complete successfully
...
12:19:XX this bug filed
13:06:XX emorley makes noise in #vcs
13:24:XX fubar answers ping
13:27:47 fubar runs `hg replicatesync`
13:27:57 command fails with FailedPayloadsError attempting to write to Kafka
13:28:12 pid 23830 fails with RequestTimedOutError writing to Kafka (not sure what process this is)
13:28:20 first Nagios check fails due to failing to write heartbeat test message to partition 4 (try's partition)
13:29:51 jkingston@mozilla.com establishes ssh connection; push experiences problems
13:29:21 first nagios alert in #vcs
13:31:24 kafka restarted on hgssh4
XX:XX:XX kafka bounced on other machines and again on hgsshh; other attempts to run `hg replicatesync`
...
16:XX:XX gps runs `hg pull` on hgweb machines; lots of missing changesets pulled in
16:XX:XX gps also runs `hg replicatesync`, manages to make Kafka unhappy

Included in the autoland push is this traceback:

  File "/var/hg/version-control-tools/pylib/vcsreplicator/vcsreplicator/hgext.py", line 359, in wireprotodispatch
    return orig(repo, proto, command)
  File "/var/hg/venv_pash/lib/python2.7/site-packages/mercurial/wireproto.py", line 570, in dispatch
    return func(repo, proto, *args)
  File "/var/hg/venv_pash/lib/python2.7/site-packages/mercurial/wireproto.py", line 982, in unbundle
    proto._client())
  File "/var/hg/venv_pash/lib/python2.7/site-packages/mercurial/exchange.py", line 1777, in unbundle
    lockandtr[2].close()
  File "/var/hg/venv_pash/lib/python2.7/site-packages/mercurial/transaction.py", line 43, in _active
    return func(self, *args, **kwds)
  File "/var/hg/venv_pash/lib/python2.7/site-packages/mercurial/transaction.py", line 490, in close
    self._postclosecallback[cat](self)
  File "/var/hg/version-control-tools/hgext/pushlog/__init__.py", line 245, in onpostclose
    conn.commit()
OperationalError: database is locked

Essentially, SQLite raised an exception when attempting to close the transaction adding the pushlog entry. The Mercurial transaction had completed. The changeset data was there but the pushlog data never was committed.

So, the summary of events is:

1. Error writing to pushlog meant we recorded Mercurial data without a pushlog entry for head 82464b3a6ebd and prior 13 changesets
2. Treeherder started choking because of missing pushlog data
3. Upon suspicion that things were wedged, `hg replicatesync` was executed. This overwhelmed Kafka and wedged Kafka.
4. Heartbeat checks and pushes stopped working completely
5. Kafka was bounced
6. 3-5 were repeated a few times
7. Eventually, ran `hg pull` from hgweb to obtain missing changesets, everything restored

I see the following potential follow-up actions:

1. Fix pushlog so this database lock issue doesn't happen (I suspect we're using SQLite not optimally and are running into a race between readers and a writer)
2. Treeherder to potentially react more robustly if pushlog data is missing (although fail fast applies and I'd like to say that hg.mo can guarantee pushlog data existence)
3. Figure out what is causing Kafka to get wedged during `hg replicatesync` and prevent that from happening (I think it chokes on a very large message)

Flags: needinfo?(gps)

Ed Morley [:emorley]

Comment 10

•

7 years ago

(In reply to Gregory Szorc [:gps] from comment #9)
> 2. Treeherder to potentially react more robustly if pushlog data is missing
> (although fail fast applies and I'd like to say that hg.mo can guarantee
> pushlog data existence)

I could be wrong, but I don't think there was any issue on the Treeherder side. 

Each push received over Pulse is ingested separately, and the `push_full_json_url` property used to determine what jsonpushlog URL to retrieve. However none of the more recent pushes existed in the pushlog - not just those 13. For example, the 9b6f166ff58b commit mentioned in comment 1 didn't have a pushlog entry, and that was ~25 pushes after the last good push at the time of the incident.

Gregory Szorc [:gps]

Comment 11

•

7 years ago

Regarding the interaction between Mercurial and SQLite, the architectural problem is that we don't have atomic transactions. Mercurial has its transaction. SQLite has its transaction. A failure to close SQLite's transaction doesn't roll back Mercurial's transaction.

Way back in bug 966545 I improved pushlog code to at least be scoped to Mercurial's transaction. Before, if you aborted a push, we would orphan data in the SQLite database and things would get confused. For several years, we told people not to ^C pushes or it could corrupt repos. After that change, we roll back the SQLite transaction if Mercurial aborts its transaction.

Transactions are fundamentally a hard problem. More so when you try to link discrete software components to the same logical transaction.

The thing making it difficult in this case is there is no easy way to "undo" the last committed transaction in Mercurial and SQLite. If you have discrete transactions, you need a way to undo the last transaction if a subsequent transaction fails in order to make them behave as an atomic unit. Lacking a mechanism to do this, we have to hope and pray or work outside the box.

Up to this day, we've somehow never had an issue (or at least recorded an instance of) the SQLite transaction failing after the Mercurial transaction closed. So the "hope and pray" solution has worked. But since it failed, we should probably do something. Ideas I see:

1) Create a copy of the SQLite database on transaction open. Save SQLite transaction before Mercurial's. Use Mercurial's built-in restore-file-on-rollback/copy-file-on-close primitives to lean on file copies instead of SQLite transactions to achieve atomic transactions. (There will likely be a perf hit and the change is quite hacky.)
2) Improve usage of SQLite so we don't have database locks and/or retry harder when a lock is observed. This is still "hope and pray" but it might be good enough.

Gregory Szorc [:gps]

Comment 12

•

7 years ago

(In reply to Ed Morley [:emorley] from comment #10)
> Each push received over Pulse is ingested separately, and the
> `push_full_json_url` property used to determine what jsonpushlog URL to
> retrieve. However none of the more recent pushes existed in the pushlog -
> not just those 13. For example, the 9b6f166ff58b commit mentioned in comment
> 1 didn't have a pushlog entry, and that was ~25 pushes after the last good
> push at the time of the incident.

OK. Then it appears the hgweb machines didn't replicate pushlog entries until I ran a manual `hg pull`. We don't log the raw command output when running some replication commands. So I can't confirm this. But it is a plausible explanation. So additional follow-ups would be:

* Log output from replication commands to facilitate forensics
* Make pushlog data mirroring more robust to missing pushlog data

Gregory Szorc [:gps]

Updated

•

7 years ago

Depends on: 1387645

Gregory Szorc [:gps]

Comment 13

•

7 years ago

The pushlog SQLite databases don't use WAL, which means there is contention between readers and writers. Unfortunately, I don't think we can use WAL because the pushlog on hgssh is hosted on NFS. The WAL index is in shared memory, which means it won't work across machines.

While we typically only have one master, a shared memory index for SQLite feels brittle to me. I don't want to be dealing with issues from multiple machines attempting to access the SQLite database and touching the index.

You can disable the use of shared memory with the WAL index. https://sqlite.org/wal.html#noshm. But this requires locking_mode=EXCLUSIVE, which prevents concurrent readers. We need concurrent reads for replication requests.

So I think the WAL is out.

Refreshing my brain on transaction semantics (https://www.sqlite.org/lang_transaction.html#immediate), we /could/ move to an IMMEDIATE transaction from deferred. This will obtain a RESERVED lock immediately, thus locking out other writers. However, a RESERVED lock will also lock out other readers. That's not desirable because it will prevent replication.

I think the best we can do is retry harder when the COMMIT fails.

That being said, if we did actually see SQLITE_LOCKED (https://www.sqlite.org/rescode.html#locked) from Python, that means the lock is coming from within the Python process rather than from a reader. Perhaps we are inadvertently opening multiple SQLite connections from the same Python process? Due to the way the pushlog object is cached on the Mercurial repo, this is certainly possible...

Gregory Szorc [:gps]

Comment 14

•

7 years ago

According to the SQLite source code "database is locked" corresponds to SQLITE_BUSY. So this is likely a reader locking out a writer.

We should increase the busy timeout when we commit the transaction to maximize our chances for success.

Justin Wood (:Callek)

Comment 15

•

7 years ago

This seems to have happened again today, https://treeherder.mozilla.org/#/jobs?repo=try&revision=9fd674cc9b840320a1f39421141863058c74e45f and https://treeherder.mozilla.org/#/jobs?repo=try&revision=a4a9037ae138067133c64a7d4b83321a26aa8c12 were two I've done this morning

the latest entry on https://hg.mozilla.org/try/pushloghtml is [was] from ~90 minutes ago

Kendall Libby [:fubar] (he/him)

Comment 16

•

7 years ago

Tried running 'hg pull' on the webheads to catch try back up, but I'm clearly missing something as it was a no-op. There weren't any 'database is locked' errors that I could find on hgssh4; presuming they were found elsewhere.

Flags: needinfo?(gps)

Aryeh Gregor (:ayg) (no longer with Mozilla)

Comment 17

•

7 years ago

Correct me if I'm wrong, but SQLite only has table-wide locks, correct?  Is this the correct solution for us if we're having problems with concurrency?  In MySQL with InnoDB, for instance, locking is generally row-level, so lock contention is much less likely.

Gregory Szorc [:gps]

Comment 18

•

7 years ago

We use SQLite because it is simple and fits the requirements (assuming we use it correctly). Using something more complex like MySQL introduces more failure modes, operational complexity, etc.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 19

•

7 years ago

No new pushes on Treeherder for Try or the last 2 hours. Kendall, please take a look.

Flags: needinfo?(klibby)

Kendall Libby [:fubar] (he/him)

Comment 20

•

7 years ago

restarted kafka, jic. pulled data over to the webheads.

Flags: needinfo?(klibby)

Pulsebot

Comment 21

•

7 years ago

Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/0795a1fde163
pushlog: increase busy_timeout

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Gregory Szorc [:gps]

Comment 22

•

7 years ago

I suspect that change (which I'm deploying now) will wallpaper over the issue.

But so far I have yet to reproduce the exact failure we're seeing in production in tests. Speaking of tests, I intend to land some more test coverage for database contention. But that doesn't need to block the server change.

Gregory Szorc [:gps]

Comment 23

•

7 years ago

This broke again a few minutes ago. Had to do an `hg pull` to fix. Ugh.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Gregory Szorc [:gps]

Updated

•

7 years ago

Blocks: hg42

Flags: needinfo?(gps)

Gregory Szorc [:gps]

Comment 24

•

7 years ago

I suspect bug 1297153 may be partially responsible for some of the failures.

Anyway, I don't think we've had a pushlog failure since both the timeout increase and the try heads were pruned over the past several days. While I'm still concerned about what caused this, I'm willing to call it resolved once bug 1297153 lands.

Gregory Szorc [:gps]

Comment 25

•

7 years ago

There is no indication that this is still a problem.

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

Resolution: --- → FIXED