Closed Bug 1661955 Opened 4 years ago Closed 4 years ago

News messages aren't fetched

Categories

(MailNews Core :: Networking: NNTP, defect, P1)

Thunderbird 81

Tracking

(thunderbird_esr78 unaffected, thunderbird81 affected)

RESOLVED FIXED
82 Branch
Tracking Status
thunderbird_esr78 --- unaffected
thunderbird81 --- affected

People

(Reporter: Paenglab, Assigned: benc)

References

Details

(Keywords: regression)

Attachments

(5 files, 4 obsolete files)

Bug 1657493 fixed the crash when accessing news messages but since then the messages aren't downloaded in 99% of the times. News messages that are already downloaded before (locally available) are still readable. Before the fix of bug 1657493 there where already some messages that where still readable and didn't crash. So probably the same message are the ones that are still fetch-able.

This is a blocker as newsgroups, one of TB's base features, aren't usable.

Depends on: 1633198

This is a 100% regression of bug 1657493 attachment 9172251 [details] [diff] [review] because in fact it simply disables the news fetch. No fetch -> no crash.
That was a good idea to prevent crashes in the beta build. But now it's not helpful because we still need to fix the crash.

So in my humble opinion we should revert attachment 9172251 [details] [diff] [review], reopen bug 1657493 and then hunt the crash.

Assignee: nobody → benc
Attached file Stack trace (obsolete) (deleted) —
Stack trace for the crash:
Attached file backtrace.txt (deleted) —

I have attached a backtrace just before it actually crashes. The problem seems to be that DocumentLoadListener::RedirectToRealChannel is called indirectly from within DocumentLoadListener::OpenDocument, and attempts to resolve the mOpenPromise promise. However, DocumentLoadListener::OpenDocument only creates mOpenPromise at the very end, so within RedirectToRealChannel, mOpenPromise is still a null pointer, and resolving that is not going to go well. I'm not familiar enough with how this is intended to work to say whether this is a problem in DocumentLoadListener or in nsNNTPProtocol, but it should be possible to work around in DocumentLoadListener by creating mOpenPromise on first use (either in OpenDocument or in RedirectToRealChannel).

Attachment #9174112 - Attachment is obsolete: true
Attached patch workaround.patch (deleted) — Splinter Review

That really did turn out to be easy to work around: news messages work fine for me locally with the attached patch, tested on top of Thunderbird 81.0b2. Like I said, I have no idea whether this is actually the right way to fix things. If it is not, I hope it will at least help in getting a proper fix.

Thanks Harald - your diagnosis of the crash tallies with what I've been looking at (ie DocumentLoadListener::mOpenPromise being null being the crash point).
Unfortunately, the patch doesn't work for me (freshly subscribed to a big newsgroup, all messages unread). I can read the first message I click on, but subsequent messages crash.

I'm pretty sure it's something being done wrong in nsNNTPProtocol rather than on the M-C side.
I'm still slowly picking my way up through the callstack, figuring out what it's trying to do. It definitely does some weird things with the caching (during nsNNTPProtocol::AsyncOpen(), it eventually ends up using itself to feed into the cache entry, as well as being the endpoint channel being used by the docshell, which seems a little odd).

If I bypass the caching, it works the same as with workaround.patch - it works for the first message I click on, then crashes trying to read subsequent ones.

One thing I'm looking at now: It might just be because our nsNNTPProtocol::AsyncOpen() invokes the listener's OnStartRequest() directly. The nsIChannel documentation isn't totally clear, but I suspect AsyncOpen() is expected to return before OnStartRequest() is invoked... that would certainly ensure DocumentLoadListener::mOpenPromise was set up.

Debug build?
With a NoDebug build I can read News with Haralds patch on Trunk minus attachment 9172251 [details] [diff] [review].
Only the cursor doesn't stop spinning over folder , header and thread pane.

The debug build stops at this assertion:
https://dxr.mozilla.org/comm-central/source/netwerk/base/nsNetUtil.cpp#252

some notes to myself, mainly :-)

Quick overview: it fails trying to recycle existing nntp connections for new requests.

The crash occurs because of this line in nsNNTPProtocol::LoadUrl():

if (m_fromCache) nsMsgProtocol::OnStartRequest(nullptr);

(this is the line that was commented out in Bug 1657493).

LoadUrl() is called (eventually) by AsyncOpen(), and the offending line means that the listener's OnStartRequest() is called early, which none of the channel stuff can handle - there's an (implict?) assumption that listener callbacks (OnStartRequest()/OnStopRequest()/OnDataAvailable) aren't invoked until after AsyncOpen() returns, which seems like a reasonable condition to me.
So the line above violates that condition.

The m_fromCache flag indicates that this is a cached connection (not cached data! This is really confusing, seeing how many cached-data calls are in the crashing callstack!). A cached connection is an already-open nsNNTPProtocol object, connected to the server, and m_fromCache means we're reusing it.
When setting up a new request, it'll first see if any of the cached (ie already open) connections are idle. If not, and we're still below the "max_cached_connections" pref setting (default 2), it'll create a new nsNTTPProtocol object and use that. As a last resort, it'll instead create a nsNNTPMockChannel to hold all the request details and place it in a queue for when a real nsNTTPProtocol object becomes available.
Also, whenever nsNTTPProtocol objects finish their request, they submit themselves for reuse by any queued nsNNTPMockChannels (via nsNntpIncomingServer::PrepareForNextUrl()).
Cached nsNTTPProtocol objects also have a timeout - if they're idle long enough, they're discarded.

I can get it all working again by hacking things to never recycle connections (I just hacked the timeout code to always trigger, and set my "max_cached_connections" to something huge).
Tomorrow I shall put together a stopgap patch with non-recycling connections to at least get it working again.

The if (m_fromCache) nsMsgProtocol::OnStartRequest(nullptr); line appears to have been added to try and fix up some bad state when recycling nsNNTPProtocol objects. The associated comment says it's the loadGroup, but I'm not totally convinced by this. A little code archeology to do there.
Anyway, what I need to figure out now: what is wrong with the state of recycled nsNNTPProtocol objects that causes them to stall? It'll turn out to be something stupidly simple, I'm sure : -)

Attached patch 1661955-evil-hack-no-connection-reuse-1.patch (obsolete) (deleted) — Splinter Review

Nasty little hack to disable all NNTP connection reuse. Slow and awful, obviously, but it works.
Not aiming to land it (hoping to work out a proper fix), but dumping it here just in case.

A few more notes:
The if (m_fromCache) nsMsgProtocol::OnStartRequest(nullptr); line which causes all the trouble dates from waaaaay back. Changeset 0, the original import from CVS in 2008 or so. Which makes me less nervous about removing it and figuring out something cleaner.

The nntp connection reuse happens in two ways:

  1. in nsNntpIncomingServer::GetNntpConnection() - new requests check to see if any of the existing channels are idle (and haven't timed out). If so, it reuses one.
  2. When nsNNTPProtocol finishes a request, it checks to see if the nsNntpIncomingServer has any queued requests (as nsNntpMockChannel objects), and if so tells a mock channel to reuse it.

The two paths use slightly different mechanisms.
I suspect it'd be cleaner to unify them - always return an nsNntpMockChannel, and underneath bind them to nsNNTPProtocol objects as they become available (also some renaming to help clarify the roles: nsNntpMockChannel -> nsNNTPChannel and nsNNTPProtocol -> nsNNTPConnection, say).

Also, it looks like there's a logic bug in nsNntpIncomingServer::GetNntpConnection() - if there's an idle connection but it's timed out, it doesn't go back to look for other idle connections. I could see a situation where new requests would get queued, but never actually acted upon...

Joshua, any insights?

Flags: needinfo?(Pidgeot18)
Version: unspecified → Thunderbird 81

Quick update - I've got a handle on the causes of this now and am working on a solution:
To serve up individual messages, the nsNNTPProtocol (which handles real NNTP connections) is presented to the docshell as a nsIChannel. However, because the nsNNTPProtocol is a persistent connection (a sockettransport, managed by it's base nsMsgProtocol class), it doesn't appear (to the docshell) to follow the usual OnStartRequest/OnDataAvailable.../OnStopRequest patterns. Previously this was fudged by manually calling OnStartRequest inside our AsyncOpen(), to give the docshell a kick, but this was always a little dodgey. nsIChannels are meant to be single-shot things, handling a single request then going away. Not for representing persistent connections like we have here. We were lucky it worked for a long time, but now the fudged OnStartRequest-during-AsyncOpen() no longer works.
Got a couple of approaches to try, but I think the best one is to continue my current intention of using making sure the docshell always just deals through nsIMockChannel rather than sometimes using nsNNTPProtocol directly.

Here's the least-intrusive fix I can come up with.
It still leaves a whole heap of dodgy stuff in the NNTP code, but it Works For Me (tm) and I don't think it'll break anything else that's not already broken.

I'll craft a few followup patches to add some notes and tidy up a few loose ends I spotted, but unless there are any big issues, I'll leave the big intrusive nsNNTPProtocol/nsMockChannel cleanup for now. I think its best to consider that in terms of a larger cleanup of all the nsMsgProtocol-based classes.

Attachment #9175053 - Attachment is obsolete: true
Attachment #9175940 - Flags: review?(mkmelin+mozilla)
Comment on attachment 9175940 [details] [diff] [review] [checked in] 1661955-fix-stalled-nntp-1.patch Review of attachment 9175940 [details] [diff] [review]: ----------------------------------------------------------------- Seems to work thx! r=mkmelin
Attachment #9175940 - Flags: review?(mkmelin+mozilla) → review+
Status: NEW → ASSIGNED
Target Milestone: --- → 82 Branch

Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/81b009ff299a
Fix stalled message display when using persistent NNTP connections. r=mkmelin

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Attachment #9175940 - Attachment description: 1661955-fix-stalled-nntp-1.patch → [checked in] 1661955-fix-stalled-nntp-1.patch
Attached patch 1661955-followup-nntp-comments-1.patch (obsolete) (deleted) — Splinter Review

Just a followup to add some comments on things that really confused me when I was digging about in NNTP code.

Attachment #9176805 - Flags: review?(mkmelin+mozilla)

Cruft removal: nsINNTPProtocol.isCachedConnection is no longer used.

Attachment #9176807 - Flags: review?(mkmelin+mozilla)
Comment on attachment 9176807 [details] [diff] [review] 1661955-followup-remove-isCachedConnection-1.patch Review of attachment 9176807 [details] [diff] [review]: ----------------------------------------------------------------- Please remove m_fromCache as well, looks like this removes the last usages of it
Attachment #9176807 - Flags: review?(mkmelin+mozilla) → review+
Comment on attachment 9176805 [details] [diff] [review] 1661955-followup-nntp-comments-1.patch Review of attachment 9176805 [details] [diff] [review]: ----------------------------------------------------------------- ::: mailnews/news/src/nsNNTPProtocol.cpp @@ +1978,5 @@ > } > > +/** > + * Handler for NNTP_READ_ARTICLE state when being used as an nsIChannel > + * (eg for displaying an Article in a nsDocShell). The nsIChannel counterpart nit: e.g., and I guess article lowercase? ::: mailnews/news/src/nsNntpIncomingServer.cpp @@ +487,5 @@ > > +/** > + * Find an available nsNNTPConnection. Can create new connections as long as > + * we stay under the maximum connection limit. > + * If none are available, returns NS_OK and nullptr. Good to get it documented, but not really ok per coding conventions in xpcom. An ok should not really allow null out.
Attachment #9176805 - Flags: review?(mkmelin+mozilla) → review+
Attachment #9176805 - Attachment is obsolete: true
Attachment #9177015 - Flags: review+
Attachment #9176807 - Attachment is obsolete: true
Attachment #9177016 - Flags: review+

(In reply to Magnus Melin [:mkmelin] from comment #18)

Good to get it documented, but not really ok per coding conventions in
xpcom. An ok should not really allow null out.

Technically it's not part of the XPCOM interface (It's just a helper function in nsNNTPConnection), but yeah, I agree.
For now I just wanted to keep that patch to comments, but that function would be one of the first to go in a proper cleanup of the NNTP code.

Pushed by geoff@darktrojan.net:
https://hg.mozilla.org/comm-central/rev/f81bd27cd9a9
Followup. Add some comments on NNTP implementation. r=mkmelin
https://hg.mozilla.org/comm-central/rev/2719fb20cee2
Followup. Remove now-unused nsINNTPProtocol.isCachedConnection attribute. r=mkmelin

Flags: needinfo?(Pidgeot18)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: