Closed Bug 1661955 Opened 4 years ago Closed 4 years ago

News messages aren't fetched

Tracking

(thunderbird_esr78 unaffected, thunderbird81 affected)

Status:

RESOLVED FIXED

Milestone:

82 Branch

Tracking Flags:

Tracking

Status

thunderbird_esr78

---

unaffected

thunderbird81

---

affected

People

(Reporter: Paenglab, Assigned: benc)

References

Details

(Keywords: regression)

Attachments

(5 files, 4 obsolete files)

Stack trace 4 years ago Harald van Dijk (deleted), text/plain		Details
backtrace.txt 4 years ago Harald van Dijk (deleted), text/plain		Details
workaround.patch 4 years ago Harald van Dijk (deleted), patch		Details \| Diff \| Splinter Review
1661955-evil-hack-no-connection-reuse-1.patch 4 years ago Ben Campbell (deleted), patch		Details \| Diff \| Splinter Review
[checked in] 1661955-fix-stalled-nntp-1.patch 4 years ago Ben Campbell (deleted), patch	mkmelin : review+	Details \| Diff \| Splinter Review
1661955-followup-nntp-comments-1.patch 4 years ago Ben Campbell (deleted), patch	mkmelin : review+	Details \| Diff \| Splinter Review
1661955-followup-remove-isCachedConnection-1.patch 4 years ago Ben Campbell (deleted), patch	mkmelin : review+	Details \| Diff \| Splinter Review
1661955-followup-nntp-comments-2.patch 4 years ago Ben Campbell (deleted), patch	benc : review+	Details \| Diff \| Splinter Review
1661955-followup-remove-isCachedConnection-2.patch 4 years ago Ben Campbell (deleted), patch	benc : review+	Details \| Diff \| Splinter Review

Richard Marti (:Paenglab)

Reporter

Description

•

4 years ago

Bug 1657493 fixed the crash when accessing news messages but since then the messages aren't downloaded in 99% of the times. News messages that are already downloaded before (locally available) are still readable. Before the fix of bug 1657493 there where already some messages that where still readable and didn't crash. So probably the same message are the ones that are still fetch-able.

This is a blocker as newsgroups, one of TB's base features, aren't usable.

Wayne Mery (:wsmwk)

Updated

•

4 years ago

Depends on: 1633198

Alfred Peters

Comment 1

•

4 years ago

This is a 100% regression of bug 1657493 attachment 9172251 [details] [diff] [review] because in fact it simply disables the news fetch. No fetch -> no crash.
That was a good idea to prevent crashes in the beta build. But now it's not helpful because we still need to fix the crash.

So in my humble opinion we should revert attachment 9172251 [details] [diff] [review], reopen bug 1657493 and then hunt the crash.

Alfred Peters

Updated

•

4 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1662967

Alfred Peters

Updated

•

4 years ago

See Also: https://bugzilla.mozilla.org/show_bug.cgi?id=1662967 →

Ben Campbell

Assignee

Updated

•

4 years ago

Assignee: nobody → benc

Harald van Dijk

Comment 2

•

4 years ago

Attached file Stack trace (obsolete) (deleted) — Details

Stack trace for the crash:

Harald van Dijk

Comment 3

•

4 years ago

Attached file backtrace.txt (deleted) — Details

I have attached a backtrace just before it actually crashes. The problem seems to be that DocumentLoadListener::RedirectToRealChannel is called indirectly from within DocumentLoadListener::OpenDocument, and attempts to resolve the mOpenPromise promise. However, DocumentLoadListener::OpenDocument only creates mOpenPromise at the very end, so within RedirectToRealChannel, mOpenPromise is still a null pointer, and resolving that is not going to go well. I'm not familiar enough with how this is intended to work to say whether this is a problem in DocumentLoadListener or in nsNNTPProtocol, but it should be possible to work around in DocumentLoadListener by creating mOpenPromise on first use (either in OpenDocument or in RedirectToRealChannel).

Attachment #9174112 - Attachment is obsolete: true

Harald van Dijk

Comment 4

•

4 years ago

Attached patch workaround.patch (deleted) — Details — Splinter Review

That really did turn out to be easy to work around: news messages work fine for me locally with the attached patch, tested on top of Thunderbird 81.0b2. Like I said, I have no idea whether this is actually the right way to fix things. If it is not, I hope it will at least help in getting a proper fix.

Ben Campbell

Assignee

Comment 5

•

4 years ago

Thanks Harald - your diagnosis of the crash tallies with what I've been looking at (ie DocumentLoadListener::mOpenPromise being null being the crash point).
Unfortunately, the patch doesn't work for me (freshly subscribed to a big newsgroup, all messages unread). I can read the first message I click on, but subsequent messages crash.

I'm pretty sure it's something being done wrong in nsNNTPProtocol rather than on the M-C side.
I'm still slowly picking my way up through the callstack, figuring out what it's trying to do. It definitely does some weird things with the caching (during nsNNTPProtocol::AsyncOpen(), it eventually ends up using itself to feed into the cache entry, as well as being the endpoint channel being used by the docshell, which seems a little odd).

If I bypass the caching, it works the same as with workaround.patch - it works for the first message I click on, then crashes trying to read subsequent ones.

One thing I'm looking at now: It might just be because our nsNNTPProtocol::AsyncOpen() invokes the listener's OnStartRequest() directly. The nsIChannel documentation isn't totally clear, but I suspect AsyncOpen() is expected to return before OnStartRequest() is invoked... that would certainly ensure DocumentLoadListener::mOpenPromise was set up.

Alfred Peters

Comment 6

•

4 years ago

Debug build?
With a NoDebug build I can read News with Haralds patch on Trunk minus attachment 9172251 [details] [diff] [review].
Only the cursor doesn't stop spinning over folder , header and thread pane.

The debug build stops at this assertion:
https://dxr.mozilla.org/comm-central/source/netwerk/base/nsNetUtil.cpp#252

Ben Campbell

Assignee

Comment 7

•

4 years ago

some notes to myself, mainly :-)

Quick overview: it fails trying to recycle existing nntp connections for new requests.

The crash occurs because of this line in nsNNTPProtocol::LoadUrl():

if (m_fromCache) nsMsgProtocol::OnStartRequest(nullptr);

(this is the line that was commented out in Bug 1657493).

LoadUrl() is called (eventually) by AsyncOpen(), and the offending line means that the listener's OnStartRequest() is called early, which none of the channel stuff can handle - there's an (implict?) assumption that listener callbacks (OnStartRequest()/OnStopRequest()/OnDataAvailable) aren't invoked until after AsyncOpen() returns, which seems like a reasonable condition to me.
So the line above violates that condition.

The m_fromCache flag indicates that this is a cached connection (not cached data! This is really confusing, seeing how many cached-data calls are in the crashing callstack!). A cached connection is an already-open nsNNTPProtocol object, connected to the server, and m_fromCache means we're reusing it.
When setting up a new request, it'll first see if any of the cached (ie already open) connections are idle. If not, and we're still below the "max_cached_connections" pref setting (default 2), it'll create a new nsNTTPProtocol object and use that. As a last resort, it'll instead create a nsNNTPMockChannel to hold all the request details and place it in a queue for when a real nsNTTPProtocol object becomes available.
Also, whenever nsNTTPProtocol objects finish their request, they submit themselves for reuse by any queued nsNNTPMockChannels (via nsNntpIncomingServer::PrepareForNextUrl()).
Cached nsNTTPProtocol objects also have a timeout - if they're idle long enough, they're discarded.

I can get it all working again by hacking things to never recycle connections (I just hacked the timeout code to always trigger, and set my "max_cached_connections" to something huge).
Tomorrow I shall put together a stopgap patch with non-recycling connections to at least get it working again.

The if (m_fromCache) nsMsgProtocol::OnStartRequest(nullptr); line appears to have been added to try and fix up some bad state when recycling nsNNTPProtocol objects. The associated comment says it's the loadGroup, but I'm not totally convinced by this. A little code archeology to do there.
Anyway, what I need to figure out now: what is wrong with the state of recycled nsNNTPProtocol objects that causes them to stall? It'll turn out to be something stupidly simple, I'm sure : -)

Ben Campbell

Assignee

Comment 8

•

4 years ago

Attached patch 1661955-evil-hack-no-connection-reuse-1.patch (obsolete) (deleted) — Details — Splinter Review

Nasty little hack to disable all NNTP connection reuse. Slow and awful, obviously, but it works.
Not aiming to land it (hoping to work out a proper fix), but dumping it here just in case.

Ben Campbell

Assignee

Comment 9

•

4 years ago

A few more notes:
The if (m_fromCache) nsMsgProtocol::OnStartRequest(nullptr); line which causes all the trouble dates from waaaaay back. Changeset 0, the original import from CVS in 2008 or so. Which makes me less nervous about removing it and figuring out something cleaner.

The nntp connection reuse happens in two ways:

in nsNntpIncomingServer::GetNntpConnection() - new requests check to see if any of the existing channels are idle (and haven't timed out). If so, it reuses one.
When nsNNTPProtocol finishes a request, it checks to see if the nsNntpIncomingServer has any queued requests (as nsNntpMockChannel objects), and if so tells a mock channel to reuse it.

The two paths use slightly different mechanisms.
I suspect it'd be cleaner to unify them - always return an nsNntpMockChannel, and underneath bind them to nsNNTPProtocol objects as they become available (also some renaming to help clarify the roles: nsNntpMockChannel -> nsNNTPChannel and nsNNTPProtocol -> nsNNTPConnection, say).

Also, it looks like there's a logic bug in nsNntpIncomingServer::GetNntpConnection() - if there's an idle connection but it's timed out, it doesn't go back to look for other idle connections. I could see a situation where new requests would get queued, but never actually acted upon...

Magnus Melin [:mkmelin]

Comment 10

•

4 years ago

Joshua, any insights?

Wayne Mery (:wsmwk)

Updated

•

4 years ago

Flags: needinfo?(Pidgeot18)

Wayne Mery (:wsmwk)

Updated

•

4 years ago

status-thunderbird81: --- → affected

status-thunderbird_esr78: --- → unaffected

Version: unspecified → Thunderbird 81

Ben Campbell

Assignee

Comment 11

•

4 years ago

Quick update - I've got a handle on the causes of this now and am working on a solution:
To serve up individual messages, the nsNNTPProtocol (which handles real NNTP connections) is presented to the docshell as a nsIChannel. However, because the nsNNTPProtocol is a persistent connection (a sockettransport, managed by it's base nsMsgProtocol class), it doesn't appear (to the docshell) to follow the usual OnStartRequest/OnDataAvailable.../OnStopRequest patterns. Previously this was fudged by manually calling OnStartRequest inside our AsyncOpen(), to give the docshell a kick, but this was always a little dodgey. nsIChannels are meant to be single-shot things, handling a single request then going away. Not for representing persistent connections like we have here. We were lucky it worked for a long time, but now the fudged OnStartRequest-during-AsyncOpen() no longer works.
Got a couple of approaches to try, but I think the best one is to continue my current intention of using making sure the docshell always just deals through nsIMockChannel rather than sometimes using nsNNTPProtocol directly.

Ben Campbell

Assignee

Comment 12

•

4 years ago

Attached patch [checked in] 1661955-fix-stalled-nntp-1.patch (deleted) — Details — Splinter Review

Here's the least-intrusive fix I can come up with.
It still leaves a whole heap of dodgy stuff in the NNTP code, but it Works For Me (tm) and I don't think it'll break anything else that's not already broken.

I'll craft a few followup patches to add some notes and tidy up a few loose ends I spotted, but unless there are any big issues, I'll leave the big intrusive nsNNTPProtocol/nsMockChannel cleanup for now. I think its best to consider that in terms of a larger cleanup of all the nsMsgProtocol-based classes.

Attachment #9175053 - Attachment is obsolete: true

Attachment #9175940 - Flags: review?(mkmelin+mozilla)

Magnus Melin [:mkmelin]

Comment 13

•

4 years ago

Comment on attachment 9175940 [details] [diff] [review] [checked in] 1661955-fix-stalled-nntp-1.patch Review of attachment 9175940 [details] [diff] [review]: ----------------------------------------------------------------- Seems to work thx! r=mkmelin

Attachment #9175940 - Flags: review?(mkmelin+mozilla) → review+

Magnus Melin [:mkmelin]

Updated

•

4 years ago

Status: NEW → ASSIGNED

Keywords: checkin-needed-tb

Target Milestone: --- → 82 Branch

Pulsebot

Comment 14

•

4 years ago

Pushed by mkmelin@iki.fi:
https://hg.mozilla.org/comm-central/rev/81b009ff299a
Fix stalled message display when using persistent NNTP connections. r=mkmelin

Status: ASSIGNED → RESOLVED

Closed: 4 years ago

Keywords: checkin-needed-tb

Resolution: --- → FIXED

Wayne Mery (:wsmwk)

Updated

•

4 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1657493

Alfred Peters

Updated

•

4 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1666146

Ben Campbell

Assignee

Updated

•

4 years ago

Attachment #9175940 - Attachment description: 1661955-fix-stalled-nntp-1.patch → [checked in] 1661955-fix-stalled-nntp-1.patch

Ben Campbell

Assignee

Comment 15

•

4 years ago

Attached patch 1661955-followup-nntp-comments-1.patch (obsolete) (deleted) — Details — Splinter Review

Just a followup to add some comments on things that really confused me when I was digging about in NNTP code.

Attachment #9176805 - Flags: review?(mkmelin+mozilla)

Ben Campbell

Assignee

Comment 16

•

4 years ago

Attached patch 1661955-followup-remove-isCachedConnection-1.patch (obsolete) (deleted) — Details — Splinter Review

Cruft removal: nsINNTPProtocol.isCachedConnection is no longer used.

Attachment #9176807 - Flags: review?(mkmelin+mozilla)

Magnus Melin [:mkmelin]

Comment 17

•

4 years ago

Comment on attachment 9176807 [details] [diff] [review] 1661955-followup-remove-isCachedConnection-1.patch Review of attachment 9176807 [details] [diff] [review]: ----------------------------------------------------------------- Please remove m_fromCache as well, looks like this removes the last usages of it

Attachment #9176807 - Flags: review?(mkmelin+mozilla) → review+

Magnus Melin [:mkmelin]

Comment 18

•

4 years ago

Comment on attachment 9176805 [details] [diff] [review] 1661955-followup-nntp-comments-1.patch Review of attachment 9176805 [details] [diff] [review]: ----------------------------------------------------------------- ::: mailnews/news/src/nsNNTPProtocol.cpp @@ +1978,5 @@ > } > > +/** > + * Handler for NNTP_READ_ARTICLE state when being used as an nsIChannel > + * (eg for displaying an Article in a nsDocShell). The nsIChannel counterpart nit: e.g., and I guess article lowercase? ::: mailnews/news/src/nsNntpIncomingServer.cpp @@ +487,5 @@ > > +/** > + * Find an available nsNNTPConnection. Can create new connections as long as > + * we stay under the maximum connection limit. > + * If none are available, returns NS_OK and nullptr. Good to get it documented, but not really ok per coding conventions in xpcom. An ok should not really allow null out.

Attachment #9176805 - Flags: review?(mkmelin+mozilla) → review+

Ben Campbell

Assignee

Comment 19

•

4 years ago

Attached patch 1661955-followup-nntp-comments-2.patch (deleted) — Details — Splinter Review

Attachment #9176805 - Attachment is obsolete: true

Attachment #9177015 - Flags: review+

Ben Campbell

Assignee

Comment 20

•

4 years ago

Attached patch 1661955-followup-remove-isCachedConnection-2.patch (deleted) — Details — Splinter Review

Attachment #9176807 - Attachment is obsolete: true

Attachment #9177016 - Flags: review+

Ben Campbell

Assignee

Comment 21

•

4 years ago

(In reply to Magnus Melin [:mkmelin] from comment #18)

Good to get it documented, but not really ok per coding conventions in
xpcom. An ok should not really allow null out.

Technically it's not part of the XPCOM interface (It's just a helper function in nsNNTPConnection), but yeah, I agree.
For now I just wanted to keep that patch to comments, but that function would be one of the first to go in a proper cleanup of the NNTP code.

Ben Campbell

Assignee

Updated

•

4 years ago

Keywords: checkin-needed-tb

Pulsebot

Comment 22

•

4 years ago

Pushed by geoff@darktrojan.net:
https://hg.mozilla.org/comm-central/rev/f81bd27cd9a9
Followup. Add some comments on NNTP implementation. r=mkmelin
https://hg.mozilla.org/comm-central/rev/2719fb20cee2
Followup. Remove now-unused nsINNTPProtocol.isCachedConnection attribute. r=mkmelin

Keywords: checkin-needed-tb

Joshua Cranmer [:jcranmer]

Updated

•

4 years ago

Flags: needinfo?(Pidgeot18)

You need to log in before you can comment on or make changes to this bug.