Closed Bug 1469117 Opened 6 years ago Closed 6 years ago

Socket timeout is left engaged when set to infinity and later back

Categories

(Core :: Networking, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla63
Tracking Status
firefox63 --- fixed

People

(Reporter: gds, Assigned: mayhemer)

References

Details

(Whiteboard: [necko-triaged][tb])

Attachments

(1 file, 3 obsolete files)

User Agent: Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0
Build ID: 20180327091359

Steps to reproduce:

At least Msg B needs to have mime parts so that a bodystructure is fetched to see the problem; maybe also Msg A but not sure. Also, this occurs when there is no offline store for the folder containing A and B (so memory cache is used) and probably with attachments displayed inline.
1. Open Msg A
2. Wait about 120 seconds (2 minutes)
3. Open Msg B


Actual results:

When Msg B is selected, the header (to, from , subject etc) from Msg A remains visible and the message body area becomes blank. Also, if Msg A has attachments, they remain visible in the attachment pane. Nothing from Msg B becomes visible, even after a long wait. Also, the progress bar at the lower right remains in motion and never stops.


Expected results:

Of course, Msg B should become visible when it is selected. and all vestiges of Msg A should be gone after Msg B is opened. And the progress bar should quickly become blank. 

I think the timing of this is important. If Msg B is quickly selected instead of waiting a few minutes there seems to be no problem. Also, if a long time occurs before Msg B is selected, there is no problem. The problem seems to occur when the time between Msg A and Msg B is maybe 2-5 minutes.

Note: This problem was originally noticed and somewhat described in 
Bug 1454542 Comment 8 - 14. 
Also, a similar problem can occur with Msg B displayed correctly and not yet cached attachments to Msg B are opened, but this is usually harder to duplicate. This is mostly what is described in Bug 1454542.
I think what is happening is that when Msg B is opened for the first time and it is not yet in cache, Bodystructure() function is called that does a SendData() to fetch the bodystructure. The same tcp/ip connection that was used to access Msg A is used. However, if the time since the last access to Msg A data is more than 100 seconds a NS_ERROR_NET_TIMEOUT errors occurs when the SendData() response line is attempted to be obtained. The SendData() does not return and error, but the "fetch bodystructure" command is never sent. 

The IMAP server never sends any indication of timeout but it seems to be independently determined by thunderbird. After the 100 seconds, based on wireshark info, the connection is still good. I can see tb initiate the disconnect that is properly acknowledged by the IMAP server. The server never sends a timeout or a disconnect on its own.

Once tb disconnects, it then establishes a new connection and retries the "fetch bodystruture" that is correctly sent and the server responds properly on the new connection. However, for yet unknown reasons, tb ignores this retry response and it is never stored in cache or used by "listener" and displayed on the screen, so the progress indicator keeps moving.

If Msg X is selected and Msg B is quickly re-selected, the Msg B cache is marked as readable (aNew=0) but the header data is empty (and probably the body data) so the cache entry is doomed. This causes the cache to be written (aNew=1) that finally fetches and caches the correct data so Msg B is then displayed.

I think a similar thing happens when not yet cached attachments to Msg Z are opened within a certain timing window.
Reading the code, it appears that NS_ERROR_NET_TIMEOUT is controlled by mailnews.tcptimeout that defaults to 100 seconds. I think tcptimeout is intended to watchdog the imap transaction so that ..._NET_TIMEOUT occurs if the response to tb takes more than 100s. I don't see any previous imap transactions taking anywhere near 100s before the ..._TIMEOUT error is seen on the bodystructure command response check. It seems to be acting more as an idle timer and occurs when there is no activity on the connection for 100s. But I need to look closer...
I checked again with 52.8 and the same profile/configuration. The problem seems to be a regression since I cannot duplicate the problem with 52.8. I only see it with the trunk build.

I do see some post 52.8 changes in nsImapProtocol.cpp relating to "proxies" that might affect the places where m_transport->SetTimeout() occur but don't know if they are significant.

Also, the places where SetTimeout() is called seem to assume the timeout measures the imap request/response time. However, looking at the code in the mozilla/netwerk area that handle the socket timeout, it seems to be more of an activity timer. So you only get a timeout indication if the socket is inactive for the set timeout period, default 100s; it really has nothing to do with timing a given imap request/response activity.

What I see is that if I force some network activity within 100s between opening various messages (e.g., check new messages) I don't see a problem. This probably keeps the socket active so the timeout never hits.
Attached patch netwerk.diff -- proposed fix for socket timer (obsolete) (deleted) — Splinter Review
I think there is basic problem with the socket timer. When an imap operation occurs, tb code sets the timers to either 100s (if sending/receiving) or 160s (when doing a new connection). Once the operation is done, tb then sets the timer to PR_MAX_UINT32 that causes the timer to be disabled (or disengaged).

When the timer is set to the big value to disable it, it correctly does not flag a timeout as long as it remains set. However, when another imap request occurs and the timeout is set back to 100, the timer starts back using the previous starting point value, called mPollStartEpoch, and, if more than 100s have elapsed since the last imap activity, immediately flags a timeout.

A my fix is to set mPollStartEpoch to zero in nsSocketTransportService::SocketContext::TimeoutIn() when the MAX_UINT16 timeout value is detected. To get it to compile I had to remove several "const" attributes, so there is probably a better way. But this does seem to fix the problem.

Note: I enabled socket logging to determine what was happening. Also, after I found the problem I checked the dxr "blame" and see that there are two recent changes in the exact same area. This may explain why I see no problem with v52.x.

Bug 1446117 -- fix regression
Bug 1444160 -- original changes to timer
Attachment #8986624 - Flags: feedback?(jorgk)
The netwerk.diff does fix the problem, or maybe just masks it. With the change removed, tb detects the bogus TIMEOUT error and still does what seems like a successful retry on a new connection, but the correct response data just goes to the bit bucket and not to the listener and cache like it should, so the display remains incomplete. I still don't know why.
Comment on attachment 8986624 [details] [diff] [review]
netwerk.diff -- proposed fix for socket timer

The code in question here is from M-C's netwerk/ so we need to find a peer for this module. No better choice than Honza, the author of bug 1444160 and bug 1446117.

Honza, is there a problem that crept in in these two bugs?

I'm not sure what this const business is about, but you'll know.
Attachment #8986624 - Flags: feedback?(jorgk) → feedback?(honzab.moz)
Comment on attachment 8986624 [details] [diff] [review]
netwerk.diff -- proposed fix for socket timer

Review of attachment 8986624 [details] [diff] [review]:
-----------------------------------------------------------------

thanks for catching this!  fx is not using timeouts that much as tb does, so we have no tests/paths to catch this...  reading comment 4 (thanks for it!) it seem better to add another branch at [1] like:

} else {
  s.MaybeResetEpoch();
}

where:

void MaybeResetEpoch()
{
  if (mPollStartEpoch && mHandler->mPollTimeout == UINT16_MAX) {
    mPollStartEpoch = 0;
  }
}

note that your patch may not work when TimeoutIt is not called between manipulations with the timeout value on the socket.

[1] https://searchfox.org/mozilla-central/rev/403038737ba75af3842ba6b43b6e2fb47eb06609/netwerk/base/nsSocketTransportService2.cpp#1249
Attachment #8986624 - Flags: feedback?(honzab.moz) → feedback-
Honza, can you handle this? Not much point typing up your suggestion and then having it reviewed by you ;-)

Thankfully this hasn't regressed into mozilla60.
Flags: needinfo?(honzab.moz)
Assignee: nobody → honzab.moz
Status: UNCONFIRMED → ASSIGNED
Component: Untriaged → Networking
Ever confirmed: true
Flags: needinfo?(honzab.moz)
Priority: -- → P2
Product: Thunderbird → Core
Summary: Open msg A, wait a few minutes, Open Msg B. Msg A header remains displayed with body area blank and with lower-right progress bar moving → Socket timeout is left engaged when set to infinity and later back
Whiteboard: [necko-triaged][tb]
Blocks: 1444160
Attached patch bug1469117_socket_timeout.patch (obsolete) (deleted) — Splinter Review
Like so?
Attachment #9002233 - Flags: review?(honzab.moz)
Attachment #9002233 - Flags: feedback?(gds)
Comment on attachment 9002233 [details] [diff] [review]
bug1469117_socket_timeout.patch

Review of attachment 9002233 [details] [diff] [review]:
-----------------------------------------------------------------

I tested your updated patch and it also fixes the original problem described in comment 0.
Attachment #9002233 - Flags: feedback?(gds) → feedback+
Attachment #8986624 - Attachment is obsolete: true
(In reply to gene smith from comment #10)
> Comment on attachment 9002233 [details] [diff] [review]
> bug1469117_socket_timeout.patch
> 
> Review of attachment 9002233 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I tested your updated patch and it also fixes the original problem described
> in comment 0.

Thank for verifying this!  I will still update the patch a bit and let it review from a necko peer.  I want this to be a method on SocketContext.
Comment on attachment 9002233 [details] [diff] [review]
bug1469117_socket_timeout.patch

Thanks!  I updated your patch with few details and submitted for peer review.
Attachment #9002233 - Attachment is obsolete: true
Attachment #9002233 - Flags: review?(honzab.moz)
Attachment #9002460 - Attachment is obsolete: true
Comment on attachment 9002748 [details]
Bug 1469117 - Make sure we reset network socket timeout epoch to 0 when timeout on a socket is turned off

Dragana Damjanovic [:dragana] has approved the revision.
Attachment #9002748 - Flags: review+
Keywords: checkin-needed
Pushed by dvarga@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e28a399f05e8
Make sure we reset network socket timeout epoch to 0 when timeout on a socket is turned off r=dragana
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/e28a399f05e8
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: