Closed Bug 541367 Opened 15 years ago Closed 11 years ago

Sending mail with attachment (even small ones) hangs, eventually timeout (Server supports TCP Window Scaling and max 128KB is set. Router tolerate the large 128KB. Router's bug is suspected.)

Categories

(MailNews Core :: Networking: SMTP, defect)

x86
Windows XP
defect
Not set
critical

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: dtsweb, Unassigned)

References

()

Details

(Keywords: hang, Whiteboard: [gs])

Attachments

(5 files)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.7) Gecko/20100111 Thunderbird/3.0.1 When sending messages with attachments, Thunderbird hangs during the sending process and eventually the send times out. (The problem may also exist with large e-mails though I have not tested that.) Looking at a packet trace, you see that the process suddenly starts producing massive amounts of TCP Dup ACK packets and TCP retransmissions. Using a different mailer works fine. Likewise when sending through a different account at a different provider (The problematic provider uses CommuniGate Pro on their end, not sure what version). Reproducible: Always Steps to Reproduce: 1. Compose new message 2. Attach file (anywhere from 200k - several megabytes) 3. Send mail Actual Results: Sending message dialog appears, Status is 'Delivering message' and the Progress climbs up to a random number (6%, 75%, 98%, ... depending on the message) and then gets stuck for a while until eventually this error message appears: "Sending of message failed. The message could not be sent because the connection to SMTP server ..... timed out. Try again or contact your network administrator." Expected Results: Message should be sent. A TCP dump is available but I need to scrub it for private data before uploading it .... This bug might be related to the bugs 535203 and 535070.
Forgot to mention this: The problem started after the upgrade to Thunderbird 3.0 and is happening on multiple machines in the same network.
This is the data part of a mail sending process. You can see it failing at packet Nr 180. I removed the authentication and mail headers for security/anti-spam reasons, the EHLO is below: 220 netline.lu ESMTP CommuniGate Pro is listening to you! EHLO [10.78.78.181] 250-netline.lu we trust you [10.78.78.181] 250-DSN 250-SIZE 250-STARTTLS 250-AUTH LOGIN PLAIN CRAM-MD5 DIGEST-MD5 250-ETRN 250-TURN 250-ATRN 250-NO-SOLICITING 250-HELP 250-PIPELINING 250 EHLO AUTH CRAM-MD5 334 removed removed 235 yyy.yyy@naturpark-our.lu relaying authenticated MAIL FROM:<yyy.yyy@naturpark-our.lu> SIZE=5199163 250 yyy.yyy@naturpark-our.lu sender accepted RCPT TO:<xxx.xxx@naturpark-our.lu> 250 xxx.xxx@naturpark-our.lu will leave the Internet DATA 354 Enter mail, end with "." on a line by itself
Didn't delve far enough into the code to be sure, but I suspect the retransmission of unacknowledged packets would be the responsibility of the OS's TCP stack, not Thunderbird's "netwerk" code. Note that the sending side is far ahead of the acknowledgements, and gets very much farther ahead with each (selective) ack; what's the window scaling value for this session ? Is it possible that the sending side is sending more than a windows' worth of data, which is being ignored by the receiver -- thus requiring everything to be retransmitted multiple times once a single packet is not received correctly ?
Attached file Session start (SYN, SYN/ACK, ACK) (deleted) —
To be honest, I would've thought it was the sending OS too, but the symptoms just don't match (They have no issues with any other mail provider, or network service for that matter, the problem is happening on multiple machines on the same network and it seems to have appeared after TB3 was installed.) I've uploaded the start of the session. I'm not really very familiar or good with TCP/IP networking but maybe you see anything out of the ordinary there that correlates to the rest of the trace.
I'm not that great at TCP/IP either -- at least not without "The Book" and/or the relevant RFCs beside me. The SYN doesn't indicate any window-scaling being used. Looking more closely at the end of the previously-attached capture, it appears that packets are being dropped/corruped along the way, as can be seen by the selective ACKs. The failure of the receiver to acknowledge the receipt of the of the (retransmitted) bytes 129620 through 131072 (relative) in the 11-second time period (at the end of the capture) would lead one to believe there's either a congested network segment or a "flaky" link somewhere. Such a conjecture can also be used to explain the periodic "flood" of packets from one side or the other, as congestion clears and backup-up router buffers are emptied (router buffer overflow can also explain the packet loss).
(In reply to comment #6) > Such a conjecture can also be used to explain the periodic "flood" of packets from one side or the other, > as congestion clears and backup-up router buffers are emptied (router buffer overflow can also explain the packet loss). If such problem happens, retry is executed at TCP or IP level. It may force long time to send. Tb3's default timeout is 100sec, and it may be small for such environment. Increase mailnews.tcptimeout value via Config Editor and restart Tb. (I don't know mail.smtpserver.smtpN.tcptimeout is usable or not.) Still error occurs almost always? FYI. Tb3 side SMTP/Socket level log with timesamp may help your diagnosis. > https://wiki.mozilla.org/MailNews:Logging > http://www.mozilla.org/projects/netlib/http/http-debugging.html > http://www.mozilla.org/projects/nspr/reference/html/prlog.html#25328 > (SMTP only) > SET NSPR_LOG_MODULES=timestamp,smtp:5 > (Socket level log too) > SET NSPR_LOG_MODULES=timestamp,smtp:5,nsSocketTransport:5,nsHostResolver:5
Old firmware version of rooter can be a cause. Outbound mail scan(port scan type, or proxy type) by ant-virus software may produce such problem. Small MTU, too small or too large TCP Window size may also be a cause. > http://kb.mozillazine.org/Can_not_send_large_attachments > bug 69931 Please be careful when test with large TCP window size(bug 532348).
As long as mailnews.tcptimeout is not used within TB itself, but is instead passed as a parameter to the OS's TCP layer to set the TCP timeout value, then a value of 100 seconds should not be a problem. The provided capture shows packets are not reaching the destination, as indicated by the selective ACKs and the lack of acknowledgement of the retransmitted packets at the end of the provided capture. Question for the reporter: Is there additional capture data beyond that in the attached capture ? That is, do retransmission limits or timeouts eventually cause the TCP session to fail ?
13:34:20.308794 IP 212.24.195.252.25 > 10.78.78.188.1105: . ack 103055 win 45012 <nop,nop,sack 2 {82727:84179}{132095:139355}> Definitely not correct; SACK sequence number is less than the ACK sequence number. This only occurred for one packet, though, and the TB end did not retransmit any data prior to 103055, so this is just an observation rather than a problem. The real problem is that packets just aren't getting through to the receiver. 13:34:18.823021 IP 10.78.78.188.1105 > 212.24.195.252.25: P 105959:107411(1452) ack 547 win 64989 13:34:20.007325 IP 212.24.195.252.25 > 10.78.78.188.1105: . ack 82727 win 63888 <nop,nop,sack 4 {91439:94343}{95795:103055}{104507:105959}{107411:108863}> (Packet containing 105959 through 107411 not received) 13:34:20.190096 IP 10.78.78.188.1105 > 212.24.195.252.25: P 105959:106719(760) ack 547 win 64989 (Packet containing 105959 through 106719 retransmitted) 13:34:20.942904 IP 212.24.195.252.25 > 10.78.78.188.1105: . ack 106719 win 41348 <nop,nop,sack 3 {108863:110815}{142259:148067}{132095:140807}> (Packet containing 105959 through 106719 acknowledged) 13:34:20.942921 IP 10.78.78.188.1105 > 212.24.195.252.25: . 106719:107411(692) ack 547 win 64989 (Packet containing 106719 through 107411 retransmitted) 13:34:22.118047 IP 10.78.78.188.1105 > 212.24.195.252.25: . 106719:108171(1452) ack 547 win 64989 (Packet containing 106719 through 107411 -- and up to 108171 -- retransmitted again) 13:34:24.649285 IP 10.78.78.188.1105 > 212.24.195.252.25: . 106719:108171(1452) ack 547 win 64989 13:34:29.555310 IP 10.78.78.188.1105 > 212.24.195.252.25: . 106719:108171(1452) ack 547 win 64989 13:34:39.179983 IP 10.78.78.188.1105 > 212.24.195.252.25: . 106719:108171(1452) ack 547 win 64989 13:34:58.429333 IP 10.78.78.188.1105 > 212.24.195.252.25: . 106719:108171(1452) ack 547 win 64989 (Packet containing 106719 through 107411 -- and up to 108171 -- retransmitted 4 more times, never acknowledged, trace ends) I note that the last packet shown above is only 30 seconds after the first transmission of the data; there should have been at least one additional retransmission attempt, and -- if still not correctly received -- eventually one end or the other should have terminated the TCP session (FIN/RST).
Make that "40 seconds after the first transmission".
Successful transmission of large file by TB2. Note TB2 seems to send in 4KB "chunks", and is never more than 5-6KB "ahead" of the acknowledgements from the receiver.
Failed transmission of large email via TB3; both this and the TB2 capture should be "filtered" for port 25. Note TB3 will get almost 50KB "ahead" of receiver; could this cause the "network congestion" noted in previous posts, particularly for your "garden variety" home routers ? For more extensive comments, see: http://getsatisfaction.com/mozilla_messaging/topics/waiting_time_for_connecting_to_smtp_expired#reply_1915734
Attachment 426911 [details] does not show a retransmission timeout, but does exhibit the same "congestion" symptoms noted in Attachment 424641 [details] (receiver missing packets, responses come in "groups"). Attachment 426911 [details] ends with multiple RSTs, followed by multiple ACKs (??), followed by a few more RSTs, from the SMTP server, only 4 seconds after the session was initiated.
what is the conclusion - NTF, server, thunderbird, more info needed?
Component: General → Message Compose Window
QA Contact: general → message-compose
Version: unspecified → 3.0
(In reply to comment #16) > what is the conclusion - NTF, server, thunderbird, more info needed? The problem in the case of attachment 426911 [details] was (finally, on 4/20) determined to be a Linksys WRT54G router (at the latest firmware update), coupled with the SMTP server's large TCP window sizes (the session failed when the window opened to about 128K). Packet captures from both ends (TB client, SMTP server) allowed me to determine that the SMTP server was not the one sending the RSTs (it was apparently the router, after having "choked" on the outstanding data). Sessions to SMTP servers which advertised smaller window sizes, or sessions bypassing the router, were not affected. Interestingly, the person experiencing the problem said that he had no problem sending the same email using WindowsMail; I can only surmise that either WM can't send data as fast (thereby not overrunning the router), or that it will not send more than 65K (or so) without waiting for acknowledgement -- I did not examine captures from WM.
(In reply to comment #17) > SMTP server's large TCP window sizes (the session failed when the window opened to about 128K). > Packet captures from both ends (TB client, SMTP server) allowed me to determine that the SMTP server was not the one sending the RSTs > (it was apparently the router, after having "choked" on the outstanding data). > Sessions to SMTP servers which advertised smaller window sizes, or sessions > bypassing the router, were not affected. >(snip) > Interestingly, the person experiencing the problem said that he had no problem > sending the same email using WindowsMail; I can only surmise that either WM > can't send data as fast (thereby not overrunning the router), or that it will > not send more than 65K (or so) without waiting for acknowledgement > -- I did not examine captures from WM. It's not surprizing. AFAIK, Mozilla familiy supports TCP's feature of reducing/increasing Window size dynamically(such as, if overflow happens, decrease, if no error, increase). Router's issue is probably "frequent buffer over flow in router, and no care for over flow, then retry" if TCP layer uses larger than 64K bytes Window size(many IP packets per a_unit_of_time than expected for router). It's reported in other similar bug, so I wrote next in my comment #8. > Old firmware version of rooter can be a cause.
Summary: Sending mail with attachment (even small ones) hangs → Sending mail with attachment (even small ones) hangs, eventually timeout (Server supports dynamic TCP Window Size and max 128KB is set. Router can't be torelant with the large 128KB. Router's bug)
Router (noted in comment 17) was at latest firmware -- but the problem didn't occur when router was bypassed. I should point out that the symptoms of that case (as seen in packet captures) are not the same as those in the previous cases in this bug report. Since -- for those cases -- we only have packet captures from the client (TB) side, it is not possible to determine which packets are being lost (i.e., data from TB, or ack from server ?), or why.
(In reply to comment #19) > I should point out that the symptoms of that case (as seen in packet captures) > are not the same as those in the previous cases in this bug report. There are two different cases in this bug? (1) Michel Meyers(bug opener)'s case: Comment 0, original report Comment 4, attachment 424516 [details] Session start (SYN, SYN/ACK, ACK) Comment 10, attachment 424641 [details] Full trace from start to finish Comment 11, analysis of trace data by you (2) Your(Michael A. Pasek) case: Comment 14, attachment 426911 [details] Wireshark capture of TB3 large email transmission Comment 17, Your analysis. It's found router's problem If so, I think your case(clearer than any other known similar phenomena) is better to be kept in separate bug, for future analysis of this bug and other similar phenomena, as reference for analysis of other similar ones.
Summary: Sending mail with attachment (even small ones) hangs, eventually timeout (Server supports dynamic TCP Window Size and max 128KB is set. Router can't be torelant with the large 128KB. Router's bug) → Sending mail with attachment (even small ones) hangs, eventually timeout (Server supports dynamic TCP Window Size and max 128KB is set. Router doesn't look torelant with the large 128KB. Router's bug is suspected.)
The problem may be that of a router in both cases, but without data from the "other end" (for the original reporter's case), that is not possible to determine. Splitting the "router" case to a separate bug report may prove useful, but -- on the other hand -- it may server to obscure that there may be many underlying reasons for the same outward symptom. The SSL case (Bug 566548) has symptoms nearly identical (if not identical) to the original problem report in this bug report. Since TCP would not care whether or not SSL is involved, perhaps that case should be combined with this one ?
Michel Meyers(bug opener), can you check with network.tcp.sendbuffer=65536? (See bug 566548, please)
Blocks: 566548, 538283
Summary: Sending mail with attachment (even small ones) hangs, eventually timeout (Server supports dynamic TCP Window Size and max 128KB is set. Router doesn't look torelant with the large 128KB. Router's bug is suspected.) → Sending mail with attachment (even small ones) hangs, eventually timeout (Server supports TCP Window Scaling and max 128KB is set. Router doesn't look torelant with the large 128KB. Router's bug is suspected.)
Severity: major → critical
Keywords: hang
David who would be best qualified to have a look at that biesi ?
(In reply to comment #23) > David who would be best qualified to have a look at that biesi ? Yes, I guess biesi.
(In reply to comment #24) > (In reply to comment #23) > > David who would be best qualified to have a look at that biesi ? > > Yes, I guess biesi. Biesi could you give us a few hint on what to do and where to look to fully analyse this issue ?
(In reply to comment #25) > (In reply to comment #24) > > (In reply to comment #23) > > > David who would be best qualified to have a look at that biesi ? > > > > Yes, I guess biesi. > > Biesi could you give us a few hint on what to do and where to look to fully > analyse this issue ? Biesi says he doesn't have good ideas.
Component: Message Compose Window → Networking: SMTP
Product: Thunderbird → MailNews Core
QA Contact: message-compose → networking.smtp
Version: 3.0 → Trunk
I think this should be closed as "INVALID"; it's not a TB problem. Two users (see Comment #17 and http://getsatisfaction.com/mozilla_messaging/topics/tb_vonage_router) have reported that the problem did NOT occur when the router was removed from the path. Also, the workaround of lowering the "sendbuffer" size to 64KB has worked for at least 4 or 5 people so far. I think Bug 566548 can also be closed as "INVALID", since it also appears to have the same cause.
Linking to GS topic that was used to gather the information used to provide the data for analysis. I believe (as noted therein) that Bug 566548 should be DUP'd to this bug -- rather than, as suggested in Comment #27, closed as INVALID, even though it is. It would be better to link the two so as to indicate that the upper-level protocols don't matter WRT this problem and its symptoms.
Whiteboard: [gs]
I don't think duping is appropriate for INVALID bugs like this bug, and purpose of this bug and that bug is different. this bug : analysis of network activity, finding culprit that bug : based on this bug, finding why problem occurs, finding workaround I think next is adequate action. keep both open for a while with keeping dependency, for ease of search. close both as INVALID indepenently. Changing to new for ease of search.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Whiteboard: [gs] → [gs] [will be closed as INVALID]
I also experience this problem. It gives the error message "sending of message failed. The message could not be sent beacuse the connection to SMTP server smtp.dsl.telkomsa.net timed out....." But in MOST cases although it gives the error message, the message (plus attachement)is actually successfully send thru and is received by the person on the other side. But not in ALL cases is the message actually send successfully and this makes it very difficult to know if the receiver have received it or not. I use Windows XP Pro and has Thunderbird 3.1.6
Encountered similar things to this and bug 566548: Sending of mails with large attachments over TLS (STARTTLS) on Windows XP SP3 fails with a server timeout message. Interestingly, saving the message as a draft, closing the edit window and then opening and sending the draft massively increases the chances of a successful send, although still not to 100%. Just wiresharked the connection on the client side, and found out this: 1. Both sides advertise SACK capabilities. 2. Neither side advertises window scaling. 3. Server mackes use of SACK repeatedly, and clients retransmits missing packets. 4. Server (or router?) resets connection just milliseconds after last duplicate ACK and retransmit. So no real timeout seems to be involved. 5. For the last data packet sent by the client, wireshark reports 133337 bytes in flight. A bit earlier there was a maximum of 133359. 6. The rwindow advertised by the server has been at 64240 for some time, and has never been higher than this. I have the pcap saved here, it's about 352K. I would be willing to upload it if you require it, but would prefer only doing so if necessary, and perhaps rather send it via personal mail to individual developers. Given the fact that the receive window is smaller than the bytes in flight, I would believe it likely that the server is actually correct in resetting this connection. Perhaps the Windows TCP stack disregards packets which have been SACKed but not ACKed when calculating the amount of data to send. Which doesn't seem a good idea, as SACKed packets can't be delivered to the server application, therefore have to remain queued, therefore occupy buffer space. Is Thunderbird in any way involved in this computation, i.e. how much data to send for a given receive window when a given amount of data has not yet been ACKed/SACKed and is therefore still in flight? If so, where's the code for this? For the effect of the network.tcp.sendbuffer property I could only find calls to setsockopt, making this almost certainly a OS bug, not a TB bug. If this comment in fact is the same bug, do you want to adjust the subject? If not, should I file a new report? I guess I'll try disabling SACK support on the OS level, as described here: http://www.speedguide.net/articles/windows-2kxp-registry-tweaks-157
Melanie - thx for the detailed investigation. The Thunderbird mail code is not involved in this computation at all. I highly doubt that the core Mozilla networking code is either, but I can't swear to thta.
(In reply to comment #31) > I guess I'll try disabling SACK support on the OS level, as described here: > http://www.speedguide.net/articles/windows-2kxp-registry-tweaks-157 Didn't help, at least not in all cases. Haven't wiresharked any connections since then yet.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → INVALID
Summary: Sending mail with attachment (even small ones) hangs, eventually timeout (Server supports TCP Window Scaling and max 128KB is set. Router doesn't look torelant with the large 128KB. Router's bug is suspected.) → Sending mail with attachment (even small ones) hangs, eventually timeout (Server supports TCP Window Scaling and max 128KB is set. Router tolerate the large 128KB. Router's bug is suspected.)
Whiteboard: [gs] [will be closed as INVALID] → [gs]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: