Closed Bug 1337713 Opened 8 years ago Closed 8 years ago

TEST-UNEXPECTED-FAIL | Disconnect Error: Application unexpectedly closed

Categories

(Thunderbird :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 54.0

People

(Reporter: jorgk-bmo, Assigned: aceman)

Details

Attachments

(3 files, 1 obsolete file)

TEST-UNEXPECTED-FAIL | Disconnect Error: Application unexpectedly closed started mysteriously on C-C on Mon Feb 6, 2017, 22:21:23 with https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=6483fa45f5b2330a94edbb24ff7235170bce7830 Those patches have since been backed out and the crash is still there. The previous push at Mon Feb 6, 2017, 18:45:51 was based on the same M-C version 12c02bf624c48903b155428f7c8a419ba7. Today I did a try run with M-C at that version (but with C-C at tip) https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=c1540a60fbb5fc65bb8415642ef7e00c8205cd4a and it crashed just the same. Now, the enlightenment here comes from the Aurora tree, where the crash also started. Last good: Requested: Mon Feb 6, 10:41:21 Started: Mon Feb 6, 10:41:25 Ended: Mon Feb 6, 11:22:30 First bad: Requested: Tue Feb 7, 10:37:21 Started: Tue Feb 7, 10:37:24 Ended: Tue Feb 7, 11:14:25 So my bet is that something changed on those Mozilla servers between Mon Feb 6, 2017, 18:45:51 and Mon Feb 6, 2017, 22:21:23. I don't even understand the error we can see in the logs: 01:52:27 INFO - TEST-START | C:\slave\test\build\tests\mozmill\composition\test-attachment-reminder.js | test_attachment_vs_filelink_reminder 01:52:27 INFO - Timeout: bridge.execFunction("c6b7dd91-ed1a-11e6-8ffe-002590e6a953", bridge.registry["{63dd3940-ca67-4b40-b1ae-ebc31658be26}"]["runTestDirectory"], ["C:\\slave\\test\\build\\tests\\mozmill\\composition"]) 01:52:27 WARNING - TEST-UNEXPECTED-FAIL | Disconnect Error: Application unexpectedly closed 01:52:27 INFO - Timeout: bridge.set("2237f00f-ed1b-11e6-83e7-002590e6a953", Components.utils.import('resource://mozmill/modules/mozmill.js')) What's that "filelink" business? As detailed in bug comment #104, I ran test-attachment-reminder.js locally and it didn't fail. However, I noticed a second message panel under the various confirmation panels, and the test constantly goes "bing". Aceman, I'm really distraught by this, so your help here would be much appreciated.
Flags: needinfo?(acelists)
Attached patch 1337713-mozmill-crash.patch (v1). (obsolete) (deleted) — Splinter Review
It appears this this needs 'attachment-helpers'. Hmm, how did that ever work then?
I was wrong. 'add_attachment' comes from the compose helpers. Another try with that test disabled: https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=949b72c806a58e8678577912ba1c34d377b6d3db
OK, those tries all failed due to current bustage. I had to fix bug 1337734 first. Here's another try disabling the test: https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=3ff238b69169df5762a9f724219fbcae2333bb25
OK, with the test disabled, Mozmill passes. So I've done something which appears to be very unpopular: A non-purist temporary measure. I've disabled the test. This test is apparently crashing the Mozmill suite which means that other tests in the same directory will not run. So we lose testing coverage (and nervous people like myself start seeing crashes everywhere and back out perfectly harmless changesets). https://hg.mozilla.org/comm-central/rev/50e1f832608564d859d9a1cdb64591bdebc4689d
Attachment #8834821 - Attachment is obsolete: true
I have been bitten by this test failure maybe one in five times or one in ten times when I run mozmill test under local Linux environment. So I think the real cause is timing-related, but I am not sure what type of timing-related issue. (I am showing a patch to the test and adding wait/sleep seemed to help a bit (?). I am attaching a log of local runs that show the success of this particular test and the failure run of this particular test. (Like I said, the error occurs maybe once in 5-10 times now with my patch. Before, without my patch, the frequency was much higher.) In the log, I also attach a local modification for the function when I tried to figure out what is causing the failure. My current theory is a timing issue of popup window for modal error dialog OR that the dialog does NOT appear sometimes. Why? Not sure, but there are some *OTHER* serious errors such as AlertHook is missing on the message handler (?). Maybe an error condition is triggered somewhere/somehow, but it is not properly handled because of the lack of Alert Handler. Thus the failure is not properly propagated to the test client? Pure Guess. [And why and what the original error then? I think when AlertHook is implemented and prints out some helpful information, we will know.] Thus, I think it is the timing issue between the TB process and test client that operates on the TB process. Anyway I am attaching a log and the patch in it. Note: In the log from test_attachment_vs_filelink_reminder, you see lines like step-01 step-01.5 step-02 step-03 step-04 step-04.5 step-05 step-05.5 these lines are printed by my patch to see where the timeout occurs. In the successful run, all the lines are printed. In the failure run, all but the last step-05.5 are printed. This means that we experienced the timeout in the following code: Inside click_send_and_handle_send_error(cwc), I think + cwc.sleep(2000); // add time to see the proper dialog. + dump("step-05\n"); + // we may see sending failed due to filelink not working (always on + // local build test cases under linux. click_send_and_handle_send_error(cwc); + dump("step-05.5\n"); close_window(cwc); Note also, that even in this short test, we see the following JS errors/warnings. Very disturbing. ======================================== NS_ERROR ======================================== 1 JavaScript error: resource:///modules/activity/alertHook.js, line 48: NS_ERROR_UNEXPECTED: Component returned failure code: 0x8000ffff (NS_ERROR_UNEXPECTED) [nsIMsgMailNewsUrl.server] ======================================== ASSERTION ======================================== Count/sum is modulo [without the first PID] 4 [6911] ###!!! ASSERTION: not-null m_mime_delivery_state: 'm_mime_delivery_state != nullptr', file /NREF-COMM-CENTRAL/comm-central/mailnews/compose/src/nsMsgAttachmentHandler.cpp, line 1104 ======================================== ^WARNING ======================================== First with NS_ENSURE macros: Count/sum is modulo [without the first PID] 7 [6911] WARNING: NS_ENSURE_TRUE(msgDocShellItem) failed: file /NREF-COMM-CENTRAL/comm-central/mailnews/base/src/nsMsgWindow.cpp, line 93 4 [6911] WARNING: NS_ENSURE_TRUE(ps) failed: file /NREF-COMM-CENTRAL/comm-central/mozilla/editor/libeditor/HTMLEditor.cpp, line 2948 3 [6911] WARNING: NS_ENSURE_TRUE(sheet) failed: file /NREF-COMM-CENTRAL/comm-central/mozilla/editor/libeditor/HTMLEditor.cpp, line 2944 3 [6911] WARNING: NS_ENSURE_TRUE(name) failed: file /NREF-COMM-CENTRAL/comm-central/mozilla/dom/base/nsDOMAttributeMap.cpp, line 374 3 [6911] WARNING: NS_ENSURE_TRUE(mInput) failed: file /NREF-COMM-CENTRAL/comm-central/mozilla/toolkit/components/autocomplete/nsAutoCompleteController.cpp, line 1700 3 [6911] WARNING: NS_ENSURE_TRUE(editor) failed: file /NREF-COMM-CENTRAL/comm-central/mozilla/extensions/spellcheck/src/mozInlineSpellChecker.cpp, line 722 3 [6911] WARNING: NS_ENSURE_TRUE(aNode) failed: file /NREF-COMM-CENTRAL/comm-central/mozilla/editor/libeditor/EditorBase.cpp, line 3416 2 [6911] WARNING: NS_ENSURE_TRUE(editor) failed: file /NREF-COMM-CENTRAL/comm-central/mozilla/extensions/spellcheck/src/mozInlineSpellChecker.cpp, line 1801 1 [6911] WARNING: NS_ENSURE_SUCCESS(rv, nullptr) failed with result 0x8000FFFF: file /NREF-COMM-CENTRAL/comm-central/mailnews/base/util/nsMsgProtocol.cpp, line 1531 Second without NS_ENSURE macros: Count/sum is modulo [without the first PID] 2 [6911] WARNING: We should have hit the document element...: file /NREF-COMM-CENTRAL/comm-central/mozilla/layout/xul/BoxObject.cpp, line 174 1 [7178] WARNING: 'NS_FAILED(rv)', file /NREF-COMM-CENTRAL/comm-central/mozilla/dom/indexedDB/ActorsParent.cpp, line 18809 PS: Disabling the test certainly gets us move forward, but I hope that someone can look into the cause of this. Yes, I deem disabling a rather unpopular move here. (I know we are busted with unreleased atom issue, too.) It seems that underlying stricter checks and some infrastructure changes [that may have tweaked the timing of program execution] have finally allow us to tackle hitherto-neglected bugs head on. I have seen this error so many times in the past on my PC and did not realize until of late, that I am not running the tests in the same composition directory past this failing function when it DOES fail. That is why I tinkered by creating the patch and caused the failure rate tolerable at least locally... Maybe there *COULD* be linux vs Windows issue even. TIA
Yes I think this test file is particularly problematic and it often gave failures on c-c servers that couldn't be reproduced locally by anyone. It seems only Chiaki can see it. > I don't even understand the error we can see in the logs: > 01:52:27 INFO - TEST-START | > C:\slave\test\build\tests\mozmill\composition\test-attachment-reminder.js | > test_attachment_vs_filelink_reminder > 01:52:27 INFO - Timeout: > bridge.execFunction("c6b7dd91-ed1a-11e6-8ffe-002590e6a953", > bridge.registry["{63dd3940-ca67-4b40-b1ae- > ebc31658be26}"]["runTestDirectory"], > ["C:\\slave\\test\\build\\tests\\mozmill\\composition"]) > 01:52:27 WARNING - TEST-UNEXPECTED-FAIL | Disconnect Error: Application > unexpectedly closed > 01:52:27 INFO - Timeout: > bridge.set("2237f00f-ed1b-11e6-83e7-002590e6a953", > Components.utils.import('resource://mozmill/modules/mozmill.js')) I think this happens when the mozmill test is waiting for some event or some widget to appear that has to be clicked, but that one never appears. The test times out and mozmill takes down the whole session. Chiaki narrowed it down to failing 'click_send_and_handle_send_error(cwc)' which matches my theory. The comment that Chiaki added at that spot is disturbing: "// we may see sending failed due to filelink not working" So we only "may" see the error? If it does not appear always and consistently, then the test is wrong to always expect it and try to click the dialog. I can look at this. My first experiment would be to change the: assert_notification_displayed(cwc, kBoxId, "bigAttachment", true) to: wait_for_notification_to_show(cwc, kBoxId, "bigAttachment") as the notification may need some time to show up.
Flags: needinfo?(acelists)
Interestingly, I can now see he failure locally. When we attach the attachment that trigger filelink, the "attaching file" progress dialog is still up when we click sending the message. The "sending error" dialog pops up too. There are both dialogs at once and then the test times out.
Sorry, I expressed myself badly in comment #0: The sending error pops up behind the sending progress dialogue and it goes "bing". But that's intentional since there is "click_send_and_handle_send_error()". So you're saying that now another dialogue interferes, the "attaching file"?
Yes, I see the "attaching file" dialog while it fetches the "http://www.example.com/1" file. Then on top I get the "sending error" dialog. I tried the http://www.example.com/1 link in Firefox and it gives an "encoding error" page. But we do no need to fetch unreliable remote resources. I pushed a patch that attaches a local file instead, https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=dad7a38398492dcc537117d3f2e38c1d26fd2c7b . That makes the test pass for me (no "attaching dialog").
I did the same thing, I tried the URL in FF. However, we do this all the time, so why does it fail here? https://dxr.mozilla.org/comm-central/search?q=regexp%3Aadd_attach.*example&redirect=false
BTW, Mac opt doesn't build, so no need to even ask ;-)
The timing *may* be related to host CPU load issue, too. I mean CPU load goes up and mozmill test processing becomes a little jerkey if there is a compilation going on in parallel (or video playing on the host windows: I am running Linux inside VirtualBox under windows 10, and sometimes Firefox under the host windows 10 occupies a large memory footprint and consumes 15% of CPU, etc.) http://www.example.com/ is resolved by a special magic inside the test harness to point to a mozilla-supplied data (instead of going to the outside DNS lookup.) This processing path may take longer than usual and the response from fake server at http://www.example.com/1 to TB binary may be slower than usual. Then the situation described may happen now hat I read aceman's comment 10 and comment 13. BTW, I am surprised to learn that I had been the only person seeing this bug locally for such a long time (!) My setup of Linux in the VirtualBox on my home PC seems to trigger so many hither-to-unexperienced dormant issues. It is a good test box in that sense :-) TIA
> http://www.example.com/ is resolved by a special magic inside the test > harness to point to a > mozilla-supplied data (instead of going to the outside DNS lookup.) Magnus do you know anything about this? Do the http://www.example.com/1 URLs return anything useful? See e.g. at https://dxr.mozilla.org/comm-central/rev/d406e5ce548c2e80f2dcb82b14207f088e921355/mail/test/mozmill/cloudfile/test-cloudfile-notifications.js#116 . But in the particular test here, it seems just setting the attachment size properly even if the attachment contents is fake, it triggers the filelink notification.
Flags: needinfo?(mkmelin+mozilla)
(In reply to ISHIKAWA, Chiaki from comment #16) > My setup of Linux in the VirtualBox on my home PC seems to trigger so many > hither-to-unexperienced dormant issues. It is a good test box in that sense Yes, VM may be a good setup to produce unexpected conditions, like unusuall HDD stalls or CPU being used outside on the host so the OS in VM stalls. I use to underclock my CPU (fix to lowest idle frequency) to look at timing issues. I have tried it now for this test and I have found a third dialog at play here. 1. There is the "attaching file" dialog. 2. Then there flashes some "Confirm" dialog saying "There was a problem including the file http://www.example.com/1" in the message. Would you like to continue sending the message without this file?. This one is immediatelly dismissed. 3. Then on top comes the "Sending error" that hangs there. Then I found it is not a problem of timing. If I add sleep(10000) after adding the attachment, the "attaching file" dialog closes, then the compose waits. Then the test clicks send and then the "There was a problem " dialog appears and the sending fails. So it looks like the URL can't be accessed properly and the composition can't be send. I think the test at cloudfile/test-cloudfile-notifications.js that also use the http://www.example.com/ URLs may get away with it as they never try to send the composition. They even do not close the windows so it gets messy. I see 2 options: 1. I update the test to expect the third dialog. But I don't know if it always appears (the test was pasing up to now, or randomly for random users). 2. I attach a local file, as in the patch at https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=4e7923199969d7401b7b99eab760e174b3e91899 (the test passes except on Windows where the path I used doesn't seem accepted).
Attached patch patch (deleted) — Splinter Review
Assignee: nobody → acelists
Status: NEW → ASSIGNED
Attachment #8835753 - Flags: review?(jorgk)
(In reply to :aceman from comment #18) > (In reply to ISHIKAWA, Chiaki from comment #16) > > My setup of Linux in the VirtualBox on my home PC seems to trigger so many > > hither-to-unexperienced dormant issues. It is a good test box in that sense > > Yes, VM may be a good setup to produce unexpected conditions, like unusuall > HDD stalls or CPU being used outside on the host so the OS in VM stalls. > > I use to underclock my CPU (fix to lowest idle frequency) to look at timing > issues. I should have thought of that (!) I used to cringe at unnecessary redrawing of objects on the screen when I ran 750MHz AMD Duron CPU when the majority of developers had already moved to 1.5GHz - 2GHz CPUs. I thought that by forcing the developers to use an outdated slow CPU, we may find and remove unnecessary redrawing of objects on the screen when user interacts with programs. I cannot recall the details now, but I may have found the extra dialog about sending mail failure when I ran the test under valgrind. > I have tried it now for this test and I have found a third dialog at play > here. > > 1. There is the "attaching file" dialog. > 2. Then there flashes some "Confirm" dialog saying "There was a problem > including the file http://www.example.com/1" in the message. Would you like > to continue sending the message without this file?. This one is immediatelly > dismissed. Even under my VM setting and many local tests, I have not noticed this. Probably because as the message is dismissed immediately as you say. Maybe I should try to confirm if I can see this under valgrind/memcheck. > 3. Then on top comes the "Sending error" that hangs there. > > Then I found it is not a problem of timing. If I add sleep(10000) after > adding the attachment, the "attaching file" dialog closes, then the compose > waits. > Then the test clicks send and then the "There was a problem " dialog appears > and the sending fails. > So it looks like the URL can't be accessed properly and the composition > can't be send. I wonder why the URL can't be accessed properly. If so, the test mentioned below, test-cloudfile-notifications.js ought to be fixed as well (?) > > I think the test at cloudfile/test-cloudfile-notifications.js that also use > the http://www.example.com/ URLs may get away with it as they never try to > send the composition. They even do not close the windows so it gets messy. > > I see 2 options: > 1. I update the test to expect the third dialog. But I don't know if it > always appears (the test was pasing up to now, or randomly for random users). > 2. I attach a local file, as in the patch at > https://treeherder.mozilla.org/#/jobs?repo=try-comm- > central&revision=4e7923199969d7401b7b99eab760e174b3e91899 (the test passes > except on Windows where the path I used doesn't seem accepted). Great that you could find the proper URL in the patch in Comment 19, which works for Windows too. Maybe mozilla compilation farm should have a very slow CPU machine for testing purposes :-)
Aceman have you looked at what they do differently here so it doesn't fail: https://dxr.mozilla.org/comm-central/search?q=regexp%3Aadd_attach.*example&redirect=false
I answered that in comment 18. Those tests never click Send. I said the attaching error dialog appears upon clicking Send, not as a result of add_attachment(). I have now confirmed that by running test-cloudfile-notifications.js and clicking Send manually. The error about attaching http://www.example.com/* shows.
Comment on attachment 8835753 [details] [diff] [review] patch OK, thanks, let's not spend more time on that than necessary. Sorry about missing a prior comment, too many fires at once. I'll check it in after the next M-C merge in the afternoon.
Attachment #8835753 - Flags: review?(jorgk) → review+
Comment on attachment 8835753 [details] [diff] [review] patch Review of attachment 8835753 [details] [diff] [review]: ----------------------------------------------------------------- ::: mail/test/mozmill/composition/test-attachment-reminder.js @@ +434,5 @@ > let kOfferThreshold = "mail.compose.big_attachments.threshold_kb"; > let maxSize = Services.prefs.getIntPref(kOfferThreshold, 0) * 1024; > + let file = Services.dirsvc.get("ProfD", Components.interfaces.nsIFile); > + file.append("panacea.dat"); > + add_attachment(cwc, "file://" + file.path, maxSize); This isn't really going to work on windows is it? Probably Services.io.newFileURI(file).spec
(In reply to :aceman from comment #17) > Magnus do you know anything about this? > Do the http://www.example.com/1 URLs return anything useful? See e.g. at Not that I'm aware of.
Flags: needinfo?(mkmelin+mozilla)
(In reply to Magnus Melin from comment #24) > This isn't really going to work on windows is it? Probably > Services.io.newFileURI(file).spec I don't get it. I was made to believe that there was a successful try run (comment #19): https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=c876273933b2abcbb32e8bb344a1f79c9052e5ae&selectedJob=76091070 So it works on Windows, no? What's going on?
Aceman explained to me in a PM that at first he used "file://panacea.dat" which failed. The current code has "file://" + file.path. I'll compare this to Services.io.newFileURI(file).spec before landing.
https://hg.mozilla.org/comm-central/rev/009752635869e308f08ee9819abf80e1159e6b2d I went with Magnus' suggestion since the result looked more civilised ;-)
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 54.0
Aceman was really interested in the difference, so I'll document it here: file:///c:/mozilla-source/comm-central/obj-x86_64-pc-mingw32/_tests/mozmill/mozmillprofile/panacea.dat file://c:\mozilla-source\comm-central\obj-x86_64-pc-mingw32\_tests\mozmill\mozmillprofile\panacea.dat
Thanks Magnus, my nsIFile.path seems to have worked also on Windows, but the Services.io.newFileURI(file).spec seems more correct as the add_attachment/AddAttachment() handles URLs.
Thanks Jorg for comparing the produced strings and fixing the patch.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: