Closed Bug 483902 Opened 16 years ago Closed 15 years ago

leopard/tiger talos boxes require flash upgrade to run new pageset

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: anodelman, Assigned: anodelman)

References

Details

I have yet to see moz-central browsers complete a tp run using the new page set on leopard or vista. The observed behavior is the browser successfully completes 5-7 cycles and then stops loading new pages. No crash stacks are collected. Some errors found in the js console include: Error: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIFileOutputStream.init]" nsresult: "0x80004005 (NS_ERROR_FAILURE)" location: "JS frame :: file:///Users/mozqa/talos-slave/mac-trunk/Minefield.app/Contents/MacOS/components/nsSessionStore.js :: sss_writeFile :: line 2519" data: no] Source File: file:///Users/mozqa/talos-slave/mac-trunk/Minefield.app/Contents/MacOS/components/nsSessionStore.js Line: 2519 Error: uncaught exception: [Exception... "Illegal document.domain value" code: "1009" nsresult: "0x805303f1 (NS_ERROR_DOM_BAD_DOCUMENT_DOMAIN)" location: "http://localhost/page_load_test/pages/www.worldofwarcraft.com/www.worldofwarcraft.com/new-hp/js/functions.js Line: 5"] Error: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIFileOutputStream.init]" nsresult: "0x80004005 (NS_ERROR_FAILURE)" location: "JS frame :: file:///Users/mozqa/talos-slave/mac-trunk/Minefield.app/Contents/MacOS/components/nsSessionStore.js :: sss_writeFile :: line 2519" data: no] Source File: file:///Users/mozqa/talos-slave/mac-trunk/Minefield.app/Contents/MacOS/components/nsSessionStore.js Line: 2519 Error: uncaught exception: [Exception... "Component returned failure code: 0x80570019 (NS_ERROR_XPC_CANT_CREATE_WN) [nsIJSCID.getService]" nsresult: "0x80570019 (NS_ERROR_XPC_CANT_CREATE_WN)" location: "JS frame :: chrome://global/content/macWindowMenu.js :: checkFocusedWindow :: line 7" data: no] The failure to write to sessionstore error has been observed during several tests. The sessionstore file does exist and is writable, it's unknown why this error is being thrown. I have also seen failure to correctly load pages - appearing as if style sheets have not been correctly loaded and thus being displayed as a simple column of html without images/animation/layout/etc. This could indicate a caching error. It is also possible that the new page set is in some way causing a failure in the pageloader extension - thus halting the pageloader and stopping the test from advancing. This would have to do with interfering with the onload handlers or some other component of the pageloader.
Just to clarify, moz-central browsers are able to complete the new tp test on winxp, vista and ubuntu. You can see these machines cycling on the MozillaTest waterfall.
Assignee: nobody → anodelman
I had a theory that what I was seeing was caching errors, so I added: browser.cache.disk.capacity : 0 to the talos prefs configuration. The tp test then ran to completion. I'm going to install this pref to both tiger/leopard and see if I can get more than a single successful run.
Lots of successful runs over night. Tiger still occasionally freezes up, but it looks more like the failures that we see occasionally on the production talos moz-central testers. I would be pretty confident in saying that the issue here lies with a corrupted or damaged cache creation on mac.
Assignee: anodelman → nobody
Component: Release Engineering → Networking: Cache
Product: mozilla.org → Core
QA Contact: release → networking.cache
Version: other → Trunk
Moved to Core/Networking-Cache as I seem to be generating a corrupted cache when I run the new pageset on tiger/leopard.
Here's a profile used by a frozen leopard test box: http://people.mozilla.org/~anodelman/profile.zip
Are there any specific pages that are hanging more often?
I haven't seen any pattern in the pages that it gets stuck on.
I put an http log of a failed run up at http://campd.org/stuff/cache.log.gz It's giant, but if you search for 'cacheMap', You'll see that at some point the cache is getting confused: -1605746784[50a960]: Destroying nsHttpTransaction @7889700 -1605746784[50a960]: nsHttpChannel::FinalizeCacheEntry [this=d6c5130] -1605746784[50a960]: calling OnStopRequest -1605746784[50a960]: CACHE: Flush [90df8358 doomed=0] -1605746784[50a960]: CACHE: DeleteStorage [90df8358 0] -1605746784[50a960]: WARNING: cacheMap->DeleteStorage() failed.: file ../../../. ./mozilla/netwerk/cache/src/nsDiskCacheStreams.cpp, line 517 WARNING: cacheMap->DeleteStorage() failed.: file ../../../../mozilla/netwerk/cac he/src/nsDiskCacheStreams.cpp, line 517 -1605746784[50a960]: CACHE: DeleteRecord [90df8358] -1605746784[50a960]: ###!!! ASSERTION: Flush() failed: 'NS_SUCCEEDED(rv)', file ../../../../mozilla/netwerk/cache/src/nsDiskCacheStreams.cpp, line 461 ###!!! ASSERTION: Flush() failed: 'NS_SUCCEEDED(rv)', file ../../../../mozilla /netwerk/cache/src/nsDiskCacheStreams.cpp, line 461 -1605746784[50a960]: nsHttpChannel::CloseCacheEntry [this=d6c5130] -1605746784[50a960]: Deactivating entry 98166f0 -1605746784[50a960]: Removed deactivated entry 98166f0 from mActiveEntries -1605746784[50a960]: CACHE: disk DeactivateEntry [98166f0 90df8358] -1605746784[50a960]: CACHE: WriteDiskCacheEntry [90df8358] -1605746784[50a960]: CACHE: UpdateRecord [90df8358] -1605746784[50a960]: ###!!! ASSERTION: record not found: 'Not Reached', file ../ ../../../mozilla/netwerk/cache/src/nsDiskCacheMap.cpp, line 462 ###!!! ASSERTION: record not found: 'Not Reached', file ../../../../mozilla/ne twerk/cache/src/nsDiskCacheMap.cpp, line 462 -1605746784[50a960]: WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x800 0FFFF: file ../../../../mozilla/netwerk/cache/src/nsDiskCacheMap.cpp, line 867 WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x8000FFFF: file ../../../ ../mozilla/netwerk/cache/src/nsDiskCacheMap.cpp, line 867 -1605746784[50a960]: CACHE: DeleteStorage [90df8358 0] -1605746784[50a960]: CACHE: DeleteStorage [90df8358 1] -1605746784[50a960]: CACHE: DeleteRecord [90df8358] -1605746784[50a960]: ###!!! ASSERTION: deleting dirty buffer: 'mBufDirty == PR_F ALSE', file ../../../../mozilla/netwerk/cache/src/nsDiskCacheStreams.cpp, line 7 50 ###!!! ASSERTION: deleting dirty buffer: 'mBufDirty == PR_FALSE', file ../../. ./../mozilla/netwerk/cache/src/nsDiskCacheStreams.cpp, line 750 -1605746784[50a960]: Destroying nsHttpChannel @d6c5130 This pattern repeats itself a few more times before talos gets stuck. This definitely seems to be related. Runs that failed always seemed to generate this failure, and runs that succeeded never did. The talos machine I was looking at stopped reproducing it, but it should be possible to reproduce again, using the new talos pageset and the pageloader extension - the rest of the talos setup didn't seem to be necessary.
I'm no longer seeing this behavior when running talos with the new page set.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
I've changed my mind here, and I think that this is still occurring. I'd like some confirmation from dcamp before I attempt to close it again.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Is there any up-to-date documentation about how to set up this tp test? https://wiki.mozilla.org/Performance:Tinderbox_Tests seems to be obsolete.
Is it possible to download somewhere the new page set? StandaloneV1_5.zip contains some other set since there is no www.worldofwarcraft.com page.
jst - any update here? Is there anything else that I can provide that would help the hunt? (Have already responded to comment #13 on irc and provided the pageset in question).
I've downloaded the pageset and I'm able to reproduce the bug. Now I'm trying to find out what's wrong with the cache.
(In reply to comment #15) > I've downloaded the pageset and I'm able to reproduce the bug. Now I'm trying > to find out what's wrong with the cache. Not trying to add any pressure, but any update? Obviously, we're anxious to use the new pageset in production, but I'm also concerned if this might be a FF3.5 blocker?
It took me so long because it takes hours to reproduce the bug. But in the end the problem is quite simple. There is no problem with the cache. Cache gets confused because OpenNSPRFileDesc() in nsDiskCacheStreamIO::OpenCacheFile() fails to create/open a cache file. For some reason firefox process sometimes doesn't close file "/Library/Internet Plug-Ins/Flash Player.plugin/Contents/Resources/Flash Player.rsrc". So when viewing sites with flash content after some time firefox reaches a limit of opened files which is 256 on Mac OS X by default. Is there anybody who knows plugin code and can look at the reason why this happens?
(In reply to comment #17) > It took me so long because it takes hours to reproduce the bug. But in the end > the problem is quite simple. There is no problem with the cache. Cache gets > confused because OpenNSPRFileDesc() in nsDiskCacheStreamIO::OpenCacheFile() > fails to create/open a cache file. For some reason firefox process sometimes > doesn't close file "/Library/Internet Plug-Ins/Flash > Player.plugin/Contents/Resources/Flash Player.rsrc". So when viewing sites with > flash content after some time firefox reaches a limit of opened files which is > 256 on Mac OS X by default. Thanks Michal, Alice. Sounds like this new pageset did tripped over something that could be a blocker, hence nom'd. > Is there anybody who knows plugin code and can look at the reason why this > happens?
Flags: blocking1.9.1?
Do we know if this is a problem in our code or Flash? If it's flash keeping file handles open, I'd say minus.
Michal, thanks for digging in here! Josh, is "/Library/Internet Plug-Ins/Flash Player.plugin/Contents/Resources/Flash Player.rsrc" by any chance a file that we open repeatedly in the plugin code and forget to ever close, or is this a file descriptor leak in the flash player?
QA Contact: networking.cache → joshmoz
Not blocking final release, looking into it for 3.5.x, based on the fact that so far the only people to run into the problem are releng getting the page set to run, not users. Josh: please see comment 20 in a hurry, and renominate this if you think my judgement is wrong, thanks.
Component: Networking: Cache → Plug-ins
Flags: wanted1.9.1.x?
Flags: blocking1.9.1?
Flags: blocking1.9.1-
QA Contact: joshmoz → plugins
Is the box running Flash 9? See bug 397053, you might need to upgrade to Flash 10 to fix this. Mac OS X 10.5.7 might come with Flash 10, iirc Apple updated it, but I could be misremembering. If that is true though then all you need to do is update to 10.5.7.
(In reply to comment #21) > Not blocking final release, looking into it for 3.5.x, based on the fact that > so far the only people to run into the problem are releng getting the page set > to run, not users. Beltzner: we're hitting this in our new Top100 website pageset. Not clear which specific page(s) are causing this, but as all of them are in the top 100 websites, it feels like it might quickly become an urgent requirement to fix. Hence the blocker request. The nom- is fine for now, but depending on tests below, I may re-nom. > Josh: please see comment 20 in a hurry, and renominate this if you think my > judgement is wrong, thanks. (In reply to comment #22) > Is the box running Flash 9? See bug 397053, you might need to upgrade to Flash > 10 to fix this. Mac OS X 10.5.7 might come with Flash 10, iirc Apple updated > it, but I could be misremembering. If that is true though then all you need to > do is update to 10.5.7. I've just looked at 5 talos leopard machines, and they had: OSX 10.5.2 Flash: 9.0.115 Josh: We intentionally do *not* upgrade s/w on Talos machines unless we *need* to, because changes like this typically causes changes in perf data results. Which means recalibrating results, discussions about what to do with regenerating results for historical milestone releases, and a serious downtime! However, if thats what it takes, so be it. Alice: on *one* staging talos machine could you see if: - the Flash upgrade by itself fixes the problem? - an O.S. upgrade *and* Flash upgrade fixes the problem? If either of these work, then we should all regroup and figure out next step.
(In reply to comment #23) > Beltzner: we're hitting this in our new Top100 website pageset. Not clear which > specific page(s) are causing this, but as all of them are in the top 100 Pages containing flash are: http://localhost/page_load_test/pages/www.youtube.com/www.youtube.com/index.html http://localhost/page_load_test/pages/www.imdb.com/www.imdb.com/index.html http://localhost/page_load_test/pages/www.bbc.co.uk/www.bbc.co.uk/index.html http://localhost/page_load_test/pages/www.nicovideo.jp/www.nicovideo.jp/index.html http://localhost/page_load_test/pages/www.gamespot.com/www.gamespot.com/index.html http://localhost/page_load_test/pages/www.blogfa.com/www.blogfa.com/index.html http://localhost/page_load_test/pages/www.maktoob.com/www.maktoob.com/index.html http://localhost/page_load_test/pages/www.spiegel.de/www.spiegel.de/index.html http://localhost/page_load_test/pages/www.jugem.jp/jugem.jp/index.html http://localhost/page_load_test/pages/www.marca.com/www.marca.com/index.html http://localhost/page_load_test/pages/www.ku6.com/www.ku6.com/index.html http://localhost/page_load_test/pages/www.it168.com/www.it168.com/index.html http://localhost/page_load_test/pages/www.corriere.it/www.corriere.it/index.html http://localhost/page_load_test/pages/www.people.com.cn/www.people.com.cn/index.html http://localhost/page_load_test/pages/www.minijuegos.com/www.minijuegos.com/index.html http://localhost/page_load_test/pages/www.yam.com/www.yam.com/index.html http://localhost/page_load_test/pages/www.nnm.ru/www.nnm.ru/index.html Running tp test only with these pages speeds up the failure. > I've just looked at 5 talos leopard machines, and they had: > OSX 10.5.2 > Flash: 9.0.115 I have OSX 10.4.10 with flash 9.0.22. Upgrading just flash to 10.0.22.87 seems to help.
bz and msintov seemed to imply there was an underlying Gecko bug causing Flash to keep eating up the file descriptors; see bug 397053 comment 30 and bug 397053 comment 33. If you didn't want to upgrade Flash and didn't want to fix the constant "reopening of the resource map" or whatever, you could also up the file descriptor limit like we did as work-around in Camino (which would make the bug harder, but not impossible, to trigger, both for Talos and for actual users who aren't using Flash 10); see bug 401138.
Doh. I filed bug 496344 to track hunting that down. It's nice to have a testcase to test fixes against!
Upgrading flash has allowed the pageset to run to completion on leopard. Should we: - upgrade flash on all mac talos boxes - upgrade flash on all talos boxes (to try and get parity) - something else?
Component: Plug-ins → Networking: Cache
QA Contact: plugins → networking.cache
Component: Networking: Cache → Plug-ins
QA Contact: networking.cache → plugins
Possible fix for upgrading flash throughout talos slave pool with bug 475383. Have it working on stage using moz-central builds. But the fix for loading plugins from profiles isn't on 1.9.1 or Firefox3.0.
Assignee: nobody → anodelman
Component: Plug-ins → Release Engineering
Product: Core → mozilla.org
QA Contact: plugins → release
Summary: leopard/tiger talos freezing on new page set → leopard/tiger talos boxes require flash upgrade to run new pageset
Version: Trunk → other
(In reply to comment #28) > Possible fix for upgrading flash throughout talos slave pool with bug 475383. > > Have it working on stage using moz-central builds. But the fix for loading > plugins from profiles isn't on 1.9.1 or Firefox3.0. Beltzner: Gentle ping - can we get approval to land this, so we can enable tp4 talos on the mozilla-1.9.1 branch?
Fixed by downloading plugins per talos run - thus they can be updated at will and installed through the profile used by talos during testing.
Status: REOPENED → RESOLVED
Closed: 15 years ago15 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.