Closed Bug 184204 Opened 22 years ago Closed 22 years ago

initially generated XUL.mfasl is corrupt and breaks chrome/GUI

Categories

(Core :: XPCOM, defect, P2)

DEC
Linux
defect

Tracking

()

RESOLVED DUPLICATE of bug 201574
mozilla1.4alpha

People

(Reporter: mach1, Assigned: alecf)

References

Details

Attachments

(9 files, 1 obsolete file)

User-Agent:       Mozilla/5.0 (X11; U; Linux alpha; en-US; rv:1.3a) Gecko/20021207
Build Identifier: Mozilla/5.0 (X11; U; Linux alpha; en-US; rv:1.3a) Gecko/20021207

When started with no profile (i.e., no ~/.mozilla) the browser starts fine and
runs properly. But when the browser is terminated and restarted, several
javascript and other errors are reported and the main window displays
non-responsive, bare-bones versions of all controls.

Reproducible: Always

Steps to Reproduce:
1. rm -R ~/.mozilla
2. ./mozilla (and exit once browser starts up correctly)
3. ./mozilla (this produces the errors and incorrect display)

Actual Results:  
The incorrect display appears to contain all of the controls, but in a minimal
form. For example, all of the menu items display only the first two letters of
the labels and the main document area is very small and positioned in the upper
left corner rather than filling the remaining space in the window. None of the
controls is operable.

Expected Results:  
The browser should have restarted and behaved exactly as it did when started
with no user profile.
I don't see this problem with PC-Linux on the trunk, and didn't see it with
1.2.1 for Alpha-Linux.

what build options did you use?
I built the browser yesterday (07-DEC-2002) in a clean directory using the
typical procedure:

cvs co -f mozilla/client.mk
cd mozilla
gmake -f client.mk

My .mozconfig file contains the single line:

ac_add_options --enable-crypto

This is the same procedure I have used for many months with no problems. The
reported behavior first appeared in a build I made on 23-NOV-2002.
Hrm... perhaps the automagically generated profile files are not good.  None of
the files is actually critical (they should all be regenerated if they're
missing).  So perhaps, once it's busted, go in your profile dir and delete
individual files and see if it works again (even just once)
You are correct. When I delete ~/.mozilla/default/....slt/XUL.mfasl and restart
the browser the main window displays properly. When I exit, the file is
recreated and the browser fails to display the window properly when it is restarted.

So now what? Would you like me to send a copy of XUL.mfasl? If so, what type of
attachment should I use?

please just gzip/bzip2 it and create an attachment here.  I don't have the
expertise you need.

==> XP/Widget/XUL
Assignee: asa → hyatt
Component: Browser-General → XP Toolkit/Widgets: XUL
QA Contact: asa → shrir
Summary: Functions OK when first started, but incorrectly displays main window when restarted → initially generated XUL.mfasl is corrupt and breaks chrome/GUI
ccing people who should really be in on this...
Yes, please attach the file. If it is too large to be attached, you can either
'split(1)' the file into two parts, or just mail it to me <jrgm@netscape.com>

> ... When I exit, the file is recreated ...

Just to be clear, is the file recreated when you _start_ the browser, or when
you _exit_ the browser. (i.e., start with no XUL.mfasl file in place; when 
you have started, check the timestamp on that file; then wait one minute and
then exit. Does the timestamp or filesize change on exit. [It should
not change at exit]).

> The reported behavior first appeared in a build I made on 23-NOV-2002.

What was the date of the last build that you made that did not have this 
problem?
Assignee: hyatt → ben
Status: UNCONFIRMED → NEW
Ever confirmed: true
Attached file gzip version of XUL.mfasl (deleted) —
I have attached a gzip version of XUL.mfasl. The file is created when the
browser starts (i.e., its size and date do not change upon exit or even restart
of the browser).

I am not sure which build was the most recent one that did not exhibit the
problem. I believe it was probably either 09-NOV-2002 or 16-NOV-2002 (I
typically do a build every week or two, but only keep the most recent one or two).
Thanks for attaching the XUL.mfasl. Unfortunately, I cannot reproduce this
bug with either trunk builds on win32 or Linux (builds which I've hacked
to bypass some mtime/checksum/path sanity checks that would normally 
prevent one from running a fastload file from another build environment).

Is there anything else I can do to assist? I am not familiar with the browser
code, but in my experience with my own code, problems related to reading and
writing binary files on my Alpha often arise because of unconscious assumptions
I make based on my experience with 32-bit machines that do not apply to the
Alpha's 64-bit architecture with its preference for aligned data.

Is the format of the XUL.mfasl file common across all architectures or is it
peculiar to each platform? If common, has there been a recent change to the code
that makes size, alignment, or byte order assumptions that are not valid for the
Alpha or other 64-bit machines?
Blocks: FastLoadHang
Reply to comment 13:

Can you try again with a recent build ? (see bug 169777 comment 137)
The build I attempted on 18-JAN-2003 failed as did the build I attempted this
morning. The last successful build was on 11-JAN-2003. As a result, I cannot
test the recent fix.
I just built 1.3b for Alpha to see if I could reproduce this, and I don't see
exactly the same behavior as described in comment 0, but instead get a very
badly constructed composer window on the second launch.  Deleting the XUL.mfasl
fixes it and the next startup is successful.

Is there any useful debugging info I can retrieve by littering the code with
printfs?
Attached file gzipped XUL.mfasl (obsolete) (deleted) —
this is the XUL.mfasl I get after deleting the XUL.mfasl, and then doing:
mozilla -P clean about:blank
Are you sure you are running the 1.3b build that you built? 

The fastload file you attached gives the patch to the chrome directory as 
'/home/browser/07dec2002/mozilla/dist/bin/chrome' and the timestamps of the
XUL.mfasl inside the zip file and of the jar files that were used to generate 
the contents of the fastload file are all from Dec 7th, 2002.

To answer an earlier question...

> Is the format of the XUL.mfasl file common across all architectures or is it
> peculiar to each platform? If common, has there been a recent change to the 
> code that makes size, alignment, or byte order assumptions that are not valid 
> for the Alpha or other 64-bit machines?

The format is common across architectures (with the one OS dependent item being
the format of file paths, e.g., "C:\foobar" vs. "/home/foobar").

The binary data is stored in network order, so there is (or should not be) any
big- vs. little-endian issues. I'm not sure whether there might be implicit 
alignment issues that affect Alpha.

But I guess I'd like to know first if a current trunk build does demonstrate
any problems (particularly after the most recent checkin by brendan this 
morning). [Sorry, I have to be a little skeptical of Andrew's attached 
XUL.mfasl since the timestamps indicate that this was not a 1.3b build that
was used to generate that file].
Attached file the real XUL.mfasl (deleted) —
ya, what I just attached was actually mach1's that I had laying around.  this
one is for real.
Attachment #114613 - Attachment is obsolete: true
> Is there any useful debugging info I can retrieve by littering the code with
> printfs?

Is this a debug build? Are there any assertions, or odd warnings in stdout? 
> Is this a debug build? Are there any assertions, or odd warnings in stdout?

it was an optimized build.  :(  wishful thinking that I wouldn't see this bug
and I could actually use the build.

I'm building CVS-by-date to narrow the regression.  12-01-2002 has the bug.
I can build one with debug.
I'd suggest building debug to track this down, but first checking that it 
does not occur in a trunk build (there have been possibly relevant changes
since 1.3b was finished).

These lines from mach1's earlier logs are suspicious, particularly the file
channel failure to load. Why are those strings truncated?

  > ...
  > WEBSHELL+ = 2
* > CSS Error (chrome://navigator/content/navigator.xul :1.1): Expected ':' but 
found 'm'.  Declaration dropped.
  > Note: verifyreflow is disabled
  > ...
  > 
  > charset = ISO, file nsFontMetricsGTK.cpp, line 2234
* > Error reading file jar:resource:///chrome/comm.jar!/content/navigator/rd
* > Error reading file jar:resource:///chrome/comm.jar!/content/navigator/r
* > Error reading file jar:resource:///chrome/comm.jar!/content/navigator/c
  > WEBSHELL+ = 3
  > ...

Andrew, Question: is the pattern that you are observing something like:
  1) delete XUL.mfasl file; 
  2) start browser; all OK every time; quit
  3) start again; open composer; not OK every time; quit
  4) go back to 1, and repeat this pattern
1) delete XUL.mfasl file; 
2) start browser; all OK every time; quit
3) start again; open composer; not OK every time; quit
4) go back to 1, and repeat this pattern

yes.  that's it absolutely.
I'm pulling CVS head now...
Okay, so that pattern suggests (to me) that this is a different issue than
the fastload bug of 169777 and variants. In fact, this may not be a problem
with fastload code per se, but due to the fact that the existence of a valid 
fastload file can alter the code paths that are followed elsewhere. But it will 
be good to see what a tip debug build says first, and then backtrack. 

[Sorry that this got deferred. I had assumed that this bug would be taken care 
of with the other fastload fixes (although that may still be the case)].
I disabled debug on my CVS tip build in an attempt to workaround bug 190794, but
got the build working eventually.  It still has this bug.  I'll try a respin
with debug enabled tonite.
after playing CVS-by-date for a while, I've determined that this became busted
with the second checkin for bug 177401 (Nov 14).  (checkin comment: "the
previous checkin had a typo which disabled fastload entirely!"  duh.)

applying shanmu's 64bit alignment patch (bug 177401 comment 16, adjusted to be
not OSF-specific) actually fixes this bug.  Is there a way I can verify that it
isn't just ignoring the fastload file like it was before the second checkin?

does this mean that NS_SWAP16 is totally busted on Alpha Linux?  How can I test
that?

I still need to verify that the patch fixes the trunk...
Does that patch need to handle both places where we NS_SWAP16 in that method? 
Or it it only doing the swap inside the big buffer that breaks?  (The other
place is where we deal with a possible carryover byte.)
The other place, where we deal with a possible carryover byte, is safe.
In fact, this patch is trying to be consistent in the way outString->Append 
is used.
Attached patch debugging patch (deleted) — Splinter Review
this patch tries both ways (NS_SWAP16 and manually swapping), compares the
results and prints them out.  If the two results differ, it prints both and
also the unswapped unicodeSegment[i/2].

sometimes it works fine:
0 61 0
1 70 0
2 70 0
3 6c 0
4 69 0
5 63 0
6 61 0
7 74 0
8 69 0
9 6f 0
10 6e 0
11 2f 0

but sometimes it doesn't.  it seems to screw up every 4th element.  the
unswapped unicodeSegment[i/2] appears to be fine.
0 6e 0
1 61 0 0 0 61 0
2 76 0
3 69 0
4 67 0
5 61 0 0 0 61 0
6 74 0
7 6f 0
8 72 0
9 2d 0 0 0 2d 0
10 74 0
11 6f 0
12 6f 0
13 6c 0 0 0 6c 0
14 62 0
15 6f 0
16 78 0
0 74 0 0 0 74 0
1 6f 0
2 6f 0
3 6c 0
4 62 0 0 0 62 0
5 6f 0
6 78 0
7 2d 0
8 74 0 0 0 74 0
9 6f 0
10 70 0
it looks like Mozilla 1.3 is going to miss the Alpha/Linux boat, but it sounds
like this bug would easy to fix and the fix would help out other 64bit platforms
(by making the code 64bit-clean).

nominating for 1.4a

XUL no longer seems to be the proper component
==> xpcom
Assignee: ben → dougt
Component: XP Toolkit/Widgets: XUL → XPCOM
Flags: blocking1.4a?
QA Contact: shrir → scc
I'll take this.

/be
Assignee: dougt → brendan
Priority: -- → P2
Target Milestone: --- → mozilla1.4alpha
Status: NEW → ASSIGNED
Flags: blocking1.4a? → blocking1.4a-
I'm not getting anywhere reading code.  If there's an aliasing bug, and the
Linux Alpha compiler is assuming pointers to different types can't alias the
same piece of memory, I don't see it (although the carryover byte case does cast
the bytes array to (PRUnichar*) -- could that be the problem?  Stores scheduled
for the bytes array initializer, even if not committed to memory, should be
coherently found in a write buffer on the same CPU by the loads from the cast).

Cc'ing bbaetz, bouncing to alecf.

/be
Assignee: brendan → alecf
Status: ASSIGNED → NEW
I don't think that gcc does aliasing in c++ before 3.3, so thats not it. Is
someone casting void*s to an int, or something?

I don't suppose you can get a small test program of some sort?
I finally figured out what the difference was between the cases where it did
work and didn't work (comment 30).

It fails when &(unicodeSegment[i/2]) ends in 7 or f (0x1202d2e8f).  this
explains why (even when it does fail) it only fails every fourth time.  The long
stretches when it does work are a result of &(unicodeSegment[i/2]) being even.

also, when it does fail, ((unicodeSegment[i/2]) >> 8) is 0.  it seems to lose
the byte when it shifts it across an 8-byte boundary.

I'll attach a small C program that shows this.
Attached file simple C program (deleted) —
output included at top
Sounds like a compiler bug, although I would've expected a bus error on all the
misaligned PRUnichar loads, at least on some architectures.  It would be best to
reduce the testcase to one that didn't rely on misaligned loads, and on casting
addresses to different pointer type.  Can that be done, or does the bug go away?


Should nsBinaryStream.cpp stop using a bytes array and just shift and bitwise-or
to construct the two-byte sequence containing the carryover byte?  If that works
around the bug, I say alecf should check it in for 1.4a.

Can someone search the gcc bug database?

/be
Avoid misaligned loads and cast-to-pointer-to-wider-type aliasing.


This is the patch to apply.  I haven't tested it yet.  Reviewers may prefer the
cvs diff -wpu8 that's coming up next.

/be
Andrew, can you show the assembly code, or disassemble the function from the
object file using objdump, for your test program?

/be
Attached file objdump -d ns_swap16 (deleted) —
I can stepi through execution with gdb if that would be easier to read.
Andrew, can you try the patch (attachment 118472 [details] [diff] [review])?

/be
Flags: blocking1.4a- → blocking1.4a+
the patch works great.
does the misaligned if branch need to be #ifdef IS_LITTLE_ENDIAN ?
No, big-endian CPUs (most RISCs, last I checked -- a while ago) don't like
misaligned loads and stores, either.  Someone more up-to-date on CPU
architectures and the memory models that compilers erect on them should comment.

I'd like to get this patch in for 1.4a.  Alecf, could you review?

/be
Flags: blocking1.4a+ → blocking1.4a-
Comment on attachment 118472 [details] [diff] [review]
proposed fix, plus useless trailing whitespace elimination

But do read the diff -w patch for less eyestrain.

/be
Attachment #118472 - Flags: review?(alecf)
did you really mean to set this to 1.4a- or was that a miscue. (Comments 
suggest it was).
Flags: blocking1.4a- → blocking1.4a?
Comment on attachment 118472 [details] [diff] [review]
proposed fix, plus useless trailing whitespace elimination

seems reasonable.. a configure test would be nice so that we didn't have to do
a count>1 test on all architectures
Attachment #118472 - Flags: review?(alecf) → review+
1.4a is out.
Flags: blocking1.4a? → blocking1.4a-
Status: NEW → ASSIGNED
Darin's got the money patch in bug 201574.  Please cc: yourselves there, I'm
gonna dup this against that bug.

/be

*** This bug has been marked as a duplicate of 201574 ***
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: