61363 - (latemeta) Make sure that chardetng-triggered encoding reload is read from the cache

Reporter

Description

•

24 years ago

This is a follow-on to bug 27006.  We need to come up with a "real" fix for this
problem.  Since I'm hoping that we can somehow force the reload to come from the
cache instead of the server, I'm starting out with this assigned to Gagan.  I'll
send out an email trying to get a meeting set up so we can work out more details.

Eric Pollmann

Reporter

Comment 1

•

24 years ago

Adding keywords from 27006

Keywords: 4xp, correctness, dataloss, mostfreq

Gagan

Comment 2

•

24 years ago

->to cache

Assignee: gagan → neeti

Component: Networking → Networking: Cache

QA Contact: tever → gordon

Target Milestone: --- → M1

Gagan

Updated

•

24 years ago

Target Milestone: M1 → mozilla0.9

neeti

Comment 3

•

24 years ago

Cache bugs to Gordon

Assignee: neeti → gordon

gordon

Comment 4

•

24 years ago

Eric, what do you need from the cache and/or http?

Target Milestone: mozilla0.9 → mozilla0.9.1

gordon

Comment 5

•

24 years ago

Darin, this looks like a dup of the other <meta> charset bug you're working on.

Assignee: gordon → darin

Darin Fisher

Comment 6

•

24 years ago

actually, i'm going to use this bug to track this problem.  we need:

1) support for overlapped i/o in the disk cache.
2) ability to background the first load and let it finish on its own.

Cathleen

Updated

•

24 years ago

Blocks: 71668

Gagan

Comment 7

•

24 years ago

If we had 1) then I wonder if for blocked layout, uninterrupted streaming of
data to cache would be a better solution than our filling up the pipes and
subsequently taking the socket off the select list. This way our network
requests would never pause for layout/other blocking events-- that means we'd be
fast. Maybe we need a PushToBackground (with a better name) on nsIRequest. The
implementation of PushToBackground will simply take all "end" listeners off and
continue to stream data to the cache. So consumers that are currently killing
our first channel would just push it to background and make new requests for the
same URL. What say?

Darin Fisher

Comment 8

•

24 years ago

agreed.. we do need some sort of communication from the parser to http to tell
it to keep going "in the background"... there are two options i see...

1) parser could just eat all the data; then, http would not even need to be made
aware of what's going on.

2) parser could return a special error code from OnDataAvailable that would 
instruct HTTP to not call OnDataAvailable anymore, but to just continue
streaming the data into the cache... this error code could perhaps be
NS_BASE_STREAM_CLOSED.

I'm not sure that option 2 would be that much more efficient than option 1...
option 1 would be a lot easier to implement, but option 2 could be used by
any client of http.

gagan: i'm not sure we can completely background the download... as this would
require pushing it to another thread, which would be difficult.

Gagan

Comment 9

•

24 years ago

either way we need overlapped io in the cache. Setting that as the first target
and giving to gordon. This is a serious bug (ibench and double-post)

Assignee: darin → gordon

Keywords: topperf

Darin Fisher

Comment 10

•

24 years ago

*** Bug 78018 has been marked as a duplicate of this bug. ***

Tom Mraz

Comment 11

•

24 years ago

*** Bug 78494 has been marked as a duplicate of this bug. ***

Christopher Blizzard (:blizzard)

Updated

•

24 years ago

Whiteboard: want for mozilla 0.9.1

gordon

Comment 12

•

24 years ago

In order to implement over-lapped I/O in the cache, I'll need to finish the Disk 
Cache Level 2, which includes the necessary stream-wrappers.

Depends on: 72507

gordon

Updated

•

24 years ago

Priority: P3 → P2

gordon

Updated

•

24 years ago

Depends on: 81724

gordon

Updated

•

24 years ago

No longer depends on: 81724

Marek Z. Jeziorek

Comment 13

•

24 years ago

per PDT triage to 0.9.2

Target Milestone: mozilla0.9.1 → mozilla0.9.2

Cathleen

Updated

•

24 years ago

Depends on: 81724

Gagan

Updated

•

24 years ago

Whiteboard: want for mozilla 0.9.1

Hong Kwon

Updated

•

24 years ago

Keywords: nsenterprise

Gagan

Comment 14

•

23 years ago

can't make it to 0.9.2. pushing over...

Target Milestone: mozilla0.9.2 → mozilla0.9.3

Doug Turner (:dougt)

Comment 15

•

23 years ago

moving out milestone.

Target Milestone: mozilla0.9.3 → mozilla1.0

Jussi-Pekka Mantere

Comment 16

•

23 years ago

Removing nsenterprise nomination. Adding nsBranch.

Keywords: nsenterprise → nsBranch

elaineking

Updated

•

23 years ago

Blocks: 99142

Jaime Rodriguez, Jr.

Comment 17

•

23 years ago

Gordon/Gagan - This likes a good one to take. How close are you to resolving
this one? If it can't be finished this week, pls mark it as nsbranch- for this
round.

gordon

Comment 18

•

23 years ago

We are not close on this.  It's very doubtful this will be ready to land in the
next month.

Keywords: nsbranch → nsbranch-

Boris Zbarsky [:bzbarsky]

Updated

•

23 years ago

Blocks: 105709

Marek Z. Jeziorek

Updated

•

23 years ago

Blocks: 107067

Marek Z. Jeziorek

Updated

•

23 years ago

Keywords: nsbranch-

Gagan

Updated

•

23 years ago

Keywords: mozilla1.0

Jaime Rodriguez, Jr.

Comment 19

•

23 years ago

any chance this will make the MachV train?

No longer blocks: 107067

Keywords: nsbeta1

gordon

Comment 20

•

23 years ago

Darin and I are backing off of supporting overlapped I/O in the cache (which was
the reason I was given this bug).  We need review the severity and potential
fixes, since necko has changed quite a bit since this bug was originally
reported.   I'll meet with him and update the bug with our current thoughts.

Darin Fisher

Comment 21

•

23 years ago

cc'ing shaver, since i know he has comments on this one.

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 22

•

23 years ago

If you submit a POST form, and the returned HTML has a charset -- as is the case
with a number of e-commerce sites in Canada, where we have accents and things --
then you get the scary "resubmit your data?" dialog, sometimes twice.  That
dialog is doubly scary when you're slinging around your credit card with
non-refundable tickets, so I've had to spin up IE for some purchases to keep my
blood pressure down.

I don't understand why we have to go to the network or the cache for this.  When
we hit a <meta charset> tag, we just need to go back and fix up attribute values
to match the new character set, and then make sure that future content is
charset-parsed appropriately.  I don't think it's ever possible for the charset
to change the structure of the document, because in that case we might not
really have seen <meta> and the whole thing collapses on itself.

"Overlapping I/O" sounds like a win other things (multiple copies of an image on
a page, where the second one is requested while the first one is still coming
in?), to be honest, but I don't think the right fix here involves any I/O driven
by a <meta>, just attribute fixup.  And since overlapped I/O seems to be rocket
science, why not let a DOM/parser guy take a swing at it?

Darin Fisher

Comment 23

•

23 years ago

agreed... falling back on the cache/necko is just a hack solution at best.

-> parser

shaver: btw, imagelib already gates image requests to avoid multiple hits on the
disk cache / network for an image that appears more than once on a page.

Assignee: gordon → harishd

Component: Networking: Cache → Parser

QA Contact: gordon → moied

rpotts (gone)

Comment 24

•

23 years ago

Do we understand the situations where the 'meta charset sniffer' is failing --
thus forcing us to reload the document?

our current sniffing code looks at the first buffer of data for a
meta-charset... so, i'm assuming that in situations where we reload, the server
has sent us a 'small' first packet of data...

is this *really* the case, or has our sniffer broken?

-- rick

John Morrison

Comment 25

•

23 years ago

Can't answer the question about server reloads of POST documents, but for 
the GET case of a document, the sniffer is working (i.e., we don't do 
the double GET as long as the meta tag is within 2K bytes of the beginning of
the document; otherwise, we do the double GET).

Darin Fisher

Comment 26

•

23 years ago

rpotts: what jrgm said... otherwise, we'd have seen a huge regression on ibench
times.

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 27

•

23 years ago

And duplicate stylesheets and script transclusions from frames in framesets? 
Not to hijack this bug with netwerk talk, now that we've punted it back
(correctly, IMO) to the parser guys -- hi, Harish! -- but it seems like this is
a correctness issue for more than just <meta> tags.  I don't see another bug in
which to beat this dying horse, but I'll be more than happy to take the
discussion to one that someone finds or files.

Darin Fisher

Comment 28

•

23 years ago

shaver: duplicate css and js loads are serialized.  hopefully, this is not too
costly in practice.

Frank Tang

Comment 29

•

23 years ago

>Do we understand the situations where the 'meta charset sniffer' is failing --
>thus forcing us to reload the document?
>
>our current sniffing code looks at the first buffer of data for a
>meta-charset... so, i'm assuming that in situations where we reload, the server
>has sent us a 'small' first packet of data...
>
>is this *really* the case, or has our sniffer broken?
First of all, the sniffing code is origional designed as an "inperfect" performance
tuning for "most of the case" instead of a perfect all-the-case general solution
You are right, it only look at the first block. And it is possible in theory that 
the meta tag can happen thoudans bytes after (I saw larnge js in front of that
before)
second, even the meta code sniff correctly, we still need the reload mechanism
work correctly for charset-detector (by exam bytes and freq analysis) reload
turn "character set:auto-detect" from "(Off)" to "All" and visit some 
non latin 1 text file and you will see the reload kick in.


add shanjian

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 30

•

23 years ago

Why do we need to reload for the charset sniffer?  Can't it just look at text
runs and attribute values to do frequency analysis, and the perform the in-place
switchover described above?  The document structure had better not change due to
a charset shift, or there's nothing we can do without an explicit and correct
charset value in the headers.

Brendan Eich [:brendan]

Comment 31

•

23 years ago

I expect a reload hack after will always be "easier" than fixing up a bunch of
strings in content node members.  Cc'ing jst.  But easier isn't always better
(nor is worse, always).  If we can avoid reloads, let's do it.

ftang, is it possible shaver's canadian e-commerce website POST data reloads are
due to universalchardet and not a sniffer failure?

/be

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 32

•

23 years ago

A great test case is this: go to the URL I just added, and click:
  [Click Here to Generate Currency Table]


You'll be treated to _four_ POST data alerts, two of each type.

(For bonus marks, just _try_ to use the back button or alt-left to go back in
history.)

URL: http://www.xe.net/ict/

harishd

Comment 33

•

23 years ago

Correct me if I'm wrong:

There is code in nsObserverBase::NotifyWebShell() to prevent reload for a POST data.

harishd

Comment 34

•

23 years ago

If meta charset sniffing fails then we fall back on the tag-observer mechansim
to |reload| the document with a new charset. However, the code in nsOberverBase
would make sure that we don't reload POST data. Therefore, we should never ( I
think ) encounter double-submit-problem. The draw back, however, is that the
document wouldn't get the requested charset.

Johnny Stenback (:jst)

Comment 35

•

23 years ago

I don't think it's even possible to always correctly do the fixup in the content
nodes after we realize what the charset really should be. The bad conversion
that already happened could've actually lost information if there were
characters in the stream that were not convertable to whatever charset we
converted to, right?

Darin Fisher

Comment 36

•

23 years ago

there has to be some way to do this without going back to the cache/network for
the data.  remember: the cache isn't guaranteed to be present.  we need a
solution for this bug that doesn't involve going back to netlib.

harishd

Comment 37

•

23 years ago

jst: Is there a way to throw away the content nodes, that got generated before
encountering a META tag with charset, without reloading the document?

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 38

•

23 years ago

jst: aren't we storing text and attributes as UCS2 -- unless they were
all-ASCII, in which case we can trivially reinflate.  From either of those
conditions, I think we should be able to reconstruct the original (on-wire) text
runs, if we haven't thrown away the original charset info, and then re-decode
with the new character set.

I thought, and that paragraph is very much predicated on this belief, that we
only converted to "native" charsets at the borders of the application: file
names, font/glyph work for rendering, etc.  If that's not the case, I will just
go throw myself into traffic now.

Johnny Stenback (:jst)

Comment 39

•

23 years ago

If the conversion from the input stream to unicode (using the default charset)
and back to what we had in the input stream is reliably doable, then yes, we
could convert things back and re-convert once we know the correct charset. But
I'm not sure that's doable with our i18n converter code... ftang, thoughts?

Johnny Stenback (:jst)

Comment 40

•

23 years ago

Harish, yes, we can reset the document and start over if we have the data to
start over from.

Tom Mraz

Comment 41

•

23 years ago

Strange... I don't see any reposts on the www.xe.net URL when I click on
the [Click Here to Generate Currency Table] button.

I'm using Mozilla0.9.8 on Linux.

But when I choose to View-source on the generated table it is blank.

Heikki Toivonen (remove -bugzilla when emailing directly)

Updated

•

23 years ago

Keywords: nsbeta1 → nsbeta1+

gordon

Updated

•

23 years ago

No longer depends on: 81724

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 42

•

23 years ago

Attached patch non-working, non-compiling sketch of DOM rewriting (deleted) — Details — Splinter Review

I'm underwater with 0.9.9 reviews and approvals, but I wanted to toss this up
for discussion.  If people agree that it's a viable path, there are a bunch of
improvements that can be made as well: text nodes already know if they're
all-ASCII, for example, though I don't know how to ask them.

Big issues that I need assurance on:
 - all parts of the DOM can handle having their cdata/text/attribute string 
   values set, including <script>, and will DTRT.  (I fear re-running scripts!)

 - the entire concept of re-encoding in the old charset and then decoding with
   the new one is viable.  (Ignore, for now, the XXX comment about the new 
   buffer having more characters.)

Be gentle, but be thorough!

Asa Dotzler [:asa]

Updated

•

23 years ago

Keywords: mozilla1.0+

Asa Dotzler [:asa]

Updated

•

23 years ago

Keywords: mozilla1.0

Johnny Stenback (:jst)

Comment 43

•

23 years ago

I can't say I had a close look or anything but I really like this aproach, this
would be light years ahead of what we have now (assuming it actually works, that
is :-)

harishd

Comment 44

•

23 years ago

We need Ftang's input too.

harishd

Comment 45

•

23 years ago

The proposal doesn't cover encoding the unparsed data with the correct charset (
new charset ). 

Btw, if the approach works, can we remove the META tag sniffing code?

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 46

•

23 years ago

Yes, you're right: I forgot to mention that we need to update the parser's
notion of current charset.

smontagu has made me nervous about roundtripping illegal values.  I'm hoping
he'll say more here.

Mike

Boris Zbarsky [:bzbarsky]

Comment 47

•

23 years ago

*** Bug 129074 has been marked as a duplicate of this bug. ***

harishd

Comment 48

•

23 years ago

Shaver is working on this. Mike, should I assign this bug to you?

Shanjian Li

Comment 49

•

23 years ago

I wish I paid attention to this bug earlier. I suggested the same approach when 
ftang first explained mozilla's doc charset handling. However, I have to say 
that the final patch might be more complicated that shaver's patch. 

Is it possible to convert text back from unicode to current character encoding 
and reconvert to unicode with new encoding? I want to share some of my 
understanding. Theoritically, the answer is NO. It is true ( or practically 
true) that unicode charset covers almost all the native charsets we can 
encounter today. But not all code points in a non-unicode encoding are valid. 
For example, in iso-8859-1, code point 0x81 is not defined. If the incoming data 
stream is encoded in win1251, 0x81 is a valid code point. Suppose somehow we use 
iso-8859-1 to interprete the text data, code point 0x81 will be converted to 
unicode u+fffd. When we later try to convert this code point back, there is no 
way to figure out where it comes from.  I believe this is the only scenario we 
need to worry about. (It is possible that for some encoding, more than one code 
points map to a single unicode code point. If that is the case, it is a bug and 
we can always fix it in unicode conversion module.)

I could not figure out a perfect solution to this problem at this time, but I 
would like to suggest 2 approaches for further discussion.
1) Could we always buffer current page? Probably inside parser?
2) We can use a series of unassigned code points in unicode for unassigned code 
point and change our charset mapping table. The aim is to  make charset 
conversion to be round trip for any character. For single byte encoding, we have 
at most 256 code points, and most of them should be assigned. For multi-byte 
encoding, we can interprete illegal byte sequence byte by byte. This practice 
must be kept internal as much as possible. This should make mike's approach 
feasible.
(We can't ignore the existence of illegal code points in many websites. In many 
cases, a illegal code point usually suggests a wrong encoding. To interrupt the 
process when meeting a invalide code point does not seem like a good idea.)

Boris Zbarsky [:bzbarsky]

Updated

•

23 years ago

Blocks: 116217

Boris Zbarsky [:bzbarsky]

Updated

•

23 years ago

Blocks: 105636

Eugene Savitsky

Comment 50

•

23 years ago

This happends not only with russian. Set any autodetection and I'll see it.

scottputterman

Comment 51

•

23 years ago

ADT2 per ADT triage.

Whiteboard: [ADT2]

Darin Fisher

Comment 52

•

23 years ago

*** Bug 116217 has been marked as a duplicate of this bug. ***

Darin Fisher

Comment 53

•

23 years ago

updating summary

Summary: <meta> with charset should reload from cache, not server → <meta> with charset should NOT cause reload

Darin Fisher

Comment 54

•

23 years ago

*** Bug 131524 has been marked as a duplicate of this bug. ***

Boris Zbarsky [:bzbarsky]

Comment 55

•

23 years ago

*** Bug 131966 has been marked as a duplicate of this bug. ***

harishd

Comment 56

•

23 years ago

Shaver: What's the status on this? Can this be done in the 1.0 time frame? If
not let's move it to a more realistic mile stone.

harishd

Comment 57

•

23 years ago

Looks like this is definitely not going to make it to the m1.0 train ( Shaver? ).
Giving bug to Shaver so that he can target it to a more realistic milestone.

Assignee: harishd → shaver

Vadim Berezniker

Comment 58

•

23 years ago

*** Bug 135852 has been marked as a duplicate of this bug. ***

Vladimir Ermakov

Comment 59

•

23 years ago

*** Bug 129196 has been marked as a duplicate of this bug. ***

John Levon

Comment 60

•

23 years ago

Attempt to reduce dupes by adding "twice" and "two"

Summary: <meta> with charset should NOT cause reload → <meta> with charset should NOT cause reload (loads twice/two times)

Andrew Hagen

Comment 61

•

23 years ago

*** Bug 117647 has been marked as a duplicate of this bug. ***

Vadim Berezniker

Comment 62

•

23 years ago

*** Bug 139659 has been marked as a duplicate of this bug. ***

Darin Fisher

Comment 63

•

23 years ago

*** Bug 102407 has been marked as a duplicate of this bug. ***

/\/\arcio Galli

Comment 64

•

23 years ago

Adding topembed to this one since Bug 102407 was marked duplicate of this. Many
sites from the evangelism effort demonstrates the POSTDATA popup problem. See
more info in bug: Bug 102407. 

Adding jaimejr and roger.

Keywords: topembed

Winnie Lam

Comment 65

•

23 years ago

topembed+.  carrying topembed+ over from Bug 102407.

Keywords: topembed → topembed+

Radha on family leave (not reading bugmail)

Updated

•

23 years ago

Blocks: 102407, 134029

Jaime Rodriguez, Jr.

Comment 66

•

23 years ago

Seems loike a few customers are interested in this one getting fixed soon. What
are the chances we could have a fix in the next week?

vidur (gone)

Comment 67

•

23 years ago

Take note of bug 81253. It definitely wasn't a complete fix for this bug, but it
dealt with the 90% case. Specifically, we do not reload if the META tag is in
the first buffer delivered by the server. Can someone confirm that the new bugs
are cases where the META is not in the first buffer? Or did the code change from
81253 just rot?

harishd

Comment 68

•

23 years ago

> Or did the code change from 81253 just rot?

It ain't rotten.

John Morrison

Comment 69

•

23 years ago

As vidur notes, for the most common case, if the document returned by GET or 
POST has a "<meta http-equiv='Content-Type' content='text/html; charset=...'>"
within the first ~2k returned (and not beyond that point), then we do not 
re-request the document.

The other bugs that have been marked recent dupe's involve charset 
auto-detection and/or more elaborate form submission scenarios.

Shanjian Li

Comment 70

•

23 years ago

We may have to choose not to fix this in 1.0 time frame because of the comlexity
and risk. But we have to fix it sooner or later. It is just unacceptable for
websites without meta charset, but involved in form submitting.

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 71

•

23 years ago

Yeah, the roundtripping of illegal values makes this turn into something like
rocket science.  I haven't had good success getting i18n brains on this one, no
doubt because they're swamped with 1.0/nsbeta issues as well.

Let's reconvene for 1.1alpha.

Status: NEW → ASSIGNED

Target Milestone: mozilla1.0 → mozilla1.1alpha

rpotts (gone)

Comment 72

•

23 years ago

It's important to remember that the patch to
http://bugzilla.mozilla.org/show_bug.cgi?id=81253 looks for the META-charset in
the first buffer of data.  This is at most ~2-4k, however, it is whatever the
network hands out to the parser...  It 'could' be much less...

Perhaps, some of the remaining problems are due to servers which return a much
smaller block of data in the first response buffer...
-- rick

Darin Fisher

Comment 73

•

23 years ago

rick: yup that would also cause problems, but i think a large part of the
problem has to do with charset detection.  say there is no meta tag... if we
don't know the charset of the document, and we try to sniff out the charset,
then there'll always be a charset reload.  that seems like the killer here IMO.
 it seems like we hit this problem *a lot* when "auto-detect" is enabled.

Eugene Savitsky

Comment 74

•

23 years ago

Sorry for the spam, but any chance for fixing it? 

It's wery annoyng when using character set autodetection and russians etc must
use this feature. I heard many questions about this problem in Moz 1.0 PRx and
Netscape 7.0 PR1...

scottputterman

Updated

•

23 years ago

Whiteboard: [ADT2] → [ADT2 RTM]

Jaime Rodriguez, Jr.

Comment 75

•

23 years ago

A short term solution that we're considering is:
1) The default charset for a page should be set to the charset of the referrer
(both in the link and the form submission case). This is dealt with by bug 143579.
2) Auto-detection should not happen when there's POST data associated with a page. 

Some pages may not be rendered correctly, but this solution should deal with the
common case. Reassigning this bug to Shanjian.

Assignee: shaver → shanjian

Blocks: 141008

Status: ASSIGNED → NEW

Shanjian Li

Comment 76

•

23 years ago

I am going to handle this problem in 102407 using above approach, and leave this
bug open for future.

No longer blocks: 102407, 141008

Shanjian Li

Comment 77

•

23 years ago

Jaime, you might want to remove some keywords in this bug.

Jaime Rodriguez, Jr.

Comment 78

•

23 years ago

thanks shanjian! 

removing nsbeta1+/[adt2 RTM], and strongly suggest Driver's remove Mozilla1.0+,
and EDT remove topembed+, as the short term solution (safer, saner) will be
addressed in bug 102407, well relegating this issue to WFM, or just edge cases.

Keywords: nsbeta1+

Jaime Rodriguez, Jr.

Updated

•

23 years ago

Whiteboard: [ADT2 RTM]

Mikko Rantalainen

Comment 79

•

23 years ago

Just porting my comment from bug 102407:

Why cannot you keep loading the document to the end even
though meta charset says it's in another charset and after the document has
finished, reload the document from the cache in the same way as viewing source
(finally!) works. The performace could degrade but at least
Mozilla would be doing the right thing -- and for big files reloading full thing
from the cache would be faster than loading from server anyway. Asyncronous
loading to cache would be cool, but it's needed for a feature that isn't used
that much. Performance can be increased later if *really* seen important but how
often charset is changed between page changes anyway?. 9 times of 10 I've seen
this bug is because automatic charset detection has detected the charset
incorrectly and reloads the document even though it should be doing nothing.

I put up a little test at http://www.cc.jyu.fi/~mira/moz/moztest.php which uses
cookies to save 7 last page loading times and changes charset every now and
then. And sends meta charset after 2K. Automatic reloading can be seen as
subsecond reload times and flashing on browser view.

Shanjian Li

Updated

•

22 years ago

No longer blocks: 105709

Ray Jefferson

Comment 80

•

22 years ago

Removing topembed+.  As per comment #78

Keywords: topembed+

Jaime Rodriguez, Jr.

Comment 81

•

22 years ago

We should look at fixing this one for the next release, because it is a
performace issue.

Keywords: nsbeta1

Whiteboard: [adt2]

Target Milestone: mozilla1.1alpha → mozilla1.2beta

Shanjian Li

Comment 82

•

22 years ago

accepting.

Status: NEW → ASSIGNED

Vladimir Ermakov

Comment 83

•

22 years ago

*** Bug 158331 has been marked as a duplicate of this bug. ***

Brant Gurganus

Comment 84

•

22 years ago

By the definitions on <http://bugzilla.mozilla.org/bug_status.html#severity> and
<http://bugzilla.mozilla.org/enter_bug.cgi?format=guided>, crashing and dataloss
bugs are of critical or possibly higher severity.  Only changing open bugs to
minimize unnecessary spam.  Keywords to trigger this would be crash, topcrash,
topcrash+, zt4newcrash, dataloss.

Severity: major → critical

John Keiser (jkeiser)

Updated

•

22 years ago

No longer blocks: 105636

Simon Montagu :smontagu

Comment 85

•

22 years ago

*** Bug 88701 has been marked as a duplicate of this bug. ***

Samir Gehani

Comment 86

•

22 years ago

adt: nsbeta1-

Keywords: nsbeta1 → nsbeta1-

Simon Montagu :smontagu

Updated

•

21 years ago

Summary: <meta> with charset should NOT cause reload (loads twice/two times) → <meta> with charset and autodetection should NOT cause reload (loads twice/two times)

Simon Montagu :smontagu

Comment 87

•

21 years ago

*** Bug 171425 has been marked as a duplicate of this bug. ***

Simon Montagu :smontagu

Comment 88

•

21 years ago

*** Bug 77702 has been marked as a duplicate of this bug. ***

Simon Montagu :smontagu

Comment 89

•

21 years ago

*** Bug 137936 has been marked as a duplicate of this bug. ***

lohphat

Comment 90

•

21 years ago

Could we update the target milestone for this 3 year old bug?  I think we missed
1.2  ;-)

David Baron :dbaron:

Updated

•

21 years ago

Assignee: shanjian → parser

Status: ASSIGNED → NEW

Priority: P2 → P1

Travis Chase

Updated

•

21 years ago

Target Milestone: mozilla1.2beta → Future

Simon Montagu :smontagu

Comment 91

•

21 years ago

*** Bug 235160 has been marked as a duplicate of this bug. ***

Dave Miller [:justdave]

Comment 92

•

20 years ago

*** Bug 248610 has been marked as a duplicate of this bug. ***

Boris Zbarsky [:bzbarsky]

Updated

•

20 years ago

Blocks: 236858

Felix Stellmacher

Comment 93

•

20 years ago

*** Bug 287569 has been marked as a duplicate of this bug. ***

Mikko Rantalainen

Comment 94

•

19 years ago

Status report? This bug has been marked "Severity: Critical", "Priority: 1" and
has keyword "dataloss" and still there hasn't been even a status update in last
3+ years?

Can somebody comment about how hard it would be to implement my suggestion in
comment #79? I haven't hacked with Mozilla's C++ source, so I have no idea.
Here's the suggested algorithm again reworded:

1. In case the meta charset (or any other heuristics) tells Mozilla that it's
using incorrect charset, raise a flag that document is displayed with incorrect
character set.
2. Regardless of this problem, keep going until the document has been fully
transferred so that Mozilla has full copy of it in the cache.
3. Reload the page from cache with the correct charset. (I'm hoping that cache
has *binary* copy of transferred data, not something that has gone through
parser and is therefore hosed anyway.) If the View Source feature can work
without reloading the POSTed page, then it should be possible to reload from
cache, too.

timeless

Comment 95

•

19 years ago

thank you for volunteering

Assignee: parser → mira

Severity: critical → major

Keywords: helpwanted

Priority: P1 → P3

Jesse Ruderman

Updated

•

19 years ago

Assignee: mira → mrbkap

Priority: P3 → --

QA Contact: moied → parser

Target Milestone: Future → ---

Blake Kaplan (:mrbkap) (inactive)

Updated

•

19 years ago

Priority: -- → P3

Target Milestone: --- → mozilla1.9alpha

Blake Kaplan (:mrbkap) (inactive)

Comment 96

•

19 years ago

I can't seem to reproduce this on the site in the URL. Can someone please update the URL with a testcase that shows this?

Target Milestone: mozilla1.9alpha → Future

Mikko Rantalainen

Comment 97

•

19 years ago

(In reply to comment #96)
> I can't seem to reproduce this on the site in the URL. Can someone please
> update the URL with a testcase that shows this?

I was about to change the url from http://www.xe.net/ict/ to one I mentioned in comment #79 (http://www.cc.jyu.fi/~mira/moz/moztest.php) but I wasn't allowed to. The test case changes between iso-8859-1 and iso-8859-15 every second. Hit "Reload page via GET" link a couple of times (wait a few seconds between tries) to see the problem. The test case uses cookies for timing the requests from a single browser. You should be able to see the euro sign when the page text says "iso-8859-15" and there should be a generic currency sign when page text says "iso-8859-1". With GET this is true (the page loads twice if there's a problem) whereas with POST you get incorrect rendering. I have View - Character Encoding - Auto-Detect - (Off) set in case that matters.

I still see the problem with Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a1) Gecko/20051108 Firefox/1.6a1.

A workaround for this bug is to include the meta charset declaration in the first 2048 bytes of a file.

Simon Montagu :smontagu

Updated

•

18 years ago

Blocks: 339459

Christian :Biesinger (don't email me, ping me on IRC)

Updated

•

18 years ago

Blocks: 338176

Frank Wein [:mcsmurf]

Updated

•

18 years ago

No longer blocks: 339459

(mostly gone) XtC4UaLL [:xtc4uall]

Comment 98

•

18 years ago

hi,

i can confirm the problem for the new testcase testing

Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3 ID:2007030919

and

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a4pre) Gecko/20070417 Minefield/3.0a2pre ID:2007041704 [cairo]

Andrei Boros

Comment 100

•

16 years ago

I'm still seeing this problem on Firefox 2.0.0.14 ... 

I use a page with 
<META HTTP-EQUIV="Content-Language" CONTENT="ro">
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-2">

and all pages that contain this are requested twice from the server the first time they are requested. If they already been requested during the current session, the server receives only one request. 

This error is a big problem for me, since on generated pages, data is processed twice and I get incorrect data in the database because of this... 

The only workaround so far was to

Hans Fast

Comment 101

•

15 years ago

bug 463175 is reporting the same reload twice problem. Always reproduceable. 
This bug still exists in 3.1.10
This bug is serious, causes dataloss, and makes modalDialog stop working. 
It tells me the default behaviour is not supporting "extended" charsets. Which it really should be. 
If the page is somehow faulty, raise a flag, and inform the user. This reload behaviour seems like a fix with good intentions, but this is not a good idea for any page that is dynamic, submits data back or is used in a RIA.

Mike Beltzner [:beltzner, not reading bugmail]

Updated

•

15 years ago

Flags: blocking1.9.2?

Andrew Hagen

Comment 103

•

15 years ago

This bug needs a test case. The test case should be on the web. That way we can see how other browsers interact with the same web page. The test case should clearly show how data loss occurs. It should also show how the modal dialog stops working. Any other defects should be clearly shown. There should be "Expected Behavior" and "Actual Behavior." The test case should be entered into the URL field of this bug report.

Hans Fast

Comment 104

•

15 years ago

Check bug 463175. I made a test page where you can see the load twice behaviour. The testpage is in th URL field for that bug. That's where I thought it should go, I am new to this forum, still trying to get my head around how this forum works...
Here is that page:
http://beta.backbone.se/tools/test/loadtwice/loadtwice.html
Spimle, repreducible, happends every time, no swapping of META tags needed as suggested above. I do not know why the URL for this bug points to xe.com.

Here is how the showModalDialog stops working: When the modal dialog is first opened the arguments are fine. But since the double load behaviour then loads the page once more, the arguments are lost (returns null). This is actually how we first found this bug. It took a long time to backtrack that this was actually happening to ALL non cached pages in FF without charset tag. Difficult to believe, but there it is.
Data loss can occur in all sorts of ways. See thread above.

Since I am a bit taken back by this, I would like to emphasize three side effects that might speed things up on your end (I guess if the right people read it). I take the chance of barking up the wrong tree, and if I do please accept my appologies.
Three serious side effects I can think of:
1. Developers need to include the correct charset tag in ALL pages.
This is actually something we should have done in the first place. But for a big site or system, this is months of hard work for any developer.
Ironically this includes all(?) pages in this forum ;-)
2. Performance. New pages load twice, dynamic pages using ids in urls load twice.
3. Browser statisics is wrong. A large chunk of the FF penetration figures should be taken out. This might be the most serious.

Since I like irony, I will make a little experiment by writing the letter ä in this comment. Voila, this page is hit twice first time it is requested by FF! Same as the page with bug 463175.

For the developer that cannot wait for this bug to be fixed, here is what we had to do (fixes side effect 1 above):
1. Convert all pages to UTF-8
2. Pray that the developer tool or HTML-editor you use has support for UTF-8
3. Place the correct charset meta tag in the header of all pages
4. If your webserver is IIS, all parameters must be uri encoded or you loose all extended characters
5. Rewrite you cookie routines to support extended characters as well
This took us four months, and still not everything in place :-(
Hope this helps.
Good luck!

Hans Fast

Comment 105

•

15 years ago

I suggest renaming this bug to:
<meta> with charset and autodetection OR charset missing, should NOT cause reload (loads twice/two times)

Andrew Hagen

Comment 106

•

15 years ago

Testcase: 

1. Set Character Encoding to Auto Detect. 
2. Go to URL: http://beta.backbone.se/tools/test/loadtwice/loadtwice.html

Expected Results: Page loads once

Actual Results: Page loads twice

URL: http://www.xe.net/ict/ → http://beta.backbone.se/tools/test/lo...

Flags: wanted1.9.2?

Keywords: testcase

Summary: <meta> with charset and autodetection should NOT cause reload (loads twice/two times) → <meta> tag with charset and autodetection OR charset missing, should NOT cause reload (loads twice/two times)

Johnny Stenback (:jst)

Comment 107

•

15 years ago

Unfortunately I don't think we can fix this for 1.9.2 as this is far from a trivial problem to fix, and we don't have anyone right now with the time to spend on this.

However, if people feel there's value in making the effects of this when it comes to showModalDialog() go away (i.e. if we preserve dialog arguments across reloads), I think we could do *that* for 1.9.2.

I'd like to hear what people think about doing the showModalDialog() part of this only for 1.9.2. I know it sucks to not fix the underlying problem here now, but as I said, it's not very easy to fix in our code, and I'd rather see us fix this for the HTML5 parser than worrying about it in the current parser code. Leaving this nominated for now until I hear some thoughts here.

Blake Kaplan (:mrbkap) (inactive)

Updated

•

15 years ago

Assignee: mrbkap → nobody

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Comment 108

•

15 years ago

The HTML5 spec prescribes reparsing when the <meta> is so far from the start of the file that the prescan doesn't find it.

As for chardet, I've made the HTML5 parser only run chardet (if enabled) over the prescan buffer, so chardet-related reparses should be eliminated. However, the HTML5 parser needs more testing in CJK and Cyrillic locales to assess whether the setup is good enough.

Benjamin Smedberg

Updated

•

15 years ago

Flags: wanted1.9.2?

Flags: wanted1.9.2-

Flags: blocking1.9.2?

Flags: blocking1.9.2-

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Comment 110

•

12 years ago

Information for Web authors seeing this problem and finding this report here in Bugzilla:

This problem can be 100% avoided by the Web page author by using HTML correctly as required by the HTML specification. There are three different solutions any one of which can be used:

1) Configure your server to declare the character encoding in the Content-Type HTTP header. For example, if your HTML document is encoded as UTF-8 (the preferred encoding for Web pages), make your servers send the HTTP header
Content-Type: text/html; charset=utf-8
instead of
Content-Type: text/html

This solution works with any character encoding supported by Firefox.

OR

2) Make sure that you declare the character encoding of your HTML document using a "meta" element within the first 1024 bytes of your document. That is, if you are using UTF-8 (as you should considering that UTF-8 is the preferred encoding for Web pages), start your document with
<!DOCTYPE html>
<html>
<head>
<meta charset=utf-8>
<title>…
and don't put comments, scripts or other stuff before <meta charset=utf-8>.

This solution works with any character encoding supported by Firefox except UTF-16 encodings, but UTF-16 should not be used for interchange anyway.

OR

3) Start your document with a BOM (byte order mark). If you're using UTF-8, make the first three bytes of your file be 0xEF, 0xBB, 0xBF. You probably should not use this method unless you're sure that the software you are using won't accidentally delete these three bytes.

This solution works only with UTF-8 and UTF-16, but UTF-16 should not be used for interchange anyway, which is why I did not give the magic bytes for UTF-16.

- -

As for fixing this:

This bug is WONTFIX for practical purposes, since fixing this would take a substantial amount of work for very little gain. Anyone capable of fixing that this will probably always have higher priority things to work on.

But if this was to be fixed, the first step would be figuring out what WebKit and IE do. Without actually figuring that out, here are a couple of ideas how this could be fixed:

1) In the case of a late meta, if we want to continue to honor late metas (which isn't a given), we should keep the bytes that the HTML parser has already consumed and keep consuming the network stream into that buffer while causing the docshell to do a renavigation without hitting the network again but instead restarting the parser with the buffer mentioned earlier in this sentence.

2) In the case of chardet, it might be theoretically possible to replace chardet with a multi-encoding decoder with an internal buffer. The decoder would work like this: As long as the incoming bytes are ASCII-only, the decoder would immediately emit the corresponding Basic Latin characters. Upon seeing an non-ASCII byte, the decoder would accumulate bytes into its internal buffer until it can commit to a guess about their encoding. Upon committing to the guess, the decoder would emit its internal buffer decoded according to the guest encoding. Thereafter, the decoder would act just like a normal decoder for that encoding.

But it would be a bad idea to pursue these ideas without first carefully finding out what WebKit and IE do. I hear that WebKit gets away with much less complexity in this area compared to what Gecko implements.

Alias: latemeta

Severity: major → enhancement

Priority: P3 → --

Summary: <meta> tag with charset and autodetection OR charset missing, should NOT cause reload (loads twice/two times) → Late charset <meta> or autodetection (chardet) should NOT cause reload (loads twice/two times)

Whiteboard: [adt2] → Please read comment 110.

Ghulam Abbas

Comment 112

•

6 years ago

 I can't seem to reproduce this on the site in the URL. Can someone please
 update the URL with a testcase that shows this?There is code in nsObserverBase::NotifyWebShell() to prevent reload for a POST data.
A great test case is this: go to the URL I just added, and click to get currency table
  https://www.timehubzone.com/currencies

Flags: needinfo?(datehubzone)

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Comment 113

•

6 years ago

> There is code in nsObserverBase::NotifyWebShell() to prevent reload for a POST data.

It should still be possible to reproduce this in the case of GET requests.

> A great test case is this: go to the URL I just added, and click to get currency table
> https://www.timehubzone.com/currencies

That site declares the encoding both on the HTTP layer and in early <meta>, so it shouldn't be possible to see this reload case there.

In general, I don't expect us to add complexity to cater for this long-tail legacy issue. If we want to never reload, we should revert bug 620106 and then stop honoring late <meta>, too.

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 114

•

2 years ago

Moving open bugs with topperf keyword to triage queue so they can be reassessed for performance priority.

Performance Impact: --- → ?

Keywords: topperf

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Comment 115

•

2 years ago

The late meta aspect was fixed in bug 1701828.

The page can still be reloaded in the case where it doesn't declare an encoding and the detector guess at the end of the stream differs from the guess made at </head>. The telemetry for how often this happened expired and I've been too busy on other things to reinstate telemetry in this area.

In any case:

Any page can avoid this perf problem by declaring the encoding, and pages that people browse the most declare their encoding.
Even before bug 1701828, which extended the number of bytes that are considered for the initial guess, the detector-triggered reload case affected less than 1.5% of unlabeled page loads globally.

I think it's not useful to try to eliminate the remaining reload case, since it's better for pages to be readable than performantly unreadable.

I'm leaving this bug open for checking that the reload comes from the cache, though.

Severity: normal → S4

Flags: needinfo?(datehubzone)

Keywords: dataloss

Priority: -- → P5

Summary: Late charset <meta> or autodetection (chardet) should NOT cause reload (loads twice/two times) → Make sure that chardetng-triggered encoding reload is read from the cache

Whiteboard: Please read comment 110. → Please read comment 115.

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

2 years ago

Performance Impact: ? → ---

Daniel Veditz [:dveditz]

Updated

•

2 years ago

Restrict Comments: true