Closed Bug 212308 Opened 22 years ago Closed 21 years ago

javascript error when using unescape function in UTF8

Categories

(Core :: DOM: Core & HTML, defect)

x86
Windows XP
defect
Not set
critical

Tracking

()

RESOLVED FIXED

People

(Reporter: omercier, Unassigned)

References

()

Details

(Keywords: intl)

User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4) Gecko/20030624 the unescape function don't work when i am using it in html page which is in UTF-8 charset Reproducible: Always Steps to Reproduce: 1.Go to the url http://195.25.243.18/ppp/unescape.html 2. 3. Actual Results: the page don't come and i have this error in the javascript console : Error: uncaught exception: [Exception... "Component returned failure code: 0x8000ffff (NS_ERROR_UNEXPECTED) [nsIDOMWindowInternal.unescape]" nsresult: "0x8000ffff (NS_ERROR_UNEXPECTED)" location: "JS frame :: http://195.25.243.18/ppp/unescape.html :: <TOP_LEVEL> :: line 7" data: no]
That's because the string, once unescaped is expected to be in the character encoding of the page (from which it can then be converted into the UTF-16 encoding Javascript uses natively to be returned to the script). Your string, when unescaped, is not valid UTF-8. I assume it's valid ISO-8859-1 or something like that, but there is no reason for Mozilla to try that charset on a UTF-8 page...
---> DOM for handling. In the browser, the DOM unescape() function supersedes the JS Engine implementation.
Assignee: rogerl → dom_bugs
Component: JavaScript Engine → DOM Level 0
QA Contact: pschwartau → ashishbhatt
the url has change : http://www.agdf.com/ppp/unescape.html What i can do to resolve this bug .
Not mix separate charsets in the same page? I'm not sure what we can do here that would not break thousands of other pages, really... The issue is that once we unescape the string, all we have is a sequence of _bytes_. To write it to the document, we need to convert the bytes to characters. To do this, we need to assume that the byte stream is a character string represented in some encoding. We have to guess the encoding. We guess, reasonably, that it's the same as the encoding of the page itself (which is what it is 99.99% of the time). Any suggestions on what we should change there?
window.unescape() is DOM Level 0 http://www.mozilla.org/docs/dom/domref/dom_window_ref123.html#1022042 document.write() is DOM HTML Levels 1 and 2: http://www.w3.org/TR/2000/WD-DOM-Level-1-20000929/level-one-html.html#ID-75233634 http://www.w3.org/TR/2003/REC-DOM-Level-2-HTML-20030109/html.html#ID-75233634 document.write() should work on nothing but Unicode text. window.unescape() unescapes a 2 digit (one octet) hex value. The escaping is defined in RFC 2396: http://www.apps.ietf.org/rfc/rfc2396.html#sec-2 There is no way to represent all Unicode characters in 1 octet, so it can't work directly on Unicode Text, and the only option for it, is to treat binary data it receives as encoded, and then decode it into Unicode text, using the encoding of the file which carries this data, and then return the result. Here's a list of states that you want your "%E9" to have through it's life: Binary Data->(Text decoder)->Unicode Text->(Parser)->MarkUp->(DOM)->JavaScript Code->(JavaScript)->ASCII string of Escaped Binary Data->(Escape Sequence Decoder)->Binary Data->(Text decoder)->Unicode Text->(Parser)->MarkUp->(DOM)->Document text->(Renderer)->Screen image The part that gets screwed up is: ASCII string of Escaped Binary Data->(Escape Sequence Decoder)->Binary Data->(Text decoder)->Unicode Text Which is all done by window.unescape() In particular, Text decoder inside of it fails. It tries to use UTF-8 for decoding, because your document indicates through Content-Type tag that it is UTF-8 encoded. The Document also contains a UTF-8 BOM. %E9 octet has the first bit set, which in UTF-8 means that this is a multibyte-encoded character. However there's no other bytes complementing it. Which makes it ill-formed UTF-8 code. http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf#G11165 The bottom line: text decoder inside window.unescape() fails because the binary data you given it is ill-formed UTF-8. Thus the Error. Marking INVALID.
Status: UNCONFIRMED → RESOLVED
Closed: 21 years ago
Resolution: --- → INVALID
Depends on: 44272
This bug was valid and was fixed by the fix for bug 44272. Refer to that bug as to why this bug was valid. Reopening now. I'm gonna reslove it as fixed in a moment.
Status: RESOLVED → UNCONFIRMED
Resolution: INVALID → ---
sorry for spamming, but this is the right thing to do :-)
Status: UNCONFIRMED → RESOLVED
Closed: 21 years ago21 years ago
Keywords: intl
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.