Closed Bug 805080 Opened 12 years ago Closed 11 years ago

Remove UTF-8 handling from SpiderMonkey

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: terrence, Assigned: terrence)

References

Details

(Whiteboard: [leave open][js:t])

Attachments

(2 files)

v0 12 years ago Terrence Cole [:terrence] (deleted), patch	luke : review+ terrence : checkin+	Details \| Diff \| Splinter Review
v0: move UTF-8 api to CompileOptions 12 years ago Terrence Cole [:terrence] (deleted), patch	luke : review+	Details \| Diff \| Splinter Review

Terrence Cole [:terrence]

Assignee

Description

•

12 years ago

The current UTF-8 handling code in SpiderMonkey is super-gross and completely untested. If you wanted to use it, you have to turn it on dynamically before creating the first runtime. Once turned on, it replaces ASCII in most, but not all cases. This adds an unnecessary extra branch to every Inflate call and, much worse, makes any call to Inflate a potential GC. The actual UTF-8 implementation itself appears to be cobbled together and not from any well-known (and well tested library). This code appears to have been strapped on by an embedder in a totally ad-hoc manner and is causing us real problems right now. We should remove it until we can build something that integrates well with the rest of SpiderMonkey.

Terrence Cole [:terrence]

Assignee

Comment 1

•

12 years ago

After more investigation, it looks like CTypes has started using the UTF-8 functions, because they had public visibility, for no good reason. However, CTypes doesn't set js_CStringsAreUTF8, which makes these functions go off the rails if there is an error in the UTF-8 string. So there is at least one bug in CTypes currently because of this weirdness.

Terrence Cole [:terrence]

Assignee

Comment 2

•

12 years ago

Attached patch v0 (deleted) — Details — Splinter Review

It's hard to express just how totally messed up our text encoding situation is right now. I have epsilon less sanity after untangling this gordian knot of misdirection and lies, so I'm in a good position to explain at the moment. Imagine that Laurel and Hardy are doing their "Who's on first" sketch, only instead of being about baseball it's about text encodings and the argument is between SpiderMonkey and the Internet instead of Laurel and Hardy. That's the level of confusion we are talking about here. This patch is my attempt to explain to SpiderMonkey that "Who's a UTF-8 string" does not imply in any way that it should decode Who as a CESU-8 string, and similar awful punning. There is still a great deal of work to do, but I'd really like to get this in-tree ASAP: I feel we have a duty to save the sanity of anyone who might try to pass a UTF-8 string into our CompileUTF8 functions and expect a anything but comedy to come out.

Assignee: general → terrence

Status: NEW → ASSIGNED

Attachment #676682 - Flags: review?(luke)

Luke Wagner [:luke]

Comment 3

•

12 years ago

Comment on attachment 676682 [details] [diff] [review] v0 Review of attachment 676682 [details] [diff] [review]: ----------------------------------------------------------------- Thanks for excising all this dead and (in my experience) confusing logic. ::: js/src/jsapi.cpp @@ +6246,5 @@ > size_t necessaryLength = GetDeflatedStringLength(NULL, chars, str->length()); > if (necessaryLength == size_t(-1)) > return size_t(-1); > + if (writtenLength != length) > + JS_NOT_REACHED("CStrings are NOT UTF-8"); Seems like we'd want either JS_ASSERT or JS_Assert (if we want a release-mode assert). ::: js/src/jsapi.h @@ +4965,2 @@ > JS_CompileUCScript(JSContext *cx, JSObject *obj, > + const jschar *ucs2, size_t length, Norbert was telling us not to call it UCS2 (because it isn't). "twoByteChar" would be more honest but, given that we call ascii 'ascii' and utf8 'utf8' and the type 'jschar' tells us "this is a two-byte char", perhaps we can leave all these cases you changed to 'ucs2' as 'chars'.

Attachment #676682 - Flags: review?(luke) → review+

Terrence Cole [:terrence]

Assignee

Comment 4

•

12 years ago

Green try run at: https://tbpl.mozilla.org/?tree=Try&rev=785b9a9345ee Pushed at: https://hg.mozilla.org/integration/mozilla-inbound/rev/b2620ac72e94 (In reply to Luke Wagner [:luke] from comment #3) > Norbert was telling us not to call it UCS2 (because it isn't). > "twoByteChar" would be more honest but, given that we call ascii 'ascii' and > utf8 'utf8' and the type 'jschar' tells us "this is a two-byte char", > perhaps we can leave all these cases you changed to 'ucs2' as 'chars'. You are correct. I would like to make the character type names actual C++ types when we convert these interfaces to mozilla::Range subclasses.

Blocks: 805081

Whiteboard: [leave open]

Terrence Cole [:terrence]

Assignee

Updated

•

12 years ago

Depends on: 807151

Ed Morley [:emorley]

Comment 5

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/b2620ac72e94

Nicolas B. Pierron [:nbp]

Updated

•

12 years ago

Depends on: 807518

Terrence Cole [:terrence]

Assignee

Comment 6

•

12 years ago

Attached patch v0: move UTF-8 api to CompileOptions (deleted) — Details — Splinter Review

Now that we have CompileOptions, we really don't need the absurd diaspora of Compile varieties. This reigns in some of the madness by removing the API entries that are UTF-8 specific. Since I was in the area, this also kills off the NOPed parts of our legacy API. I was planning to do more here, but a quick visual inspection of https://developer.mozilla.org/en-US/docs/SpiderMonkey/JSAPI_Reference shows that somewhere from a third to half of our API surface is either obsoleted or deprecated. You have to start somewhere, but I can easily move this into a different patch or even bug if you'd like.

Attachment #679430 - Flags: review?(luke)

Luke Wagner [:luke]

Comment 7

•

12 years ago

Comment on attachment 679430 [details] [diff] [review] v0: move UTF-8 api to CompileOptions Review of attachment 679430 [details] [diff] [review]: ----------------------------------------------------------------- Rock it. (Sorry for the delay)

Attachment #679430 - Flags: review?(luke) → review+

Terrence Cole [:terrence]

Assignee

Updated

•

12 years ago

Attachment #676682 - Flags: checkin+

Terrence Cole [:terrence]

Assignee

Comment 8

•

12 years ago

Part2 pushed at: https://hg.mozilla.org/integration/mozilla-inbound/rev/139f53c8e5d8 Green try run at: https://tbpl.mozilla.org/?tree=Try&rev=2652920cb0f5

Ryan VanderMeulen [:RyanVM]

Comment 9

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/139f53c8e5d8

Flags: in-testsuite+

Terrence Cole [:terrence]

Assignee

Updated

•

12 years ago

Depends on: 811060

Naveed Ihsanullah [:naveed]

Updated

•

12 years ago

Whiteboard: [leave open] → [leave open][js:t]

Daniel Holbert [:dholbert]

Updated

•

12 years ago

Depends on: 816888

Terrence Cole [:terrence]

Assignee

Comment 10

•

11 years ago

I think all of the CESU-8 is gone as well as the weird modal stuff. If there is more badness to trim, we can do it in other bugs.

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Remove UTF-8 handling from SpiderMonkey

Categories

(Core :: JavaScript Engine, defect)

Tracking

()

People

(Reporter: terrence, Assigned: terrence)

References

Details

(Whiteboard: [leave open][js:t])

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Updated

Comment 6

Comment 7

Updated

Comment 8

Comment 9

Updated

Updated

Updated

Comment 10

Attachment

General

Description

File Name

Content Type