Closed Bug 757049 Opened 13 years ago Closed 12 years ago

GTK immodule receives invalid cursor position when the surrounding text contains any non-BMP Unicode character

Tracking

()

Status:

VERIFIED FIXED

Milestone:

mozilla15

People

(Reporter: yukawa, Assigned: masayuki)

References

Details

Attachments

(1 file, 2 obsolete files)

Patch 13 years ago Masayuki Nakano [:masayuki] (he/him)(JST, +0900) (deleted), patch		Details \| Diff \| Splinter Review
Patch 13 years ago Masayuki Nakano [:masayuki] (he/him)(JST, +0900) (deleted), patch	karlt : review+	Details \| Diff \| Splinter Review
Patch 13 years ago Masayuki Nakano [:masayuki] (he/him)(JST, +0900) (deleted), patch	karlt : review+	Details \| Diff \| Splinter Review

Yohei Yukawa

Reporter

Description

•

13 years ago

This is a corner case of Bug 353776.

If I understood correctly, there exists a confusion between UTF-16 and the number of Unicode characters in the current implementation of nsGtkIMModule::OnRetrieveSurroundingNative.

nsGtkIMModule::GetCurrentParagraph returns aCursorPos as the number of UTF-16 elements.
http://hg.mozilla.org/mozilla-central/file/7e24f87253c3/widget/gtk2/nsGtkIMModule.cpp#l1469

Then the returned aCursorPos will be passed as the second argument of g_utf8_offset_to_pointer, which expects the number of Unicode characters instead.
http://hg.mozilla.org/mozilla-central/file/7e24f87253c3/widget/gtk2/nsGtkIMModule.cpp#l923

As a result, an invalid cursor position will be passed to gtk_im_context_set_surrounding when the surrounding text contains any non-BMP Unicode character before the cursor location. The current implementation should work as long as there is no non-BMP character before the cursor.

Please let me know if I need to elaborate more.

Thanks,

Karl Tomlinson (:karlt)

Updated

•

13 years ago

Blocks: 353776

Status: UNCONFIRMED → NEW

Ever confirmed: true

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Assignee

Comment 1

•

13 years ago

Hmm... There is another issue. I'm not sure whether IVS is an independent character of UTF-8 in gnome. The document said nothing about IVS.

http://developer.gnome.org/glib/stable/glib-Unicode-Manipulation.html#g-utf8-strlen

Karl Tomlinson (:karlt)

Comment 2

•

13 years ago

I expect variation selectors to count as separate characters in UTF-8.
http://unicode.org/reports/tr37/ says
"An Ideographic Variation Sequence (IVS) is a sequence of two coded characters, the first being a character with the Unified_Ideograph property, the second being a variation selector character in the range U+E0100 to U+E01EF."

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Assignee

Comment 3

•

13 years ago

Attached patch Patch (obsolete) (deleted) — Details — Splinter Review

I think that this the safest fix. I think that it doesn't make we handle all undocumented behavior of glib APIs. This make the cost increase. But it must not be too expensive.

Assignee: nobody → masayuki

Status: NEW → ASSIGNED

Attachment #625985 - Flags: review?(karlt)

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Assignee

Comment 4

•

13 years ago

> it doesn't make we handle all undocumented behavior of glib APIs.

I meant it doesn't make sense...

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Assignee

Comment 5

•

13 years ago

Attached patch Patch (obsolete) (deleted) — Details — Splinter Review

Oops. I attached wrong file. Sorry for the spam.

Attachment #625985 - Attachment is obsolete: true

Attachment #625985 - Flags: review?(karlt)

Attachment #625986 - Flags: review?(karlt)

Karl Tomlinson (:karlt)

Comment 6

•

13 years ago

Comment on attachment 625986 [details] [diff] [review]
Patch

>+    // Note that we shouldn't use g_utf8_offset_to_pointer() in this situation.
>+    // cursorPos is cursor position in UTF16 string.  If we want to convert
>+    // it to number of UTF-8 characters, we need to know the exactly same rules
>+    // of the glib APIs.  E.g., How do they count IVS sequence? How do they
>+    // handle broken surrogate pairs or IVS sequence at converting from UTF-16?
>+    // For safety, we should convert the string before cursor from UTF-16 to
>+    // UTF-8 by a glib API.  This must not be too expensive.

I expect these are not real issues.
Characters in g_utf8_offset_to_pointer are Unicode characters.  These are well
defined in UTF-8.  No characters (not even variation selectors) are treated
specially.  If uniStr does not contain valid UTF-16 (e.g. unpaired surrogates)
then g_utf16_to_utf8() will fail.

However, GLib doesn't provide a better method to use here.

If you feel like trying to tidy this up further so that text is only converted
once and zero/Length() special-case optimizations are not necessary,
AppendUTF16toUTF8 could be used with Gecko string classes to first convert the
portion before the cursor and then the rest.

Attachment #625986 - Flags: review?(karlt) → review+

Karl Tomlinson (:karlt)

Comment 7

•

13 years ago

The same bug exists in the delete-surrounding signal handler, where offset and
n_chars measure Unicode characters, but I assume nsSelectionEvent wants UTF-16
code point counts.

Should that be fixed at the same time?

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Assignee

Comment 8

•

13 years ago

(In reply to Karl Tomlinson (:karlt) from comment #7)
> The same bug exists in the delete-surrounding signal handler, where offset
> and
> n_chars measure Unicode characters, but I assume nsSelectionEvent wants
> UTF-16
> code point counts.
> 
> Should that be fixed at the same time?

Ah, I guess so. IME developers might be able to avoid this bug if they know this. In such situation, both of them should work in the same way.

Karl Tomlinson (:karlt)

Comment 9

•

13 years ago

Thinking about it a bit more, retrieve-surrounding and delete-surrounding differ in that the former wants a byte offset while the latter uses character counts, so I'm guessing it would be fine to fix retrieve-surrounding first.

Also delete-surrounding uses offsets from the cursor position, so there would only be a problem if there was a non-BMP character either in the deleted text or between the deleted text and the cursor.  That is possibly less likely than having a non-BMP character in the surrounding text before the cursor.

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Assignee

Comment 10

•

13 years ago

But IME may use both of them same time and compute the offset from the result of retrieve-surrounding. Anyway, I have a patch for both bugs, I'll test them tonight and tomorrow. I'll post the new patch soon.

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Assignee

Comment 11

•

13 years ago

Attached patch Patch (deleted) — Details — Splinter Review

How about this?

testing on tryserver:
https://tbpl.mozilla.org/?tree=Try&usebuildbot=1&rev=3665a4af9925

Attachment #625986 - Attachment is obsolete: true

Attachment #627985 - Flags: review?(karlt)

Karl Tomlinson (:karlt)

Comment 12

•

12 years ago

Comment on attachment 627985 [details] [diff] [review]
Patch

>+    NS_ENSURE_TRUE(!utf8Str.IsEmpty(), NS_ERROR_FAILURE);

This shouldn't be necessary.
The offset and length checks already deal with this.

>+    gchar* charAtOffset =
>+        g_utf8_offset_to_pointer(utf8Str.get(), offsetInUTF8Characters);
>+    if (!charAtOffset) {
>+        PR_LOG(gGtkIMLog, PR_LOG_ALWAYS,
>+            ("    FAILED, charAtOffset is NULL"));
>+        return NS_ERROR_FAILURE;
>+    }
>+
>+    gchar* charAtEnd =
>+        g_utf8_offset_to_pointer(utf8Str.get(), endInUTF8Characters);
>+    if (!charAtEnd) {
>+        PR_LOG(gGtkIMLog, PR_LOG_ALWAYS,
>+            ("    FAILED, charAtEnd is NULL"));
>+        return NS_ERROR_FAILURE;
>+    }

g_utf8_offset_to_pointer won't return NULL so don't bother checking the
result.  It can only imply a false sense of security.

Attachment #627985 - Flags: review?(karlt) → review+

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Assignee

Comment 13

•

12 years ago

Okay, I added aNChar check. That's never been checked.

https://hg.mozilla.org/integration/mozilla-inbound/rev/291c61234c4a

Target Milestone: --- → mozilla15

Phil Ringnalda (:philor)

Comment 14

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/291c61234c4a

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Yohei Yukawa

Reporter

Comment 15

•

12 years ago

I confirmed that undo command and reverse conversion command of Mozc 1.6.1187.102 are working fine on Firefox 15.0 / Ubuntu 12.04 even when surrounding text contains non-BMP characters.

Thank you for fixing this issue so quickly!

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Assignee

Comment 16

•

12 years ago

-> v. per comment 15

Thank you Yukawa-san.

Status: RESOLVED → VERIFIED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

GTK immodule receives invalid cursor position when the surrounding text contains any non-BMP Unicode character

Categories

(Core :: Widget: Gtk, defect)

Tracking

()

People

(Reporter: yukawa, Assigned: masayuki)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file, 2 obsolete files)

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Attachment

General

Description

File Name

Content Type