Closed Bug 413784 Opened 17 years ago Closed 17 years ago

Search for a non-English term in the URL don't match

Categories

(Firefox :: Bookmarks & History, defect)

defect
Not set
normal

Tracking

()

VERIFIED FIXED
Firefox 3 beta3

People

(Reporter: erwan, Assigned: erwan)

References

(Blocks 1 open bug, )

Details

Attachments

(3 files, 4 obsolete files)

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b3pre) Gecko/2008012320 Minefield/3.0b3pre
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b3pre) Gecko/2008012320 Minefield/3.0b3pre

Since the DB holds an URI-encoded URI, it doesn't match with terms for the decoded URI.

For example, this URL:
http://flocktest.wordpress.com/2008/01/23/%E3%81%BD%E3%81%BD/
Becomes, decoded:
http://flocktest.wordpress.com/2008/01/23/ぽぽ/

A search for "ぽぽ" will return no result. It should return that page.


Reproducible: Always

Steps to Reproduce:
1. Visit http://flocktest.wordpress.com/2008/01/23/ぽぽ/
2. Search for "ぽぽ" in the URL bar

Actual Results:  
No result (or other non-related pages)

Expected Results:  
http://flocktest.wordpress.com/2008/01/23/ぽぽ/ appears in the results set

I had to create that dummy page because most pages with non-English characters in their URL appear to have the same terms in the title. The bug would not reproduce in this case because, with a match on the title, the pages would show.

Some solutions have been discussed in bug 389465:
* do a decodeURI before the search, and search for both terms. That would increase the cost of a query.
* store a decoded version in the DB. That would increase the cost of indexing and the size of the DB.
Severity: minor → normal
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Linux → All
Hardware: PC → All
Version: unspecified → Trunk
bug 389465 comment #10:
> (Things aren't that bad in this case anyway because the title gets matched..)

I beg to differ.  If you only consider Wikipedia-style websites, that may be correct, but not even Wikipedia follows this on every kind of page.  Example:

<http://fa.wikipedia.org/w/index.php?title=%D8%B5%D9%81%D8%AD%D9%87%D9%94_%D8%A7%D8%B5%D9%84%DB%8C&action=edit>


I think this is a serious functionality loss for non-English users, mostly for those who speak languages with little Latin alphabets, if any.  Therefore, I'm requesting the blocking flag on this bug.
Flags: blocking-firefox3?
I agree that the general problem is really bad if we can't match in urls, but like I said, in that particular case of wikipedia, it's not too bad.

However, there's a belief that users will start focusing on matching in titles than in urls. That's why there's an emphasis on the title in the display as well as searching the title first before urls when querying.
Here is a common problem, and serious for power users.  After opening two pages in WP which one of them is ascii-only and the other one is IRI, ie. http://en.wikipedia.org/wiki/Durs_Grunbein and http://en.wikipedia.org/wiki/Durs_Gr%C3%BCnbein, I try to open them again later, and I want to select the url I'm looking for, but it's not there.

Of course this is not a big deal for German users with a few non-ascii letters, but it makes displaying/searching URL useless for all non-Latin users.
Edward, what do you think about just unescaping the query string?
* We just add the cost of one unescape to the query, we don't do a double query or mess up with the DB
* If the user types the escaped string (like "%D8%B5") it will not match. But I don't think anyone is ever going to do this kind of search.
Attached patch encodeURI the search string for URL match (obsolete) (deleted) — Splinter Review
Since the URLs are URIencoded in the DB, I changed the SQL query to use an encoded version to match on the URL (but still use the non-encoded version to match on the title).

That relies on a native implementation of encodeURI that I put inline in the file, because I didn't know where to put it. Maybe it would be better somewhere else?
Attachment #299335 - Flags: review?
Attachment #299335 - Flags: review? → review?(dietrich)
It's not really a problem now, but later if we allow the user to type multiple words like "page title ぽ" to search in both the title and url at the same time, we won't know which ones to escape or not.

I suppose we would do something like
(title LIKE 'page' OR url LIKE encode('page')) AND (title LIKE 'title' OR url LIKE encode('title')) AND (title LIKE 'ぽ' OR url LIKE encode('ぽ')) // pretend theres %%s
Comment on attachment 299335 [details] [diff] [review]
encodeURI the search string for URL match

Oops, no longer applies... I'm merging now and I'll submit a new patch then.
Attachment #299335 - Attachment is obsolete: true
Attachment #299335 - Flags: review?(dietrich)
Attached patch encodeURI the search string for URL match (obsolete) (deleted) — Splinter Review
Patch merged with recent changes - working on a unit test.
Attachment #299358 - Attachment is obsolete: true
Attachment #299363 - Flags: review?(dietrich)
Broke again, new version.

Dietrich: is there any patch in the pipe that I should know about?
Attachment #299363 - Attachment is obsolete: true
Attachment #299639 - Flags: review?(dietrich)
Attachment #299363 - Flags: review?(dietrich)
merged again
Attachment #299639 - Attachment is obsolete: true
Attachment #300093 - Flags: review?(dietrich)
Attachment #299639 - Flags: review?(dietrich)
Attached patch v2 (deleted) — Splinter Review
Thanks for the patches Erwan. I've reimplemented the escaping and merged the patch on top of a few other changes like..

Bug 414285 - Refactor AutoCompleteTagsSearch token splitting code and persist tokens
Bug 401869 - Allow multiple words search in Auto-complete/Location Bar
Attachment #300333 - Flags: review?(dietrich)
Depends on: 414285
Attachment #300093 - Flags: review?(dietrich)
Attachment #300333 - Flags: review?(dietrich)
Thanks for looking into this and providing patches Erwan.

Checking in toolkit/components/places/tests/unit/test_413784.js;
/cvsroot/mozilla/toolkit/components/places/tests/unit/test_413784.js,v  <--  test_413784.js
initial revision: 1.1
done
Assignee: nobody → erwan
No longer depends on: 414285
This should be fixed by bug 407974.
Status: NEW → RESOLVED
Closed: 17 years ago
Depends on: 407974
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → Firefox 3 beta3
Verified FIXED using Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9b3pre) Gecko/2008020419 Minefield/3.0b3pre; see the screenshot in comment 16, above.
Status: RESOLVED → VERIFIED
Flags: blocking-firefox3? → blocking-firefox3+
Blocks: fx35-l10n-fa
No longer blocks: Persian-Fx3.5
No longer blocks: fx35-l10n-fa
Blocks: Persian
Bug 451915 - move Firefox/Places bugs to Firefox/Bookmarks and History. Remove all bugspam from this move by filtering for the string "places-to-b-and-h".

In Thunderbird 3.0b, you do that as follows:
Tools | Message Filters
Make sure the correct account is selected. Click "New"
Conditions: Body   contains   places-to-b-and-h
Change the action to "Delete Message".
Select "Manually Run" from the dropdown at the top.
Click OK.

Select the filter in the list, make sure "Inbox" is selected at the bottom, and click "Run Now". This should delete all the bugspam. You can then delete the filter.

Gerv
Component: Places → Bookmarks & History
QA Contact: places → bookmarks
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: