Open Bug 1276600 Opened 8 years ago Updated 2 years ago

improve asian language support in bookmarks/history query and search results

Categories

(WebExtensions :: General, defect, P5)

defect

Tracking

(Not tracked)

People

(Reporter: yuki, Unassigned)

References

Details

(Whiteboard: [design-decision-approved]triaged)

My addon "XUL/Migemo" https://addons.mozilla.org/en-US/firefox/addon/xulmigemo/ provided ability to search history/bookmark from the sidebar and the Places Organizer, with custom search conditions based on regular expressions. In other words, it applies custom filter for search results. After XUL is ended, we need something to do it.

The addon replaces the "load" method of a bookmark/history tree to replace search results:
https://github.com/piroor/xulmigemo/blob/master/content/xulmigemo/places/bookmarksPanelOverlay.js
https://github.com/piroor/xulmigemo/blob/master/modules/places.jsm
In short, the addon searchs all bookmark/history entries and filter them based on regular expressions generated by the user input. As the result, the addon provides ability to find webpages from bookmarks/histories ignoring difference of phonetic modifiers (like accents).

This is similar to the bug 1276598 but different. It is for the location bar (awesomebar), but this is for sidebars and the Places Organizer.
Whiteboard: [design-decision-needed]triaged
Hi Piro! This bug has been added to the agenda for the April 4 WebExtensions APIs triage. 

Call in info: https://wiki.mozilla.org/Add-ons/Contribute/Triage#Details_.26_How_to_Join

Meeting agenda: https://docs.google.com/document/d/1V4NP4tWnjHigS2lAosCLfkU2FTcrQnoQzzXZmmB1uzk/edit#

Hope to see you there!
Chatted with Piro on irc.  omnibox provides the ability to get user input and add search suggestion results.  For CJK the major limitation is query capability on history and bookmarks.  We boiled this down to the following:

a) add a search score (currently frecency) to HistoryItem and BookmarkTreeNode
   This allows the addon to choose how to sort the omnibox SuggestResult array
b) add basic logic (AND/OR) support to history.search and bookmarks.search queries
   Some queries are simply not possible with the current AND-only query
c) bonus if a single search function could be implemented that covers both history and bookmarks

If the omnibox SuggestResult array is rearranged by awesomebar, adding the search score to the SuggestResult object would help to suggest the order to omnibox.
Summary: Provide ability to apply custom filter for search results of bookmarks and histories → improve asian language support in bookmarks/history query and search results
This is an extra description for the point: "improve asian language support".

Incremental search features in Firefox (awesomebar, searchboxes for bookmarks and histories) are very useful for many languages, but not enough for CJK languages. For example, all these four different terms have same pronunciation "nihongo" in Japanese: "にほんご", "ニホンゴ", "日本語", and "二本後". We Japanese people really require to search them with just single incremental input: "niho...". This is the reason why the "OR" operator is required to search histories/bookmarks. Moreover, "OR" and "AND" operators will appear in one query, like "Firefox AND (にほんご OR ニホンゴ OR 日本語 ...)" from single input "Firefox nihongo".
Marco, it was suggested you might have good input on b) in comment 3, in order to achieve what Piro described in comment 4.  Really wanting to understand if there are any implications in supporting some kind of interface to boolean logic in the queries we have, such as the one here (which is what the webextensions bookmarks.query uses):

https://dxr.mozilla.org/mozilla-central/source/toolkit/components/places/Bookmarks.jsm#1327


I haven't looked for where history.query in implemented, but same basic functionality.  Alternately it may be possible to add a query to the omnibox api that could look something like this in an addon:

Some off-the-top-of-my-head thoughts on what it could look like:

omnibox.onInputChanged.addListener((text, suggest) => {
  let query = getTermsForText(text);
  // where query may look like ["Firefox", ["にほんご", "ニホンゴ", "日本語"]]
  // which may change internally to something like "Firefox AND (にほんご OR ニホンゴ OR 日本語)"
  history.query({query}).then(results => {
    // possibly further process results, before finishing
    suggest(results);
  });
};


omnibox.onInputChanged.addListener((text, suggest) => {
  let query = getTermsForText(text);
  // where query may look like ["Firefox", ["にほんご", "ニホンゴ", "日本語"]]
  // which may change internally to something like "Firefox AND (にほんご OR ニホンゴ OR 日本語)"
  omnibox.searchTerms({query}).then(results => {
    // possibly further process results, such as ordering, before finishing
    suggest(results);
  });
};
Flags: needinfo?(mak77)
(In reply to Shane Caraveo (:mixedpuppy) from comment #4)
> Marco, it was suggested you might have good input on b) in comment 3, in
> order to achieve what Piro described in comment 4. 

Proper character matching and folding is a complex problem in general.
The first thing we are architecturally missing is a fulltext index with a decent tokenization and folding.
It's a problem we didn't even solve ourselves for the awesomebar (bug 1340487), and if we'd solve that, we'd probably not need any special API, maybe not even an add-on, to get decent matching...
This is something we are evaluating to resource in the next year, but with Quantum and Photon in the middle I can't give a good estimate on that.

> Really wanting to
> understand if there are any implications in supporting some kind of
> interface to boolean logic in the queries we have, such as the one here

The only implication is that the current bookmarks.query APIs was basically a quick mock to provide API compatibility, but it's not the way to go, the bookmarks and history APIs are supposed to manage entries not to search them. Long term we need an async querying component in Places that can work similarly to nsINavHistoryQuery, that is still synchronous and thus has horrible perf.
The interface or look of such API is not defined atm, it could be similar to the current XPCOM one, an ORM, or something else.

> Alternately it may be possible to add a query to the omnibox
> api that could look something like this in an addon:

Currently the omnibox API is one-way, the add-on gives results to the location bar. It could also work the other way around, so the add-on can ask the locationbar to give it a certain number of results with certain filters and parameters.
Adding AND/OR support to the location bar would not be trivial though, so the add-on should do that by itself. The problem with this is that, still due to the lack of fulltext, some results may take a long time to return, so the add-on should not wait for all the results before pumping out its results.

Everything boils down to whether we want to extend the omnibox API with a filtered search and then the add-on will have to do its own ANDing and ORing of results, or whether we need a more general Places querying API that can do AND / OR / SORTING and so on.
The former is likely cheaper to do, may not be as flexibile and useful as the latter though.
Flags: needinfo?(mak77)
(In reply to Marco Bonardo [::mak] from comment #5)
> Proper character matching and folding is a complex problem in general.
> The first thing we are architecturally missing is a fulltext index with a
> decent tokenization and folding.
> It's a problem we didn't even solve ourselves for the awesomebar (bug
> 1340487), and if we'd solve that, we'd probably not need any special API,
> maybe not even an add-on, to get decent matching...

Even if a better full-text search is landed on Places, it doesn't solve my problem perfectly. My usecase requires synonyms search based on any dictionary. Actually, multiple full-text search engines contain something feature for the purpose. For example, Elasticsearch has the "synonyms" parameter:
https://www.elastic.co/guide/en/elasticsearch/guide/current/using-synonyms.html
Groonga also has more generic feature named "query expander":
http://groonga.org/docs/reference/query_expanders/tsv.html

Something new parameter to give synonyms information will work as I expected, like:

omnibox.searchTerms({
  query: "Firefox nihongo",
  synonyms: [
    ["nihongo", "にほんご", "ニホンゴ", "日本語", "二本後"]
  ]
})
Thus, WHAT I ACTUALLY NEED is NOT "boolean operators". Instead, any mechanism to inject synonyms information to search bookmarks/histories is required!
Required may be too strong a statement.  The boolean operators would work, it's a matter of where it is best to handle that, and whether or not it is something we should implement in general.

You can currently get synonyms by doing multiple calls to history|bookmarks.search and pull out the top results for each.  I think you're limited in the number of results you can send to omnibox.  The limiting factor in that is a lack of a score (or frecency in Firefox) to help with sorting the results.  I know this isn't the most performing way to handle it, but it would work.

Marco, do you have any though on returning frecency as part of the query that webextensions uses for those search APIs?
Flags: needinfo?(mak77)
(In reply to Shane Caraveo (:mixedpuppy) from comment #8)
> Marco, do you have any though on returning frecency as part of the query
> that webextensions uses for those search APIs?

I don't think it would be a problem to return frecency, provided that some special kind of results may not have one and then we'll probably return -1. But normal history and bookmarks entries all have their frecency.
Flags: needinfo?(mak77)
After reading through the last few comments it sounds like we are on the path to getting something working so marking as approved.
Priority: -- → P5
Whiteboard: [design-decision-needed]triaged → [design-decision-approved]triaged
(In reply to YUKI "Piro" Hiroshi from comment #3)
> Incremental search features in Firefox (awesomebar, searchboxes for
> bookmarks and histories) are very useful for many languages, but not enough
> for CJK languages. For example, all these four different terms have same
> pronunciation "nihongo" in Japanese: "にほんご", "ニホンゴ", "日本語", and "二本後". We
> Japanese people really require to search them with just single incremental
> input: "niho...". This is the reason why the "OR" operator is required to
> search histories/bookmarks. Moreover, "OR" and "AND" operators will appear
> in one query, like "Firefox AND (にほんご OR ニホンゴ OR 日本語 ...)" from single input
> "Firefox nihongo".

This is also true about Vietnamese to some extent, in which diacritics often prevent incremental search from working and diacritic folding tends to worsen accuracy. (This is in large part because tone marks that appear in the middle of the word are usually entered after all the base letters in the word.) A simple example: a search for "xoa" would ideally bring up results for "xoa", "xóa", and "xoá", whereas "xóa" should bring up results for "xóa" and "xoá" but not "xoa". Searches for "tan", "tân", and "tán" would all bring up results for "tấn".

An extension such as AVIM <https://github.com/1ec5/avim/> would be able to use this feature to generate words with additional diacritics as synonyms.
Product: Toolkit → WebExtensions
Bulk move of bugs per https://bugzilla.mozilla.org/show_bug.cgi?id=1483958
Component: Untriaged → General
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.