Closed Bug 499990 Opened 15 years ago Closed 15 years ago

Locale-aware collation

Categories

(Toolkit :: Storage, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.9.2a1

People

(Reporter: adw, Assigned: adw)

References

Details

(Keywords: dev-doc-complete)

Attachments

(1 file, 10 obsolete files)

Currently in Places to sort strings from the database while respecting the application's locale, we have to get our entire set of results and then sort using nsICollation.compareString(). In order to push sorting from C++ down to SQL, we need a locale-aware collation in Storage. Such a collation would be useful for all consumers building on the platform that need a string sort respectful of locale, which really should be every consumer that needs a string sort period.
Assignee: nobody → adw
Status: NEW → ASSIGNED
Attached patch WIP v1 (obsolete) (deleted) — Splinter Review
Keeps an nsICollation pointer in the Connection class that's protected by a lock and exposes Connection::localeCompare(). Test passes so far.
Attached patch WIP v1.1 (obsolete) (deleted) — Splinter Review
Added test...
Attachment #384803 - Attachment is obsolete: true
You will actually want to build on top of bug 494828 and use the mutex it adds.
Attached patch WIP v2 (obsolete) (deleted) — Splinter Review
Gets rid of the lock mentioned in comment 1, since SQLite serializes the execution of concurrently executing statements, as Shawn and I discussed today. Added a note about that to the source. Got rid of the new method mentioned in comment 1. Instead, lazily create the nsICollation instance and expose a getter, Connection::getLocaleCollation(). Need to investigate whether a UTF-16 version is needed -- I noticed the custom functions we're creating handle both UTF-8 and UTF-16. I'm also thinking that while I'm here it might be useful to create both case-sensitive and -insensitive versions. It would be trivial.
Attachment #384804 - Attachment is obsolete: true
Actually, why do Storage's custom functions handle UTF-16 at all? Looking over the code, doesn't Storage convert all text to UTF-8 before putting it in the DB? And aren't statements also converted to UTF-8 before execution? Just trying to figure out what I need to do here.
Attached patch WIP v3 (obsolete) (deleted) — Splinter Review
This creates the following collations: locale locale_case_sensitive locale_accent_sensitive locale_case_accent_sensitive That covers all the bases of nsICollation: http://mxr.mozilla.org/mozilla-central/source/intl/locale/idl/nsICollation.idl#62 I think case insensitivity is the common... case. That's what "locale" is. To add sensitivity you specify one of the others. Shawn, what do you think about that? If it's OK with you, I'll work on tests for the full set.
Attachment #385016 - Attachment is obsolete: true
That set of four sounds good to me.
(In reply to comment #5) > Actually, why do Storage's custom functions handle UTF-16 at all? Looking over > the code, doesn't Storage convert all text to UTF-8 before putting it in the > DB? And aren't statements also converted to UTF-8 before execution? Just > trying to figure out what I need to do here. AFAIK, consumers could force it to use UTF-16 if they wanted to.
Attached patch for review v1 (obsolete) (deleted) — Splinter Review
Includes tests and UTF-16 support.
Attachment #385051 - Attachment is obsolete: true
Attachment #385440 - Flags: review?(sdwilsh)
Attached patch for review v1.1 (obsolete) (deleted) — Splinter Review
Sorry, for review v1 was based off of the patch in bug 494828. This one is based off the current tree.
Attachment #385440 - Attachment is obsolete: true
Attachment #385457 - Flags: review?(sdwilsh)
Attachment #385440 - Flags: review?(sdwilsh)
Comment on attachment 385457 [details] [diff] [review] for review v1.1 http://reviews.visophyte.org/r/385457/diff/#index_header (you'll likely want to go here to see the diff with comments inline) on file: storage/src/mozStorageConnection.h line 180 > // This isn't accessed but is used to make sure that the connections do This should be the last member of the class, so add new things above it please. on file: storage/src/mozStorageConnection.h line 184 > /** nit: please use @note Also, this comment is not completely accurate. It should say something about it being OK to not protect it with a mutex because it is only ever used from SQLite callbacks, and those are serialized within the same process. on file: storage/src/mozStorageConnection.cpp line 260 > , mLocaleCollation(nsnull) You do not need to initialize a nsCOMPtr to null in the constructor (strangely, we do that with other stuff :/) on file: storage/src/mozStorageConnection.cpp line 272 > mLocaleCollation = nsnull; Also don't need to set it to null at destruction - the nsCOMPtr's destructor will do this for you. on file: storage/src/mozStorageConnection.cpp line 306 > if (!mLocaleCollation) { I would prefer to have an early return here if mLocaleCollation is not null. Saves on the indenting. on file: storage/src/mozStorageConnection.cpp line 330 > mLocaleCollation = nsnull; XPCOM rules say that the outparam will not be written to if a failure is given, so you do not have to null this out. on file: storage/src/mozStorageSQLFunctions.h line 60 > /** I'd like to see this in a new file please (along with the implementation). How about collations.[h|cpp]? on file: storage/src/mozStorageSQLFunctions.h line 106 > * Custom UTF-8 collating sequence that respects the application's locale. Is this stuff application-locale or is it OS-locale? Just checking. on file: storage/src/mozStorageSQLFunctions.h line 112 > * The length of the first string. I'd prefer something like "The length of characters of aStr1." on file: storage/src/mozStorageSQLFunctions.h line 114 > * The first string. The first string what? Please be a bit more descriptive here. on file: storage/src/mozStorageSQLFunctions.h line 119 > * @return strcmp semantics. I'd prefer if you documented what the semantics are. I usually have to look it up just to make sure. on file: storage/src/mozStorageSQLFunctions.h line 200 > * The length of the first string. Is this length in bytes or characters? The distinction is important here. on file: storage/src/mozStorageSQLFunctions.cpp line 180 > NS_WARNING("Connection has no collation"); Use NS_ERROR here please on file: storage/src/mozStorageSQLFunctions.cpp line 188 > nsresult rv = coll->CompareString(aComparisonStrength, str1, str2, &res); Why don't you just call the UTF-16 version of this function once you convert? It's the same code with that minor exception. You may even want to make two versions of the UTF-16 function - one that takes an nsAString and one that takes a const void *. The latter would just call the former, and then all of your logic would be in one function instead of two. on file: storage/src/mozStorageSQLFunctions.cpp line 190 > NS_WARNING("Collation compare string failed"); NS_ERROR here as well please on file: storage/src/mozStorageSQLFunctions.cpp line 214 > PRInt32 This should really return an int. On 64-bit systems we have a mismatch of types. on file: storage/src/mozStorageSQLFunctions.cpp line 224 > NS_WARNING("Connection has no collation"); NS_ERROR please on file: storage/src/mozStorageSQLFunctions.cpp line 234 > NS_WARNING("Collation compare string failed"); NS_ERROR please on file: storage/src/mozStorageSQLFunctions.cpp line 255 > localeCollation8}, nit: place each sub-item on a new line please: {"locale", SQLITE_UTF8, localeCollation8}, on file: storage/src/mozStorageSQLFunctions.cpp line 273 > for (unsigned i = 0; SQLITE_OK == rv && i < NS_ARRAY_LENGTH(collations); ++i) { should probably use size_t here on file: storage/test/unit/test_locale_collation.js line 1 > /* ***** BEGIN LICENSE BLOCK ***** Missing the right header per storage/style.txt on file: storage/test/unit/test_locale_collation.js line 16 > * The Initial Developer of the Original Code is Mozilla Corp. should be a new line after is. Also spell out Corp please on file: storage/test/unit/test_locale_collation.js line 101 > desc: "Async test", It would be OK to write this test with the synchronous methods too. The same SQLite code would run, and it might make debugging test failures easier in the future. Your call on this. on file: storage/test/unit/test_locale_collation.js line 139 > ]; It would also be useful to test your UTF-16 callback. You can switch SQLite into UTF-16 mode by setting http://www.sqlite.org/pragma.html#pragma_encoding accordingly. on file: storage/test/unit/test_locale_collation.js line 141 > /////////////////////////////////////////////////////////////////////////////// Think you want a line after this like: //// Helper Functions Also, please move this section above the tests. on file: storage/test/unit/test_locale_collation.js line 160 > } nit: no braces needed
Attachment #385457 - Flags: review?(sdwilsh) → review-
(In reply to comment #11) > on file: storage/src/mozStorageSQLFunctions.h line 106 > > * Custom UTF-8 collating sequence that respects the application's locale. > > Is this stuff application-locale or is it OS-locale? Just checking. Application locale. That's what Places is currently doing, and I think it makes sense for the platform also. What do you think? As you say, OS locale is the other possible choice: http://mxr.mozilla.org/mozilla-central/source/intl/locale/idl/nsILocaleService.idl > on file: storage/test/unit/test_locale_collation.js line 101 > > desc: "Async test", > > It would be OK to write this test with the synchronous methods too. The same > SQLite code would run, and it might make debugging test failures easier in the > future. Your call on this. Does this mean it would be OK to use sync methods instead of async ones, or it would be good to write another test like this but in addition to it that uses sync methods? I didn't make it clear, but the purpose of this test is to use collations with multiple concurrently executing statements and make sure nothing blows up. I guess it's not too rigorous though.
(In reply to comment #12) > Application locale. That's what Places is currently doing, and I think it > makes sense for the platform also. What do you think? As you say, OS locale > is the other possible choice: Application locale is fine, but I just wanted to make sure that is what we a) wanted and b) got. > Does this mean it would be OK to use sync methods instead of async ones, or it > would be good to write another test like this but in addition to it that uses > sync methods? I didn't make it clear, but the purpose of this test is to use > collations with multiple concurrently executing statements and make sure > nothing blows up. I guess it's not too rigorous though. I'm fine with just sync methods. It's an API guarantee that all async calls on the same connection will be serialized, so throwing a bunch of statements at it isn't going to actually do anything.
Attached patch for review v2 (obsolete) (deleted) — Splinter Review
Addressed comment 11 and comment 13. (In reply to comment #11) > on file: storage/src/mozStorageSQLFunctions.cpp line 273 > > for (unsigned i = 0; SQLITE_OK == rv && i < NS_ARRAY_LENGTH(collations); ++i) { > > should probably use size_t here I copied this over from mozStorageSQLFunctions.cpp, so I took the liberty of fixing it there, too. Also, writing the UTF-16 tests exposed some fallacious thinking I had about length given in characters vs. bytes (it's bytes), and that's documented in localeCollationHelper16() and other function comments. And, tests are now fully sync.
Attachment #385457 - Attachment is obsolete: true
Attachment #385556 - Flags: review?(sdwilsh)
Attachment #385556 - Flags: review?(sdwilsh) → review+
Comment on attachment 385556 [details] [diff] [review] for review v2 http://reviews.visophyte.org/r/385556/diff/#index_header on file: storage/src/Makefile.in line 88 > collations.cpp \ I have a better idea for a name that is slightly less generic: SQLCollations.cpp on file: storage/src/collations.h line 16 > * The Original Code is unicode functions code. that's not right ;) on file: storage/src/collations.h line 40 > #ifndef _mozStorageCollations_h_ actually, our define should be in a different form: mozilla_storage_collations_h_ (and probably mozilla_storage_SQLCollations_h if you change the filename) on file: storage/src/collations.h line 79 > * @param aComparisonStrength You have this parameter in all the docs, but it's not actually in any of the functions. on file: storage/src/collations.cpp line 16 > * The Original Code is unicode functions code. not unicode functions ;) on file: storage/src/collations.cpp line 163 > registerCollations(sqlite3 *aDB, Connection *aConn) Storage style guidelines say the parameters should be on their own line. on file: storage/test/unit/test_locale_collation.js line 148 > while (stmt.executeStep()) { > results.push(stmt.row.t); > } nit: don't need braces here Also, use NS_ERROR instead of NS_WARNING for the lazy getter in mozStorageConnection.cpp r=sdwilsh with these changes addressed.
It's not clear why you can't use nsDependentString - why do you need a null-terminated string exactly?
###!!! ASSERTION: nsTDependentString must wrap only null-terminated strings: 'mData[mLength] == 0', file ../../dist/include/nsTDependentString.h, line 67 http://mxr.mozilla.org/mozilla-central/source/xpcom/string/public/nsTDependentString.h#67
About strings being passed to the collation callbacks non-null-terminated, huzzah: http://www.mail-archive.com/sqlite-users@sqlite.org/msg35634.html This confirms my own testing. I paired down the unit test to just working off the strings ["a", "b"] instead of reading test data from a file and stepped through it in gdb. My UTF-8 collation callback is getting passed as one of the strings: a\002\017a which is followed by a repeating sequence of bytes: \323\301\261\241 and the corresponding length of this string as passed in through a different parameter is 1 byte. Not to mention the fact that nsDependentString was asserting that the buffer passed to it in the UTF-16 callback wasn't null-terminated. Now on to making sure nsICollation is safe to touch from multiple threads.
Use nsDependentSubstring to wrap non null-terminated buffers.
Attached patch for review v3 (obsolete) (deleted) — Splinter Review
Thanks for the tip, Benjamin. This addresses comment 15 and comment 19. I ran this patch on all three platforms, looks good, and I pushed it to tryserver just now.
Attachment #385556 - Attachment is obsolete: true
I should note that this is a new API, and so it should be sr'd.
Attachment #387201 - Flags: superreview?(vladimir)
Tryserver looks good except for two oranges, both of which are known: bug 488596 and bug 493359. I added comments to them. https://build.mozilla.org/tryserver-builds/dwillcoxon@mozilla.com-1246982390/
nsCollationUnix::CompareString() makes a couple of calls to the POSIX function setlocale(). It calls it first to set the locale to that of the given collation, does the string comparison, then calls it again to restore the old locale. According to [1], "The locale state is common to all threads within a process." Plus, Mozilla code for the OS X and Windows cases looks fine, but the only info I could find about reentrancy re: their respective system calls was scant speculation about OS X on a Xerces mailing list. So it looks like I'll have to synchronize at least localeCollationHelper(), which is the function that actually calls CompareString() on the Connection's nsICollation. [1] http://www.opengroup.org/onlinepubs/009695399/functions/setlocale.html
Comment on attachment 387201 [details] [diff] [review] for review v3 Need a new patch that synchronizes access to the collation across all threads. Planning to move the nsICollation instance up from Connection to the Service singleton. localeCollationHelper() will move from mozilla::storage to mozilla::storage::Service. The Service already keeps a lock, will use that one to sync localeCollationHelper() also.
Attachment #387201 - Attachment is obsolete: true
Attachment #387201 - Flags: superreview?(vladimir)
Attached patch for review v4 (obsolete) (deleted) — Splinter Review
Will run on tryserver again.
Attachment #387658 - Flags: review?(sdwilsh)
Tryserver looked good again; there was the same known orange, bug 488596.
Comment on attachment 387658 [details] [diff] [review] for review v4 http://reviews.visophyte.org/r/387658/ on file: storage/src/SQLCollations.h line 93 > /** > * Custom UTF-8 collating sequence that respects the application's locale. > * Comparison is case-sensitive and accent-insensitive. This is called by > * SQLite. > * > * @param aService > * The Service that owns the nsICollation used by this collation. > * @param aLen1 > * The number of bytes (not characters) in aStr1. > * @param aStr1 > * The string to be compared against aStr2. It will be passed in by > * SQLite as a non-null-terminated char* buffer. > * @param aLen2 > * The number of bytes (not characters) in aStr2. > * @param aStr2 > * The string to be compared against aStr1. It will be passed in by > * SQLite as a non-null-terminated char* buffer. > * @return aStr1 - aStr2. That is, if aStr1 < aStr2, returns a negative number. > * If aStr1 > aStr2, returns a positive number. If aStr1 == aStr2, > * returns 0. > */ Bytes is the number of characters in this situation actually. Ditto for all the utf-8 comments. on file: storage/src/SQLCollations.cpp line 50 > /** > * Helper function for the UTF-8 locale collations. > * > * @param aService > * The Service that owns the nsICollation used by this collation. > * @param aLen1 > * The number of bytes (not characters) in aStr1. > * @param aStr1 > * The string to be compared against aStr2 as provided by SQLite. It > * must be a non-null-terminated char* buffer. > * @param aLen2 > * The number of bytes (not characters) in aStr2. Same here re: bytes and characters. on file: storage/src/SQLCollations.cpp line 121 > nsDependentSubstring str1(buf1, buf1 + (aLen1 / 2)); > nsDependentSubstring str2(buf2, buf2 + (aLen2 / 2)); Instead of dividing by two, could you please divide by sizeof(PRUnichar). Let the compiler figure out the size just in case. (yeah, we don't do that all over in storage, but we should be doing that) on file: storage/src/SQLCollations.cpp line 141 > {"locale", SQLITE_UTF8, localeCollation8}, > {"locale_case_sensitive", SQLITE_UTF8, localeCollationCaseSensitive8}, > {"locale_accent_sensitive", SQLITE_UTF8, localeCollationAccentSensitive8}, > {"locale_case_accent_sensitive", SQLITE_UTF8, localeCollationCaseAccentSensitive8}, > {"locale", SQLITE_UTF16, localeCollation16}, > {"locale_case_sensitive", SQLITE_UTF16, localeCollationCaseSensitive16}, > {"locale_accent_sensitive", SQLITE_UTF16, localeCollationAccentSensitive16}, > {"locale_case_accent_sensitive", SQLITE_UTF16, localeCollationCaseAccentSensitive16}, nit: {"locale", SQLITE_UTF8, localeCollation8}, ... on file: storage/src/mozStorageConnection.cpp line 349 > srv = registerCollations(mDBConn, > static_cast<Service *>(mStorageService.get())); Let's just change the constructor of Connection to take a Service *, and change the nsCOMPtr to an nsRefPtr<Service> so we can avoid the cast. on file: storage/src/mozStorageConnection.cpp line 352 > mDBConn = nsnull; er...we should be calling sqlite3_close on this before we null it out. Ditto for registering functions... :( on file: storage/src/mozStorageService.h line 100 > /** > * Used for locking around calls when initializing connections so that we > * can ensure that the state of sqlite3_enable_shared_cache is sane. > */ > PRLock *mLock; This comment is now out of date. Please update it. on file: storage/src/mozStorageService.cpp line 192 > nsAutoLock lock(mLock); Hey, we should convert our PRLock usage to mozilla::Mutex here like we have for the newer mutexes on Connection. The variable should be changed to mMutex too. on file: storage/src/mozStorageService.cpp line 210 > nsICollation * > Service::getLocaleCollation() > { We should assert that our lock is held here. on file: storage/test/unit/test_locale_collation.js line 57 > // The UTF-16-encoded database file we create. > var gUtf16File; Just use a memory database for this? r=sdwilsh
Attachment #387658 - Flags: review?(sdwilsh) → review+
Attached patch for review v5 (obsolete) (deleted) — Splinter Review
Shawn, I'm asking sr for this but you might want to give it a quick once-over to make sure the mutex change is OK.
Attachment #387658 - Attachment is obsolete: true
Attachment #388349 - Flags: superreview?(vladimir)
Comment on attachment 388349 [details] [diff] [review] for review v5 Hrm, wonder if we can fix the unix collation stuff to avoid setlocale and the thread safety hazard. Can we somehow audit/ensure that our code avoids calling the utf8-version of the collation function if at all possible? A new allocation per string per comparison really sucks. Code's fine otherwise, though.
Attachment #388349 - Flags: superreview?(vladimir) → superreview+
Shawn can speak to this better than I can, but according to the SQLite docs, "Each database file manages text as either UTF-8, UTF-16BE (big-endian), or UTF-16LE (little-endian). Internally and in the disk file, the same text representation is used everywhere." [1] And, "Once an encoding has been set for a database, it cannot be changed." [2] I'll be using the collations to sort the Places database, which is UTF-8 AFAIK, so I'm not sure if avoiding the conversion is possible... :\ (The situation looks even worse on Unix, where the wide char string is converted back to narrow before the comparison (nsCollationUnix.cpp).) Incidentally, the existing Places string sort (nsNavHistoryResult.cpp) also converts its args to UTF-16 before passing off to nsICollation.CompareString(). [1] http://www.sqlite.org/version3.html [2] http://www.sqlite.org/pragma.html
Databases default to UTF-8, so sqlite we will tend to get it via UTF-8.
Whiteboard: [can land]
Attached patch for checkin (deleted) — Splinter Review
Removes an unnecessary conversion from nsRefPtr to its raw ptr, thanks Shawn.
Attachment #388349 - Attachment is obsolete: true
Flags: in-testsuite+
Keywords: dev-doc-needed
Whiteboard: [can land]
Target Milestone: --- → mozilla1.9.2a1
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
What's the takeaway from this patch for typical developers? Just that sorting of sqlite data (such as in Places) now respects locales?
Yes. Any time you need to sort textual data from the user, you should use these locale collations. It would be good if the docs briefly mentioned the importance of doing so; the introduction to [1] is a good summary. [1] http://unicode.org/reports/tr10/
Sheppy, forgot to say that I blogged about this with explanation and an example. If you'd like, you can copy and paste: http://blog.mozilla.com/adw/2009/07/19/locale-sensitive-collations-in-storage/
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: