Closed
Bug 499990
Opened 15 years ago
Closed 15 years ago
Locale-aware collation
Categories
(Toolkit :: Storage, defect)
Toolkit
Storage
Tracking
()
RESOLVED
FIXED
mozilla1.9.2a1
People
(Reporter: adw, Assigned: adw)
References
Details
(Keywords: dev-doc-complete)
Attachments
(1 file, 10 obsolete files)
(deleted),
patch
|
Details | Diff | Splinter Review |
Currently in Places to sort strings from the database while respecting the application's locale, we have to get our entire set of results and then sort using nsICollation.compareString(). In order to push sorting from C++ down to SQL, we need a locale-aware collation in Storage. Such a collation would be useful for all consumers building on the platform that need a string sort respectful of locale, which really should be every consumer that needs a string sort period.
Assignee | ||
Updated•15 years ago
|
Assignee: nobody → adw
Status: NEW → ASSIGNED
Assignee | ||
Comment 1•15 years ago
|
||
Keeps an nsICollation pointer in the Connection class that's protected by a lock and exposes Connection::localeCompare(). Test passes so far.
Comment 3•15 years ago
|
||
You will actually want to build on top of bug 494828 and use the mutex it adds.
Assignee | ||
Comment 4•15 years ago
|
||
Gets rid of the lock mentioned in comment 1, since SQLite serializes the execution of concurrently executing statements, as Shawn and I discussed today. Added a note about that to the source. Got rid of the new method mentioned in comment 1. Instead, lazily create the nsICollation instance and expose a getter, Connection::getLocaleCollation().
Need to investigate whether a UTF-16 version is needed -- I noticed the custom functions we're creating handle both UTF-8 and UTF-16.
I'm also thinking that while I'm here it might be useful to create both case-sensitive and -insensitive versions. It would be trivial.
Attachment #384804 -
Attachment is obsolete: true
Assignee | ||
Comment 5•15 years ago
|
||
Actually, why do Storage's custom functions handle UTF-16 at all? Looking over the code, doesn't Storage convert all text to UTF-8 before putting it in the DB? And aren't statements also converted to UTF-8 before execution? Just trying to figure out what I need to do here.
Assignee | ||
Comment 6•15 years ago
|
||
This creates the following collations:
locale
locale_case_sensitive
locale_accent_sensitive
locale_case_accent_sensitive
That covers all the bases of nsICollation: http://mxr.mozilla.org/mozilla-central/source/intl/locale/idl/nsICollation.idl#62
I think case insensitivity is the common... case. That's what "locale" is. To add sensitivity you specify one of the others.
Shawn, what do you think about that? If it's OK with you, I'll work on tests for the full set.
Attachment #385016 -
Attachment is obsolete: true
Comment 7•15 years ago
|
||
That set of four sounds good to me.
Comment 8•15 years ago
|
||
(In reply to comment #5)
> Actually, why do Storage's custom functions handle UTF-16 at all? Looking over
> the code, doesn't Storage convert all text to UTF-8 before putting it in the
> DB? And aren't statements also converted to UTF-8 before execution? Just
> trying to figure out what I need to do here.
AFAIK, consumers could force it to use UTF-16 if they wanted to.
Assignee | ||
Comment 9•15 years ago
|
||
Includes tests and UTF-16 support.
Attachment #385051 -
Attachment is obsolete: true
Attachment #385440 -
Flags: review?(sdwilsh)
Assignee | ||
Comment 10•15 years ago
|
||
Sorry, for review v1 was based off of the patch in bug 494828. This one is based off the current tree.
Attachment #385440 -
Attachment is obsolete: true
Attachment #385457 -
Flags: review?(sdwilsh)
Attachment #385440 -
Flags: review?(sdwilsh)
Comment 11•15 years ago
|
||
Comment on attachment 385457 [details] [diff] [review]
for review v1.1
http://reviews.visophyte.org/r/385457/diff/#index_header (you'll likely want to go here to see the diff with comments inline)
on file: storage/src/mozStorageConnection.h line 180
> // This isn't accessed but is used to make sure that the connections do
This should be the last member of the class, so add new things above it
please.
on file: storage/src/mozStorageConnection.h line 184
> /**
nit: please use @note
Also, this comment is not completely accurate. It should say something about
it being OK to not protect it with a mutex because it is only ever used from
SQLite callbacks, and those are serialized within the same process.
on file: storage/src/mozStorageConnection.cpp line 260
> , mLocaleCollation(nsnull)
You do not need to initialize a nsCOMPtr to null in the constructor
(strangely, we do that with other stuff :/)
on file: storage/src/mozStorageConnection.cpp line 272
> mLocaleCollation = nsnull;
Also don't need to set it to null at destruction - the nsCOMPtr's destructor
will do this for you.
on file: storage/src/mozStorageConnection.cpp line 306
> if (!mLocaleCollation) {
I would prefer to have an early return here if mLocaleCollation is not null.
Saves on the indenting.
on file: storage/src/mozStorageConnection.cpp line 330
> mLocaleCollation = nsnull;
XPCOM rules say that the outparam will not be written to if a failure is
given, so you do not have to null this out.
on file: storage/src/mozStorageSQLFunctions.h line 60
> /**
I'd like to see this in a new file please (along with the implementation).
How about collations.[h|cpp]?
on file: storage/src/mozStorageSQLFunctions.h line 106
> * Custom UTF-8 collating sequence that respects the application's locale.
Is this stuff application-locale or is it OS-locale? Just checking.
on file: storage/src/mozStorageSQLFunctions.h line 112
> * The length of the first string.
I'd prefer something like "The length of characters of aStr1."
on file: storage/src/mozStorageSQLFunctions.h line 114
> * The first string.
The first string what? Please be a bit more descriptive here.
on file: storage/src/mozStorageSQLFunctions.h line 119
> * @return strcmp semantics.
I'd prefer if you documented what the semantics are. I usually have to look
it up just to make sure.
on file: storage/src/mozStorageSQLFunctions.h line 200
> * The length of the first string.
Is this length in bytes or characters? The distinction is important here.
on file: storage/src/mozStorageSQLFunctions.cpp line 180
> NS_WARNING("Connection has no collation");
Use NS_ERROR here please
on file: storage/src/mozStorageSQLFunctions.cpp line 188
> nsresult rv = coll->CompareString(aComparisonStrength, str1, str2, &res);
Why don't you just call the UTF-16 version of this function once you convert?
It's the same code with that minor exception. You may even want to make two
versions of the UTF-16 function - one that takes an nsAString and one that
takes a const void *. The latter would just call the former, and then all of
your logic would be in one function instead of two.
on file: storage/src/mozStorageSQLFunctions.cpp line 190
> NS_WARNING("Collation compare string failed");
NS_ERROR here as well please
on file: storage/src/mozStorageSQLFunctions.cpp line 214
> PRInt32
This should really return an int. On 64-bit systems we have a mismatch of
types.
on file: storage/src/mozStorageSQLFunctions.cpp line 224
> NS_WARNING("Connection has no collation");
NS_ERROR please
on file: storage/src/mozStorageSQLFunctions.cpp line 234
> NS_WARNING("Collation compare string failed");
NS_ERROR please
on file: storage/src/mozStorageSQLFunctions.cpp line 255
> localeCollation8},
nit: place each sub-item on a new line please:
{"locale",
SQLITE_UTF8,
localeCollation8},
on file: storage/src/mozStorageSQLFunctions.cpp line 273
> for (unsigned i = 0; SQLITE_OK == rv && i < NS_ARRAY_LENGTH(collations); ++i) {
should probably use size_t here
on file: storage/test/unit/test_locale_collation.js line 1
> /* ***** BEGIN LICENSE BLOCK *****
Missing the right header per storage/style.txt
on file: storage/test/unit/test_locale_collation.js line 16
> * The Initial Developer of the Original Code is Mozilla Corp.
should be a new line after is. Also spell out Corp please
on file: storage/test/unit/test_locale_collation.js line 101
> desc: "Async test",
It would be OK to write this test with the synchronous methods too. The same
SQLite code would run, and it might make debugging test failures easier in the
future. Your call on this.
on file: storage/test/unit/test_locale_collation.js line 139
> ];
It would also be useful to test your UTF-16 callback. You can switch SQLite
into UTF-16 mode by setting http://www.sqlite.org/pragma.html#pragma_encoding
accordingly.
on file: storage/test/unit/test_locale_collation.js line 141
> ///////////////////////////////////////////////////////////////////////////////
Think you want a line after this like:
//// Helper Functions
Also, please move this section above the tests.
on file: storage/test/unit/test_locale_collation.js line 160
> }
nit: no braces needed
Attachment #385457 -
Flags: review?(sdwilsh) → review-
Assignee | ||
Comment 12•15 years ago
|
||
(In reply to comment #11)
> on file: storage/src/mozStorageSQLFunctions.h line 106
> > * Custom UTF-8 collating sequence that respects the application's locale.
>
> Is this stuff application-locale or is it OS-locale? Just checking.
Application locale. That's what Places is currently doing, and I think it makes sense for the platform also. What do you think? As you say, OS locale is the other possible choice:
http://mxr.mozilla.org/mozilla-central/source/intl/locale/idl/nsILocaleService.idl
> on file: storage/test/unit/test_locale_collation.js line 101
> > desc: "Async test",
>
> It would be OK to write this test with the synchronous methods too. The same
> SQLite code would run, and it might make debugging test failures easier in the
> future. Your call on this.
Does this mean it would be OK to use sync methods instead of async ones, or it would be good to write another test like this but in addition to it that uses sync methods? I didn't make it clear, but the purpose of this test is to use collations with multiple concurrently executing statements and make sure nothing blows up. I guess it's not too rigorous though.
Comment 13•15 years ago
|
||
(In reply to comment #12)
> Application locale. That's what Places is currently doing, and I think it
> makes sense for the platform also. What do you think? As you say, OS locale
> is the other possible choice:
Application locale is fine, but I just wanted to make sure that is what we a) wanted and b) got.
> Does this mean it would be OK to use sync methods instead of async ones, or it
> would be good to write another test like this but in addition to it that uses
> sync methods? I didn't make it clear, but the purpose of this test is to use
> collations with multiple concurrently executing statements and make sure
> nothing blows up. I guess it's not too rigorous though.
I'm fine with just sync methods. It's an API guarantee that all async calls on the same connection will be serialized, so throwing a bunch of statements at it isn't going to actually do anything.
Assignee | ||
Comment 14•15 years ago
|
||
Addressed comment 11 and comment 13.
(In reply to comment #11)
> on file: storage/src/mozStorageSQLFunctions.cpp line 273
> > for (unsigned i = 0; SQLITE_OK == rv && i < NS_ARRAY_LENGTH(collations); ++i) {
>
> should probably use size_t here
I copied this over from mozStorageSQLFunctions.cpp, so I took the liberty of fixing it there, too.
Also, writing the UTF-16 tests exposed some fallacious thinking I had about length given in characters vs. bytes (it's bytes), and that's documented in localeCollationHelper16() and other function comments.
And, tests are now fully sync.
Attachment #385457 -
Attachment is obsolete: true
Attachment #385556 -
Flags: review?(sdwilsh)
Updated•15 years ago
|
Attachment #385556 -
Flags: review?(sdwilsh) → review+
Comment 15•15 years ago
|
||
Comment on attachment 385556 [details] [diff] [review]
for review v2
http://reviews.visophyte.org/r/385556/diff/#index_header
on file: storage/src/Makefile.in line 88
> collations.cpp \
I have a better idea for a name that is slightly less generic:
SQLCollations.cpp
on file: storage/src/collations.h line 16
> * The Original Code is unicode functions code.
that's not right ;)
on file: storage/src/collations.h line 40
> #ifndef _mozStorageCollations_h_
actually, our define should be in a different form:
mozilla_storage_collations_h_ (and probably mozilla_storage_SQLCollations_h if
you change the filename)
on file: storage/src/collations.h line 79
> * @param aComparisonStrength
You have this parameter in all the docs, but it's not actually in any of the
functions.
on file: storage/src/collations.cpp line 16
> * The Original Code is unicode functions code.
not unicode functions ;)
on file: storage/src/collations.cpp line 163
> registerCollations(sqlite3 *aDB, Connection *aConn)
Storage style guidelines say the parameters should be on their own line.
on file: storage/test/unit/test_locale_collation.js line 148
> while (stmt.executeStep()) {
> results.push(stmt.row.t);
> }
nit: don't need braces here
Also, use NS_ERROR instead of NS_WARNING for the lazy getter in mozStorageConnection.cpp
r=sdwilsh with these changes addressed.
Comment 16•15 years ago
|
||
It's not clear why you can't use nsDependentString - why do you need a null-terminated string exactly?
Assignee | ||
Comment 17•15 years ago
|
||
###!!! ASSERTION: nsTDependentString must wrap only null-terminated strings: 'mData[mLength] == 0', file ../../dist/include/nsTDependentString.h, line 67
http://mxr.mozilla.org/mozilla-central/source/xpcom/string/public/nsTDependentString.h#67
Assignee | ||
Comment 18•15 years ago
|
||
About strings being passed to the collation callbacks non-null-terminated, huzzah:
http://www.mail-archive.com/sqlite-users@sqlite.org/msg35634.html
This confirms my own testing. I paired down the unit test to just working off the strings ["a", "b"] instead of reading test data from a file and stepped through it in gdb. My UTF-8 collation callback is getting passed as one of the strings:
a\002\017a
which is followed by a repeating sequence of bytes:
\323\301\261\241
and the corresponding length of this string as passed in through a different parameter is 1 byte. Not to mention the fact that nsDependentString was asserting that the buffer passed to it in the UTF-16 callback wasn't null-terminated.
Now on to making sure nsICollation is safe to touch from multiple threads.
Comment 19•15 years ago
|
||
Use nsDependentSubstring to wrap non null-terminated buffers.
Assignee | ||
Comment 20•15 years ago
|
||
Thanks for the tip, Benjamin. This addresses comment 15 and comment 19. I ran this patch on all three platforms, looks good, and I pushed it to tryserver just now.
Attachment #385556 -
Attachment is obsolete: true
Comment 21•15 years ago
|
||
I should note that this is a new API, and so it should be sr'd.
Assignee | ||
Updated•15 years ago
|
Attachment #387201 -
Flags: superreview?(vladimir)
Assignee | ||
Comment 22•15 years ago
|
||
Tryserver looks good except for two oranges, both of which are known: bug 488596 and bug 493359. I added comments to them.
https://build.mozilla.org/tryserver-builds/dwillcoxon@mozilla.com-1246982390/
Assignee | ||
Comment 23•15 years ago
|
||
nsCollationUnix::CompareString() makes a couple of calls to the POSIX function setlocale(). It calls it first to set the locale to that of the given collation, does the string comparison, then calls it again to restore the old locale. According to [1], "The locale state is common to all threads within a process." Plus, Mozilla code for the OS X and Windows cases looks fine, but the only info I could find about reentrancy re: their respective system calls was scant speculation about OS X on a Xerces mailing list.
So it looks like I'll have to synchronize at least localeCollationHelper(), which is the function that actually calls CompareString() on the Connection's nsICollation.
[1] http://www.opengroup.org/onlinepubs/009695399/functions/setlocale.html
Assignee | ||
Comment 24•15 years ago
|
||
Comment on attachment 387201 [details] [diff] [review]
for review v3
Need a new patch that synchronizes access to the collation across all threads. Planning to move the nsICollation instance up from Connection to the Service singleton. localeCollationHelper() will move from mozilla::storage to mozilla::storage::Service. The Service already keeps a lock, will use that one to sync localeCollationHelper() also.
Attachment #387201 -
Attachment is obsolete: true
Attachment #387201 -
Flags: superreview?(vladimir)
Assignee | ||
Comment 25•15 years ago
|
||
Will run on tryserver again.
Attachment #387658 -
Flags: review?(sdwilsh)
Assignee | ||
Comment 26•15 years ago
|
||
Tryserver looked good again; there was the same known orange, bug 488596.
Comment 27•15 years ago
|
||
Comment on attachment 387658 [details] [diff] [review]
for review v4
http://reviews.visophyte.org/r/387658/
on file: storage/src/SQLCollations.h line 93
> /**
> * Custom UTF-8 collating sequence that respects the application's locale.
> * Comparison is case-sensitive and accent-insensitive. This is called by
> * SQLite.
> *
> * @param aService
> * The Service that owns the nsICollation used by this collation.
> * @param aLen1
> * The number of bytes (not characters) in aStr1.
> * @param aStr1
> * The string to be compared against aStr2. It will be passed in by
> * SQLite as a non-null-terminated char* buffer.
> * @param aLen2
> * The number of bytes (not characters) in aStr2.
> * @param aStr2
> * The string to be compared against aStr1. It will be passed in by
> * SQLite as a non-null-terminated char* buffer.
> * @return aStr1 - aStr2. That is, if aStr1 < aStr2, returns a negative number.
> * If aStr1 > aStr2, returns a positive number. If aStr1 == aStr2,
> * returns 0.
> */
Bytes is the number of characters in this situation actually. Ditto for all
the utf-8 comments.
on file: storage/src/SQLCollations.cpp line 50
> /**
> * Helper function for the UTF-8 locale collations.
> *
> * @param aService
> * The Service that owns the nsICollation used by this collation.
> * @param aLen1
> * The number of bytes (not characters) in aStr1.
> * @param aStr1
> * The string to be compared against aStr2 as provided by SQLite. It
> * must be a non-null-terminated char* buffer.
> * @param aLen2
> * The number of bytes (not characters) in aStr2.
Same here re: bytes and characters.
on file: storage/src/SQLCollations.cpp line 121
> nsDependentSubstring str1(buf1, buf1 + (aLen1 / 2));
> nsDependentSubstring str2(buf2, buf2 + (aLen2 / 2));
Instead of dividing by two, could you please divide by sizeof(PRUnichar). Let
the compiler figure out the size just in case. (yeah, we don't do that all
over in storage, but we should be doing that)
on file: storage/src/SQLCollations.cpp line 141
> {"locale", SQLITE_UTF8, localeCollation8},
> {"locale_case_sensitive", SQLITE_UTF8, localeCollationCaseSensitive8},
> {"locale_accent_sensitive", SQLITE_UTF8, localeCollationAccentSensitive8},
> {"locale_case_accent_sensitive", SQLITE_UTF8, localeCollationCaseAccentSensitive8},
> {"locale", SQLITE_UTF16, localeCollation16},
> {"locale_case_sensitive", SQLITE_UTF16, localeCollationCaseSensitive16},
> {"locale_accent_sensitive", SQLITE_UTF16, localeCollationAccentSensitive16},
> {"locale_case_accent_sensitive", SQLITE_UTF16, localeCollationCaseAccentSensitive16},
nit:
{"locale",
SQLITE_UTF8,
localeCollation8},
...
on file: storage/src/mozStorageConnection.cpp line 349
> srv = registerCollations(mDBConn,
> static_cast<Service *>(mStorageService.get()));
Let's just change the constructor of Connection to take a Service *, and
change the nsCOMPtr to an nsRefPtr<Service> so we can avoid the cast.
on file: storage/src/mozStorageConnection.cpp line 352
> mDBConn = nsnull;
er...we should be calling sqlite3_close on this before we null it out. Ditto
for registering functions... :(
on file: storage/src/mozStorageService.h line 100
> /**
> * Used for locking around calls when initializing connections so that we
> * can ensure that the state of sqlite3_enable_shared_cache is sane.
> */
> PRLock *mLock;
This comment is now out of date. Please update it.
on file: storage/src/mozStorageService.cpp line 192
> nsAutoLock lock(mLock);
Hey, we should convert our PRLock usage to mozilla::Mutex here like we have
for the newer mutexes on Connection. The variable should be changed to mMutex
too.
on file: storage/src/mozStorageService.cpp line 210
> nsICollation *
> Service::getLocaleCollation()
> {
We should assert that our lock is held here.
on file: storage/test/unit/test_locale_collation.js line 57
> // The UTF-16-encoded database file we create.
> var gUtf16File;
Just use a memory database for this?
r=sdwilsh
Attachment #387658 -
Flags: review?(sdwilsh) → review+
Assignee | ||
Comment 28•15 years ago
|
||
Shawn, I'm asking sr for this but you might want to give it a quick once-over to make sure the mutex change is OK.
Attachment #387658 -
Attachment is obsolete: true
Attachment #388349 -
Flags: superreview?(vladimir)
Comment on attachment 388349 [details] [diff] [review]
for review v5
Hrm, wonder if we can fix the unix collation stuff to avoid setlocale and the thread safety hazard.
Can we somehow audit/ensure that our code avoids calling the utf8-version of the collation function if at all possible? A new allocation per string per comparison really sucks.
Code's fine otherwise, though.
Attachment #388349 -
Flags: superreview?(vladimir) → superreview+
Assignee | ||
Comment 30•15 years ago
|
||
Shawn can speak to this better than I can, but according to the SQLite docs, "Each database file manages text as either UTF-8, UTF-16BE (big-endian), or UTF-16LE (little-endian). Internally and in the disk file, the same text representation is used everywhere." [1] And, "Once an encoding has been set for a database, it cannot be changed." [2]
I'll be using the collations to sort the Places database, which is UTF-8 AFAIK, so I'm not sure if avoiding the conversion is possible... :\ (The situation looks even worse on Unix, where the wide char string is converted back to narrow before the comparison (nsCollationUnix.cpp).)
Incidentally, the existing Places string sort (nsNavHistoryResult.cpp) also converts its args to UTF-16 before passing off to nsICollation.CompareString().
[1] http://www.sqlite.org/version3.html
[2] http://www.sqlite.org/pragma.html
Comment 31•15 years ago
|
||
Databases default to UTF-8, so sqlite we will tend to get it via UTF-8.
Assignee | ||
Updated•15 years ago
|
Whiteboard: [can land]
Assignee | ||
Comment 32•15 years ago
|
||
Removes an unnecessary conversion from nsRefPtr to its raw ptr, thanks Shawn.
Attachment #388349 -
Attachment is obsolete: true
Comment 33•15 years ago
|
||
Flags: in-testsuite+
Keywords: dev-doc-needed
Whiteboard: [can land]
Target Milestone: --- → mozilla1.9.2a1
Updated•15 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Comment 34•15 years ago
|
||
What's the takeaway from this patch for typical developers? Just that sorting of sqlite data (such as in Places) now respects locales?
Assignee | ||
Comment 35•15 years ago
|
||
Yes. Any time you need to sort textual data from the user, you should use these locale collations. It would be good if the docs briefly mentioned the importance of doing so; the introduction to [1] is a good summary.
[1] http://unicode.org/reports/tr10/
Assignee | ||
Comment 36•15 years ago
|
||
Sheppy, forgot to say that I blogged about this with explanation and an example. If you'd like, you can copy and paste:
http://blog.mozilla.com/adw/2009/07/19/locale-sensitive-collations-in-storage/
Comment 37•15 years ago
|
||
This is now documented:
https://developer.mozilla.org/en/Storage#Collation_%28sorting%29
Keywords: dev-doc-needed → dev-doc-complete
You need to log in
before you can comment on or make changes to this bug.
Description
•