Closed Bug 403655 Opened 17 years ago Closed 16 years ago

update eTLD file for gecko 1.9

Categories

(Core :: Networking, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla1.9beta3

People

(Reporter: dwitte, Assigned: pamg.bugs)

References

Details

Attachments

(6 files, 9 obsolete files)

(deleted), message/rfc822
Details
(deleted), application/zip
Details
(deleted), text/plain
Details
(deleted), patch
dwitte
: review+
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
now that more consumers (cookies, soon navhistory) are using the effective TLD service, it's would be nice to review what we have and update it if necessary. (see bug 342314 for the original checkin of effective_tld_names.dat). there are also a few comments in the file indicating incomplete entries (e.g the city.state.us system, but that's pretty lengthy to fix). Jo, Rubena, do you have any interest in helping out here? i'm not sure how up to date the wiki (http://wiki.mozilla.org/TLD_List) is, but that seems like a good starting point for anyone who wants to help out.
Athough I do not have any experience editing a wiki, the MozillaWiki TLD List currently indicates .AQ (Antarctica) is not used. Running a quick Google search, two Antarctic-related sites using TLD .aq came up: http://http://www.comnap.aq/ http://www.ats.aq/
Athough I do not have any experience editing a wiki, the MozillaWiki TLD List currently indicates .AQ (Antarctica) is not used. Running a quick Google search, two Antarctic-related sites using TLD .aq came up: http://www.comnap.aq/ http://www.ats.aq/
Comment 2 is the correct comment, if comment 1 can be deleted, please delete.
I was going to fix bug 373013 and bug 370569 (Brazil / Korea), but I haven't found a diff or patch yet that can work with utf-8. Can I check in a complete new file ? I should also visit every TLD-registry (again ...) to see if something has been changed. Note that the wiki-page is way out of date. This information is quite volatile.
(In reply to comment #4) > I was going to fix bug 373013 and bug 370569 (Brazil / Korea), but I haven't > found a diff or patch yet that can work with utf-8. Can I check in a complete > new file ? wow, nice work! i tested my diff here (linux) and it seems to work with utf8... if you're having trouble, you can either attach the file here or email it to me and i'll diff it. (note that if bug 402013 lands, it might make sense to convert the TLD file to an IDN ASCII encoding, to reduce the legwork the normalizer has to do on read-in... which would have the side effect of solving your utf8 issues. ;)
ftr, http://publicsuffix.org/ allows registries to submit modifications to the list.
(In reply to comment #4) > I was going to fix bug 373013 and bug 370569 (Brazil / Korea), but I haven't > found a diff or patch yet that can work with utf-8. Can I check in a complete > new file ? $ echo $LANG pt_BR.UTF-8 and diff works fine with utf-8. Maybe your system language is n ot utf-8. > I should also visit every TLD-registry (again ...) to see if something has been > changed. Note that the wiki-page is way out of date. This information is quite > volatile. > For Brazil it still the same.
.aq is in use; for random reasons, I once owned a domain in .aq. PublicSuffix.org does allow submission of fixes; they come to me :-) And I have some pending, which I can pass on to whoever takes on this task. We also wanted to ping all the registries to help us with data gathering, and I arranged that the message could be sent out by ICANN - but we never got around to it, because publicsuffix.org wasn't finished. If anyone wants to take on the small bit of web design work required there, that would be great too. As for converting from UTF-8 to ASCII, I'll leave that to the technical people, but even if it was ASCII, it would be nice to have the full IDN version of each domain as a comment. Gerv
(In reply to comment #8) > PublicSuffix.org does allow submission of fixes; they come to me :-) And I have > some pending, which I can pass on to whoever takes on this task. would you be okay with filing bugs as you get these, maybe in batches, and cc'ing relevant people? alternatively, if someone here is willing to be your contact for updating this stuff, that might work. as long as someone gets this information. ;) for now, could you attach this info here? assuming there's nothing private in it... > We also wanted to ping all the registries to help us with data gathering, and I > arranged that the message could be sent out by ICANN - but we never got around > to it, because publicsuffix.org wasn't finished. If anyone wants to take on the > small bit of web design work required there, that would be great too. that would be great to do. just thinking out loud, does moco have any in-house web design guys that we could ping about doing this? (i can do the pinging, if necessary.) i'll also send Ruben an email in case he's interested. > As for converting from UTF-8 to ASCII, I'll leave that to the technical people, > but even if it was ASCII, it would be nice to have the full IDN version of each > domain as a comment. absolutely, that'd be no problem.
i forgot to ask - what exactly needs to be done to get publicsuffix.org up to scratch? i searched bugzilla, but couldn't find anything apart from bug 373190.
Dan: thanks for picking up this ball and running with it :-) It's past 11pm here now, but I'll try and get you the info you need ASAP. Gerv
I have received an email from Dan regarding this and have replied to him. The basic gist is that I have the second version of the site ready with the changes that Gerv suggested. I had a few problems with logging into SVN and therefore zipped up the site and sent it to Gerv some time ago. I'm guessing it must have got lost in the mountains of email, but I can email it again so that it can be uploaded. Sorry I haven't spoken about this earlier - I have been very busy with university! Ruben
thanks Ruben! we should definitely get your svn problems cleared up; please file a bug under mozilla.org/server operations (even if you just need technical help), that's the fastest way to get someone to look at it.
Yeah the problems are at my end so I'll have to sort them out at some point. In the meantime, I'll attached a zipped copy of the new site that someone else can upload to SVN.
Attached file PublicSuffix.org v2 (obsolete) (deleted) —
Attached file Email from Apache dude (deleted) —
This is an email I got from an Apache guy with some updates to the list.
One guy commented on the site: "I eventually found the 'format' section nder the 'submit' link, but that's far from intuitive. The format and examples should get their own dedicated page." The only other stuff I had was from Reuben, and it's now attached here. I do think the website needs a usability once-over - we need to make sure that we are correctly serving both registrars (who want to submit changes to their part), list users (who want to see the whole list and learn the parsing rules). But I'm hoping you guys can run with this ball. Thanks for picking this back up. This is important work; more and more things are relying on this list being correct for security. Once the site is good and up to date, compose a _short_ email to send to the registrars, saying "We are doing this... Your bit of the list is attached... Lots of big companies are using it (name them)... It's in your interests to make sure it's up to date, because otherwise X, Y and Z will break... Here's a link to a page with the update procedure." Then, I'll see about getting it sent out. Gerv
Depends on: 370569, 373013
Attached file PublicSuffix.org v2.1 (deleted) —
Moved format and submissions to their own page and linked to it from the homepage.
Attachment #289672 - Attachment is obsolete: true
as a note, the .tv subdomains need to be added to the file too; there's currently just the "tv" tld in there.
(In reply to comment #19) > as a note, the .tv subdomains need to be added to the file too; there's > currently just the "tv" tld in there. > Actually, no. I haven't found yet an explanation on their NIC website (www.tv), but <http://en.wikipedia.org/wiki/.tv> tells us (for what it's worth) that second level domains are allowed. There should be a few third-level domains like gov.tv, but I haven't found yet such a list. If you're talking about co.tv, then the "tv" rule will still suffice.
(In reply to comment #20) > If you're talking about co.tv, then the "tv" rule will still suffice. oh, so we don't want to consider co.tv an eTLD?
Why would it be ? co.tv, www.co.tv and bubu.co.tv all point to the same host (66.232.143.122) : co.tv is the domain name. If co.tv was an eTLD, then you couldn't surf to it. Try it with gov.tv and www.gov.tv.
ok - i wasn't thinking of it being a legit site or not as the test, but more whether we want to prevent co.tv from being able to set domain cookies. (i.e. whether it serves also as an eTLD for other, unrelated sites.) if it doesn't, then don't mind me :)
Got the following email recently: From: hsowa@bfk.de Subject: some small errors in effective_tld_names.dat Hi! I found some small errors in the effectie_tld_names.dat list supplied with firefox/mozilla. 1. Line 1750: gáŋgaviika (.no tld forgotten) 2. Line 1912: Lærdal (.no tld forgotten) Thanks in advance! bye, Hannes Jo: are you working on the list for Firefox 3? We are using this in anger now, so it needs to be good. Gerv
(In reply to comment #16) > Created an attachment (id=289922) [details] > Email from Apache dude > > This is an email I got from an Apache guy with some updates to the list. > I'm incorporating these changes, but with these remarks : - I took the 3 .us domains - The .uk domains are already covered under *.uk. The regular expressions are indeed difficult to understand - they don't point to domains, but to toplevel domains. *.uk matches every eTLD, including ones that don't exists yet, like blah.uk - Idem for .tr domains (Turkey). - Those 3 .hk domains are actually exceptions to the to the old *.hk rule. But since second level eTLD's are now permitted, it doesn't matter anymore. - italy.it doesn't exist, according to Google. - pro.az is now included - All .au domains are already covered with *.au (it's even mentioned in the comment) - eu.int isn't supposed to exist anymore, but is apparently used, so I added them again - .br was covered in bug 370569 - gov.it & edu.it are now corrected - all uppercase characters are now converted
Attached file new etld list (slightly corrupted) (obsolete) (deleted) —
This is the new full eTLD list (not a diff yet). It contains the changes for 370569 (Brazil), bug 373013 (Korea), russian chnages from bug 342314 comment 20 and the list from comment 16. But I'm having some trouble editing it, since various editors tend to destroy the UTF-8 characters, even when they're supposed to be compatible. I have been scratching my head to remember what editor I used last time. I think it was one on my old Mac, and might try again, when I'm at home. The real editor that I used to edit the list, is Firefox itself. You can find the original document in <http://wiki.mozilla.org/User:Jhermans/scratch>, if you open that document to edit it. Select everything inside the form-field with select-all, then copy. In the mean time, I'm going home now, and I'm trying to find the correct editor again. When I have the correct file, I'll try to generate the diff again.
Attached file new etld list (possibly correct) (obsolete) (deleted) —
Hmmm ... my editor apparently didn't save it in Unicode after all (I thought I checked it my opening in Firefox first). Let's try again. Note : the characters that give the most problems (not present in ISO-8859-1), are marked with the comment "utf8 !"
Attachment #299031 - Attachment is obsolete: true
Remind me again why the list contains UTF-8 characters rather than punycode? (I guess, though, that if it was punycode, we'd probably want the UTF-8 just above it anyway so people knew what it said...) Gerv
Attached file new eTLD list (deleted) —
The previous editor apprently saved it as utf-16, but this should be the correct one. The only editor that I know that does it correctly, and the one I used before, is old TextEdit on Mac OS X. I guess that Seamonkey would be fine too.
Attachment #299033 - Attachment is obsolete: true
(In reply to comment #28) > Remind me again why the list contains UTF-8 characters rather than punycode? just historical - no good reason. Jo, if this really is giving you trouble across editors, please feel free to convert all the entries to punycode - i'll gladly r+ a patch to do that. (it'd be nice to leave the utf8 in a comment beside the entry, i'm guessing that would get corrupted - but how badly?).
(In reply to comment #28) > Remind me again why the list contains UTF-8 characters rather than punycode? (I > guess, though, that if it was punycode, we'd probably want the UTF-8 just above > it anyway so people knew what it said...) > > Gerv > If nsEffectiveTLDService::AddEffectiveTLDEntry() is converting the UTF-8 strings to ACE, then I guess it would be possible. But note that I didn't actually type all those names, I copied them from the various websites. I can convert all strings if you like, but as you say, we would probably still need to mention the utf-8 name in a comment, otherwise people wouldn't recognize them. Note : this time, attachment 299051 [details] is the correct one. I'll try tomorrow to find a diff that doesn't mangle the data either.
My diff works fine, on Linux, configured with UTF-8 as default encoding system. Maybe I even attach a diff before you wake up.
Attached patch Converting the eTLD list from UTF8 to ACE (obsolete) (deleted) — Splinter Review
Attachment #299081 - Flags: review?
Attachment #299081 - Flags: review? → review?(dwitte)
Attached patch Converting, take 2 (obsolete) (deleted) — Splinter Review
Converting to punnycode, plus: - addressing my last comment - removing the UTF8 at all, to make it easier for people on non UTF8 systems to make patches - adding a comment on start (feel free to correct my English or the sentence itself)
Attachment #299081 - Attachment is obsolete: true
Attachment #299086 - Flags: review?(dwitte)
Attachment #299081 - Flags: review?(dwitte)
Attached patch Fixing a minor issue with previous patch (obsolete) (deleted) — Splinter Review
I've seen the extra blank line at the end of line is strictly required, otherwise the last rule won't be applied. This is the same as last patch, without the last line removal.
Attachment #299086 - Attachment is obsolete: true
Attachment #299092 - Flags: review?(dwitte)
Attachment #299086 - Flags: review?(dwitte)
Attached patch Followup (WIP) (obsolete) (deleted) — Splinter Review
This patch incorporate the changes from the last Jo's list on top of the patch I've submitted. Should the attachment 299092 [details] [diff] [review] be approved, this patch would make a (possible) final list. I'd suggest removing the "utf8 !" comments. If so, it's better to remove them here than in the other patch. This way we make that file plain ascii and then everyone else can make diffs cleanly.
Comment on attachment 299092 [details] [diff] [review] Fixing a minor issue with previous patch >Index: netwerk/dns/src/effective_tld_names.dat >=================================================================== >@@ -1,8 +1,11 @@ >+// All entries on this file should be on punnycode (ACE) >+// rather than UTF8. Perhaps instead: // All entries in this file should be in ASCII or ACE (punycode) encodings. >-`øksnes.no >+xn--`ksnes-bya.no the ` looks like a slip of the keyboard - can you verify? looks good, r=me, provided we remove all the "utf8!" comments in the followup patch. i didn't check all the conversions are correct - can you tell us how you did it? i'm not too worried about readability of the punycode entries, since there are websites that will do the conversion if people care (see e.g. http://idnaconv.phlymail.de/). would be great to get this, and the followup, in for b3 - i'll see if we can get blanket sr/moa for these changes...
Attachment #299092 - Flags: review?(dwitte) → review+
followup note to Ruben: might need to note the file is ACE encoded on the website...
(In reply to comment #38) > (From update of attachment 299092 [details] [diff] [review]) > >Index: netwerk/dns/src/effective_tld_names.dat > >=================================================================== > >@@ -1,8 +1,11 @@ > >+// All entries on this file should be on punnycode (ACE) > >+// rather than UTF8. > > Perhaps instead: > // All entries in this file should be in ASCII or ACE (punycode) encodings. OK. > >-`øksnes.no > >+xn--`ksnes-bya.no The ` is ascii... so it passed the conversion. Thanks for the catch: http://www.norid.no/domenenavnbaser/whois/index.php3?charset=UTF-8&query=%C3%B8ksnes.no&sok=s%C3%B8k > looks good, r=me, provided we remove all the "utf8!" comments in the followup > patch. i didn't check all the conversions are correct - can you tell us how you > did it? i'm not too worried about readability of the punycode entries, since > there are websites that will do the conversion if people care (see e.g. > http://idnaconv.phlymail.de/). The nsEffectiveTLDService::LoadOneEffectiveTLDFile was modified to put stuff on a new file. So I got a new converted etld file on startup. The most important line is this: mIDNService->ConvertUTF8toACE(rule, rule); Moving the write to file to bellow TruncateAtWhitespace will clean the "utf8 !"'s. It's possible to make it all in JS instead of C++ and put on an extension, but would take more time.
Attached patch Addressing comments and joining both patches (obsolete) (deleted) — Splinter Review
I had a better idea, since the first part of this patch wouldn't pass sr, anyway, since the file we were using had some issues other than those fixed on attachment 299094 [details] [diff] [review] and the one you mentioned. We also gain a lot of time this way. Compare with attachment 299051 [details]. OBS: There were a few white-space changes. I've removed the spaces at end of line.
Attachment #299092 - Attachment is obsolete: true
Attachment #299094 - Attachment is obsolete: true
Attachment #299109 - Flags: review?(dwitte)
(In reply to comment #39) > followup note to Ruben: might need to note the file is ACE encoded on the > website... > Thanks for the note; I'll make sure I change it and i'll upload the whole new site this weekend.
You've added 2nd level domains from CentralNIC, NetRegistry and eu.org. It seems to me that we shouldn't do that until we have an explicit request from the company concerned. Not wanting to flip-flop on the UTF-8 thing, but it seems to me that putting them in the file in their full form does make the file a heck of a lot easier to read. Example: can someone tell me if bådåddjå.no is in the list? Well, not easily, because it's not written that way. If the list were UTF-8 and Mozilla would need to convert the entire list to punycode on startup, I completely agree that we need to eliminate that unnecessary step. But I still think there is value in having the UTF-8 values in there as comments, even if we are repeating ourselves. Gerv
(In reply to comment #45) > If the list were UTF-8 and Mozilla would need to convert the entire list to > punycode on startup, I completely agree that we need to eliminate that > unnecessary step. But I still think there is value in having the UTF-8 values > in there as comments, even if we are repeating ourselves. This file isn't edited that often; can we agree that the win of easy readability trumps the lesser win of easy editability? Don't forget that repeating/duplicating makes it much easier to have divergence between the two.
(In reply to comment #45) > Not wanting to flip-flop on the UTF-8 thing, but it seems to me that putting > them in the file in their full form does make the file a heck of a lot easier > to read. Example: can someone tell me if bådåddjå.no is in the list? Well, > not easily, because it's not written that way. > > If the list were UTF-8 and Mozilla would need to convert the entire list to > punycode on startup, I completely agree that we need to eliminate that > unnecessary step. But I still think there is value in having the UTF-8 values > in there as comments, even if we are repeating ourselves. All utf8 entries are converted to punnycode on each startup. Do you want to try only converting the utf8 entries to punnycode, without adding the other entries, and measure the startup perf impact? Making diffs with the utf8 files is being a real pain for Jo on Mac. Since diff has a context, the utf8 on comments would bother, anyway. And, how often will someone wants to look at the list to see if a domain is missing?
> All utf8 entries are converted to punnycode on each startup. Seriously? We should add a build step to do the conversion; I'm sure there's something to do it in Python. > And, how often will someone wants to look at the list to see if a domain is > missing? I imagine audits wouldn't be that uncommon at release time and, say, if other browsers wanted to swipe this from us but still do their own QA to be safe.
(In reply to comment #48) > Seriously? We should add a build step to do the conversion; I'm sure there's > something to do it in Python. http://mxr.mozilla.org/seamonkey/source/netwerk/dns/src/nsEffectiveTLDService.cpp#405 nsEffectiveTLDService::LoadOneEffectiveTLDFile is called on startup, it calls AddEffectiveTLDEntry for each non-comment non empty line, which processes the line and call NormalizeHostname for the host itself: http://mxr.mozilla.org/seamonkey/source/netwerk/dns/src/nsEffectiveTLDService.cpp#282 which checks if it's ascii, then converts it to lower case or convert to ACE otherwise. If the file is already normalized, there is no need to call NormalizeHostname. > > And, how often will someone wants to look at the list to see if a domain is > > missing? > > I imagine audits wouldn't be that uncommon at release time and, say, if other > browsers wanted to swipe this from us but still do their own QA to be safe. So they could use any web app, such http://idnaconv.phlymail.de/, some script or even someone could build an extension.
(In reply to comment #49) > nsEffectiveTLDService::LoadOneEffectiveTLDFile is called on startup, it calls > AddEffectiveTLDEntry for each non-comment non empty line, which processes the > line and call NormalizeHostname for the host itself: > http://mxr.mozilla.org/seamonkey/source/netwerk/dns/src/nsEffectiveTLDService.cpp#282 > > which checks if it's ascii, then converts it to lower case or convert to ACE > otherwise. > > If the file is already normalized, there is no need to call NormalizeHostname. Yowza. So we're "needlessly" converting/lowercasing the whole file at startup, every startup. That's pointless work we really don't need to do -- I'll take a look at that and see how we can make improvements here. > So they could use any web app, such http://idnaconv.phlymail.de/, some script > or even someone could build an extension. That's not ease of use by any means. :-)
(In reply to comment #50) > Yowza. So we're "needlessly" converting/lowercasing the whole file at startup, > every startup. That's pointless work we really don't need to do -- I'll take a > look at that and see how we can make improvements here. It looks like the eTLD service can load arbitrary eTLD.dat files - at least from the profile folder. So we can't guarantee that the file will be normalized. But it's possible to convert the file to punnycode while building, save the ascii+punnycode file on the app folder and to pass a flag saying there is no need to convert it again while loading. We could even let a eTLD.dat.txt and a eTLD.dat on the same folder, check timestamps, and rebuild eTLD.dat if needed. I'm using this fact to generate converted file on startup. It's trivial to make it to add the UTF8 entries as comments before the punnycode. So, if it's a consensus, I can make that. > That's not ease of use by any means. :-) To make Composer punnycode aware would be easy of use?
(In reply to comment #50) > Yowza. So we're "needlessly" converting/lowercasing the whole file pretty sure you're chasing ghosts here - it may seem "shock! horror!" but the fact is this is gonna be small potatoes. the majority of lines will go through the fast path (ASCII), and the UTF8 entries won't be significant since there aren't many of them. turning off normalization on read-in is dangerous, in case a) something slips by us (we've found two obvious typos in the file already), b) the user replaces the tld file, or has supplemental ones. however, i completely agree it'd be nice to do this better (and avoiding the file read altogether *would* be significant). preprocessing the file at build time, and using static data, would be nice - if we did that, such that we can guarantee the data is normalized, and we either remove or special-case the supplemental file capabilities, then it'd be viable.
nominating for blocking - we should definitely roll in these updates before release (and we'll use this bug to cover all of them).
Flags: blocking1.9?
Bug 414122 moves the processing into the build process for great justice. (In reply to comment #51) > To make Composer punnycode aware would be easy of use? Using a proper UTF-8-supporting editor is ease of use; there's not really a shortage of them these days. For example, TextWrangler works just peachy for me over here on Unicode.
Flags: blocking1.9? → blocking1.9+
Priority: -- → P2
(In reply to comment #54) > Bug 414122 moves the processing into the build process for great justice. > > (In reply to comment #51) > > To make Composer punnycode aware would be easy of use? > > Using a proper UTF-8-supporting editor is ease of use; there's not really a > shortage of them these days. For example, TextWrangler works just peachy for > me over here on Unicode. > TextWrangler does not run on any operating system I have access to. I have been using Firefox as an editor, but since it doesn't save any files (I guess Seamonkey might do the trick), I have been copying the data in various editors on Linux, Windows, Solaris and Mac OS X. But only TextEdit did what I asked : File->New. Edit->Paste, File->Save, without any changes to the data. You would be surprised how many I found that can complete such a simple task. Most decided to save it UTF-16 when they discovered a character that can't be present in ISO-8859-1 (the ŋ in .no domains, and the Taiwanese ones). Or they did it in UTF-8, but they mangled the character anyway.
TextWrangler is OS X; gEdit also handles UTF-8, and that's Linux. How do both of these not work (especially the Linux one, since Linux at this point is basically pure UTF-8, no ASCII to be seen)?
Jo, you can try Emacs: Mac OS X: http://aquamacs.org/ Win: http://www.ourcomments.org/Emacs/EmacsW32.html Linux: Your distro package or http://ftp.gnu.org/pub/gnu/emacs/ There is a builtin diff that should work against CVS cleanly. If you can't find a way to make the diff work for you with utf8, you can upload the whole file and someone makes a diff. Your last list should have the "lea?gaviika.no" and "`øksnes.no" corrected to "leaŋgaviika.no" and "øksnes.no". Other than that there are the Gerv comments about 2nd level eTLDs.
Comment on attachment 299109 [details] [diff] [review] Addressing comments and joining both patches It looks like most people won't like to convert the file and Jeff is taking care of pre-processing the file on another issue.
Attachment #299109 - Attachment is obsolete: true
Attachment #299109 - Flags: review?(dwitte)
(In reply to comment #56) > TextWrangler is OS X; gEdit also handles UTF-8, and that's Linux. How do both > of these not work (especially the Linux one, since Linux at this point is > basically pure UTF-8, no ASCII to be seen)? > I'm running OS X 10.2.8 - TextWrangler requires 4.0. And Linux - I'm using an emasculated version at the office, (which is also why I can't find a diff that works correctly), where gEdit doesn't run. Neither does Firefox btw, since that requires a never version of glib. I tried even vi as an editor (I didn't bother for emacs, but I haven't fired that in at least 4 years) - but it still mangles the data (it saved either in Ascii or in Unicode-16). PS : emacs is installed on OS X - no need to download it. You remark about "leaŋgaviika.no" and "øksnes.no" is exactly what I'm talking about - every time you think it's correct, and check and recheck the file, you always discover there's a problem. It doesn't even matter where or how it happens - the entire data is very fragile ! That's why I applaud all efforts to simplify it, like only using punycode in the input file.
Emacs 21 doesn't handle unicode very well, I needed some black magic to make it work. But editing utf-8 files anywhere is possible without much effort... if your system is configured to use western encoding, vi and emacs will pick that as default. Evaluating those: (set-keyboard-coding-system 'utf-8) (set-terminal-coding-system 'utf-8) or C-x RET t utf-8 and C-x RET k utf-8 I could make emacs open that eTLD file as utf8 on a system configured to use iso-8859-1. I couldn't make the Composer on SM edit that file, it opens the file as read-only when it actually contains utf8 entities. Another approach would be if Gerv and Jeff would be OK with making the list on the web-site as a reference, and using punnycode on the source code. But I think everyone agreed we should move into getting the list as complete as possible on b3. It's better leaving optimizations and changes in the format for latter, even though the pre-processing Jeff is doing would be nice for 1.9.
(In reply to comment #55) > But only TextEdit did what I asked : > File->New. Edit->Paste, File->Save, without any changes to the data. You would > be surprised how many I found that can complete such a simple task. Most > decided to save it UTF-16 when they discovered a character that can't be > present in ISO-8859-1 (the ŋ in .no domains, and the Taiwanese ones). Or they > did it in UTF-8, but they mangled the character anyway. So just explicitly switch to UTF-8 before pasting? Or open the previously existing file, which should already be UTF-8? Either way I don't really see the problem. Using UTF-8 rather than punycode makes the file better accessible and maintainable.
alright - let's do this thing in UTF8. punycode would be nice, but Waldo's right, readability beats editability. sorry Jo - but, the least we can do is roll diffs for you if you attach full files here. i have verbal moa=biesi for all updates to this file for 1.9. asrail (or Jo), if you want to roll together a final patch with Jo's changes, i'll review it and we can land for b3. (please do remove those existing "utf8!" comments - they look a bit silly!)
I'm not sure what's the better thing to do about Gerv comments on 2nd level eTLDs. The ones from: CentralNIC, NetRegistry and eu.org. I can make a patch without that controversial part, which is minor. So that part could land even after b3, if desired.
yeah, let's leave those out for now, can land later. gerv, can we just ask them about adding those domains? seems like the responsibility is generally on us to keep our file up to date. (also, do we already have domains listed from those registrars? if so, and assuming we didn't ask before, why ask now?)
Attached patch diff -w version of the patch (deleted) — Splinter Review
Since there were a few whitespace changes, uploading a diff -w. The other one is intended to land, this one is for comparison.
I've removed the "free za" subdomains, since they are on the same category as eu.org and others. And Dan, we don't have domains from those registrars yet.
Comment on attachment 299640 [details] [diff] [review] Jo's list without 2nd level eTLDs (checked in) r=dwitte
Attachment #299640 - Flags: review?(dwitte) → review+
Comment on attachment 299640 [details] [diff] [review] Jo's list without 2nd level eTLDs (checked in) landed this one - leaving this one open for followup updates for 1.9.
Attachment #299640 - Attachment description: Jo's list without 2nd level eTLDs → Jo's list without 2nd level eTLDs (checked in)
(In reply to comment #64) > gerv, can we just ask them about adding those domains? seems like the > responsibility is generally on us to keep our file up to date. (also, do we > already have domains listed from those registrars? if so, and assuming we > didn't ask before, why ask now?) Here's my thought. Ideally, the controller of each TLD or pseudo-TLD would be responsible for their section of the file. But until that is the case, we have a duty to be very cautious. Say, for example, we-provide-great-urls.com currently makes domains available in country-specific areas, e.g.: my-company.uk.we-provide-great-urls.com someone-else.de.we-provide-great-urls.com So we come along and put uk.we-provide-great-urls.com and de.we-provide-great-urls.com into the TLD list. However, unbeknown to us, this company is planning to launch a great new service with even shorter URLs, allowing you to do: mycompany.we-provide-great-urls.com They launch this service on the same day Firefox 3 comes out. But disaster - no-one with one of these new domains can set cookies in Firefox. (That's what would happen, right?) In other words: we should not be adding restrictions in the sub-domain space of private companies without their express permission. Gerv
(In reply to comment #70) > They launch this service on the same day Firefox 3 comes out. But disaster - > no-one with one of these new domains can set cookies in Firefox. (That's what > would happen, right?) well, as you pose the problem, no - mycompany.we-provide-great-urls.com would still be able to set cookies and everything else, *unless* we added a *.we-provide-great-urls.com rule, but that really seems like something we wouldn't do (for exactly this reason). (wildcard rules should be used with care!) so, i'm not sure what you think given this information, though i do agree we should be cautious. i just want to make sure things don't get stalled, since more and more consumers are using this list for security-related purposes.
Ruben - how's the site update going? can you ping me once it's live, and I (or you!) can draft up an email to send out to registrars (comment 17).
Dan, I've updated the site to mention that the list is in UTF-8 format (I gather that's the latest consensus) and I've uploaded everything to the SVN - it just need to be replicated onto the live site now...
(In reply to comment #74) > There's a small typo ;), whenever the eTLD.dat is next updated: > > http://bonsai.mozilla.org/cvsblame.cgi?file=/mozilla/netwerk/dns/src/effective_tld_names.dat&rev=1.3&mark=1148#1148 I needed to kick off the unit test boxen, so I just fixed it.
Flags: tracking1.9+
Beltzner: did you mean to mark this blocking1.9+? Removing the tracking-1.9+ flag without a comment makes it look like a mistake.
Flags: blocking1.9?
(In reply to comment #69) > (From update of attachment 299640 [details] [diff] [review]) > landed this one - leaving this one open for followup updates for 1.9. ... (In reply to comment #76) > Beltzner: did you mean to mark this blocking1.9+? Removing the tracking-1.9+ > flag without a comment makes it look like a mistake. Oops. I should have commented and marked this wanted-next+. I'm assuming based on comment 69 that we've got what we need to ship, any further updates would be sauce for the goose.
Flags: wanted-next+
Flags: blocking1.9?
Flags: blocking1.9-
yep, further updates here will be for the love of saucy geese.
The current .dat file has "*.co" as the rule for .co, but the corresponding page at http://en.wikipedia.org/wiki/.co seems to give a small set of specific TLDs. Is this the right bug to request a review on this? I'm not sure of the process here :(
It's not a problem, registrations are only possible at the third level in Colombia, so *.co matches all second level domains as effective TLD's. Or any future second level for that matter. This way we don't have to specify all of them. The regular expressions are a bit different than what you're used to with regular expressions, because they're supposed to match eTLD's, not domains. Now, if we could convince those registrars not to mix 2nd level and 3rd level domains together, then the file could be a lot simpler ...
Sure, I understand the eTLD match versus domain match issue; my concern was that right now we would think that "foo.bar.co" was an address in the valid eTLD bar.co, when in fact it's not a valid address at all. It would be nice to distinguish these cases.
the purpose of the eTLD service isn't to determine if an address is valid, it's just to provide the (supposed) eTLD of a given host. providing a wrong answer in the case of an invalid host is acceptable.
Attached patch Missing eTLDs for .ru (obsolete) (deleted) — Splinter Review
The current eTLD file says for .ru: "there should be geo-names like msk.ru, but I didn't find a list". One of our users found the list: http://www.ripn.net:8082/nic/dns/geo_list.html http://www.ripn.net:8082/nic/dns/generic_domains.html This patch adds the missing .ru geographic domains and ac.ru, which is also missing.
Can I suggest that the www.centralnic.com domains are added? People can register under the following "eTLDs": eu.com uk.com uk.net us.com cn.com de.com jpn.com kr.com no.com za.com br.com ar.com ru.com sa.com se.com se.net hu.com gb.com gb.net qc.com uy.com ae.org
Also, currently the eTLD system defaults to assuming that an unknown eTLD has minimum length (i.e. "www.example.invalid" is assumed to have an eTLD of "invalid"). Isn't this "failing dangerously?" - i.e. wouldn't it be safer to assume that in an unknown situation we should default to restrictive rather than permissive?
The centralnic domains are not true eTLDs, merely domain names for which a company sells out subdomains. An eTLD of "invalid" in the example you give is correct, and doing this differently could have unknown consequences on intranet sites.
If the CentralNIC domains are not "true eTLDs" then either you or I must be considerably misunderstanding the meaning of the 'e' in "eTLDs". As I understand it, "domain names for which a company sells out subdomains" is precisely what this list *is*. A web site of "example.uk.com" is *not* the same organisation as "otherexample.uk.com" and sharing cookies between them is a security risk. I don't particularly approve of the "uk.com" style domains either, but like it or not, it is a fact that they *are* eTLDs. An eTLD of "invalid" in my example is not correct. However, I don't know the implications for intranet sites, it may be that a deliberate and reasoned decision has been made that breaking intranet sites is worse than the potential security risk for sites on unknown eTLDs, i.e. that for practical reasons the "wrong" answer is deliberately given - in which case I won't argue.
CentralNIC is not the registrar for .com, .net, or .org. It merely owns some domains in them. Even if it's selling subdomains to customers, foo.eu.com and bar.eu.com are in the same top-level domain, ".com". This is not different than any other ISP that supplies its customers with subdomains on the ISP's domain. On what grounds do you say that "invalid" is not the correct eTLD for "www.example.invalid"? The effective top-level domain of a host that does not match any known rules is the last subcomponent. This is what browsers have always done--before the time of eTLD data the TLD of a site was _always_ considered to be the last subcomponent; this merely limits that fallback case to when we don't have more specific data. This is also compatible with new TLDs -- if ICANN or similar announces .coolnewdomain, existing Firefox instances will do the right thing with it even before they're upgraded to a version with more specific eTLD data.
Re CentralNIC, you still appear to be missing the significant of the "e" in "eTLD". Yes, "example.uk.com" and "otherexample.uk.com" are under the same "top-level domain". They are *not*, however, under the same "eTLD". (For that matter, example.co.uk and otherexample.co.uk are under the same TLD.) Also if ICANN announces ".coolnewdomain", existing Firefox instances will only "do the right thing" if it happens that end users register directly under the TLD. If they actually register under "a.coolnewdomain" and "b.coolnewdomain" (like ".pro", for example) then Firefox is certainly not "doing the right thing" - that's the whole point of the eTLD list in the first place. As I said above, denying that there's a problem is pointless. Saying "yes there's a problem but it's insoluble and we've thought about it and decided upon what we believe is the least-worst solution" is fine, if it's true. Are you actually a Mozilla committer or am I wasting my time arguing with you? ;-)
Your comments about CentralNIC are true, but they apply just as well to geocities.com or comcast.net or any other ISP or hosting service that lets users create their own virtual subdomains. We have to draw the line somewhere, and we've chosen to draw it at official registrars.
To follow Pam's comments... You still haven't said what you think the correct TLD for no-rule scenarios is. You are right that our current behavior is suboptimal if a new domain is announced using registration only under subcomponents of that domin, but the scope of the damage is limited: basically, we fall back to the pre-eTLD behavior. What is your alternative?
Pam - I guess that's fair enough. I suggest that it might be an idea to write the policy down somewhere though, if a policy has been decided upon. Even just a comment at the top of the eTLD data file would be a big improvement. I'm not sure I 100% agree with it but I won't argue about it. Peter - sorry, I thought I made it obvious in my first comment on the subject what I thought the behaviour should be: given "foo.bar.unknown", it should default to "most restrictive", i.e. refuse to share cookies with anything not ending in "foo.bar.unknown", just the same as if the site was "example.co.uk" or "example.com". As already mentioned, if it's already been thought about and a policy decision has been made that the security benefits of this approach are outweighed by practical breakages that would occur, that's fine but I think it should be documented (if it isn't already and I haven't missed it!)
If you use the "most restrictive" policy on unknown TLDs, then there is the chance that if ICANN approves a new TLD, previous versions of Firefox etc will break sites under that TLD since they will not allow cookies to be set/read. It is true that many more people regularly update their Mozilla products than, say, IE, but there is still a chance which must be considered.
Should this bug be retired and a new one created? Here's a patch incorporating the previous addition (missing eTLDs for .ru), adding .rs and .me information, and updating info for several TLDs based on posted registrar information (with occasional small additions from Wikipedia).
Assignee: nobody → pamg.bugs
Attachment #318014 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #331017 - Flags: review?
Attachment #331017 - Flags: review? → review?(jo.hermans)
Yes, please create a new bug for further additions. Thanks :-) Gerv
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Attachment #331017 - Flags: review?(jo.hermans)
Moved to bug 447815.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: