Closed Bug 1369738 Opened 7 years ago Closed 7 years ago

Create XML sitemap for www.mozilla.org

Categories

(www.mozilla.org :: General, enhancement)

Production
enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pgerman, Assigned: pmac)

References

Details

Attachments

(1 file)

We would like to create an xml sitemap for pages on www.mozilla.org. I have prepared a draft prioritized list of pages:https://bugzilla.mozilla.org/show_bug.cgi?id=906882 XML sitemaps should follow this structure: https://www.sitemaps.org/protocol.html
Please reference https://bugzilla.mozilla.org/show_bug.cgi?id=906882 for additional historical context. @Eric can you confirm who we should be working with on this on the dev side?
Flags: needinfo?(erenaud)
i just confirmed with pmac that he can help here, ccing him
Flags: needinfo?(erenaud)
Blocks: 906882
Summary: Create XML sitemap → Create XML sitemap for www.mozilla.org
Assignee: nobody → pmac
Status: NEW → ASSIGNED
Looking at the doc mentioned in bug 906882 [0], I just wanted to double check that we only want those specific urls in that doc included in the sitemap unless noted by a ".*" at the end? Do I have that right? And we only want it to automatically include new pages under those specific areas as well?
Flags: needinfo?(pgerman)
Correct, that is the intention. Are there any risks that you see here for excluding content inadvertently? Example: We're making a page for Firefox focus, we would want that included, but it is not currently in the site. What is there a good way to show that?
Flags: needinfo?(pgerman) → needinfo?(pmac)
That is the concern. If we're just manually setting this up for the pages we want then we do risk forgetting to keep it updated. The old PR Kohei did found URLs and added them, but we probably have less control that way. So it's a tradeoff. We should be able to manually manage the top level URLs and have it discover sub ones (like focus which I presume will live under /firefox). We'll be doing that for areas like /firefox/features/*.
Flags: needinfo?(pmac)
I've got a good start on this. You can see the current proposed output of the sitemap generator here: https://gist.github.com/pmac/d2b1a7a38f49d7ade4cbcfdcc505268a PG and I were discussing something on IRC though that I wanted to capture here. I'm unclear on whether the sitemap should include all of the locale URLs. We're already indicating to search engines in the pages themselves the alternate "hreflang" URLs in <link> tags in the head of the documents. Google tells us that we can get this information to them in one of three ways: 1) in <link> tags in the header, in Link: HTTP headers in the response, or in the sitemap: https://support.google.com/webmasters/answer/189077?hl=en&ref_topic=2370587 Since we're already doing #1 I'm reluctant to duplicate that in the sitemap since it could be confusing or even contradictory since we're generating these lists in different places in the code. I'm unsure which place is best however, and PG said on IRC that he thought perhaps including them both would be good. Current state of the work is that they are being added to the sitemap, but I can easily remove them. PG said that he needed some time to do more research, so I'm waiting for the conclusion of that since the sitemap is being generated successfully now, so either way we go I don't think there's much more to do to get this merged and finished.
(In reply to Paul [:pmac] McLanahan from comment #7) > I've got a good start on this. You can see the current proposed output of > the sitemap generator here: > > https://gist.github.com/pmac/d2b1a7a38f49d7ade4cbcfdcc505268a > > PG and I were discussing something on IRC though that I wanted to capture > here. I'm unclear on whether the sitemap should include all of the locale > URLs. We're already indicating to search engines in the pages themselves the > alternate "hreflang" URLs in <link> tags in the head of the documents. > Google tells us that we can get this information to them in one of three > ways: 1) in <link> tags in the header, in Link: HTTP headers in the > response, or in the sitemap: > > https://support.google.com/webmasters/answer/189077?hl=en&ref_topic=2370587 > > Since we're already doing #1 I'm reluctant to duplicate that in the sitemap > since it could be confusing or even contradictory since we're generating > these lists in different places in the code. I'm unsure which place is best > however, and PG said on IRC that he thought perhaps including them both > would be good. > > Current state of the work is that they are being added to the sitemap, but I > can easily remove them. PG said that he needed some time to do more > research, so I'm waiting for the conclusion of that since the sitemap is > being generated successfully now, so either way we go I don't think there's > much more to do to get this merged and finished. If we didn't include the all the locales, what locales would we include? Would we just have the URLs without the locale (i.e. www.mozilla.org/firefox/)? The concern I have with doing a sitemap of locale-less URLs like www.mozilla.org/firefox/ is that they are not canonical. Search engines see www.mozilla.org/firefox/ as just a 301 redirect to another URL like www.mozilla.org/en-US/firfox/ which is canonical for the en-US locale. Search engines only give rank to URLs that are not redirects. I did some research and it seems like Google and other search engines will ignore 301 redirects in sitemaps. So, if the entire sitemap is a 301 to another URL (i.e. the canonical locales), then the entire sitemap could be ignored. I also found this note in the google support article from comment 7 about specifically the homepage, which may or may not apply to our use case: "For language/country selectors or auto-redirecting homepages, you should add an annotation for the hreflang value "x-default" as well:" <link rel="alternate" href="http://example.com/" hreflang="x-default" /> Apple has a good structured website with a clear hierarchy and they have search sitelinks working and also include all of the locales in their sitemap: https://www.apple.com/sitemap.xml
We could just provide the en-US URLs in the sitemap, then rely on our <link> tags on those pages to inform the google-bot of all of the translations. That should work. Apple seems to simply include every URL (including translations) as <url> entries instead of the way Goog's docs say to do it with hreflang. There is a lot of conflicting info out there unfortunately.
(In reply to Paul [:pmac] McLanahan from comment #9) > We could just provide the en-US URLs in the sitemap, then rely on our <link> > tags on those pages to inform the google-bot of all of the translations. > That should work. Apple seems to simply include every URL (including > translations) as <url> entries instead of the way Goog's docs say to do it > with hreflang. There is a lot of conflicting info out there unfortunately. Yeah, lots of conflicting information for sure. Of these three choices: a) All canonical URLs for all locales b) en-US canonical URLs only c) locale-less URLs The relatively priority would be: a > b > c option B of en-US only seems better than option C of locale-less URLs. Between option A and B, what is the technical cost difference of each option? (ignoring the fact that we have link hreflang in the html)
I think the cost is about the same. If you look at the current output you'll see all of the local hreflang links in there already. It's the duplication I'm concerned about. I think we should only have the hreflang in either the sitemap or the code for the pages, but not both. And I think it's more appropriate and useful in the page source. So I'm thinking either B or C is the way to go for the sitemap. My understanding is that search engines should then find the pages and the other localized versions because of the hreflang link tags there.
@pg @hoosteeno: what do you think about starting with option B (en-US only) see how it performs and then decide next steps before going full-in with option C?
Flags: needinfo?(pgerman)
Flags: needinfo?(hoosteeno)
Based on this article from Google:https://support.google.com/webmasters/answer/2620865?hl=en It sounds like the 'ideal' state is to also do hreflang style translations in the sitemap and include every locale's version of the site If this is too complicated and we want to get something out the door quickly to see how it works, I recommend option b) as well.
Flags: needinfo?(pgerman)
I believe I've got this working, and have deployed it to my demo. Have a look: https://bedrock-demo-pmac.us-west.moz.works/sitemap.xml As a bonus (because of Kohei's original work) we also get: https://bedrock-demo-pmac.us-west.moz.works/all-urls.json Which is a JSON mapping of URLs with all of the locales in which those URLs are available. Could be helpful for something I think. Let me know what you think and especially if you see anything missing. I still need to update robots.txt to specify the sitemap URL. Working on that now.
Attached file Link to pull request on Github (deleted) —
(In reply to Paul [:pmac] McLanahan from comment #14) > I believe I've got this working, and have deployed it to my demo. Have a > look: > > https://bedrock-demo-pmac.us-west.moz.works/sitemap.xml > > As a bonus (because of Kohei's original work) we also get: > > https://bedrock-demo-pmac.us-west.moz.works/all-urls.json > > Which is a JSON mapping of URLs with all of the locales in which those URLs > are available. Could be helpful for something I think. > > Let me know what you think and especially if you see anything missing. > > I still need to update robots.txt to specify the sitemap URL. Working on > that now. Wow. pretty awesome work :pmac! I ran the sitemap through a few checkers and it passed with a 100% score without any errors or warnings. :) pmac: Quick question: are we doing all bedrock URLs in the sitemap or just the ones the :pg specified in the comment 4 doc?
Flags: needinfo?(pmac)
> pmac: Quick question: are we doing all bedrock URLs in the sitemap or just the ones the :pg specified in the comment 4 doc? It's all of them. It's being generated based on URL configs in the site, so it will stay up-to-date no matter what pages are added. It is regenerated on every deployment. We can tweak the generation to include or exclude any URLs we want though, so if you see something in there that we don't want to highlight or notice something missing just let me know. I believe it covers all the ones from the doc though.
Flags: needinfo?(pmac)
(In reply to Paul [:pmac] McLanahan from comment #17) > > pmac: Quick question: are we doing all bedrock URLs in the sitemap or just the ones the :pg specified in the comment 4 doc? > > It's all of them. It's being generated based on URL configs in the site, so > it will stay up-to-date no matter what pages are added. It is regenerated on > every deployment. We can tweak the generation to include or exclude any URLs > we want though, so if you see something in there that we don't want to > highlight or notice something missing just let me know. I believe it covers > all the ones from the doc though. Nice :pmac! Justin/PG: can you confirm if there are any specific URLs you want to exclude from the auto-generated list of URLs? Pmac: what do you think about removing any URL from the XML sitemap that would be cause by the robots.txt filter? https://www.mozilla.org/robots.txt URLs that match those patterns seem like prime initial candidates to remove since we're telling search engines the URLs and then blocking them in the robots.txt.
I agree 100% with excluding URLs that we don't want robots to crawl, and I think those two sources should be aligned. It doesn't make a lot of sense to exclude URLs that the crawler can find and we want the crawler to index. I added a couple outstanding questions to the PR: https://github.com/mozilla/bedrock/pull/4907#issuecomment-309148884
Flags: needinfo?(hoosteeno)
> what do you think about removing any URL from the XML sitemap that would be cause by the robots.txt filter? We should exclude those. I thought I had, but must have missed some. I'll check.
I have a number we can use to see how well the sitemap works. Right now www.mozilla.org has 384 URLs that have organic traffic or backlinks, but no internal links[0]. A sitemap should make this number approach 0. We'll watch! [0] https://docs.google.com/spreadsheets/d/1_Zh7WqMA8NZ29gb3K4OM3shCbyHf5YFBRjsO82HU6DA/edit#gid=1122704084
Pmac updated his code and here's the latest XML sitemap: https://bedrock-demo-pmac.us-west.moz.works/sitemap.xml Robots.txt: https://bedrock-demo-pmac.us-west.moz.works/robots.txt Looks like we are good here! :pg :hoosteeno, "ship it?"
Let's ship this now. I can add to webmaster tools and run tools against it to see how it's working. I also have copious tooling to help understand whether it has any effect on organic traffic patterns.
Agreed. :D
Commits pushed to master at https://github.com/mozilla/bedrock https://github.com/mozilla/bedrock/commit/4002e7167a5390a0b46b601e58e83bbe17321a0c Fix bug 1369738: Add XML sitemap Add a management command to generate a sitemap and update deployment to generate and include it in the builds. Thanks to @kyoshino in PR #1333 for most of the basis of this work. https://github.com/mozilla/bedrock/commit/9fd1dddb8e4760bf4a79aead08336124af024d64 Merge pull request #4907 from pmac/add-sitemap-xml-1369738 Fix bug 1369738: Add XML sitemap
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
No longer blocks: 906882
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: