Closed Bug 906882 Opened 11 years ago Closed 6 years ago

XML sitemap missing from www.mozilla.org

Categories

(www.mozilla.org :: Pages & Content, enhancement, P2)

enhancement

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1369738

People

(Reporter: cmore, Unassigned)

References

()

Details

(Whiteboard: [kb=1128714] )

Attachments

(5 files, 2 obsolete files)

Attached file GSI's examples with lang references (deleted) —
Based on recommendations of our SEO audit and other research, we should create an XML version of a sitemap that is easily crawalable by search engines include references to all languages.
No longer depends on: 906879
Priority: -- → P2
Whiteboard: [kb=1086766]
:kohei: What are thoughts on making a dynamic sitemap.xml file for only the URLs/locales that are in bedrock and have it expand out over time as more pages and locales move into bedrock?
I think we can have a dynamic sitemap in some way... As more pages migrated to Bedrock, the sitemap will be more complete. I'll find out how :)

I could find a useful script at http://djangosnippets.org/snippets/1434/
Assignee: nobody → kohei.yoshino
Status: NEW → ASSIGNED
OS: Mac OS X → All
Hardware: x86 → All
Attached file Full sitemap.xml including l10n (obsolete) (deleted) —
The implementation was not difficult. My rough code is here, and an output is attached.
https://github.com/kyoshino/bedrock/commit/d0b485075d343c1e650842395dc637b4ee662a13

Issues:

* It takes about 30 seconds to respond. The translation list of each page is based on the template name, but there's no easy way to get the template name of each URL. I had to send an HTTP request to each page.

* As you can see, the output is redundant. The file size will exceed a 50 MB limit in the future.
https://support.google.com/webmasters/answer/183668?hl=en#1

Possible solution:

* Including only /en-US/ URLs in the sitemap. Search engines can still recognize each page's alternate URLs that we already have implemented in Bug 481550.
Attached file sitemap.xml (obsolete) (deleted) —
Sent PR: https://github.com/mozilla/bedrock/pull/1217

To avoid the issues noted above, I only included English URLs. An output sitemap.xml file is attached.
(In reply to Kohei Yoshino [:kohei] from comment #3)
> Created attachment 803081 [details]
> Full sitemap.xml including l10n
> 
> The implementation was not difficult. My rough code is here, and an output
> is attached.
> https://github.com/kyoshino/bedrock/commit/
> d0b485075d343c1e650842395dc637b4ee662a13
> 
> Issues:
> 
> * It takes about 30 seconds to respond. The translation list of each page is
> based on the template name, but there's no easy way to get the template name
> of each URL. I had to send an HTTP request to each page.
> 
> * As you can see, the output is redundant. The file size will exceed a 50 MB
> limit in the future.
> https://support.google.com/webmasters/answer/183668?hl=en#1
> 
> Possible solution:
> 
> * Including only /en-US/ URLs in the sitemap. Search engines can still
> recognize each page's alternate URLs that we already have implemented in Bug
> 481550.

Do you have a link or can you attach an example complete sitemap.xml file that would include all locales? over 50MB? wow.

en-US only sitemap.xml doesn't help SEO much at all.

jgmize had an idea: Use sitemap pagination and use the Django pagination feature. 

jgmize and kohei, can you two sync up?
Flags: needinfo?(kohei.yoshino)
Flags: needinfo?(jmize)
(In reply to Chris More [:cmore] from comment #5)
> Do you have a link or can you attach an example complete sitemap.xml file
> that would include all locales? over 50MB? wow.

The attachment 803081 [details] in my Comment 3 is a complete sitemap. Though it's still 1.63 MB, more and more pages are migrated to and translated on Bedrock...

> en-US only sitemap.xml doesn't help SEO much at all.

Canonical URLs on each page might help, but of course, a complete sitemap would be helpful.

> jgmize had an idea: Use sitemap pagination and use the Django pagination
> feature. 

I'll check it out this afternoon!
Flags: needinfo?(kohei.yoshino)
I just regenerated a complete sitemap. Now it's 3.1 MB with 859 URLs. Will try to

* Use a cron to retrieve URLs including localized pages
* Split the complete URL list by locales or specific number of URLs, by using a Sitemap index file
https://support.google.com/webmasters/answer/71453
Attached file Pull Request on GitHub (deleted) —
Attachment #821100 - Attachment description: pull reques → Pull Request on GitHub
Attachment #803297 - Attachment is obsolete: true
Attached file Sitemap Index (deleted) —
Attachment #803081 - Attachment is obsolete: true
Attached file Sitemap (en-US) (deleted) —
Flags: needinfo?(jmize)
Whiteboard: [kb=1086766] → [kb=1128714]
Severity: normal → enhancement
Attached file sitemap_urls (deleted) —
Summary: Create a dynamic XML sitemap of top-level URLs in [Bedrock] → Create a dynamic XML sitemap of all indexable URLs in [Bedrock]
Any update on the sitemap bug?
All: Given everything else we are working on now, let's put this on hold until later in Q2. I still think it will help, but we have bigger priorities now.
Status: ASSIGNED → NEW
Here's a good example of a XML sitemap of a website that has a lot of sub-sites with their own sub-navigation: https://www.apple.com/sitemap.xml
Now is a great time for us to resurrect this effort. It's high on the list of marketing priorities[0]. 

An optimal approach would
* generate this sitemap from a more authoritative source than http crawls (e.g. from bedrock itself)
* give us an opportunity to choose the priority of certain elements in the sitemap (e.g. firefox marketing pages) in an effort to shape search results.

[0] https://docs.google.com/spreadsheets/d/1fizrZ92kNr6sJSMizxl343OF7F-BCJHEUG5TfWj1Gs8/edit#gid=466760365
One more thing here, we also need to make sure the sitemap.xml is linked from the robots.txt like:

https://www.mozilla.org/robots.txt

i.e.

"Sitemap: https://www.mozilla.org/sitemap.xml"

Please note that the sitemap URL in robots.txt should be the full absolute URL and not relative like the rest of the URLs in the file. See examples at https://www.apple.com/robots.txt (bottom) and https://www.google.com/robots.txt (bottom)
Doh, I totally missed the Django sitemap framework the last time I baked my pull request ;) I'm happy to work on this again but my question now is: will the sitemap include all pages on Bedrock or only major pages? The purpose of Bug 1369738 is the latter, I guess...
(In reply to Kohei Yoshino [:kohei] from comment #20)
> Doh, I totally missed the Django sitemap framework the last time I baked my
> pull request ;) I'm happy to work on this again but my question now is: will
> the sitemap include all pages on Bedrock or only major pages? The purpose of
> Bug 1369738 is the latter, I guess...

Peter German has worked on a spreadsheet to capture all of the URLs to be included in the v1.0 of this sitemap: https://docs.google.com/spreadsheets/d/1Sq-o-R9XjO9VPKaOL-aprOIiWNgvFttZ8XHeSquAWH4/edit#gid=1400086798

Peter: what is the difference between this bug and bug 1369738? If there is no difference, we should keep this bug for historical context and if the bugs are different, but related they should be linked together with specific title differences.
Flags: needinfo?(pgerman)
I was asked to create a new bug for this. I'll reference this for context.
Flags: needinfo?(pgerman)
Summary: Create a dynamic XML sitemap of all indexable URLs in [Bedrock] → XML sitemap missing from www.mozilla.org
Depends on: 1369738
I think Bug 1369738 has covered this.
Assignee: kohei.yoshino → nobody
No longer blocks: 629786
Status: NEW → RESOLVED
Closed: 6 years ago
No longer depends on: 1369738
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: