Closed Bug 1198731 Opened 9 years ago Closed 4 years ago

Reader mode skips <h1> headers

Categories

(Toolkit :: Reader Mode, defect, P3)

Desktop
All
defect

Tracking

()

RESOLVED FIXED
86 Branch
Tracking Status
firefox86 --- fixed

People

(Reporter: hub, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [reader-mode-readability-algorithm])

Reader mode skips <h1> headers in the following article: http://blog.francetvinfo.fr/deja-vu/2015/08/26/trois-best-sellers-qui-faillirent-passer-a-la-trappe.html This make the article much less understandable. Happens on both Desktop (Nightly) and Android (Beta)
Of course the page totally abuses headers and should be using <h2> or <h3> instead, because it also uses <h1> for the article title. We remove <h1>s because they normally echo the article title which we present separately at the top of the page. We can probably do something different for multiple <h1>s but it would be additional work, and error-prone. Ideally the page should get fixed.
Priority: -- → P3
Whiteboard: [reader-mode-readability-algorithm]
The article's title is in an <h1>, within a <header> separate from the body, which I believe is technically okay, even if not recommended. (But it uses multiple <h1>s in the body, as siblings instead of having one per <section>.) Additional <h1>s can also be used within <blockquote>s and <td>s. > Certain elements are said to be sectioning roots, including blockquote and td elements. These elements can have their own outlines, but the sections and headings inside these elements do not contribute to the outlines of their ancestors. http://w3c.github.io/html/sections.html#headings-and-sections Though technically valid, multiple <h1>s are not recommended, because browsers have not been keen to implement the alternative. https://github.com/whatwg/html/issues/83
Blocks: 1286221
Given how common this is for README.mds on github or gitlab, I would think that maybe we should check for duplication (on the first <h1>) and otherwise just render them, as-is.
Hopefully this comment doesn't violate etiquette guideline #4 but in an attempt to add extra context to the discussion at https://github.com/whatwg/html/issues/83 posted above: - I figure one of the primary purposes of Reader is to be as compatible with a wide selection of web-pages as possible - As a great deal of websites optimise their content for Google SEO purposes, looking at what's being recommended in those types of communities may be of relevance - Matt Cutts of Google has, in the past, been influential here, and said the following: https://stackoverflow.com/a/2738438/210865 which is likely to be heeded by a lot of content creators
Right, this is still a valid bug and we should do better here, but right now I don't know anyone who has cycles to work on it. If people are interested, I'd be happy to take pull requests at https://github.com/mozilla/readability that try to address this by detecting whether there are multiple H1s and/or checking for similarity between the title and <h1>s, and/or removing the first <h1> but not the others, or something.
To add 1 bullet point on comment #7: - Content management systems and their WYSIWYG editors are likely presenting all 6 levels for users to choose from as "heading styles", so anything is possible--semantically valid or not. I don't know the correct solution, but I see that bug 1369997 also mentions that a fix was needed for <h2> elements. Is the actual problem that there is no single root node for the outline tree? Should one be added as a pseudo element when needed (from <title>, or as a kind of "<h0>", or push other levels down)?

I'm curious, why don't you just ditch the <title> contentsinstead?

After all, the top <h1> is also what a user would see in the content area at the top in regular mode. <title> often has unnecessary details/SEO things like "XYZ ARTICLE - blah site" while <h1> is often more succinct and/or has a more detailed subtitle. Maybe a leading <h1> should just take <title>s place in general, even if there's just one <h1>, and even if it has similar text to <title>. (the exception being when there is no leading <h1> or anything else that is clearly the heading, then it seems very reasonable to fall back to <title>)

(In reply to jonas from comment #10)

I'm curious, why don't you just ditch the <title> contentsinstead?

After all, the top <h1> is also what a user would see in the content area at the top in regular mode. <title> often has unnecessary details/SEO things like "XYZ ARTICLE - blah site" while <h1> is often more succinct and/or has a more detailed subtitle.

We actually take the title displayed in reader mode from metadata when present, which is more reliable than either <title> or the top <h1>. Where we do rely on <title>, there's code to strip out the SEO stuff.

Maybe a leading <h1> should just take <title>s place in general, even if there's just one <h1>, and even if it has similar text to <title>. (the exception being when there is no leading <h1> or anything else that is clearly the heading, then it seems very reasonable to fall back to <title>)

Then we'd do the wrong thing in cases like this, right?

<title>Firefox is awesome</title>
<h1>NiceTown Local Paper</h1>
<h2>Firefox is awesome</h2>

Well yes, and when <h1> has more information / a more detailed subtitle, which I sometimes do on my own pages. (since the SEO / more general stuff in <title> eats up space so I put more brief titles there.)

So yes, there are cases where this is a bad idea, and I still think the best idea is just to ALWAYS show <h1> if available. Again after all, it is what is usually at the top as a heading in the content area, which I'd say usually is for a reason. The <title> or metadata only makes sense to me if there is no obvious alternative like a clear <h1> available

I'm just thinking, couldn't you do a similarity score based on similar length & similar words based on some common word edit distance metric? And then only throw out the <h1> if it's relatively similar, and not like twice the length with way more info?

Surely there must be some heuristic that while for sure will fail again in some corner cases, at least does some of the common cases more justice than just dropping the <h1> without ever even looking at its contents

Depends on: 1685571

Fixed by bug 1685571.

Status: NEW → RESOLVED
Closed: 4 years ago
OS: Unspecified → All
Hardware: Unspecified → Desktop
Resolution: --- → FIXED
Target Milestone: --- → 86 Branch
Version: unspecified → Trunk
You need to log in before you can comment on or make changes to this bug.