Should HTML Parsing in Browsers Be Standardized?

Published in

Geek Culture

4 min readMar 15, 2021

The answer may lie between modern browser lexing and parsing and the rigidity of XHTML of the past.

A good HTML parser puts lipstick on a pig.

In the advent of graphical web browsers in the early 1990s, Netscape Navigator made an important decision- one that after some push for more rigidity like XHTML has modeled the internet as we know it today in HTML5:

They decided that the HTML parser their browser implemented would not strictly parse markup.

Strictly parse? So what does that mean? HTML parsers have an important job when you visit a web page. Not only do these browsers that are made up of millions of lines of C++ code utilize HTTP to request resources on your behalf, but they also have the taxing job of displaying the returned content on your screen- more specifically in your browser window.

This process looks a bit like the below image

Note: this is a WebKit model but all modern browsers share the HTML parsing step.

Any resources that are HTML files are run through an HTML parser, which constructs a DOM tree made of groups of elements known as nodes. A separate CSS parser applies the styling to each of these nodes. A layout process determines exactly where in the browser each of those nodes should be displayed. Finally, a contentful paint on the screen of your requested website appears before you. But what if the HTML file is… well.. not good, semantic HTML?

It works anyway. That is correct- it just works. These underlying HTML parsers have become a sort of a black box that will take in countless lines of HTML, leniently account for all of the errors within them, and whether it is valid markup or not assemble a DOM tree that is the result of that markup. That is to say that even this markup…

<html>
  <mytag>
  </mytag>
  <div>
  <p>
  </div>
    Really lousy HTML
  </p>
</html>

…will work just fine.

So what is the issue?

The issue as I see it is that HTML specifications have never really given any sort of hard and fast rules around how browsers should handle these different errors. In a world where everyone gets a hand in building a better, faster internet, I think that we neglect the fact that parsing the markup is an issue for several reasons.

1. The parsers themselves need to resolve these errors, which is not always easy to do. Take a look at this comment from WebKit on error handling:

“…we have to take care of at least the following error conditions:
The element being added is explicitly forbidden inside some outer tag. In this case we should close all tags up to the one which forbids the element, and add it afterwards.
We are not allowed to add the element directly. It could be that the person writing the document forgot some tag in between (or that the tag in between is optional). This could be the case with the following tags: HTML HEAD BODY TBODY TR TD LI (did I forget any?).
We want to add a block element inside an inline element. Close all inline elements up to the next higher block element.
If this doesn’t help, close elements until we are allowed to add the element–or ignore the tag.do you ever see the internet making it to a point where either…”

2. Since there is no standardization between the browsers, different parsers can resolve these errors in different ways.

Let’s revisit the example butchered markup from earlier.

<html>
  <mytag>
  </mytag>
  <div>
  <p>
  </div>
    Really lousy HTML
  </p>
</html>

Since each of the leading modern browsers has a different implementation of these increasingly complex parsers, one might reconcile the above markup to this:

<div>
  <p>
    Really lousy HTML
  </p>
</div>

…and another, to this:

<p>
  Really lousy HTML
</p>

…and a third, to this:

Really lousy HTML

…with no wrapping tags at all! Because it is up to the browser to decide what the end product should be, it adds to the inconsistencies that developers and users experience across browsers.

3. It is lazy and isn’t working towards a better, faster internet

HTML5 focused on syntactic features and semantic elements, support for SVG, and new attributes amongst other improvements. They even made some suggestions about how errors can be resolved within parsers. We are very willing to build a better, faster internet, but it would seem to me that a drastic improvement can be made by revisiting the rigidity and control we put into our markup to standardize the internet and make everything from semantic html to accessibility features a requirement, not a feature, within the structure we create for our web applications.