Failure of the HTML P element as structural markup (reprise)

There has been quite a bit of discussion on the W3C lists regarding the need for a DI (definition item) element, one such thread being a reply from Hixie that he cc’d to www-style. As far as I can tell, he is against DI: he thinks that it is only trying to solve a presentational (CSS side) issue. Yet fantasai thinks that the lack of DI is really a structural (HTML side) issue. I am siding with her and say it’s a structural issue; in fact I would go on to say that this is a larger problem of failing to identify the correct structures that certain “structural” elements are supposed to represent.

DL (the parent element of the still-imaginary DI) then, in my opinion, joins P and BLOCKQUOTE (and maybe even more as-yet-unidentified elements) in having been incorrectly defined. They really started off as presentational elements, and then later arbitrarily redeclared (but never redefined) by fiat as structural. Then when we tried really using them as structural markup we found them wanting.

There is enough discussion about DI already, so let me try to explain why I say even P and BLOCKQUOTE, as currently defined, are not structural markup.

Of paragraphs

I am willing to make the claim that any writer (a translator is a writer), editor, or even proofreader will be able to confirm that P cannot possibly be structural the way it is currently defined. But how so? Ironically, the W3C website itself contains examples of how P fails to represent true paragraphs.

On the page Facts about W3C, we find the following Patent Policy:

In February 2004, W3C adopted a Patent Policy for Working Groups to enable continued innovation and widespread adoption of Web standards developed by the World Wide Web Consortium. The W3C Patent Policy is designed to:

  • Facilitate the development of W3C Recommendations by W3C Working Groups;
  • Promote the widespread implementation of those Recommendations on a Royalty-Free (RF) basis;
  • Address issues related to patents that arise during and after the development of a Recommendation.

In August 2011, W3C adopted a Community Contributor License Agreement with Royalty-Free patent licensing terms and permissive copyright for W3C Community and Business Groups. See also the Final Specification Agreement, which further increases patent protection around Community and Business Group Specifications.

Now what is the structure of the above piece of quoted text? Clearly, this is a single paragraph with an embedded list; i.e.,

P
In February 2004,… is designed to:
UL
LI
Facilitate the development…;
LI
Promote the widespread implementation…;
LI
Address issues….
In August 2011,… Business Group Specifications.

But can P — as defined by HTML (even HTML5) — represent this structure? The answer is no. Instead, we have to represent it as the illogical

P
In February 2004,… is designed to:
UL
LI
Facilitate the development…;
LI
Promote the widespread implementation…;
LI
Address issues….
P
In August 2011,… Business Group Specifications.

as if the second half of the paragraph had no relation to the first half.

The above illogical representation is in fact precisely how the paragraph is encoded on the W3C site. Yet this is also clearly wrong: Even if you argued that the second half need not belong to the same paragraph, you will have to acknowledge that the real structure has to be at least

P
In February 2004,… is designed to:
UL
LI
Facilitate the development…;
LI
Promote the widespread implementation…;
LI
Address issues….
P
In August 2011,… Business Group Specifications.

simply by virtue of how the text is punctuated. Even if we treated the second P as a separate paragraph, as long as we are honest about what the true structure has to be, it is still impossible to represent the first paragraph correctly using P and UL as currently defined.

In effect, because of the way the P element is defined in the SGML DTD, P can never enclose a BLOCKQUOTE, UL, OL, or DL. Yet it is perfectly natural in English (as well as in other languages) for real paragraphs to contain quotations or lists.

The sad truth is that HTML 2.0 had merely redeclared P as representing a paragraph without redefining it in a way that would have given it the actual ability to represent real paragraphs. In retrospect, the original pre–HTML 2.0 specification was actually more correct in stating P as being merely a “paragraph break”, and this is what P as currently defined really is — presentation markup that is incapable of representing paragraphs-with-embedded-lists.

This disconnect between ideology (that P should be structural markup) and reality (that P is in fact useless as true structural markup) stemmed from a fundamental mistake made way back during HTML 2.0’s design phase that was never fixed; and because P requires no end tag, it can in fact never be fixed without causing widespread breakage.

So what will need to be done to fix the problem? Unfortunately, because fixing P at this point will break too many sites, we will need to create a completely new element that is correctly defined so that it can represent real paragraphs.

In the meantime, we can at least be honest about it and revise the HTML5 specification to stop pretending that P is capable of representing a paragraph.

Of block quotations

How about BLOCKQUOTE? Actually, this is even more fun because the incorrectness of the current definition is even more obvious.

Why is this even more obvious? Because we may want to quote more than one paragraph. For example, we might want to quote from W3C’s “Schema” page:

Checking a document against a Schema is known as validating against that schema; for a DTD, this is just validating, but for any other type of schema the type is mentioned, such as XSD Validation or Relax-NG validation.

Validating against a schema is an important component of quality assurance.

What would be the structure of this two-paragraph quotation? Obviously, it must be

BLOCKQUOTE
P
Checking a document… Relax-NG validation.
P
Validating against a schema… quality assurance.

Unfortunately, this would be invalid HTML. Because of the way BLOCKQUOTE is defined in the DTD, you cannot have a P embedded within a BLOCKQUOTE. So you must represent this as either the illogical

BLOCKQUOTE
Checking a document… Relax-NG validation.
BLOCKQUOTE
Validating against a schema… quality assurance.
which does not even display correctly by default, or the just-as-illogical
BLOCKQUOTE
Checking a document… Relax-NG validation.
BR /
BR /
Validating against a schema… quality assurance.

which does not even make sense and again is not guaranteed to display correctly.

BLOCKQUOTE is completely ill-prepared to handle multi-paragraph quotations. We have not even considered quotations with embedded lists, or the not-too-hypothetical quotation-with-an-embedded-paragraph-that-in-turn-has-an-embedded-quotation.

One might say that because BLOCKQUOTE’s end tag is mandatory. this problem could at least theoretically be fixed. In practice, I strongly doubt this will ever happen.

The mythical disease of divitis

Short of fixing HTML5 (slim chance), what can we do right now? Surprisingly for some, we can actually represent real paragraphs, block quotations, and lists right now, using DIV.

When DIV was introduced in HTML 4.0, it was actually explicitly stated that DIV was created to allow one to “add structure”. As far as original intent goes, DIV was primarily structural, and insertion of random DIV’s for styling purposes was never even intended. The real problem with using DIV is that the presentation will not degrade correctly in text browsers. However, if P, BLOCKQUOTE, and DL are not doing their jobs, perhaps we might as well create some interim conventions to use DIV to replace P, BLOCKQUOTE, and — if the need arises — DL.

We could even say the much-loathed “divitis” might in fact be only a fictitious, imaginary disease. Rather than being a disease, it is really a symptom for HTML’s lack of true structural elements. Until we can have elements that can represent real-life paragraphs, block quotations and definition lists, DIV may in fact be more structural and more semantic than the allegedly-structural P, BLOCKQUOTE and DL.