02 Semantic Markup

The goal is to prepare a semantically well formed HTML document. How do we go about doing this?

Organizing the Content

Document structure before HTML5 was organized according to the hierarchy of header tags, which went from the most general to more specific, following the way content is created.

How content is created? In a well written document, the topic is expressed by the title. It is then supported by subtopics. This structure is duplicated in HTML. The main header expresses the title, and sub headers express the subtopics.

This hierarchy of headers used to be the only structure an HTML document had. The title is an <h1> header that explains what document is about. Second level topics are given a <h2> header, and if there are third or forth level topics, they receive a <h3> or <h4>. Headers become more specific as the levels go down, to level <h6>, the most specific header available in HTML. There is no <h7>.

Here Come the Machines, or The Semantic Web

We can understand the content of a web page, because we are cognizant beings. The World Wide Web was originally built for human consumption, and although everything on it is machine-readable, this data is not machine-understandable. It is very hard to automate anything on the Web, and because of the volume of information the Web contains, it is not possible to manage it manually.

It has been a goal of the 3W.org and of Tim-Berners-Lee in particular to overcome this problem. To that end, Tim Berners-Lee coined the term semantic web, and has simplified it to something he calls linked data. You can see this idea embodied in a discussion of statistics by Swedish professor Hans Roseling.

The goal of the semantic web project is to make machines understand, as far as possible, the meaning of the content from the structure and meta information contained in the markup itself. This would allow automated agents, like bots that cruse the web, to link up information in a meaningful way.

Such agents would automatically locate related information on behalf of the user. That would allow us to cut through the noise, so to speak.

The semantic web has been gaining ground, and is currently (2015) expressed as

In addition to the classic “Web of documents” W3C is helping to build a technology stack to support a “Web of data,” the sort of data you find in databases. The ultimate goal of the Web of data is to enable computers to do more useful work and to develop systems that can support trusted interactions over the network. The term “Semantic Web” refers to W3C’s vision of the Web of linked data. Semantic Web technologies enable people to create data stores on the Web, build vocabularies, and write rules for handling data.

HTML vs XHTML

HTML is tolerant of human differences in coding and allows for errors. That made it difficult for machines to understand the semantic meaning of the content. W3.org introduced XHTML as a way to clean up the excesses of human error. It saw XHTML as the future of the web, and was fully prepared to release XHTML2 as the future of the web back in 2005-6. Good thing for us that didn’t happen.

Based on XML, or extensible machine language, it would have make the web much more friendly for machines to repurpose, but at a price. Humans would have been required to write code with machine like accuracy in creating all of their web pages. To enforce that, draconian error handling would not render pages if they were not XHTML2 compliant.

A group of browser makers (Apple, Mozilla Foundation and Opera) revolted and came together in 2004 to form the Web Hypertext Application Technology Working Group (WHATWG). They created HTML5 in response to XHTML2, making it forgiving to human error, even standardized how errors are to be resolved, and introduced new tags that would allow for a more semantic web.

The messy world of human error won over the more perfect coding world of machines. XHTML 2.0 was canned by the W3.org and was allowed to expire in 2009, officially recognizing HTML5 as the way of the future. It should be noted that there is an XHTML5 version, for machines to read and write, of course, but we need not be concerned with that.

You can view the differences between HTML and XHTML in this Table. To provide but one example, while XHTML requires all starting and ending tags, HTML is quite cavalier about it, so that both starting and ending tags are optional for the html, head and body tags. A well formed HTML5 document can start and end with the content itself. End tags are optional for li, dt, p, tr, td. I personally started to leave these out as the W3.org site itself leaves them out. I figure that they set the standard by which we are to measure our web pages. Roll over in your grave, XHTML2.

Tag Soup

The model for marking up a page is to let the CSS do all of the styling. This is most easily done by using the <div> tag for block elements and the <span> tag for inline elements, and then giving them id or class attributes: <div id="header">. <Div> stands for division, and is a generic element that, along with other block elements, can be thought of as boxes that divide the page. The <span> tag “spans” the content of inline elements. The content of inline elements are like characters, though they can be pictures, links or characters that make up the content of a paragraph.

Everything can be marked up with these <div> and <span> tags, and before HTML5, this is what a lot of people did, to the exclusion of using the other HTML tags.

The problem is that these are generic tags that do not impart any meaning onto the content. The over-use of these tags, called “classitis” and “divitis”, of which I have been guilty of in the past, contributed to a lack of semantically well formed web pages. Being semantically neutral, the markup could not be counted on to help machines understand the meaning of the content.

The solution is simple. Use the different HTMl tags to determine the content, and only use the <div> and <span> tags if necessary.

Imparting Semantic Meaning to the Markup.

The content can be structured so that it becomes more semantically meaningful, by using the header tags, specific tags, microformats and meta tags. HTML5 then comes along and introduces a number of new tags that help determine the document content’s. meaning .

Document Hierarchy

As mentioned above, before HTML5, documents were structured using the header tags <h1> through <h6>. This does not always synch up with a designer’s gut reaction, which visualizes header 1 as bigger and bolder than header 2. This is how they are in the default browser style sheets. Designers then use these header tags according to their idea of the visual hierarchy.

This may or may not be correct, as the design’s visual hierarchy does not necessarily follow the semantic requirement that header tags reflect the structural meaning of the content. It’s possible, for example, to make a <h1> smaller and less bold than a <h3> if that is what the layout calls for.

Before HTML5, each document should have only one <h1> tag that expresses the title and purpose of the page. All of the subsequent content should be organized according to the <h2> through <h6> headers. With HTML5, each tag can have its own hierarchy of <h1> to <h6> headers, making it much easier to structure complicated documents.

Semantic Code Elements

Tags that describe the content are:

<cite> Citation, used to cite a source of information.
<code> Computer or Programming code.
<del> Deleted word or phrase.
<dfn> Definition.
<dl> Definition List. Similar to UL and OL, but uses DT (Definition term) and DD (definition description) to show terms and definitions.
<em> Emphasis, displayed as italicized text.
<ins> Insert, used to display text you have inserted due to an edit at a later date.
<kbd> Keyboard instructions.
<ol> Ordered List.
<samp> Sample output, used to show sample output from programming code.
<ul> Unordered List.
<var> Variable, used to represent a variable in programming code.
<strong> Strong, or bold, emphasis on a word or phrase.

MicroFormats

Microformats are agreed upon classes used to tag certain information. Instead of making up your own name for the class, which would be specific only to your document, a name has become universally recognizable by convention. This is quite handy for things like contact information or calendars, and can be seen in Apple’s Address Book and iCal, which uses standardized microformats.

In this example, the contact information is presented with generic markup:

<div>
<div>Joe Doe</div>
<div>The Example Company</div>
<div>604-555-1234</div>
<a href="http://example.com/">http://example.com/</a>
</div>

With hCard microformat markup, that becomes:

<div class="vcard">
<div class="fn">Joe Doe</div>
<div class="org">The Example Company</div>
<div class="tel">604-555-1234</div>
<a class="url" href="http://example.com/">http://example.com/</a>
</div>

Meta Tags

Meta tags that described the content. They are used by bots to identify page content. This use to be especially important for search engine optimization, but was abused. Google will still take them into account, but they no longer have the weight they used to have.

Here is a list of some of these meta tags that can appear in the header of the web page. You can use these yourself. Just fill in _missing_fields and delete unwanted tags.

  • <meta name=”author” content=”All: _author_name_; mailto:_your@mail.here_” />
  • <meta name=”owner” content=”_owner_of_the_page_” />
  • <meta name=”generator” content=”_name_of_your_editor_” />
  • <meta name=”publisher” content=”_publisher_of_page_” />
  • <meta name=”resource-type” content=”document” />
  • <meta name=”page-topic” content=”_page_topic_” />
  • <meta name=”doc-rights” content=”_copyright_status_(Copywritten Work / Public Domain)_” />
  • <meta name=”language” content=”_page_language_” />

HTML5

HTML5 added structural elements that provide additional semantic meaning, replacing the divs that would otherwise have marked up the page. Incorporate these elements into the structure of the document as part of its semantic structure. You can find out more about HTML5 elements here.

The HTML5 Tags help to structure your content/document

  • main is an element that can be used only once per page. It represents the main content of the body. The main element may not be a descendant of a article, aside, footer, header or nav element.

  • section represents a generic document or application section. It can be used together with the h1, h2, h3, h4, h5, and h6 elements to indicate the document structure.

  • article represents an independent piece of content of a document, such as a blog entry or newspaper article.

  • aside represents a piece of content that is only slightly related to the rest of the page. For complimentary content to the main content (Taken from XHTML 2.0 specification)

  • hgroup represents the header of a section.

  • header represents a group of introductory or navigational aids.

  • footer represents a footer for a section and can contain information about the author, copyright information, et cetera.

  • nav represents a section of the document intended for navigation.

  • figure represents a piece of self-contained flow content, typically referenced as a single unit from the main flow of the document.

    <figure>
    <video src="ogg"></video>
    <figcaption>Example</figcaption>
    </figure>
  • figcaption can be used as caption (it is optional).