The Semantic Web

Posted on
A stack of childrens' building blocks

The web utilizes a rapidly increasing number of programming languages, but HTML, the first and most basic building block of the web, is not one of them. Most developers would say with some condemnation that HTML is not a programming language, and while they are correct that it is not programming, it is a language. A programming language can provide instructions using computer analogues of all the basic constructions of human language: nouns, verbs, descriptors, punctuation, and rules of grammar. A markup language, which are the last two letters in HTML to give you a hint, is like a written language that consists solely of adjectives. It only describes the context of another language. To illustrate, HTML tags are the adjectives that go before, after or between the English-language content that makes up a complete webpage. The tag, <a> describes that the following piece of text is a link, and the tag property, href, further describes where that link leads to. So, for example, the line:

<a href=“http://www.google.com”> Clicking this text will take you to Google </a>

 

is HTML’s way of ascribing a purposeful intent in the text that goes on the webpage. The dictionary for HTML contains 109 adjectives in its most recent publication, and the job of every current browser is to read the meaning of them in as close to the same way as possible, though each browser infers the intent of those words slightly differently. However, unless your goal in life is to learn all the ins and outs of HTML, there is no reason to cover any more of these fundamentals.

Instead, I’d like to talk about the parts of HTML that make me enjoy my job. This is the part of HTML known as semantic markup. It’s a purposeful effort by the people who use and design HTML to make their adjectives give better context to what they actually describe. For example, the tag <b> has been the method for describing text which should be bold since the very early days of HTML. However, <b> is poor vocabulary to use in the context of text, because it doesn’t actually give meaning to the words that it is describing. Instead, if you wanted to describe the importance of the text, you should use the tag that describes it as <strong>. <strong> is functionally identical to <b>, but de-couples the function from the intent. After all, the text that you want to stand out as strong may not actually appear bold in your design, so why describe it that way?

Another goal of semantic HTML is to help tell the browser which parts of the page are actually important by using words that describe the common structure of a page. Most websites have a header, which contains the logo and a few important links. Nearly as many also have a footer with some supplemental information. Between these, there is an area for your page content, and blog or news sites might have one or more articles in the content area. Prior to the most recent standard for HTML, all of these areas were usually described with a single multi-purpose building block called a <div> - an area of the page that is divided up from other parts of the page. Now, each of these areas has its own word to describe it. One reason this is useful is in the area of web accessibility. People who rely on screen reading software comprise a surprising 14% of users in some studies, far more than will likely visit a web page using some older versions of common browsers, so why not make it easier for their software to infer the intent of your content - to know which text is navigation and which is the body of your article. Computers aren’t very good at figuring out the intent behind human language, so semantic markup helps them to know what you want them to look at.

As you may know, HTML is also a hierarchically structured language. Inside of an <article> you have a set of paragraphs that are separated by <p> tags using the same construction as in normal language. Each of those <p> tag enclosures contain your paragraph text, with additional tags that describe the semantic nature of individual words or phrases. What this also means is that semantic markup can describe the relationships between one piece of content and another. The tag <h1> is the markup that denotes text which is a heading - essentially an important line of text that describes all the content that follows it, such as the title of a newspaper article. When search engines view your page, they decide what the most important thing to take away from your content is largely by looking for the contents of any <h1> tags. The second-most important thing will be an <h2> tag and so on. Every page should have something that is the most important thing for a user to know when reading it, but does that mean that there should only be one thing at each level of importance? Not necessarily, if the headline has semantic context of what it refers to. So, you may have a page with multiple news articles and each headline is the most important thing in the context of that article. Each <article> may have its own <h1> that would tell a search engine or screen reader that this is the most important information in the context of that article. Another <h1> at the top of your page would also summarize that this page contains a collection of articles, or whatever else is important for the user to know about that page. A tip for the SEO-minded: the thing that is almost universally not the most important thing on your page is your logo or company name. Search engines have other ways of figuring this out, and users don’t need to be reminded of where they are.

Web developers also want to help computers understand the relationship between text and real-world objects. The HTML5 standard falls short in this regard, but other kinds of markup that work with HTML fill the gap. A schema is a descriptor that clings to an HTML tag and contextualizes content online with what it represents. This is important for helping people who will want to find content in a specific context that HTML does not yet describe. If you have a product that you want to sell, you want to be able to tell the software that all the text in this part of your website describes this product, and what specific pieces refer to the price, availability, manufacturer and product description. If your content describes a recipe, you want to make it easy for a computer to understand how many servings it provides, what ingredients it calls for, and how long it will take to cook. Lastly, when you write an article on the web, you want people to know the name of the author who wrote it as well as when and where it was originally published. Like with semantic HTML, these “microdata” descriptors live alongside the content they describe and contextualize the entity that they inhabit.

I’ve described many of the good reasons for thinking of HTML as a semantic and descriptive language for your content, but I have not explained why it makes me enjoy my job as much as it does. In truth, I can’t really give a satisfactory answer. I’m sure there’s something in the way I think that sees a need for clean and orderly structure in my day-to-day work, an OCD-like compulsion for systematic taxonomy, like a biologist who urges to find the right classification for every new species. I want to think I can use my work to describe the context around it in a way that approaches perfection, even as the adjectives I have available are comparatively sparse and subject to change. I do not deny that there have been times where I spent upwards of an hour changing my mind over whether one tag or another better describes the intent of its purpose. But like in our day-to-day language, the importance of communicating intent cannot be understated, and at the end of the day, I visualize the web as a method of communication that has the capacity to send ideas beyond the limits of our natural language, even if the words that HTML has invented so far are more like the limits of our language in the caveman days.