Embedded Semantic Markup, schema.org, the Common Crawl, and Web Data Commons: Big Web Data

Jason Ronallo
Associate Head, Digital Library Initiatives
North Carolina State University Libraries

@ronallo
jason_ronallo@ncsu.edu

Hi! I'm Jason Ronallo at NC State.

Outline

I'm going to cover a lot of ground in a short period of time just to give you an idea about each of these. Just enough to begin to see why this is important and give you ideas on what's becoming possible. Much of this is the necessary preface for the research I've begun.

How Search Engines Work

  1. Robots crawl the Web
  2. Process and index crawl data
  3. Try to answer search queries with the most relevant results

friendly robot 

This is basically how search engines work.

Problem is that for the most part they're having to do natural language processing to pull out semantics. It is really hard to do. Even the smart minds at big companies like Google can only do so much.

Embedded Semantic Markup

is the Associate Head of Digital Library Initiatives at .

This is part of the reason why embedded semantic markup is important. Here's a snippet here, but you can't see it.

Embedded Semantic Markup

<span itemscope
   itemtype="http://schema.org/Person">
   <a itemprop="url"
      href="http://twitter.com/ronallo">
      <span itemprop="name">Jason Ronallo</span>
   </a> is the <span itemprop="jobTitle">
     Associate Head of Digital Library
     Initiatives</span> at
   <span itemprop="affiliation" itemscope
     itemtype="http://schema.org/Library">
     <span itemprop="name"> <a itemprop="url"
       href="http://lib.ncsu.edu">NCSU Libraries</a></span>
  </span>.
</span>

But here's what the markup looks like. Some attributes have been added to some simple HTML to add some more structure to the data.

That's all that embedded semantic markup is. Embedded semantic markup provides the syntax (some extra markup) to structure data in your HTML pages.

Think of this like hidden annotations.

Why use embedded semantic markup?

Since youre eyes are more often on the web site, it can be better than trying to keep your data in sync with some external XML serialization.

We often have very rich metadata for the resources we describe in our databases. In the past schemes to expose this metadata through HTML and the Web led to a lot of dumbing down. Using embedded semantic markup like Microdata or RDFa Lite allows for us to expose richer metadata with more structure through HTML.

Schema.org

* Numbers from last time I checked early in 2013.

Using embedded semantic markup to structure your data is great, but if you're not using a vocabulary that someone else understands it is kind of pointless. This is where schema.org comes in. (Read slide.)

Examples

I'm going to show you a couple of very quick examples of what we've done so far with embedded semantic markup and schema.org at NCSU. I'm sure the Duke folks will be showing you more examples in a bit.

screenshot of home page of rare and unique materials site 

OK, here a digital collections site where we've implemented embedded semantic markup and schema.org.

Schema.org types on the Rare & Unique Materials site

These are the types of things that we're describing on that site. And looking at Google Webmaster Tools we know these things are getting indexed.

Rich Snippets: Rare and Unique Materials Video

YKK 

Third result in Google video search for "bug sprays and pets."

So the main benefit we get out of all this right now is what Google calls Rich Snippets.

This search result has a video thumbnail, the duration and a bit from the transcript of the video as the description. Rich snippets is really the only thing that Google has said it will use this data for.

You can see how having this extra information can make a particular search result stand out and be more likely to be clicked on. So it improves discoverability.

But how else could this benefit libraries and archives if all of this gets pushed further?

Library Home Page