Hi. I'm Jason Ronallo the Associate Head of Digital Library Initiatives at NCSU Libraries. Much of the work that I've done has been with digital special collections and especially with improving the discoverability of these collections on the open Web.
I'm going to give an introduction to each of these pieces of the title, what they are, and how we can use them, and then I'm going to show a little of the original research that I'm doing. So let's jump into it.
How Search Engines Work
Robots crawl the Web
Process and index crawl data
Try to answer search queries with the most relevant results
But first I'd like to talk for a moment about how search engines work. Robots crawl the web. They process and index crawl data. Finally they try to get folks to relevant pages that match search queries. It is that step #2 that we're going to focus on today. There's a lot in that. The point I'd like to make about it now is that search engines have begun to reach the limits of what they can do with natural language processing alone. The problem is that there's a limit to what meaning you can pull out just from HTML tags and the text content.
HTML5 has added a bunch of new semantic elements like article. This can let us pick the article out of a page for a distraction free reading experience.
Trapped Knowledge
But that still doesn't tell us much about what the content is about. There's still a lot of knowledge trapped in HTML pages that's difficult to get out.
Here's a simple statement. It is easy for us as humans to know what this means, but you can imagine how much more complex it would be to try to instruct a computer to pull out these same pieces of data, especially if this was within a longer text.
There's actually some embedded semantic markup on this page. You can't see it?
Embedded Semantic Markup Is Hidden Annotations Meant for Machines
Well that's because embedded semantic markup is a bunch of hidden annotations on the page meant for machines.
Embedded Semantic Markup Exposed
Person has the properties name, url, jobTitle, and affiliation. The affiliation is with a Library that has a name and url.
Here's the embedded semantic markup exposed. You can see that this whole thing describes a Person who has some properties like a name, url, jobTitle, and affiliation. Breaking things down in this way it can make easy sense to machines.
Embedded Semantic Markup Structure
Embedded Semantic Markup HTML
<span itemscope itemtype="Person"><a itemprop="url" href="http://twitter.com/ronallo"><span itemprop="name">Jason Ronallo</span></a> is the <span itemprop="jobTitle">
Associate Head of Digital Library
Initiatives</span> at
<span itemprop="affiliation" itemscope itemtype="Library"><span itemprop="name"><a itemprop="url" href="http://lib.ncsu.edu">
NCSU Libraries</a></span></span>.
</span>
Here's our example HTML. I'm using the Microdata syntax for the embedded semantic markup. I won't get into the particulars, but you can see that there are some extra attributes like itemscope, itemtype, and itemprop added to the HTML.
You can exctract the embedded semantic markup and serialize it as JSON.
RDF (Turtle)
@prefix md: <http://www.w3.org/ns/md#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfa: <http://www.w3.org/ns/rdfa#> .
@prefix schema: <http://schema.org/> .
<> md:item ( [ a schema:Person;
schema:affiliation [ a schema:Library;
schema:name "NCSU Libraries";
schema:url <http://lib.ncsu.edu> ];
schema:jobTitle "Associate Head of Digital Library Initiatives";
schema:name "Jason Ronallo";
schema:url <http://twitter.com/ronallo> ] );
rdfa:usesVocabulary schema: .
Or serialize to some RDF representation.
Types of Embedded Semantic Markup
Microformats
RDFa (Lite)
Microdata
These are the different syntaxes that are most commonly used for embedded semantic markup. The example that I've shown is in the Microdata syntax. I mention them now since, we'll see these again when we get to looking at the research I've done.
Why use embedded semantic markup?
A way to structure data in HTML
To communicate with machines
Your eyes are on the Web site (Maintain this data in one place and keep it in sync)
Rich Metadata to Rich Embedded Data
Embedded semantic markup is a syntax for structuring data in HTML when you need to communicate unambiguously with machines. Since your eyes are more often on the web site, it can be better than trying to keep your data in sync with some external XML serialization. These syntaxes also allow us the chance to go from rich metadata to rich embedded data.
That last point is something I'd like to stress. Too often in libraries when we're looking to exchange data or expose it on the Web we dumb it down. (OAI-PMH led to a lot of this.) You might also remember folks trying to put Dublin Core into the header of HTML.
We often have very rich metadata for the resources we describe in our databases.
Using embedded semantic markup like Microdata or RDFa Lite allows for us to expose richer metadata with more structure through HTML.
The End of Dumbed Down Metadata
I hope we're nearing the end of dumbed down metadata.
Vocabularies for Understanding
Embedded semantic markup alone only gives the syntax and isn't useful on its own. We also need vocabularies so that we understand each other and get past dumbed down metadata.
Schema.org
This is where a vocabulary like schema.org comes in to help. There are all kinds of vocabularies that can be used for describing the content of Web pages, but I'll focus on schema.org for its ease of use, particular use cases, and growing implementation base.
Schema.org
Shared, Web-scale, single-stop vocabulary for describing the content of Web pages.
Released 2011
Maintained by the major search eninges (Bing, Google, Yahoo, Yandex)
Everything is a Thing
407+ Types of Things (Numbers from early 2013)
545+ Properties of Things
Everything from Airport to Library to Volcano
Expanding and open to proposals to update the schema (see SchemaBibEx W3C Community Group)
Single site for documenation. Easy to use. No fragmentation.
[Read slide.] Yes, you can even describe a Volcano, which peculiarly has a property for phone number.
Here's the schema.org for the tree of all the types of things you can describe with it.
This kind of simple documentation makes it easier to implement.
Why use Schema.org?
Growing implementation base.
Software implementations (CMS).
With implementations and known consumers, other consumers will follow.
There are lots of reasons to use schema.org: - Growing implementation base - Software implementations (CMS) - When the data is out there, others will come along, discover it, and use it.
Improve Discoverability on the Open Web
But the main reason right now is to improve the discoverability of your services and collections on the open web. There are many facets to how to improve discoverability. In part it means improving things in Google.
Rich Snippets
The way Google has promised to use embedded semantic markup with schema.org is for what it calls rich snippets.
You've probably already seen search results snippets in Google like this. There's a lot of information about this recipe page. You see an image of a cupcake, the number of reviews and stars, how long it takes to cook, and the number of calories. It even includes a list of some of the ingredients.
And you can imagine how the click through rates on a search snippet like this could be higher than a normal one. And this is the main reason folks are currently using this.
Library Examples
NCSU Libraries
Future Possibilities
So I'll show you some examples of how we've implemented embedded semantic markup and schema.org at NCSU Libraries and then show a couple ideas on how we might see it being used in the future.
To give you an idea how this can apply to libraries and digital collections.