Jason Ronallo
Associate Head, Digital Library Initiatives
North Carolina State University Libraries
@ronallo
jason_ronallo@ncsu.edu
This is basically how search engines work.
Problem is that for the most part they're having to do natural language processing to pull out semantics. It is really hard to do. Even the smart minds at big companies like Google can only do so much.Jason Ronallo is the Associate Head of Digital Library Initiatives at NCSU Libraries.
<span itemscope
itemtype="http://schema.org/Person">
<a itemprop="url"
href="http://twitter.com/ronallo">
<span itemprop="name">Jason Ronallo</span>
</a> is the <span itemprop="jobTitle">
Associate Head of Digital Library
Initiatives</span> at
<span itemprop="affiliation" itemscope
itemtype="http://schema.org/Library">
<span itemprop="name"> <a itemprop="url"
href="http://lib.ncsu.edu">NCSU Libraries</a></span>
</span>.
</span>
But here's what the markup looks like. Some attributes have been added to some simple HTML to add some more structure to the data.
That's all that embedded semantic markup is. Embedded semantic markup provides the syntax (some extra markup) to structure data in your HTML pages.
Think of this like hidden annotations.Since youre eyes are more often on the web site, it can be better than trying to keep your data in sync with some external XML serialization.
We often have very rich metadata for the resources we describe in our databases. In the past schemes to expose this metadata through HTML and the Web led to a lot of dumbing down. Using embedded semantic markup like Microdata or RDFa Lite allows for us to expose richer metadata with more structure through HTML.* Numbers from last time I checked early in 2013.
Third result in Google video search for "bug sprays and pets."
So the main benefit we get out of all this right now is what Google calls Rich Snippets.
This search result has a video thumbnail, the duration and a bit from the transcript of the video as the description. Rich snippets is really the only thing that Google has said it will use this data for.
You can see how having this extra information can make a particular search result stand out and be more likely to be clicked on. So it improves discoverability.
But how else could this benefit libraries and archives if all of this gets pushed further?
Embedded semantic markup includes the Libraries name, URL, logo (hidden), address, and telephone number.
Hunt Library hours: 7:00 a.m. - 11:00 p.m.
Other good things we could add would be hours of operation, our extact geographic location, and events happening in the Libraries.
Who is familiar with Google Now?
It is like an automatic personal assistant that learns about your habits and gives you helpful information. If you enable Google Now you'll see information about how long it would take for you to get from home to work on the next bus.
This is totally speculative about where this could go, but wouldn't it be cool if it showed students the Libraries' hours for the day, when and where their study group is meeting, and what events are happening in the library? This is the kind of thing which is possible when lots of this data is published on the Web and combined with the student's data.Get some rough idea of the landscape of use of embedded semantic markup and schema.org among academic institutions and academic libraries.
So one of the things I'd like to do with this data is learn is how many academic institutions are using these technologies, especially how academic libraries are. What are their patterns of use? How could we improve things for libraries in this new Big Web data environment? I've got lots of other questions.
Problem is that I'd need lots of data to answer these questions. I certainly don't have the means to go out and crawl the Web, store all the data, and parse it.
This is where the Common Crawl comes in to help. They don't crawl as much as Google, but it is still a lot. (Read slide.)
If you want to use this data, it is free. But to parse it all will cost you money. Not as much as crawling the Web on your own would, but still something.Domains with Triples | 2,286,277 |
Typed Entities | 1,811,471,956 |
Triples/Statements | 7,350,953,995 |
Cost to extract the data from the Common Crawl: $398
And this is where the Web Data Commons comes in. It parses the whole Common Crawl corpus to extract all of the embedded semantic markup into RDF-triples. It makes all the data available for free.
Again we're talking some Big Data here with over 7 billion triples.
Even so the size of this data set is more approachable to just download and play with._:node6eecc231551a72e90e7efb3dc3fc26 http://schema.org/Photograph/name "Mary Travers singing live on stage" http://d.lib.ncsu.edu/collections/catalog/0228376 .
An N-Quad is an RDF statement that also includes a context piece at the end. Context is the URL of the HTML page from which the data was extracted.
Line-based format makes it easier to do some rough parsing.
Before we get into some of the results let's cover a little bit of vocabulary.
Web Data Commons publishes its data as N-Quads. So what's an NQuad?All very much a crude pass at this!
I wanted to contain the big data element a bit so I used some crude methods to just begin to get some usable data out.
OK, let's see what we're left with.All triples | 7,350,953,995 |
All .edu | 8,178,985 |
duke.edu | 58,867 |
nccu.edu | 79 |
ncsu.edu | 9,339 |
unc.edu | 52,751 |
These are all the statements that contain the text (.edu, duke.edu, nccu.edu, ncsu.edu, unc.edu) anywhere in the N-Quad.
duke.edu | 55,344 | library.duke.edu | 1,123 |
nccu.edu | 2 | n/a* | |
ncsu.edu | 664 | lib.ncsu.edu | 155 |
unc.edu | 2,837 | lib.unc.edu | 503 |
These are the number of unique contexts (HTML pages) that are included in the Common Crawl and that have included some embedded semantic markup that Web Data Commons has extracted.
* Uncertain how to target just NCCU Libraries.
145,351 N-Quads from all 8,178,985 .edu N-Quads use schema.org.
duke.edu | 1901 | library.duke.edu | 1660 |
nccu.edu | 3 | ||
ncsu.edu | 326 | lib.ncsu.edu | 102 |
unc.edu | 301 | lib.unc.edu | 25 |
* These numbers look at the whole quad and not just the context. So these universities and libraries might not actually be using schema.org (or may have been using schema.org but the documents that have schema.org have not been crawled by the Common Crawl).
Libraries and archives can be both producers and consumers of this data.
So I think we can have a part to play both as producers of this data and as consumers of it.
So that's what others might do with this data, but what could Libraries do?
It could enable new services.
This might help us identify data sets we'd like to preserve.NCSU sites that use embedded semantic markup (Microdata) and Schema.org:
@ronallo
jason_ronallo@ncsu.edu