Embedded Semantic Markup, schema.org, the Common Crawl, and Web Data Commons: Big Web Data

Jason Ronallo
Associate Head, Digital Library Initiatives
North Carolina State University Libraries

@ronallo
jason_ronallo@ncsu.edu

Hi! I'm Jason Ronallo at NC State.

Outline

Embedded Semantic Markup
Schema.org
Examples
Common Crawl
Web Data Commons
Preliminary Research

I'm going to cover a lot of ground in a short period of time just to give you an idea about each of these. Just enough to begin to see why this is important and give you ideas on what's becoming possible. Much of this is the necessary preface for the research I've begun.

How Search Engines Work

Robots crawl the Web
Process and index crawl data
Try to answer search queries with the most relevant results

This is basically how search engines work.

Problem is that for the most part they're having to do natural language processing to pull out semantics. It is really hard to do. Even the smart minds at big companies like Google can only do so much.

Embedded Semantic Markup

Jason Ronallo is the Associate Head of Digital Library Initiatives at NCSU Libraries.

This is part of the reason why embedded semantic markup is important. Here's a snippet here, but you can't see it.

Embedded Semantic Markup

<span itemscope
   itemtype="http://schema.org/Person">
   <a itemprop="url"
      href="http://twitter.com/ronallo">
      <span itemprop="name">Jason Ronallo</span>
   </a> is the <span itemprop="jobTitle">
     Associate Head of Digital Library
     Initiatives</span> at
   <span itemprop="affiliation" itemscope
     itemtype="http://schema.org/Library">
     <span itemprop="name"> <a itemprop="url"
       href="http://lib.ncsu.edu">NCSU Libraries</a></span>
  </span>.
</span>

But here's what the markup looks like. Some attributes have been added to some simple HTML to add some more structure to the data.

That's all that embedded semantic markup is. Embedded semantic markup provides the syntax (some extra markup) to structure data in your HTML pages.

Think of this like hidden annotations.

Why use embedded semantic markup?

A way to structure data in HTML
It is meant for machines--that's why it is hidden!
Your eyes are on the Web site (Maintain this data in one place and keep it in sync)
Rich Metadata ⇒ Rich Embedded Data

Since youre eyes are more often on the web site, it can be better than trying to keep your data in sync with some external XML serialization.

We often have very rich metadata for the resources we describe in our databases. In the past schemes to expose this metadata through HTML and the Web led to a lot of dumbing down. Using embedded semantic markup like Microdata or RDFa Lite allows for us to expose richer metadata with more structure through HTML.

Schema.org

Shared, Web-scale, single-stop vocabulary for describing the content of Web pages.
Maintained by the major search eninges (Bing, Google, Yahoo, Yandex)
Everything is a Thing
407 Types of Things
545 Properties of Things
Everything from Airport to Library to Volcano
Single site for documenation makes it easy to use (no fragmentation)
Expanding and open to proposals to update the schema (see SchemaBibEx W3C Community Group)

* Numbers from last time I checked early in 2013.

Using embedded semantic markup to structure your data is great, but if you're not using a vocabulary that someone else understands it is kind of pointless. This is where schema.org comes in. (Read slide.)

Examples

I'm going to show you a couple of very quick examples of what we've done so far with embedded semantic markup and schema.org at NCSU. I'm sure the Duke folks will be showing you more examples in a bit.

OK, here a digital collections site where we've implemented embedded semantic markup and schema.org.

These are the types of things that we're describing on that site. And looking at Google Webmaster Tools we know these things are getting indexed.

Rich Snippets: Rare and Unique Materials Video

YKK

Third result in Google video search for "bug sprays and pets."

So the main benefit we get out of all this right now is what Google calls Rich Snippets.

This search result has a video thumbnail, the duration and a bit from the transcript of the video as the description. Rich snippets is really the only thing that Google has said it will use this data for.

You can see how having this extra information can make a particular search result stand out and be more likely to be clicked on. So it improves discoverability.

But how else could this benefit libraries and archives if all of this gets pushed further?

Library Home Page

NCSU Libraries Home Page

We've included some Microdata and Schema.org on the Libraries' site.

Library Home Page Footer

NCSU Libraries Web Site Footer

Embedded semantic markup includes the Libraries name, URL, logo (hidden), address, and telephone number.

We've included some simple embedded semantic markup using schema.org in the footer. This is a way to start publishing some basic data about us. The amount we've added to the page would be enough to include the Libraries in various directory services.

Answers Instead of Search Results

There is a more general trend for search engines to give answers instead of a list of search results. You may have seen results like this in Google already. While the data is currently taken from Wikipedia, Freebase, and some other standard sources, I'd expect that more answers would start being sourced from the embedded semantic markup that gets published. I think this could go even further though.

Library Services and Google Now

22 minutes to Hunt Library, Centennial Campus

Wolfline #8 from Scott Hall

Hunt Library hours: 7:00 a.m. - 11:00 p.m.

View nearby events

Fabulous Faculty @ DH Hill Library - Brickyard Farmer's Market - Read Smart Book Discussion (Salt Sugar Fat)

Other good things we could add would be hours of operation, our extact geographic location, and events happening in the Libraries.

Who is familiar with Google Now?

It is like an automatic personal assistant that learns about your habits and gives you helpful information. If you enable Google Now you'll see information about how long it would take for you to get from home to work on the next bus.

This is totally speculative about where this could go, but wouldn't it be cool if it showed students the Libraries' hours for the day, when and where their study group is meeting, and what events are happening in the library? This is the kind of thing which is possible when lots of this data is published on the Web and combined with the student's data.

Research Question:
Are Academic Libraries Publishing Embedded Structued Data?

How about digital collections?

Get some rough idea of the landscape of use of embedded semantic markup and schema.org among academic institutions and academic libraries.

So one of the things I'd like to do with this data is learn is how many academic institutions are using these technologies, especially how academic libraries are. What are their patterns of use? How could we improve things for libraries in this new Big Web data environment? I've got lots of other questions.

Problem is that I'd need lots of data to answer these questions. I certainly don't have the means to go out and crawl the Web, store all the data, and parse it.

Common Crawl

Common Crawl is a non-profit foundation dedicated to providing an open repository of web crawl data that can be accessed and analyzed by everyone.

Over 5 billion Web pages (3,005,629,093 for the set I looked at)
40,600,000 domains
~81TB total
Other sets of crawl data are being added (Blekko)
Uses PageRank so is a snapshot of the current most popular part of the Web

http://commoncrawl.org/

This is where the Common Crawl comes in to help. They don't crawl as much as Google, but it is still a lot. (Read slide.)

If you want to use this data, it is free. But to parse it all will cost you money. Not as much as crawling the Web on your own would, but still something.

Web Data Commons

Extracting Structured Data from the Common Web Crawl

http://webdatacommons.org/

Domains with Triples	2,286,277
Typed Entities	1,811,471,956
Triples/Statements	7,350,953,995

Cost to extract the data from the Common Crawl: $398

And this is where the Web Data Commons comes in. It parses the whole Common Crawl corpus to extract all of the embedded semantic markup into RDF-triples. It makes all the data available for free.

Again we're talking some Big Data here with over 7 billion triples.

Even so the size of this data set is more approachable to just download and play with.

What's an N-Quad?

_:node6eecc231551a72e90e7efb3dc3fc26 http://schema.org/Photograph/name "Mary Travers singing live on stage" http://d.lib.ncsu.edu/collections/catalog/0228376 .

Subject Predicate Object Context

An N-Quad is an RDF statement that also includes a context piece at the end. Context is the URL of the HTML page from which the data was extracted.

Line-based format makes it easier to do some rough parsing.

Before we get into some of the results let's cover a little bit of vocabulary.

Web Data Commons publishes its data as N-Quads. So what's an NQuad?

Methodology

Grab all of the Web Data Commons extracted N-Quads (7,350,953,995 of them) from the August 2012 Common Crawl Corpus
Use commandline tools (cat & grep) to boil things down to just N-Quads that contain ".edu" somewhere, anywhere
Further reduce by university (duke.edu, nccu.edu, ncsu.edu, unc.edu)
Even further reduce to just libraries (library.duke.edu, lib.ncsu.edu, lib.ncsu.edu)
Run some scripts over these smaller batches to get some results

All very much a crude pass at this!

I wanted to contain the big data element a bit so I used some crude methods to just begin to get some usable data out.

OK, let's see what we're left with.

Total Statements (N-Quads)

All triples	7,350,953,995
All .edu	8,178,985
duke.edu	58,867
nccu.edu	79
ncsu.edu	9,339
unc.edu	52,751

These are all the statements that contain the text (.edu, duke.edu, nccu.edu, ncsu.edu, unc.edu) anywhere in the N-Quad.

Unique Contexts/Pages

duke.edu	55,344	library.duke.edu	1,123
nccu.edu	2	n/a*
ncsu.edu	664	lib.ncsu.edu	155
unc.edu	2,837	lib.unc.edu	503

These are the number of unique contexts (HTML pages) that are included in the Common Crawl and that have included some embedded semantic markup that Web Data Commons has extracted.

* Uncertain how to target just NCCU Libraries.

So 55 thousand pages on Duke's site included some structured data, and of those 1 thousand are from the library's site.

Digital Collections at NCSU: Rare & Unique Materials

Mary Travers singing Webb-Barron-Wells House American Dry Cleaning building drawing

In looking closer at just the NCSU digital collections I see just these 4 URLs included as contexts. So these are the only NCSU digital collections pages with embedded semantic markup that were crawled. Many more pages have it--and many more interesting resources, in my opinion, in the collection as well--but these are the only ones that had enough PageRank to be included in the crawl in August of 2012.

Use of Schema.org

145,351 N-Quads from all 8,178,985 .edu N-Quads use schema.org.

duke.edu	1901	library.duke.edu	1660
nccu.edu	3
ncsu.edu	326	lib.ncsu.edu	102
unc.edu	301	lib.unc.edu	25

* These numbers look at the whole quad and not just the context. So these universities and libraries might not actually be using schema.org (or may have been using schema.org but the documents that have schema.org have not been crawled by the Common Crawl).

So one thing I really need to do is do this calculation based on contexts instead.

Most Popular Schema.org Types Used by TRLN Parent Institutions

http://schema.org/CollectionPage (Duke)
http://schema.org/VideoObject (All TRLN Libraries)
http://schema.org/BlogPosting (All TRLN Libraries)
http://schema.org/Event (Duke)
http://schema.org/Person (UNC)
http://schema.org/ScholarlyArticle (UNC)

So videos and blog posts are definitely the most popular types of things to describe. That leaves plenty of room to be describing other things.

OK, What Does This Mean? Preliminary Thoughts.

Common Crawl is crawling .edu domains (Good.)
Common Crawl is crawling TRLN libraries (Great!)
Common Crawl is not extensively crawling, especially digital special collections :-(
We could probably improve what gets crawled by increasing our PageRank.
Academic institutions are using some embedded markup and starting to use schema.org.
Our use as of this is a just start. We could be doing more.

Still a lot more research to do--and do more exactly--but here are some of my initial thoughts on this.

What questions about Big Web Data would you be interested in?

Bonus Slide: What Could Libraries and Archives Do With These Web Technologies?

Libraries and archives can be both producers and consumers of this data.

Improve discoverability of our services and collections
Domain-specific vertical search engines
The public interoperability API to our data (replace library-specific APIs)
Publish data sets and harvest them from others
Archiving the Web with improved metadata
What are your ideas?

So I think we can have a part to play both as producers of this data and as consumers of it.

So that's what others might do with this data, but what could Libraries do?

It could enable new services.

This might help us identify data sets we'd like to preserve.

Links

NCSU sites that use embedded semantic markup (Microdata) and Schema.org:

Credits

Friendly Robut by Sean Hannan
Google Now HTML and CSS derived from Bennett Feely http://codepen.io/bennettfeely/details/Ftczh

Jason Ronallo

@ronallo

http://ronallo.com/

jason_ronallo@ncsu.edu

You can read more of what I've written on these topics and find other presentations I've given.

Embedded Semantic Markup, schema.org, the Common Crawl, and Web Data Commons: Big Web Data

Outline

How Search Engines Work

Embedded Semantic Markup

Embedded Semantic Markup

Why use embedded semantic markup?

Schema.org

Examples

Schema.org types on the Rare & Unique Materials site

Rich Snippets: Rare and Unique Materials Video

Library Home Page

Library Home Page Footer

Answers Instead of Search Results

Library Services and Google Now

22 minutes to Hunt Library, Centennial Campus

Wolfline #8 from Scott Hall

View nearby events

Fabulous Faculty @ DH Hill Library - Brickyard Farmer's Market - Read Smart Book Discussion (Salt Sugar Fat)

Research Question:
Are Academic Libraries Publishing Embedded Structued Data?

How about digital collections?

Common Crawl

Common Crawl is a non-profit foundation dedicated to providing an open repository of web crawl data that can be accessed and analyzed by everyone.

Web Data Commons

Extracting Structured Data from the Common Web Crawl

What's an N-Quad?

Subject Predicate Object Context

Methodology

Total Statements (N-Quads)

Unique Contexts/Pages

Digital Collections at NCSU: Rare & Unique Materials

Use of Schema.org

Most Popular Types Used by TRLN Parent Institutions

Most Popular Schema.org Types Used by TRLN Parent Institutions

OK, What Does This Mean? Preliminary Thoughts.

What questions about Big Web Data would you be interested in?

Bonus Slide: What Could Libraries and Archives Do With These Web Technologies?

Links

Credits

Jason Ronallo

Embedded Semantic Markup, schema.org, the Common Crawl, and Web Data Commons: Big Web Data

Outline

How Search Engines Work

Embedded Semantic Markup

Embedded Semantic Markup

Why use embedded semantic markup?

Schema.org

Examples

Schema.org types on the Rare & Unique Materials site

Rich Snippets: Rare and Unique Materials Video

Library Home Page

Library Home Page Footer

Answers Instead of Search Results

Library Services and Google Now

22 minutes to Hunt Library, Centennial Campus

Wolfline #8 from Scott Hall

View nearby events

Fabulous Faculty @ DH Hill Library - Brickyard Farmer's Market - Read Smart Book Discussion (Salt Sugar Fat)

Research Question:Are Academic Libraries Publishing Embedded Structued Data? How about digital collections?

Common Crawl

Common Crawl is a non-profit foundation dedicated to providing an open repository of web crawl data that can be accessed and analyzed by everyone.

Web Data Commons

Extracting Structured Data from the Common Web Crawl

What's an N-Quad?

Subject Predicate Object Context

Methodology

Total Statements (N-Quads)

Unique Contexts/Pages

Digital Collections at NCSU: Rare & Unique Materials

Use of Schema.org

Most Popular Types Used by TRLN Parent Institutions

Most Popular Schema.org Types Used by TRLN Parent Institutions

OK, What Does This Mean? Preliminary Thoughts.

What questions about Big Web Data would you be interested in?

Bonus Slide: What Could Libraries and Archives Do With These Web Technologies?

Links

Credits

Jason Ronallo

Research Question:
Are Academic Libraries Publishing Embedded Structued Data?

How about digital collections?