NCSU Libraries
IIIF Implementation

http://ronallo.com/presentations/iiif-demo/

For the impatient: http://d.lib.ncsu.edu/collections/catalog?f%5Bfulltext_bs%5D%5B%5D=true&f%5Bispartof_facet%5D%5B%5D=Nubian+Message

Jason Ronallo
Head of Digital Library Initiatives
NCSU Libraries

Please Note

This deck was put together very quickly for a demo on the IIIF community call of September 14, 2016.

Implemented IIIF APIs

Existing Infrastructure

SCAMS

Special Collections Asset Management System

Manages descriptive metadata.

SAL

Image Server: Djatoka

Motivations

(other than wanting to migrate off of Djatoka)

We had been out in front with discovery (esp. SEO) and interfaces for digital special collections. Often a model for others.

We already had versions of what I'm talking about:
A pan/zoom paginated viewer, search inside, an API

But nothing as good as what can be done now with IIIF-compatible tools. Being out in front for us now means adopting IIIF and pushing it forward, because we get so many benefits from it.

Development Process: Promises

Oh, noes it is a month until Harambee and we promised to have search inside for the Nubian Message before this event!

Most of this development was done over the course of a month with 2-3 weeks of intensive work. Had already migrated JP2 images from Djatoka profile to one that worked well with existing image servers.

Worked closely in a small team.

No worries. We like to have all the pieces strung together into a minimal product, learn more about the problems, and improve each of them over time.

We had had a bespoke implementation of search inside for the Technician newspaper. It wasn't good.

Image Server: Eyebright

Eyebright: A traditional medicinal herb used to relieve eye strain.

Why another image server?

Caching

Caches to the file system in a transparent way.

Caches in directories the same way a Level 0 implementation would.

Most image requests never hit the image server, just the web server.

Easy performance win.

Profiles

Common URLs across various applications we manage:

Only clear images from the cache that do not match these patterns. Don't have to crawl deeply. If not "full" or "square" region, remove it. Runs as cron job.

TODO: Do not expire any derivatives for our most popularly used images. These should always be as fast as possible.

UV requests for
/full/90,/0/default.jpg

Keep this image because it is small and used to help UV load faster. We keep these around regardless of when it was last used.

iiif/nubian-message-1996-02-22_0001
├── full
│   └── 90,
│       └── 0
│           └── default.jpg
└── info.json

Static Image Generator

Just in time static site generator for images.

Simple cache expiration.

Improved cache maintainability.

Feature Extension

gravityBangs

Problems

Other applications would rather they know as little as possible about anything.

full image: portrait

Seguy

square: center gravity

!square: top left gravity

square!: bottom right gravity

so now we record this information about some images and that information is accessible to all the other applications that use images.

Cache cleaning easier/simpler

Top left (!square) would otherwise be:
/iiif/segIns_001/0,0,6099,6099/350,/0/default.jpg

square!

/iiif/segIns_001/0,2500,6099,6099/350,/0/default.jpg

Migration from Djatoka:
Better Images!

OCR and Search Inside with Ocracoke

Job Queues

No UI (Yet)

Trigger jobs from the command line:

bin/rake \
ocracoke:queue_from_ncsu_id[LD3928-A23-1947]

Public API

Rake uses a simplistic API in our public Blacklight app:

https://d.lib.ncsu.edu/collections/catalog/LD3928-A23-1947.json

{
  "fileName": "LD3928-A23-1947",
  "images": [
    "LD3928-A23-1947_0001",
    "LD3928-A23-1947_0002",
    "LD3928-A23-1947_0003",
    "LD3928-A23-1947_0004"]
}
If you have any kind of API or way to get an identifier for a resource and a list of the page image

API: Queuing

Simple token-based API

curl -X POST
-H "Content-Type: application/json"
-H "Accept: application/json"
-d '{"resource": "gng00126",
    "images": ["gng00126_001",
       "gng00126_002",
       "gng00126_003",
       "gng00126_004"]}'
-H "Authorization: Token token=token, user=scams"
-k http://localhost:8090/api/ocr_resource

API: Notifications

When a job is complete a notification can be sent to another application.

Outputs: txt, hocr, json, txt, pdf

Page Image: text, hOCR, JSON word boundaries
Resource: concatenated text, PDF with embedded text

/access-images/
└── ocr
    └── LD
        ├── LD3928-A23-1947
        │   ├── LD3928-A23-1947.pdf
        │   └── LD3928-A23-1947.txt
        ├── LD3928-A23-1947_0001
        │   ├── LD3928-A23-1947_0001.hocr
        │   ├── LD3928-A23-1947_0001.json
        │   └── LD3928-A23-1947_0001.txt
        ├── LD3928-A23-1947_0002
        │   ├── LD3928-A23-1947_0002.hocr
        │   ├── LD3928-A23-1947_0002.json
        │   └── LD3928-A23-1947_0002.txt
        ...

Workflow between applications

Content Search API

https://ocr.lib.ncsu.edu/search/nubian-message-1995-04-13?q=afrikan

Strings probably the wrong thing to use, but it works!

{ "hits": [
  {
    "@type": "search:Hit",
    "annotations": [
        "urn:nubian-message-1995-04-13:nubian-message-1995-04-13_0011:annotation0",
        "urn:nubian-message-1995-04-13:nubian-message-1995-04-13_0011:annotation1"]
  }
]}

Hit Highlighting and
Bounding Boxes

Canvas not dereferencable yet:
https://iiif.lib.ncsu.edu/iiif/nubian-message-1995-04-13_0011/canvas#xywh=497,4775,153,37

{"resources": [{
  "@id": "urn:nubian-message-1995-04-13:nubian-message-1995-04-13_0011:annotation0",
  "@type": "oa:Annotation",
  "motivation": "sc:painting",
  "resource": {"@type": "cnt:ContentAsText",
               "chars": "Afrikan"},
  "on":
    "https://iiif.lib.ncsu.edu/iiif/nubian-message-1995-04-13_0011/canvas#xywh=497,4775,153,37"
}]}
Where does it get the bounding boxes?

Library of Congress Approach

Extract the bounding boxes for each word from the OCR (hOCR or ALTO).

Make a hash where the keys are words on the page and the values are bounding boxes.

Solr provides hit highlights. Extract those from the search.

For each hit, look up

http://blogs.loc.gov/thesignal/2014/08/making-scanned-content-accessible-using-full-text-search-and-ocr/

JSON Word Boundaries

{"Panther": [{
    "x0": "1694",
    "y0": "3875",
    "x1": "1899",
    "y1": "3925",
    "c": "77" },
    { "x0": "1899", "y0": "1543", "x1": "4219", "y1": "1745", "c": "85"
  }],
  "Seale": [{
    "x0": "2983",
    "y0": "2451",
    "x1": "3086",
    "y1": "2496",
    "c": "88"},
    {"x0": "2921", "y0": "2638", "x1": "3015", "y1": "2678", "c": "88"}]
}

What would a good indexing strategy for these instead of retrieving the JSON files off of the filesystem for each search?

#xywh=0,0,0,0

UV requires the fragment hash to get you to the right page.

Sometimes the tokenization or word boundaries are different between what Solr indexes and what goes in the word boundaries JSON file. (That's a bug!)

Phrase searching is difficult. Phrase suggestions are difficult.

Some of the OCR is complete garbage.

Suggestions

https://ocr.lib.ncsu.edu/suggest/nubian-message-1995-04-13?q=afri

Uses the newer Suggester in Solr for suggestions.

This was the most difficult part of it all to get working halfway good enough.

Ocracoke code

https://github.com/NCSU-Libraries/ocracoke

A production prototype application!

Quick start: vagrant up and Ansible provisioner

Issue #1: Make it easy to provide a IIIF-compliant Content Search API

Manifests

http://d.lib.ncsu.edu/collections/catalog/nubian-message-2003-04-01/manifest

Uses jbuilder templates.

Many of the @ids are made up!

I could have used more examples of manifests of different kinds to work from.

The Tripoli validator was super helpful.

Public Interface (SAL)

http://d.lib.ncsu.edu/collections/

Full text:
http://d.lib.ncsu.edu/collections/catalog?f%5Bfulltext_bs%5D%5B%5D=true

Nubian Message:
http://d.lib.ncsu.edu/collections/catalog?f%5Bispartof_facet%5D%5B%5D=Nubian+Message

UV


Arbitrary Collection Manifests

http://d.lib.ncsu.edu/collections/catalog/manifest?f[ispartof_facet][]=Nubian+Message&f[resource_decade_facet][]=1990s

Sitemap

http://d.lib.ncsu.edu/collections/sal-sitemap.xml

iiif-discuss thread

<?xml version="1.0"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:iiif="http://iiif.io/api/presentation/2.1/">
  <url>
    <loc>http://d.lib.ncsu.edu/collections/catalog/bh020301401</loc>
    <lastmod>2015-06-01T18:02:34Z</lastmod>
    <iiif:manifest>https://d.lib.ncsu.edu/collections/catalog/bh020301401/manifest</iiif:manifest>
  </url>

Questions