SCAMS

Special Collections Asset Management System

Manages descriptive metadata.

stores metadata in MySQL
indexes data into Solr

developed this app for maybe 6 years now

SAL

Blacklight application.
Public interface.
Uses SCAMS Solr index for search
Uses SCAMS MySQL database for show views

Image Server: Djatoka

Have used Djatoka for several years
Developed Djatoka Ruby gem (Stanford added a IIIF shim)
Problems with Djatoka
- instability
- poor quality
- odd URLs
- difficult to explain
- less reuse than hoped
- poor choice of JP2 compression options
- unpronounceable name

We weren't using the Stanford shim.
This was the one piece of our infrastructure where I had to worry about going down.

Motivations

(other than wanting to migrate off of Djatoka)

We had been out in front with discovery (esp. SEO) and interfaces for digital special collections. Often a model for others.

We already had versions of what I'm talking about:
A pan/zoom paginated viewer, search inside, an API

But nothing as good as what can be done now with IIIF-compatible tools. Being out in front for us now means adopting IIIF and pushing it forward, because we get so many benefits from it.

Development Process: Promises

Oh, noes it is a month until Harambee and we promised to have search inside for the Nubian Message before this event!

Most of this development was done over the course of a month with 2-3 weeks of intensive work. Had already migrated JP2 images from Djatoka profile to one that worked well with existing image servers.

Worked closely in a small team.

No worries. We like to have all the pieces strung together into a minimal product, learn more about the problems, and improve each of them over time.

We had had a bespoke implementation of search inside for the Technician newspaper. It wasn't good.

Image Server: Eyebright

Ruby on Rails
kdu + OpenJPEG
code, such as it is, will be released soon here: https://github.com/NCSU-Libraries/eyebright

Eyebright: A traditional medicinal herb used to relieve eye strain.

Why another image server?

Deployment story
Familiar technologies
kdu what it is good for; OpenJPEG what it is good for
- kdu for image extraction on the fly
- OpenJPEG for image information

Caching

Caches to the file system in a transparent way.

Caches in directories the same way a Level 0 implementation would.

Most image requests never hit the image server, just the web server.

Easy performance win.

Profiles

Common URLs across various applications we manage:

/full/300,/0/default.jpg
/full/!200,200/0/default.jpg
/square/300,/0/default.jpg

Only clear images from the cache that do not match these patterns. Don't have to crawl deeply. If not "full" or "square" region, remove it. Runs as cron job.

TODO: Do not expire any derivatives for our most popularly used images. These should always be as fast as possible.

UV requests for
/full/90,/0/default.jpg

Keep this image because it is small and used to help UV load faster. We keep these around regardless of when it was last used.

iiif/nubian-message-1996-02-22_0001
├── full
│   └── 90,
│       └── 0
│           └── default.jpg
└── info.json

Static Image Generator

Just in time static site generator for images.

Simple cache expiration.

Improved cache maintainability.

Feature Extension

gravityBangs

Problems

Headless VIP (Is this image about a cigarette?)
Horses Behind ("Man with Horse")
Various applications use the images and know these cases. Shouldn't have to know much to use images.

Other applications would rather they know as little as possible about anything.

full image: portrait

Seguy

square: center gravity

!square: top left gravity

square!: bottom right gravity

so now we record this information about some images and that information is accessible to all the other applications that use images.

Cache cleaning easier/simpler

Top left (!square) would otherwise be:
/iiif/segIns_001/0,0,6099,6099/350,/0/default.jpg

square!

/iiif/segIns_001/0,2500,6099,6099/350,/0/default.jpg

Migration from Djatoka:
Better Images!

OCR and Search Inside with Ocracoke

Rails application to create, index, and search text from page images.
Images are supplied for OCR via a IIIF image server.
OCR with Tesseract.
Goal: Can be used to index without using the OCR pipeline.
Indexes each page into Solr as a separate document.
Facets by a resource identifier. (Next step: hierarchical facets for search across, say an archival box containing folders. But how to best publish and make discoverable manifests?)

No UI (Yet)

Trigger jobs from the command line:

bin/rake \
ocracoke:queue_from_ncsu_id[LD3928-A23-1947]

Public API

Rake uses a simplistic API in our public Blacklight app:

https://d.lib.ncsu.edu/collections/catalog/LD3928-A23-1947.json

{
  "fileName": "LD3928-A23-1947",
  "images": [
    "LD3928-A23-1947_0001",
    "LD3928-A23-1947_0002",
    "LD3928-A23-1947_0003",
    "LD3928-A23-1947_0004"]
}

If you have any kind of API or way to get an identifier for a resource and a list of the page image

API: Queuing

Simple token-based API

curl -X POST
-H "Content-Type: application/json"
-H "Accept: application/json"
-d '{"resource": "gng00126",
    "images": ["gng00126_001",
       "gng00126_002",
       "gng00126_003",
       "gng00126_004"]}'
-H "Authorization: Token token=token, user=scams"
-k http://localhost:8090/api/ocr_resource

API: Notifications

When a job is complete a notification can be sent to another application.

Outputs: txt, hocr, json, txt, pdf

Page Image: text, hOCR, JSON word boundaries
Resource: concatenated text, PDF with embedded text

/access-images/
└── ocr
    └── LD
        ├── LD3928-A23-1947
        │   ├── LD3928-A23-1947.pdf
        │   └── LD3928-A23-1947.txt
        ├── LD3928-A23-1947_0001
        │   ├── LD3928-A23-1947_0001.hocr
        │   ├── LD3928-A23-1947_0001.json
        │   └── LD3928-A23-1947_0001.txt
        ├── LD3928-A23-1947_0002
        │   ├── LD3928-A23-1947_0002.hocr
        │   ├── LD3928-A23-1947_0002.json
        │   └── LD3928-A23-1947_0002.txt
        ...

Workflow between applications

Trigger OCR job from SCAMS (click button)
SCAMS sends message to Ocracoke to OCR the resource
Ocracoke gets each page image from the image server, OCRs, indexes each page
Ocracoke concatenates the text
Ocracoke creates PDF with embedded text
Ocracoke sends a notification to SCAMS that the resource has OCR
SCAMS reindexes the resource into Solr
Solr timestamp busts the cache in SAL (public interface)
- search inside label shows up in index view
- manifest includes search service
- search box shows up in UV

Content Search API

https://ocr.lib.ncsu.edu/search/nubian-message-1995-04-13?q=afrikan

Strings probably the wrong thing to use, but it works!

{ "hits": [
  {
    "@type": "search:Hit",
    "annotations": [
        "urn:nubian-message-1995-04-13:nubian-message-1995-04-13_0011:annotation0",
        "urn:nubian-message-1995-04-13:nubian-message-1995-04-13_0011:annotation1"]
  }
]}

Hit Highlighting and
Bounding Boxes

Canvas not dereferencable yet:
https://iiif.lib.ncsu.edu/iiif/nubian-message-1995-04-13_0011/canvas#xywh=497,4775,153,37

{"resources": [{
  "@id": "urn:nubian-message-1995-04-13:nubian-message-1995-04-13_0011:annotation0",
  "@type": "oa:Annotation",
  "motivation": "sc:painting",
  "resource": {"@type": "cnt:ContentAsText",
               "chars": "Afrikan"},
  "on":
    "https://iiif.lib.ncsu.edu/iiif/nubian-message-1995-04-13_0011/canvas#xywh=497,4775,153,37"
}]}

Where does it get the bounding boxes?

Library of Congress Approach

Extract the bounding boxes for each word from the OCR (hOCR or ALTO).

Make a hash where the keys are words on the page and the values are bounding boxes.

Solr provides hit highlights. Extract those from the search.

For each hit, look up

http://blogs.loc.gov/thesignal/2014/08/making-scanned-content-accessible-using-full-text-search-and-ocr/

JSON Word Boundaries

{"Panther": [{
    "x0": "1694",
    "y0": "3875",
    "x1": "1899",
    "y1": "3925",
    "c": "77" },
    { "x0": "1899", "y0": "1543", "x1": "4219", "y1": "1745", "c": "85"
  }],
  "Seale": [{
    "x0": "2983",
    "y0": "2451",
    "x1": "3086",
    "y1": "2496",
    "c": "88"},
    {"x0": "2921", "y0": "2638", "x1": "3015", "y1": "2678", "c": "88"}]
}

What would a good indexing strategy for these instead of retrieving the JSON files off of the filesystem for each search?

#xywh=0,0,0,0

UV requires the fragment hash to get you to the right page.

Sometimes the tokenization or word boundaries are different between what Solr indexes and what goes in the word boundaries JSON file. (That's a bug!)

Phrase searching is difficult. Phrase suggestions are difficult.

Some of the OCR is complete garbage.

Suggestions

https://ocr.lib.ncsu.edu/suggest/nubian-message-1995-04-13?q=afri

Uses the newer Suggester in Solr for suggestions.

This was the most difficult part of it all to get working halfway good enough.

Ocracoke code

https://github.com/NCSU-Libraries/ocracoke

A production prototype application!

Quick start: vagrant up and Ansible provisioner

Issue #1: Make it easy to provide a IIIF-compliant Content Search API

Manifests

http://d.lib.ncsu.edu/collections/catalog/nubian-message-2003-04-01/manifest

Uses jbuilder templates.

Many of the @ids are made up!

I could have used more examples of manifests of different kinds to work from.

The Tripoli validator was super helpful.

Public Interface (SAL)

http://d.lib.ncsu.edu/collections/

Full text:
http://d.lib.ncsu.edu/collections/catalog?f%5Bfulltext_bs%5D%5B%5D=true

Nubian Message:
http://d.lib.ncsu.edu/collections/catalog?f%5Bispartof_facet%5D%5B%5D=Nubian+Message

UV

Such a better experience than we had before!
Totally the only reason we could do this so quickly!

Placeholder image and pre-load spinner to avoid big blank space as UV loads
Disabled many features and will add some back in over time
Discovered some issues we will be reporting
Mobile fallback: leaflet-iiif
viewing hints don't quite match the user experience we'd like

Arbitrary Collection Manifests

http://d.lib.ncsu.edu/collections/catalog/manifest?f[ispartof_facet][]=Nubian+Message&f[resource_decade_facet][]=1990s

Is this a reasonable direction?
If not query string then how else to create arbitrary groupings?
Need to make good labels
Paging not implemented yet
navDate or ordering other than relevancy not implemented yet
Interested in interfaces to group archival collections

Sitemap

http://d.lib.ncsu.edu/collections/sal-sitemap.xml

iiif-discuss thread

<?xml version="1.0"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:iiif="http://iiif.io/api/presentation/2.1/">
  <url>
    <loc>http://d.lib.ncsu.edu/collections/catalog/bh020301401</loc>
    <lastmod>2015-06-01T18:02:34Z</lastmod>
    <iiif:manifest>https://d.lib.ncsu.edu/collections/catalog/bh020301401/manifest</iiif:manifest>
  </url>

NCSU Libraries IIIF Implementation

Please Note

Implemented IIIF APIs

Existing Infrastructure