http://ronallo.com/presentations/iiif-demo/
For the impatient: http://d.lib.ncsu.edu/collections/catalog?f%5Bfulltext_bs%5D%5B%5D=true&f%5Bispartof_facet%5D%5B%5D=Nubian+Message
Jason Ronallo
Head of Digital Library Initiatives
NCSU Libraries
This deck was put together very quickly for a demo on the IIIF community call of September 14, 2016.
Special Collections Asset Management System
Manages descriptive metadata.
We had been out in front with discovery (esp. SEO) and interfaces for digital special collections. Often a model for others.
We already had versions of what I'm talking about:
A pan/zoom paginated viewer, search inside, an API
But nothing as good as what can be done now with IIIF-compatible tools. Being out in front for us now means adopting IIIF and pushing it forward, because we get so many benefits from it.
Oh, noes it is a month until Harambee and we promised to have search inside for the Nubian Message before this event!
Most of this development was done over the course of a month with 2-3 weeks of intensive work. Had already migrated JP2 images from Djatoka profile to one that worked well with existing image servers.
Worked closely in a small team.
No worries. We like to have all the pieces strung together into a minimal product, learn more about the problems, and improve each of them over time.
Eyebright: A traditional medicinal herb used to relieve eye strain.
Caches to the file system in a transparent way.
Caches in directories the same way a Level 0 implementation would.
Most image requests never hit the image server, just the web server.
Easy performance win.
Common URLs across various applications we manage:
Only clear images from the cache that do not match these patterns. Don't have to crawl deeply. If not "full" or "square" region, remove it. Runs as cron job.
TODO: Do not expire any derivatives for our most popularly used images. These should always be as fast as possible.
Keep this image because it is small and used to help UV load faster. We keep these around regardless of when it was last used.
iiif/nubian-message-1996-02-22_0001
├── full
│ └── 90,
│ └── 0
│ └── default.jpg
└── info.json
Just in time static site generator for images.
Simple cache expiration.
Improved cache maintainability.
Top left (!square) would otherwise be:
/iiif/segIns_001/0,0,6099,6099/350,/0/default.jpg
/iiif/segIns_001/0,2500,6099,6099/350,/0/default.jpg
Trigger jobs from the command line:
bin/rake \
ocracoke:queue_from_ncsu_id[LD3928-A23-1947]
Rake uses a simplistic API in our public Blacklight app:
https://d.lib.ncsu.edu/collections/catalog/LD3928-A23-1947.json
{
"fileName": "LD3928-A23-1947",
"images": [
"LD3928-A23-1947_0001",
"LD3928-A23-1947_0002",
"LD3928-A23-1947_0003",
"LD3928-A23-1947_0004"]
}
Simple token-based API
curl -X POST
-H "Content-Type: application/json"
-H "Accept: application/json"
-d '{"resource": "gng00126",
"images": ["gng00126_001",
"gng00126_002",
"gng00126_003",
"gng00126_004"]}'
-H "Authorization: Token token=token, user=scams"
-k http://localhost:8090/api/ocr_resource
When a job is complete a notification can be sent to another application.
Page Image: text, hOCR, JSON word boundaries
Resource: concatenated text, PDF with embedded text
/access-images/
└── ocr
└── LD
├── LD3928-A23-1947
│ ├── LD3928-A23-1947.pdf
│ └── LD3928-A23-1947.txt
├── LD3928-A23-1947_0001
│ ├── LD3928-A23-1947_0001.hocr
│ ├── LD3928-A23-1947_0001.json
│ └── LD3928-A23-1947_0001.txt
├── LD3928-A23-1947_0002
│ ├── LD3928-A23-1947_0002.hocr
│ ├── LD3928-A23-1947_0002.json
│ └── LD3928-A23-1947_0002.txt
...
https://ocr.lib.ncsu.edu/search/nubian-message-1995-04-13?q=afrikan
Strings probably the wrong thing to use, but it works!
{ "hits": [
{
"@type": "search:Hit",
"annotations": [
"urn:nubian-message-1995-04-13:nubian-message-1995-04-13_0011:annotation0",
"urn:nubian-message-1995-04-13:nubian-message-1995-04-13_0011:annotation1"]
}
]}
Canvas not dereferencable yet:
https://iiif.lib.ncsu.edu/iiif/nubian-message-1995-04-13_0011/canvas#xywh=497,4775,153,37
{"resources": [{
"@id": "urn:nubian-message-1995-04-13:nubian-message-1995-04-13_0011:annotation0",
"@type": "oa:Annotation",
"motivation": "sc:painting",
"resource": {"@type": "cnt:ContentAsText",
"chars": "Afrikan"},
"on":
"https://iiif.lib.ncsu.edu/iiif/nubian-message-1995-04-13_0011/canvas#xywh=497,4775,153,37"
}]}
Extract the bounding boxes for each word from the OCR (hOCR or ALTO).
Make a hash where the keys are words on the page and the values are bounding boxes.
Solr provides hit highlights. Extract those from the search.
For each hit, look up
{"Panther": [{
"x0": "1694",
"y0": "3875",
"x1": "1899",
"y1": "3925",
"c": "77" },
{ "x0": "1899", "y0": "1543", "x1": "4219", "y1": "1745", "c": "85"
}],
"Seale": [{
"x0": "2983",
"y0": "2451",
"x1": "3086",
"y1": "2496",
"c": "88"},
{"x0": "2921", "y0": "2638", "x1": "3015", "y1": "2678", "c": "88"}]
}
What would a good indexing strategy for these instead of retrieving the JSON files off of the filesystem for each search?
UV requires the fragment hash to get you to the right page.
Sometimes the tokenization or word boundaries are different between what Solr indexes and what goes in the word boundaries JSON file. (That's a bug!)
Phrase searching is difficult. Phrase suggestions are difficult.
Some of the OCR is complete garbage.
https://ocr.lib.ncsu.edu/suggest/nubian-message-1995-04-13?q=afri
Uses the newer Suggester in Solr for suggestions.
This was the most difficult part of it all to get working halfway good enough.
https://github.com/NCSU-Libraries/ocracoke
A production prototype application!
Quick start: vagrant up
and Ansible provisioner
Issue #1: Make it easy to provide a IIIF-compliant Content Search API
http://d.lib.ncsu.edu/collections/catalog/nubian-message-2003-04-01/manifest
Uses jbuilder templates.
Many of the @id
s are made up!
I could have used more examples of manifests of different kinds to work from.
The Tripoli validator was super helpful.
http://d.lib.ncsu.edu/collections/
Full text:
http://d.lib.ncsu.edu/collections/catalog?f%5Bfulltext_bs%5D%5B%5D=true
Nubian Message:
http://d.lib.ncsu.edu/collections/catalog?f%5Bispartof_facet%5D%5B%5D=Nubian+Message
navDate
or ordering other than relevancy not implemented yethttp://d.lib.ncsu.edu/collections/sal-sitemap.xml
<?xml version="1.0"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:iiif="http://iiif.io/api/presentation/2.1/">
<url>
<loc>http://d.lib.ncsu.edu/collections/catalog/bh020301401</loc>
<lastmod>2015-06-01T18:02:34Z</lastmod>
<iiif:manifest>https://d.lib.ncsu.edu/collections/catalog/bh020301401/manifest</iiif:manifest>
</url>