Closing in on Client-side IIIF Content Search
Published: 2016-09-25 09:30 -0400
It sounds like client-side search inside may at some point be feasible for a IIIF-compatible viewer, so I wanted to test the idea a bit further. This time I’m not going to try to paint a bounding box over an image like in my last post, but just use client-side search results to create IIIF Content Search API JSON that could be passed to a more capable viewer.
This page is a test for that. Some of what I need in a Presentation manifest I’ve only deployed to staging. From there this example uses an issue from the Nubian Message. First, you can look at how I created the lunr index using this gist. I did not have to use the manifest to do this, but it seemed like a nice little reuse of the API since I’ve begun to include seeAlso links to hOCR for each canvas. The manifest2lunr
tool isn’t very flexible right now, but it does successfully download the manifest and hOCR, parse the hOCR, and create a data file with everything we need.
In the data file are included the pre-created lunr.js index and the documents including the OCR text. What was extracted into documents and indexed is the the text of each paragraph. This could be changed to segment by lines or some other segment depending on the type of content and use case. The id/ref/key for each paragraph combines the identifier for the canvas (shortened to keep index size small) and the x, y, w, h that can be used to highlight that paragraph. We can just parse the ref that is returned from lunr to get the coordinates we need. We can’t get back from lunr.js what words actually match our query so we have to fake it some. This limitation also means at this point there is no reason to go back to our original text for anything just for hit highlighting. The documents with original text are still in the original data should the client-side implementation evolve some in the future.
Also included with the data file is the URL for the original manifest the data was created from and the base URLs for creating canvas and image URLs. These base URLs could have a better, generic implementation with URL templates but it works well enough in this case because of the URL structure I’m using for canvases and images.
Now we can search and see the results in the textareas below.
Raw results that lunr.js gives us are in the following textarea. The ref includes everything we need to create a canvas URI with a xywh fragment hash.
Resulting IIIF Content API JSON-LD:
Since I use the same identifier part for canvases and images in my implementation, I can even show matching images without going back to the presentation manifest. This isn’t necessary in a fuller viewer implementation since the content search JSON already links back to the canvas in the presentation manifest, and each canvas already contains information about where to find images.
I’ve not tested if this content search JSON would actually work in a viewer, but it seems close enough to begin fiddling with until it does. I think in order for this to be feasible in a IIIF-compatible viewer the following would still need to happen:
- Some way to advertise this client-side service and data/index file via a Presentation manifest.
- A way to turn on the search box for a viewer and listen to events from it.
- A way to push the resulting Content Search JSON to the viewer for display.
What else would need to be done? How might we accomplish this? I think it’d be great to have something like this as part of a viable option for search inside for static sites while still using the rest of the IIIF ecosystem and powerful viewers like UniversalViewer.