Client-side Search Inside for Images with Bounding Boxes

Published: 2016-09-24 11:00 -0400

It is possible to create a Level 0 IIIF Image API implementation with just static images and an info.json. And some institutions are probably pre-creating Presentation API manifests or even hand-crafting them. All that’s required then is to put those files up behind any web server with no other application code running and you can provide the user with a great viewing experience.

The one piece that currently requires a server-side application component is the IIIF Content Search API. This usually involves a search index like Solr as well as application code in front of it to convert the results to JSON-LD. I’ve implemented search inside using the content search via Ocracoke. With decent client-side search from libraries like lunr.js it ought to be possible to create a search inside experience even for a completely static site.

Here’s a simple example:

This works first of all because the page has been OCR’d with Tesseract which outputs hOCR. (I developed Ocracoke in part to help with automating an OCR workflow.) The hOCR output is basically HTML that also includes the bounding boxes of sections of the page based on the size of the digitized image. We can then use this information to draw boxes over top of the corresponding portion of the image. So how do we use search to find the section of the page to highlight?

The first step in this case for simplicity’s sake was to use an image of known size. This is possible to do hit highlighting in a tiling pan/zoom viewer like OpenSeadragon as evidenced by UniversalViewer. The page image at 20% of the original fits within the width of this site: https://iiif.lib.ncsu.edu/iiif/ua011_006-001-cn0001-032-001_0007/full/pct:20/0/default.jpg

I then used some code from Ocracoke to rescale the original hOCR to create bounding box coordinates that would match on the resized page image. I parsed that resized hOCR file to find all the paragraphs and recorded their position and text in a JSON file.

At this point I could have created the lunr.js index file ahead of time to save the client some work. In this example the client requests the JSON file and adds each document to the index. The Fabric.js library is used to create a HTML canvas, add the page image as a background, and draw and remove rectangles for matches over top of the relevant section. Take a look at the JavaScript to see how this all works. Pretty simple to put all these pieces together to get a decent search inside experience.

If you gave this a try you’ll notice that this implementation does not highlight words but sections of the page. It might be possible to make this work for individual words, but it would increase the total size of the documents as the bounding boxes for each word would need to be retained. Indexing each word separately would also disrupt the ability to do phrase searching. There’s some discussion in lunr.js issues about adding the ability to get term start positions within a text that may make this possible in the future without these drawbacks. I had originally considered just trying to achieve getting the user to the correct page, but I think targeting some level of segment of the page is a reasonable compromise.

I don’t use the IIIF Content Search API in this demonstration, but it ought to be enough of a proof of concept to show the way towards a viewer that can support a completely static site including search inside. Anyone on ideas or thoughts on how a static version of content search could be identified in a IIIF Presentation manifest? Without a URL service point what might this look like?

Preliminary Inventory of Digital Collections

Incomplete thoughts on digital libraries.

Client-side Search Inside for Images with Bounding Boxes