About this Website

The Dictionary of New Zealand Biography (1940) by Guy Hardy Scholefield is available on the NZ History website as two PDFs (Volume 1, Volume 2). Surprisingly the dictionary does not include any index of names, and while a text search is possible (thanks to OCR), quickly locating specific names is difficult.

I decided to index the Dictionary so that names could be matched against searches on my 🌳 Ancestor Search Helper site. This project was also a way to try out various AI tools in my workflow.

Copyright

My understanding is that the Dictionary is released under the Creative Commons Attribution-NonCommercial 3.0 New Zealand Licence (according to the NZ History website footer).

I offer this indexed version of Scholefield's Dictionary of New Zealand Biography as a freely available resource, and I believe that this is consistent with the spirit of the Creative Commons licence.

Indexing Notes

Despite my initial expectations that the indexing would be straightforward, I hit an early hurdle: simply copying and pasting the text of the PDFs appeared to work, but on closer inspection many entries were garbled, because the original OCR did not separate the columns of text correctly.

A fresh OCR was performed on the PDF files with WebAssembly PDF Viewer and Editor. This produced plain TXT files containing 50 pages of PDF at a time, which could be combined for volumes 1 and 2.

The new OCR output was largely satisfactory, apart from a few pages where the scanner was slanted and the columns again became confused. These pages were corrected manually.

Cleanup proceeded with removing lines containing single capitalised words (ie, 'MURTON', the index at the top of each page) and removing the printed page number.

The next step was to index the entries by name and place them into a CSV (spreadsheet) file. A simple PHP script was written with the aid of GPT-4. Each entry was identified by finding capitalised words at the beginning of a line (SAVAGE, MICHAEL JOSEPH)

A fair amount of manual editing was done for Māori chiefs, who were often recorded with one-word names, and/or multiple aliases. Aliases were indexed into a separate 'Also Known As' column.

The PDF Page numbers were inserted into the scanned TXT file by the OCR software, and these were recognised by a preg_match function and saved against each entry, before the page number line was removed from the text of the biography.

Further cleanup included:

Inspection of the resulting CSV revealed that hyphenated names and very long names which spanned multiple lines had often not been detected correctly, requiring manual fixes.

Lastly, each entry was given a unique 'handle' for URL purposes (eg, william-bayly-2).

The final output is a reasonably clean CSV file, from which the content of this site is retrieved.

Closing notes

The majority of entries have at least a date of death and often a date of birth, but the date formatting is too inconsistent to reliably index with an algorithm.

The text of the Introduction was straightforward to format from plain text to HTML, with the exception of the long table of sources in the Bibliography section, which was difficult to transcribe accurately without a lot of manual correction. Eventually I was able to use the Claude 3 'Opus' model, which transcribed screenshots of the Bibliography pages directly to HTML with excellent accuracy.

About Me

I'm Luke Howison, a web developer based in Lower Hutt. I'm building a suite of free digital research tools.