Hathi Trust Non-Consumptive Research Pilot: Analysing the New Zealand Corpus

The project aims to undertake the first New Zealand analysis of the Hathi corpus to assess the nature, range, and quality of its New Zealand content.

The Hathi Trust describes itself as “a partnership of major research institutions and libraries working to ensure that the cultural record is preserved and accessible long into the future. There are more than sixty partners in HathiTrust, and membership is open to institutions worldwide.” The HathiTrust Digital Library stores the collections of partner institutions in digital form, preserving them securely and providing various degrees of access, depending on copyright status. The collection currently holds over 10 million scanned books. Primary contributors include Google, the Internet Archive, and major US research libraries.

Because of copyright issues not all of the books, including many on New Zealand topics, are available to the New Zealand public or scholars. The Hathi Trust, however, is developing a ‘non-consumptive research services’ that will enable researchers to query the available dataset (currently approximately 2 million books) and have results returned to them. This can be done using their API, or Meandre workbenches. One primary output of the Meandre workbenches are topic models. Meandre itself is described as a ‘semantic enabled web-driven, dataflow execution environment’ tailored for digital humanities applications. It is part of the Software Environment for the Advancement of Scholarly Research (SEASR) package. Workbenches are accessible online, with other components available as Eclipse plugins.

Submitted by Tim McNamara on