robertmarchman's blog

The non-Google public domain dataset

For the past few weeks I've been working with the HathiTrust's 300,000 document non-Google digitized public domain collection. It's easy enough to get your hands on the dataset; it is, after all, public domain and as such has no access restrictions--all it takes is a quick email to HathiTrust. They prefer to distribute the collection via rsync, and once your IP address is authorized on their server you're good to go.

Submitted by robertmarchman on

HathiTrust Non-consumptive Research Pilot: Analysing the New Zealand Corpus

Arrived in Auckland on Friday. What more could I ask for? Beautiful weather (well, at least for the first two days), a fantastic city (or so say my first impressions), and an interesting project to work on. For the rest of the summer (another benefit - I get to escape from winter in the northern hemisphere!) I’ll be working on the HathiTrust pilot, sponsored by James Smithies.

Submitted by robertmarchman on
Subscribe to RSS - robertmarchman's blog