Natural Language Processing

To support Material Culture in the 19th Century German Novel, the HDW is developing a number of new capabilities:

  • Natural Language Processing, including tokenization, part of speech tagging, stemming and lemmatization.
  • The management of large-scale databases. The latest version of this database holds > 37 million tokens.
  • The use of standard linguistic taxonomies. We've started using Wordnet, mostly as a way of getting used to handling large taxonomies, and of evaluating the usefulness of taxonomy driven methods against smaller sets of documents in English/

We recently acquired a copy of Germanet, an equivalent of Wordnet in German, and have just started examining it.