Early Print Genre Classification Project

Summer workshop report, 2016

Early Print Genre Classification Project

Humanities Digital Workshop

Summer 2016


Project Summary:


Around 40,000 of the total ~53,000 texts in EEBO-TCP have subject heading tags. These tags come from the Library of Congress Subject Headings (LCSH), and they identify people, places, topics, time periods, formats, or genres pertaining to each work. Tags are identified as belonging to one of these categories in the MARC 21 Bibliographic system. While EEBO-TCP tags themselves do not retain MARC information, ESTC bibliographic information keeps a record of the top-level tag categories in their MARC records. We were able to use the ESTC metadata to cross-check tags about genre: first we found all tags marked as "genre" or "form" in the ESTC metadata and then searched for those tags in the EEBO-TCP subject headings themselves. This gave us a more or less complete list of texts which are marked with information about genre.

Many of these genre categories overlap, are incomplete, or otherwise need an expert eye. In order to complete our task of identifying texts by genre, we hope to compile as complete an account of these tags as possible.  Simultaneously, we hope to build a feature set allowing us to analyze the corpus intrinsically.


To create a feature set, we parsed the XML tags from the marked up EEBO-TCP texts into bigrams and then counted them for each document. Using XML trigrams is unfeasible due to the number of combinations. We then used principal component analysis to visualize both the XML counts and the topic models. The LDA topic model began to have around the same number of principal components explaining 95% of the variance with 200 and 300 topic models.


For supervised classification, we used a Naive Bayes classifier on the combined results of the PCA of the XML bigrams and the 100 topic LDA topic model. A low variance filter and normalization were used in feature selection. We used a 70-30 train test split, and whole set was composed of equal number of tagged documents and a randomly sampled untagged documents. The composition of the set affects the accuracy of the classifier, so justifying whatever split is used in the future is important. Leave one out cross validation for the smaller sub genres remains unfinished and would be useful.


Ideas for future improvement include counting the number of repeated tags differently the reduce how the counts are skewed towards <LI,LI> or similar multiple tags. Additionally, using an alternative to PCA with a discrete number of topics, such as a community detection algorithm, might increase future accuracy in using semi-supervised or unsupervised methods to determine indigenous topics. Using better word vectorization would be another way to achieve this.


Current Tag Hierarchy:


Slides from intermediate presentation to colleagues.

Slides from 7-minute final presentation.