METHOD | Machinelearning

We used 61 raw text files from 13 different authors found the public domain Gutenberg Project (http://www.gutenberg.org/). We specifically downloaded the text files from this dataset (https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html).

We then wrote code using Python’s Natural Language Toolkit (nltk) library that processed these text files for stylistic attributes. To standardize these attributes, we normalized the values under 1000 tokens (instances of words) in the case of searching for the frequency of a specific word or punctuation. In total, we searched for 18 stylistic attributes:

total length of text
mean word length
mean sentence length
standard deviation of sentence lengths
weighted number of commas, semicolons, quotations marks, and exclamation marks per 1000 tokens
weighted number of and, but, however, if, that, more, must might, this, and very’s per 1000 tokens

We derived these features from Hanlein’s research on individual style features.

Figure 1: Processed Data

Since our data set was small, we decided to use the entire set to train, and then measure the accuracy of our algorithms by using 10-fold cross-validation on all algorithms. We used 3 machine learning algorithms: K-Nearest Neighbors, Decision Trees, and Naive Bayes. We varied the features of our k-nn algorithm, and we used the AdaBoost ensemble method in our decision tree learning.

Stylometric Analysis of Open-Source Literature

EECS 349 Machine Learning, Northwestern University