top of page

We used 61 raw text files from 13 different authors found the public domain Gutenberg Project (http://www.gutenberg.org/). We specifically downloaded the text files from this dataset (https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html).

 

We then wrote code using Python’s Natural Language Toolkit (nltk) library that processed these text files for stylistic attributes. To standardize these attributes, we normalized the values under 1000 tokens (instances of words) in the case of searching for the frequency of a specific word or punctuation. In total, we searched for 18 stylistic attributes:

 

  • total length of text

  • mean word length

  • mean sentence length

  • standard deviation of sentence lengths

  • weighted number of commas, semicolons, quotations marks, and exclamation marks per 1000 tokens

  • weighted number of and, but, however, if, that, more, must might, this, and very’s per 1000 tokens

 

We derived these features from Hanlein’s research on individual style features.

Figure 1: Processed Data

Since our data set was small, we decided to use the entire set to train, and then measure the accuracy of our algorithms by using 10-fold cross-validation on all algorithms. We used 3 machine learning algorithms: K-Nearest Neighbors, Decision Trees, and Naive Bayes. We varied the features of our k-nn algorithm, and we used the AdaBoost ensemble method in our decision tree learning.

bottom of page