RESULTS | Machinelearning

Our most accurate result of 91.8% CV accuracy was returned when using the K-Nearest Neighbor algorithm with a K of 1 using a Euclidean distance measure with a weight of 1/distance (Figure 1). We believe that our data was more accurate when using a K of 1 because the dataset was small. When limiting the K, it was likely that one author’s text was similar to another author’s text. However, increasing the K would likely grab a neighbor text from a different author, therefore decreasing the accuracy.

Figure 2: K-NN Chart

We also used decision trees which produced an 10-fold CV accuracy of 49%. We believe that this accuracy was much lower because of the variability among style in an author’s body of work; it was likely that two texts would be similar, but not the entire body. To increase the decision tree’s accuracy, we used the ensemble method AdaBoost (Figure 2). As we increased the iterations, the 10-fold CV accuracy also increased. However, this increase never matched the accuracy of the KNN algorithm.

Figure 3: Decision Tree with AdaBoost Chart

Finally, we determined the most important attributes in determining authorship.

Figure 4: Correlation Coefficients of Attributes

From our analysis, we conclude that the standard deviation of sentence length in an author’s text is the most important feature in determining authorship and that K-NN was the most accurate algorithm in predicting authorship. We also note that author styles are variable, which results in higher values of K being less likely to accurate authorship. However, in most cases, at least one text from an author is similar to another’s.

Some challenges we considered was the homogeneity of the dataset: the majority of the authors considered were white men. Additionally, since the texts were in the public domain, every text selected was written before 1923, meaning that these attributes may not be relevant for modern day texts.

A future direction to explore is allowing users to input their own text and to see whose writing style their writing most resembled.

Stylometric Analysis of Open-Source Literature

EECS 349 Machine Learning, Northwestern University