Distinguishing fiction from non-fiction with complex networks

David M. Larue; Lincoln D. Carr; Linnea K. Jones; Joe T. Stevanak

Distinguishing fiction from non-fiction with complex networks

ORAL

Abstract

Complex Network Measures are applied to networks constructed from texts in English to demonstrate an initial viability in textual analysis. Texts from novels and short stories obtained from Project Gutenberg and news stories obtained from NPR are selected. Unique word stems in a text are used as nodes in an associated unweighted undirected network, with edges connecting words occurring within a certain number of words somewhere in the text. Various combinations of complex network measures are computed for each text's network. Fisher's Linear Discriminant analysis is used to build a parameter optimizing the ability to separate the texts according to their genre. Success rates in the 70\% range for correctly distinguishing fiction from non-fiction were obtained using edges defined as within four words, using 400 word samples from 400 texts from each of the two genres with some combinations of measures such as the power-law exponents of degree distributions and clustering coefficients.

March 7, 2014, 12:15 PM – March 7, 2014, 12:27 PM

Authors

David M. Larue
- Colorado School of Mines
Lincoln D. Carr
- Colorado School of Mines
- Colorado School Of Mines
Linnea K. Jones
- Colorado School of Mines
Joe T. Stevanak
- Colorado School of Mines