A Name Entity Recognition and Extraction Method for Mid-Distance Reading in Spanish


Computational literary studies usually rely on systems such as Stanford NER to identify characters, locations and other entities present in texts. This method is highly accurate in the English language and it is also probably the best approach when working with large corpora where a certain amount of overlooked tagging can be allowed. This accuracy is partly due to the extensive availability of English-language texts in digital format to train the tagger and, as a result, it allows computational analysis to be conducted in larger scales (K. Bode; T. Underwood; M. Jockers; etc.). Despite this, automatic entity recognition taggers are inadequate for languages scarcely represented in the digital cultural record (Risam) even when the majority of users of the Internet are non-English speakers (Whose Knowledge?). How can we, then, make up for the absence of reliable computational analysis implementations in languages not extensively supported by automatic data tagging?

Drawing from the results of a literary character network analysis of 19th century Spanish novels, I present the practicalities of creating a database of entities with several variables that allow, on the one hand, to normalize the process of entity recognition for full, compound or informal names of characters, and, on the other hand, to easily extract and weight the presence and connections in texts. I show how I applied this method, combined with literary analysis, to study the interaction between fictional and historical characters.

Finally, a work-in-progress mapping that involves bilingual data helps me further argue in favor of a system that we can situate between manual TEI-XML tagging and automatic NER.

Presentation slides in PDF here