Chances and Challenges of Computer Assisted Authorship Attribution

Research Paper (undergraduate) 2014 18 Pages

English Language and Literature Studies - Literature


Table of Contents

1. Introduction

2. Defining 'Author', 'Authorship' and 'Authorship Marker'

3. Authorship Attribution
3.1 Traditional Approach
3.2 Computer Assisted Approach

4. Computer Assisted Authorship Attribution
4.1 Quantitative Techniques
4.2.Challenges of CAAA
4.3 Chances of CAAA
4.4 Further Scope of Application

5. Conclusion

Works cited

1. Introduction

"Attribution studies, in order to succeed, need a linguistic theory and methodology responsive to the fundamental feature of natural languages, that they weave together words of all kinds in order to create meaning" (Vickers 2009,135)

Is a certain piece written by Shakespeare or is it not? This question and others, regarding different authors and a plethora of anonymous works, has been asked multiple times but still remains unacknowledged for many cases.

Scholars of authorship attribution studies have attempted to solve this issue as long as written language exists. To do so, the work under discussion is generally compared with the corpus of works which assuredly derive from a certain author, in any characteristic with is considered distinctive.

Nevertheless, the opinions are deeply divided regarding both, the features and peculiarities which have be considered significant for authorship attribution and the methods which should be applied to determine, find and compare the authorship markers.

As a matter of fact, linguistics provide the tools for a detailed consideration of language. Regarding authorship attribution, lexical choices and the structure of phrases, sentences and texts can be named as the most important sections, because an author applies his personal thumbprint to a text by his personal choice of words and the way, he or she arranges these words in phrases, sentences and semantic relations throughout his works.

The traditional way to assign authorship to a piece of writing affords close reading of the anonymous piece and one or more selected works of the author's writing and spotting similar passages, word choices and relations and syntactical structures. If there is a certain amount of overlappings, the formerly anonymous text might be appended to an author's corpus.

Unfortunately, this method does not only take a lot of time but also can hardly be universalized because it requires a lot of prior knowledge of the author, his or her style and the era the work was written in and mostly, unique passages are considered to be most significant.

A more recent approach in authorship studies arose with the emerge of computers and their increasing capabilities to process natural language. This attempt is also based on the identification of authorship markers and the comparison of their appearances in works, but uses statistical frequency distribution and calculated likelihoods rather then human guidelines and a deep understanding of language.

Until the present day, neither of these two approaches was able to produce overall satisfactory, universally applicable and generally solid methods but one can allege, that the implementation of human linguistic knowledge into language processing programs could finally provide a fast and valid method for authorship attribution.

In the following paper, this assertion will be explained and discussed. Therefore, a further definition of author, authorship markers and authorship will be determined in the next chapter (2), to provide a basis for the following remarks. The third chapter will briefly outline the differences between the traditional approach to authorship attribution studies and computer assisted authorship attribution, to contextualize the paper.

The main part of this work consists of chapter four, where the chances of computer assisted authorship attribution will be confronted with it's general challenges, after the most recent techniques are explained. Chapter 4.4 draws a line to forensic linguistics, another specific field of application, where authorship attribution becomes increasingly important.

The last chapter summarizes the key points of the paper and outlines the most important findings, concerning (computer assisted) authorship attribution in general and it's impact on literary history in particular.

2. Defining ’Author’, ’Authorship’ and ’Authorship Marker’

For any attempt to determine the author of a play, poem, book or any other piece of writing, it is crucial to initially define what 'author' and 'authorship' means and accordingly, which markers shall be considered significant for assigning those definitions to the piece under discussion.

This chapter will work out the definitions applied in the present text, in order to provide a basis for the following study of computer assisted authorship attribution. Arising from his book "Defining Authorship", Harold Love presents various distinct definitions of authorship and names collaborative authorship, precursory authorship, executive authorship, declarative authorship and revisionary authorship as subcategories(cf. Chapter 3).

The difference between those forms is determined by the group of persons who are considered responsible and influential and the nature of their involvement. Whereas the executive author is the actor, thus the person who actual writes, the revisionary author corrects, attaches and deletes, the declarative author commissions or implements a work and the precursory author is responsible for ideas and topics, stated in one of his executive writings. The most common or current notion of these today however is the one of the executive author, the one name printed on the cover of the book. Nonetheless, it cannot be denied, that it is impossible to produce even an idea or left alone write a text without the influence of others in any way.

In this text, 'the author' or the name of the author shall rather regarded as a label, that the reference to a single or certain person. Speaking of Shakespeare's plays for example, referrers to the canon of works which are considered to be Shakespearean drama and bear his name in the original version. Under the term 'Authorship' all those persons, events and conditions shall unite, which are significant and influential for the actual piece of writing and therefore 'authorship markers' are those features, a canon of works of an author share with one another and which circumscribe from others.

From these definitions, the following understanding of authorship studies emerges; Authorship studies aim to detect common features in a single or a corpus of texts and create a pattern of authorship markers out of them or state their nature in the preface of the analysis, which allows in all probability, applied to any other text, to predict, whether this text must be ascribed to the corpus or not. Generally the corpus is referred to under a label, which is usually the name of an author, whether the author as a person is historically confirmed or the circumstances and influential persons contributing to the text are known or not.

3. Authorship Attribution

In keeping with the definitions determined in the previous chapter, in this chapter, "attribution studies" are generally regarded as "attempts to distinguish the traces of agency, that cohere in pieces of writing, sometimes discovering one singular trace but often a subtle entanglement of several or many." (Love,32)

The following chapter aims to provide a brief distinction between the attempts for authorship attribution considered in this text. Thereby, the terms "traditional Approach" and "computer assisted Approach" are taken from and defined according to Brian Vickers Essay "Shakespeare and Authorship Studies in the twenty first century" (2011).

Rather than comparing the two approaches as detailed as possible, they shall only be contextualized historically and described in their relation to each other and according to their implementation in authorship studies in general, to help understanding the actual state and further requirements of authorship attribution studies, according to the general objective of this paper.

3.1 Traditional Approach

The so called traditional approach of attribution authorship is as old as written language and has already been discussed about 300 years BC. Performing traditional authorship studies means an attempt, to assign authorship by close reading of the work under discussion and the corpus of the author's assigns plays and comparing "the full spectrum of dramatic language, from the minutia of verbal contraceptions to the larger significance or repeated words and concepts." (Vickers 2009,114)

Thus, the traditional approach of authorship attribution is very precise, as it considers language and dramatic style in it's syntactical characteristics as well as in it's semantic relations and pragmatic use and considers qualitative as well as quantitative measures. Standard practice to determine the author in a traditional way, includes a detailed analysis of the piece under discussion and the comparative corpus in direct line-to-line comparison, also in the context of its time and place of origin, and the attempt to discover similarities and overlappings in any way, provided that it has been stipulated before, which similarities and overlappings will be considered significant for the corresponding examination.

However, the traditional approach also has it disadvantages, as the whole procedure is highly dependent on the person or group who performs the procedure and their individual concept of authorship and authorship markers. Therefore, specific results for one author or a single piece of writing are hardly transferable or generalizable, which makes traditional authorship attribution a very time consuming and protracted field of studies and often leaves ample scope to opposite interpretations.

3.2 Computer Assisted Approach

The 'non-traditional' or 'computer assisted Approach' emerged from the need, to find a generally valid, fast and comparable way of assigning authorship. This rather new field of authorship studies and it's possibilities grow proportionally to the general expansion and improvement of Computers. Whereas the improvement of hardware enables researchers to access and process larger amounts of data, natural language and it's reprocessing only slowly get into the focus of software development. In Relation to natural language, as opposed to programming languages, this approach is sometimes also named 'linguistic-processing'.

Just like the traditional approach, the basis of computer assisted authorship attribution is a comparison between two texts or a text and a corpus of already assigned texts. The computer, briefly, is used to calculate most used or most significant features as authorship markers according to the guidelines, the researching authorities considered helpful. Such guidelines are, for example, the focus on function words or lexical words or the exclusion of certain information, parts of speech or single words.

However, current and past attempts of computer assisted authorship attribution have hardly shown any satisfiable results, as they worked with quantitative methods and measures only, relying solely on count and comparison of most frequent words and their allocation in texts and leaving out every other level of language.



ISBN (eBook)
ISBN (Book)
File size
552 KB
Catalog Number
Institution / College
Justus-Liebig-University Giessen
Authorship Autorenerkennung Autoren Erkennung Computerlinguistik Authorship Attribution Computer Assisted Authorship Attribution Autorenerkennung Computer Shakespeare Shakespeare and Authorship



Title: Chances and Challenges of Computer Assisted Authorship Attribution