not without mustard :: current projects

Literary Genre and Authorial Style

Funded by the British Academy and Jisc under the Digital Research in the Humanities scheme (DRH18\180084).


Project Abstract

In attribution study (as in literary studies more broadly), literary genre is assumed to affect an author’s style; to mitigate its possible effects, attribution tests to determine authorship generally rely on samples from the same literary genre from which to generate unique stylistic ‘profiles’ or authorial ‘fingerprints’. A poem of unknown or disputed authorship, for example, ordinarily will be compared only with poems of known authorship. This critical assumption is intuitive but largely untested. Using the sophisticated statistical procedures and machine-learning techniques associated with computational stylistics to analyse samples of drama, poetry, and prose drawn from the EEBO-TCP Phase I corpus of open-access, machine-readable transcriptions of early modern English texts printed between 1473 and 1700, the proposed project will interrogate this long-standing assumption about the effect of literary genre on authorial style and, by quantifying and accounting for these effects, explore new ways to attribute authorship using generically diverse texts.


Computational stylistics uses the computer to identify and analyse complex linguistic patterns in documents not readily discernible to a human reader. Its primary application is in authorship attribution, in which a document of uncertain authorship is compared with the unique stylistic ‘profiles’ (or authorial ‘fingerprints’) of potential candidates generated from their acknowledged works. Such studies are often hampered by an author’s oeuvre containing too few examples of works in the same literary genre: if one is studying plays, for instance, the conventional wisdom is that we must compare plays only with other plays, poems only with other poems, and so on.

Although hundreds of plays (now in digital form) survive from the ‘Golden Age’ of the English theatre (1567–1642), only a minority of the dramatists active during this period have left us two or more sole-authored, well-attributed plays to test, and for many methods this is too small a sample. For many of these playwrights, the pool of sample writing could be substantially enlarged if only we could confidently use their non-dramatic writing as well – their poems, prose narratives, and so on – in constructing their stylistic profiles. Were it possible to do so, we might begin to resolve many unanswered questions about dramatic authorship, gain valuable insights into the working habits of early modern playwrights and processes of collaboration and rivalry, and build a more accurate, holistic picture of the English Renaissance theatre and its development. However, we simply do not know if, in general, the aspects of a writer’s style that we can capture with computers remain the same in different literary genres. That is the question this project will attempt to answer.

Research Questions and Objectives

Our question then is, ‘Does literary genre affect an author’s style?’ Are there stylistic differences between Shakespeare-the-poet and Shakespeare-the-playwright? If so, how significant are these differences? To identify or disqualify potential authors, attribution tests generally rely on samples of a shared literary genre to generate stylistic profiles for comparison with works of uncertain authorship. There are obvious cases where this limitation is absurd, since (to use an oft-cited example) no one attempting authentication of a suicide note would demand samples of other suicide notes from all suspects. In such cases, might samples of a different genre (e.g. journal entries, letters, email) be used instead?

A version of the ‘suicide note’ problem hinders attribution studies of English Renaissance drama: many plays from the period survive, but a substantial proportion has been lost; those that survive are limited in number and unequal in distribution, with the result that the vast majority of potential authorial candidates have fewer than two sole-authored, well-attributed plays and are thereby excluded from attribution testing. As in the hypothetical ‘suicide note’ case, might it be possible to use non-dramatic samples to attribute dramatic works?

The proposed project will explore means for significantly enlarging the pool of potential candidate playwrights, regardless of how many plays for each survive, eligible for attribution testing of English Renaissance drama. If any stylistic differences across literary genre can be measured and taken into account, it then becomes possible to include playwrights whose paucity or lack of surviving plays previously excluded them from attribution study of the drama by leveraging their works of poetry and/or prose. Thomas Lodge, for example, wrote for the professional theatre – Francis Meres named him among ‘the best for comedy’ in his Palladis tamia (1598) – but current practice would exclude him from attribution testing of the drama because only one of his two surviving plays is sole-authored and well-attributed. However, in addition to practising as a physician, Lodge was also a prolific poet and writer of narrative fiction; thus, if allowances can be made for the effects of literary genre on authorial style, Lodge’s substantial canon of sole-authored, well-attributed poetry and prose makes possible his inclusion in attribution testing of drama. There are many cases much like Lodge’s.


It is not clear just how little writing is required, in principle, to perform reliable authorship attribution. Clearly at some point as the samples get smaller any aspects of idiosyncratic style become swamped by random variations arising from local concerns, such as the author’s need to convey a particular mood or create the idiolect of a fictional character. Our necessarily imperfect methods become unreliable before this theoretical threshold is reached, and although the methods might be improved we may well already be working close to the theoretical limit of what is possible with small samples. Moreover, our existing attribution methods become demonstrably more reliable as our sample sizes increase, so we have good reason to extend, augment, or otherwise enrich the available data to meet the criteria for existing attribution methods (e.g. by expanding the dataset to include more relevant material, or by supplementing the existing data with additional metadata). In particular, we need to know if our methods are undermined when we try to compare writings in one genre (plays, poems, prose narratives) with writings in another.

To see if we may safely use more evidence than we currently do, the proposed project will employ methods of computational stylistics – from standard statistical procedures (e.g. Principal Components Analysis) to innovative machine-learning techniques (e.g. Delta, Random Forests, Word Adjacency Networks, Zeta) – to analyse the stylistic differences and affinities in a representative sample of English Renaissance playwrights with two or more sole-authored, well-attributed plays and two or more sole-authored, well-attributed works of poetry and/or prose. This initial sample will comprise 120 works by 7 authors (George Chapman, Thomas Dekker, Robert Greene, Thomas Heywood, John Lyly, George Peele, and William Shakespeare), and will be drawn from the existing Early English Books Online Text Creation Partnership (EEBO-TCP) Phase I corpus of machine-readable transcriptions of English printed books, 1473–1700, already in the public domain. The next phase of the project will expand the range of samples to include authors with two or more sole-authored, well-attributed works of poetry and/or prose, but with only one sole-authored, well-attributed play – in other words, writers who would ordinarily be excluded from attribution testing of drama. This enlarged sample will introduce an additional 30 works by 6 new authors (Francis Beaumont, Henry Chettle, Thomas Lodge, Thomas Nashe, Cyril Tourneur, and George Wilkins). In each phase, attribution tests will be conducted by extracting a random selection of the samples to treat as if they were of unknown authorship and using only the remaining samples to generate stylistic profiles.

As a proof-of-concept, the project will focus on analysing ‘function’ words – i.e., common words with a primarily grammatical function (such as ‘a’, ‘the’, ‘at’, ‘to’) as opposed to ‘lexical’ or ‘content’ words – since patterns in the frequency and distribution of function words has been shown to be a significant and reliable marker of style, independent of topic. To ensure accurate counting and analysis, the EEBO-TCP transcriptions must be enriched with TEI-XML mark-up to ‘tag’ function words and to distinguish between different homograph forms – i.e., function words with shared spelling but different grammatical functions, such as the noun (abstract, concrete, proper), adjective, and verb forms of the word ‘will’.


  • Dr Brett Greatley-Hirsch – Principal Investigator
  • Dr Emily Mayne – Postdoctoral Research Fellow
  • Dr Rachel White – Postdoctoral Research Fellow

© 2011– Brett Greatley-Hirsch