Big data and the study of reading

By Daniel Allington and Andrew Salway

We’re really looking forward to running the workshop on big data and digital reading on 6 March 2014. Here is your required reading… just kidding, but we’ve selected two discussion pieces that we think could be interesting to talk about, so if you could have a look at them ahead of the workshop and post any initial thoughts below, that would be brilliant.

Both of the pieces relate to the first two words of our title.

What is ‘big data’? Or: how big is ‘big’?

To a computer scientist, ‘big data’ has a fairly precise meaning: it refers to any dataset that exceeds the limits of commonly used tools. Clearly, that’s a moving target, as the capabilities of software and hardware are constantly being expanded. Nonetheless, ‘big data’ is a meaningful concept for the scientists and programmers who must work around the limits of current technology to cope with the staggering volumes of data now being produced.

If we were to adapt the computer scientist’s definition to the tools used in the humanities and social sciences, ‘big’ data would probably be something much smaller. In English studies, for example, the most commonly used ‘tool’ is the technique of close reading, so one could arguably use the words ‘big data’ in reference to any text or collection of texts too large for an individual researcher to subject to close analysis. As has been observed in a discussion of the various species of ‘big data’ hype, ‘[s]ince humanists usually still work with small numbers of examples, any study with n > 50 is in danger of being described as an example of “big data.”’ (Underwood, 2013) In sociology, by comparison, the primary quantitative research tool has been the sample survey, generally limited to a few hundred or thousand respondents – so one might perhaps be inclined to start to calling sociological data ‘big’ once it includes records for hundreds of thousands of individuals.

But such numbers are not what the words ‘big data’ usually call to mind. In practice, ‘big data’ tends to mean a ‘data-driven’ rather than ‘hypothesis-driven’ approach to research, which aims not to formulate hypotheses and then test them against data collected for its relevance to those hypotheses, but to start with a dataset and then identify patterns within it, often through the use of ‘machine learning’ algorithms. This is sometimes described as ‘inductive’ rather than ‘deductive’ research, which probably sounds more radical from the point of view of a scientist than from that of a humanist – although it is necessarily quantitative, and as such quite alien to much traditional humanist research, especially in literary studies.

‘Big data’ as an approach to social scientific research

The first discussion piece we’d like to draw your attention to is a talk on ‘What Big Data Means for Social Science’ by economist Sendhil Mulainathan (2013). This is available at the following address in both video and transcript form:

In the main part of his talk, Mulainathan argues that by collecting as much data as possible – including data that we have no specific reason for believing to be relevant – and then searching algorithmically for the variables, or combinations of variables, that predict the outcomes we are interested in, we can avoid being led up the garden path by our hypotheses. In context of Mulainathan’s main (imaginary) example – a false medical hypothesis that appears to be supported by data only because the data collected was too narrow to show that something unexpected was going on – this seems highly convincing, although the talk as a whole could be regarded as something of a ‘sales pitch’ for data-driven research. By contrast, the position Mulainathan takes in response to audience questions is far more nuanced, and shows keen awareness of the problems of the approach he has been espousing.

In this context, it should be observed that the purely statistical approach to linguistics which Mulainathan provides as a real-world model for the scientific use of big data from 06:01 to 07:05 is highly controversial. It has been very successful in answering certain kinds of questions (see Halevy, Norvig, and Pereira, 2009), but because of its indifference to the meaning of the behaviour it models, linguists such as Noam Chomsky have argued that this should not be considered success in any ‘sense that science has ever been interested in.’ (Pinker and Chomsky 2011, parag. 2) At 25:56, Mulainathan is challenged by the philosopher Daniel Dennett, who makes the suggestion (related to Chomsky’s argument) that the ‘big data’ approach may yield predictions, but not understanding; at 31:44, Mulainathan is challenged by the psychologist, Daniel Kahneman, who argues that a variable that does not have much predictive power may still be relevant and interesting. From 29:14 to 30:40, Mulainathan argues that there are limits to the ‘big data’ approach which cannot be solved simply through more data or more computing power, and that theory-driven hypothesis testing will always be required.

‘Big data’ as an approach to humanist research

The second discussion piece is ‘Big Data for Dead People: Digital Readings and the Conundrums of Positivism’ by historian Tim Hitchcock (2013a), a talk subsequently published in textual form on Hitchcock’s blog. It is available at the following address:

Hitchcock is the lead academic behind a very large humanist research project: Old Bailey Online. He argues both that the potential for close reading of historical documents has been greatly extended with the availability of linked data, and that visualisations of quantitative data can help us to see individual events in terms of long-term trends. But he also asks whether ‘our practice as humanists and historians is being driven by the technology, rather than being served by it’ (2013, parag. 72) due to a tendency to ‘ask questions we know that computers can answer’ (parag. 71). He goes on as follows:

in choosing to move towards a ‘big data’ approach… and in adopting the forms of representation and analysis that come with big data, all of us are naturally being pushed subtly towards a kind of social science, and a kind of positivism, which has been profoundly out of favour for at least the last thirty years. (Hitchcock 2013, parag. 70)

Note that in a comment on Brian Lennon’s (2013) discussion of this piece, Hitchcock (2013b) agrees that this critique of the digital humanist version of ‘big data’ could be taken further. It should also be recognised that similar points to Hitchcock’s have been made in critiques of humanist positivism that do not directly tie its rise to the notion of ‘big data’ (see, for example, Eyers 2013).

‘Big data’ and reading

The internet provides unprecedented opportunities for gathering very large datasets about reading – digital and otherwise. As such, there is the potential for an apparent revolution in reader study, comparable to the explosion of qualitative audience research that followed the realisation that ‘naturally occurring’ data on media consumption practices was easily accessible via the Internet. But in view of Hitchcock’s observations about positivism, there is perhaps a still greater need to guard against the assumption that online data can, as one scholar astutely put it, ‘unproblematically unveil those cultural processes and mechanisms which cultural studies has been positing’ (Hills 2002, p. 175). There are many reasons for this, including the fact that, as Jen Schradie has shown, internet research leaves working class people under-represented because ‘people with lower levels of income and education are not accessing or creating online content nearly as much as people with a college degree and a comfortable middle-class lifestyle.’ (2013, parag. 5; see also Schradie, 2011) This means that when we study a phenomenon – reading, say, or political activism – through analysis of internet ‘big data’, we may in fact be studying only an elite variant of that phenomenon. For example, Schradie’s research on 35 political organisations in the US found that ‘[t]hree of the most active… offline have virtually no online presence’, while ‘[o]ne… does not come up on Google searches’ and ‘[h]alf… are not active on Twitter.’ (2013, parag. 10) Such disparities will of course be multiplied in an international context, given disparities in internet access. And they are paralleled by disparities in the ability to carry out ‘big data’ research, which, as Ben Williamson has observed, ‘concentrate[s] data analysis and knowledge production in a few highly resourced research centres, including the R&D labs of corporate technology companies.’ (2014, parag. 10) Indeed, long before the term ‘big data’ became a political, commercial, and academic buzzword, it was recognised that the industrial creation and analysis of vast social datasets far exceeded the capacities of scholarly research – and argued that the most urgent project for social science might therefore be one of ‘critically engaging with the extensive data sources which now exist, and not least, campaigning for access to such data where they are currently private.’ (Savage and Burrows 2007, p. 896) Online retail giant Amazon’s collection of information on the behaviour and preferences of readers can thus, for example, be seen as an opportunity for research, in that some of this information is available to members of the public, including academics – but it can also be seen as a phenomenon that in itself requires the most critical form of scholarly attention.

Ahead of the workshop…

At the workshop, we’ll be asking what ‘big data’ could bring – in every sense – to humanist and social-scientific research on digital reading. We hope that this process of discussion will start in responses to the video and essay posted above. In addition to general points about ‘big data’, it would be great to read about the themes and questions that you are interested in with regard to (digital) reading, and how these may, or may not, benefit from ‘big data’ and its associated assumptions.


Eyers, Tom (2013). ‘The perils of the “digital humanities”: new positivisms and the fate of literary theory’. Postmodern Culture 23 (2). Available online at
Halevy, Alon, Norvig, Peter, and Pereira, Fernando (2009). ‘The unreasonable effectiveness of data’. Intelligent Systems 24 (2): 8-12. Available online at
Hills, Matt (2002). Fan cultures. London and New York: Routledge.
Hitchcock, Tim (2013a). ‘Big Data for Dead People: Digital Readings and the Conundrums of Positivism’. Keynote address, Reading Historical Sources in the Digital Age, 5 December, Centre virtuel de la connaissance sur l’Europe, Luxembourg. Published online, 9 December. Accessed 24 Jan 2014 at
Hitchcock, Tim (2013b). Comment on ‘On Digital Humanities “Surprisism”’, December. Accessed 24 Jan 2014 at
Lennon, Brian (2013). ‘On Digital Humanities “Surprisism”’, 11 December. Accessed 24 January 2014 at
Mulainathan, Sendhil (2013). ‘What Big Data Means for Social Science’. July, HeadCon ’13. Published online, 11 November. Accessed 24 Jan 2014 at
Pinker, Steven and Chomsky, Noam (2011). ‘Pinker/Chomsky Q&A from MIT150 panel’. Accessed 28 February 2014 at
Savage, Mike and Burrows, Roger (2007). ‘The coming crisis of empirical sociology’. Sociology 41 (5): 885-899.
Schradie, Jen (2011). ‘The digital production gap: the digital divide and Web 2.0 collide’. Poetics 39 (2): 145-168.
Schradie, Jen (2013). ‘Big data not big enough? How the digital divide leaves people out’. 31 July. Accessed 1 March at
Underwood, Ted (2013). ‘Against (talking about) “big data”’. 10 May. Accessed 27 January 2014 at
Williamson, Ben (2014). ‘The end of theory in digital social research?’ 20 January. Accessed 1 March at

Tagged with: , , , ,
6 comments on “Big data and the study of reading
  1. Bronwen Thomas says:

    Thanks, both, looking forward to the workshop!! This is a really helpful introduction to the issues. Our online survey for RRO did bear out the idea that the readers on the forums we looked at were from a narrow band in terms of social class, but then this probably more or as much to do with who reads literary/middlebrow fiction as with who has access to the internet?

    • Daniel Allington says:

      Thanks for the comment, Bronwen. And that’s a good example: the reading of literary and middlebrow fiction and the publishing of content on the Internet are both markers of high social status, so a study that relied on data scraped from user-produced content might be expected to have a literary/middlebrow bias. When one is doing survey research like yours, such a bias can be uncovered by asking demographic questions of your respondents (which I remember you did). When one is harvesting data from websites, it’s more likely to be obscured. This could lead to a systematically distorted picture of reading (both digital and otherwise). For example, the Cultural Capital and Social Exclusion survey found romance to be the only book genre whose frequency of reading was inversely correlated with the educational level of the reader, so Jen Schradie’s findings about educational level and digital production would suggest that the reading of this genre will be under-represented on book blogs and probably also on websites like BookCrossing and LibraryThing.

      In practice, though, we may find that some forms of online content production have a very different dynamic than others. Of perhaps the greatest relevance to our concerns is the fact that you can post an review direct from a Kindle, which arguably means there’s less of a barrier to Amazon customer reviewing than to certain other forms of online content production, such as running a WordPress blog – and this might mean better representation of other kinds of readers, reading material, and reading experiences. For example, Barry Unsworth’s Booker-prizewinning Sacred Hunger, published in 1992, currently has just 79 customer reviews, while Diana Gabaldon’s Outlander, a successful romance novel published the same year, has 3602. Even Michael Ondaatje’s The English Patient, also published in 1992 and benefiting not only from a shared Booker win with Sacred Hunger but from a screen adaptation that made hundreds of millions of dollars at the box office and pretty well cleaned up at the Oscars, has only 369: an order of magnitude lower than Outlander‘s figure.

      • Bronwen Thomas says:

        Some interesting stats there. What I like about this kind of approach is that it does throw up surprises. And it can be possible to dig deeper into the data to try to understand the context.

        • Simon Frost says:

          Dear Daniel and Andrew
          Thank you very much for that. Terrific. Can I pick up on your inductive and deductive distinction? This has been kicked around by philosophers for years and never really resolved, as far as I understand it. If there is a resolution, then it is unsatisfactory and has something to do with abduction (abductive reasoning rather than carrying off someone), and counter narrative. To take your example of relatively big data. We know that stylistics or literary form in novelistic fiction over the long 19th C moves roughly from romanticism to Victorian realism to early modernism. And we know this from the few dozen novels we study. (Can anyone think of more than two dozen?) But, according to a source like Sutherland, there were around 7000 people in the Victorian era alone who could legitimately call themselves novelists, many producing more than one work. Elaine Showalter famously does much the same thing from a gender perspective. We know from our smaller (curricula) data sample that there were very few women novelists, but from Showalter we know they ran into their hundreds and from what a data sample such as At The Circulating Library can tell us,, a very significant number indeed of those 7000 novelists seem to be women. So, from a small sample (the one we usually teach), English 19thC literature is predominantly male, but from a larger sample it is predominantly female. Or in terms of the history of stylistics – let’s say those large number of 7000 novelists were using, say, elements of melodrama with heavy characterisation, should we then have to say that the 19thC was the age of melodrama and character, to which romanticism, realism and modernism are nothing more than interesting anomalies. I am inclined to think the former, that the 19thC was an age of melodrama but I can think of arguments for both sides.
          There’s perhaps not a resolution but … for what it’s worth … abductive reasoning is when you derive a result from a hypothesis than doesn’t have to be proved. It’s like guessing. My car won’t start. I forgot to put petrol in my car. The premise here does not guarantee the conclusion (the car may have a flat battery, or may have just enough petrol left to start but something else is wrong etc.) but the premise provides sufficient conditions for an explanation although not necessary conditions for the explanation. Abductive reasoning provides working explanatory models than require constant re-evaluation as further evidence and bodies of data come to light – just like life.
          The second part-resolution (again not satisfactory) is counter-narrative. The hypothesis that most Victorian writers were women may eventually be proved wrong (some cash of androgynous writing waiting to be discovered, perhaps) but it can be valuable not because of its truth value but in how it provides a counter to a dominant narrative. The shift is from a truth claim to a political strategy.
          Getting back to big data – if research is dealing with staggering volumes of data that is always increasing then we are always under threat that the next cash of data may prove the first hypothesis entirely wrong. Surely analysing big data (which is always being overtaken by bigger data) becomes problematic only when we want to make a truth claim – ie. isn’t the problem partly the need for truth (regardless of inductive or deductive method). Alternatively, guessing a temporary answer abductively that is constantly re-examined, or strategically making a counter narrative informed through big data, unsatisfying as it may be in terms of analytical philosophy, may make more sense.

        • I agree with you, Bronwen, although what I think you’re picking up on is the search for explanations that has traditionally motivated scholarly research – and which Chomsky sees as absent from the radical approaches to ‘big data’ being pushed by some researchers.

  2. Anouk says:

    First I want to thank Daniel & Andrew for the excellent seminar they gave on this material at Sheffield. It was very well put together and I came away with lots of food for thought.

    Secondly, apologies for coming late to this discussion on the blog, but I wanted to throw into the mix a piece by Tim Harford on the perils of big data: “Big data: are we making a big mistake?” ( Harford points to some examples that undermine the putative analytical infallibility of big data, for example the fact that a few years on, Google Flu Trends started to get it wrong when compared to the CDC’s data about the spread of flu (something that, rather wonderfully, might be attributable to Google’s own auto-complete algorithms suggesting diagnoses when users entered their symptoms). The famous Target-uses-big-data-to-figure-out-that-a-teenager-is-pregnant-before-her-father-does story is also qualified: we aren’t told how many false positives there are, ie. how many people to whom Target sent coupons for baby paraphernalia when they weren’t expecting a baby.

    Overall I find his critiques of the statistical pitfalls to be very compelling. I suspect, though, that most of us working with big(-ish) data in relation to the study of reading aren’t using statistical methods as our primary approach – or indeed making use of them at all – so I wonder about how we can take such critiques on board, and be appropriately suspicious about the datasets we’re using, when we are looking for patterns other than numerical ones (eg. linguistic patterning of the sort corpus analysis can turn up, which I realise can be subjected to statistical analysis but does not have to be).

2 Pings/Trackbacks for "Big data and the study of reading"
  1. […] users’ opinions on anything, let alone reading. As Daniel Allington and Andrew Salway observe in their post on big data, the cross-section of the population that uses social media is not representative of society at […]

  2. […] users’ opinions on anything, let alone reading. As Daniel Allington and Andrew Salway observe in their post on big data, the cross-section of the population that uses social media is not representative of society at […]

Leave a Reply