This post is a slightly revised version of a presentation I gave at the University of Florida Digital Humanities Bootcamp on January 28, 2016. It represents the current iteration of one approach I take to introducing digital humanities, providing an entry point focused on conceptual issues rather than particular tools or projects. To that end, following Ted Underwood’s lead, I try to locate the use of computational tools in relation to the search tools that are a naturalized but poorly understood part of current research practices.
When I think about conceptualizing digital humanities research, I often return to an analogy that Janet Murray made in 1997 in Hamlet on the Holodeck, a book that was a staple of early digital history courses. To highlight the challenge of the new digital medium, Murray pointed to the early history of film-making. The first films were photoplays, created by pointing a static camera at a stagelike set: they were an additive art form, still depending on formats derived from earlier technologies – photography plus theater. Filmmakers changed a mere recording technology into an expressive medium by seizing on the unique physical properties of film: the way the camera could be moved; the way the lens could open, close, and change focus; the way celluloid processed light; the way strips of film could be cut up and reassembled. When they panned and zoomed the camera, switched from low to high angles, cut between two scenes and created montages, they made movies. To move past the photo-play stage in our use of the digital medium requires, Murray argued, identifying its essential properties, “the qualities comparable to the variability of the lens, the movability of the camera, and the editability of film.” Her focus was on the web as a medium for storytelling, and the properties she identified reflect that focus: “Digital environments,” Murray argued, “are procedural, participatory, spatial and encyclopedic.”1
In more recent years, the focus of digital humanities research has shifted to tools that use the procedural properties of computers to examine artifacts not to create narratives. In addition to using the digital medium to “reshape the representation, sharing and discussion of knowledge,” humanities scholars are now using algorithms as a research tool that enables us to ask new questions and develop new methodologies. Working with those computational tools poses the same challenge to get beyond practices derived from older technologies that Murray identified in regard to storytelling on the web. It requires recognizing that they have a property that Murray did not discuss: computational tools are deformative.
My first goal here is to try to follow Murray’s lead and focus us on the deceptively straightforward task of thinking about the deformative property of computers when we conceive how to make use of computational tools. Deformation offers a method of both discovery and interpretation. One of the challenges in discussing DH research is distinguishing those two purposes, and ensuring that we conceive our use of computational tools in the appropriate terms. My second goal is to disentangle discovery and interpretation, and to promote the use of digital tools for discovery, counter to a tendency to devalue such uses as not ‘game-changing’ enough.
Beginning with discovery has not been my typical starting place when helping people conceptualize digital humanities research projects. I usually begin with a research question, which I then try to match with a digital tool that will help answer it. In doing so, I’m effectively starting in the middle of the research process, without considering what role digital tools could play in framing questions. That approach reflects a broader inattention to discovery in how humanities scholars conceive research, including a lack of description of how sources are found and analyzed in published scholarship. My own discipline, history, is particularly reticent about discussing research methods. Nonetheless, as with most humanities scholars, historians’ questions emerge to a large extent apart from their sources – from some combination of the issues in the world in which the they live, what the scholarly literature says, and what the sources say. As much as we try to keep an eye out for answers to our questions and pay attention to what else is there, looking at parts and at the whole are two incommensurate perspectives. Trying to employ both approaches would leave you cross-eyed. Even when historians do start by identifying a source and then seeing what it contains, their reading is still shaped by their preconceptions and interests, by what they bring to the sources. Scholars generally browse– which means they are not reading every word. They stop when something attracts their attention – when some text “catches their eye” as it crosses the page. Likewise, historians’ sense of patterns in sources we browse reflects our subjectivity: our retention of content is idiosyncratic not systematic.
Digital tools offer a perspective on what sources contain that is derived from the documents themselves not shaped outside them. Computational tools take artifacts apart, so they can be reordered and transformed to reveal patterns they contain. It is in regards to text analysis that the deformative property of the digital has been most extensively discussed (I find Stephen Ramsay’s Reading Machines to be the most insightful account2) and is most readily understood by humanities scholars. Text analysis tools deform texts by atomizing them. A computer processes texts as a meaningless string of characters; it does not know what a word is. An algorithm has to define a word, split a long string (text) into shorter strings (words) by looking for spaces and punctuation. This process – tokenization – is not a quantitative operation, but pattern recognition. Algorithms can identify not only words, but also more complex patterns – a string of ordered words rather than an unordered bag of words. Computers also process texts with what Ramsay describes as “relentless exactitude:” they unerringly discover every instance of a feature regardless of the size of the text to which they are applied.3
Computational tools often couple algorithms with visualizations, which allow the user to observe and understand information. We are not talking about illustrations of an interpretation – which is how humanities scholars typically conceive visualizations – but representations for interpretation. Visualizations make mathematical abstraction more accessible, and are particularly effective in providing a way of seeing patterns in data, of seeing the whole and the parts. They enhance the perspective that computational tools offer by combining the deformative property of computers with their visual property.
It is not necessary to have big data in order to use digital tools to get a perspective on what your sources contain. There is no minimum amount of material needed for a tool to function. Nor is the perspective offered by an algorithm continuous with that of a human researcher to the extent that when the scale of data is small enough, a researcher will see everything that an algorithm does. With a relatively small set of sources, I may be able to recognize more features and patterns within those sources than I could working with more material, but my view is still shaped by preconceptions and constraints different from those that operate in algorithms. Thus, while digital tools are often presented as a necessary and inevitable response to the availability of more material than a human could read, as an extension of the human researcher, they don’t just read faster, they read differently. They combine the encyclopedic property of computers with the procedural, deformative, property I have been highlighting.
Nor is it necessary to master the mathmatics behind a computational tool to use it for discovery. The existence of a range of open source tools make it possible to use algorithms to get a perspective on your sources without having to write scripts or code. The limits to what those tools can do are balanced by their relative ease of use.4 Algorithms are not guaranteed to produce new insights, but freely available tools significantly reduce the cost in time and effort of getting their perspective on your sources. Whatever that perspective reveals is simply a starting point — as Trevor Owens puts it, “The results of a particular computational or statistical tool don’t need to be treated as facts, but instead can be used as part of an ongoing exploration.”5 An algorithm points to particular elements in sources, which a researcher can then make sense of by close reading or other methods. Any interpretation that comes from the spark provided by a digital tool can be elaborated and validated by other approaches. There are also digital tools that can be used to construct interpretations. Relying on the tool to justify your interpretation, to answer your question, does require understanding its mathematical basis sufficiently to establish the validity of the evidence it produced. But using a digital tool for discovery does not necessitate using a digital tool for interpretation.
Using digital tools for open-ended discovery is not a dramatic departure from current research practices. For approaching two decades, we have been unreflectively using a digital tool — search — to test hypotheses. Humanities scholars have a tendency to see search as simply a bibliographic tool, akin to flipping through a card catalog to find a book based on its author or title. But such tools have long since given way to keyword and full text searches, which while having similar interfaces, are deformative in the same way as text mining tools.6 Keyword search, by searching multiple fields in a record, disrupts the hierarchies and categories of information established in the past. Full-text search examines every publication in a database, not simply those that have become canonical, and every word in a publication, working from the bottom up – words in a text — rather than from the top down—from the journal to the issue to the article, as a researcher in a library would. In the case of archival collections, search can remove information from the context of the institution that structured the collection; it can, in Tim Hitchcock’s words, de-center institutions in favor of individuals.7
However, search is a limited method. It struggles to deal with what lies outside a set of results. In returning only the terms one enters, a search filters out any alternative hypotheses. For historians, this poses particular challenges, as the language and ways of organizing knowledge in the past often differ significantly from contemporary terms and patterns of thought. If we use the wrong search terms, we literally misread our sources. Working with interfaces that tell us how many results we found without reference to how many results were possible provides us with no sense of how representative our results are. Moreover, the increasing numbers of results generated by search has resulted in many search tools incorporating algorithms to sort results by relevance. Those algorithms are often proprietary – as is the case with those used by Google and Proquest – but even when they’re not, its rare to find any explanation of how they work.
Ngram viewers address the uncertain significance of search results by putting them in context, and creating a visualization of those results. The best known of those tools is the Google Books Ngram viewer: it identifies how often a search result appears in each year of a corpus, and presents those results on a line graph. This tool is limited by the lack of any way to access the specific texts from which the search results derive, a consequence of the copyright status of much of the material in Google Books. Ngram viewers built for other sets of documents do not have that constraint. The Bookworm ngram viewer Ben Schmidt built for Chronicling America, the Library of Congress’s collection of public domain digitized American newspapers from the period 1836-1922, allows the user to click on any year in the graph sees a list of the stories from that year in which their search term appear results, and to access the full text of those stories. Likewise the Bookworm Lindsay King and Peter Leonard built of Vogue magazine includes links to texts in the Proquest database of the magazine, but the texts are available only to users from institutions that subscribe to that database. This is also the case with Chronicle, the New York Times‘ ngram viewer, which offers links to the title and first sentence of texts, but requires a subscription to see full articles.
However, like search, ngram viewers are limited to only showing the terms that a researcher enters. Tools for open-ended discovery highlight and escape that constraint. In the remainder of my time I want to briefly highlight some tools and examples of their use that I hope are suggestive of the possibilities of using tools for discovery – word frequency, topic modeling, and mapping, the focus of my own work.
Word frequency tools count words in a document or a corpus of documents. They most obviously differ from search in that they count every word, rather than only those that a researcher queries. The numerical results of those counts are coupled with visualizations. The most widely used word frequency tool, Voyant, is characterized by the multiple different ways that it presents data, in an initially crowded interface (much improved in the recently released Voyant 2.0). The juxtaposed presentations highlight how visualizations tend to reduce the amount of information presented, in service of drawing attention to some aspect of the data. The multiple frames and threads of information present in the interface also work to draw users into exploring the results and visualizations, avoiding the tendency to simply look that comes from the long tradition of encountering visualizations that are illustrations. Using tools for discovery is a process. The vision of Voyant’s developers Stéfan Sinclair and Geoffrey Rockwell is that the approach to computational tools should be to try things out. Tinkering with the settings of tools and accumulating perspectives from many tools can be beneficial, as one tool may help you notice something that is worth exploring in more detail with another tool.8
Word frequency tools only get you some way toward insights about what a source contains, leaving the user to establish the meaning of words and the context of their use. To provide more insight about the themes and ideas in texts, topic modeling examines semantic relationships. These algorithms produce possible topics by identifying clusters of words that appear in proximity to each other, are in the same context. The algorithm divides the texts into as many topics as the user specifies – a number that does not necessarily reflect how many topics are in the text. Forty topics is generally accepted to be the number most likely to strike the balance between lumping and splitting the themes of a corpus – but getting the number right is a process of trial and error. The algorithm creates a model of probable topics, not a picture of the topics in a corpus. And the topics it produces can be something else other than the themes of a corpus: they might also identify specific historical events, notable stylistic features, or systematic transcription errors, to name only a handful of non-thematic topical modes. It is the “the task of the interpreter [researcher] to decide, through further investigation, whether a topic’s meaning is overt, covert, or simply illusory.” 9
A new book, The Historian’s Macroscope, offers a fantastic introduction to the range of freely available topic modeling tools. 10 This book is a particularly valuable because it offers a perspective on text analysis from outside literary studies. Disciplines do scholarship differently, as Amy Earhart reminds us on recent analysis of digital literary studies 11, so the almost total dominance of discussion of text mining by literary critics has made it more difficult for scholars in other disciplines to see how these tools can be applied to their research. Graham, Milligan and Weingart effectively chart what is possible with tools of differing complexity. The Topic Modeling Tool, a graphic user interface for the command line tool MALLET, creates topics, allows a user to link through to the documents that contain each topic, and to look at documents and see which topics they contain. However, you can’t tweak or fine tune the model. MALLET, without the GUI, on the command line, does allow users to fine tune the model, to remove stop words, and filter out or include numbers. R, the statistical programming language is a more powerful, and complex to learn, tool for creating topic models, and offers more options for visualization.
It is worth emphasizing that visualization is not an intrinsic part of MALLET. My students are often disappointed that the results of the TMT and MALLET are only tables of text not the word clouds and line charts tracing change over time of Robots Reading Vogue, or the grids and bar graphs of Signs@40 (which uses Andrew Goldstone’s dfr-browser). The additional work required to construct visualizations, which are crucial to making the results of topic modeling accessible, is an obstacle to humanities scholars using this tool. However, Lauren Klein and her collaborators at Georgia Tech, with the support of the NEG are building a tool, TOME, that integrates visualizations useful to humanities scholars.12
Mapping tools are deformative in a different way than word frequency and topic modeling tools. Maps take apart artifacts and reorder them in terms of the spatial information they contain – textual descriptions of locations become geocordinates, texts become maps. Spatial patterns that are hidden in texts and tables – often because they are so fragmentary or limited that humanities scholars treat them as ephemera — become visible on maps. The interactive, iterative features of digital maps allow them to be explored in the same manner as the results of text analysis tools. Not only are mapped sources placed in their geographic contexts, but selections of those sources can be mapped, different layers of sources can be juxtaposed, and the scale can be zoomed from the level of individual buildings out to neighborhoods, cities, and regions.
As with text analysis, free or open source web mapping platforms and tools are available to scholars to use for discovery. Google My Maps features automatic georeferencing – conversion of location names and address into coordinates – and the ability to create maps of points, polygons, lines and layers. CartoDB adds a range of quantitative map types – cluster maps, heatmaps – as well as torque maps that animate time. Neatline offers the ability to build more handcrafted, annotated maps focused on telling stories, with timelines, and linked to digital collections of material (built in Omeka). A mapping platform built by humanities scholars for humanities scholars, Neatline accommodates uncertainty and ambiguity in data to far greater extent than other web mapping tools.
Mapping tools can be used in conjunction with text analysis. A recent example is “Space, Nation, and the Triumph of Region,” an article by Cameron Blevins published in the Journal of American History in 2014, which analyzed a Texas newspaper to explore what it revealed about the imagined geography of the United States in a period understood as one of integration and incorporation. He found that, at odds with the prevailing view, the newspaper was focused on region, not nation.13 To find this pattern, Blevins first extracted all the references to places in the newspaper using a text analysis tool called Named Entity Extraction, which is trained to identify all the place names in a text. He then visualized the results on a map – his data is both spatial (places and their geo-coordinates) and quantitative (the number of appearances of each place in the newspaper, indicated by the size of the circle).14
Mapping tools can also be used without text analysis tools and with qualitative data. In Digital Harlem, the project on 1920s Harlem which I helped create, we work with street addresses, which in Manhattan are a combination of named avenues and numbered cross streets that can’t be identified by text mining in the way that place names can, even if our sources were machine-readable, which, like many historical records, they are not. Nor is our data quantitative. The maps the site produces feature points derived from a wide variety of sources. This form of mapping highlights another dimension of the deformative nature of mapping – the way mapping provides a way of integrating material from a wide range of sources on the basis of geographic location. This example combines information from the reports of an undercover investigator hired by a white anti-prostution organization, lists of speakeasies and articles published in black newspapers, and legal records, to provide a picture of the different venues that made up Harlem’s Prohibition-era nightlife.
If you find these examples suggestive of what can be gained from using digital tools for discovery, I need to temper your enthusiasm by briefly noting what you need in order to do research in that way.
- You need digitized sources – and depending on your field, only a very small portion of the sources you are interested in will be machine-readable and ready to be fed into a computational tool.
- Even if digitized sources do exist, you need to be able to access them, and do so at scale – which means that sources you access through proprietary databases in your library’s collection will likely not be available for this approach (although this is changing as librarians push vendors to provide this access)
- And even if they exist and are accessible, digitized sources are often of relatively poor quality, marred by gaps, poor metadata, and poor optical character recognition
However, we are reaching a point where it is becoming more possible to create your own digitized sources. Digital cameras are now ubiquitous, with even those in phones able to capture images of documents, and archives generally allow their use, making it possible to quickly gather to gather large quantities of digitized material. The new software tool that Sean Takats and I are developing at RRCHNM with the support of the Mellon Foundation, Tropy, will provide a way to attach metadata to those images and organize them. Off-the-shelf OCR software can make most printed text in digital images machine-readable with sufficient accuracy for discovery, if not interpretation. If I have succeeded at all in my goals, it should also be clear that we’re at a point where it is worth taking the time to digitize our sources as it allows us to make use of the deformative property of computational tools to approach and frame our research in new ways.
- Janet Murray, Hamlet on the Holodeck: The Future of Narrative in Cyberspace (Cambridge, Mass., 1998), 65-97 ↩
- Stephen Ramsay, Reading Machines: Toward an Algorothmic Criticism (University of Illinois Press, 2011). ↩
- Ramsay, 32 ↩
- As Matthew Jockers and Ted Underwood note, out of the box open source tools “are never a complete solution. Since every research question is different (almost by definition), each entails some problems that resist standardization. Idiosyncratic types of metadata need to be gathered, special-purpose analyses need to be performed, and results need to be translated into visualizations that address a specific question.” Matthew Jockers and Ted Underwood, “Text-Mining the Humanities,” in A New Companion to Digital Humanities, eds Susan Schreibman, Ray Siemens, and John Unsworth (Wiley-Blackwell, 2016) ↩
- Trevor Owens, “Discovery and Justification are Different: Notes on Science-ing the Humanities,” (November 19, 2012). ↩
- Ted Underwood, “Theorizing Research Practices We Forgot to Theorize Twenty Years Ago.”Representations 127 (2014): 64-72. ↩
- Tim Hitchcock, “Digital Searching and the Re-formulation of Historical Knowledge,” The Virtual Representation of the Past, eds Mark Greenglass and Lorna Hughes (2008) ↩
- Stéfan Sinclair & Geoffrey Rockwell, “Text Analysis and Visualization: Making Meaning Count,” in A New Companion to Digital Humanities, eds Susan Schreibman, Ray Siemens, and John Unsworth (Wiley-Blackwell, 2016) ↩
- Lauren Klein, Jacob Eisenstein, and Iris Sun, “Exploratory Thematic Analysis for Digitized Archival Collections,” Digital Scholarship in the Humanities, 30, Supplement 1 (2015), 131 ↩
- Shawn Graham, Ian Milligan, and Scott Weingart, The Historian’s Macroscope: Exploring Big Historical Data (Imperial College Press, 2015) ↩
- Earhart, Traces of the Old, Uses of the New: The Emergence of Digital Literary Studies (University of Michigan Press, 2015) ↩
- Klein et al ↩
- Cameron Blevins, “Space, Nation, and the Triumph of Region: A View of the World from Houston,” Journal of American History 101, no. 1 (June 2014), 122-147. ↩
- Cameron Blevins, “Mining and Mapping the Production of Space: A View of the World from Houston” (2104) ↩