HIST680: Text Mining with Voyant

Module 7: Voyant 2.0

In this activity you will learn to use Voyant, a free tool for exploring patterns in words or phrases in a text. A web-based version of the tool can be accessed at http://voyant-tools.org.

You can also download a version to run locally on your own computer (instructions: http://docs.voyant-tools.org/resources/run-your-own/voyant-server/).  Voyant generally crashes less and works more quickly when you run it on your own computer, but the visualizations that you produce cannot be embedded in other web sites (such as your blog). For that reason, we are using the web-based version.

This activity is inspired by Dan Royles’ lesson plan.

 

You will be learning to use the five default Voyant 2.0 tools (For other tools, some of which have yet to be upgraded to Voyant 2.0, see the documentation).

  • Cirrus: a word cloud that displays the highest frequency terms.  The more a word appears in the text, the larger the word in the word cloud.   You may remember looking at word clouds of digital humanities definitions in module 2.
  • Reader: a list ordered by frequency for all the words appearing in a corpus.
  • Trends: a line graph depicting the distribution of a term’s occurrence across a corpus or a document.
  • Summary: basic information about the text(s) such as number of words, the length of documents, vocabulary density, and distinctive words for each document.
  • Contexts:  a table that shows each occurrence of a word with segments of text that directly precede and follow the term throughout the corpus or document.

 

The dataset

The dataset for this activity is the WPA Slave Narratives. This collection consists of more than two thousand interviews with former slaves from seventeen states collected in the years 1936-1938 by staff of the Federal Writers’ Project of the Works Progress Administration. The interviews are available in the public domain as images and uncorrected OCR as part of the Library of Congress American Memory site, and as transcriptions as part of Project Gutenberg. These interviews are a complex source; make sure you read the background material before you work with them.

How did we create this dataset?

The files you will be using were created by downloading the .txt file of the transcription of the volume from Project Gutenberg. [Three of the volumes are annotated as being altered by the transcriber, making them both not accurate copies of the original and inconsistent with the other volumes. In the case of those volumes, the uncorrected OCR text file was downloaded from the Library of Congress and manually corrected]. The introductory material, including the title page and list of interviewees and illustrations, and the information on Project Gutenberg at the end of each volume was removed from each file. Multiple volumes from a state were combined into a single file to produce 17 files.

 

Open a window in your browser and go to Voyant at http://voyant-tools.org.

Paste the urls below into the “Add Texts” box – make sure each is on one line – then click the “Reveal Button.”


http://drstephenrobertson.com/SlaveNarratives/Alabama_v2.txt

http://drstephenrobertson.com/SlaveNarratives/Arkansas.txt

http://drstephenrobertson.com/SlaveNarratives/Florida_v2.txt

http://drstephenrobertson.com/SlaveNarratives/Georgia.txt

http://drstephenrobertson.com/SlaveNarratives/Indiana_v2.txt

http://drstephenrobertson.com/SlaveNarratives/Kansas_v2.txt

http://drstephenrobertson.com/SlaveNarratives/Kentucky_v2.txt

http://drstephenrobertson.com/SlaveNarratives/MIssissippi_v2.txt

http://drstephenrobertson.com/SlaveNarratives/Maryland_v2.txt

http://drstephenrobertson.com/SlaveNarratives/Missouri_v2.txt

http://drstephenrobertson.com/SlaveNarratives/NorthCarolina.txt

http://drstephenrobertson.com/SlaveNarratives/Ohio_v2.txt

http://drstephenrobertson.com/SlaveNarratives/Oklahoma_v2.txt

http://drstephenrobertson.com/SlaveNarratives/SouthCarolina.txt

http://drstephenrobertson.com/SlaveNarratives/Tennessee_v2.txt

http://drstephenrobertson.com/SlaveNarratives/Texas.txt

http://drstephenrobertson.com/SlaveNarratives/Virginia_v2.txt

 

Note: these files are all plain text (.txt); Voyant can also read pdf, .doc and other formats.

 

When Voyant opens, the screen is divided into 5 windows, each containing a tool: Cirrus, Reader, Trends, Summary and Contexts.

  • Rolling your cursor over the bar that appears next to each tool reveals the options for that tool

  • the left-hand button is for export (which includes opening that window in a tab of its own)
  • the middle button allow you to choose another tool to open in this window (see documentation for additional tools)
  • the right hand button – which is only available for some tools – provides options to adjust the tool
  • ‘?’ provides help for each tool

In Voyant, each file you loaded is called a document; all the files together are called a corpus.

 

  1. Analyze the corpus of interviews

 

Begin with the Cirrus word cloud – this shows the word frequency for the corpus or for an individual document. A word cloud by default displays every word in the document, including common words that appear in every text. To produce a more revealing visualization, a list of common words — called stop words — is automatically removed from the visualization.

However, as you know from the background reading, dialect is a particular feature of these interviews, and common words in dialect are not part of the stop word list that Voyant uses by default. You can edit that list by clicking on the option button in the Cirrus tool. The window below will appear:

  • Click Edit List
  • Scroll to the bottom of the list and paste in the words below [this list is based on one created by Dan Royles – feel free to add additional words].

ain’t
ain’t
an’
an’
atter
couldn’t
couldn’t
dar
dat
dat’s
dem
den
dere
dey
didn’t
dis
don’t
em
en
en’
fer
fo
fo’
git
iffen
i’s
i’se
mo
mo’
neber
nothin
nothin’
roun’
roun
whar
whut
wid
wuz

  • Click Save
  • Click Confirm
  • The Word cloud will now reformat with the words you added removed – and since we left the ‘apply globally’ box ticked, the words you added will also disappear from the other tools
  • Note: some of the txt files were created using OCR software, which does not always recognize words correctly, producing errors in the transcriptions

    • If you hover over a word in the cloud, the number of times it appears in the corpus will appear.
  • The Terms slider on the bottom left of the Cirrus tool allows you to adjust the number of words that appear in the cloud
  • If you roll your cursor over a word in the cloud, the number of times the word appears in the corpus will appear
    • If you click on Terms in the menu bar (next to Cirrus), the window changes to a list of words and their frequency in the document

  • If you click on Links in the menu bar, the window changes to a visualization of the highest frequency words that occur close to the specified search terms
    • Click on a word to highlight its collocated words
    • Double click on a word to fetch more words that appear close to it
    • The Context slider on the bottom right adjusts how close to the search terms the words that appear are  – if you slide it to the right, words further away are included

What does the Cirrus tool tell you about the corpus of interviews – what topics does it suggest are the focus of the interviews? (In making that assessment, don’t look only at individual words, but consider how individual words might be related).

 

Roll over the menu bar for the Cirrus tool and select the middle icon. This opens the export options. Whatever you have open in the Cirrus window — a word cloud, a table of terms or a links graph — is what will be exported.

  • Leave the default option, “a URL for this view (tools and data)”, and click “Export.” This opens a new window containing only the contents of the Cirrus tool: copy the url for this new window.

  • The Export View option also provides code to allow you to embed an interactive view of this window in a web page (including your WordPress blog (once you install the iframe plugin) and Omeka exhibit). Select the “an HTML snippet…” option, click “Export,” copy the code that appears, and paste that code into the text view in WordPress or html view in Omeka
  • The Export Visualization option generates a static image of the window – note, many operating systems do not read the SVG format

Provide URLs of the corpus word cloud and any other visualizations from the Cirrus tool that you think reveal the nature of the interviews.

Corpus word cloud URL:

Other helpful visualizations:

 

2. Compare documents/interviews from different states

A: How frequently do the most common words in the corpus appear in individual documents?

Choose two words that you think are the most revealing in the Cirrus cloud of the corpus.

For each word, click on that word in the Cirrus cloud: a graph of its frequency across the corpus will appear at the bottom of the Reader tool and in the Trends tool. Explore what each tool tells you about each word you selected.

 

  • Reader
  • Each document appears in the graph as a colored block. The wider the block, the larger number of words there is in the document.

 

      • When you select a word, a line graph appears that shows the frequency of the selected word across each document
      • The vertical blue line indicates what part of the corpus is being displayed in the Reader tool window
      • Click on a word in the Reader window – a frequency graph for that word will appear in the Trends tool, and that word will appear in the Contexts tool, beginning with the instance that you clicked on

 

  • Contexts

 

    • This tool shows all of the appearances of a selected word with the words to its left and right
  • Clicking on the + to the left of an example opens a window showing more of the text surrounding that example of a word
  • The Context slider on the bottom of the window adjusts how many words surrounding the search term are displayed sets how many words of surrounding text are shown if you click on the ‘+’ for each appearance of the word

 

  • Trends

 

    • The line graph shows the total frequency of the selected word in each document
    • The default frequency is relative frequency. If you click on the button on the bottom right of the tool (with 4 horizontal lines), and select Frequencies, you can instead select Raw frequencies.

Change the scale to see the frequency of the selected word within specific documents instead of the total frequency for the document.

  • Click on the “Scale” button on the bottom right of the Trends tool. Click on Documents and choose a state/document – export the graph to a separate window to allow you to compare it to the corpus word cloud and to another
  • The Trends graph now shows how widely the word is used in the document, in how many interviews it appears
  • Click on the “Scale” button and choose a second state/document – export it to a separate window to allow you to compare them to the corpus word cloud and to the other other

What do the frequencies of the most common words in the corpus in different documents, and within two of those documents tell you about the nature of the differences between interviews from various states?

Please provide URLs of 6 Trends graphs – across the corpus for each word, and then within 2 documents for each word:

Word 1

Trend graph across the corpus URL

Trend graph within document 1 URL

Trend graph within document 2 URL

 

Word 2

Trend graph across the corpus URL

Trend graph within document 1 URL

Trend graph within document 2 URL

 

B: How do the overall word frequencies in individual documents compare with the overall pattern in the corpus, and with the overall pattern in other individual documents?

  • Use the export button to open a new tab in your browser containing the word cloud for the corpus
  • Return to the tab in your browser with the full Voyant window
  • Use the Scale button in the Cirrus tool to create a word cloud of at least three different documents – export each to a separate window to allow you to compare them to the corpus word cloud and to each other

What do the different word clouds tell you about differences in the interviews from each state?

 

URLs of each of the 4 word clouds:

Word Cloud 1 URL:

Word Cloud 2 URL:

Word Cloud 3 URL:

Word Cloud 4 URL:

 

 

C: How frequently do the words that are most distinctive to each document appear in the corpus and in the document to which they are distinctive?

The Summary tool displays information on each document, as well as the corpus.

  • Document list – longest and shortest in terms of number of words
  • Vocabulary density – highest and lowest ratio of the number of words in the document to the number of unique words in the document
  • Distinctive words in each document – compared to the rest of the corpus
  • The “items” slider at the bottom of the tool adjust how many items appear next to each of those summaries
  • Clicking on a word in the Summary tool loads the first instance of that word in the document or corpus in the Reader tool and in the Contexts tool

Select two distinctive words, each from a different document (don’t choose place names – why they are distinctive to different documents is fairly obvious).

    • Click the word in the Summary tool
    • Click on the first example of the word in the Contexts tool – that will load instances of that word in the Reader tool
    • Click on an instance of that word in the Reader tool to feature it in the graphs in the Reader tool and Trends tool showing its frequency in the corpus

 

  • Click on the Scale button in the Trends graph, select document, and then select the document in which the word you chose is distinctive. The Trends graph now shows how widely the word is used in the document, in how many interviews it appears.

 

  • Click on the Scale button in the Contexts, select documents, and then select the document in which the word you chose is distinctive – does Contexts help explain any differences in the Trends graphs?

What do the distinctive words tell you about differences in the interviews from each state?

 

Please provide URLs of graphs of each distinctive word in the corpus, and in the document in which it is distinctive (4 graphs total):

 

Graph of word 1 in the corpus URL

Graph of word 2 in the corpus URL

Graph of word 1 in the document in which it is distinctive URL

Graph of word 2 in the document in which it is distinctive URL

 

PORTFOLIO BLOG POST: