This glossary is based on one Stephen Robertson created for the American Historical Association based on one created for the 2016 Doing Digital History Institute. It has been expanded with additional terms related to the internet, web, digitization, preservation, search and copyright.
3D models: A computational method for spatial analysis. Three-dimensional visualizations are created by specialized software using geometric data. Objects can be created, placed in scenes, and produced in physical form by 3D printers. Within 3D modeling there is variation between models which are static, such as re-creations of architectural spaces, and models which are meant to be experienced, such as those used in simulations which can be explored. Sketchup, which has a free version, is the most widely used 3D software in digital humanities (https://www.sketchup.com/). 3D models are also being created in game development platforms, such as Unity Game Engine.
API (Application Program Interface): Software that allows two web applications to communicate. Commonly used to access data in an online database. Museums, libraries and archives can provide an API as a way to share their databases, providing users with the ability to gather large quantities of data more efficiently than with the search tools and interfaces provided for the database.
Augmented reality, AR: A form of visualization that overlays information on a user’s view of a real world environment and the objects in, to create a composite picture that alters perception of the environment (as distinct from VR which replaces the real world environment with a simulation). AR is commonly displayed on smartphones and tablets, using the device’s camera and GPS to determine what information to display and where to display it. Uses of AR in digital humanities include augmenting locations with historical photographs of the place, augmenting classical statues with the colors in which they were originally painted, and augmenting displays with additional descriptions.
Backend (aka control panel or dashboard): Administrative side of software where you can make technical and content changes that is not visible or accessible to visitors to the site (not public-facing).
Black box: A tool that can be used to input and output data without any knowledge of its inner workings. In some cases how a tool works is unknown because it is not made available for inspection, usually because its creators wish to retain control over the underlying code. See Relevance.
Blog: A web site that contains discrete, short, often informal entries (posts) that appear in reverse chronological order and can combine text, multimedia and links. Originally a form of online diary that allowed readers to leave public comments to which authors could respond, blogs have evolved into venues for commentary on a variety of topics by public figures, institutions and journalists as well as individuals, including scholars. Most blogs are published using free content management systems designed for that purpose such as WordPress and Blogger, and are freely available.
Born digital: Material that originates in digital form; in contrast to material that is digitized, which originated in another form. Common forms of born digital content are photographs taken with digital cameras, web pages, and electronic records like email and spreadsheets.
Coaxial cable, cable: A copper cable built with a metal shield and insulation to block interference, primarily used for TV. Cable has more bandwidth than DSL, but that bandwidth is shared with other users, so speeds can be reduced if there are many other people on the network. TV cable provides the majority of internet access in the US (2019).
CMS (Content Management System): A computer program that allows content to be published edited, and modified from a central interface. A CMS typically provides an interface that removes the need for the user to write in a programming language or markup language, although that option is also often available. CMSs are often used to run websites containing blogs and digital collections.
CSV (Comma Separated Values): A file with a set of information, where each value is separated by a comma or other specific character (; | / ). Can be created with spreadsheet software like Excel, Google Sheets, Numbers; when a spreadsheet is saved as a csv file, the values in the rows and columns are separated by a comma or other specified character. Most databases and many CMS platforms (i.e. Omeka) and digital tools can import csv files, making them a commonly used means of transferring information.
Computational methods/tools: Programming and software that analyzes data; the most commonly used methods in digital humanities are text analysis, spatial analysis, and network analysis. Using computational methods requires transforming historical sources into data by extracting information and features, and creating structured data by normalizing them to fit the chosen categories in service of particular research goals. The results of computational analysis are generally presented in visualizations, such as maps, graphs and charts.
Corpus Linguistics: Builds on text analysis to elucidate meaning by examining syntactic and semantic structures larger than single words. A corpus of texts is annotated with tags for parts of speech, and for the different modifying functions and relations that a word can have in different contexts. The corpus is analyzed by combining search and colocation to use context to establish the meaning of a word.
Data Center: A facility that serves as a central repository for servers, storage systems, network routers and firewalls, and the cabling and physical racks used to connect them. A data center requires infrastructure including uninterruptable power supplies and back-up generators, ventilation and cooling systems, and access to the internet.
Data cleaning: The process of detecting and correcting (or removing) incomplete records or data with inconsistent spelling or formatting from a database.
Database: A form of structured data in which related information is organized into fields (a single item of data), records (a complete set of fields; a row in a spreadsheet) and files (a collection of records). Also software that enables you to enter, organize, store, and retrieve information in a database.
Digitization: The conversion of analog content into a digital format. The creation of digital images by photography or scanning is the common form of digitization, used in the case of documents, photographs, artworks, or objects. Sound and moving images can also be digitised, by re-recording video and audio onto digital media. See also JPEG; TIFF; Pixel; Resolution.
Digital Archive: A collection of digitized sources organized, described with metadata, and made accessible through an online interface. In the context of digital humanities, the term generally refers to a collection brought together online from a variety of different physical collections and locations. Archivists would generally not consider such a collection to be an archive; in that field, the term archive is only used to refer to material created by an originating organization or person or by a third party brought together in a repository.
Distant Reading: From Franco Moretti, a term for using text analysis to look for patterns over large corpora of texts.
DOI (Digital object identifier): A managed, persistent link to an online publication. To obtain a DOI you must register with a DOI Registration Agency, which collects metadata about publications and assigns them DOI names. If the URL of the publication changes, the publisher must update the DOI metadata for the DOI to continue to link to the publication.
DSL (digital subscriber line): Technologies used to transmit digital data over copper telephone lines, taking advantage of capacity in the lines not used by phone by using a modulation scheme to create separate frequencies for phone and internet. Without DSL digital data needs to be converted into analog data by a dial-up modem to be transmitted over telephone lines, and is transmitted much more slowly.
Domain name, domain: A unique identifier for a resource on the internet such as a web server, web site or web app; used to translate the numerical IP addresses employed by internet protocols, as part of a URL. Domain names are used to establish a unique identity for a project. A domain name can include one of a number of top level domains (eg .com). Second level domains, what precedes the top level domain, are a string of text and numbers up to 253 characters. Individuals, organizations and projects often use their name as a second level domain. Anyone can obtain a domain name by registering it with a domain name registrar, who charges an annual fee.
DPI (dots per inch): The resolution or detail of a printed image. The number of dots per inch is based on the number of pixels in an image file but in printing the pixels take the form of droplets of ink. The total number of pixels are distributed across the size of the printed image, so the detail the image depends upon the combination of the total pixels of the image and the size of the print. To print a 4 x 6 inch photo using the professional standard of 300 dpi, the photo needs to be 1200 pixels by 1800 pixels, a total of 2,160,000 pixels. An image with fewer pixels could be printed at the same quality if printed at a smaller size. DPI is also used as a measure of the image quality needed for OCR software to work effectively: a 300 dpi image will allow the software to recognize text in fonts of size 10 pt or larger; a 400-600 dpi image is needed if the text is in fonts 9 pt or smaller.
Dublin Core: An internationally recognized metadata standard for describing any conceivable resource, comprised of 15 elements, including “title,” “description,” “date,” and “format.” Dublin Core is used in Omeka, an open source content management system for publishing resources online widely used in digital humanities.
Emulation: Hardware or software that enables a computer system to behave like another computer system, and run software or use devices created for that system. In digital preservation, emulation is an alternative to updating and migrating digital objects as new systems become available, with the advantage of maintaining much of the original look and feel of the digital object.
Fiber Optic cable: A cable made of glass not wire that uses pulses of light to transmit information. Fiber optic cable provides higher bandwidth and can transmit data over longer distances than copper wire and DSL. Telephone companies began to replace copper wire networks with fiber optic cable in the 1980s. An increasing proportion of the internet uses fiber optic cable, although television cable still provides the majority of internet access in the US (2019).
FTP (File Transfer Protocol) Client: A program that lets a user transfer a file from their computer to a web server so that it can be available or viewed online.
Full-Text Search: A search that examines every word in stored documents to find matches for the search criteria. Previous forms of search typically examined only metadata associated with documents, such as title, author, date of publication and subject classifications. Full-text search became more widely available with the mass digitization of texts that began in the 1990s. See also Google search; Image Search; Reverse Image Search.
Generous interface: A visual interface for search results that presents the scale and richness of a collection. Rather than a list of search results, a generous interface provides overviews to establish context and maintain orientation while revealing detail at multiple scales. Where browse-based interfaces present collections as alphabetical lists, the overviews in a generous interface are based on selected features (e.g. year, subject, color). Coined by Mitchell Whitelaw.
GIS (Geographic Information Systems): Software that combines a database and a mapping application to relate information to a location. ARC-GIS is the best known example of this software; it is a commercial product with a steep learning curve designed primarily for social scientists working with quantitative data. An open source alternative is QGIS. See also Web Mapping.
Georeferencing: Transforming place names and addresses into coordinates for mapping.
Github: An open source platform for sharing code and any other kinds of files.
GLAM: Acronym for Galleries Libraries Archives Museums.
Google Search: A search engine that conducts a text search of the index of the web created by Google’s web crawlers, with relevant results displayed as a list on a page. The algorithms that sort the results to place the most relevant at the top of the list look at language models to try to identify the meaning of the query, other search results with similar keywords, what other users have chosen from results for that search, sites linked to by other sites (Page Rank), the usability of sites and information about the person conducting the search such as location, search history, and search settings. Google does not provide more specific information on its algorithms (see black box). Other search engines are available to search the web (Bing, Yahoo, DuckDuckGo), each of which has its own index and algorithms to sort results by relevance.
GUI (graphical user interface): see User interface
Hard Disk Drive (hard drive): A magnetic data storage device that uses rapidly rotating disks (platters) coated with magnetic material paired with magnetic heads that read and write data to the disks. Hard disks are the major form of storage in personal computers and servers, although solid state memory is becoming more widely used. Hard drives have a lifespan of 3-5 years.
Hosting; see Web Hosting
HTML (HyperText Markup Language): A markup language that uses tags to describe the structure of what something will look like online, and specifying the format of text (font, bold, italics), the header of a page, etc. HTML is now commonly used in conjunction with CSS, another markup language that modifies the design and appearance of HTML elements and offers an easier way of creating the style of a site.
HTTP (Hypertext Transfer Protocol): The system of rules used to communicate data in the web; it appears as the first part of a URL or web address. HTTP requests are sent by clients such as web browsers to servers, which return responses. If the URL is valid and the connection is established, the server sends a webpage and related files or other content. HTTPS is an extension of HTTP that uses an encrypted protocol to provide secure communication.
Image Search: A text search for of an index of images that returns a list of image files identified based on keywords and metadata such as title and caption associated with image files. Algorithms rank image search results for relevance based on that textual information. Image search does not search the image files themselves. You can search image files using Reverse Image Search, but only for images that match an image you provide.
Interface: see User interface
Internet (interconnected network): A global system of interconnected computer networks – a network of networks – that use a variety of different telecommunications technologies (DSL, TV cable, fiber optics) and relay information using the Internet Protocol (IP). The large number of redundant network links and the lack of central control make the internet very resilient. Services delivered on the internet include the Web, social media, and email.
IP address (internet protocol address): A unique numerical label assigned to devices connected to the internet. An IP address is used to identify a device, and where it is in a network, which allows a path to the machine to be established. Most users do not use IP addresses to access the web; instead they use a more easily remembered domain name, which a DNS (domain name server) translates into an IP address.
ISP (internet service provider): An organization that provides access to the internet and other related services such as hosting. ISPs can connect to the network and transmit data using different technologies, including copper wire telephone lines, television cable, and fiber optic cable. Tier 1 ISPs operate the global internet backbone. Other ISPs buy access to the internet from Tier 1 ISPs and sell it to users. In 2019, just over half of internet access in the US was provided by cable TV providers, and most of the remainder by telephone companies.
JPEG: An image format that uses lossy compression – which compresses images by discarding some data when they are edited and saved. The most commonly used format in digital cameras and for storing and transmitting image files online. See also TIFF; Pixel; Resolution.
KML (Keyhole Markup Language); KMZ file: An XML-based markup language that uses tags to describe geographic information about a place that can be displayed on maps. Originally developed for Google Earth. KMZ files are compressed KML files.
LAMP (Linux, Apache, MySQL, PHP/Python): An open source software bundle that is used to create web sites and web applications: Linux is the operating system, Apache is the webserver, MySQL is the database, PHP/Python is the scripting language.
LMS (Learning Management System): A content management system designed for teaching and learning, offering the ability to organize content by classes and courses, design quizzes, and manage grades and monitor the activity of students. The best known example is Blackboard.
Lossless compression; see TIFF; GIF
Lossy compression; see JPEG
Machine Learning: Algorithms that automate analysis by taking a sample of training data and progressively building a statistical model to categorize or classify data. Commonly used when the features and patterns of the data are too fuzzy to make it feasible to use strict instructions to sort the data.
Markup language: A computer language that uses tags to define elements within a document. The language contains standard words rather than code so is human readable. The two most popular markup languages are HTML and XML. Historians and literary scholars often use an adaptation of XML called TEI to identify and mark up particular non-technical elements of a document (e.g. people or places). See also KML.
Metadata: Data about data, or information that describes an item. Metadata is what you read in library catalog records or museum collections management systems. Standardized metadata uses agreed-on spelling, language, date formats etc in order to allow metadata to be compared. Metadata standards or schemas are sets of structured and standardized metadata, developed to describe resources for a particular purpose or community. Dublin Core is a widely used metadata standard for describing digital and physical resources.
Modem: A device that provides a connection to the internet, and encodes data for transmission over networks of telephone wire, television cable or fiber optic cable and decodes transmitted data. Most home modems are now combined with a router to allow devices to establish a connection to the internet via wi-fi rather than by using Ethernet cables plugged into the modem.
Natural Language Processing (NLP): Algorithms that identify features of language such as the part of speech for each word, the basic form of a word (lemmatization), nouns that refer to real world entities like people, places, events, and organizations (NER), and the relationship of the words in a sentence (dependency parsing)
Network Analysis: A computational method that uses network graphs to visualize and measure non-spatial relationships between people, groups or information. These graphs render the components of the network as nodes and the relationships between them as edges or links, and allow multiple types of both nodes and edges. The resulting networks can describe which entities are most central to those relationships, or the density or degree of centralization of the whole network. Gephi has been the open source network visualization software most commonly used in the digital humanities (https://gephi.org/).
OCR (Optical Character Recognition): Software that converts digital images (photographs, scans) of text to machine readable text that can be analyzed with computational methods. Generally only effective for text in modern typefaces (although machine learning algorithms are being developed to convert older typefaces and handwriting). See DPI.
Omeka: An open-source content management system which uses an item (object/image/document) as the primary piece (as opposed to WordPress, which uses the post) and Dublin Core metadata to describe items. Omeka is commonly used for the creation of digital collections and for exhibitions based on those collections, and by archives, libraries and museums, and in classrooms. www.omeka.org
Open source: Software whose source code is made freely available and can be modified and redistributed, encouraging open collaboration on development of the software. Examples widely used in the field of digital humanities are content management systems such as WordPress, Omeka, and Scalar, and computational tools such as Voyant and Gephi.
Pixel: A physical point that is the smallest component of a digital image (the word is a combination of pix, for picture, and element). The total number of pixels is a measure of the resolution or detail of an image. Pixel is also used in a variety of other contexts: to express the number of pixels on a printed image, or on a display (screen, monitor), or in a digital camera photosensor element. See DPI.
Primary Sources: Documents or artifacts created by a witness to or participant in an event. They can be firsthand testimony or evidence created during the time period that you are studying. Types of primary sources may include diaries, letters, interviews, oral histories, photographs, newspaper articles, government documents, poems, novels, plays, and music.
Relevance: Criteria employed by an algorithm to sort search results so that the “best” results appear at the top of a list of search results. A range of different criteria can be used to determine relevance, but information on which criteria a specific search engine uses are rarely available. See Google search and Black box.
Resolution: The detail an image holds. The resolution of digital images is usually measured as the number of pixels in an image. An image that is 2048 pixels in width and 1536 pixels in height has a total of 3,145,728 pixels or 3.1 megapixels. The resolution of print is usually measured as dpi (dots per inch), which is derived the number of pixels but refers to droplets of ink, with 300 dpi accepted as the professional standard for a quality printed image. The resolution of displays (screens, monitors) is usually measured in pixels as width x height, but since it is dependent on the size of the display, ppi (pixels per inch) is a more meaningful measure. For example, MacBook models introduced in 2015 or later have 2304 x 1440 pixels at 226 pixels per inch.
Responsive web design: The design of web pages to render well on a variety of devices and windows or screen sizes. Since the default design for web pages generally assumes they will be viewed on a computer monitor, responsive web design means ensuring that those pages also render well on a phone or tablet.
Reverse Image Search: A search for an image based on a mathematical model of the features such as colors, points, lines, and textures in a submitted image file. That model is then used to generate a search of an index of images. Google’s reverse image search returns a set of images that match the submitted image as well as visually similar images and a list of websites that contain those images. By contrast, image search is a search of text associated with images.
Robots.txt: A file that provides web crawlers with instructions in a specific format about which pages or files they can and cannot request from a web site. It is up to the web crawler whether to obey the instructions in a robots.txt file. Crawlers used by the Internet Archive and Google do obey the instructions (although content blocked by robots.txt can still be indexed by Google if it is linked to from other parts of the web). See Google Search.
Router: A networking device that forwards data packets between networks, sending them to another router until they reach their destination. Home routers forward data between a computer in your home and the internet via a modem and a connection provided by an ISP; most modern home devices combine a router and modem.
Scalar: An open source content management system for publishing long-form digital texts. It is designed to allow for publications to be organized in nested, recursive and non-linear formats, and for annotation of a variety of media. https://scalar.me/anvc/scalar/
Secondary Sources: analyze a scholarly question and often use primary sources as evidence. Types of secondary sources include books and articles about a topic.
Server; see Web Server
Spatial analysis: A computational method that involves mapping and other forms of visualization that employ spatial data to analyze historical processes. Mapping involves georeferencing location information to generate coordinates that can be mapped and visualizing that data using GIS software, web mapping platforms, or programming with open source tools such as Leaflet and Openlayers.
SQL (Structured Query Language): A programming language used to query, insert, update and modify data in a database. WordPress uses SQL to manage the database that stores information about your site, as one component of a CMS.
Structured Data: Data organized in database or with markup tags, where each element fits a field in a table or has a label in a markup language. Structured data can be analyzed using computational methods. See also unstructured data.
SVG (Scalable Vector Graphic): An XML-based image format; as it is based on markup language, an SVG image can be edited as code in a text editor. SVG images can be created in graphics software such as Adobe Illustrator and Sketch.
Text analysis (aka as text mining): The computational analysis of textual data – words – in digitized documents. The simplest text analysis algorithms discard word order to count the frequency of words in a corpus of documents. Voyant is an open source tool for simple text analysis (https://voyant-tools.org/). This form of text analysis can also be used to measure and compare the similarity of texts by counting the words and phrases they have in common. Other forms of text analysis build on those algorithms to try to identify the semantic relationships between words, and consequently the concepts in texts; see Corpus Linguistics; Distant reading; Topic Modeling.
TEI (Text Encoding Initiative): A set of guidelines that define an XML markup language format to tag textual components (eg word, sentence) and concepts (eg person, place). TEI is widely used in literary studies and in digital editions of texts.
Tier 1 ISP: An internet service provider that operates part of the global internet backbone and can reach every other network on the internet. Tier 1 ISPs exchange information with each other without cost in peering agreements, and then charge other ISPs for connections to their networks.
TIFF (Tagged Image File Format); TIF: An image file format supported by a wide variety of software that uses lossless compression – meaning no image quality is lost when the file is edited or saved — and consequently is the file format used for preserving images. Other common lossless file formats are PNG and GIF. See also JPEG; Resolution; Pixel.
Tokenization: An algorithm that splits a string of characters into pieces such as individual words, phrases or even whole sentences by looking for space, punctuation or line breaks. The tokens produced by this process are used in computational text analysis.
Top Level Domain: The last part of a domain name that identifies computers or services connected to the internet. The two major types of top level domain are country code (those with the largest number of domain names are .cn (China), .tk (Tokelau) and .de (Germany)) and generic (those with the largest number of domain names are .com, .net and .org). Management of most top-level domains is handled by the Internet Corporation for Assigned Names and Numbers (ICANN).
Topic modeling: A form of text analysis that uses algorithms to capture semantic features by identifying clusters of words – topics — that are more likely to appear in proximity to each other. The algorithm divides the texts into as many topics as the user specifies to produce a model of the possible themes of the corpus. It is up to the researcher to determine the meaning of those topics; a topic could capture stylistic features or systematic OCR errors as well as themes.
Unstructured Data: Data that is not organized in a database or with markup tags. The text documents that humanities scholars commonly study, for example, are unstructured data; they can have elements of structure, such as the date, sender and recipient information in a letter, but not all the text fits those categories. Information in unstructured data needs to be tagged in a consistent way or extracted and organized in a database before it can be analyzed using computational methods such as mapping and network analysis. Unstructured textual data can be analyzed with computational methods such as text analysis, topic modeling and corpus linguistics. See also structured data.
URL (Uniform Resource Locator): Commonly referred to as a web address, a URL specifies the location of a web site or web application and a mechanism for retrieving it. It is usually displayed in a web browser above the page in an address bar. A typical URL includes a protocol for how the data is transmitted (usually http or https), a domain name identifying the location of a resource (eg historians.org), and a file name (eg index.html) identifying a specific web page or a database query (/?page_id=21).
User Interface: The space where interactions between humans and computers occur. The most common user interface is a graphical user interface (GUI) that combines tactile elements (via a keyboard, mouse or touch screen) and visual elements (graphical display). Computer operating systems currently use concepts related to the desktop to help users interact more easily with the computer: the monitor as the top of the desk, with objects such as documents and folders placed on it. See also Generous interface.
Virtual reality, VR: An computer-generated simulation that immerses the user in a three-dimensional environment with which they can interact. Current technology uses headsets to generate images, sounds and sensations, and sometimes augmented by controllers to transmit vibrations and other tactile sensations.
Visualization, data visualization: Placing data in a visual context in order to analyze and communicate it; encompasses images, diagrams, graphs, maps and animations. Most computational methods produce visualizations. Visualizations in digital humanities are commonly research tools produced to explore data, but they can also be used to communicate arguments.
Web: A service delivered on the internet consisting of a series of interconnected web pages and resources stored on a web server and retrieved and displayed by a software application called a web browser.
Web archive: Content collected from the web in order to preserve and provide long term access to information available online. Collection is typically done automatically using web crawlers. The information collected includes web pages, CSS style sheets, images, video and metadata. The largest web archiving organization is the Internet Archive, which aims to archive the whole web. National and local agencies are also creating web archives of specific domains.
Web hosting: Providing a web server on which files, instances of CMS and web publishing platforms, and web applications/software can be made available on the internet. Some free hosting is available, usually only for specific platforms and with limited functionality and advertising. For example, a free WordPress site is available through WordPress.com, and a free Omeka site is available through omeka.net. Users of that hosting do not need to manage the servers in anyway, so they are easy to use, but in both instances only some of the platforms features are available. A dedicated or managed hosting service leases space on its web servers, on which clients can store files and install software of their choice. Dedicated hosting requires an annual payment and some knowledge to manage. Both the cost and skill required are diminishing. Reclaim Hosting is a service widely used in higher education in the US, and offers hosting beginning at $30/year (2018) and one-click install of platforms such as WordPress, Omeka and Scalar that handles the most complex aspects of installing software.
Web mapping: Platforms such as Google Maps that offer online access to geographical data and APIs that allow users to create custom maps. An alternative to GIS used widely in digital humanities. Open source web mapping software developed for the humanities include Neatline (a set of plugins for Omeka) and Palladio.
Web Site: a collection of web pages stored on a web server connected to the internet. Web sites are now typically created by using a CMS such as WordPress or Omeka, but they can simply be a set of files written in HTML.
Word Cloud: A visualization of word frequency that gives greater prominence to words that appear more frequently in a source text. The larger the word in the visualization the more frequently the word was in the text.
WordPress: An open source content management system originally developed for blogs. WordPress allows the creation of pages and posts; pages do not have a publication date and are intended for static content in a fixed location; posts have a publication date and appear in reverse chronological order, and can be tagged and categorized. Additional features can be added to a WordPress site by installing plugins.
WYSIWYG (“What You See Is What You Get”): An interfaces for creating and editing content that displays the content as it will appear when published. They provide an alternative to interfaces that display the tags and markup language used to make the content appear in that way. The classic WordPress editorial interface provided a tab to view the content as it would appear (Visual) and a second tab to view the markup that produced that appearance (Text).
XML (EXtensible Markup Language): A markup language that uses tags to describe the content that it is identifying: title, author, year, genre etc. XML files are a form of structured data that can be analyzed using computational methods.