Named Entity Recognition (NER): A form of Natural Language Processing that uses algorithms that identify words referring to people, places, and organizations.
Natural Language Processing (NLP): Algorithms that identify features of language such as the part of speech for each word, the basic form of a word (lemmatization), nouns that refer to real world entities like people, places, events, and organizations (NER), and the relationship of the words in a sentence (dependency parsing)
Network Analysis: A computational method that uses network graphs to visualize and measure non-spatial relationships between people, groups or information. These graphs render the components of the network as nodes and the relationships between them as edges or links, and allow multiple types of both nodes and edges. The resulting networks can describe which entities are most central to those relationships, or the density or degree of centralization of the whole network. Gephi has been the open source network visualization software most commonly used in the digital humanities (https://gephi.org/).
OCR (Optical Character Recognition): Software that converts digital images (photographs, scans) of text to machine readable text that can be analyzed with computational methods. Generally only effective for text in modern typefaces (although machine learning algorithms are being developed to convert older typefaces and handwriting). See DPI.
Omeka: An open-source content management system which uses an item (object/image/document) as the primary piece (as opposed to WordPress, which uses the post) and Dublin Core metadata to describe items. Omeka is commonly used for the creation of digital collections and for exhibitions based on those collections, and by archives, libraries and museums, and in classrooms. www.omeka.org
Open access: Material made freely available online. Usually refers to published peer-reviewed research made available without cost to the reader.
Open source: Software whose source code is made freely available and can be modified and redistributed, encouraging open collaboration on development of the software. Examples widely used in the field of digital humanities are content management systems such as WordPress, Omeka, and Scalar, and computational tools such as Voyant and Gephi.
Pixel: A physical point that is the smallest component of a digital image (the word is a combination of pix, for picture, and element). The total number of pixels is a measure of the resolution or detail of an image. Pixel is also used in a variety of other contexts: to express the number of pixels on a printed image, or on a display (screen, monitor), or in a digital camera photosensor element. See DPI.
Plugin: Software that adds a specific feature to an existing computer program. Used in WordPress and Omeka.
Primary Sources: Documents or artifacts created by a witness to or participant in an event. They can be firsthand testimony or evidence created during the time period that you are studying. Types of primary sources may include diaries, letters, interviews, oral histories, photographs, newspaper articles, government documents, poems, novels, plays, and music.
Programming language: A formal languages consisting of instructions for computers, used to create programs that implement specific algorithms telling a computer what to do and how to do it. Each language has its own vocabulary and a syntax or grammar for organizing instructions. Languages commonly used in digital humanities include R, Python, Javascript, and Ruby/Ruby on Rails.
Relevance: Criteria employed by an algorithm to sort search results so that the “best” results appear at the top of a list of search results. A range of different criteria can be used to determine relevance, but information on which criteria a specific search engine uses are rarely available. See Google search and Black box.
Resolution: The detail an image holds. The resolution of digital images is usually measured as the number of pixels in an image. An image that is 2048 pixels in width and 1536 pixels in height has a total of 3,145,728 pixels or 3.1 megapixels. The resolution of print is usually measured as dpi (dots per inch), which is derived the number of pixels but refers to droplets of ink, with 300 dpi accepted as the professional standard for a quality printed image. The resolution of displays (screens, monitors) is usually measured in pixels as width x height, but since it is dependent on the size of the display, ppi (pixels per inch) is a more meaningful measure. For example, MacBook models introduced in 2015 or later have 2304 x 1440 pixels at 226 pixels per inch.
Responsive web design: The design of web pages to render well on a variety of devices and windows or screen sizes. Since the default design for web pages generally assumes they will be viewed on a computer monitor, responsive web design means ensuring that those pages also render well on a phone or tablet.
Reverse Image Search: A search for an image based on a mathematical model of the features such as colors, points, lines, and textures in a submitted image file. That model is then used to generate a search of an index of images. Google’s reverse image search returns a set of images that match the submitted image as well as visually similar images and a list of websites that contain those images. By contrast, image search is a search of text associated with images.
Robots.txt: A file that provides web crawlers with instructions in a specific format about which pages or files they can and cannot request from a web site. It is up to the web crawler whether to obey the instructions in a robots.txt file. Crawlers used by the Internet Archive and Google do obey the instructions (although content blocked by robots.txt can still be indexed by Google if it is linked to from other parts of the web). See Google Search.
Router: A networking device that forwards data packets between networks, sending them to another router until they reach their destination. Home routers forward data between a computer in your home and the internet via a modem and a connection provided by an ISP; most modern home devices combine a router and modem.
Scalar: An open source content management system for publishing long-form digital texts. It is designed to allow for publications to be organized in nested, recursive and non-linear formats, and for annotation of a variety of media. https://scalar.me/anvc/scalar/
Search: See Google search (web search); full-text search; image search; reverse image search.
Secondary Sources: analyze a scholarly question and often use primary sources as evidence. Types of secondary sources include books and articles about a topic.
Server; see Web Server
Spatial analysis: A computational method that involves mapping and other forms of visualization that employ spatial data to analyze historical processes. Mapping involves georeferencing location information to generate coordinates that can be mapped and visualizing that data using GIS software, web mapping platforms, or programming with open source tools such as Leaflet and Openlayers.
SQL (Structured Query Language): A programming language used to query, insert, update and modify data in a database. WordPress uses SQL to manage the database that stores information about your site, as one component of a CMS.
Structured Data: Data organized in database or with markup tags, where each element fits a field in a table or has a label in a markup language. Structured data can be analyzed using computational methods. See also unstructured data.
SVG (Scalable Vector Graphic): An XML-based image format; as it is based on markup language, an SVG image can be edited as code in a text editor. SVG images can be created in graphics software such as Adobe Illustrator and Sketch.
Text analysis (aka as text mining): The computational analysis of textual data – words – in digitized documents. The simplest text analysis algorithms discard word order to count the frequency of words in a corpus of documents. Voyant is an open source tool for simple text analysis (https://voyant-tools.org/). This form of text analysis can also be used to measure and compare the similarity of texts by counting the words and phrases they have in common. Other forms of text analysis build on those algorithms to try to identify the semantic relationships between words, and consequently the concepts in texts; see Corpus Linguistics; Distant reading; Topic Modeling.
TEI (Text Encoding Initiative): A set of guidelines that define an XML markup language format to tag textual components (eg word, sentence) and concepts (eg person, place). TEI is widely used in literary studies and in digital editions of texts.
Tier 1 ISP: An internet service provider that operates part of the global internet backbone and can reach every other network on the internet. Tier 1 ISPs exchange information with each other without cost in peering agreements, and then charge other ISPs for connections to their networks.
TIFF (Tagged Image File Format); TIF: An image file format supported by a wide variety of software that uses lossless compression – meaning no image quality is lost when the file is edited or saved — and consequently is the file format used for preserving images. Other common lossless file formats are PNG and GIF. See also JPEG; Resolution; Pixel.
Tokenization: An algorithm that splits a string of characters into pieces such as individual words, phrases or even whole sentences by looking for space, punctuation or line breaks. The tokens produced by this process are used in computational text analysis.
Tool: A term for software used in the digital humanities.
Top Level Domain: The last part of a domain name that identifies computers or services connected to the internet. The two major types of top level domain are country code (those with the largest number of domain names are .cn (China), .tk (Tokelau) and .de (Germany)) and generic (those with the largest number of domain names are .com, .net and .org). Management of most top-level domains is handled by the Internet Corporation for Assigned Names and Numbers (ICANN).
Topic modeling: A form of text analysis that uses algorithms to capture semantic features by identifying clusters of words – topics — that are more likely to appear in proximity to each other. The algorithm divides the texts into as many topics as the user specifies to produce a model of the possible themes of the corpus. It is up to the researcher to determine the meaning of those topics; a topic could capture stylistic features or systematic OCR errors as well as themes.