Lesson 1.2: TEI-XML and Intelligent Search
Introduction
TEI is based on XML (Extensible Markup Language), a flexible markup language used to represent and store textual data in a hierarchical structure that makes it easier to analyze, retrieve, and preserve information. Unlike HTML, which is primarily designed for web display, XML does not define how data should be displayed; rather, it provides a flexible structure for representing complex relationships in data. As Lou Burnard explains, TEI-XML gives us a framework for representing whatever is considered of importance about the text, not just its appearance, so that software can act on the distinctions identified, generating new visualisations and new perspectives
(9).
It is highly adaptable, allowing users to encode information using custom tags that define the structure and meaning of content. This makes XML ideal for organizing textual data in TEI projects, where structured markup facilitates advanced search capabilities, fosters data interoperability, and ensures long-term preservation of digitally encoded texts.
XML Features for Structured Text Analysis
- Self-descriptive: Tags are not predefined (users create their own meaningful tags).
- Hierarchical: XML data is structured in a tree-like format, with nested elements, which allows hierarchical organization of data, making it easy to categorize and retrieve.
- Extensible: New elements can be added without breaking the structure.
- Interoperable: XML works across different platforms and systems.
- Adaptable: XML separates content and presentation as it does not dictate how data is displayed.
How XML Enables Intelligent Search in TEI
TEI leverages XML to enhance the searchability and retrieval of textual data. Unlike plain text, XML provides a structured way to represent a document’s content, allowing for semantic tagging, metadata inclusion, and hierarchical organization. These features enable a more advanced and what Lou Burnard calls intelligent search
(8). For example, he explains, in such a search ‘London’ as the name of a place in Canada is distinguished from that of a place in England, or the surname of an author
(8). By enabling nuanced searches that distinguish between different contexts and meanings, TEI helps scholars express deeper interpretive insights and supports more inclusive and accurate textual analysis across disciplines.
Structured Data for Precise Queries
One of XML’s key advantages is its ability to structure textual data hierarchically, meaning that elements such as paragraphs, sections, and even words can be explicitly marked up. This structure allows search engines and computational tools to distinguish between different types of information. For example, in a TEI-encoded document, a user could search for a specific name only within <persName>
tags, filtering out irrelevant mentions of the same word in other contexts.
Example Comparison:
- Plain text search: Searching for “John” in a plain text file returns all instances, whether it refers to a person, a location (e.g., “John Street”), or a reference in another context (e.g., “Dear John”).
- TEI-based XML search: Searching for
<persName>
containing “John” ensures that only named persons matching “John” appear in the results, excluding unrelated mentions.
Semantic Markup for Meaning-Based Search
XML allows TEI users to apply semantic tagging, which means encoding texts with meaningful labels rather than relying on basic string-matching techniques. This capability enhances context-aware searches, enabling researchers to retrieve results based on concepts rather than mere keywords.
Example Comparison:
- Plain text search: Searching for “queen” will return results including “queen bee,” “Queen Elizabeth,” and “queen-sized bed.”
- TEI-based XML search: Searching within
<roleName>
ensures that only mentions of royal titles like “Queen Elizabeth” or “Queen Victoria” appear in results, filtering out unrelated uses.
Metadata-Driven Search Capabilities
Metadata is a key part of XML-based TEI because it provides useful details about texts, like the author’s name, when it was published, and its historical context, enabling more precise filtering and organization. This structured approach helps users search and categorize texts more effectively, allowing them to retrieve relevant materials based on attributes such as date, author, or document type.
Example Comparison:
- Plain text search: Searching for “Shakespeare” retrieves all references, whether he is mentioned as an author, a character, or a subject of discussion.
- TEI-based XML search: A query filtering
<author>
ensures only works written by Shakespeare appear, rather than texts that merely mention him.
XPath and XQuery for Advanced Search
TEI-encoded XML documents benefit from powerful query languages such as XPath and XQuery, which are designed for searching and extracting information from XML documents. In TEI-encoded texts, they facilitate fine-grained search and text analysis.
XPath is a language used to navigate the structure of an XML document and retrieve specific elements based on their position within the hierarchy. It enables searches based on hierarchical relationships within the text, allowing users to define paths to find elements nested within specific sections or structures.
Example of XPath in TEI: Find all <title>
elements, regardless of their depth within <teiHeader>
, ensuring precise retrieval of document titles.
XQuery builds upon XPath and provides additional functionalities for filtering, transforming, and structuring results from XML documents. It enables complex text retrieval operations, such as finding all instances of a term appearing within footnotes but not in main text.
Example of XQuery in TEI: Search for all <p>
elements in the text body but only return those where “Shakespeare” is listed as the author.
How XQuery Extends XPath:
- XPath is mainly used to locate and select XML nodes.
- XQuery retrieves, filters, transforms, and organizes data into structured outputs.
By using XPath and XQuery, TEI-encoded XML documents become highly searchable and adaptable, allowing users to query texts based on structure, not just keywords.
Example Comparison:
- Plain text search: A search for “revolution” returns all results, regardless of where it appears in the document.
- TEI-based XML search with XPath/XQuery: A user can query for “revolution” only within
<note>
elements, retrieving only footnotes discussing the term.
Machine-Readable Texts
Because XML is machine-readable, TEI-encoded texts can be indexed and processed by search engines in ways that plain text cannot. Instead of searching blindly through unstructured text, indexing mechanisms can categorize content by author, title, date, genre, or thematic elements. This allows for filtered and ranked results and enables more effective cross-corpus searching, allowing users to analyze large text collections with structured queries that differentiate between different types of content.
Work Cited
Suggested Readings
- TEI Consortium. A Gentle Introduction to XML. TEI: Guidelines for Electronic Text Encoding and Interchange, TEI Consortium, 2025, tei-c.org/release/doc/tei-p5-doc/en/html/SG.html. Accessed February 20, 2025.
- Birnbaum, David J. What is XML and Why Should Humanists Care? An Even Gentler Introduction to XML. Digital Humanities obdurodon.org, 7 Dec. 2024, dh.obdurodon.org/what-is-xml.xhtml. Accessed February 20, 2025.
- A shamelessly short intro to XML for DH beginners (includes TEI). LaTeX Ninja’ing and the Digital Humanities, 2 Feb. 2022, latex-ninja.com/2022/02/02/a-shamelessly-short-intro-to-xml-for-dh-beginners-includes-tei/. Accessed February 20, 2025.
- Hawkins, Kevin S. “Introduction to XML for Text.” ultraslavonic.info, 31 Oct. 2019, ultraslavonic.info/intro-to-xml/. Accessed February 20, 2025.