Introduction
Hypothes.is is an open-source annotation tool that allows you to add highlights and comments on web pages. It can be easily used through browser extensions or JavaScript embedding, but there are cases where you may want to back up accumulated annotations or utilize them in other formats such as TEI/XML.
This article introduces how to export annotations using the Hypothes.is API and convert them to TEI/XML.
Obtaining an API Key
- Log in to Hypothes.is
- Go to Developer settings
- Generate an API key with “Generate your API token”
Save the obtained key in a .env file.
Exporting Annotations
API Basics
The base URL for the Hypothes.is API is https://api.hypothes.is/api. Authentication is done via the Authorization: Bearer <API_KEY> header.
Key endpoints:
| Endpoint | Purpose |
|---|---|
GET /api/profile | Get authenticated user’s profile |
GET /api/search | Search annotations |
GET /api/annotations/{id} | Get individual annotation |
Script
The export through TEI/XML conversion is consolidated in a single script hypothes_export.py.
https://github.com/nakamura196/hypothes-export/blob/main/hypothes_export.py
Below, the main processing is excerpted and explained.
Loading .env and API Calls
Fetching All Annotations (with Pagination)
The Search API returns a maximum of 200 results per request, so all annotations are fetched by incrementing the offset.
Execution
Annotation Data Structure
Each annotation in the exported JSON has a structure based on the W3C Web Annotation Data Model.
Three Types of Selectors
Hypothes.is records the text position of annotation targets using three types of selectors.
| Selector | Mechanism | Robustness |
|---|---|---|
| RangeSelector | Specifies position using XPath on the DOM | Fair - Vulnerable to HTML structure changes |
| TextPositionSelector | Specifies by character offset position | Fair - Shifts with text additions/deletions |
| TextQuoteSelector | Specifies by target text + surrounding context | Excellent - Can re-anchor via fuzzy match |
When the source document changes, Hypothes.is attempts these selectors as fallbacks in sequence. TextQuoteSelector performs fuzzy matching including prefix/suffix, making it the most robust, but if the target text itself is deleted or significantly modified, the annotation becomes “orphaned.”
Conversion to TEI/XML
The exported JSON is converted to TEI/XML format.
Mapping Strategy
| Hypothes.is | TEI/XML |
|---|---|
| Target document (URI, title) | <sourceDesc><bibl> |
| Group by document | <div> |
| Each annotation | <ab> |
Highlighted text (TextQuoteSelector.exact) | <quote> |
| Comment body | <note type="annotation"> |
| Tags | <note type="tag"> |
Conversion Logic
Quote text is extracted from TextQuoteSelector and mapped to TEI elements.
Annotations are grouped by URI and output in the structure <div> -> <ab> -> <quote> / <note>. See the source code for details.
Output Example
Source Document Changes and Annotation Consistency
Hypothes.is annotations use a “standoff annotation” approach, stored separately from the source document. Therefore, when the source document changes, annotation positions may shift.
- Minor changes: Often re-anchored via
TextQuoteSelectorfuzzy matching - Major changes: Annotations become “orphaned” and are no longer linked to their target locations
By exporting to TEI/XML, the highlighted target text is recorded in <quote> elements, so the correspondence with the source document is at least preserved as a record.
Summary
- The Hypothes.is API allows programmatic retrieval of your annotations
TextQuoteSelector’sexact/prefix/suffixare most important for identifying annotation target text- Converting to TEI/XML enables storage and utilization in a format widely used in humanities research
- However, be aware of anchoring shifts due to source document changes
The source code is published on GitHub.