Introduction
Long-term preservation of digital data is an important challenge for libraries, archives, and research institutions. Various factors such as changes in data formats, software obsolescence, and the evolution of storage technologies threaten the sustainability of digital information.
In this article, I introduce OCFL (Oxford Common File Layout), one solution to this challenge, covering its concepts, significance, and implementation examples.
What is OCFL
OCFL (Oxford Common File Layout) is a specification for preserving digital information in a structured, transparent, and predictable manner. It was developed primarily by the Bodleian Library at the University of Oxford and Stanford University Libraries, and has now evolved as a community-driven open standard.
Official Definition of OCFL
“OCFL is a specification for preserving digital information using an application-independent approach, ensuring long-term preservation and data integrity.”
Five Key Principles of OCFL
- Completeness - The repository can be fully reconstructed from storage files
- Parsability - A structure understandable by both humans and machines
- Robustness - Resilience against errors and corruption
- Versioning - Maintains the change history of objects
- Storage Diversity - Supports various storage infrastructures
Why OCFL is Needed
Challenges of Digital Preservation
Traditional digital preservation has the following challenges:
- Vendor lock-in - Dependency on specific systems or software
- Difficulty of migration - Complex data migration during system updates
- Lack of transparency - Unclear how data is stored
- Long-term readability - No guarantee that data can be read decades later
Solutions Provided by OCFL
OCFL addresses these challenges with the following approaches:
- Simple file structure - Uses standard file systems
- JSON metadata - Stores management information in human-readable format
- Hash-based integrity verification - Can detect data corruption
- Transparent versioning - All change history is traceable
Basic Structure of OCFL
An OCFL repository has the following hierarchical structure:
inventory.json - The Heart of OCFL
inventory.json is the most important element of an OCFL object, containing the following information:
Key Elements of inventory.json
- manifest - A mapping table of file hash values to physical paths
- versions - Creation date, message, and user information for each version
- state - The state of logical files in each version
Differences Between OCFL and Git
OCFL is similar to Git in that it performs version management, but their purposes and design philosophies differ significantly.
| Item | OCFL | Git |
|---|---|---|
| Primary purpose | Long-term digital preservation | Collaborative work in software development |
| Design philosophy | Simplicity and transparency | Efficiency and functionality |
| File format | Standard files and JSON | Proprietary binary format |
| Readability | Directly readable by humans | Tools required |
| Branches | Not supported | Core feature |
| Merging | Not supported | Core feature |
| Deduplication | Content addressing | Compression via packfiles |
| Guarantee | Designed to be readable 100 years later | Optimized for current tools |
| Target users | Archives and libraries | Software developers |
Why Git is Not Sufficient
Git is an excellent version management system, but it has the following challenges for long-term preservation:
- Complexity - The internal structure of the
.gitdirectory is complex and difficult to understand without Git tools - Tool dependency - A Git client is required to read data
- Binary format - Packfiles are efficient but raise concerns about future tool compatibility
- Feature overload - Has many features unnecessary for preservation
OCFL prioritizes simplicity and transparency, ensuring that data can be accessed without special tools.
Implementation Examples in This Repository
This repository implements the main features of OCFL using the Python OCFL Core library.
Implementation Example 1: Creating a Simple OCFL Object
Implementation Example 2: Version Management
In object-003 of this repository, the update history of files is managed:
- v1: Initial version of
data.txt - v2: Update of
data.txt+ addition ofadditional.txt
This structure enables the following:
- Restoration to any version
- Tracking of change history
- File integrity verification
Implementation Example 3: Deduplication
In object-004, files with the same content are added under different names:
Result:
- Logical file count: 3
- Physical file count: 1
OCFL stores files with the same hash value as a single physical file, optimizing storage.
Advantages of OCFL
1. Long-term Sustainability
- Simple structure - No dependency on complex tools
- Standard technologies - Only file systems and JSON
- Clear specification - A published open standard
2. Data Integrity
- Hash-based verification - Can verify the integrity of all files
- Change tracking - All changes are recorded
- Corruption detection - Automatic verification via checksums
3. Flexibility
- Storage-independent - Local file systems, S3, Glacier, etc.
- Application-independent - Not tied to specific software
- Easy migration - Can be migrated by simple file copying
4. Transparency
- Human-readable - JSON files can be directly inspected
- Auditable - All history is explicit
- Easy to debug - Easy to identify and fix problems
Use Cases for OCFL
Suitable Applications
- Digital archives - Digitization of cultural heritage
- Research data management - Long-term preservation of academic research data
- Compliance - Data preservation based on legal requirements
- Institutional repositories - Publication management for universities and research institutions
Unsuitable Applications
- Frequent updates - Real-time data changes
- Collaborative editing - Simultaneous editing by multiple people
- Branch management - Parallel development workflows
- Temporary data - Data needed only for a short period
Practical Tips
1. Object Design
Design OCFL objects as logically coherent collections of data:
- Good example: A single book with its metadata and images
- Bad example: Multiple unrelated files in a single object
2. Versioning Strategy
When to create versions:
- When there are meaningful changes
- At publication or release timing
- As periodic snapshots
3. Choosing a Digest Algorithm
- SHA-512 - Recommended (default)
- SHA-256 - Alternative option
- MD5 - Not recommended (security issues)
4. Storage Considerations
- Cloud storage - S3, Azure Blob, Google Cloud Storage
- Tape storage - Large-capacity long-term preservation
- Distributed storage - Ensuring geographic redundancy
Summary
OCFL is a practical and powerful solution for long-term preservation of digital information. The following features make it an ideal choice for libraries, archives, and research institutions:
- Simplicity - No dependency on complex tools
- Transparency - Everything in human-readable format
- Sustainability - Designed to be readable decades later
- Flexibility - Supports various storage options
I hope that through the implementation examples in this repository, you have gained an understanding of the basic concepts and practical methods of OCFL. If you are facing challenges with digital preservation, please consider adopting OCFL.
Reference Links
Appendix: Glossary
- OCFL Object - A collection of digital content with a unique ID
- Inventory - A JSON file recording the history and state of an object
- Manifest - A mapping table of file digest values to physical paths
- Version - The state of an object at a specific point in time
- State - The set of logical file paths in a given version
- Digest - A hash value of a file (for integrity verification)
- Content Directory - The directory where actual files are stored
- Deduplication - A mechanism for storing files with identical content as a single physical file