Introduction

Long-term preservation of digital data is an important challenge for libraries, archives, and research institutions. Various factors such as changes in data formats, software obsolescence, and the evolution of storage technologies threaten the sustainability of digital information.

In this article, I introduce OCFL (Oxford Common File Layout), one solution to this challenge, covering its concepts, significance, and implementation examples.

What is OCFL

OCFL (Oxford Common File Layout) is a specification for preserving digital information in a structured, transparent, and predictable manner. It was developed primarily by the Bodleian Library at the University of Oxford and Stanford University Libraries, and has now evolved as a community-driven open standard.

Official Definition of OCFL

“OCFL is a specification for preserving digital information using an application-independent approach, ensuring long-term preservation and data integrity.”

Five Key Principles of OCFL

  1. Completeness - The repository can be fully reconstructed from storage files
  2. Parsability - A structure understandable by both humans and machines
  3. Robustness - Resilience against errors and corruption
  4. Versioning - Maintains the change history of objects
  5. Storage Diversity - Supports various storage infrastructures

Why OCFL is Needed

Challenges of Digital Preservation

Traditional digital preservation has the following challenges:

  • Vendor lock-in - Dependency on specific systems or software
  • Difficulty of migration - Complex data migration during system updates
  • Lack of transparency - Unclear how data is stored
  • Long-term readability - No guarantee that data can be read decades later

Solutions Provided by OCFL

OCFL addresses these challenges with the following approaches:

  • Simple file structure - Uses standard file systems
  • JSON metadata - Stores management information in human-readable format
  • Hash-based integrity verification - Can detect data corruption
  • Transparent versioning - All change history is traceable

Basic Structure of OCFL

An OCFL repository has the following hierarchical structure:

ocfl_0oor=cboofjocletf_c0iiv/llt=nn1_a-ovv/1y0cee.o0fnniic1u1lttnnot_oovvn.orreetjbyynnessj..ttnaoejjootmncssrr/ptooyyl_nn..e1.jj..ssst1hooxannt5.1s2ha512#########MLOOCCVHAaaCboheicryFjmerstkoLepcstueuclkioartotesorlbtunyiijvemcnnee1uodfcrhopniotsiftcristeamottontanohtitretnimyhgoaiinronstkfvheevertnehtrOeosCriFoyoLbnjsepcetcificationversion

inventory.json - The Heart of OCFL

inventory.json is the most important element of an OCFL object, containing the following information:

{}""""""}"}itdhcm,vdyieoa"""e"}"}"pgann42drv,v:eedti1des1"""}"}2"""}"}"s"ef733i"cmu,s"cmu,s":t:ne1d1o:res""t":res""t""oAtsf73nesenaa4esenaa2db"l"Dtd1es{asradt1{asradtdejhgvi"a86"ta"mde7ta"mde33eto2r:e53:eg:er"1eg:er"d1ctr"e311de"e:fde"e:73tpi,c{b57{""{:sd""{:s1e-sttdb1::s{a::s{860:hoe3.""e""530/mr...""D:3""D:113/"y...2Vab2Va57"o:".."0et"d0et"b1,c:"":2rame2ram3.f"::5sa.5sa..ls"[-iMn.-iMn...hc[["1oaa.1oaa."iao""v1nng"1nng":o5nvv2-ae:-ae:/1t1201gr02gr[12e/c5:e@[5:e@["."nccoTre"Tre"a1,toon1I"xd1U"xdd/"nnt1n,aa1p,aads,tte:imt:dmtipeen3tpa3apatennt2il.2tl.ictt:aet:eeto//a1l.x1d.xn#ddd7ct7ctaiaad.do".do"lntti4am]4am].vaat6t"6t",te..i2a2axntto4"5"ttxxn5,6,"otta38]r""l++y]].00",,t00,x::t00"00]"",,

Key Elements of inventory.json

  • manifest - A mapping table of file hash values to physical paths
  • versions - Creation date, message, and user information for each version
  • state - The state of logical files in each version

Differences Between OCFL and Git

OCFL is similar to Git in that it performs version management, but their purposes and design philosophies differ significantly.

ItemOCFLGit
Primary purposeLong-term digital preservationCollaborative work in software development
Design philosophySimplicity and transparencyEfficiency and functionality
File formatStandard files and JSONProprietary binary format
ReadabilityDirectly readable by humansTools required
BranchesNot supportedCore feature
MergingNot supportedCore feature
DeduplicationContent addressingCompression via packfiles
GuaranteeDesigned to be readable 100 years laterOptimized for current tools
Target usersArchives and librariesSoftware developers

Why Git is Not Sufficient

Git is an excellent version management system, but it has the following challenges for long-term preservation:

  1. Complexity - The internal structure of the .git directory is complex and difficult to understand without Git tools
  2. Tool dependency - A Git client is required to read data
  3. Binary format - Packfiles are efficient but raise concerns about future tool compatibility
  4. Feature overload - Has many features unnecessary for preservation

OCFL prioritizes simplicity and transparency, ensuring that data can be accessed without special tools.

Implementation Examples in This Repository

This repository implements the main features of OCFL using the Python OCFL Core library.

Implementation Example 1: Creating a Simple OCFL Object

f#rr#ovvvo#rreeb111beoIppCj..jApmnoor=_f.doisse=mivdsotiiaOeleicitttOCserttfaooeCFsssoollrrFLa.irciyyaLVgaotyoz.nOeednh.re=ibrdseaenojs=(.dtOibei"ardihCtjco"spe(meFietnIappopLac((nmeoborRlt"dipnsjreeioatldi)tppzabtie(tooenjea.voOss(detlt1rCii)cix)yFttvtmctLooe-eo"Rrrr0.m,eyys0nmp(i1oiforo"wtison)("liotetti_o,msretyszr,toeonaOremCa.,FguLetdO)cib)gj)eesctt),OCFLVersion

Implementation Example 2: Version Management

In object-003 of this repository, the update history of files is managed:

  • v1: Initial version of data.txt
  • v2: Update of data.txt + addition of additional.txt

This structure enables the following:

  • Restoration to any version
  • Tracking of change history
  • File integrity verification

Implementation Example 3: Deduplication

In object-004, files with the same content are added under different names:

#vvv111C...rfffeiiialllteeeesss...3aaadddfdddi(((l"""ecccsooopppwyyyi123t///hfffiiitlllheeee___abcs...atttmxxxettt"""c,,,onffftiiielllneeet___ssstttrrreeeaaammm___abc,,,dddiiigggeeesssttt)))

Result:

  • Logical file count: 3
  • Physical file count: 1

OCFL stores files with the same hash value as a single physical file, optimizing storage.

Advantages of OCFL

1. Long-term Sustainability

  • Simple structure - No dependency on complex tools
  • Standard technologies - Only file systems and JSON
  • Clear specification - A published open standard

2. Data Integrity

  • Hash-based verification - Can verify the integrity of all files
  • Change tracking - All changes are recorded
  • Corruption detection - Automatic verification via checksums

3. Flexibility

  • Storage-independent - Local file systems, S3, Glacier, etc.
  • Application-independent - Not tied to specific software
  • Easy migration - Can be migrated by simple file copying

4. Transparency

  • Human-readable - JSON files can be directly inspected
  • Auditable - All history is explicit
  • Easy to debug - Easy to identify and fix problems

Use Cases for OCFL

Suitable Applications

  • Digital archives - Digitization of cultural heritage
  • Research data management - Long-term preservation of academic research data
  • Compliance - Data preservation based on legal requirements
  • Institutional repositories - Publication management for universities and research institutions

Unsuitable Applications

  • Frequent updates - Real-time data changes
  • Collaborative editing - Simultaneous editing by multiple people
  • Branch management - Parallel development workflows
  • Temporary data - Data needed only for a short period

Practical Tips

1. Object Design

Design OCFL objects as logically coherent collections of data:

  • Good example: A single book with its metadata and images
  • Bad example: Multiple unrelated files in a single object

2. Versioning Strategy

When to create versions:

  • When there are meaningful changes
  • At publication or release timing
  • As periodic snapshots

3. Choosing a Digest Algorithm

  • SHA-512 - Recommended (default)
  • SHA-256 - Alternative option
  • MD5 - Not recommended (security issues)

4. Storage Considerations

  • Cloud storage - S3, Azure Blob, Google Cloud Storage
  • Tape storage - Large-capacity long-term preservation
  • Distributed storage - Ensuring geographic redundancy

Summary

OCFL is a practical and powerful solution for long-term preservation of digital information. The following features make it an ideal choice for libraries, archives, and research institutions:

  • Simplicity - No dependency on complex tools
  • Transparency - Everything in human-readable format
  • Sustainability - Designed to be readable decades later
  • Flexibility - Supports various storage options

I hope that through the implementation examples in this repository, you have gained an understanding of the basic concepts and practical methods of OCFL. If you are facing challenges with digital preservation, please consider adopting OCFL.

Appendix: Glossary

  • OCFL Object - A collection of digital content with a unique ID
  • Inventory - A JSON file recording the history and state of an object
  • Manifest - A mapping table of file digest values to physical paths
  • Version - The state of an object at a specific point in time
  • State - The set of logical file paths in a given version
  • Digest - A hash value of a file (for integrity verification)
  • Content Directory - The directory where actual files are stored
  • Deduplication - A mechanism for storing files with identical content as a single physical file