Skip to content

Corpus

The versioned data store for TypeScript. Track lineage, deduplicate content, extract structured observations.

Core Features

Type-Safe Stores

Define stores with Zod schemas. Full TypeScript inference with validation on encode and decode.

Content Deduplication

Automatic SHA-256 hashing. Store the same data twice, pay for one copy.

Lineage Tracking

Link snapshots with parent references. Build complete data provenance graphs.

Observations

Extract structured facts from documents with automatic staleness detection.

Quick Start

import { z } from 'zod'
import { create_corpus, create_memory_backend, define_store, json_codec } from '@f0rbit/corpus'
const ArticleSchema = z.object({
title: z.string(),
body: z.string(),
tags: z.array(z.string()),
})
const corpus = create_corpus()
.with_backend(create_memory_backend())
.with_store(define_store('articles', json_codec(ArticleSchema)))
.build()
// Store a versioned snapshot
const result = await corpus.stores.articles.put({
title: 'Hello World',
body: 'My first article.',
tags: ['intro', 'getting-started'],
})
if (result.ok) {
console.log(`Version: ${result.value.version}`)
console.log(`Hash: ${result.value.content_hash}`)
}

Why Corpus?

Errors as Values

No thrown exceptions. Every operation returns a Result<T, CorpusError> that you can pattern match on. Predictable, type-safe error handling.

Functional Composition

Builder pattern for setup, pure functions for operations. No hidden state, no surprises. Compose backends, stores, and observations.

Local-First Ready

Start with the memory backend for development, file backend for persistence, then deploy to Cloudflare D1+R2 for production. Same API everywhere.

Pluggable Backends

Memory for testing, filesystem for local dev, Cloudflare for production. Combine them with the layered backend for caching and replication.

Extract Structured Facts with Observations

When you need to extract structured data from your documents - sentiment analysis, named entities, key facts - Observations provide first-class support with automatic provenance tracking.

import { define_observation_type } from '@f0rbit/corpus'
const SentimentObservation = define_observation_type('sentiment', z.object({
subject: z.string(),
score: z.number().min(-1).max(1),
}))
const corpus = create_corpus()
.with_backend(backend)
.with_store(define_store('articles', json_codec(ArticleSchema)))
.with_observations([SentimentObservation])
.build()
// Store observation pointing back to source document
await corpus.observations.put(SentimentObservation, {
source: { store_id: 'articles', version: article.version },
content: { subject: 'product launch', score: 0.8 },
})
// Query observations - stale ones automatically filtered
for await (const obs of corpus.observations.query({ type: 'sentiment' })) {
console.log(obs.content.subject, obs.content.score)
}

Backend Comparison

BackendStorageBest For
MemoryIn-memoryTesting, prototyping
FileLocal diskDevelopment, desktop apps
CloudflareD1 + R2Production, global distribution
LayeredCompositeCaching, replication, migration