Corpus

The versioned data store for TypeScript. Track lineage, deduplicate content, extract structured observations.

Get Started View on GitHub

Core Features

Type-Safe Stores

Define stores with Zod schemas. Full TypeScript inference with validation on encode and decode.

Content Deduplication

Automatic SHA-256 hashing. Store the same data twice, pay for one copy.

Lineage Tracking

Link snapshots with parent references. Build complete data provenance graphs.

Observations

Extract structured facts from documents with automatic staleness detection.

Quick Start

import { z } from 'zod'
import { create_corpus, create_memory_backend, define_store, json_codec } from '@f0rbit/corpus'

const ArticleSchema = z.object({
  title: z.string(),
  body: z.string(),
  tags: z.array(z.string()),
})

const corpus = create_corpus()
  .with_backend(create_memory_backend())
  .with_store(define_store('articles', json_codec(ArticleSchema)))
  .build()

// Store a versioned snapshot
const result = await corpus.stores.articles.put({
  title: 'Hello World',
  body: 'My first article.',
  tags: ['intro', 'getting-started'],
})

if (result.ok) {
  console.log(`Version: ${result.value.version}`)
  console.log(`Hash: ${result.value.content_hash}`)
}

Why Corpus?

Errors as Values

No thrown exceptions. Every operation returns a Result<T, CorpusError> that you can pattern match on. Predictable, type-safe error handling.

Functional Composition

Builder pattern for setup, pure functions for operations. No hidden state, no surprises. Compose backends, stores, and observations.

Local-First Ready

Start with the memory backend for development, file backend for persistence, then deploy to Cloudflare D1+R2 for production. Same API everywhere.

Pluggable Backends

Memory for testing, filesystem for local dev, Cloudflare for production. Combine them with the layered backend for caching and replication.

Extract Structured Facts with Observations

When you need to extract structured data from your documents - sentiment analysis, named entities, key facts - Observations provide first-class support with automatic provenance tracking.

import { define_observation_type } from '@f0rbit/corpus'

const SentimentObservation = define_observation_type('sentiment', z.object({
  subject: z.string(),
  score: z.number().min(-1).max(1),
}))

const corpus = create_corpus()
  .with_backend(backend)
  .with_store(define_store('articles', json_codec(ArticleSchema)))
  .with_observations([SentimentObservation])
  .build()

// Store observation pointing back to source document
await corpus.observations.put(SentimentObservation, {
  source: { store_id: 'articles', version: article.version },
  content: { subject: 'product launch', score: 0.8 },
})

// Query observations - stale ones automatically filtered
for await (const obs of corpus.observations.query({ type: 'sentiment' })) {
  console.log(obs.content.subject, obs.content.score)
}

Backend Comparison

Backend	Storage	Best For
Memory	In-memory	Testing, prototyping
File	Local disk	Development, desktop apps
Cloudflare	D1 + R2	Production, global distribution
Layered	Composite	Caching, replication, migration

Get Started View on GitHub