Skip to content

Drizzle Schema

The corpus library exports Drizzle ORM schema definitions for direct database access. Use these when you need to query corpus tables directly, join with your own tables, or build custom queries beyond the high-level API.

Tables

corpus_snapshots

The primary table for storing versioned content snapshots.

import { corpus_snapshots } from '@f0rbit/corpus'
ColumnTypeDescription
store_idtextStore identifier (part of composite PK)
versiontextUnique version identifier (part of composite PK)
parentstextJSON array of parent version references
created_attextISO 8601 timestamp when snapshot was created
invoked_attextOptional ISO 8601 timestamp for scheduled invocations
content_hashtextSHA-256 hash of content for deduplication
content_typetextMIME type (e.g., application/json)
size_bytesintegerSize of the stored data
data_keytextKey to retrieve binary data from storage (R2/filesystem)
tagstextOptional JSON array of string tags

Primary Key: (store_id, version)

Indexes:

  • idx_store_created on (store_id, created_at) - efficient latest queries
  • idx_content_hash on (store_id, content_hash) - deduplication lookups
  • idx_data_key on (data_key) - data retrieval

corpus_observations

Table for storing structured observations that reference snapshots.

import { corpus_observations } from '@f0rbit/corpus'
ColumnTypeDescription
idtextPrimary key (unique observation identifier)
typetextObservation type name (e.g., entity_mention)
source_store_idtextStore ID of the referenced snapshot
source_versiontextVersion of the referenced snapshot
source_pathtextOptional JSONPath to specific element
source_span_starttextOptional start of character range
source_span_endtextOptional end of character range
contenttextJSON-encoded observation data
confidencerealOptional confidence score (0.0 to 1.0)
observed_attextISO 8601 timestamp when observation was made
created_attextISO 8601 timestamp when record was stored
derived_fromtextOptional JSON array of source snapshot pointers

Primary Key: id

Indexes:

  • idx_obs_type on (type) - filter by observation type
  • idx_obs_source on (source_store_id, source_version) - find observations for a snapshot
  • idx_obs_type_observed on (type, observed_at) - temporal queries by type
  • idx_obs_type_source on (type, source_store_id) - type queries within a store

Type Exports

Drizzle infers TypeScript types from schema definitions:

import type {
CorpusSnapshotRow,
CorpusSnapshotInsert,
ObservationRow,
ObservationInsert,
} from '@f0rbit/corpus'
TypeDescription
CorpusSnapshotRowSelect type - all columns as returned from queries
CorpusSnapshotInsertInsert type - required and optional columns for inserts
ObservationRowSelect type for observations
ObservationInsertInsert type for observations

Usage Examples

Querying Snapshots Directly

Use Drizzle’s query builder for custom queries:

import { drizzle } from 'drizzle-orm/d1'
import { eq, desc, and } from 'drizzle-orm'
import { corpus_snapshots } from '@f0rbit/corpus'
const db = drizzle(env.D1)
// Get latest 10 snapshots for a store
const recent = await db
.select()
.from(corpus_snapshots)
.where(eq(corpus_snapshots.store_id, 'articles'))
.orderBy(desc(corpus_snapshots.created_at))
.limit(10)
// Find snapshots by content hash (deduplication check)
const existing = await db
.select()
.from(corpus_snapshots)
.where(
and(
eq(corpus_snapshots.store_id, 'articles'),
eq(corpus_snapshots.content_hash, hash)
)
)
.limit(1)

Joining Observations with External Tables

Link observations to your domain tables:

import { drizzle } from 'drizzle-orm/d1'
import { eq, and } from 'drizzle-orm'
import { corpus_observations } from '@f0rbit/corpus'
import { users, documents } from './schema'
const db = drizzle(env.D1)
// Find all entity mentions in a user's documents
const mentions = await db
.select({
observation: corpus_observations,
document: documents,
})
.from(corpus_observations)
.innerJoin(
documents,
and(
eq(corpus_observations.source_store_id, 'documents'),
eq(corpus_observations.source_version, documents.corpus_version)
)
)
.innerJoin(users, eq(documents.user_id, users.id))
.where(
and(
eq(corpus_observations.type, 'entity_mention'),
eq(users.id, userId)
)
)

Aggregating Observations

import { count, sql } from 'drizzle-orm'
// Count observations by type
const counts = await db
.select({
type: corpus_observations.type,
count: count(),
})
.from(corpus_observations)
.groupBy(corpus_observations.type)
// Get observations with confidence above threshold
const confident = await db
.select()
.from(corpus_observations)
.where(sql`${corpus_observations.confidence} >= 0.8`)

Migrations

Include the corpus schemas in your Drizzle config:

drizzle.config.ts
import { defineConfig } from 'drizzle-kit'
export default defineConfig({
schema: [
'./src/schema.ts', // Your app schemas
'./node_modules/@f0rbit/corpus/schema.ts',
'./node_modules/@f0rbit/corpus/observations/schema.ts',
],
out: './drizzle',
dialect: 'sqlite',
})

Generate migrations when corpus updates its schema:

Terminal window
bunx drizzle-kit generate
bunx drizzle-kit migrate

When to Use Direct Queries

Use CaseApproach
Standard CRUD operationsHigh-level corpus API
Custom aggregationsDirect Drizzle queries
Joins with your tablesDirect Drizzle queries
Bulk operationsDirect Drizzle queries
Complex filteringDirect Drizzle queries

The high-level corpus API handles encoding/decoding, versioning, and lineage tracking. Use direct queries when you need raw database access for analytics, reporting, or cross-table operations.