Architecture
This document provides a deep dive into DittoFS’s architecture, design patterns, and internal implementation.
Table of Contents
Section titled “Table of Contents”- Core Abstraction Layers
- Per-Share Block Store Isolation
- Storage Tiers
- Adapter Pattern
- Control Plane Pattern
- Service Layer
- Built-In and Custom Backends
- Directory Structure
- Horizontal Scaling with PostgreSQL
- Durable Handle State Flow
- Phase 12 Engine API + BlockRef + Cache (v0.15.0 A3)
- Phase 13 File-Level Dedup: ObjectID + Merkle Root (v0.15.0 A4)
- Migration & Block-Layout Routing (v0.15.x A5)
Core Abstraction Layers
Section titled “Core Abstraction Layers”DittoFS uses a Runtime-centric architecture where the Runtime is the single entrypoint for all operations. This design ensures that both persistent store and in-memory state stay synchronized.
┌─────────────────────────────────────────┐│ Protocol Adapters ││ (NFS, SMB) ││ pkg/adapter/{nfs,smb}/ │└───────────────┬─────────────────────────┘ │ GetBlockStoreForHandle(handle) ▼┌─────────────────────────────────────────┐│ Runtime ││ (Composition layer + sub-services) ││ pkg/controlplane/runtime/ ││ ││ ┌──────────┐ ┌────────┐ ┌──────────┐ ││ │ adapters │ │ stores │ │ shares │ ││ │lifecycle │ │registry│ │per-share │ ││ └──────────┘ └────────┘ │BlockStore│ ││ ┌──────────┐ ┌────────┐ └──────────┘ ││ │ mounts │ │lifecycl│ ┌──────────┐ ││ │ tracking │ │ serve │ │ identity │ ││ └──────────┘ └────────┘ │ mapping │ ││ └──────────┘ ││ ┌────────────┐ ┌───────────────────┐ ││ │ Store │ │ Auth Layer │ ││ │ (Persist) │ │ pkg/auth/ │ ││ │ 9 sub-ifs │ │ AuthProvider, │ ││ │ │ │ IdentityMapper │ ││ └────────────┘ └───────────────────┘ │└───────┬───────────────────┬─────────────┘ │ │ ▼ ▼┌────────────────┐ ┌──────────────────────┐│ Metadata │ │ Per-Share BlockStore ││ Stores │ │ pkg/blockstore/ ││ │ │ ││ - Memory │ │ ┌──────────────┐ ││ - BadgerDB │ │ │ Local Store │ ││ - PostgreSQL │ │ │ fs / memory │ ││ │ │ └──────┬───────┘ ││ │ │ │ ││ │ │ ┌──────▼───────┐ ││ │ │ │ Syncer │ ││ │ │ │ (async xfer) │ ││ │ │ └──────┬───────┘ ││ │ │ │ ││ │ │ ┌──────▼────────┐ ││ │ │ │ Remote Store │ ││ │ │ │ s3 / memory │ ││ │ │ │ (ref counted) │ ││ │ │ └───────────────┘ │└────────────────┘ └──────────────────────┘Key Interfaces
Section titled “Key Interfaces”1. Runtime (pkg/controlplane/runtime/)
- Single entrypoint for all operations - both API handlers and internal code
- Updates both persistent store AND in-memory state together
- Thin composition layer delegating to 6 focused sub-services:
adapters/: Protocol adapter lifecycle management (create, start, stop, delete)stores/: Metadata store registryshares/: Share registration and configuration; owns per-share*engine.BlockStoreinstancesmounts/: Unified mount tracking across protocolslifecycle/: Server startup/shutdown orchestrationidentity/: Share-level identity mapping
- Key methods:
Serve(ctx): Starts all adapters and servers, blocks until shutdownCreateAdapter(ctx, cfg): Saves to store AND starts immediatelyDeleteAdapter(ctx, type): Stops adapter AND removes from storeAddAdapter(adapter): Direct adapter injection (for testing)GetBlockStoreForHandle(ctx, handle): Resolves per-share BlockStore from a file handle viashares.Service
2. Control Plane Store (pkg/controlplane/store/)
- Persistent configuration (users, groups, permissions, adapters)
- Decomposed into 9 sub-interfaces:
UserStore,GroupStore,ShareStore,PermissionStore,MetadataStoreConfigStore,BlockStoreConfigStore,AdapterStore,SettingsStore,GuestStore - Composite
Storeinterface embeds all sub-interfaces - API handlers accept narrowest interface needed
- SQLite (single-node) or PostgreSQL (distributed)
3. Adapter Interface (pkg/adapter/adapter.go)
- Each protocol implements the
Adapterinterface IdentityMappingAdapterextendsAdapterwithauth.IdentityMapperfor protocol-specific identity mapping- Adapters receive a Runtime reference to access services
BaseAdapterprovides shared TCP lifecycle, defaultMapErrorandMapIdentitystubs- Lifecycle:
SetRuntime() -> Serve() -> Stop() - Multiple adapters can share the same runtime
- Thread-safe, supports graceful shutdown
4. Auth (pkg/auth/)
- Centralized authentication abstractions shared across all protocols
AuthProviderinterface:CanHandle(token)+Authenticate(ctx, token)Authenticator: Chains multiple providers, tries each in orderIdentity: Protocol-neutral authenticated identity (Unix creds, Kerberos, NTLM, anonymous)IdentityMapperinterface: ConvertsAuthResultto protocol-specific identity- Sub-packages:
kerberos/: KerberosAuthProviderwith keytab management and hot-reload
5. MetadataService (pkg/metadata/)
- Central service for all metadata operations
- Routes operations to the correct store based on share name
- Owns LockManager per share (for SMB/NLM byte-range locking)
- Split into focused files:
file_create.go,file_modify.go,file_remove.go,file_helpers.go,file_types.go: File operationsauth_identity.go,auth_permissions.go: Identity resolution and permission checks
- Protocol handlers should use this instead of stores directly
storetest/: Metadata store conformance test suite (all implementations must pass)
Recycle bin (trash). The recycle trap lives inside MetadataService.RemoveFile, RemoveDirectory, and Move, gated by a per-share TrashPolicy read through a locked accessor. When the policy enables the bin, an unlink (NFS REMOVE/RMDIR, SMB delete-on-close) or a replace-overwrite (a Move whose destination clobbers an existing node) moves the victim into a single shared #recycle directory at the share root instead of destroying it, preserving the original path subtree and owner. Block deletion is deferred: recycling returns an empty PayloadID so protocol adapters skip the block-deletion step, and a recycled node keeps its content blocks until it is reaped or the bin is emptied. The runtime’s trash.Service (pkg/controlplane/runtime/trash/) owns list/restore/empty and runs a background reaper that enforces the per-share retention-days and max-size policy on an hourly interval (oldest-first eviction). Disabling trash auto-empties the bin.
6. BlockStore (pkg/blockstore/)
- Per-share block storage orchestrator. Each share gets its own
*engine.BlockStoreinstance. engine.BlockStorecomposeslocal.LocalStore + remote.RemoteStore + engine.Syncer- Each share gets an isolated local storage directory; remote stores can be shared across shares (ref counted)
shares.Serviceowns the lifecycle (create on AddShare, close on RemoveShare)- Sub-packages:
engine/: BlockStore orchestrator — composes local + remote stores and owns the unifiedCache(single CAS-keyed type that absorbed the formerreadbuffer/+prefetch.gopair per Phase 12 / CACHE-01), the syncer, and the garbage collector (merged from formerreadbuffer/,sync/,gc/packages per TD-01). Seepkg/blockstore/engine/cache.gofor the Cache type.local/: Local store interface and implementations (fs/filesystem,memory/in-memory)remote/: Remote store interface and implementations (s3/production,memory/testing)storetest/: Conformance test helpers for new backend implementations
7. Metadata Store (pkg/metadata/store.go)
- Simple CRUD interface for file/directory metadata
- Stores file structure, attributes, permissions
- Implementations:
pkg/metadata/store/memory/: In-memory (fast, ephemeral, full hard link support)pkg/metadata/store/badger/: BadgerDB (persistent, embedded, path-based handles)pkg/metadata/store/postgres/: PostgreSQL (persistent, distributed, UUID-based handles)
- File handles are opaque identifiers (implementation-specific format)
Per-Share Block Store Isolation
Section titled “Per-Share Block Store Isolation”Each share in DittoFS gets its own *engine.BlockStore instance, providing complete data isolation between shares.
How It Works
Section titled “How It Works”-
Share Creation: When a share is added via
dfsctl share create, the runtime creates a dedicated BlockStore instance with:- An isolated local storage directory (under the configured local store path)
- A reference to the configured remote store (shared across shares via ref counting)
-
Handle Resolution: Protocol handlers call
GetBlockStoreForHandle(ctx, handle)which:- Extracts the share name from the file handle
- Returns the share’s dedicated BlockStore instance
- There is no global BlockStore
-
Share Removal: When a share is removed, its BlockStore is closed:
- Local storage directory is cleaned up
- Remote store reference count is decremented
- If ref count reaches zero, the remote store connection is closed
Isolation Properties
Section titled “Isolation Properties”- Data Isolation: Each share’s local blocks are stored in separate directories
- Cache Independence: The unified
Cacheis per-share (eviction in one share does not affect others). Inside a share, the cache is keyed byContentHash, so two files referencing the same chunk via dedup share one cache entry (CACHE-02). - Remote Sharing: Multiple shares can reference the same remote store (e.g., same S3 bucket) — blocks are namespaced by share to prevent collisions
- Lifecycle Independence: Block stores are created/closed with share lifecycle
Storage Tiers
Section titled “Storage Tiers”DittoFS uses a three-tier storage model for block data:
┌─────────────────────────────────────┐│ Cache (In-Memory, CAS-keyed) ││ pkg/blockstore/engine/cache.go ││ - Single type, keyed by ContentHash││ - LRU eviction (D-30) ││ - Internal sequential prefetch ││ (3-trigger threshold, D-29) ││ - Cross-file dedup (CACHE-02) ││ - Configurable budget per share ││ (cache.size_mib, default 256) ││ - Volatile (lost on restart) │└──────────────┬──────────────────────┘ │ cache miss ▼┌─────────────────────────────────────┐│ Local Block Store ││ pkg/blockstore/local/fs/ ││ - Filesystem-backed ││ - Fast access (disk I/O) ││ - Persistent across restarts ││ - Per-share isolated directories │└──────────────┬──────────────────────┘ │ block not local ▼┌─────────────────────────────────────┐│ Remote Store ││ pkg/blockstore/remote/s3/ ││ - S3 or compatible object store ││ - Slowest (network I/O) ││ - Durable (survives node loss) ││ - Shared across shares (ref count) │└─────────────────────────────────────┘Read Path: Engine.ReadAt receives []BlockRef from caller, locates the
covering blocks via findBlocksForRange (binary search), serves bytes
from local CAS (mmap on linux/darwin, ReadFile on windows — CACHE-06)
or remote CAS (BLAKE3-verified end-to-end, INV-06), calls Cache.OnRead
to update the per-payload sequential tracker for prefetch hints.
Write Path: Engine.WriteAt receives (currentBlocks []BlockRef, data, offset), FastCDC-rechunks the affected range, returns newBlocks []BlockRef to the caller; caller persists newBlocks alongside the
metadata transaction (Mtime, Size, etc.). Syncer asynchronously uploads
Pending FileBlocks to remote CAS.
Eviction:
- Cache: LRU eviction when budget reached. No data loss (local CAS has the data). Cache is per-share but cross-file inside a share (CACHE-02 — same hash referenced by two files shares one entry).
- Local store: Manual eviction via
dfsctl store block evict. Only blocks already synced to remote can be evicted (safety check prevents data loss).
Block Store — Hybrid Local Tier (experimental, v0.15.0 Phase 10)
Section titled “Block Store — Hybrid Local Tier (experimental, v0.15.0 Phase 10)”The hybrid local tier is a second write path inside pkg/blockstore/local/fs/,
gated by the use_append_log flag (defaults to false through v0.15.0
Phase 10; flipped to true in Phase 11). When enabled, writes flow through
an append-only log per file; a rollup pool chunks the log via FastCDC,
hashes each chunk with BLAKE3, and persists the chunks under a
content-addressable blocks/{hh}/{hh}/{hex} directory.
Phase 10 is plumbing-only. No existing write path consumes the chunker
or the log in v0.15.0 Phase 10; the engine keeps using the legacy
tryDirectDiskWrite / .blk path. Phase 11 (A2) flips the default,
rewires the syncer to write to the remote CAS keyspace
(cas/{hh}/{hh}/{hex}), and adds mark-sweep GC for the remote cas/
prefix. See Garbage Collection (mark-sweep)
and Block Lifecycle (three-state)
below for the v0.15.0 Phase 11 design that consumes this tier.
Pipeline
Section titled “Pipeline” (log header + records) logs/{payloadID}.log AppendWrite ---> per-file log (append-only) ---------------+ (per-file mutex) CRC per record | v chunkRollup pool (default 2 workers) | BLAKE3 + FastCDC | (min 1 MiB / avg 4 MiB / max 16 MiB) | v StoreChunk blocks/{hh}/{hh}/{hex} (.tmp + rename + fsync) | CommitChunks atomic: | 1. metadata.SetRollupOffset (source of truth) 2. advanceRollupOffset + fsync log header 3. tree.ConsumeUpTo + logBytesTotal.Sub 4. non-blocking signal on pressureCh | v (blocked AppendWrite unblocks)Layout
Section titled “Layout”<baseDir>/logs/<payloadID>.log per-file append-only log<baseDir>/blocks/<hh>/<hh>/<hex> content-addressed chunks (CAS)Log header (64 bytes): magic DFLG | version | rollup_offset | flags |
created_at | header CRC | 32 B reserved. Record framing:
payload_len (u32 LE) | file_offset (u64 LE) | crc32c (u32 LE) |
payload.
Invariants
Section titled “Invariants”- INV-03 (
rollup_offsetmonotone): metadata is source of truth; the filesystem header is idempotent derived state. Recovery reconciles header from metadata on boot. - INV-05 (log length bounded):
logBytesTotal <= max_log_bytesperFSStore. Writers block onpressureChwhen the budget is exceeded; rollup drains and non-blocking signals when bytes are reclaimed.
Crash recovery
Section titled “Crash recovery”Recovery (pkg/blockstore/local/fs/recovery.go) scans logs from
rollup_offset, truncates at first bad CRC, and rebuilds per-file interval
trees. Orphan logs (no metadata referrer, no live FileBlock, mtime older
than orphan_log_min_age_seconds) are swept. Orphan chunks under
blocks/{hh}/{hh}/{hex} are left intact; Phase 11’s mark-sweep GC is what
reclaims them.
Per-FSStore surface
Section titled “Per-FSStore surface”Per CLAUDE.md Rule 4 (block stores are per-share), every hybrid-tier field
— log-fd map, per-file mutex map, interval-tree map, rollup worker pool,
pressure channel, maxLogBytes budget, stabilization window — lives
inside *FSStore. No global state across shares.
Experimental: Do not enable use_append_log in production before
v0.15.0 Phase 11 (A2). Without Phase 11’s mark-sweep GC, the blocks/
directory grows unbounded. See docs/CONFIGURATION.md (use_append_log,
max_log_bytes, rollup_workers, stabilization_ms,
orphan_log_min_age_seconds) and
.planning/phases/10-fastcdc-chunker-hybrid-local-store-a1/10-CONTEXT.md
for full design detail.
Block Lifecycle (three-state, v0.15.0 Phase 11)
Section titled “Block Lifecycle (three-state, v0.15.0 Phase 11)”Phase 11 (A2) collapses the block lifecycle to three persisted states held
on FileBlock.State indexed by ContentHash. There is no parallel state
in memory, in fd pools, or anywhere else (STATE-03): the metadata store
is the single source of truth, and engine.Syncer is the sole owner of
state transitions (D-15).
Pending ──claim batch──▶ Syncing ──PUT success + meta txn──▶ Remote ▲ │ └──janitor (>claim_timeout)──────────┘ │ (RefCount → 0)│ ▼ GC eligible- Pending:
RefCount ≥ 1; bytes are local; not yet uploaded. - Syncing: a syncer goroutine has claimed the block; the upload is in flight.
- Remote: PUT to the remote CAS keyspace returned 200 AND the
metadata transaction setting
State=Remotecommitted (INV-03 — no orphan flag without metadata-txn success).
Restart recovery (D-14): at syncer Start, a one-shot janitor pass
requeues any Syncing row whose last_sync_attempt_at is older than
syncer.claim_timeout (default 10m) back to Pending. CAS keys are
content-defined so a duplicate re-upload writes the same bytes to the
same key — idempotent by construction.
Why a metadata write for every claim? The Pending → Syncing
transition is the serialization point against duplicate uploads across
syncer instances. The batched-claim cost is one txn per tick, in exchange
for exact restart recovery and a single-query introspection of stuck
blocks (State=Syncing AND last_sync_attempt_at < now − 1h).
Garbage Collection (mark-sweep, v0.15.0 Phase 11)
Section titled “Garbage Collection (mark-sweep, v0.15.0 Phase 11)”Phase 11 replaces the previous path-prefix GC with a fail-closed
mark-sweep over the union of every live FileBlock.ContentHash across
shares pointing at the same remote.
Algorithm
Section titled “Algorithm”- Mark phase. Stream every
FileBlock’sContentHashvia the newMetadataStore.EnumerateFileBlocks(ctx, fn)cursor (D-02). The cursor is implemented natively per backend (memory, Badger, Postgres) and never loads the full set into application memory. Hashes are appended to an on-disk live set under<localStore>/gc-state/<runID>/db/(Badger temp store; D-01). Snapshot timeTis captured at the start of the run. Cross-share aggregation keys on remote-store identity (bucket+endpoint+prefix), not share name (D-03), so an object reachable from any share that targets the same remote is considered live. - Sweep phase. A single
RemoteStore.Walkenumerates every CAS object cluster-wide; the backend (e.g. S3) paginates internally. For each key, the engine keeps the object iff the hash is present in the live set OR the object’sLastModifiedis newer thanT − gc.grace_period(default 1h, D-05). Otherwise the engine issues a DELETE.
Fail-closed posture (INV-04)
Section titled “Fail-closed posture (INV-04)”Mark-phase and sweep-phase failures are treated asymmetrically (D-06, D-07):
- Mark errors abort the sweep entirely. Any uncertainty about the live set could lead to deleting referenced data. Sweep workers do not start if the mark phase returned any error.
- Sweep-side per-prefix DELETE errors are captured and continue. A
single S3 503 transient should not waste a successful mark phase. The
run summary reports
error_countand the first N error samples; garbage that survives a transient is reclaimed on the next run.
gc-state directory layout
Section titled “gc-state directory layout”<localStore>/gc-state/ 20260425T143022Z-abc/ db/ (Badger temp store for the live set) incomplete.flag (removed by MarkComplete; cleaned by next run) 20260425T153122Z-def/ db/ (no incomplete.flag — successful run) last-run.json (most recent GCRunSummary)Each run writes incomplete.flag at start; the next run detects stale
directories (by leftover flag) and deletes them before starting fresh.
Mark is idempotent so resume-on-restart is intentionally not built —
simpler test surface (D-01).
Triggers and observability
Section titled “Triggers and observability”- Periodic GC is deferred to a follow-up phase.
gc.intervalis parsed and validated but unwired in v0.15.0; any non-zero value emits a startup WARN and is otherwise ignored. Schedule via cron until the scheduler ships. - On-demand via
dfsctl store block gc <share> [--dry-run](D-08, D-09);--dry-runskips DELETEs and prints up togc.dry_run_sample_sizecandidate keys (default 1000). - Observability via structured slog INFO at start/end with
run_id,hashes_marked,objects_swept,bytes_freed,duration_ms,error_count, plus a persisted summary at<localStore>/gc-state/last-run.json(D-10). Inspect viadfsctl store block gc-status <share>. Prometheus metrics are intentionally deferred to a metrics phase (D-35).
GC coordinates with the share-snapshots subsystem through a single
rule: manifest-on-disk = block held. Snapshots register a hold
implicitly by writing a manifest.json under
<localStoreDir>/snapshots/<share>/<id>/. GC’s mark phase enumerates
every manifest file at sweep start and unions the referenced hashes
into its retention set, so any block referenced by any snapshot
survives the sweep. The provider that exposes this hold to the GC
layer is SnapshotHoldProvider. No hold flag lives in any database
table — the disk is the source of truth.
See SNAPSHOTS.md for the operator-facing description of the hold semantics, including the delete-vs-GC race window.
See docs/CONFIGURATION.md for every gc.* and syncer.* knob, and
docs/CLI.md for the dfsctl store block gc reference.
Share Snapshots
Section titled “Share Snapshots”Share snapshots are point-in-time, reference-based protection for a share’s content. The subsystem produces three artifacts per snapshot on local disk and one row in the control-plane database; it does not copy any block data. See SNAPSHOTS.md for the operator-facing runbook; this section describes the architectural layout and the orchestration flows.
Subsystem layout
Section titled “Subsystem layout”| Location | Role |
|---|---|
pkg/snapshot/ | Verify gate, hash-manifest read/write, helper types. |
pkg/controlplane/runtime/snapshot.go | Runtime.CreateSnapshot, WaitForSnapshot, RestoreSnapshot, GetSnapshot, ListSnapshots, DeleteSnapshot. Composition over the metadata store, block store, and snapshot store. |
pkg/controlplane/runtime/snapshot_hold.go | SnapshotHoldProvider — per-share delete lock + manifest-on-disk hold surface for GC. |
pkg/controlplane/models/snapshot.go | Snapshot GORM model; SnapshotDir, ManifestPath, MetadataDumpPath path helpers. |
pkg/controlplane/store/snapshots.go | SnapshotStore CRUD (GetSnapshot, ListSnapshots, DeleteSnapshot). |
pkg/controlplane/api/dto/snapshot.go | Neutral wire DTOs imported by both the REST handler and the apiclient. No GORM types cross the wire. |
internal/controlplane/api/handlers/snapshot.go | Five REST handlers (Create, List, Get, Delete, Restore), the narrow SnapshotRuntime interface (testability seam), and the single mapSnapshotError sentinel-to-HTTP table. |
pkg/apiclient/snapshots.go | Typed Go client (6 methods) re-exporting the wire DTOs as type aliases of dto.Snapshot. |
cmd/dfsctl/commands/share/snapshot/ | Five cobra leaf commands matching the REST surface (create, list, show, delete, restore). |
On-disk artifacts
Section titled “On-disk artifacts”Every snapshot owns a directory under the share’s local store:
<localStoreDir>/snapshots/<share>/<snap-id>/ ├─ metadata.dump ← engine-native metadata serialization └─ manifest.json ← BLAKE3 hashes of every CAS block the share referencesSnapshotDir(localStoreDir), ManifestPath(localStoreDir), and
MetadataDumpPath(localStoreDir) on the Snapshot model compute the
canonical paths. Atomic write is via temp + rename so a partial
manifest never surfaces to the GC enumeration step. The manifest
file’s existence is the GC hold; there is no separate hold record.
Create orchestration
Section titled “Create orchestration”CreateSnapshot ─→ persist Snapshot row (state=creating) ─→ DrainAllUploads (skipped if NoVerify) ─→ Dump metadata to metadata.dump ─→ Build hash manifest from CAS ─→ VerifyRemoteDurability (skipped if NoVerify, concurrency = 16) ─→ Update row state=ready (or failed) + remote_durable flagRuntime.CreateSnapshot returns the new snapshot ID immediately and
runs the orchestration in a background goroutine. The REST handler
returns 202 Accepted with a Location header pointing at the
record; callers poll GET /snapshots/{id} until state != "creating".
The CLI’s WaitForSnapshot does that polling on the operator’s
behalf.
NoVerify=true (CLI --no-verify) skips both the upload drain and
the HEAD-probe phase. The snapshot still completes with
remote_durable=false. Restore of a non-durable snapshot then
requires the explicit AllowNonDurable flag (CLI --force).
Restore orchestration
Section titled “Restore orchestration”RestoreSnapshot ─→ Pre-flight: refuse if share enabled ─→ Verify source snapshot's remote durability (skipped if AllowNonDurable) ─→ Pre-restore safety snapshot (ID returned to caller) ─→ Close metadata store ─→ Reset (via Resetable interface) ─→ Restore from metadata.dump ─→ HashSetFromMetadataStore walk ─→ Post-restore block verifyRuntime.RestoreSnapshot returns (safetySnapshotID, err). The
safety snap ID is set as soon as step 3 succeeds, even if a later
step fails — callers (REST + CLI) surface the ID to the operator so
the rollback path is always available without a separate
ListSnapshots filter. On precheck / pre-verify failure (before
step 3) the safety ID is the empty string.
Per-share delete lock
Section titled “Per-share delete lock”SnapshotHoldProvider.AcquireDeleteLock(share) returns a release
function around a per-share *sync.RWMutex. The same mutex
serializes CreateSnapshot, RestoreSnapshot, and
DeleteSnapshot on the same share so that:
- Two concurrent
deletecalls on different snapshots of the same share cannot race the per-snapshot directory wipe against each other. - A
deletecannot race acreatewhose manifest write would appear in the snapshots directory mid-sweep. - A
restorecannot race adeleteof the safety snap it is about to create.
Runtime.DeleteSnapshot is the canonical entry point — handlers
never reach into r.store.DeleteSnapshot directly. The wrapper owns
the lock acquisition, the database row delete, the on-disk
directory wipe, and the lock release.
HTTP surface
Section titled “HTTP surface”Five REST endpoints under /api/v1/shares/{name}/snapshots (admin
only, inherits the existing RequireAdmin middleware):
| Method | Path | Result |
|---|---|---|
POST | / | 202 Accepted + Location header |
GET | / | 200 OK + JSON array (empty: [], not null) |
GET | /{id} | 200 OK + full record |
DELETE | /{id} | 204 No Content |
POST | /{id}/restore | 200 OK + {snapshot_id, safety_snapshot_id, share} |
The single mapSnapshotError helper handles the 14 typed sentinels
that can cross the boundary (12 snapshot sentinels + share-not-found
- nil-guard). The mapping table lives in the handler file as the sole source of truth; future sentinels add a single case.
The Restore handler wraps r.Context() with
context.WithTimeout(ctx, cfg.Snapshot.restore_http_timeout)
(default 30 minutes) to bound runaway restores. The apiclient
mirrors the timeout on the client’s http.Client for the restore
call only (WithRestoreTimeout).
For the full operator runbook see SNAPSHOTS.md.
Dual-Read Window (Phase 11 → Phase 14)
Section titled “Dual-Read Window (Phase 11 → Phase 14)”During the v0.15.0 → v0.15.x window, the engine resolves block reads from two coexisting key spaces (D-21, D-22):
FileBlock.Hashnon-zero → CAS path: read fromcas/{hh}/{hh}/{hex}, BLAKE3-verified end-to-end (header pre-check onx-amz-meta-content-hash+ streaming verifier over the body, INV-06).FileBlock.Hashzero → legacy path: read from{payloadID}/block-{N}(FormatStoreKey/ParseStoreKey) with no verification (verification cannot be retroactively applied to data written before BSCAS-06).
Resolution is by metadata key shape (one DB lookup per block), NOT by S3 trial-and-error — there is no doubled GET cost.
The legacy code path lives Phase 11 → Phase 14 (A5). Phase 14 ships
dfsctl blockstore migrate to re-chunk all legacy data to CAS; Phase
15 (A6) deletes the legacy path entirely. The dual-read code is
intentionally on a deletion clock — anyone touching it should know
its lifespan.
Adapter Pattern
Section titled “Adapter Pattern”DittoFS uses the Adapter pattern to provide clean protocol abstractions:
// ProtocolAdapter interface (defined in runtime package to avoid import cycles)type ProtocolAdapter interface { Serve(ctx context.Context) error Stop(ctx context.Context) error Protocol() string Port() int}
// RuntimeSetter - adapters that need runtime access implement thistype RuntimeSetter interface { SetRuntime(rt *Runtime)}
// Example: NFS Adapter accesses per-share block stores via runtimetype NFSAdapter struct { config NFSConfig runtime *runtime.Runtime}
func (a *NFSAdapter) handleRead(ctx context.Context, req *ReadRequest) { // Resolve per-share block store from file handle blockStore, err := a.runtime.GetBlockStoreForHandle(ctx, handle) // Phase 12: read data via block store with caller-snapshot []BlockRef. // Engine binary-searches blocks for the requested range; sparse holes // outside any BlockRef are zero-filled (D-21). nil/empty []BlockRef // triggers the legacy dual-read shim (D-20). n, err := blockStore.ReadAt(ctx, payloadID, attr.Blocks, dest, offset) // ...}
// Multiple adapters can run concurrently, sharing the same runtimert := runtime.New(cpStore)rt.SetAdapterFactory(createAdapterFactory())rt.Serve(ctx) // Loads adapters from store and starts themShared adapter helpers (internal/adapter/common)
Section titled “Shared adapter helpers (internal/adapter/common)”NFSv3, NFSv4, and SMB v2/3 handlers share a single package of helpers at
internal/adapter/common/ so the three adapters do not each carry a
private copy of the same logic. The package exposes:
- Block-store resolution:
common.ResolveForRead/common.ResolveForWritewrapRuntime.GetBlockStoreForHandlevia a narrowBlockStoreRegistryinterface (satisfied implicitly by*runtime.Runtime). All three protocols’ READ/WRITE/COMMIT paths route through these two calls. - Pooled read buffer:
common.ReadFromBlockStorereturns aBlockReadResultwhoseRelease()is handed to the response encoder, which invokes it after the wire write completes. NFSv3, NFSv4, and SMB regular-file READ all adopt the pool; pipe/symlink READ paths stay on heap allocations by design (documented in SMB.md). - Phase-12
[]BlockRefseam:common.ReadFromBlockStore,common.WriteToBlockStore, andcommon.CommitBlockStoreare the single edit points where Phase 12 (v0.15.0 A3 / META-01 + API-01) will feed resolved[]BlockRefinto the engine. Handler code stays untouched; Phase 12’s blast radius is confined tocommon/. - Metadata error translation: a struct-per-code table (
errorMapincommon/errmap.go) with NFS3/NFS4/SMB columns;common.MapToNFS3,common.MapToNFS4, andcommon.MapToSMBare thin accessors. Lock- operation context uses the parallellockErrorMap(common/lock_errmap.go) which overrides a handful of codes (e.g.,ErrLocked→STATUS_LOCK_NOT_GRANTEDin lock context vs.STATUS_FILE_LOCK_CONFLICTin general I/O context). Adding a newmetadata.ErrorCodeis one edit across all three protocols — the struct literal requires every column to be populated, so you cannot ship a code that is missing an NFS or SMB mapping.
See CONTRIBUTING.md “Adding a new metadata.ErrorCode” for the recipe and NFS.md / SMB.md “Error mapping” for protocol-specific notes.
Control Plane Pattern
Section titled “Control Plane Pattern”The Control Plane is the central management component enabling flexible, multi-share configurations.
How It Works
Section titled “How It Works”- Named Store Creation: Stores are created with unique names (e.g., “fast-memory”, “s3-archive”)
- Share-to-Store Mapping: Each share references metadata and block stores by name
- Handle Identity: File handles encode both the share ID and file-specific data
- Store Resolution: When handling operations, the runtime decodes the handle to identify the share, then routes to the correct stores
Configuration Example
Section titled “Configuration Example”Stores, shares, and adapters are managed at runtime via dfsctl (persisted in the control plane database):
# Create named stores (created once, shared across shares)./dfsctl store metadata add --name fast-meta --type memory./dfsctl store metadata add --name persistent-meta --type badger \ --config '{"path":"/data/metadata"}'
# Create block stores (local per-share, remote shared across shares)./dfsctl store block add --kind local --name local-cache --type fs \ --config '{"path":"/data/cache"}'./dfsctl store block add --kind remote --name s3-remote --type s3 \ --config '{"region":"us-east-1","bucket":"my-bucket"}'
# Create shares referencing stores by name (each gets its own BlockStore)./dfsctl share create --name /temp --metadata fast-meta --local local-cache./dfsctl share create --name /archive --metadata persistent-meta \ --local local-cache --remote s3-remoteBenefits
Section titled “Benefits”- Per-share isolation: Each share gets its own BlockStore with isolated local storage directory
- Resource Efficiency: Remote stores are shared (ref counted) when multiple shares reference the same config
- Flexible Topologies: Mix local-only and remote-backed storage per-share
- Future Multi-Tenancy: Foundation for per-tenant store isolation
Service Layer
Section titled “Service Layer”The service layer provides business logic and coordination between stores.
MetadataService
Section titled “MetadataService”Handles all metadata operations with share-based routing:
// MetadataService - central service for metadata operationstype MetadataService struct { stores map[string]MetadataStore // shareName -> store lockManagers map[string]*LockManager // shareName -> lock manager}
// Usage by protocol handlersmetaSvc := metadata.New()metaSvc.RegisterStoreForShare("/export", memoryStore)metaSvc.RegisterStoreForShare("/archive", badgerStore)
// High-level operations (with business logic)file, err := metaSvc.CreateFile(authCtx, parentHandle, "test.txt", fileAttr)entries, err := metaSvc.ReadDir(ctx, dirHandle)
// Byte-range locking (SMB/NLM)lock, err := metaSvc.AcquireLock(ctx, shareName, handle, offset, length, exclusive)Write Coordination Pattern
Section titled “Write Coordination Pattern”WRITE operations require coordination between metadata and block stores:
// 1. Update metadata (validates permissions, updates size/timestamps);// capture the caller-snapshot []BlockRef for the engine.attr, preSize, preMtime, preCtime, err := metadataStore.WriteFile(handle, newSize, authCtx)currentBlocks := attr.Blocks // []blockstore.BlockRef sorted by Offset
// 2. Resolve per-share block store from file handleblockStore, err := rt.GetBlockStoreForHandle(ctx, handle)
// 3. Write actual data via per-share block store; engine FastCDC-rechunks// the affected range and returns the new []BlockRef.newBlocks, err := blockStore.WriteAt(ctx, string(attr.PayloadID), currentBlocks, data, offset)
// 4. Persist newBlocks in the same metadata txn that updates Size/Mtime.// Engine never opens the metadata txn itself (API-02).err = metadataStore.SetFileBlocks(handle, newBlocks, authCtx)
// 5. Post-txn surgical cache invalidation: drop only the hashes that// disappeared, preserving warm dedup entries (CACHE-05 / D-35).removed := diffRemovedHashes(currentBlocks, newBlocks)blockStore.Cache().InvalidateFile(string(attr.PayloadID), removed)
// 6. Return updated attributes to client for cache consistencyBuilt-In and Custom Backends
Section titled “Built-In and Custom Backends”Using Built-In Backends
Section titled “Using Built-In Backends”No custom code required - configure via CLI:
# Create stores./dfsctl store metadata add --name default-meta --type memory # or badger, postgres./dfsctl store block add --kind local --name default-local --type fs \ --config '{"path":"/data/blocks"}'
# Create share referencing stores./dfsctl share create --name /export --metadata default-meta --local default-localImplementing Custom Store Backends
Section titled “Implementing Custom Store Backends”See docs/IMPLEMENTING_STORES.md for detailed implementation guides for:
- Local Store: Implement
pkg/blockstore/local.LocalStoreinterface - Remote Store: Implement
pkg/blockstore/remote.RemoteStoreinterface - Metadata Store: Implement
pkg/metadata/Storeinterface
Directory Structure
Section titled “Directory Structure”dittofs/├── cmd/│ ├── dfs/ # Server CLI binary│ │ ├── main.go # Entry point│ │ └── commands/ # Cobra commands (start, stop, config, logs)│ └── dfsctl/ # Client CLI binary│ ├── main.go # Entry point│ ├── cmdutil/ # Shared utilities (auth, output, flags)│ └── commands/ # Cobra commands (user, group, share, store, adapter)│├── pkg/ # Public API (stable interfaces)│ ├── adapter/ # Protocol adapter interface│ │ ├── adapter.go # Adapter + IdentityMappingAdapter interfaces│ │ ├── auth.go # Adapter-level Authenticator interface│ │ ├── base.go # BaseAdapter shared TCP lifecycle│ │ ├── errors.go # ProtocolError interface│ │ ├── nfs/ # NFS adapter implementation│ │ └── smb/ # SMB adapter implementation│ ││ ├── auth/ # Centralized authentication abstractions│ │ ├── auth.go # AuthProvider, Authenticator, AuthResult│ │ ├── identity.go # Identity model, IdentityMapper interface│ │ └── kerberos/ # Kerberos AuthProvider│ │ ├── provider.go # Provider (implements AuthProvider)│ │ ├── keytab.go # Keytab hot-reload manager│ │ └── doc.go # Package doc│ ││ ├── metadata/ # Metadata layer│ │ ├── service.go # MetadataService (business logic, routing)│ │ ├── store.go # MetadataStore interface (CRUD)│ │ ├── file_create.go # File/directory creation operations│ │ ├── file_modify.go # File modification operations│ │ ├── file_remove.go # File removal operations│ │ ├── file_helpers.go # Shared file operation helpers│ │ ├── file_types.go # File-related type definitions│ │ ├── auth_identity.go # Identity resolution│ │ ├── auth_permissions.go # Permission checking│ │ ├── cookies.go # CookieManager (NFS/SMB pagination)│ │ ├── types.go # FileAttr, DirEntry, etc.│ │ ├── errors.go # Metadata-specific errors│ │ ├── locking.go # LockManager for byte-range locks│ │ ├── storetest/ # Conformance test suite for store implementations│ │ └── store/ # Store implementations│ │ ├── memory/ # In-memory (ephemeral)│ │ ├── badger/ # BadgerDB (persistent)│ │ └── postgres/ # PostgreSQL (distributed)│ ││ ├── blockstore/ # Per-share block storage│ │ ├── doc.go # Package documentation│ │ ├── store.go # FileBlockStore interface│ │ ├── types.go # FileBlock, BlockState types│ │ ├── errors.go # BlockStore error types│ │ ├── chunker/ # FastCDC content-defined chunker (Phase 10 A1)│ │ │ # min=1 MiB / avg=4 MiB / max=16 MiB, lvl 2;│ │ │ # BLAKE3 hashing; consumed by local rollup pool│ │ ├── engine/ # BlockStore orchestrator + read cache + syncer + GC│ │ ├── local/ # Local store interface│ │ │ ├── fs/ # Filesystem-backed local store│ │ │ │ # (+ hybrid append-log + CAS blocks/ tier,│ │ │ │ # gated by use_append_log, Phase 10 A1)│ │ │ └── memory/ # In-memory local store (testing)│ │ └── remote/ # Remote store interface│ │ ├── s3/ # S3-backed remote store│ │ └── memory/ # In-memory remote store (testing)│ ││ ├── controlplane/ # Control plane (config + runtime)│ │ ├── store/ # GORM-based persistent store│ │ │ ├── interface.go # 9 sub-interfaces + composite Store│ │ │ ├── gorm.go # GORMStore implementation│ │ │ ├── helpers.go # Generic GORM helpers│ │ │ └── ... # Per-entity implementations│ │ ├── runtime/ # Ephemeral runtime state│ │ │ ├── runtime.go # Composition layer (~500 lines)│ │ │ ├── adapters/ # Adapter lifecycle sub-service│ │ │ ├── stores/ # Metadata store registry sub-service│ │ │ ├── shares/ # Share management sub-service│ │ │ ├── mounts/ # Unified mount tracking sub-service│ │ │ ├── lifecycle/ # Serve/shutdown orchestration sub-service│ │ │ └── identity/ # Identity mapping sub-service│ │ ├── api/ # REST API server│ │ │ ├── server.go # HTTP server with JWT│ │ │ └── router.go # Route definitions│ │ └── models/ # Domain models (User, Group, Share)│ ││ ├── apiclient/ # REST API client library│ │ ├── client.go # HTTP client with token auth│ │ ├── helpers.go # Generic API client helpers│ │ └── ... # Resource-specific methods│ ││ └── config/ # Configuration parsing│ ├── config.go # Main config struct│ ├── stores.go # Store creation│ └── runtime.go # Runtime initialization│├── internal/ # Private implementation details│ ├── adapter/common/ # Shared NFS/SMB adapter helpers: block-store│ │ │ # resolution (ResolveForRead/Write), pooled│ │ │ # ReadFromBlockStore + WriteToBlockStore +│ │ │ # CommitBlockStore seams (Phase 12 entry│ │ │ # point for []BlockRef), consolidated│ │ │ # metadata.ErrorCode -> NFS3/NFS4/SMB│ │ │ # mapping table (errmap + content_errmap +│ │ │ # lock_errmap).│ │ ├── resolve.go # BlockStoreRegistry narrow interface +│ │ │ # ResolveForRead/Write│ │ ├── read_payload.go # Pooled BlockReadResult + ReadFromBlockStore│ │ ├── write_payload.go # WriteToBlockStore + CommitBlockStore seams│ │ ├── errmap.go # Struct-per-code table (NFS3/NFS4/SMB columns)│ │ ├── content_errmap.go # Block-store content error table (D-08 §2)│ │ └── lock_errmap.go # Lock-context error table (D-08 §3)│ ├── adapter/nfs/ # NFS protocol implementation│ │ ├── dispatch.go # RPC procedure routing│ │ ├── rpc/ # RPC layer (call/reply handling)│ │ │ └── gss/ # RPCSEC_GSS framework│ │ ├── core/ # Generic XDR codec│ │ ├── types/ # NFS constants and types│ │ ├── mount/handlers/ # Mount protocol procedures│ │ ├── v3/handlers/ # NFSv3 procedures (READ, WRITE, etc.)│ │ └── v4/handlers/ # NFSv4.0 and v4.1 procedures│ ├── adapter/smb/ # SMB protocol implementation│ │ ├── auth/ # NTLM/SPNEGO authentication│ │ ├── framing.go # NetBIOS framing│ │ ├── dispatch.go # Command dispatch│ │ └── v2/handlers/ # SMB2 command handlers│ ├── controlplane/api/ # API implementation│ │ ├── handlers/ # HTTP handlers with centralized error mapping│ │ └── middleware/ # Auth middleware│ └── logger/ # Logging utilities│├── docs/ # Documentation│ ├── ARCHITECTURE.md # This file│ ├── CONFIGURATION.md # Configuration guide│ └── ...│└── test/ # Test suites ├── integration/ # Integration tests (S3, BadgerDB) └── e2e/ # End-to-end tests (real NFS mounts)Horizontal Scaling with PostgreSQL
Section titled “Horizontal Scaling with PostgreSQL”The PostgreSQL metadata store enables horizontal scaling for high-availability and high-throughput deployments:
Architecture
Section titled “Architecture”┌─────────────┐ ┌─────────────┐ ┌─────────────┐│ DittoFS #1 │ │ DittoFS #2 │ │ DittoFS #3 ││ (Pod 1) │ │ (Pod 2) │ │ (Pod 3) │└──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ └────────────────┼────────────────┘ │ ┌────▼─────┐ │PostgreSQL│ │ Cluster │ └──────────┘Key Features
Section titled “Key Features”- Multiple DittoFS Instances: Run multiple instances sharing one PostgreSQL database
- Load Balancing: Use Kubernetes services or external load balancers to distribute requests
- No Session Affinity Required: Any instance can serve any request (stateless design)
- Independent Connection Pools: Each instance maintains its own connection pool (10-15 conns typical)
- Statistics Caching: 5-second TTL cache reduces database load
- ACID Transactions: Ensures consistency across concurrent operations
Deployment Example (Kubernetes)
Section titled “Deployment Example (Kubernetes)”apiVersion: apps/v1kind: Deploymentmetadata: name: dfsspec: replicas: 3 # Multiple instances for HA selector: matchLabels: app: dfs template: metadata: labels: app: dfs spec: containers: - name: dfs image: dfs:latest ports: - containerPort: 12049 name: nfs env: - name: DITTOFS_METADATA_POSTGRES_HOST value: postgres-service - name: DITTOFS_METADATA_POSTGRES_PASSWORD valueFrom: secretKeyRef: name: postgres-secret key: password resources: requests: memory: "256Mi" cpu: "250m" limits: memory: "512Mi" cpu: "500m"---apiVersion: v1kind: Servicemetadata: name: dfs-nfsspec: selector: app: dfs ports: - port: 2049 targetPort: 12049 protocol: TCP type: LoadBalancerConnection Pool Sizing
Section titled “Connection Pool Sizing”Connection pool sizing depends on your workload:
- Light workload (< 10 concurrent clients):
max_conns: 10 - Medium workload (10-50 concurrent clients):
max_conns: 15 - Heavy workload (50+ concurrent clients):
max_conns: 20-25
Formula: max_conns ~ 2 x expected_concurrent_operations
PostgreSQL Limits: Ensure PostgreSQL max_connections > (DittoFS instances x max_conns)
Example: 3 DittoFS instances x 15 conns = 45 total connections needed from PostgreSQL
Performance Considerations
Section titled “Performance Considerations”- Network Latency: PostgreSQL adds ~1-2ms latency per metadata operation
- Statistics Caching: Reduces expensive queries (disk usage, file counts)
- Query Optimization: All queries use indexed fields for fast lookups
- Transaction Overhead: Short-lived transactions minimize lock contention
Best Practices
Section titled “Best Practices”- Use Connection Pooling: Keep
max_connsreasonable (10-20 per instance) - Enable TLS: Use
sslmode: requireor higher in production - Monitor Connections: Watch PostgreSQL connection count and utilization
- Scale Horizontally: Add DittoFS replicas, not connection pool size
- Separate Read Replicas: For read-heavy workloads, consider PostgreSQL read replicas
Durable Handle State Flow
Section titled “Durable Handle State Flow”SMB3 durable handles allow open file state to survive client disconnects and (with persistent backends) server restarts. The lifecycle is:
OPEN -[disconnect]-> ORPHANED -[scavenger timeout]-> EXPIRED -[cleanup]-> CLOSED | | +-[reconnect]--> RESTORED --> OPEN | | | +-[conflict/app-instance]--> FORCE_EXPIRED --> CLOSEDGrant: CREATE with DHnQ/DH2Q context triggers durability check. If the oplock level and share mode allow it, the server grants a durable handle with a configurable timeout (default 60s).
Disconnect: On connection loss, closeFilesWithFilter checks IsDurable. Durable files are persisted to DurableHandleStore (locks and leases preserved) rather than closed.
Scavenger: A background goroutine (DurableHandleScavenger) runs at 10-second intervals. For each expired handle it performs cleanup: releases byte-range locks, flushes block store caches, then deletes the handle from the store. On server restart, the scavenger adjusts remaining timeouts to account for downtime.
Reconnect: A new session sends CREATE with DHnC/DH2C. The server validates the durable-handle context against stored state (share name, path, username, session key hash, FileID, DesiredAccess, ShareAccess, expiry, and file existence) and restores the OpenFile without data loss.
Conflict: When a new open targets a file with an orphaned durable handle, the scavenger force-expires the orphaned handle to allow the new open to proceed. Cleanup includes releasing byte-range locks and flushing block store caches.
App Instance ID: For Hyper-V failover, a CREATE with a matching AppInstanceId triggers force-close of the old handle, allowing the new VM instance to take over.
Admin API: GET /api/v1/durable-handles lists all active handles with remaining timeout. DELETE /api/v1/durable-handles/{id} force-closes a specific handle.
Phase 12 Engine API + BlockRef + Cache (v0.15.0 A3)
Section titled “Phase 12 Engine API + BlockRef + Cache (v0.15.0 A3)”Phase 12 (v0.15.0 A3) reshapes the read path so the engine never imports
pkg/metadata on hot paths and consumes a caller-supplied
[]BlockRef snapshot as the authoritative content list for every file.
BlockRef — the on-the-wire content unit
Section titled “BlockRef — the on-the-wire content unit”BlockRef is the 3-tuple of (Hash ContentHash, Offset uint64, Size uint32)
defined in pkg/blockstore/types.go (D-10/D-19). FileAttr.Blocks []BlockRef (in pkg/metadata/file_types.go) is the authoritative,
offset-sorted list of every chunk that composes a file. It is populated
on every sync finalization; the engine binary-searches it via
findBlocksForRange (pkg/blockstore/engine/range.go, D-12).
Storage encodings differ per backend:
- Postgres uses a separate
file_block_refstable (D-01..D-04; migration000012_file_block_refs.up.sql) with PK(file_id, offset) INCLUDE (size, hash), FKON DELETE CASCADE, hash columnBYTEA. Random 4 KiB writes touch 1–2 rows instead of rewriting a ~1.5 MB TOAST blob — the VM-workload decision driver. - Badger and Memory inline-encode
Blocks []BlockRefinside the existingFileAttrblob (gob for Badger, typed structs for Memory) via the sameomitemptytag for legacy tolerance (D-05).
Engine API (API-01..04)
Section titled “Engine API (API-01..04)”// pkg/blockstore/engine/engine.go (Phase 12 signatures)ReadAt(ctx, payloadID, blocks []BlockRef, dest []byte, offset uint64) (int, error)WriteAt(ctx, payloadID, currentBlocks []BlockRef, data []byte, offset uint64) ([]BlockRef, error)Truncate(ctx, payloadID, currentBlocks []BlockRef, newSize uint64) ([]BlockRef, error)Delete(ctx, payloadID, blocks []BlockRef) errorCopyPayload(ctx, srcPayloadID, srcBlocks []BlockRef, dstPayloadID) ([]BlockRef, error)Range-coverage semantics: findBlocksForRange(blocks, offset, size)
returns [start, end) of the BlockRef slice that overlaps the requested
range using binary search on the offset-sorted slice; sparse holes
inside FileAttr.Size are zero-filled (D-21) — no BlockRef for this range is a documented behavior, not a bug. Past FileAttr.Size
returns short-read or EOF.
CopyPayload is O(1) — a single metadata transaction increments
FileBlock.RefCount for every distinct hash in srcBlocks and inserts
the dst rows (D-11). No data copy. This is the file-level dedup
primitive Phase 13 (META-02 / BSCAS-04/05) consumes.
MetadataCoordinator (pkg/blockstore/engine/coordinator.go) is the
narrow interface the engine uses to mutate refcounts and persist
FileAttr.Blocks. The engine never opens a metadata txn itself —
the API-02 strict-grep gate enforces zero pkg/metadata imports under
pkg/blockstore/engine/*.go production files except a single justified
exception in gc.go.
Cache (CACHE-01..06)
Section titled “Cache (CACHE-01..06)”The Cache type (pkg/blockstore/engine/cache.go) is keyed solely by
ContentHash. It absorbs the former readbuffer/cache.go + standalone
prefetch.go worker pool into a single per-share type with a single
budget (cache.size_mib, default 256 MiB; D-31). Two files reading the
same chunk hit the same entry (CACHE-02 cross-file dedup).
// pkg/blockstore/engine/cache.go (CACHE-04 hint API)OnRead(payloadID PayloadID, hashes []ContentHash, fileSize uint64)InvalidateFile(payloadID PayloadID, removedHashes []ContentHash) // CACHE-05 surgicalSequential prefetch triggers after 3 consecutive sequential reads (D-29 / CACHE-03; raised from Phase 11’s threshold of 2 to suppress speculative prefetch on accidental two-block runs in random-IO workloads). Bounded concurrency: 4 worker goroutines per cache by default. LRU eviction (D-30; ARC/LFU rejected as overkill for v0.15.0).
Single-copy reads: on Linux/Darwin, readFromCAS
(cache_mmap_unix.go) mmaps the local CAS chunk and copy(dest, mapped[offset:]) once (CACHE-06 / D-33). Chunks below 64 KiB use
os.ReadFile (mmap setup overhead dominates tiny reads). Windows uses
os.ReadFile only.
InvalidateFile is surgical (CACHE-05): the caller passes only the
hashes that disappeared from the file, so other files still referencing
those hashes via dedup keep them warm. Invalidation happens
post-txn (D-35) — caller commits new []BlockRef first, then drops
cache entries.
Adapter call sites unchanged
Section titled “Adapter call sites unchanged”All NFS v3/v4 + SMB v2 protocol handlers stay untouched (D-26). The
internal/adapter/common/{ResolveForRead, ResolveForWrite, WriteToBlockStore, ReadFromBlockStore} helpers absorb the new
[]BlockRef threading. Phase 09 (ADAPT-04) seam pays off here:
Phase 12’s adapter diff is confined to the helpers.
Operator surfaces
Section titled “Operator surfaces”dfsctl blockstore audit-refcounts <share>runs the INV-02 reconciliation audit (∑ FileBlock.RefCount == ∑ len(FileAttr.Blocks)), emits aggregate counts as structured slog INFO, and persists the last-run summary at<localStore>/audit-state/last-inv02.json. Seedocs/CLI.mdfor the full reference anddocs/FAQ.mdfor operator guidance.- Cache and prefetch knobs (
cache.size_mib,cache.prefetch_threshold,cache.prefetch_max_depth,cache.prefetch_workers) are documented indocs/CONFIGURATION.md.
Migration window
Section titled “Migration window”Phase 12 ships forward-only Postgres migration
000012_file_block_refs.up.sql. Legacy files written before Phase 12
keep using the Phase 11 dual-read shim (D-20: empty/nil []BlockRef
triggers the metadata-driven legacy resolver). Phase 14 ships
dfsctl blockstore migrate to backfill []BlockRef and CAS-keys
atomically; Phase 15 retires the dual-read shim. See
docs/BLOCKSTORE_MIGRATION.md for the operator-facing migration
guide.
Phase 13 File-Level Dedup: ObjectID + Merkle Root (v0.15.0 A4)
Section titled “Phase 13 File-Level Dedup: ObjectID + Merkle Root (v0.15.0 A4)”Phase 13 (v0.15.0 A4) layers file-level dedup on top of the Phase 12
chunk-level CAS path. Each FileAttr carries an ObjectID — a BLAKE3
Merkle root computed over the file’s BlockRef.Hash values sorted by
Offset, prefixed by the domain-separation tag
dittofs:objectid:v1\x00:
ObjectID = BLAKE3("dittofs:objectid:v1\x00" || h0 || h1 || ... || hN-1)Implemented in blockstore.ComputeObjectID
(pkg/blockstore/objectid.go). Stable across rename and engine restart
by construction (BLAKE3 + FastCDC are both deterministic; the prefix
protects the output space from per-chunk hash collisions and reserves
room for future input-shape changes via v2/v3).
Lifecycle
Section titled “Lifecycle”- Cleared (zeroed) on first dirty write that mutates
FileAttr.Blocks, in the same metadata transaction (D-07). - Recomputed and persisted at the post-Flush coordinator hook
(
Syncer.persistFileBlocksAfterFlush→MetadataCoordinator.PersistFileBlocks), in the same metadata transaction that updatesFileAttr.Blocks/Size/Mtime(D-05). - Persisted ONLY on full quiesce — every block in
Remotestate (D-06). Partial flushes leaveObjectIDat zero.
A non-zero ObjectID always reflects a fully-Remote consistent
state. Lookups (BSCAS-05 short-circuit) trust this without checking
per-block states. Empty files dedup to one canonical constant
BLAKE3("dittofs:objectid:v1\x00"); legacy pre-Phase-13 files keep
the all-zero sentinel until Phase 14 backfills.
File-level dedup short-circuit (BSCAS-05)
Section titled “File-level dedup short-circuit (BSCAS-05)”When a file’s BlockRef list is fully Pending (newly chunked, nothing
uploaded yet) and the file has no prior ObjectID, the syncer:
- Computes the provisional ObjectID over the chunker output.
- Calls
MetadataStore.FindByObjectID(ctx, objectID). - On hit: increments RefCount on every distinct hash in the target’s BlockRef list, replaces the file’s BlockRef list with the target’s (deep copy), persists the ObjectID, decrements RefCount on any speculative-only hashes, invalidates orphaned cache entries, and truncates the per-file append log. Zero S3 PUTs.
- On miss: continues per-block GetByHash + PUT path; ObjectID is finalized at the post-Flush coordinator hook.
Trigger condition (D-09): len(Blocks) > 0 AND every block.State == Pending AND file.ObjectID == zero. This captures fresh-file-create
(VM image clone — primary target) and full-overwrite (cp -f,
dd-overwrite, restore-from-backup). It intentionally excludes the
running-VM hot path (incremental writes already get chunk-level dedup
via Phase 11 GetByHash and would not benefit from file-level
fingerprinting that requires a quiesce).
Production call chain (post-Plans 13-12 / 13-13)
Section titled “Production call chain (post-Plans 13-12 / 13-13)”The end-to-end wiring as of v0.15.0 (Plans 13-12 + 13-13 closed the Phase 13 chain). Reads bottom-up; arrows show synchronous dispatch:
Production call chain (per-write, on quiesce):
protocol handler (NFSv3 COMMIT, NFSv4 COMMIT, SMB CLOSE) → internal/adapter/common.CommitBlockStore → engine.BlockStore.Flush → engine.Syncer.Flush ├─[BSCAS-05 short-circuit] │ ├─ snapshotPendingBlockRefs(payloadID) // ListFileBlocks projection │ ├─ coordinator.GetFileObjectID(payloadID) // trigger-condition check │ ├─ TrySpeculativeFileLevelDedup │ │ ├─ ComputeObjectID(specBlocks) │ │ ├─ coordinator.FindByObjectID │ │ └─ applyFileLevelDedupHit (one metadata txn): │ │ ├─ IncrementRefCount on each target hash │ │ ├─ coordinator.PersistFileBlocks(target.Blocks, provisionalObjectID) │ │ ├─ DecrementRefCount on speculative-only hashes │ │ ├─ Cache.InvalidateFile(removedHashes) │ │ └─ local.DeleteAppendLog(payloadID) │ └─[hit] return Finalized:true (zero new CAS PUTs) │ └─[BSCAS-04 post-Flush hook (on miss OR no trigger)] ├─ drainPayloadToRemote (uploadOne per Pending block) ├─ snapshotBlockRefs (every block now Remote) └─ persistFileBlocksAfterFlush └─ ComputeObjectID(blocks) └─ coordinator.PersistFileBlocks(blocks, objectID) └─ runtime coordinator: WithTransaction(GetFileByPayloadID + PutFile) // FileAttr.Blocks AND FileAttr.ObjectID // written in one metadata txn (CR-01)Both branches finalize FileAttr.ObjectID inside the same metadata
transaction that persists FileAttr.Blocks (D-05). The hit branch
performs zero new CAS PUTs (donor blocks already exist remotely);
the miss branch uploads each Pending block once via uploadOne and
then runs the post-Flush hook.
Source-of-truth file:line anchors:
pkg/blockstore/engine/syncer.go::Flush— entry point + branch selection;snapshotPendingBlockRefs(BSCAS-05 input) andsnapshotBlockRefs(BSCAS-04 input) helpers.pkg/blockstore/engine/dedup.go::TrySpeculativeFileLevelDedupandapplyFileLevelDedupHit— the metadata-side swap.pkg/blockstore/engine/dedup.go::persistFileBlocksAfterFlush— the post-Flush coordinator hook.pkg/controlplane/runtime/shares/coordinator.go::PersistFileBlocks/GetFileObjectID— runtime forwarders.
Concurrent quiesce: first-committer-wins
Section titled “Concurrent quiesce: first-committer-wins”Two concurrent flushes of byte-identical content race independently
(no distributed locking). At commit time the partial unique index on
object_id ensures exactly one write succeeds; the loser detects the
conflict (Postgres SQLSTATE 23505 / metadata.ErrConflict on Memory
and Badger), decrements its just-uploaded refs, swaps to the now-
existing target’s BlockRef list, and re-commits. One wasted upload
per loser is acceptable; GC reclaims any orphans. See
pkg/metadata/storetest/objectid_lookup.go for the cross-backend
race conformance scenarios.
Per-backend ObjectID lookup index
Section titled “Per-backend ObjectID lookup index”MetadataStore.FindByObjectID(ctx, ObjectID) ([]BlockRef, error)
returns (nil, nil) on miss; on hit returns the canonical BlockRef
list of the matching file (per-metadata-store scope, NOT per-share —
D-13). Backends maintain a secondary index:
| Backend | Index |
|---|---|
| Postgres | Partial unique: files_object_id_idx ON files(object_id) WHERE object_id IS NOT NULL (migration 000013_object_id) |
| Badger | Secondary key obj:{hex} -> file_id, maintained inside each Put/Delete write batch |
| Memory | map[ContentHash]uuid, guarded by the existing store mutex |
Zero-valued ObjectID (legacy / pre-quiesce) is excluded from the index
— FindByObjectID(zero) short-circuits to (nil, nil) at every layer
so partial states never trigger a false short-circuit.
Observability
Section titled “Observability”Phase 13 emits slog-only signals (D-20; matches Phase 11 D-35 / Phase 12 D-42 deferral):
- DEBUG: post-Flush ObjectID persisted; short-circuit hit/miss
with
payloadID,objectID,donor_blocks. - INFO: cross-VM dedup ratio emitted by the e2e fixture
(
test/e2e/dedup_vmfleet_test.go, nightly).
No new Prometheus surface; metrics roll into the dedicated observability phase.
Performance gate (D-21)
Section titled “Performance gate (D-21)”Hard gate: ≤2% rand-write regression vs BenchmarkRandWriteCAS
baseline. The microbench
(pkg/blockstore/engine/perf_bench_test.go::BenchmarkRandWrite_Phase13Baseline)
mirrors the Phase 12 D-43 paired-bench pattern and is gated by the CI
perf lane (D21_STRICT_GATE=1). ObjectID compute is one BLAKE3 pass
over 32×N bytes per quiesce (sub-millisecond at N=16K BlockRefs);
short-circuit lookup is one indexed query per quiesce. Both fire off
the random-write hot path.
Migration & Block-Layout Routing (v0.15.x A5)
Section titled “Migration & Block-Layout Routing (v0.15.x A5)”Phase 14 (#425) ships dfsctl blockstore migrate — the offline tool
that converts a v0.13/v0.14 share’s block layout from path-indexed
legacy keys ({payloadID}/block-{idx}) to the v0.15 CAS layout
(cas/{hh}/{hh}/{hex}). Two ARCHITECTURE-level pieces ship alongside
the tool: the per-share block_layout flag, and the engine-level
gate that routes reads through the dual-read shim or the CAS-only
fast path based on that flag.
Per-share block_layout flag
Section titled “Per-share block_layout flag”A new field block_layout on metadata.ShareOptions carries the
share’s authoritative layout state (Plan 14-01, D-A6):
type BlockLayout uint8
const ( BlockLayoutLegacy BlockLayout = iota // dual-read: shim + CAS BlockLayoutCASOnly // CAS-only: legacy reads fail loud)
type ShareOptions struct { // ... pre-existing fields ... BlockLayout BlockLayout}Storage:
| Backend | Layout |
|---|---|
| Postgres | Dedicated block_layout TEXT NOT NULL DEFAULT 'legacy' column on shares (migration 000014_block_layout.up.sql, reversible). Authoritative over the legacy options JSON blob. |
| Badger | Inline-encoded inside the existing ShareOptions blob (gob; omitempty on the new field for forward-compat with pre-Phase-14 rows). |
| Memory | Direct field on the in-process struct. |
ParseBlockLayout("") coerces empty / missing values to
BlockLayoutLegacy so pre-Phase-14 metadata rows decode cleanly
(forward-compat). Unknown values surface
metadata.ErrInvalidBlockLayout rather than silently coercing.
The flag is read once by shares.Service.createBlockStoreForShare
when the share’s per-share *engine.BlockStore is constructed, then
threaded into engine.SyncerConfig.BlockLayout. The engine never
re-reads it during normal operation; the migration tool’s cutover
runs while the daemon is offline so a stale in-memory copy is
impossible.
Dual-read shim and the CAS-only gate
Section titled “Dual-read shim and the CAS-only gate”The dual-read shim is the engine code path that resolves block reads
from two coexisting key spaces (see
Dual-Read Window for the
per-block resolution rules). The Phase 14 gate sits one level above
the shim, in engine.Syncer.dispatchRemoteFetch:
┌───────────────────────────────────────────┐ │ engine.Syncer.dispatchRemoteFetch(block) │ └────────────────────┬──────────────────────┘ │ ▼ block.Hash != ZeroContentHash ? ┌────────────┴────────────┐ yes (CAS shape) no (legacy shape) │ │ ▼ ▼ remote.ReadBlockVerified [BlockLayout gate] │ │ │ ┌─────────┴─────────┐ │ legacy cas-only │ │ │ ▼ ▼ ▼ (CAS path) remote.ReadBlock ErrLegacyReadOnCASOnly (dual-read shim) (fail loud, slog Error)Concretely:
block_layout=legacy(the default for upgraded shares before migration): the engine resolves CAS-shaped FileBlocks via the CAS path AND legacy-shaped FileBlocks via the dual-read shim. Both key spaces coexist. This is exactly the Phase 11 → Phase 14 dual-read window described above.block_layout=cas-only(set by the migration tool’s cutover txn after integrity passes): legacy-shaped FileBlocks surfaceengine.ErrLegacyReadOnCASOnlyas a fail-loud signal. The function logs at Error withblock_id+store_keyand returns the wrapped sentinel rather than silently falling through toReadBlock. This guards against the case where a freshly-cutover share encounters a forgotten legacy FileBlock — the engine fails loud rather than reading from a key that the migration tool already deleted.
The gate is defense-in-depth: the migration tool’s atomic per-file
PutFile already updates every legacy FileBlock to the CAS shape
before flipping block_layout. Encountering a legacy-shaped block
post-cutover indicates either a migration-tool bug, a metadata-store
corruption, or a hand-edited row — all of which are operationally
distinct from a normal dual-read fallback and demand operator
attention rather than a silent legacy read.
Migration tool boundary
Section titled “Migration tool boundary”The migration tool itself is intentionally offline-only (D-A5) and lives outside the daemon:
- Tool entrypoint:
cmd/dfsctl/commands/blockstore/migrate.go, invoked viadfsctl blockstore migrate --share NAME. - Tool composition root:
openOfflineRuntimeincmd/dfsctl/commands/blockstore/migrate_runtime.go. It composes per-share metadata + remote stores directly from the controlplane DB, deliberately bypassingpkg/controlplane/runtime.Runtimeso the tool cannot accidentally race a live daemon. - Tool refuses to run if a daemon is serving the target share — the
ensureDaemonOfflinePID-file probe is run before any work. - The tool’s pipeline is: walk → FastCDC re-chunk →
GetByHashdedup probe → upload (orIncrementRefCount) →PutFileBlocks + ObjectID → journal Append → integrity HEAD-per-ref → cutover (block_layoutflip) → legacy delete sweep. See BLOCKSTORE_MIGRATION.md for the full operator-facing runbook.
Phase 15 (A6) removes the dual-read shim
Section titled “Phase 15 (A6) removes the dual-read shim”Phase 15 is intentionally deferred until Phase 14’s migration tool has
been rolled out across production workloads (per-share verification
via dfsctl blockstore migrate status). Once every production share
is block_layout=cas-only, Phase 15 deletes:
- The
engine.Syncer.dispatchRemoteFetchlegacy fallback branch. - The Phase 11 D-21 metadata-driven legacy resolver.
- The
block_layout=legacyenum variant (collapsed to a single CAS-only routing). - Every
{payloadID}/block-{idx}key-handling code path.
Until Phase 15 ships, anyone touching the dual-read shim should be aware it is on a deletion clock — no new behavior should accumulate there.
Performance Characteristics
Section titled “Performance Characteristics”DittoFS is designed for high performance through several architectural choices:
- Direct protocol implementation: No FUSE overhead
- Goroutine-per-connection model: Leverages Go’s lightweight concurrency
- Buffer pooling: Reduces GC pressure for large I/O operations
- Streaming I/O: Efficient handling of large files without full buffering
- Three-tier storage: Unified CAS-keyed
Cache+ local disk + remote store for optimal read latency (Phase 12 collapsed Phase 11’sreadbuffer + prefetcherpair into a singleCachetype) - Zero-copy aspirations: Working toward minimal data copying in hot paths
Why Pure Go?
Section titled “Why Pure Go?”Go provides significant advantages for a project like DittoFS:
- Easy deployment: Single static binary, no runtime dependencies
- Cross-platform: Native support for Linux, macOS, Windows
- Easy integration: Embed DittoFS directly into existing Go applications
- Modern concurrency: Goroutines and channels for natural async I/O
- Memory safety: No buffer overflows or use-after-free vulnerabilities
- Strong ecosystem: Rich standard library and third-party packages
- Fast compilation: Quick iteration during development
- Built-in tooling: Testing, profiling, and race detection included