Architecture

Verbatim Studio is built as a modular, privacy-first system with two deployment models: a standalone desktop application for individuals and a containerized enterprise stack for teams. Both editions share the same core transcription engine, AI pipeline, and search infrastructure — the enterprise edition layers multi-user authentication, team management, and production-grade infrastructure on top via a plugin system.

This document provides a comprehensive technical overview of the system architecture, including component design, model pipeline, data flow, security model, and extension points.

Desktop Edition

The desktop application bundles the entire stack into a single Electron shell. No external services, accounts, or internet connectivity are required after the initial model download.

Electron Shell

Chromium renderer, auto-update (electron-updater), system tray, window management

React Frontend

Vite-built SPA — recording controls, transcript editor, chat, voice, search

Python Backend

FastAPI server — STT, OCR, LLM, TTS, diarization, embeddings, search, voice

SQLite

Recordings, transcripts, metadata, embeddings, settings

Local Filesystem

Media files, model weights, configuration

Application Shell

The Electron layer provides the native application wrapper. On launch, it spawns the Python backend as a managed child process and loads the React frontend in a Chromium renderer window. Key responsibilities:

  • Process lifecycle — spawns and monitors the FastAPI backend process, handles graceful shutdown, and manages port allocation
  • Auto-update — uses electron-updater to check for new releases, download updates in the background, and apply them on restart
  • System tray — provides a persistent system tray icon with quick-access controls for recording and status monitoring
  • Native integrations — OS-level audio input selection, microphone permissions, and keychain access for credential storage

Frontend Layer

The frontend is a React single-page application built with Vite and served locally by the backend's static file handler. It communicates with the backend exclusively over HTTP and WebSocket connections on localhost. The frontend handles:

  • Recording controls with real-time waveform visualization
  • Live transcript editor with speaker labels, timestamps, and inline editing
  • Chat interface for Max with tool result rendering
  • Voice chat UI with push-to-talk and hands-free modes
  • Semantic search interface with filtering and result highlighting
  • Document viewer with OCR overlay, page navigation, and annotation support
  • Project management dashboard with stats and activity feed

Backend Layer

The Python backend is a FastAPI application that handles all computation-intensive operations. It exposes a REST API consumed by the frontend and manages the lifecycle of all ML models. The server uses async I/O for request handling and delegates GPU-bound work to dedicated worker threads to prevent blocking.

  • Route structure — organized by domain: recordings, transcripts, documents, projects, chat, voice, search, settings
  • Model management — lazy-loads models on first use, manages GPU memory allocation, and supports hot-swapping between model sizes
  • WebSocket endpoints — real-time transcription progress, chat streaming, and voice chat audio streaming

Enterprise Edition

The enterprise edition replaces Electron with Docker and SQLite with PostgreSQL, adding multi-user authentication, team management, and a full REST API. Enterprise capabilities are loaded through a plugin system — the core application remains identical to the desktop edition.

nginx

Reverse proxy — :80 → backend :8000, SSL termination, static asset caching

Backend Container

Core (OSS)

Transcription, OCR, AI, Voice, Search, Projects, Recordings

Enterprise Plugin

Auth, Teams, API Keys, Webhooks, Audit, License, Admin

PostgreSQL 16

Primary datastore — ACID, pgvector, concurrent access

S3 / Azure Blob

Media storage — recordings, documents, exports (optional)

LLM Provider

OpenAI, Anthropic, Ollama, vLLM, or any compatible API

Container Architecture

The enterprise stack runs as three Docker containers orchestrated by Docker Compose. Each container has a dedicated responsibility:

ContainerImagePortResponsibility
nginxnginx:1.27-alpine80Reverse proxy, SSL/TLS termination, static asset serving, request buffering
backendghcr.io/jongodb/verbatim-enterprise8000FastAPI application with core + enterprise plugin, frontend SPA, all ML models
postgrespostgres:16-alpine5432Primary datastore — internal only, not exposed to host network

Persistent volumes (postgres-data and app-data) survive container restarts and image updates. For production deployments, an external managed PostgreSQL instance (AWS RDS, Google Cloud SQL, Azure Database for PostgreSQL) can replace the bundled container.

Plugin System

Enterprise features are loaded dynamically via Python's entry point mechanism, keeping the core application identical between editions. The pyproject.toml declares:

[project.entry-points."verbatim.plugins"]
enterprise = "verbatim_enterprise.plugin:EnterprisePlugin"

Plugin Lifecycle

On application startup, the backend executes the following sequence:

1

load_plugins()

Discovers the enterprise plugin via Python entry points. Calls register() on the plugin, which contributes adapters (storage, database), routers (auth, teams, API keys, webhooks, admin), middleware (license, JWT, audit), and SQLAlchemy models to the application registry.

2

run_startup_hooks()

Executes plugin startup hooks. The enterprise hook initializes the async PostgreSQL engine (asyncpg), replacing the default SQLite engine. Configures connection pooling with pool_size and max_overflow parameters.

3

init_db()

Creates all database tables — both core (recordings, transcripts, documents, projects, settings) and enterprise (users, teams, API keys, webhooks, audit logs) — in PostgreSQL using SQLAlchemy's create_all().

4

apply_to_app()

Mounts enterprise routers and middleware onto the FastAPI application instance. Middleware is applied in a specific order to ensure correct request processing (see Middleware Stack below).

This architecture means the desktop edition never loads enterprise code, and the enterprise edition inherits all core functionality without modification. The plugin adds capabilities — it doesn't fork the codebase.

Model Pipeline

Verbatim runs multiple specialized ML models locally. Each model is loaded on demand and managed independently. This section documents every model in the pipeline, its role, and its runtime characteristics.

Processing Pipeline

Audio Input

Microphone, system audio, file import

Document Input

PDF, DOCX, XLSX, images

Whisper STT

Speech → text (5 model sizes)

Granite Vision

Document → text (OCR)

pyannote.audio

Speaker diarization

nomic-embed

Text → embeddings

Granite 4.0 Tiny

AI assistant (llama.cpp)

Kokoro / Qwen3-TTS

Text → speech

Search Index + Storage

SQLite FTS5 (desktop) / PostgreSQL + pgvector (enterprise)

Speech-to-Text (Whisper)

The transcription engine uses OpenAI Whisper models with platform-optimized inference runtimes. On macOS with Apple Silicon, inference runs through MLX Whisper for native Metal GPU acceleration. On Windows and in Docker, CTranslate2 (faster-whisper) provides CUDA-accelerated inference. A CPU fallback is always available.

Five model sizes are supported (tiny through large-v3), ranging from 39M to 1.55B parameters. The base model (74M parameters, ~137 MB) is bundled by default. Larger models are downloaded on demand when selected in Settings. All models support 99+ languages with automatic language detection.

The transcription pipeline processes audio in chunks, streaming partial results to the frontend via WebSocket for real-time display. Audio preprocessing (noise reduction via FFmpeg, EBU R128 normalization) runs before the audio reaches the model when enabled.

Speaker Diarization (pyannote.audio)

Speaker identification uses pyannote.audio, a neural speaker diarization framework. The pipeline performs three stages:

  1. Voice Activity Detection (VAD) — identifies speech regions in the audio, filtering silence and background noise
  2. Speaker Embedding Extraction — computes a fixed-dimensional vector representation for each speech segment using a pre-trained speaker embedding model
  3. Agglomerative Clustering — groups segments by speaker identity based on embedding similarity, producing speaker labels for each transcript segment

The diarization model (~100 MB) is downloaded automatically on first use. Diarization runs as a post-processing step after transcription completes.

Vision & OCR (Granite Vision)

Document text extraction uses Granite Vision, a multimodal vision-language model. Unlike traditional OCR engines that rely on character-level recognition, Granite Vision understands document layout, table structure, and handwritten text at a semantic level. It processes:

  • Scanned and digital PDFs (multi-page, with layout preservation)
  • Photographs of printed and handwritten text (including cursive)
  • Spreadsheets (XLSX) with cell-level extraction
  • Presentations (PPTX) with slide-by-slide processing
  • Word documents (DOCX) with formatting awareness
  • Raw images (PNG, JPEG, TIFF, WebP)

Granite Vision runs with GPU acceleration on Apple Silicon (Metal) and NVIDIA (CUDA). Extracted text is immediately indexed in the search layer for full-text and semantic retrieval.

Language Model (Granite 4.0 Tiny via llama.cpp)

The AI assistant Max is powered by Granite 4.0 Tiny, a 7B-parameter language model with only 1B active parameters via the Mamba-2 state-space architecture (mixture-of-experts). This design provides strong reasoning capabilities with minimal memory footprint and fast inference.

The model runs through llama.cpp in GGUF format, supporting GPU-accelerated inference on both Apple Metal and NVIDIA CUDA. The inference server supports streaming token generation, which the frontend renders incrementally for responsive chat interactions.

Max has access to 15+ tools through a function-calling interface. When Max decides to use a tool, the backend executes the tool function and returns results to the model for synthesis. Tools include project search, document generation, transcript export, web search, entity extraction, segment highlighting, note creation, and more.

Enterprise users can connect external LLM providers (OpenAI, Anthropic, or any OpenAI-compatible API endpoint) as an alternative to the local model. LLM calls are made server-side — API keys never reach the browser.

Text-to-Speech (Kokoro / Qwen3-TTS)

Voice chat uses two available TTS models, selectable by the user:

  • Kokoro (82M parameters) — ultra-fast synthesis (<200ms latency), optimized for real-time conversational flow. Best for interactive voice chat where responsiveness matters.
  • Qwen3-TTS (1.7B parameters) — higher-quality synthesis with more natural prosody and intonation. Better for longer-form spoken responses where quality is prioritized.

The TTS pipeline operates in streaming mode — synthesis begins as soon as the LLM produces its first tokens, creating a low-latency LLM→TTS→audio playback chain. Both models run locally with no external API calls.

Embeddings (nomic-embed-text-v1.5)

Semantic search is powered by the nomic-embed-text-v1.5 embedding model. Text from transcripts and documents is converted into high-dimensional vectors and stored in the search index. At query time, the user's search text is embedded with the same model and matched against stored vectors using cosine similarity.

On the desktop edition, vectors are stored in SQLite using FTS5 extensions. The enterprise edition uses PostgreSQL with the pgvector extension for efficient vector similarity search at scale with concurrent multi-user access.

Voice Chat Architecture

Voice chat implements a full-duplex conversational AI pipeline — the user speaks, the system transcribes, the LLM responds, and TTS speaks the response back. The architecture uses a dual-LLM design for optimal responsiveness:

User Microphone

Audio capture with echo suppression

Whisper STT

Real-time speech recognition

Fast LLM

Quick conversational responses

Primary LLM

Complex analysis + tool calls

Kokoro / Qwen3-TTS

Streaming text-to-speech synthesis

Audio Output

Speaker playback with echo cancellation

Key implementation details:

  • Echo suppression — prevents the system from transcribing its own TTS output by gating the STT input during playback
  • Streaming pipeline — LLM tokens stream directly to TTS, which begins synthesis immediately, minimizing end-to-end latency
  • Context parity — voice chat has access to all the same tools as text chat (search, export, summarize, etc.)
  • Interrupt handling — the user can interrupt Max mid-response; the system stops generation and TTS immediately
  • Attachment support — transcripts and documents can be attached to the voice session for contextual Q&A

Browser Extension Architecture

The Verbatim Studio browser extension (Chrome) acts as a thin client that communicates with the desktop application running on localhost. No data leaves the user's machine — the extension delegates all processing to the local backend.

  • Audio recording — captures microphone audio from the browser and sends it to the local backend for transcription with speaker detection
  • Screen capture — region-select screenshot tool that sends captured images for OCR processing
  • Chat interface — embedded Max chat panel with page context, text selection, and document attachment support
  • File upload — drag-and-drop document upload for OCR and indexing
  • Search — full-text search across all recordings, documents, and conversations
  • Job tracking — real-time progress indicators for active transcription, OCR, and indexing jobs

The extension requires the Verbatim Studio desktop app to be running. Communication uses the same localhost REST API and WebSocket endpoints that the Electron frontend uses.

Middleware Stack (Enterprise)

In the enterprise edition, every inbound HTTP request passes through three middleware layers before reaching the route handler. Middleware is applied in a strict order to enforce security invariants:

1

LicenseMiddleware

Validates the license JWT on every request. Checks HS256 signature against VERBATIM_SECRET_KEY and verifies the expiry claim. If the license has expired, the system enters a 14-day grace period: GET requests return normally (read-only access), but POST/PUT/DELETE return HTTP 402. After the grace period, all requests return HTTP 403.

2

JWTAuthMiddleware

Validates user authentication. Extracts the JWT from the Authorization header, verifies the signature and expiry, and attaches the authenticated user to the request state (request.state.user). Supports both short-lived access tokens and API key authentication (vst_ prefix). Unauthenticated requests to protected endpoints receive HTTP 401.

3

AuditLogMiddleware

Logs all mutating requests (POST, PUT, DELETE) to the audit_logs table in PostgreSQL. Each entry records the authenticated user ID, HTTP method, request path, timestamp (UTC), and originating IP address. GET requests are not logged to avoid excessive log volume. Audit entries are immutable — they cannot be modified or deleted through the API.

Public endpoints (/api/auth/login, /api/auth/register, /api/info, /api/license/status, /health) bypass the JWTAuthMiddleware but still pass through LicenseMiddleware.

Authentication & API Security

Enterprise authentication is built on JWT bearer tokens with a dual-credential system supporting both interactive sessions and programmatic access:

User Authentication (JWT)

  • Registration — new users register with email and password. The first user to register is automatically assigned the admin role. Subsequent users enter a pending state until an admin approves them.
  • Password storage — passwords are hashed with Argon2, a memory-hard algorithm resistant to GPU-accelerated brute-force attacks
  • Token lifecycle — login returns a short-lived access token and a longer-lived refresh token. The frontend uses the refresh token to obtain new access tokens transparently, providing seamless session continuity.

API Keys

  • Format — API keys use a vst_ prefix for easy identification in logs and configurations
  • Storage — keys are stored as SHA-256 hashes. The raw key is displayed exactly once at creation and cannot be retrieved afterward.
  • Scoping — each key is assigned granular scopes (read, write) to limit its capabilities
  • Revocation — any key can be revoked instantly via the API or admin dashboard

Webhook Security

Webhook payloads are signed with HMAC-SHA256 using a per-subscription secret. The signature is included in the X-Webhook-Signature header. Failed deliveries are retried with exponential backoff (up to 5 attempts).

Data Storage

The storage layer is designed to keep all data on the user's infrastructure. No data is transmitted to external services unless the user explicitly configures cloud storage or LLM providers.

AspectDesktopEnterprise
Primary databaseSQLite (local file)PostgreSQL 16 (with async connection pooling)
Vector searchSQLite FTS5 extensionsPostgreSQL pgvector (cosine similarity)
Media filesLocal filesystemLocal Docker volume, S3-compatible, or Azure Blob
Model weights~/Library/Application Support/@verbatim/ (macOS)Bundled in Docker image + on-demand downloads
CredentialsFernet AES-128 encryption, OS keychainArgon2 (passwords), SHA-256 (API keys), env vars
ConfigurationSQLite settings table + OS keychainEnvironment variables (VERBATIM_* prefix)

Desktop Credential Security

On the desktop edition, sensitive credentials (cloud storage tokens, search provider API keys) are encrypted using Fernet symmetric encryption (AES-128-CBC with HMAC authentication) before being written to disk. Encryption keys are stored in the operating system's native keychain — macOS Keychain or Windows Credential Manager — so they are protected by OS-level security. No credentials are ever stored in plain text.

License Validation

Enterprise licenses are HS256-signed JWTs containing organization metadata, seat counts, and expiry dates. License validation is double-gated to prevent unauthorized use:

1

Pull Gate (Cloudflare Worker)

Before Docker can pull the enterprise image from GHCR, the license JWT is posted to a Cloudflare Worker that validates the signature and expiry. Only valid licenses receive a GHCR read token. This prevents unauthorized image downloads.

2

Runtime Gate (LicenseMiddleware)

At runtime, every API request passes through LicenseMiddleware which re-validates the JWT on every request — checking both signature (VERBATIM_SECRET_KEY) and expiry. Even if an image is obtained, the application will not serve requests without a valid license.

When a license expires, the system enters a 14-day grace period with read-only access before blocking all requests entirely. No data is ever deleted due to license expiry. See License Management for full details.

Deployment Topology

The enterprise stack is designed for flexibility across deployment environments:

  • Single-server — all three containers on one host. Suitable for small teams (up to ~20 concurrent users).
  • Scaled — add backend replicas behind the nginx reverse proxy for horizontal scaling. Requires an external PostgreSQL instance and shared storage (S3/Azure).
  • Kubernetes — each Docker Compose service maps directly to a Kubernetes Deployment. Use PersistentVolumeClaims for PostgreSQL data and a managed database service for production.
  • Air-gapped — transfer Docker images via docker save/load. Point the LLM integration at a self-hosted model server (Ollama, vLLM, LocalAI). The entire stack operates with zero internet connectivity.

GPU acceleration for transcription in Docker requires NVIDIA GPUs with the nvidia-container-toolkit installed. Apple Silicon Metal acceleration is not available inside Docker containers.