Architecture
Verbatim Studio is built as a modular, privacy-first system with two deployment models: a standalone desktop application for individuals and a containerized enterprise stack for teams. Both editions share the same core transcription engine, AI pipeline, and search infrastructure — the enterprise edition layers multi-user authentication, team management, and production-grade infrastructure on top via a plugin system.
This document provides a comprehensive technical overview of the system architecture, including component design, model pipeline, data flow, security model, and extension points.
Desktop Edition
The desktop application bundles the entire stack into a single Electron shell. No external services, accounts, or internet connectivity are required after the initial model download.
Electron Shell
Chromium renderer, auto-update (electron-updater), system tray, window management
React Frontend
Vite-built SPA — recording controls, transcript editor, chat, voice, search
Python Backend
FastAPI server — STT, OCR, LLM, TTS, diarization, embeddings, search, voice
SQLite
Recordings, transcripts, metadata, embeddings, settings
Local Filesystem
Media files, model weights, configuration
Application Shell
The Electron layer provides the native application wrapper. On launch, it spawns the Python backend as a managed child process and loads the React frontend in a Chromium renderer window. Key responsibilities:
- Process lifecycle — spawns and monitors the FastAPI backend process, handles graceful shutdown, and manages port allocation
- Auto-update — uses electron-updater to check for new releases, download updates in the background, and apply them on restart
- System tray — provides a persistent system tray icon with quick-access controls for recording and status monitoring
- Native integrations — OS-level audio input selection, microphone permissions, and keychain access for credential storage
Frontend Layer
The frontend is a React single-page application built with Vite and served locally by the backend's static file handler. It communicates with the backend exclusively over HTTP and WebSocket connections on localhost. The frontend handles:
- Recording controls with real-time waveform visualization
- Live transcript editor with speaker labels, timestamps, and inline editing
- Chat interface for Max with tool result rendering
- Voice chat UI with push-to-talk and hands-free modes
- Semantic search interface with filtering and result highlighting
- Document viewer with OCR overlay, page navigation, and annotation support
- Project management dashboard with stats and activity feed
Backend Layer
The Python backend is a FastAPI application that handles all computation-intensive operations. It exposes a REST API consumed by the frontend and manages the lifecycle of all ML models. The server uses async I/O for request handling and delegates GPU-bound work to dedicated worker threads to prevent blocking.
- Route structure — organized by domain: recordings, transcripts, documents, projects, chat, voice, search, settings
- Model management — lazy-loads models on first use, manages GPU memory allocation, and supports hot-swapping between model sizes
- WebSocket endpoints — real-time transcription progress, chat streaming, and voice chat audio streaming
Enterprise Edition
The enterprise edition replaces Electron with Docker and SQLite with PostgreSQL, adding multi-user authentication, team management, and a full REST API. Enterprise capabilities are loaded through a plugin system — the core application remains identical to the desktop edition.
nginx
Reverse proxy — :80 → backend :8000, SSL termination, static asset caching
Backend Container
Core (OSS)
Transcription, OCR, AI, Voice, Search, Projects, Recordings
Enterprise Plugin
Auth, Teams, API Keys, Webhooks, Audit, License, Admin
PostgreSQL 16
Primary datastore — ACID, pgvector, concurrent access
S3 / Azure Blob
Media storage — recordings, documents, exports (optional)
LLM Provider
OpenAI, Anthropic, Ollama, vLLM, or any compatible API
Container Architecture
The enterprise stack runs as three Docker containers orchestrated by Docker Compose. Each container has a dedicated responsibility:
| Container | Image | Port | Responsibility |
|---|---|---|---|
| nginx | nginx:1.27-alpine | 80 | Reverse proxy, SSL/TLS termination, static asset serving, request buffering |
| backend | ghcr.io/jongodb/verbatim-enterprise | 8000 | FastAPI application with core + enterprise plugin, frontend SPA, all ML models |
| postgres | postgres:16-alpine | 5432 | Primary datastore — internal only, not exposed to host network |
Persistent volumes (postgres-data and app-data) survive container restarts and image updates. For production deployments, an external managed PostgreSQL instance (AWS RDS, Google Cloud SQL, Azure Database for PostgreSQL) can replace the bundled container.
Plugin System
Enterprise features are loaded dynamically via Python's entry point mechanism, keeping the core application identical between editions. The pyproject.toml declares:
[project.entry-points."verbatim.plugins"]
enterprise = "verbatim_enterprise.plugin:EnterprisePlugin"Plugin Lifecycle
On application startup, the backend executes the following sequence:
load_plugins()
Discovers the enterprise plugin via Python entry points. Calls register() on the plugin, which contributes adapters (storage, database), routers (auth, teams, API keys, webhooks, admin), middleware (license, JWT, audit), and SQLAlchemy models to the application registry.
run_startup_hooks()
Executes plugin startup hooks. The enterprise hook initializes the async PostgreSQL engine (asyncpg), replacing the default SQLite engine. Configures connection pooling with pool_size and max_overflow parameters.
init_db()
Creates all database tables — both core (recordings, transcripts, documents, projects, settings) and enterprise (users, teams, API keys, webhooks, audit logs) — in PostgreSQL using SQLAlchemy's create_all().
apply_to_app()
Mounts enterprise routers and middleware onto the FastAPI application instance. Middleware is applied in a specific order to ensure correct request processing (see Middleware Stack below).
This architecture means the desktop edition never loads enterprise code, and the enterprise edition inherits all core functionality without modification. The plugin adds capabilities — it doesn't fork the codebase.
Model Pipeline
Verbatim runs multiple specialized ML models locally. Each model is loaded on demand and managed independently. This section documents every model in the pipeline, its role, and its runtime characteristics.
Processing Pipeline
Audio Input
Microphone, system audio, file import
Document Input
PDF, DOCX, XLSX, images
Whisper STT
Speech → text (5 model sizes)
Granite Vision
Document → text (OCR)
pyannote.audio
Speaker diarization
nomic-embed
Text → embeddings
Granite 4.0 Tiny
AI assistant (llama.cpp)
Kokoro / Qwen3-TTS
Text → speech
Search Index + Storage
SQLite FTS5 (desktop) / PostgreSQL + pgvector (enterprise)
Speech-to-Text (Whisper)
The transcription engine uses OpenAI Whisper models with platform-optimized inference runtimes. On macOS with Apple Silicon, inference runs through MLX Whisper for native Metal GPU acceleration. On Windows and in Docker, CTranslate2 (faster-whisper) provides CUDA-accelerated inference. A CPU fallback is always available.
Five model sizes are supported (tiny through large-v3), ranging from 39M to 1.55B parameters. The base model (74M parameters, ~137 MB) is bundled by default. Larger models are downloaded on demand when selected in Settings. All models support 99+ languages with automatic language detection.
The transcription pipeline processes audio in chunks, streaming partial results to the frontend via WebSocket for real-time display. Audio preprocessing (noise reduction via FFmpeg, EBU R128 normalization) runs before the audio reaches the model when enabled.
Speaker Diarization (pyannote.audio)
Speaker identification uses pyannote.audio, a neural speaker diarization framework. The pipeline performs three stages:
- Voice Activity Detection (VAD) — identifies speech regions in the audio, filtering silence and background noise
- Speaker Embedding Extraction — computes a fixed-dimensional vector representation for each speech segment using a pre-trained speaker embedding model
- Agglomerative Clustering — groups segments by speaker identity based on embedding similarity, producing speaker labels for each transcript segment
The diarization model (~100 MB) is downloaded automatically on first use. Diarization runs as a post-processing step after transcription completes.
Vision & OCR (Granite Vision)
Document text extraction uses Granite Vision, a multimodal vision-language model. Unlike traditional OCR engines that rely on character-level recognition, Granite Vision understands document layout, table structure, and handwritten text at a semantic level. It processes:
- Scanned and digital PDFs (multi-page, with layout preservation)
- Photographs of printed and handwritten text (including cursive)
- Spreadsheets (XLSX) with cell-level extraction
- Presentations (PPTX) with slide-by-slide processing
- Word documents (DOCX) with formatting awareness
- Raw images (PNG, JPEG, TIFF, WebP)
Granite Vision runs with GPU acceleration on Apple Silicon (Metal) and NVIDIA (CUDA). Extracted text is immediately indexed in the search layer for full-text and semantic retrieval.
Language Model (Granite 4.0 Tiny via llama.cpp)
The AI assistant Max is powered by Granite 4.0 Tiny, a 7B-parameter language model with only 1B active parameters via the Mamba-2 state-space architecture (mixture-of-experts). This design provides strong reasoning capabilities with minimal memory footprint and fast inference.
The model runs through llama.cpp in GGUF format, supporting GPU-accelerated inference on both Apple Metal and NVIDIA CUDA. The inference server supports streaming token generation, which the frontend renders incrementally for responsive chat interactions.
Max has access to 15+ tools through a function-calling interface. When Max decides to use a tool, the backend executes the tool function and returns results to the model for synthesis. Tools include project search, document generation, transcript export, web search, entity extraction, segment highlighting, note creation, and more.
Enterprise users can connect external LLM providers (OpenAI, Anthropic, or any OpenAI-compatible API endpoint) as an alternative to the local model. LLM calls are made server-side — API keys never reach the browser.
Text-to-Speech (Kokoro / Qwen3-TTS)
Voice chat uses two available TTS models, selectable by the user:
- Kokoro (82M parameters) — ultra-fast synthesis (<200ms latency), optimized for real-time conversational flow. Best for interactive voice chat where responsiveness matters.
- Qwen3-TTS (1.7B parameters) — higher-quality synthesis with more natural prosody and intonation. Better for longer-form spoken responses where quality is prioritized.
The TTS pipeline operates in streaming mode — synthesis begins as soon as the LLM produces its first tokens, creating a low-latency LLM→TTS→audio playback chain. Both models run locally with no external API calls.
Embeddings (nomic-embed-text-v1.5)
Semantic search is powered by the nomic-embed-text-v1.5 embedding model. Text from transcripts and documents is converted into high-dimensional vectors and stored in the search index. At query time, the user's search text is embedded with the same model and matched against stored vectors using cosine similarity.
On the desktop edition, vectors are stored in SQLite using FTS5 extensions. The enterprise edition uses PostgreSQL with the pgvector extension for efficient vector similarity search at scale with concurrent multi-user access.
Voice Chat Architecture
Voice chat implements a full-duplex conversational AI pipeline — the user speaks, the system transcribes, the LLM responds, and TTS speaks the response back. The architecture uses a dual-LLM design for optimal responsiveness:
User Microphone
Audio capture with echo suppression
Whisper STT
Real-time speech recognition
Fast LLM
Quick conversational responses
Primary LLM
Complex analysis + tool calls
Kokoro / Qwen3-TTS
Streaming text-to-speech synthesis
Audio Output
Speaker playback with echo cancellation
Key implementation details:
- Echo suppression — prevents the system from transcribing its own TTS output by gating the STT input during playback
- Streaming pipeline — LLM tokens stream directly to TTS, which begins synthesis immediately, minimizing end-to-end latency
- Context parity — voice chat has access to all the same tools as text chat (search, export, summarize, etc.)
- Interrupt handling — the user can interrupt Max mid-response; the system stops generation and TTS immediately
- Attachment support — transcripts and documents can be attached to the voice session for contextual Q&A
Browser Extension Architecture
The Verbatim Studio browser extension (Chrome) acts as a thin client that communicates with the desktop application running on localhost. No data leaves the user's machine — the extension delegates all processing to the local backend.
- Audio recording — captures microphone audio from the browser and sends it to the local backend for transcription with speaker detection
- Screen capture — region-select screenshot tool that sends captured images for OCR processing
- Chat interface — embedded Max chat panel with page context, text selection, and document attachment support
- File upload — drag-and-drop document upload for OCR and indexing
- Search — full-text search across all recordings, documents, and conversations
- Job tracking — real-time progress indicators for active transcription, OCR, and indexing jobs
The extension requires the Verbatim Studio desktop app to be running. Communication uses the same localhost REST API and WebSocket endpoints that the Electron frontend uses.
Middleware Stack (Enterprise)
In the enterprise edition, every inbound HTTP request passes through three middleware layers before reaching the route handler. Middleware is applied in a strict order to enforce security invariants:
LicenseMiddleware
Validates the license JWT on every request. Checks HS256 signature against VERBATIM_SECRET_KEY and verifies the expiry claim. If the license has expired, the system enters a 14-day grace period: GET requests return normally (read-only access), but POST/PUT/DELETE return HTTP 402. After the grace period, all requests return HTTP 403.
JWTAuthMiddleware
Validates user authentication. Extracts the JWT from the Authorization header, verifies the signature and expiry, and attaches the authenticated user to the request state (request.state.user). Supports both short-lived access tokens and API key authentication (vst_ prefix). Unauthenticated requests to protected endpoints receive HTTP 401.
AuditLogMiddleware
Logs all mutating requests (POST, PUT, DELETE) to the audit_logs table in PostgreSQL. Each entry records the authenticated user ID, HTTP method, request path, timestamp (UTC), and originating IP address. GET requests are not logged to avoid excessive log volume. Audit entries are immutable — they cannot be modified or deleted through the API.
Public endpoints (/api/auth/login, /api/auth/register, /api/info, /api/license/status, /health) bypass the JWTAuthMiddleware but still pass through LicenseMiddleware.
Authentication & API Security
Enterprise authentication is built on JWT bearer tokens with a dual-credential system supporting both interactive sessions and programmatic access:
User Authentication (JWT)
- Registration — new users register with email and password. The first user to register is automatically assigned the admin role. Subsequent users enter a pending state until an admin approves them.
- Password storage — passwords are hashed with Argon2, a memory-hard algorithm resistant to GPU-accelerated brute-force attacks
- Token lifecycle — login returns a short-lived access token and a longer-lived refresh token. The frontend uses the refresh token to obtain new access tokens transparently, providing seamless session continuity.
API Keys
- Format — API keys use a
vst_prefix for easy identification in logs and configurations - Storage — keys are stored as SHA-256 hashes. The raw key is displayed exactly once at creation and cannot be retrieved afterward.
- Scoping — each key is assigned granular scopes (read, write) to limit its capabilities
- Revocation — any key can be revoked instantly via the API or admin dashboard
Webhook Security
Webhook payloads are signed with HMAC-SHA256 using a per-subscription secret. The signature is included in the X-Webhook-Signature header. Failed deliveries are retried with exponential backoff (up to 5 attempts).
Data Storage
The storage layer is designed to keep all data on the user's infrastructure. No data is transmitted to external services unless the user explicitly configures cloud storage or LLM providers.
| Aspect | Desktop | Enterprise |
|---|---|---|
| Primary database | SQLite (local file) | PostgreSQL 16 (with async connection pooling) |
| Vector search | SQLite FTS5 extensions | PostgreSQL pgvector (cosine similarity) |
| Media files | Local filesystem | Local Docker volume, S3-compatible, or Azure Blob |
| Model weights | ~/Library/Application Support/@verbatim/ (macOS) | Bundled in Docker image + on-demand downloads |
| Credentials | Fernet AES-128 encryption, OS keychain | Argon2 (passwords), SHA-256 (API keys), env vars |
| Configuration | SQLite settings table + OS keychain | Environment variables (VERBATIM_* prefix) |
Desktop Credential Security
On the desktop edition, sensitive credentials (cloud storage tokens, search provider API keys) are encrypted using Fernet symmetric encryption (AES-128-CBC with HMAC authentication) before being written to disk. Encryption keys are stored in the operating system's native keychain — macOS Keychain or Windows Credential Manager — so they are protected by OS-level security. No credentials are ever stored in plain text.
License Validation
Enterprise licenses are HS256-signed JWTs containing organization metadata, seat counts, and expiry dates. License validation is double-gated to prevent unauthorized use:
Pull Gate (Cloudflare Worker)
Before Docker can pull the enterprise image from GHCR, the license JWT is posted to a Cloudflare Worker that validates the signature and expiry. Only valid licenses receive a GHCR read token. This prevents unauthorized image downloads.
Runtime Gate (LicenseMiddleware)
At runtime, every API request passes through LicenseMiddleware which re-validates the JWT on every request — checking both signature (VERBATIM_SECRET_KEY) and expiry. Even if an image is obtained, the application will not serve requests without a valid license.
When a license expires, the system enters a 14-day grace period with read-only access before blocking all requests entirely. No data is ever deleted due to license expiry. See License Management for full details.
Deployment Topology
The enterprise stack is designed for flexibility across deployment environments:
- Single-server — all three containers on one host. Suitable for small teams (up to ~20 concurrent users).
- Scaled — add backend replicas behind the nginx reverse proxy for horizontal scaling. Requires an external PostgreSQL instance and shared storage (S3/Azure).
- Kubernetes — each Docker Compose service maps directly to a Kubernetes Deployment. Use PersistentVolumeClaims for PostgreSQL data and a managed database service for production.
- Air-gapped — transfer Docker images via
docker save/load. Point the LLM integration at a self-hosted model server (Ollama, vLLM, LocalAI). The entire stack operates with zero internet connectivity.
GPU acceleration for transcription in Docker requires NVIDIA GPUs with the nvidia-container-toolkit installed. Apple Silicon Metal acceleration is not available inside Docker containers.