Features
Verbatim Studio provides a complete transcription workflow — from recording to searchable, AI-enhanced transcripts. All core features work in both the desktop app and the enterprise edition. This page covers each capability in detail, including the underlying models and technical implementation.
Live Transcription
Record from any audio source with real-time speech-to-text. The transcription pipeline runs locally — no audio leaves your machine. Supports microphone input, system audio capture, and file import. Drag and drop multiple files for batch processing.
Transcription Engine
Verbatim uses OpenAI Whisper as its speech-to-text backbone, with platform-optimized inference runtimes for maximum performance:
| Platform | Runtime | Acceleration | Notes |
|---|---|---|---|
| macOS (Apple Silicon) | MLX Whisper | Metal GPU | Native M1/M2/M3/M4 acceleration via Apple's MLX framework |
| Windows (x64) | CTranslate2 (faster-whisper) | NVIDIA CUDA | CUDA runtime bundled with the app — no separate install needed |
| Enterprise (Docker) | CTranslate2 (faster-whisper) | NVIDIA CUDA | Requires nvidia-container-toolkit for GPU passthrough |
| All platforms | CPU fallback | None | Always available — slower but functional on any hardware |
Model Sizes
Five Whisper model sizes are available. Select the best trade-off between speed and accuracy for your use case in Settings → Transcription:
| Model | Parameters | Disk Size | VRAM | Best For |
|---|---|---|---|---|
tiny | 39M | ~71 MB | ~1 GB | Quick drafts, low-resource devices |
base | 74M | ~137 MB | ~1 GB | Default — good balance of speed and accuracy |
small | 244M | ~460 MB | ~2 GB | Better accuracy for accented speech |
medium | 769M | ~1.5 GB | ~5 GB | High accuracy, technical vocabulary |
large-v3 | 1.55B | ~3 GB | ~10 GB | Maximum accuracy, multilingual |
The base model is bundled by default. Larger models are downloaded on-demand when selected. GPU acceleration significantly reduces transcription time — a one-hour recording typically takes 2–3 minutes with GPU versus 15–20 minutes on CPU.
Speaker Identification
Automatic speaker diarization identifies who said what in multi-speaker recordings. The diarization pipeline runs after transcription and labels each segment by speaker.
Diarization Model
Speaker identification is powered by pyannote.audio, a neural speaker diarization framework. The model is downloaded automatically on first use (~100 MB). It segments the audio waveform into speaker turns using a combination of voice activity detection, speaker embedding extraction, and agglomerative clustering.
After diarization, you can rename speakers (e.g., “Speaker 1” → “Dr. Martinez”), merge duplicate speakers, and reassign segments — all from the transcript editor. Speaker labels persist across the transcript and are included in all exports.
AI Assistant (Max)
Max is a built-in AI assistant that works fully offline — no API keys or internet needed. Max has access to 15+ specialized tools and can operate on your transcripts, documents, and projects through natural conversation.
Language Model
The desktop app runs Granite 4.0 Tiny locally via llama.cpp:
| Property | Value |
|---|---|
| Model | Granite 4.0 Tiny |
| Architecture | Mamba-2 (state-space model) |
| Total parameters | 7B |
| Active parameters | 1B (mixture-of-experts) |
| Inference runtime | llama.cpp (GGUF format) |
| GPU acceleration | Apple Metal (macOS) / NVIDIA CUDA (Windows) |
| Context window | Configurable — supports long-context conversations |
Tool System
Max can:
- Search within your active project or across all content
- Generate PDF and DOCX documents from conversation
- Export transcripts to TXT, SRT, or VTT
- Summarize transcripts with structured key points and action items
- Queue AI quality reviews for transcripts
- Color-highlight important transcript segments
- Create notes anchored to timestamps or pages
- Create projects and organize recordings with tags
- Search the web and extract full content from URLs
- Extract entities (action items, decisions, medications, legal terms)
- Translate transcripts to 25+ languages
Enterprise users can connect external LLM providers (OpenAI, Anthropic, or any OpenAI-compatible endpoint) for additional model options. See Configuration for setup details.
Voice Chat
Talk to Max using full-duplex voice conversation. The voice system uses a dual-LLM architecture — a fast model handles real-time conversational flow while the primary model handles complex analysis and tool calls.
Text-to-Speech Models
Two TTS models are available, selectable in Settings:
| Model | Parameters | Latency | Quality |
|---|---|---|---|
Kokoro | 82M | Ultra-fast (<200ms) | Good — ideal for real-time conversational flow |
Qwen3-TTS | 1.7B | Moderate | High — more natural prosody and intonation |
Voice Pipeline
Features include:
- Streaming LLM-to-TTS pipeline — Max speaks while still generating
- Echo suppression prevents feedback loops
- Mute button to toggle microphone without ending the session
- Attach transcripts and documents during voice chat
- Stop button to abort generation mid-response
- Full context parity — voice chat has all the same tools as text chat
Voice chat works completely offline. The entire pipeline — speech recognition, language model inference, and text-to-speech — runs on your local hardware.
Web Search
Max can search the web and extract full page content from URLs during conversation. Search results appear as visual source cards. Configure your preferred search provider (Tavily or Brave) in Settings. Max synthesizes information across multiple pages and prioritizes live web results over training knowledge.
Entity Extraction
Extract structured data from transcripts — action items, decisions, medications, legal entities, and more. Each extracted entity links back to the source transcript segment with timestamps for verification. View results in the dedicated Entity Extraction panel in the transcript viewer.
OCR & Document Processing
Import PDFs, images, spreadsheets, and Word documents. OCR extracts text from scanned documents and images, making everything full-text searchable alongside your transcripts.
Vision Model
OCR is powered by Granite Vision, an advanced multimodal vision-language model that handles:
- Scanned and digital PDFs (including multi-page documents)
- Photos of printed and handwritten text (cursive, forms, labels)
- DOCX, XLSX, PPTX documents
- Images in PNG, JPEG, TIFF, and WebP formats
- Plain text and Markdown files
Granite Vision runs locally with GPU acceleration on both Apple Silicon (Metal) and NVIDIA (CUDA) hardware. Extracted text is indexed in the search layer alongside your audio transcripts.
Semantic Search
Search across all your transcripts, documents, and recordings using natural language. Find content by meaning, not just keywords. Results are ranked by relevance and can be filtered by date, project, or type.
Embedding Model
Semantic search is powered by the nomic-embed-text-v1.5 embedding model, which converts text into high-dimensional vectors for similarity matching. The model runs locally and supports long-context embedding for accurate retrieval across lengthy transcripts.
| Edition | Index Backend | Search Type |
|---|---|---|
| Desktop | SQLite FTS5 | Keyword + vector similarity |
| Enterprise | PostgreSQL + pgvector | Keyword + vector similarity (concurrent multi-user) |
Project Workspaces
Organize recordings, transcripts, and documents into project workspaces. Each project provides context isolation — when you select a project, Max automatically reads all content within it. Dashboard stats update dynamically when switching between project scopes. Add tags, notes, and custom metadata to any item.
Trash & Recovery
Deleted items are moved to trash instead of being permanently destroyed. Restore recordings, documents, and projects from the Trash page at any time. Trash is automatically emptied after a configurable retention period (30, 60, or 90 days). You can also permanently delete individual items or empty the entire trash manually.
Export
Export transcripts in multiple formats including plain text, SRT subtitles, VTT, DOCX, and PDF. Max can also generate documents during conversation. Subtitle formats are compatible with YouTube, Vimeo, and other video platforms.
Audio Enhancement
Improve transcription accuracy on noisy recordings with built-in audio preprocessing. Noise reduction removes background noise using FFmpeg audio filters, and volume normalization ensures consistent audio levels (EBU R128). Both run locally before the audio reaches the transcription engine — enable them in Settings → Transcription.
Custom Dictionary
Add domain-specific terms to improve transcription accuracy for your field. Medical terminology, legal jargon, brand names, and technical acronyms can all be added via Settings → Transcription. The dictionary biases the speech recognition model toward your vocabulary without any model training or fine-tuning.
Filler Word Detection
Analyze transcripts for verbal fillers like “um,” “uh,” “like,” and “you know.” The Filler Detection panel in the transcript view shows total counts, per-word breakdowns, filler rate percentages, and clickable segment links. Ask Max to analyze fillers conversationally too.
Transcript Translation
Translate transcripts to 25+ languages directly from the transcript view using the Translate button, or ask Max in chat. Translation runs through the local LLM — no cloud APIs needed. Supports Spanish, French, German, Japanese, Chinese, Korean, Arabic, and many more.
Calendar Integration
Connect your Google Calendar to see upcoming meetings on the dashboard. Events with video links (Google Meet, Zoom, Teams) display a quick “Join” button. Requires Google OAuth — set up in Settings → Transcription.
Post-Transcription Automation
Configure actions that run automatically after every transcription completes. Auto-summarize generates key points and action items immediately. Auto-export saves transcripts in your preferred format without manual intervention. Enable in Settings → Transcription.
Enterprise-Only Features
The enterprise edition adds multi-user capabilities on top of all core features:
| Feature | Desktop | Enterprise |
|---|---|---|
| Live transcription | ✓ | ✓ |
| Speaker identification | ✓ | ✓ |
| AI assistant (Max) with 15+ tools | ✓ | ✓ |
| Voice chat | ✓ | ✓ |
| Web search & URL extraction | ✓ | ✓ |
| Entity extraction | ✓ | ✓ |
| OCR & documents | ✓ | ✓ |
| Semantic search | ✓ | ✓ |
| Project workspaces | ✓ | ✓ |
| Trash & recovery | ✓ | ✓ |
| Export (TXT, SRT, VTT, DOCX, PDF) | ✓ | ✓ |
| Browser extension | ✓ | ✓ |
| Multi-user & teams | — | ✓ |
| SSO & JWT auth | — | ✓ |
| API keys & webhooks | — | ✓ |
| Audit logging | — | ✓ |
| PostgreSQL database | — | ✓ |
| S3 / Azure storage | — | ✓ |
| Docker deployment | — | ✓ |
| External LLM providers | — | ✓ |