Features

Verbatim Studio provides a complete transcription workflow — from recording to searchable, AI-enhanced transcripts. All core features work in both the desktop app and the enterprise edition. This page covers each capability in detail, including the underlying models and technical implementation.

Live Transcription

Record from any audio source with real-time speech-to-text. The transcription pipeline runs locally — no audio leaves your machine. Supports microphone input, system audio capture, and file import. Drag and drop multiple files for batch processing.

Transcription Engine

Verbatim uses OpenAI Whisper as its speech-to-text backbone, with platform-optimized inference runtimes for maximum performance:

PlatformRuntimeAccelerationNotes
macOS (Apple Silicon)MLX WhisperMetal GPUNative M1/M2/M3/M4 acceleration via Apple's MLX framework
Windows (x64)CTranslate2 (faster-whisper)NVIDIA CUDACUDA runtime bundled with the app — no separate install needed
Enterprise (Docker)CTranslate2 (faster-whisper)NVIDIA CUDARequires nvidia-container-toolkit for GPU passthrough
All platformsCPU fallbackNoneAlways available — slower but functional on any hardware

Model Sizes

Five Whisper model sizes are available. Select the best trade-off between speed and accuracy for your use case in Settings → Transcription:

ModelParametersDisk SizeVRAMBest For
tiny39M~71 MB~1 GBQuick drafts, low-resource devices
base74M~137 MB~1 GBDefault — good balance of speed and accuracy
small244M~460 MB~2 GBBetter accuracy for accented speech
medium769M~1.5 GB~5 GBHigh accuracy, technical vocabulary
large-v31.55B~3 GB~10 GBMaximum accuracy, multilingual

The base model is bundled by default. Larger models are downloaded on-demand when selected. GPU acceleration significantly reduces transcription time — a one-hour recording typically takes 2–3 minutes with GPU versus 15–20 minutes on CPU.

Speaker Identification

Automatic speaker diarization identifies who said what in multi-speaker recordings. The diarization pipeline runs after transcription and labels each segment by speaker.

Diarization Model

Speaker identification is powered by pyannote.audio, a neural speaker diarization framework. The model is downloaded automatically on first use (~100 MB). It segments the audio waveform into speaker turns using a combination of voice activity detection, speaker embedding extraction, and agglomerative clustering.

After diarization, you can rename speakers (e.g., “Speaker 1” → “Dr. Martinez”), merge duplicate speakers, and reassign segments — all from the transcript editor. Speaker labels persist across the transcript and are included in all exports.

AI Assistant (Max)

Max is a built-in AI assistant that works fully offline — no API keys or internet needed. Max has access to 15+ specialized tools and can operate on your transcripts, documents, and projects through natural conversation.

Language Model

The desktop app runs Granite 4.0 Tiny locally via llama.cpp:

PropertyValue
ModelGranite 4.0 Tiny
ArchitectureMamba-2 (state-space model)
Total parameters7B
Active parameters1B (mixture-of-experts)
Inference runtimellama.cpp (GGUF format)
GPU accelerationApple Metal (macOS) / NVIDIA CUDA (Windows)
Context windowConfigurable — supports long-context conversations

Tool System

Max can:

  • Search within your active project or across all content
  • Generate PDF and DOCX documents from conversation
  • Export transcripts to TXT, SRT, or VTT
  • Summarize transcripts with structured key points and action items
  • Queue AI quality reviews for transcripts
  • Color-highlight important transcript segments
  • Create notes anchored to timestamps or pages
  • Create projects and organize recordings with tags
  • Search the web and extract full content from URLs
  • Extract entities (action items, decisions, medications, legal terms)
  • Translate transcripts to 25+ languages

Enterprise users can connect external LLM providers (OpenAI, Anthropic, or any OpenAI-compatible endpoint) for additional model options. See Configuration for setup details.

Voice Chat

Talk to Max using full-duplex voice conversation. The voice system uses a dual-LLM architecture — a fast model handles real-time conversational flow while the primary model handles complex analysis and tool calls.

Text-to-Speech Models

Two TTS models are available, selectable in Settings:

ModelParametersLatencyQuality
Kokoro82MUltra-fast (<200ms)Good — ideal for real-time conversational flow
Qwen3-TTS1.7BModerateHigh — more natural prosody and intonation

Voice Pipeline

Features include:

  • Streaming LLM-to-TTS pipeline — Max speaks while still generating
  • Echo suppression prevents feedback loops
  • Mute button to toggle microphone without ending the session
  • Attach transcripts and documents during voice chat
  • Stop button to abort generation mid-response
  • Full context parity — voice chat has all the same tools as text chat

Voice chat works completely offline. The entire pipeline — speech recognition, language model inference, and text-to-speech — runs on your local hardware.

Web Search

Max can search the web and extract full page content from URLs during conversation. Search results appear as visual source cards. Configure your preferred search provider (Tavily or Brave) in Settings. Max synthesizes information across multiple pages and prioritizes live web results over training knowledge.

Entity Extraction

Extract structured data from transcripts — action items, decisions, medications, legal entities, and more. Each extracted entity links back to the source transcript segment with timestamps for verification. View results in the dedicated Entity Extraction panel in the transcript viewer.

OCR & Document Processing

Import PDFs, images, spreadsheets, and Word documents. OCR extracts text from scanned documents and images, making everything full-text searchable alongside your transcripts.

Vision Model

OCR is powered by Granite Vision, an advanced multimodal vision-language model that handles:

  • Scanned and digital PDFs (including multi-page documents)
  • Photos of printed and handwritten text (cursive, forms, labels)
  • DOCX, XLSX, PPTX documents
  • Images in PNG, JPEG, TIFF, and WebP formats
  • Plain text and Markdown files

Granite Vision runs locally with GPU acceleration on both Apple Silicon (Metal) and NVIDIA (CUDA) hardware. Extracted text is indexed in the search layer alongside your audio transcripts.

Semantic Search

Search across all your transcripts, documents, and recordings using natural language. Find content by meaning, not just keywords. Results are ranked by relevance and can be filtered by date, project, or type.

Embedding Model

Semantic search is powered by the nomic-embed-text-v1.5 embedding model, which converts text into high-dimensional vectors for similarity matching. The model runs locally and supports long-context embedding for accurate retrieval across lengthy transcripts.

EditionIndex BackendSearch Type
DesktopSQLite FTS5Keyword + vector similarity
EnterprisePostgreSQL + pgvectorKeyword + vector similarity (concurrent multi-user)

Project Workspaces

Organize recordings, transcripts, and documents into project workspaces. Each project provides context isolation — when you select a project, Max automatically reads all content within it. Dashboard stats update dynamically when switching between project scopes. Add tags, notes, and custom metadata to any item.

Trash & Recovery

Deleted items are moved to trash instead of being permanently destroyed. Restore recordings, documents, and projects from the Trash page at any time. Trash is automatically emptied after a configurable retention period (30, 60, or 90 days). You can also permanently delete individual items or empty the entire trash manually.

Export

Export transcripts in multiple formats including plain text, SRT subtitles, VTT, DOCX, and PDF. Max can also generate documents during conversation. Subtitle formats are compatible with YouTube, Vimeo, and other video platforms.

Audio Enhancement

Improve transcription accuracy on noisy recordings with built-in audio preprocessing. Noise reduction removes background noise using FFmpeg audio filters, and volume normalization ensures consistent audio levels (EBU R128). Both run locally before the audio reaches the transcription engine — enable them in Settings → Transcription.

Custom Dictionary

Add domain-specific terms to improve transcription accuracy for your field. Medical terminology, legal jargon, brand names, and technical acronyms can all be added via Settings → Transcription. The dictionary biases the speech recognition model toward your vocabulary without any model training or fine-tuning.

Filler Word Detection

Analyze transcripts for verbal fillers like “um,” “uh,” “like,” and “you know.” The Filler Detection panel in the transcript view shows total counts, per-word breakdowns, filler rate percentages, and clickable segment links. Ask Max to analyze fillers conversationally too.

Transcript Translation

Translate transcripts to 25+ languages directly from the transcript view using the Translate button, or ask Max in chat. Translation runs through the local LLM — no cloud APIs needed. Supports Spanish, French, German, Japanese, Chinese, Korean, Arabic, and many more.

Calendar Integration

Connect your Google Calendar to see upcoming meetings on the dashboard. Events with video links (Google Meet, Zoom, Teams) display a quick “Join” button. Requires Google OAuth — set up in Settings → Transcription.

Post-Transcription Automation

Configure actions that run automatically after every transcription completes. Auto-summarize generates key points and action items immediately. Auto-export saves transcripts in your preferred format without manual intervention. Enable in Settings → Transcription.

Enterprise-Only Features

The enterprise edition adds multi-user capabilities on top of all core features:

FeatureDesktopEnterprise
Live transcription
Speaker identification
AI assistant (Max) with 15+ tools
Voice chat
Web search & URL extraction
Entity extraction
OCR & documents
Semantic search
Project workspaces
Trash & recovery
Export (TXT, SRT, VTT, DOCX, PDF)
Browser extension
Multi-user & teams
SSO & JWT auth
API keys & webhooks
Audit logging
PostgreSQL database
S3 / Azure storage
Docker deployment
External LLM providers