Page Overview
This page configures audio input and output capabilities in IntraLLM AI, including Speech-to-Text (STT) for transcribing audio and Text-to-Speech (TTS) for generating spoken responses.
Speech-to-Text (STT)
Speech-to-Text converts audio or video input into text that can be processed by models, tools, and workflows.
Supported MIME Types
Specifies which audio or video formats are accepted.
Speech-to-Text Engine
-
Speech-to-Text Engine: selects the backend used for transcription.
-
STT Model: selects the Whisper model variant used for transcription (e.g.
base).
IntraLLM WebUI uses faster-whisper internally to provide efficient local transcription. Different models offer trade-offs between speed, accuracy, and resource usage.
Text-to-Speech (TTS)
Text-to-Speech converts model-generated text into spoken audio output.
Text-to-Speech Engine
-
Text-to-Speech Engine: selects the backend used for speech synthesis.
-
TTS Model: selects the SpeechT5 model used for speech generation.
-
CMU ARCTIC speaker embedding name: specifies the speaker voice used for audio output.
Different embeddings produce different voice characteristics.
IntraLLM WebUI uses SpeechT5 together with CMU Arctic speaker embeddings for local speech synthesis.
Response Splitting
Response splitting controls how text is divided before being sent to the TTS engine.
- Punctuation: splits text into sentences
- Paragraphs: splits text into paragraphs
- None: sends the entire message as a single string
This setting affects:
- naturalness of speech output
- latency of audio generation
- how long responses are segmented during playback
How it works
When audio features are enabled:
- Audio or video input is transcribed using the configured STT engine.
- The resulting text is processed by the selected model and tools.
- If TTS is enabled, the text response is converted back into audio using the configured TTS engine and voice settings.
Usage considerations
- Smaller STT models provide faster transcription with lower accuracy.
- Larger models improve accuracy but require more compute resources.
- Response splitting can improve perceived responsiveness for long outputs.
- Audio processing runs locally when using local engines.
Quick checklist
- Supported MIME types configured correctly.
- STT engine and model selected.
- TTS engine and speaker embedding configured.
- Response splitting set to match desired playback behavior.
- Sufficient system resources available for audio processing.