Audio

Page Overview

This page configures audio input and output capabilities in IntraLLM AI, including Speech-to-Text (STT) for transcribing audio and Text-to-Speech (TTS) for generating spoken responses.

Speech-to-Text (STT)

Speech-to-Text converts audio or video input into text that can be processed by models, tools, and workflows.

Supported MIME Types

Specifies which audio or video formats are accepted.

Speech-to-Text Engine

Speech-to-Text Engine: selects the backend used for transcription.
STT Model: selects the Whisper model variant used for transcription (e.g. base).

IntraLLM WebUI uses faster-whisper internally to provide efficient local transcription. Different models offer trade-offs between speed, accuracy, and resource usage.

Text-to-Speech (TTS)

Text-to-Speech converts model-generated text into spoken audio output.

Text-to-Speech Engine

Text-to-Speech Engine: selects the backend used for speech synthesis.
TTS Model: selects the SpeechT5 model used for speech generation.
CMU ARCTIC speaker embedding name: specifies the speaker voice used for audio output.
Different embeddings produce different voice characteristics.

IntraLLM WebUI uses SpeechT5 together with CMU Arctic speaker embeddings for local speech synthesis.

Response Splitting

Response splitting controls how text is divided before being sent to the TTS engine.

Punctuation: splits text into sentences
Paragraphs: splits text into paragraphs
None: sends the entire message as a single string

This setting affects:

naturalness of speech output
latency of audio generation
how long responses are segmented during playback

How it works

When audio features are enabled:

Audio or video input is transcribed using the configured STT engine.
The resulting text is processed by the selected model and tools.
If TTS is enabled, the text response is converted back into audio using the configured TTS engine and voice settings.

Usage considerations

Smaller STT models provide faster transcription with lower accuracy.
Larger models improve accuracy but require more compute resources.
Response splitting can improve perceived responsiveness for long outputs.
Audio processing runs locally when using local engines.

Quick checklist

Supported MIME types configured correctly.
STT engine and model selected.
TTS engine and speaker embedding configured.
Response splitting set to match desired playback behavior.
Sufficient system resources available for audio processing.

Pipeline Image

Introduction

Get Started

Dashboard

Settings

Core AI Capabilities

Tools & Functions

Multimodal Capabilities

Templates

Examples

Workflow

Admin Panel