Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) improves answer quality by retrieving relevant context from documents and web sources, then injecting that context into the model prompt using a configurable RAG template. This enables grounded responses with optional citations, while keeping your knowledge local to your instance.

How RAG works in IntraLLM AI

RAG runs as a pipeline:

Ingest: files and sources are uploaded or connected (local documents, web pages, multimedia such as YouTube transcripts).
Extract: content is parsed and normalized (text, tables, and optionally OCR and image descriptions).
Split: text is segmented into chunks (size + overlap) for retrieval.
Embed: each chunk is converted into vectors using the configured embedding model.
Retrieve: the top relevant chunks are selected (Top K, optional hybrid search, optional full-context behavior).
Generate: the retrieved context is inserted into the RAG template and sent to the chat model.

User workflow: using RAG in chat

1) Use uploaded documents

Upload documents in Workspace → Documents.
In chat, reference documents using the # workflow:
- Type # before your query and select the formatted document reference shown above the input.
- Once selected, a document indicator appears near the message input (indicating the context will be included).

2) Use web pages as context

Start the prompt with # followed by a URL to fetch and parse web content (if supported).
Select the formatted URL reference shown above the input to confirm it is being used.

Example:

# https://example.com/page
Summarize the key points and list the requirements.

Tip:

Web pages often include navigation and footers. Prefer a reader-friendly or raw content view for higher-quality extraction.

3) Use YouTube content

If enabled, the YouTube RAG pipeline can retrieve and summarize information from video transcripts/captions when you provide a video URL as context. The workflow is similar to web URLs (provide the link, then run a query referencing it).

Critical operational note for Ollama

If you are using Ollama, ensure your model context length is large enough for retrieved context to fit. Ollama commonly defaults to a 2048-token context window, which can prevent retrieved data from being included. For better RAG performance, configure the model context length to 8192 tokens or higher where possible.

Admin configuration: what this settings page controls

This section maps to the RAG configuration settings (Admin Panel → Settings → Documents / Knowledge).

Content extraction

Content extraction engine (Docling): configure the Docling service endpoint used for document parsing (example: http://<host>:5001).
OCR engine (easyocr): optional OCR for text in images; configure language (example: en).
Describe pictures in documents: optional image understanding to produce short descriptions for images embedded in documents.

Text splitting

Text splitter: select the splitting strategy (example: character-based).
Chunk size: the size of each segment (example: 1000).
Chunk overlap: overlap between chunks (example: 100) to reduce boundary loss.

Guidance:

Larger chunks preserve more context but can reduce retrieval precision and increase token usage.
Overlap improves continuity but increases index size and compute.

Embeddings

Embedding model engine: select the embedding backend (example: SentenceTransformers).
Embedding model: select the embedding model (example: sentence-transformers/all-MiniLM-L6-v2).

Important:

If you change the embedding model, you must re-import documents (or fully rebuild vectors), because existing vectors will not be compatible.

Retrieval

Top K: number of chunks retrieved per query (example: 10).
Hybrid search: optional combination of keyword and vector retrieval to improve recall.
Full context mode: optional behavior to retrieve broader context; useful for summarization but increases token usage.

RAG template and citations

RAG template: defines how retrieved context and the user query are combined before the model is called.
Citation policy (typical pattern):
- Include inline citations only when sources explicitly provide an identifier (e.g., cite as [1] only when a source has an id).
- Do not emit XML tags in the model response.
- If the answer is not in context, the assistant should say so, and may answer based on general knowledge if allowed by policy.

Files and uploads

Allowed file extensions: restrict supported uploads (example: pdf, docx, txt).
Max upload size / count: enforce limits (leave empty for unlimited).
Image compression width/height: optionally compress images during upload to reduce size (note: excessive compression can reduce OCR quality).

Integrations

Google Drive / OneDrive: enable connectors that allow users to load documents from cloud storage (availability depends on your deployment configuration).
Google Drive typically requires a Google Cloud project, enabled APIs, OAuth client configuration, and environment variables such as:
- GOOGLE_DRIVE_API_KEY
- GOOGLE_DRIVE_CLIENT_ID
- GOOGLE_REDIRECT_URI

Operations and maintenance (Danger Zone)

These controls impact system state and should be used carefully:

Reset upload directory: clears the stored uploads.
Reset vector storage/knowledge: clears the vector index and knowledge storage.
Reindex knowledge base vectors: rebuilds vector indexes (useful after corruption, configuration changes, or certain upgrades).

Best practices

Prefer hybrid search for mixed workloads (technical terms + semantic similarity).
Keep Top K moderate (e.g., 5–15) to balance relevance and token costs.
Use full context mode selectively for summarization and deep analysis tasks.
Keep extraction quality high:
- Enable OCR only when you have image-based documents.
- Avoid overly aggressive image compression if OCR is needed.
For Ollama deployments, ensure context length is sufficient (8192+ recommended for RAG-heavy use).

Quick checklist

Documents ingesting correctly (Docling endpoint reachable).
OCR configured only if needed (language set correctly).
Chunk size/overlap tuned to your document types.
Embedding model chosen and stable (re-import required if changed).
Retrieval configured (Top K, hybrid search, optional full context).
RAG template enforces your citation and response rules.
Maintenance actions documented (reset/reindex procedures).

Extraction Tools

Introduction

Get Started

Dashboard

Settings

Core AI Capabilities

Tools & Functions

Multimodal Capabilities

Templates

Examples

Workflow

Admin Panel

Retrieval Augmented Generation (RAG)