Retrieval Augmented Generation (RAG)

A guide to how IntraLLM AI retrieves knowledge from documents and web sources, how to use it in chats, and how administrators configure extraction, embeddings, retrieval, citations, and indexing operations.

Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) improves answer quality by retrieving relevant context from documents and web sources, then injecting that context into the model prompt using a configurable RAG template. This enables grounded responses with optional citations, while keeping your knowledge local to your instance.

How RAG works in IntraLLM AI

RAG runs as a pipeline:

  • Ingest: files and sources are uploaded or connected (local documents, web pages, multimedia such as YouTube transcripts).
  • Extract: content is parsed and normalized (text, tables, and optionally OCR and image descriptions).
  • Split: text is segmented into chunks (size + overlap) for retrieval.
  • Embed: each chunk is converted into vectors using the configured embedding model.
  • Retrieve: the top relevant chunks are selected (Top K, optional hybrid search, optional full-context behavior).
  • Generate: the retrieved context is inserted into the RAG template and sent to the chat model.

User workflow: using RAG in chat

1) Use uploaded documents

  • Upload documents in Workspace → Documents.
  • In chat, reference documents using the # workflow:
    • Type # before your query and select the formatted document reference shown above the input.
    • Once selected, a document indicator appears near the message input (indicating the context will be included).

2) Use web pages as context

  • Start the prompt with # followed by a URL to fetch and parse web content (if supported).
  • Select the formatted URL reference shown above the input to confirm it is being used.

Example:

# https://example.com/page
Summarize the key points and list the requirements.

Tip:

  • Web pages often include navigation and footers. Prefer a reader-friendly or raw content view for higher-quality extraction.

3) Use YouTube content

If enabled, the YouTube RAG pipeline can retrieve and summarize information from video transcripts/captions when you provide a video URL as context. The workflow is similar to web URLs (provide the link, then run a query referencing it).

Critical operational note for Ollama

If you are using Ollama, ensure your model context length is large enough for retrieved context to fit. Ollama commonly defaults to a 2048-token context window, which can prevent retrieved data from being included. For better RAG performance, configure the model context length to 8192 tokens or higher where possible.

Admin configuration: what this settings page controls

This section maps to the RAG configuration settings (Admin Panel → Settings → Documents / Knowledge).

Content extraction

  • Content extraction engine (Docling): configure the Docling service endpoint used for document parsing (example: http://<host>:5001).
  • OCR engine (easyocr): optional OCR for text in images; configure language (example: en).
  • Describe pictures in documents: optional image understanding to produce short descriptions for images embedded in documents.

Text splitting

  • Text splitter: select the splitting strategy (example: character-based).
  • Chunk size: the size of each segment (example: 1000).
  • Chunk overlap: overlap between chunks (example: 100) to reduce boundary loss.

Guidance:

  • Larger chunks preserve more context but can reduce retrieval precision and increase token usage.
  • Overlap improves continuity but increases index size and compute.

Embeddings

  • Embedding model engine: select the embedding backend (example: SentenceTransformers).
  • Embedding model: select the embedding model (example: sentence-transformers/all-MiniLM-L6-v2).

Important:

  • If you change the embedding model, you must re-import documents (or fully rebuild vectors), because existing vectors will not be compatible.

Retrieval

  • Top K: number of chunks retrieved per query (example: 10).
  • Hybrid search: optional combination of keyword and vector retrieval to improve recall.
  • Full context mode: optional behavior to retrieve broader context; useful for summarization but increases token usage.

RAG template and citations

  • RAG template: defines how retrieved context and the user query are combined before the model is called.
  • Citation policy (typical pattern):
    • Include inline citations only when sources explicitly provide an identifier (e.g., cite as [1] only when a source has an id).
    • Do not emit XML tags in the model response.
    • If the answer is not in context, the assistant should say so, and may answer based on general knowledge if allowed by policy.

Files and uploads

  • Allowed file extensions: restrict supported uploads (example: pdf, docx, txt).
  • Max upload size / count: enforce limits (leave empty for unlimited).
  • Image compression width/height: optionally compress images during upload to reduce size (note: excessive compression can reduce OCR quality).

Integrations

  • Google Drive / OneDrive: enable connectors that allow users to load documents from cloud storage (availability depends on your deployment configuration).
  • Google Drive typically requires a Google Cloud project, enabled APIs, OAuth client configuration, and environment variables such as:
    • GOOGLE_DRIVE_API_KEY
    • GOOGLE_DRIVE_CLIENT_ID
    • GOOGLE_REDIRECT_URI

Operations and maintenance (Danger Zone)

These controls impact system state and should be used carefully:

  • Reset upload directory: clears the stored uploads.
  • Reset vector storage/knowledge: clears the vector index and knowledge storage.
  • Reindex knowledge base vectors: rebuilds vector indexes (useful after corruption, configuration changes, or certain upgrades).

Best practices

  • Prefer hybrid search for mixed workloads (technical terms + semantic similarity).
  • Keep Top K moderate (e.g., 5–15) to balance relevance and token costs.
  • Use full context mode selectively for summarization and deep analysis tasks.
  • Keep extraction quality high:
    • Enable OCR only when you have image-based documents.
    • Avoid overly aggressive image compression if OCR is needed.
  • For Ollama deployments, ensure context length is sufficient (8192+ recommended for RAG-heavy use).

Quick checklist

  • Documents ingesting correctly (Docling endpoint reachable).
  • OCR configured only if needed (language set correctly).
  • Chunk size/overlap tuned to your document types.
  • Embedding model chosen and stable (re-import required if changed).
  • Retrieval configured (Top K, hybrid search, optional full context).
  • RAG template enforces your citation and response rules.
  • Maintenance actions documented (reset/reindex procedures).