Retrieval Augmented Generation (RAG)
Retrieval Augmented Generation (RAG) improves answer quality by retrieving relevant context from documents and web sources, then injecting that context into the model prompt using a configurable RAG template. This enables grounded responses with optional citations, while keeping your knowledge local to your instance.
How RAG works in IntraLLM AI
RAG runs as a pipeline:
- Ingest: files and sources are uploaded or connected (local documents, web pages, multimedia such as YouTube transcripts).
- Extract: content is parsed and normalized (text, tables, and optionally OCR and image descriptions).
- Split: text is segmented into chunks (size + overlap) for retrieval.
- Embed: each chunk is converted into vectors using the configured embedding model.
- Retrieve: the top relevant chunks are selected (Top K, optional hybrid search, optional full-context behavior).
- Generate: the retrieved context is inserted into the RAG template and sent to the chat model.
User workflow: using RAG in chat
1) Use uploaded documents
- Upload documents in Workspace → Documents.
- In chat, reference documents using the
#workflow:- Type
#before your query and select the formatted document reference shown above the input. - Once selected, a document indicator appears near the message input (indicating the context will be included).
- Type
2) Use web pages as context
- Start the prompt with
#followed by a URL to fetch and parse web content (if supported). - Select the formatted URL reference shown above the input to confirm it is being used.
Example:
# https://example.com/page
Summarize the key points and list the requirements.
Tip:
- Web pages often include navigation and footers. Prefer a reader-friendly or raw content view for higher-quality extraction.
3) Use YouTube content
If enabled, the YouTube RAG pipeline can retrieve and summarize information from video transcripts/captions when you provide a video URL as context. The workflow is similar to web URLs (provide the link, then run a query referencing it).
Critical operational note for Ollama
If you are using Ollama, ensure your model context length is large enough for retrieved context to fit. Ollama commonly defaults to a 2048-token context window, which can prevent retrieved data from being included. For better RAG performance, configure the model context length to 8192 tokens or higher where possible.
Admin configuration: what this settings page controls
This section maps to the RAG configuration settings (Admin Panel → Settings → Documents / Knowledge).
Content extraction
- Content extraction engine (Docling): configure the Docling service endpoint used for document parsing (example:
http://<host>:5001). - OCR engine (easyocr): optional OCR for text in images; configure language (example:
en). - Describe pictures in documents: optional image understanding to produce short descriptions for images embedded in documents.
Text splitting
- Text splitter: select the splitting strategy (example: character-based).
- Chunk size: the size of each segment (example: 1000).
- Chunk overlap: overlap between chunks (example: 100) to reduce boundary loss.
Guidance:
- Larger chunks preserve more context but can reduce retrieval precision and increase token usage.
- Overlap improves continuity but increases index size and compute.
Embeddings
- Embedding model engine: select the embedding backend (example: SentenceTransformers).
- Embedding model: select the embedding model (example:
sentence-transformers/all-MiniLM-L6-v2).
Important:
- If you change the embedding model, you must re-import documents (or fully rebuild vectors), because existing vectors will not be compatible.
Retrieval
- Top K: number of chunks retrieved per query (example: 10).
- Hybrid search: optional combination of keyword and vector retrieval to improve recall.
- Full context mode: optional behavior to retrieve broader context; useful for summarization but increases token usage.
RAG template and citations
- RAG template: defines how retrieved context and the user query are combined before the model is called.
- Citation policy (typical pattern):
- Include inline citations only when sources explicitly provide an identifier (e.g., cite as
[1]only when a source has anid). - Do not emit XML tags in the model response.
- If the answer is not in context, the assistant should say so, and may answer based on general knowledge if allowed by policy.
- Include inline citations only when sources explicitly provide an identifier (e.g., cite as
Files and uploads
- Allowed file extensions: restrict supported uploads (example:
pdf,docx,txt). - Max upload size / count: enforce limits (leave empty for unlimited).
- Image compression width/height: optionally compress images during upload to reduce size (note: excessive compression can reduce OCR quality).
Integrations
- Google Drive / OneDrive: enable connectors that allow users to load documents from cloud storage (availability depends on your deployment configuration).
- Google Drive typically requires a Google Cloud project, enabled APIs, OAuth client configuration, and environment variables such as:
GOOGLE_DRIVE_API_KEYGOOGLE_DRIVE_CLIENT_IDGOOGLE_REDIRECT_URI
Operations and maintenance (Danger Zone)
These controls impact system state and should be used carefully:
- Reset upload directory: clears the stored uploads.
- Reset vector storage/knowledge: clears the vector index and knowledge storage.
- Reindex knowledge base vectors: rebuilds vector indexes (useful after corruption, configuration changes, or certain upgrades).
Best practices
- Prefer hybrid search for mixed workloads (technical terms + semantic similarity).
- Keep Top K moderate (e.g., 5–15) to balance relevance and token costs.
- Use full context mode selectively for summarization and deep analysis tasks.
- Keep extraction quality high:
- Enable OCR only when you have image-based documents.
- Avoid overly aggressive image compression if OCR is needed.
- For Ollama deployments, ensure context length is sufficient (8192+ recommended for RAG-heavy use).
Quick checklist
- Documents ingesting correctly (Docling endpoint reachable).
- OCR configured only if needed (language set correctly).
- Chunk size/overlap tuned to your document types.
- Embedding model chosen and stable (re-import required if changed).
- Retrieval configured (Top K, hybrid search, optional full context).
- RAG template enforces your citation and response rules.
- Maintenance actions documented (reset/reindex procedures).