Validated Patterns

RAG AI Quickstart

Validation status:
Sandbox Sandbox
Links:

About the RAG Quickstart pattern

Use retrieval-augmented generation (RAG) to enhance large language models with specialized data sources for more accurate and context-aware responses.

Use case
  • Deploy a RAG-powered chatbot that connects users to internal documentation through a single chat interface.

  • Explore retrieval-augmented generation capabilities including document ingestion, custom system prompts, and agent-based RAG.

  • Use a GitOps approach to provision AI infrastructure including LLM serving, vector storage, and safety guardrails.

    Based on the requirements of a specific implementation, certain details might differ. However, all Validated Patterns that are based on a portfolio architecture, generalize one or more successful deployments of a use case.

Background

This pattern is scaffolding around the RAG AI Quickstart. It provisions the OpenShift cluster with Red Hat OpenShift AI in a configuration suitable for LlamaStack. It deploys NFD and the NVIDIA GPU Operator for LLM inference on GPU nodes and manages secrets through the Validated Patterns framework. On AWS, GPU worker nodes can be provisioned automatically. By default, this pattern uses a CPU-based LLM.

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by retrieving relevant external knowledge to improve accuracy, reduce hallucinations, and support domain-specific conversations.

The included demo application features FantaCo, a fictional large enterprise that launched a secure RAG chatbot connecting employees to HR, procurement, sales, and IT documentation. Users can explore the capabilities of RAG by:

  • Exploring FantaCo’s solution

  • Uploading new documents to be embedded

  • Tweaking sampling parameters to influence LLM responses

  • Using custom system prompts

  • Switching between simple and agent-based RAG

About the solution

This pattern deploys a complete RAG pipeline on a single OpenShift cluster by using a GitOps approach. The Validated Patterns framework handles infrastructure provisioning, including GPU operators, AI platform configuration, and secrets management. The RAG AI Quickstart delivers the application layer: document ingestion, embedding, retrieval, and LLM-powered chat.

The solution uses LlamaStack to standardize the building blocks of the AI stack with a consistent interface for model serving, vector storage, and safety guardrails. Kubeflow Pipelines ingests documents, embeds them, and stores them in PostgreSQL with PGVector. At query time, the system retrieves relevant embeddings to ground LLM responses in real data.

About the technology

The following technologies are used in this solution:

Red Hat OpenShift Container Platform

An enterprise-ready Kubernetes container platform built for an open hybrid cloud strategy. It provides a consistent application platform to manage hybrid cloud, public cloud, and edge deployments.

Red Hat OpenShift GitOps

A declarative application continuous delivery tool for Kubernetes based on the ArgoCD project. Application definitions, configurations, and environments are declarative and version controlled in Git.

Red Hat OpenShift AI

A flexible, scalable MLOps platform with tools to build, deploy, and manage AI-enabled applications. This pattern uses Red Hat OpenShift AI to serve the LLM inference endpoint.

LlamaStack

A standardized framework for building AI applications with Llama models. It provides consistent APIs for model inference, vector storage, safety guardrails, and agentic workflows.

PostgreSQL with PGVector

An open source relational database extended with PGVector for storing and querying vector embeddings used in document retrieval.

all-MiniLM-L6-v2

A sentence transformer model used to generate vector embeddings from documents and queries for similarity search.

Llama 3.2-3B-Instruct

The default large language model used for generating responses. The pattern also supports Llama 3.1-8B and Llama 3.3-70B-Instruct on GPU-equipped clusters.

Llama Guard 3

A safety model that provides content filtering and guardrails to block harmful requests and responses.

Kubeflow Pipelines

A platform for building and deploying ML workflows. This pattern uses Kubeflow Pipelines for document ingestion and embedding.

Streamlit

An open source Python framework used to build the RAG chatbot user interface.

RAG Quickstart architecture

The following figure provides a high-level overview of the RAG Quickstart architecture.

RAG System Architecture
Figure 1. RAG system architecture

The architecture consists of two main pipelines:

  • RAG Pipeline — Handles user queries and generates responses through LlamaStack APIs, with safety guardrails, model serving, and vector retrieval.

  • Ingestion Pipeline — Processes documents from multiple sources, generates embeddings, and stores them in the vector database.

RAG pipeline

The RAG pipeline processes user queries through the following components:

Frontend UI

Provides the user interface for submitting queries and viewing responses. The Streamlit-based UI communicates with the LlamaStack APIs by using REST.

LlamaStack APIs

The central orchestration layer that routes queries to the appropriate backend services. LlamaStack provides a standardized interface for model inference, vector retrieval, tool use, and safety guardrails.

Guard Rails

Screens both incoming queries and outgoing responses for harmful content using Llama Guard. Llama Guard checks incoming queries for prompt injection, manipulative content, and inappropriate requests. It also validates generated responses for harmful content and compliance before returning them to the user.

Model Servers

Serve the LLM for response generation. The pattern supports multiple serving backends including vLLM on Red Hat OpenShift AI, and Ollama for CPU-based deployments. The default model is meta-llama/Llama-3.2-3B-Instruct.

Vector DBs

Store document embeddings in PostgreSQL with PGVector. When a query arrives, the retriever converts it to a vector embedding and performs a similarity search to find relevant document chunks, which are passed as context to the LLM.

Tools

Provide agent-based capabilities for more complex workflows. When agent-based RAG is enabled, LlamaStack can invoke tools to perform multi-step reasoning and retrieval.

Ingestion pipeline

The ingestion pipeline processes documents and updates the knowledge base. Documents can be ingested from three sources:

S3 Bucket

Documents stored in S3-compatible object storage (MinIO) are processed through OpenShift AI Pipelines (Kubeflow) for batch ingestion.

URL

The system downloads documents from web URLs and processes them through a Python script for embedding.

Uploads

Users can upload documents directly through the frontend UI or retriever listener for on-demand ingestion.

All ingestion paths feed into the Retriever and Embedding Service, which uses Docling libraries to chunk documents into appropriate segments and the all-MiniLM-L6-v2 model to generate vector embeddings. The resulting embeddings are stored in PGVector for retrieval.

Deployment architecture

The following table describes the pod structure when deployed on OpenShift:

PodPurposeKey characteristics

Frontend

User interface

Streamlit-based UI, communicates with LlamaStack APIs by using REST

LlamaStack

RAG orchestration

Central application logic, routes queries to model servers, vector DBs, guard rails, and tools

LLM Service

Language model inference

Runs vLLM with Llama models, optimized for GPU utilization, deployed by using KServe InferenceService on Red Hat OpenShift AI

Guard Rails

Content moderation

Runs Llama Guard for input and output safety screening, can be independently scaled

Vector Database

Embedding storage and search

PostgreSQL with PGVector, requires persistent storage, deployed as StatefulSet

Embedding Service

Vector embeddings

Generates embeddings for documents and queries using all-MiniLM-L6-v2

Ingestion Pipeline

Document processing

Kubeflow Pipelines workflows, uses Docling for document chunking, connected to S3-compatible storage (MinIO)

Implementation technologies

ComponentTechnology

Application Framework

LlamaStack

LLM Service

vLLM with meta-llama/Llama-3.2-3B-Instruct

Vector Database

PostgreSQL + PGVector

Container Orchestration

Red Hat OpenShift Container Platform + Red Hat OpenShift AI

Safety Models

meta-llama/Llama-Guard-3-1B

Embedding Model

all-MiniLM-L6-v2

Document Processing

Docling

Pipeline Orchestration

Kubeflow Pipelines

Object Storage

MinIO (S3-compatible)

Frontend

Streamlit