global:
models:
llama-3-2-3b-instruct-cpu:
id: meta-llama/Llama-3.2-3B-Instruct
enabled: true
resources:
limits:
cpu: "6"
memory: 48Gi
requests:
cpu: "2"
memory: 24Gi
args:
- --enable-auto-tool-choice
- --chat-template
- /chat-templates/tool_chat_template_llama3.2_json.jinja
- --tool-call-parser
- llama3_json
- --dtype
- auto
- --max-model-len
- "16384"
- --max-num-seqs
- "1"Customizing the RAG AI Quickstart pattern
Without any changes, this pattern runs a CPU-backed LLM and does not require a GPU. This can be limiting in terms of usable models as well as speed, so you might want to use a GPU instead.
Enabling GPU support
To enable GPU support, set global.device to gpu in values-global.yaml and push your changes to GitHub. This adds NFD and the NVIDIA GPU Operator to the pattern installation and enables the models to run using an NVIDIA accelerator.
If you are running this pattern on an OpenShift cluster on AWS, setting |
Changing models
To update the models, edit overrides/values-cpu.yaml (if global.device is set to cpu) or overrides/values-gpu.yaml (if set to gpu).
The default CPU-based model is defined as follows:
You can change this to any vLLM-compatible model that you have accepted the terms and conditions for with your HuggingFace API token. You can also adjust the resource parameters as needed for your environment.
The runtime defaults to vllm/vllm-openai:v0.11.1. If you need a later version, you can override the image:
llm-service:
deviceConfigs:
gpu:
image: vllm/vllm-openai:nightlyThe example above sets a GPU-specific container image. To override the CPU-based image instead, use the key |
Defining multiple models
You can define multiple LLM models to be served simultaneously. For example:
global:
models:
deepseek-r1:
id: Valdemardi/DeepSeek-R1-Distill-Llama-70B-AWQ
enabled: true
resources:
limits:
cpu: "32"
memory: 200Gi
requests:
cpu: "24"
memory: 150Gi
args:
- --reasoning-parser
- deepseek_r1
- --tool-call-parser
- llama3_json
- --enable-auto-tool-choice
- --quantization
- awq_marlin
- --dtype
- float16
- --max-model-len
- "65536"
gpt-oss-120b:
id: openai/gpt-oss-120b
enabled: true
resources:
limits:
cpu: "32"
memory: 200Gi
requests:
cpu: "24"
memory: 150Gi
args:
- --tool-call-parser
- openai
- --enable-auto-tool-choiceFor a complete list of customizable values, see the AI Architecture charts repository.
