Customizing this pattern

Customizing the RAG AI Quickstart pattern

Without any changes, this pattern runs a CPU-backed LLM and does not require a GPU. This can be limiting in terms of usable models as well as speed, so you might want to use a GPU instead.

Enabling GPU support

To enable GPU support, set global.device to gpu in values-global.yaml and push your changes to GitHub. This adds NFD and the NVIDIA GPU Operator to the pattern installation and enables the models to run using an NVIDIA accelerator.

If you are running this pattern on an OpenShift cluster on AWS, setting global.device to gpu automatically creates a GPU (g6.2xlarge) machine and add it as a worker node to your cluster.

Changing models

To update the models, edit overrides/values-cpu.yaml (if global.device is set to cpu) or overrides/values-gpu.yaml (if set to gpu).

The default CPU-based model is defined as follows:

global:
  models:
    llama-3-2-3b-instruct-cpu:
      id: meta-llama/Llama-3.2-3B-Instruct
      enabled: true
      resources:
        limits:
          cpu: "6"
          memory: 48Gi
        requests:
          cpu: "2"
          memory: 24Gi
      args:
        - --enable-auto-tool-choice
        - --chat-template
        - /chat-templates/tool_chat_template_llama3.2_json.jinja
        - --tool-call-parser
        - llama3_json
        - --dtype
        - auto
        - --max-model-len
        - "16384"
        - --max-num-seqs
        - "1"

You can change this to any vLLM-compatible model that you have accepted the terms and conditions for with your HuggingFace API token. You can also adjust the resource parameters as needed for your environment.

The runtime defaults to vllm/vllm-openai:v0.11.1. If you need a later version, you can override the image:

llm-service:
  deviceConfigs:
    gpu:
      image: vllm/vllm-openai:nightly

The example above sets a GPU-specific container image. To override the CPU-based image instead, use the key llm-service.deviceConfigs.cpu.image.

Defining multiple models

You can define multiple LLM models to be served simultaneously. For example:

global:
  models:
    deepseek-r1:
      id: Valdemardi/DeepSeek-R1-Distill-Llama-70B-AWQ
      enabled: true
      resources:
        limits:
          cpu: "32"
          memory: 200Gi
        requests:
          cpu: "24"
          memory: 150Gi
      args:
        - --reasoning-parser
        - deepseek_r1
        - --tool-call-parser
        - llama3_json
        - --enable-auto-tool-choice
        - --quantization
        - awq_marlin
        - --dtype
        - float16
        - --max-model-len
        - "65536"
    gpt-oss-120b:
      id: openai/gpt-oss-120b
      enabled: true
      resources:
        limits:
          cpu: "32"
          memory: 200Gi
        requests:
          cpu: "24"
          memory: 150Gi
      args:
        - --tool-call-parser
        - openai
        - --enable-auto-tool-choice

For a complete list of customizable values, see the AI Architecture charts repository.

Edit this page Open a documentation issue