Pattern

Lemonade Stand AI Quickstart

Status

Sandbox

Resources

About the Lemonade Stand AI Quickstart pattern

Deploy an AI chatbot with multi-layered safety guardrails on OpenShift to demonstrate content filtering, prompt injection defense, language enforcement, and real-time guardrail monitoring.

Use case

Deploy a guardrailed AI chatbot that filters harmful content, detects prompt injection attempts, enforces language constraints, and blocks competitor mentions through multiple safety layers.
Explore multi-layered AI safety techniques including model-based detectors (HAP, prompt injection), rule-based detectors (language enforcement), and regex-based filtering.
Use a GitOps approach to provision AI guardrails infrastructure including GPU-accelerated model serving, TrustyAI orchestration, and monitoring.

Background

This pattern builds on the Lemonade Stand Assistant AI Quickstart. It provisions the OpenShift cluster with Red Hat OpenShift AI configured for GPU-accelerated inference using vLLM. It deploys the NVIDIA GPU Operator for model serving on GPU nodes and manages secrets through the Validated Patterns framework using HashiCorp Vault and the External Secrets Operator. This pattern generalizes one or more successful deployments of this use case. Implementation details might vary depending on your specific environment and requirements.

Organizations can use the Lemonade Stand AI Quickstart to learn how to implement AI safety guardrails as a multi-layered pipeline. It demonstrates a production-ready approach to:

Serving a language model (Llama 3.2 3B Instruct) with GPU-accelerated inference through vLLM on KServe
Orchestrating multiple safety detectors through TrustyAI Guardrails Orchestrator (FMS Orchestr8) to validate both input and output
Detecting hate speech, abuse, and profanity with the IBM Granite Guardian HAP 125M model
Identifying prompt injection and jailbreak attempts with the DeBERTa v3 Base model
Enforcing language constraints with the Lingua rule-based detector
Blocking competitor product mentions with regex-based pre-filtering in 13+ languages
Monitoring guardrail activation rates in real time through an R Shiny dashboard

About the solution

This pattern deploys a complete guardrailed chatbot on a single OpenShift cluster by using a GitOps approach. You use the Validated Patterns framework to provision infrastructure, including GPU operators, AI platform configuration, and secrets management. The Lemonade Stand Assistant AI Quickstart delivers the application layer: model serving, guardrails orchestration, safety detectors, and monitoring.

User queries flow through a multi-stage safety pipeline. The FastAPI application first applies a regular expression pre-filter to block competitor fruit names across 13 languages. The application forwards queries that pass to the TrustyAI Guardrails Orchestrator, which chains three detector models in sequence: Lingua (language enforcement), IBM Granite Guardian HAP 125M (content safety), and DeBERTa v3 Base (prompt injection detection). Only validated queries reach the LLM. The same detectors validate model output before streaming it back to the user. An R Shiny dashboard provides real-time visibility into guardrail activation rates by polling Prometheus metrics from the FastAPI application.

About the technology

This solution uses the following technologies:

Red Hat OpenShift Container Platform: An enterprise-ready Kubernetes container platform built for an open hybrid cloud strategy. It provides a consistent application platform to manage hybrid cloud, public cloud, and edge deployments.
Red Hat OpenShift GitOps: A declarative application continuous delivery tool for Kubernetes based on the ArgoCD project. Application definitions, configurations, and environments are declarative and version controlled in Git.
Red Hat OpenShift AI: A flexible, scalable MLOps platform with tools to build, deploy, and manage AI-enabled applications. This pattern uses Red Hat OpenShift AI to manage GPU-accelerated model serving with vLLM via KServe.
TrustyAI / FMS Guardrails Orchestrator: A guardrails orchestration framework that chains multiple safety detectors to validate both input and output of LLM interactions. This pattern uses the FMS Orchestr8 configuration to pipeline Lingua, HAP, and prompt injection detectors.
vLLM: A high-throughput, memory-efficient inference engine for large language models. vLLM serves the Llama 3.2 model with optimized GPU use.
Llama 3.2 3B Instruct: A compact instruction-tuned language model from Meta. This pattern serves the FP8-quantized variant for efficient GPU inference as the chatbot’s conversational backend.
IBM Granite Guardian HAP 125M: A lightweight hate, abuse, and profanity detection model from IBM. This pattern uses it to filter harmful content in both user queries and model responses.
DeBERTa v3 Base Prompt Injection: A fine-tuned DeBERTa model for detecting prompt injection and jailbreak attempts. This pattern uses it to identify attempts to override the chatbot’s system instructions.
Lingua: A rule-based natural language detection library. This pattern uses Lingua to enforce English-only communication with the chatbot.
FMS Chunkers: A sentence-level text chunking service that segments user input and model output for individual detector evaluation.
MinIO: An S3-compatible object storage service. This pattern uses MinIO to store detector model artifacts that are downloaded from Hugging Face during initialization.
R Shiny: A web application framework for R. This pattern uses an R Shiny dashboard to visualize guardrail activation metrics in real time, displaying request counts, blocked inputs and outputs, and per-detector activation rates.
Prometheus: An open source monitoring and alerting toolkit. This pattern uses Prometheus-format metrics exposed by the FastAPI application to track guardrail activations.

Lemonade Stand AI Quickstart architecture

The following figure shows the Lemonade Stand AI Quickstart architecture.

Figure 1. Lemonade Stand AI Quickstart system architecture

The architecture consists of three main layers:

Guardrails Layer — Chains multiple safety detectors through the TrustyAI Guardrails Orchestrator to filter both input and output for content safety, prompt injection, and language compliance.
Inference Layer — Serves the Llama 3.2 3B Instruct model through vLLM with GPU acceleration on KServe.
Application Layer — Provides the chatbot UI with regex-based pre-filtering, Prometheus metrics, and a real-time R Shiny monitoring dashboard.

Guardrails layer

The guardrails layer validates both user input and model output through a chain of safety detectors:

TrustyAI Guardrails Orchestrator (FMS Orchestr8): Orchestrates multiple detector models in a configurable pipeline. Each detector evaluates the input or output independently, and the orchestrator aggregates results to determine whether to allow or block the content.
IBM Granite Guardian HAP 125M: A lightweight model that detects hate speech, abuse, and profanity. Runs on CPU by default with optional GPU acceleration. Threshold: 0.5.
DeBERTa v3 Base Prompt Injection Detector: A fine-tuned DeBERTa model that identifies prompt injection and jailbreak attempts. Runs on CPU by default with optional GPU acceleration. Threshold: 0.5.
Lingua Language Detector: A rule-based language detection service that enforces English-only communication. Deployed as a Kubernetes Deployment (not KServe). Threshold: 0.88.
FMS Chunkers: A gRPC-based text chunking service that segments input and output into sentences for individual detector evaluation.

Inference layer

The inference layer serves the language model and processes chat requests:

vLLM Model Server: Serves the Llama 3.2 3B Instruct model (FP8-quantized) with GPU acceleration. Managed by Red Hat OpenShift AI as a KServe InferenceService with optimized settings including chunked prefill and 95% GPU memory utilization.
NVIDIA GPU Operator: Manages NVIDIA GPU drivers, device plugins, and monitoring on worker nodes. Ensures GPUs are configured and available for model serving workloads.

Application layer

The application layer provides the user interface, pre-filtering, and monitoring:

Lemonade Stand FastAPI Application: A Python FastAPI application that serves the chatbot UI on port 8080. It implements a local regular expression-based detector that blocks competitor fruit names across 13+ languages before forwarding queries to the Guardrails Orchestrator. It also exposes a /metrics endpoint with Prometheus-format guardrail activation metrics and streams LLM responses to the browser through Server-Sent Events (SSE).
R Shiny Monitoring Dashboard: A real-time monitoring dashboard that visualizes guardrail activation metrics. It polls the FastAPI /metrics endpoint and displays total request counts, blocked input and output counts, approved requests, and per-detector activation breakdowns.
MinIO: An S3-compatible object storage service that stores detector model artifacts. Models are downloaded from Hugging Face during initialization and served to the KServe detector runtimes.

Data flow

The following describes the request flow through the guardrails pipeline:

The user sends a message through the chatbot UI.
The FastAPI application applies the regular expression pre-filter to check for competitor fruit names in 13+ languages. If detected, the request is blocked immediately.
The application forwards the query to the TrustyAI Guardrails Orchestrator over HTTPS.
The Guardrails Orchestrator chains the detectors in sequence:
1. Lingua checks that the input is in English.
2. IBM Granite Guardian HAP checks for hate speech, abuse, and profanity.
3. DeBERTa v3 Prompt Injection checks for jailbreak attempts.
If all detectors pass, the orchestrator forwards the query to the vLLM model server.
The LLM generates a response. The same detector chain validates the output (output guardrails).
The validated response streams back to the user through SSE.
The FastAPI application records Prometheus metrics for each detector activation.
The R Shiny dashboard polls the /metrics endpoint to display real-time guardrail statistics.

Deployment architecture

The following table describes the pod structure when you deploy on OpenShift:

Pod	Purpose	Characteristics
Lemonade Stand App	Chatbot UI and API	FastAPI on port 8080, regex pre-filter for competitor fruits, Prometheus metrics endpoint, SSE response streaming
Guardrails Orchestrator	Safety pipeline orchestration	TrustyAI / FMS Orchestr8 on port 8032, chains detector models in sequence for input and output validation
vLLM Model Server	LLM inference	Llama 3.2 3B Instruct (FP8), GPU-accelerated, KServe InferenceService managed by Red Hat OpenShift AI
HAP Detector	Content safety	IBM Granite Guardian HAP 125M, CPU or optional GPU, detects hate speech, abuse, and profanity
Prompt Injection Detector	Jailbreak defense	DeBERTa v3 Base, CPU or optional GPU, detects prompt injection and manipulation attempts
Lingua Detector	Language enforcement	Rule-based English-only detection, lightweight CPU deployment
Chunker Service	Text segmentation	FMS Chunkers, gRPC on port 8085, sentence-level splitting for detector input
MinIO	Model storage	S3-compatible object storage, stores detector model artifacts downloaded from Hugging Face
R Shiny Dashboard	Monitoring	Real-time guardrail activation visualization, polls Prometheus metrics from the FastAPI app
Vault	Secrets management	Stores vLLM API key and other credentials, synced to the cluster by the External Secrets Operator

Pod

Purpose

Characteristics

Lemonade Stand App

Chatbot UI and API

FastAPI on port 8080, regex pre-filter for competitor fruits, Prometheus metrics endpoint, SSE response streaming

Guardrails Orchestrator

Safety pipeline orchestration

TrustyAI / FMS Orchestr8 on port 8032, chains detector models in sequence for input and output validation

vLLM Model Server

LLM inference

Llama 3.2 3B Instruct (FP8), GPU-accelerated, KServe InferenceService managed by Red Hat OpenShift AI

HAP Detector

Content safety

IBM Granite Guardian HAP 125M, CPU or optional GPU, detects hate speech, abuse, and profanity

Prompt Injection Detector

Jailbreak defense

DeBERTa v3 Base, CPU or optional GPU, detects prompt injection and manipulation attempts

Lingua Detector

Language enforcement

Rule-based English-only detection, lightweight CPU deployment

Chunker Service

Text segmentation

FMS Chunkers, gRPC on port 8085, sentence-level splitting for detector input

MinIO

Model storage

S3-compatible object storage, stores detector model artifacts downloaded from Hugging Face

R Shiny Dashboard

Monitoring

Real-time guardrail activation visualization, polls Prometheus metrics from the FastAPI app

Vault

Secrets management

Stores vLLM API key and other credentials, synced to the cluster by the External Secrets Operator

Implementation technologies

Component	Technology
Application Framework	FastAPI (Python)
LLM Service	vLLM with meta-llama/Llama-3.2-3B-Instruct
Guardrails Orchestration	TrustyAI / FMS Orchestr8
Content Safety	IBM Granite Guardian HAP 125M
Prompt Injection Detection	DeBERTa v3 Base
Language Detection	Lingua
Text Chunking	FMS Chunkers
Container Orchestration	Red Hat OpenShift Container Platform + Red Hat OpenShift AI
GPU Management	NVIDIA GPU Operator
Object Storage	MinIO (S3-compatible)
Monitoring	R Shiny + Prometheus
Secrets Management	HashiCorp Vault + External Secrets Operator

Component

Technology

Application Framework

FastAPI (Python)

LLM Service

vLLM with meta-llama/Llama-3.2-3B-Instruct

Guardrails Orchestration

TrustyAI / FMS Orchestr8

Content Safety

IBM Granite Guardian HAP 125M

Prompt Injection Detection

DeBERTa v3 Base

Language Detection

Lingua

Text Chunking

FMS Chunkers

Container Orchestration

Red Hat OpenShift Container Platform + Red Hat OpenShift AI

GPU Management

NVIDIA GPU Operator

Object Storage

MinIO (S3-compatible)

Monitoring

R Shiny + Prometheus

Secrets Management

HashiCorp Vault + External Secrets Operator

Next steps

Edit this page Open a documentation issue