oc get nodes
oc get mcpTroubleshooting confidential containers deployments
This page provides solutions to common issues encountered when deploying and operating the Confidential Containers pattern.
General CoCo issues
- Problem
CoCo pods stuck in
PendingorContainerCreatingstate- Solution
This is most commonly caused by incomplete MachineConfig application or KataConfig not being ready.
Check if nodes have finished rebooting after MachineConfig updates:
Wait for all MachineConfigPools to show
UPDATED=TrueandDEGRADED=False.Verify the KataConfig is ready:
oc get kataconfig -n openshift-sandboxed-containers-operatorThe status should show
InProgress: Falseand the RuntimeClasses should be created (kata-remotefor Azure,kata-ccfor bare metal).If the KataConfig is stuck, check the operator logs:
oc logs -n openshift-sandboxed-containers-operator \ -l name=openshift-sandboxed-containers-operator -f
- Problem
ArgoCD applications not syncing or showing timeouts
- Solution
Check application dependencies and sync order. Some applications depend on others being ready first.
Note: ArgoCD applications are deployed in per-clusterGroup namespaces, not in
openshift-gitops. Useoc get applications -Ato locate them.View application health across all namespaces:
oc get applications -AFor stuck applications, check sync status and errors:
oc describe application <app-name> -n <app-namespace>Common dependency order issues:
Vault must be ready before external-secrets applications
Kyverno must be deployed before workload applications that need cc_init_data injection
cert-manager must be ready before Trustee (which depends on certificates)
Manually sync the stuck application (use
--forceif needed):argocd app sync <app-name> --force
- Problem
Peer-pod VM provisioning failures on Azure
- Solution
Verify Azure quota, region support, and networking configuration.
The pattern defaults to
Standard_DCas_v5VMs, but you can configure other Azure confidential VM families invalues-global.yamlby changing the VM size parameters.Check that your Azure region supports your chosen confidential VM family (default:
Standard_DCas_v5):Visit https://azure.microsoft.com/en-us/explore/global-infrastructure/products-by-region/ and search for your VM family in your target region.
Verify quota for confidential VM sizes in your subscription:
Navigate to Azure Portal > Subscriptions > Usage + quotas and filter for "DC" or "EC" families depending on your chosen VM type. Request a quota increase if needed.
Check sandboxed containers operator logs for Azure API errors:
oc logs -n openshift-sandboxed-containers-operator \ -l name=openshift-sandboxed-containers-operator --tail=100Verify Azure service principal credentials in Vault:
oc exec -n vault vault-0 -- vault kv get secret/hub/azureEnsure
values-global.yamlhas correct Azure networking values (clusterSubnet,clusterNSG,clusterResGroup).
- Problem
oc execdenied unexpectedly into a confidential container- Solution
This is expected behavior for containers with strict policies. Verify which policy the pod is using.
Check the pod’s initdata annotation:
oc get pod <pod-name> -n <namespace> -o yaml | grep coco.io/initdata-configmapPods using
initdataConfigMap have strict policies that deny exec. Pods usingdebug-initdataallow exec.The strict policy is a security feature, not a bug. To test CDH functionality interactively, use a pod with
coco.io/initdata-configmap: debug-initdata(like theinsecure-policypod in hello-openshift).
- Problem
CDH not returning secrets or attestation failures
- Solution
Verify KBS TLS certificate propagation and attestation policy configuration.
Check that initdata ConfigMaps exist and contain the KBS TLS certificate:
oc get configmap -n <workload-namespace> initdata -o yaml | grep INITDATAIf the ConfigMap is missing, check if Kyverno propagated it:
oc get configmap -n imperative -l coco.io/type=initdataThe source ConfigMap should exist in the
imperativenamespace. If missing, theinit-data-gzipperjob may have failed:oc logs -n imperative jobs/init-data-gzipper --tail=50Check KBS logs for attestation errors:
oc logs -n trustee-operator-system -l app=kbs -fLook for messages like
Attestation verification failedorPCR mismatch.
Kyverno-specific issues
- Problem
cc_init_dataannotation not injected into CoCo pods- Solution
Verify Kyverno is running and the CoCo pod has the required annotation trigger.
Check Kyverno pods are healthy:
oc get pods -n kyvernoAll Kyverno pods should be
Running. If not, check logs:oc logs -n kyverno -l app.kubernetes.io/component=admission-controllerVerify the pod has the
coco.io/initdata-configmapannotation:oc get pod <pod-name> -n <namespace> -o yaml | grep coco.io/initdata-configmapIf missing, the deployment/pod template must include this annotation. Kyverno only injects
cc_init_dataif this annotation is present.Check if the Kyverno policy exists:
oc get clusterpolicy inject-coco-initdataReview policy status and events:
oc describe clusterpolicy inject-coco-initdata
- Problem
initdata ConfigMap validation failures
- Solution
The ConfigMap is missing required fields. Check the ValidatingPolicy for requirements.
View validation policy:
oc get validatingpolicy validate-initdata-configmap -o yamlRequired fields in initdata ConfigMaps:
versionalgorithm(sha256, sha384, or sha512)policy.rego(OPA policy)aa.toml(attestation agent config)cdh.toml(confidential data hub config)Check Kyverno policy reports for validation errors:
oc get policyreport -A oc describe policyreport <report-name> -n <namespace>
- Problem
CoCo pods not picking up new initdata after cert rotation or KBS TLS changes
- Solution
Kyverno’s autogen is disabled by design to ensure rollout restarts pick up new initdata. You must manually restart deployments.
Rollout restart the deployment to pick up new initdata:
oc rollout restart deployment/<deployment-name> -n <namespace>Verify the new CoCo pods have the updated
cc_init_dataannotation:oc get pod <new-pod-name> -n <namespace> -o yaml | \ grep io.katacontainers.config.hypervisor.cc_init_dataThe annotation value should be a long base64-encoded string. If it matches the old value, the ConfigMap may not have been updated. Check the source ConfigMap in the
imperativenamespace.
Bare metal issues
- Problem
NFD not detecting TDX or SEV-SNP capabilities
- Solution
Verify BIOS/firmware configuration and kernel module loading.
For Intel TDX:
Check if TDX is enabled in BIOS. Consult your hardware vendor’s documentation for TEE enablement.
Verify the TDX kernel module is loaded:
oc debug node/<node-name> -- chroot /host lsmod | grep tdxExpected output should include
kvm_intelwith TDX support.Check NFD worker logs:
oc logs -n openshift-nfd -l app=nfd-workerFor AMD SEV-SNP:
Check if SEV-SNP is enabled in BIOS.
Verify SEV capabilities:
oc debug node/<node-name> -- chroot /host cat /sys/module/kvm_amd/parameters/sevExpected output:
Y(enabled)
- Problem
PCCS service not starting (Intel TDX)
- Solution
Verify the Intel PCS API key is configured correctly in secrets.
Check PCCS pod logs:
oc logs -n intel-dcap deployment/pccs-deploymentLook for authentication errors or missing API key messages.
Verify the PCCS secret exists and contains the API key:
oc get secret -n intel-dcap pccs-api-key -o yamlThe secret should have
PCCS_API_KEYfield (base64 encoded).If the secret is missing or incorrect, update
~/values-secret-coco-pattern.yamlwith your Intel PCS API key and re-run:./pattern.sh make upgrade
- Problem
QGS DaemonSet not scheduling (Intel TDX)
- Solution
QGS requires nodes labeled with TDX capability. Verify NFD labeled the nodes correctly.
Check node labels:
oc get nodes --show-labels | grep tdxNodes with TDX should have
intel.feature.node.kubernetes.io/tdx=true.If labels are missing, NFD may not have detected TDX. See "NFD not detecting TDX" troubleshooting above.
Check QGS DaemonSet status:
oc get daemonset -n intel-dcap qgs-daemonsetIf
DESIREDis 0, no nodes match the nodeSelector. IfDESIRED> 0 butREADYis 0, check pod events:oc describe pod -n intel-dcap -l app=qgs
- Problem
KataConfig not creating RuntimeClass on bare metal
- Solution
This can be a timing issue where the operator has not finished reconciling. Check operator logs.
Verify the KataConfig CR exists:
oc get kataconfig -n openshift-sandboxed-containers-operatorCheck the KataConfig status and conditions:
oc describe kataconfig -n openshift-sandboxed-containers-operatorCheck sandboxed containers operator logs for errors:
oc logs -n openshift-sandboxed-containers-operator \ -l name=openshift-sandboxed-containers-operator --tail=100If the RuntimeClass is missing after 10+ minutes, manually trigger reconciliation by adding an annotation:
oc annotate kataconfig example-kataconfig \ reconcile-trigger="$(date)" --overwrite
GPU issues
- Problem
GPU Operator install plan pending (requires manual approval)
- Solution
This is expected behavior. The pattern uses manual install plan approval for version control.
List pending install plans:
oc get installplan -n nvidia-gpu-operatorApprove the install plan:
oc patch installplan <install-plan-name> -n nvidia-gpu-operator \ --type merge -p '{"spec":{"approved":true}}'
- Problem
kata-cc-nvidia-gpuRuntimeClass missing- Solution
This is often a timing issue. The GPU reconciliation job should trigger RuntimeClass creation.
Check if the
reconcile-kataconfig-gpujob has run:oc get jobs -n imperative reconcile-kataconfig-gpuCheck job logs:
oc logs -n imperative jobs/reconcile-kataconfig-gpuIf the job hasn’t run, it may be waiting for GPU nodes to be labeled. Verify GPU Operator labeled the nodes:
oc get nodes --show-labels | grep nvidiaNodes with GPUs should have
nvidia.com/gpu.present=true.Manually trigger KataConfig reconciliation:
oc annotate kataconfig example-kataconfig \ reconcile-trigger="$(date)" --overwrite -n openshift-sandboxed-containers-operator
- Problem
GPU workload stuck in
Pendingstate- Solution
Verify IOMMU is enabled and GPUs are bound to VFIO driver.
Check pod events:
oc describe pod <gpu-pod-name> -n gpu-workloadCommon issues:
IOMMU not enabled: Check kernel parameters:
oc debug node/<node-name> -- chroot /host cat /proc/cmdline | grep iommuExpected:
intel_iommu=on(Intel) oramd_iommu=on(AMD)If missing, verify the MachineConfig applied:
oc get mc | grep iommuNodes must reboot for IOMMU kernel parameters to take effect.
GPU not bound to VFIO: Check GPU driver binding:
oc debug node/<gpu-node> -- chroot /host lspci -nnk -d 10de:GPUs should show
Kernel driver in use: vfio-pci. If not, check VFIO manager logs:oc logs -n nvidia-gpu-operator -l app=nvidia-vfio-manager
- Problem
CC Manager not enabling confidential mode on GPU
- Solution
Verify the GPU firmware supports confidential computing and CC Manager is configured correctly.
Check GPU CC Manager logs:
oc logs -n nvidia-gpu-operator -l app=nvidia-cc-managerVerify GPU supports CC mode:
oc debug node/<gpu-node> -- chroot /host nvidia-smi -q | grep "CC Mode"Expected output:
CC Mode: EnabledIf CC mode is not supported, the GPU firmware may not have confidential computing capabilities. NVIDIA confidential GPUs (H100, H200, B100, B200) with specific firmware versions support CC mode. Consult NVIDIA confidential computing documentation for supported GPU models and firmware requirements.
Attestation issues
- Problem
Attestation failing with PCR mismatch
- Solution
PCR measurements are stale or were extracted from a different image version.
Check KBS logs for the specific PCR that failed:
oc logs -n trustee-operator-system -l app=kbs --tail=100 | grep PCRRe-extract PCR measurements from the current peer-pod image:
For Azure:
bash scripts/get-pcr.shFor bare metal: Follow the manual PCR collection procedure for your hardware. See tested environments for guidance.
Update
~/values-secret-coco-pattern.yamlwith the new measurements and refresh Vault:./pattern.sh make upgrade oc rollout restart deployment/kbs-deployment -n trustee-operator-system
- Problem
TDX attestation failures (Intel)
- Solution
Verify the collateral service (PCCS) is reachable and caching quotes.
Check if Trustee can reach PCCS:
oc exec -n trustee-operator-system deployment/kbs-deployment -- \ curl -k https://pccs-service.intel-dcap.svc.cluster.local:8042/versionExpected: JSON response with PCCS version information.
If connection fails, verify PCCS is running:
oc get pods -n intel-dcap -l app=pccsCheck PCCS logs for errors fetching collateral from Intel PCS API:
oc logs -n intel-dcap deployment/pccs-deployment | grep -i errorVerify the Trustee KBS configuration points to the correct PCCS service:
oc get configmap -n trustee-operator-system kbs-config -o yaml | grep collateralServiceExpected:
pccs-service.intel-dcap.svc.cluster.local:8042
- Problem
SEV-SNP attestation failures (AMD)
- Solution
Verify SEV-SNP is enabled in firmware and certificate chain verification is working.
Check if SEV-SNP is enabled at the kernel level:
oc debug node/<node-name> -- chroot /host cat /sys/module/kvm_amd/parameters/sevExpected:
Y(enabled)Verify SEV-SNP is enabled in BIOS. Consult AMD SEV developer documentation and your hardware vendor’s BIOS documentation for SEV-SNP enablement procedures.
Check KBS logs for certificate chain verification errors:
oc logs -n trustee-operator-system -l app=kbs --tail=100 | grep -i "cert\|sev"AMD SEV-SNP uses a certificate chain-based attestation model, so no external collateral service (like PCCS) is required. The certificate chain is embedded in the attestation evidence.
Operational issues
- Problem
Vault secrets not loaded after initial deployment
- Solution
MCO-driven node reboots during initial pattern deployment can cause Vault secret loading to time out.
Wait for all nodes to finish rebooting:
oc get mcpAll MachineConfigPools should show
UPDATED=TrueandDEGRADED=False.Re-trigger secret loading:
./pattern.sh make upgradeVerify Vault is unsealed and healthy:
oc get pods -n vault oc exec -n vault vault-0 -- vault statusIf Vault is sealed, follow the Vault unsealing procedure documented in the Validated Patterns framework.
- Problem
CoCo pods starting before
cc_init_dataannotations are ready- Solution
CoCo pods may start before Kyverno injects the
cc_init_dataannotations, causing attestation failures.Delete the pod to trigger recreation with correct annotations:
oc delete pod <pod-name> -n <namespace>The deployment will recreate the pod, and Kyverno will inject the
cc_init_dataannotation during admission.Verify the new pod has the annotation:
oc get pod <new-pod-name> -n <namespace> -o yaml | \ grep io.katacontainers.config.hypervisor.cc_init_data
- Problem
TDX attestation failures after cluster rebuild (SGX registration not reset)
- Solution
Stale SGX registration state persists in BIOS/firmware after rebuilding a bare metal TDX cluster.
Before rebuilding a TDX cluster, perform an SGX factory reset in BIOS. The exact procedure varies by hardware vendor. Consult your server vendor’s BIOS documentation or the Intel TDX BIOS setup guide for reset procedures.
Common BIOS settings to check:
SGX Factory Reset (enables clearing of previous registration)
TDX enablement (must be re-enabled after SGX reset)
TME (Total Memory Encryption) settings
Without an SGX reset, the platform’s attestation evidence will not match expected values and Trustee will reject attestation requests.
- Problem
Confidential containers failing due to TEE not enabled in BIOS
- Solution
Verify that TDX or SEV-SNP is actually enabled at the BIOS/firmware level.
For Intel TDX:
Check BIOS settings according to the Intel TDX BIOS setup guide.
Verify TDX is detected by the kernel:
oc debug node/<node-name> -- chroot /host dmesg | grep -i tdxExpected: Messages indicating TDX initialization succeeded.
For AMD SEV-SNP:
Check BIOS settings according to the AMD SEV developer documentation and your hardware vendor’s TEE enablement guide.
Verify SEV-SNP is detected by the kernel:
oc debug node/<node-name> -- chroot /host dmesg | grep -i sevExpected: Messages indicating SEV-SNP initialization succeeded.
If TEE capabilities are not detected at the kernel level, Node Feature Discovery (NFD) will not label nodes, and confidential runtime classes will not be schedulable. Fix the BIOS configuration before proceeding with pattern deployment.
