Validated Patterns

Troubleshooting common Pattern Deployment issues


Validated Pattern installation process is stuck on deploying Vault


Most common reason of this is that prerequisites are not satisfied i.e. Image Registry is not set up or CephFs is not set as a default StorageClass. Please refer to section Getting started → Prerequisites and make sure all is done before proceeding to pattern deployment.


Downloading AI model Llama-2-70b-chat-hf using download-model Jupyter notebook is failing or TGI deployment fails after model is downloaded


Most often this is due to some network errors while downloading the model. If not sure if whole model was downloaded, please clear bucket model-bucket in RGW storage using aws-cli or any other method, and repeat process of downloading the model.


Builds of OPEA chat resources are failing, they cannot pull images


If all Build resources are failing, because of pulling image errors please makes ure that there is no proxy issue. Refer to section Getting started → Procedure. Also make sure that Image Registry is properly set up.


TGI or TEI pods are showing errors


Consult Troubleshooting PyTorch Model documentation for general Gaudi 2 troubleshooting steps. Gathering extended logs might be helpful, as mentioned in troubleshooting steps for "RuntimeError: tensor does not have a device". Container directory /var/log/habana_logs should then be inspected to see logs from SynapseAI and other components.


TGI shows "Cannot allocate connection" or "The condition [ isNicUp(port) ] failed." errors


Review Disable/Enable NICs guide. Standalone machines that are not configured for scale up with Gaudi 2 NICs connected to a switch require running "To disable Gaudi external NICs, run the following command" in the "Disable/Enable Gaudi 2 External NICs" section.