Validated Patterns

Troubleshooting common Pattern Deployment issues


Problem

Validated Pattern installation process is stuck on deploying Vault

Solution

Most common reason of this is that prerequisites are not satisfied i.e. Image Registry is not set up or CephFs is not set as a default StorageClass. Please refer to section Getting started → Prerequisites and make sure all is done before proceeding to pattern deployment.


Problem

Downloading AI model Llama-2-70b-chat-hf using download-model Jupyter notebook is failing or TGI deployment fails after model is downloaded

Solution

Most often this is due to some network errors while downloading the model. If not sure if whole model was downloaded, please clear bucket model-bucket in RGW storage using aws-cli or any other method, and repeat process of downloading the model.


Problem

Builds of OPEA chat resources are failing, they cannot pull images

Solution

If all Build resources are failing, because of pulling image errors please makes ure that there is no proxy issue. Refer to section Getting started → Procedure. Also make sure that Image Registry is properly set up.


Problem

TGI or TEI pods are showing errors

Solution

Consult Troubleshooting PyTorch Model documentation for general Gaudi 2 troubleshooting steps. Gathering extended logs might be helpful, as mentioned in troubleshooting steps for "RuntimeError: tensor does not have a device". Container directory /var/log/habana_logs should then be inspected to see logs from SynapseAI and other components.


Problem

TGI shows "Cannot allocate connection" or "The condition [ isNicUp(port) ] failed." errors

Solution

Review Disable/Enable NICs guide. Standalone machines that are not configured for scale up with Gaudi 2 NICs connected to a switch require running "To disable Gaudi external NICs, run the following command" in the "Disable/Enable Gaudi 2 External NICs" section.