Troubleshooting common Pattern Deployment issues
- Problem
Validated Pattern installation process is stuck on deploying Vault
- Solution
Most common reason of this is that prerequisites are not satisfied i.e. Image Registry is not set up or CephFs is not set as a default StorageClass. Please refer to section
Getting started → Prerequisites
and make sure all is done before proceeding to pattern deployment.
- Problem
Downloading AI model
Llama-2-70b-chat-hf
usingdownload-model
Jupyter notebook is failing or TGI deployment fails after model is downloaded- Solution
Most often this is due to some network errors while downloading the model. If not sure if whole model was downloaded, please clear bucket
model-bucket
in RGW storage usingaws-cli
or any other method, and repeat process of downloading the model.
- Problem
Builds of OPEA chat resources are failing, they cannot pull images
- Solution
If all Build resources are failing, because of pulling image errors please makes ure that there is no proxy issue. Refer to section
Getting started → Procedure
. Also make sure that Image Registry is properly set up.
- Problem
TGI or TEI pods are showing errors
- Solution
Consult Troubleshooting PyTorch Model documentation for general Gaudi 2 troubleshooting steps. Gathering extended logs might be helpful, as mentioned in troubleshooting steps for "RuntimeError: tensor does not have a device". Container directory
/var/log/habana_logs
should then be inspected to see logs from SynapseAI and other components.
- Problem
TGI shows "Cannot allocate connection" or "The condition [ isNicUp(port) ] failed." errors
- Solution
Review Disable/Enable NICs guide. Standalone machines that are not configured for scale up with Gaudi 2 NICs connected to a switch require running "To disable Gaudi external NICs, run the following command" in the "Disable/Enable Gaudi 2 External NICs" section.