Troubleshooting | Validated Patterns

Troubleshooting common Pattern Deployment issues

Problem: Validated Pattern installation process is stuck on deploying Vault
Solution: Most common reason of this is that prerequisites are not satisfied i.e. Image Registry is not set up or CephFs is not set as a default StorageClass. Please refer to section Getting started → Prerequisites and make sure all is done before proceeding to pattern deployment.

Problem: Downloading AI model Llama-2-70b-chat-hf using download-model Jupyter notebook is failing or TGI deployment fails after model is downloaded
Solution: Most often this is due to some network errors while downloading the model. If not sure if whole model was downloaded, please clear bucket model-bucket in RGW storage using aws-cli or any other method, and repeat process of downloading the model.

Problem: Builds of OPEA chat resources are failing, they cannot pull images
Solution: If all Build resources are failing, because of pulling image errors please makes ure that there is no proxy issue. Refer to section Getting started → Procedure. Also make sure that Image Registry is properly set up.

Problem: TGI or TEI pods are showing errors
Solution: Consult Troubleshooting PyTorch Model documentation for general Gaudi 2 troubleshooting steps. Gathering extended logs might be helpful, as mentioned in troubleshooting steps for "RuntimeError: tensor does not have a device". Container directory /var/log/habana_logs should then be inspected to see logs from SynapseAI and other components.

Problem: TGI shows "Cannot allocate connection" or "The condition [ isNicUp(port) ] failed." errors
Solution: Review Disable/Enable NICs guide. Standalone machines that are not configured for scale up with Gaudi 2 NICs connected to a switch require running "To disable Gaudi external NICs, run the following command" in the "Disable/Enable Gaudi 2 External NICs" section.