Troubleshooting

Our Issue Tracker

Installation-phase Failures

The framework for deploying the applications and their operators has been made easy for the user by using OpenShift GitOps for continuous deployment (Argo CD). It takes time to deploy everything. You may have to go back and forth between the OpenShift cluster console and the OpenShift GitOps console to check on applications and operators being up and in a ready state.

The applications deployment for the main data center are as follows. First OpenShift GitOps operator will deploy. See the OpenShift Console to see that it is running. Then OpenShift GitOps takes over the rest of the deployment. It deploys the following applications:

Advanced Cluster Management operator in the application acm. this will manage the edge clusters
Red Hat OpenShift AI in the application data-science-cluster and data-science-project for the data science components.
OpenShift Pipelines is deployed in the application pipelines
AMQ Streams is deployed to manage data coming from factories and stored in a data lake.
The data lake uses S3 based storage and is deployed in the production-data-lake application
Testing at the data center is managed by the manuela-test application

Make sure that all these applications are Healthy 💚 and Synced ✅ in the OpenShift GitOps console. If in a state other than Healthy (Progressing, Degraded, Missing, Unknown') then it’s time to dive deeper into that application and see what has happened.

The applications deployed on the factory (edge) cluster are as follows. After a successful importing [1] a factory cluster to the main ACM hub, you should check in the factory cluster’s OpenShift UI to see if the projects open-cluster-manager-agent and open-cluster-manager-agent-addons are running. When these are deployed then OpenShift GitOps operator will be deployed on the cluster. From there OpenShift GitOps deploys the following applications:

stormshift sets up application and AMQ integration components
golang-external-secrets sets up the bits to be able to get secrets from the data center.

[1] ACM has different ways of describing this process based on which tool you are using. Attach, Join, Import are terms associated with bringing a cluster under the management of a hub cluster.

Subscriptions not being installed

Symptom: Install seems to “freeze” at a specific point. Expected operators do not install in the cluster

Cause: It is possible an operator was requested to be installed that isn’t allowed to be installed on this version of OpenShift.

Resolution: In general, use the project-supplied global.options.UseCSV setting of False. This requests the current, best version of the operator available. If a specific CSV (Cluster Service Version) is requested but unavailable, that operator will not be able to install at all, and when an operator fails to install, that may have a cascading effect on other operators.

Potential (Known) Operational Issues

Pipeline Failures

Symptom: Intermittent failures in Pipeline stages

Some sample errors:

level=error msg="Error while applying layer: ApplyLayer io: read/write on closed pipe stdout: {\"layerSize\":7301}\n stderr: "
error creating build container: Error committing the finished image: error adding layer with blob

time="2021-09-29T18:48:27Z" level=fatal msg="Error trying to reuse blob sha256:235f9e6f3559c04d5ee09b613dcab06dbc03ceb93b65ce364afe35c03fd53574 at destination: failed to read from destination repository martjack/iot-software-sensor: 500 (Internal Server Error)

I1006 22:07:47.908257      14 request.go:645] Throttling request took 1.195150708s, request: GET:https://172.30.0.1:443/apis/autoscaling.openshift.io/v1?timeout=32s
PipelineRun started: seed-iot-software-sensor-run-cpzzv
Waiting for logs to be available...
E1006 22:08:27.106369      14 runtime.go:78] Observed a panic: "send on closed channel" (send on closed channel)
goroutine 487 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1b40ee0, 0x1fe47b0)
 /workspace/pkg/mod/k8s.io/apimachinery@v0.19.7/pkg/util/runtime/runtime.go:74 +0x95
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
 /workspace/pkg/mod/k8s.io/apimachinery@v0.19.7/pkg/util/runtime/runtime.go:48 +0x89
panic(0x1b40ee0, 0x1fe47b0)

When this happens, the pipeline may not entirely stop running. It is safe to stop/cancel the pipeline run, and desirable to do so, since multiple pipelines attempting to change the repository at the same time could cause more failures.

Resolution: Run make seed in the root of the repository OR re-run the failed pipeline segment (e.g. seed-iot-frontend or seed-iot-consumer).

We’re looking into better long-term fixes for a number of the situations that can cause these situations as #40.

Symptom: Error in “push-*” pipeline tasks

Cause: Multiple processes or people were trying to make changes to the repository at the same time. The state of the repository changed in the middle of the process in such a way that the update was not a “fast-forward” in git terms.

Resolution: Re-run the failed pipeline segment OR run make seed from the root of your fork of the industrial-edge repository.

It is also possible that multiple pipelines were running at the same time and were making conflicting changes. We recommend running one pipeline at a time.

Symptom: Pipelines application perpetually “progressing” and not showing green/healthy. May show “degraded”

Cause: Most likely the application is missing the images that are built by the seed pipeline.

Resolution: Run make seed from the root of your forked repository directory, which will build the images and deploy them to both test and production.

Symptom: There is a “spinny” next to one of the resources in the app that never resolves

Cause: Check for a PersistentVolumeClaim that is not in use.

Resolution: Delete the unused PVC

ArgoCD not syncing

Symptom: ArgoCD shows an error and “Unknown” sync status

Cause: A change has been made in the repository that renders invalid YAML

Resolution: Fix the issue as identified by the error message, and commit and push the fix OR revert the last one.

Certain changes might invalidate objects in ArgoCD, and this will prevent ArgoCD from deploying the change related to that commit. The error message for that situation might look like this (this particular change removed the Image details from the kustomization.yaml file, and we resolved it by re-adding the image entries:

rpc error: code = Unknown desc = Manifest generation error (cached): `/bin/bash -c helm template . --name-template ${ARGOCD_APP_NAME:0:52} -f https://github.com/claudiol/industrial-edge/raw/deployment/values-global.yaml -f https://github.com/claudiol/industrial-edge/raw/deployment/values-datacenter.yaml --set global.repoURL=$ARGOCD_APP_SOURCE_REPO_URL --set global.targetRevision=$ARGOCD_APP_SOURCE_TARGET_REVISION --set global.namespace=$ARGOCD_APP_NAMESPACE --set global.pattern=industrial-edge --set global.valuesDirectoryURL=https://github.com/claudiol/industrial-edge/raw/deployment --post-renderer ./kustomize` failed exit status 1: Error: error while running post render on files: error while running command /tmp/https:__github.com_claudiol_industrial-edge/charts/datacenter/manuela-tst/kustomize. error output: ++ dirname /tmp/https:__github.com_claudiol_industrial-edge/charts/datacenter/manuela-tst/kustomize + BASE=/tmp/https:__github.com_claudiol_industrial-edge/charts/datacenter/manuela-tst + '[' /tmp/https:__github.com_claudiol_industrial-edge/charts/datacenter/manuela-tst = /tmp/https:__github.com_claudiol_industrial-edge/charts/datacenter/manuela-tst ']' + BASE=./ + cat + echo / /tmp/https:__github.com_claudiol_industrial-edge/charts/datacenter/manuela-tst / /tmp/https:__github.com_claudiol_industrial-edge/charts/datacenter/manuela-tst + ls -al total 44 drwxr-xr-x. 3 default root 166 Oct 6 20:59 . drwxr-xr-x. 7 default root 98 Oct 6 20:28 .. -rw-r--r--. 1 default root 1105 Oct 6 20:28 Chart.yaml -rw-r--r--. 1 default root 22393 Oct 6 20:59 helm.yaml -rw-r--r--. 1 default root 98 Oct 6 20:59 kustomization.yaml -rwxr-xr-x. 1 default root 316 Oct 6 20:28 kustomize -rw-r--r--. 1 default root 348 Oct 6 20:28 system-image-builder-role-binding.yaml drwxr-xr-x. 7 default root 115 Oct 6 20:28 templates -rw-r--r--. 1 default root 585 Oct 6 20:28 values.yaml + kubectl kustomize ./ Error: json: cannot unmarshal object into Go struct field Kustomization.images of type []image.Image : exit status 1 Use --debug flag to render out invalid YAML