Troubleshooting tips

Review the following questions when you encounter an issue with Open Horizon. The tips and guides for each question can help you resolve common issues and obtain information to identify root causes.

Are the currently released versions of the Horizon packages installed?
Is the Horizon agent currently up and actively running?
Is the edge node configured to interact with the Horizon exchange?
Are the required Docker® containers started for the edge node running?
Are the expected service containers versions running?
Are the expected containers stable?
Are your Docker® containers networked correctly?
Are the dependency containers reachable within the context of your container?
Are your user-defined containers emitting error messages to the log?
Can you use your organization’s instance of Apache Kafka Kafka broker?
Are your containers published to Horizon exchange?
Does your published deployment pattern include all required services and versions?
Troubleshooting tips specific to the OpenShift Container Platform environment
Troubleshooting node errors
Are you encountering HTTP error, while executing deploy-mgmt-hub.sh?

Are the currently released versions of the Horizon packages installed?

Ensure that the Horizon software that is installed on your edge nodes is always on the latest released version.

On a Linux system, you can usually check the version of your installed Horizon packages by running this command:

dpkg -l | grep horizon

You can update your Horizon packages that use the package manager on your system. For example, on an Ubuntu-based Linux system, use the following commands to update Horizon to the current version:

sudo apt update
sudo apt install -y blue horizon

Is the Horizon agent up and actively running?

You can verify that the agent is running by using this Horizon CLI command:

hzn node list | jq .

You can also use the host’s system management software to check on the status of the Horizon agent. For example, on an Ubuntu-based Linux system, you can use the systemctl utility:

sudo systemctl status horizon

A line similar to the following is shown if the agent is active:

Active: active (running) since Thu 2020-10-01 17:56:12 UTC; 2 weeks 0 days ago

Is the edge node configured to interact with the Horizon exchange?

To verify that you can communicate with the Horizon exchange, run this command:

hzn exchange version

To verify that your Horizon exchange is accessible, run this command:

hzn exchange user list

After your edge node is registered with Horizon, you can verify whether the node is interacting with Horizon exchange by viewing the local Horizon agent configuration. Run this command to view the agent configuration:

hzn node list | jq .configuration.exchange_api

Are the required Docker® containers for the edge node running?

When your edge node is registered with Horizon, a Horizon Agbot creates an agreement with your edge node to run the services that are referenced in your gateway type (deployment pattern). If that agreement is not created, complete these checks to troubleshoot the issue.

Confirm that your edge node is in the configured state and has the correct id, organization values. Additionally, confirm that the architecture that Horizon is reporting is the same architecture that you used in the metadata for your services. Run this command to list these settings:

hzn node list | jq .

If those values are as expected, you can check the agreement status of the edge node by run:

hzn agreement list | jq .

If this command does not show any agreements; those agreements might have formed, but a problem might have been discovered. If this occurs, the agreement can be cancelled before it can display in the output from the previous command. If an agreement cancellation occurs, the cancelled agreement shows a status of terminated_description in the list of archived agreements. You can view the archived list by running this command:

hzn agreement list -r | jq .

A problem might also occur before an agreement is created. If this problem occurs, review the event log for the Horizon agent to identify possible errors. Run this command to view the log:

hzn eventlog list

The event log can include:

The signature of the service metadata, specifically the deployment field, cannot be verified. This error usually means that your signing public key is not imported into your edge node. You can import the key by using the hzn key import -k <pubkey> command. You can view the keys that are imported to your local edge node by using the hzn key list command. You can verify that the service metadata in the Horizon exchange is signed with your key by using this command:
```
hzn exchange service verify -k $PUBLIC_KEY_FILE <service-id>
```

Replace <service-id> with the ID for your service. This ID can resemble the following sample format: workload-cpu2wiotp_${CPU2WIOTP_VERSION}_${ARCH2}.

The path of Docker® image in the service deployment field is incorrect. Confirm that your edge node can docker pull that image path.
The Horizon agent on your edge node does not have access to the Docker registry that holds your Docker images. If the Docker images in the remote Docker® registry are not world-readable, you must add the credentials to your edge node by using the docker login command. You need to complete this step once as the credentials are remembered on the edge node.
If a container is continually restarting, review the container log for details. A container can be continually restarting when it is listed for only a few seconds or remains listed as restarting when you run the docker ps command. You can view the container log for details by running this command:
```
grep --text -E ' <service-id>\[[0-9]+\]' /var/log/syslog
```

Are the expected service container versions running?

Your container versions are governed by an agreement that is created after you add your service to the deployment pattern, and after you register your edge node for that pattern. Verify that your edge node has a current agreement for your pattern, by running this command:

hzn agreement list | jq .

If you confirmed the correct agreement for your pattern, use this command to view the running containers. Ensure that your user-defined containers are listed and are running:

docker ps

The Horizon agent can take several minutes after the agreement is accepted before the corresponding containers are downloaded, verified, and start to run. This agreement is mostly dependent upon the sizes of the containers themselves, which must be pulled from remote repositories.

Are the expected containers stable?

Check whether your containers are stable by running this command:

docker ps

From the command output, you can see the duration that each container is running. If over time, you observe that your containers are restarting unexpectedly, check the container logs for errors.

As a development best practice, consider configuring individual service logging by running the following commands (Linux systems only):

cat <<'EOF' > /etc/rsyslog.d/10-horizon-docker.conf
$template DynamicWorkloadFile,"/var/log/workload/%syslogtag:R,ERE,1,DFLT:.*workload-([^\[]+)--end%.log"

:syslogtag, startswith, "workload-" -?DynamicWorkloadFile
& stop
:syslogtag, startswith, "docker/" -/var/log/docker_containers.log
& stop
:syslogtag, startswith, "docker" -/var/log/docker.log
& stop
EOF
service rsyslog restart

If you complete the previous step, then the logs for your containers are recorded within separate files inside the /var/log/workload/ directory. Use the docker ps command to find the full names of your containers. You can find the log file of that name, with a .log suffix, in this directory.

If individual service logging is not configured, your service logs are added to the system log with all other log messages. To review the data for your containers, you need to search for the container name in the system log output within the /var/log/syslog file. For instance, you can search the log by running a command similar:

grep --text -E 'YOURSERVICENAME\[[0-9]+\]' /var/log/syslog

Are your containers Docker® networked correctly?

Ensure that your containers are properly Docker networked, so they can access required services. Run this command to ensure that you can view the Docker® virtual networks active on your edge node:

docker network list

To view more information about networks, use the docker inspect X command, where X is the name of the network. The command output lists all containers that run on the virtual network.

You can also run the docker inspect Y command on each container, where Y is the name of the container, to get more information. For instance, review the NetworkSettings container information and search the Networks container. Within this container, you can view the relevant network ID string and information about how the container is represented on the network. This representation information includes the container IPAddress, and the list of network aliases that are on this network.

Alias names are available to all of the containers on this virtual network, and these names are typically used by the containers in your code deployment pattern for discovering other containers on the virtual network. For example, you can name your service myservice. Then, other containers can use that name directly to access it on the network, such as with the command ping myservice. The alias name of your container is specified in the deployment field of its service definition file that you passed to the hzn exchange service publish command.

For more information about the commands supported by the Docker command line interface, see Docker® command reference .

Are the dependency containers reachable within the context of your container?

Enter the context of a running container to troubleshoot issues at run time by using the docker exec command. Use the docker ps command to find the identifier of your running container, then use a command that resembles the following to enter the context. Replace CONTAINERID with your container’s identifier:

docker exec -it CONTAINERID /bin/sh

If your container includes bash, you might want to specify /bin/bash at the end of the preceding command instead of /bin/sh.

When inside the container context, you can use commands like ping or curl to interact with the containers it requires and verify connectivity.

For more information about the commands supported by the Docker command line interface, see Docker® command reference .

Are your user-defined containers emitting error messages to the log?

If you configured individual service logging, each of your containers log in to a separate file within the /var/log/workload/ directory. Use the docker ps command to find the full names of your containers. Then, look for a file of that name, and that includes the .log suffix, within this directory.

If individual service logging is not configured, your service logs to the system log with all other details. To review the data, search for the container log in the system log output within the /var/log/syslog directory. For instance, search the log by running a command similar to:

grep --text -E 'YOURSERVICENAME\[[0-9]+\]' /var/log/syslog

Can you use your organization’s instance of Apache Kafka Kafka broker?

Subscribing to the Kafka instance for your organization from Apache Kafka can help you verify that your Kafka user credentials are correct. This subscription can also help you verify that your Kafka service instance is running in the cloud, and that your edge node is sending data when data is being published.

To subscribe to your Kafka broker, install the kafkacat program. For example, on an Ubuntu Linux system, use this command:

sudo apt install kafkacat

After installation, you can subscribe by using a command similar following example that uses the credentials you usually place in environment variable references:

kafkacat -C -q -o end -f "%t/%p/%o/%k: %s\n" -b $EVTSTREAMS_BROKER_URL -X api.version.request=true -X security.protocol=sasl_ssl -X sasl.mechanisms=PLAIN -X sasl.username=token -X sasl.password=$EVTSTREAMS_API_KEY -t $EVTSTREAMS_TOPIC

Where EVTSTREAMS_BROKER_URL is the URL to your Kafka broker, EVTSTREAMS_TOPIC is your Kafka topic, and EVTSTREAMS_API_KEY is your API key for authenticating with Apache Kafka API.

If the subscription command is successful, the command blocks indefinitely. The command waits for any publication to your Kafka broker and retrieves and displays any resulting messages. If you do not see any messages from your edge node after a few minutes, review the service log for error messages.

For example, to review the log for the cpu2evtstreams service, run this command:

For Linux and Windows

 tail -n 500 -f /var/log/syslog | grep -E 'cpu2evtstreams\[[0-9]+\]:'

For macOS

docker logs -f $(docker ps --filter 'name=-cpu2evtstreams' | tail -n +2 | awk '{print $1}')

Are your containers published to Horizon exchange?

Horizon exchange is the central warehouse for metadata about the code that is published for your edge nodes. If you have not signed and published your code to the Horizon exchange, the code cannot be pulled to your edge nodes, which are verified, and run.

Run the hzn command with the following arguments to view the list of published code to verify that all of your service containers were successfully published:

hzn exchange service list | jq .
hzn exchange service list $ORG_ID/$SERVICE | jq .

The parameter $ORG_ID is your organization ID, and $SERVICE is the name of the service you are obtaining information about.

Does your published deployment pattern include all required services and versions?

On any edge node where the hzn command is installed, you can use this command to get details about any deployment pattern. Run the hzn command with the following arguments to pull the listing of deployment patterns from the Horizon exchange:

hzn exchange pattern list | jq .
hzn exchange pattern list $ORG_ID/$PATTERN | jq .

The parameter $ORG_ID is your organization ID, and $PATTERN is the name of the deployment pattern you are obtaining information about.

Troubleshooting tips specific to the OpenShift Container Platform environment

Review this content to help you troubleshoot common issues with OpenShift Container Platform environments related to Open Horizon. These tips can help you resolve common issues and obtain information to identify root causes.

Are your Open Horizon credentials configured correctly for use in the OpenShift Container Platform environment?

You need an OpenShift Container Platform user account to complete any action within Open Horizon in this environment. You also require an API key created from that account.

To verify your Open Horizon credentials in this environment, run this command:

hzn exchange user list

If a JSON-formatted entry is returned from the Exchange showing one or more users, the Open Horizon credentials are configured properly.

If an error response is returned, you can take steps to troubleshoot your credentials setup.

If the error message indicates an incorrect API key, you can create a new API key that uses the following commands.

See Gather the necessary information and files.

Troubleshooting node errors

Open Horizon publishes a subset of event logs to the exchange that is viewable in the management console. These errors link to troubleshooting guidance.

Image load error
Deployment configuration error
Container start error
OCP edge cluster TLS internal error

Image load error

This error occurs when the service image that is referenced in the service definition does not exist in the image repository. To resolve this error:

Republish the service without the -I flag.

hzn exchange service publish -f <service-definition-file>

Push the service image directly to the image repository.
```
docker push <image name>
```

Deployment configuration error

This error occurs when the service definitions deployment configurations specify a bind to a root-protected file. To resolve this error:

Bind the container to a file that is not root protected.
Change the file permissions to allow users to read and write to the file.

Container start error

This error occurs when docker encounters an error when it starts the service container. The error message might contain details that indicate why the container start failed. Error resolution steps depend on the error. The following errors can occur:

The device is already using a published port that is specified by the deployment configurations. To resolve the error:
- Map a different port to the service container port. The displayed port number does not have to match the service port number.
- Stop the program that is using the same port.
A published port that is specified by the deployment configurations is not a valid port number. Port numbers must be a number in the range 1 - 65535.
A volume name in the deployment configurations is not a valid file path. Volume paths must be specified by their absolute (not relative) paths.

Are you encountering HTTP error, while executing deploy-mgmt-hub.sh?

If you are encountering following error while executing deploy-mgmt-hub.sh

 ------- Downloading/installing/configuring Horizon agent and CLI...
Downloading the Horizon agent and CLI packages...
Installing the Horizon agent and CLI packages...
Configuring the Horizon agent and CLI...
Publishing /tmp/horizon-all-in-1/agent-install.cfg in CSS as public object agent-install.cfg in the IBM org...
Digital sign with SHA1 will be performed for data integrity. It will delay the MMS object publish.
Start hashing the file...
Data hash is generated. Start digital signing with the data hash...
Digital sign finished.
Error: Encountered HTTP error: Put "http://127.0.0.1:9443/api/v1/objects/IBM/agent_files/agent-install.cfg": read tcp 127.0.0.1:59088->127.0.0.1:9443: read: connection reset by peer calling Model Management Service REST API PUT http://127.0.0.1:9443/api/v1/objects/IBM/agent_files/agent-install.cfg. HTTP status: .
Error: exit code 5 from: publishing /tmp/horizon-all-in-1/agent-install.cfg in CSS as a public object

Export MONGO_IMAGE_TAG:
```
export MONGO_IMAGE_TAG=4.0.6
```
Stop and purge management hub services and agent:
```
./deploy-mgmt-hub.sh -S -P
```
Re-Run deploy-mgmt-hub.sh as root:
```
./deploy-mgmt-hub.sh
```

OCP edge cluster TLS internal error

Error from server: error dialing backend: remote error: tls: internal error

If you see this error at the end of the cluster agent-install process or while trying to interact with the agent pod, there might be an issue with the Certificate Signing Requests (CSR) of your OCP cluster.

Check if you have any CSRs in the Pending state:
```
oc get csr
```

Approve the pending CSRs:

oc adm certificate approve <csr-name>

Note: You can approve all of the CSRs with one command:

for i in `oc get csr |grep Pending |awk '{print $1}'`; do oc adm certificate approve $i; done

Additional information

For more information, see:

Troubleshooting