Solve the problem that TensorFlow cannot find the GPU in HPCC Systems
According to the configuration instructions provided by TensorFlow and CUDA, I installed and configured the environment, but HPCC Systems couldn't detect the GPU. This blog post documents the process of how I resolved this issue.
I run the test code:
1 | IMPORT PYTHON3 AS PYTHON; |
Reason : In my previous environment setup, I installed TensorFlow and
CUDA in root mode, but I only configured the environment information in
the .bashrc
file of the current user.
However, HPCC Systems creates a new user named "hpcc" and uses the
environment variables from that user. As a result, in the "hpcc" user,
the LD_LIBRARY_PATH
and other environment variables were
not present, causing CUDA and GPU recognition to fail.
I first modify the password of hpcc:
1 | sudo passwd hpcc |
In Ubuntu, there are two methods to switch to another user:
- su user: The su command requires you to enter the password of the target user. You must know the password of the target user and have root user privileges. When switching to the target user using the su command, the target user's complete environment variables are not loaded. It only switches to the target user's identity and inherits the current user's environment variables.
- sudo -i -u user: When using the sudo -i -u user command to switch to the target user, the target user's complete environment variables are loaded. It switches you to the target user's identity and loads the target user's environment settings.
According to the environment variable setting rules in Linux, I have
added the previously set environment variables into
/etc/profile
:
1 | alias python='python3' |
Using sudo -i -u hpcc
to enter the hpcc user and typing
env
, I found that the previous settings have taken effect.
Please note that at this point, you should not use su hpcc
to enter the user, as it would load incorrect environment variables.
However, even after making these settings, it appears that HPCC still cannot properly recognize the GPU.
So I tried running the code in HPCC, retrieving the environment variables, and restarting HPCC Systems. After that, I ran:
1 | IMPORT PYTHON3 AS PYTHON; |
I found that the environment variable LD_LIBRARY_PATH
was not loaded correctly. Could it be because the
CUDNN_PATH
is using a Python statement that was not
executed correctly? To test this, I changed CUDNN_PATH
to
1
/usr/local/lib/python3.10/dist-packages/nvidia/cudnn/lib
which is the value obtained from the terminal, and after testing, I found that the GPU could be recognized correctly.
Additionally, I later stumbled upon an error:
1 | /etc/profile: line 33: python: command not found |
It turned out that the python
command was not recognized
correctly. I resolved this issue by changing python
to
python3
in the following setting:
1 | CUDNN_PATH=$(dirname $(python3 -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)")) |
After making this change, the code ran successfully without any issues.