петък, 12 февруари 2021 г.

PyTorch and TensorFlow don't find cuda after wake up

So this seems to be a bug. After suspend, nvidia works (i.e. $optirun glxshperes works fine), yet, torch.cuda.is_available() gives "False". 

So there are two things one could do: 1) restart 2) To try to reload nvidia and company. 

The second should be done carefully, as I froze my PC once and had to restart anyway. So you have to do (source): 

sudo rmmod nvidia_uvm
sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia
sudo modprobe nvidia
sudo modprobe nvidia_modeset
sudo modprobe nvidia_drm
sudo modprobe nvidia_uvm
 
Another way to try is to use to "modprobe -r" resolve the dependency issues. 
You could find what is in use with (source):
lsmod | grep nvidia
sudo modprobe -r <module found from lsmod> <module you want to remove> 
  
A good practice obviously is to stop your Jupyter nootebook before suspending which they claim would release the nvidia driver but I still have to try this.

четвъртък, 11 февруари 2021 г.

How to enable GPU in tensorflow

Fist, you have to have CUDA installed. That was not obvious as it comes with: 

$sudo equo install nvidia-cuda-toolkit

So after we install the nvidia-cuda-toolkit, we have to tell Linux, where the CUDA bin and libs are: 

$nano /home/$USER/.bashrc
Add the following lines:
export PATH="/usr/local/cuda-8.0/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH"

In my case, cuda is located in /opt/cuda so that replaces /usr/local/cuda. Save file.

$source /home/$USER/.bashrc

Now, you have nvcc. source 

We can test if it works with this simple code. (see comment Compile and run a CUDA hello world)

Also you can do

$nvcc -V 

to see the version and everything.

I think at some point I did:

export CUDA_VISIBLE_DEVICES=1

but then I added it to my notebook as: 

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

Now for the fun part, it seems that TensorFlow requires not only cuda, but CudaNN. Which doesn't come with cuda-toolkit but has to be downloaded independently from Nvidia's website https://developer.nvidia.com/cudnn).After registration and a survey and a warning for ethical use of AI (WTF?). The installation is very simple. You download the tar file and then you do:

  1. $ sudo cp cuda/include/cudnn*.h /opt/cuda/include 
    $ sudo cp -P cuda/lib64/libcudnn* /opt/cuda/lib64 
    $ sudo chmod a+r /opt/cuda/include/cudnn*.h /opt/cuda/lib64/libcudnn*
    

where /opt/cuda is the path to my cuda installation, but it may also be in /usr/local/cuda. And you have to do that from Downloads i.e. from outside of the folder where you untarred your file. And that's it, tensorflow works with GPU. 



Also to test it, you can use:

import tensorflow as tf
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
with tf.device('/gpu:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)

with tf.Session() as sess:
    print (sess.run(c))


Another way to test cuda with tensorflow is:


python3 -c "import tensorflow as tf;import os; os.environ['TF_XLA_FLAGS'] = '--tf_xla_enable_xla_devices';print(tf.reduce_sum(tf.random.normal([1000, 1000])))"

 

or just :

python3 -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])))"