【CUDA 编译 bug】ld: cannot find -lcudart
我们使用 Conda 安装 pytorch 和 CUDA 环境之后,要用 Conda 的CUDA环境进行某个库编译时,出现了bug:
/mnt/data/home/xxxx/miniforge3/envs/GAGAvatar/compiler_compat/ld: cannot find -lcudart: No such file or directorycollect2: error: ld returned 1 exit statuserror: command '/mnt/data/home/xxxx/miniforge3/envs/GAGAvatar/bin/g++' failed with exit code 1[end of output]note: This error originates from a subprocess, and is likely not a problem with pip.ERROR: Failed building wheel for diff_gaussian_rasterization_32dRunning setup.py clean for diff_gaussian_rasterization_32d
Failed to build diff_gaussian_rasterization_32d
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (diff_gaussian_rasterization_32d)
❯ which nvcc
/mnt/data/home/xxxx/miniforge3/envs/GAGAvatar/bin/nvcc
❯ echo $CUDA_HOME
/mnt/data/home/xxxx/miniforge3/envs/GAGAvatar
❯ echo $PATH
/home/xxxx/local/bin:/home/xxxx/local/bin:/mnt/data/home/xxxx/miniforge3/envs/GAGAvatar/bin:/mnt/data/home/xxxx/miniforge3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
❯ echo $LD_LIBRARY_PATH
/mnt/data/home/xxxx/miniforge3/envs/GAGAvatar/lib:
去探查发现,这里的软链接出了问题:
❯ ls /mnt/data/home/xxxx/miniforge3/envs/GAGAvatar/liblibcudart.so -> libcudart.so.12.1.55
libcudart.so.12
libcudart.so.12.1.105
继续探究发现,安装Pytorch时会安装 cuda-cudart=12.1.105
以下是按照Pytorch时会安装的所有以 pytorch
、 nvidia
为 channel 的包:
+ pytorch-mutex 1.0 cuda pytorch Cached+ libcublas 12.1.0.26 0 nvidia Cached+ libcufft 11.0.2.4 0 nvidia Cached+ libcusolver 11.4.4.55 0 nvidia Cached+ libcusparse 12.0.2.55 0 nvidia Cached+ libnpp 12.0.2.50 0 nvidia Cached+ cuda-cudart 12.1.105 0 nvidia Cached+ cuda-nvrtc 12.1.105 0 nvidia Cached+ libnvjitlink 12.1.105 0 nvidia Cached+ libnvjpeg 12.1.1.14 0 nvidia Cached+ cuda-cupti 12.1.105 0 nvidia Cached+ cuda-nvtx 12.1.105 0 nvidia Cached+ ffmpeg 4.3 hf484d3e_0 pytorch Cached+ libjpeg-turbo 2.0.0 h9bf148f_0 pytorch Cached+ cuda-version 12.6 3 nvidia Cached+ libcurand 10.3.7.77 0 nvidia Cached+ libcufile 1.11.1.6 0 nvidia Cached+ cuda-opencl 12.6.77 0 nvidia Cached+ cuda-libraries 12.1.0 0 nvidia Cached+ cuda-runtime 12.1.0 0 nvidia Cached+ pytorch-cuda 12.1 ha16c6d3_6 pytorch Cached+ pytorch 2.4.1 py3.12_cuda12.1_cudnn9.1.0_0 pytorch Cached+ torchtriton 3.0.0 py312 pytorch Cached+ torchaudio 2.4.1 py312_cu121 pytorch Cached+ torchvision 0.19.1 py312_cu121 pytorch Cached
而这是安装 cuda-toolkit-12.1.0
的包:
+ cuda-documentation 12.1.55 0 nvidia/label/cuda-12.1.0 Cached+ cuda-nvml-dev 12.1.55 0 nvidia/label/cuda-12.1.0 Cached+ libnvvm-samples 12.1.55 0 nvidia/label/cuda-12.1.0 Cached+ cuda-cccl 12.1.55 0 nvidia/label/cuda-12.1.0 Cached+ cuda-driver-dev 12.1.55 0 nvidia/label/cuda-12.1.0 Cached+ cuda-profiler-api 12.1.55 0 nvidia/label/cuda-12.1.0 Cached+ cuda-cudart 12.1.55 0 nvidia/label/cuda-12.1.0 Cached+ cuda-nvrtc 12.1.55 0 nvidia/label/cuda-12.1.0 21MB+ cuda-opencl 12.1.56 0 nvidia/label/cuda-12.1.0 11kB+ libcublas 12.1.0.26 0 nvidia/label/cuda-12.1.0 Cached+ libcufft 11.0.2.4 0 nvidia/label/cuda-12.1.0 Cached+ libcufile 1.6.0.25 0 nvidia/label/cuda-12.1.0 782kB+ libcurand 10.3.2.56 0 nvidia/label/cuda-12.1.0 54MB+ libcusolver 11.4.4.55 0 nvidia/label/cuda-12.1.0 Cached+ libcusparse 12.0.2.55 0 nvidia/label/cuda-12.1.0 Cached+ libnpp 12.0.2.50 0 nvidia/label/cuda-12.1.0 Cached+ libnvjitlink 12.1.55 0 nvidia/label/cuda-12.1.0 18MB+ libnvjpeg 12.1.0.39 0 nvidia/label/cuda-12.1.0 3MB+ cuda-cupti 12.1.62 0 nvidia/label/cuda-12.1.0 5MB+ cuda-cuobjdump 12.1.55 0 nvidia/label/cuda-12.1.0 Cached+ cuda-cuxxfilt 12.1.55 0 nvidia/label/cuda-12.1.0 Cached+ cuda-nvcc 12.1.66 0 nvidia/label/cuda-12.1.0 Cached+ cuda-nvprune 12.1.55 0 nvidia/label/cuda-12.1.0 Cached+ cuda-gdb 12.1.55 0 nvidia/label/cuda-12.1.0 Cached+ cuda-nvdisasm 12.1.55 0 nvidia/label/cuda-12.1.0 Cached+ cuda-nvprof 12.1.55 0 nvidia/label/cuda-12.1.0 Cached+ cuda-nvtx 12.1.66 0 nvidia/label/cuda-12.1.0 58kB+ cuda-sanitizer-api 12.1.55 0 nvidia/label/cuda-12.1.0 Cached+ cuda-nsight 12.1.55 0 nvidia/label/cuda-12.1.0 Cached+ nsight-compute 2023.1.0.15 0 nvidia/label/cuda-12.1.0 Cached+ cuda-cudart-dev 12.1.55 0 nvidia/label/cuda-12.1.0 Cached+ cuda-nvrtc-dev 12.1.55 0 nvidia/label/cuda-12.1.0 Cached+ cuda-opencl-dev 12.1.56 0 nvidia/label/cuda-12.1.0 Cached+ libcublas-dev 12.1.0.26 0 nvidia/label/cuda-12.1.0 Cached+ libcufft-dev 11.0.2.4 0 nvidia/label/cuda-12.1.0 Cached+ gds-tools 1.6.0.25 0 nvidia/label/cuda-12.1.0 Cached+ libcufile-dev 1.6.0.25 0 nvidia/label/cuda-12.1.0 Cached+ libcurand-dev 10.3.2.56 0 nvidia/label/cuda-12.1.0 Cached+ libcusolver-dev 11.4.4.55 0 nvidia/label/cuda-12.1.0 Cached+ libcusparse-dev 12.0.2.55 0 nvidia/label/cuda-12.1.0 Cached+ libnpp-dev 12.0.2.50 0 nvidia/label/cuda-12.1.0 Cached+ libnvjitlink-dev 12.1.55 0 nvidia/label/cuda-12.1.0 Cached+ libnvjpeg-dev 12.1.0.39 0 nvidia/label/cuda-12.1.0 Cached+ cuda-libraries 12.1.0 0 nvidia/label/cuda-12.1.0 Cached+ cuda-cupti-static 12.1.62 0 nvidia/label/cuda-12.1.0 Cached+ cuda-compiler 12.1.0 0 nvidia/label/cuda-12.1.0 Cached+ cuda-nvvp 12.1.55 0 nvidia/label/cuda-12.1.0 Cached+ cuda-command-line-tools 12.1.0 0 nvidia/label/cuda-12.1.0 Cached+ cuda-nsight-compute 12.1.0 0 nvidia/label/cuda-12.1.0 Cached+ cuda-cudart-static 12.1.55 0 nvidia/label/cuda-12.1.0 Cached+ cuda-nvrtc-static 12.1.55 0 nvidia/label/cuda-12.1.0 Cached+ libcublas-static 12.1.0.26 0 nvidia/label/cuda-12.1.0 Cached+ libcufft-static 11.0.2.4 0 nvidia/label/cuda-12.1.0 Cached+ libcufile-static 1.6.0.25 0 nvidia/label/cuda-12.1.0 Cached+ libcurand-static 10.3.2.56 0 nvidia/label/cuda-12.1.0 Cached+ libcusolver-static 11.4.4.55 0 nvidia/label/cuda-12.1.0 Cached+ libcusparse-static 12.0.2.55 0 nvidia/label/cuda-12.1.0 Cached+ libnpp-static 12.0.2.50 0 nvidia/label/cuda-12.1.0 Cached+ libnvjpeg-static 12.1.0.39 0 nvidia/label/cuda-12.1.0 Cached+ cuda-libraries-dev 12.1.0 0 nvidia/label/cuda-12.1.0 Cached+ cuda-libraries-static 12.1.0 0 nvidia/label/cuda-12.1.0 Cached+ cuda-visual-tools 12.1.0 0 nvidia/label/cuda-12.1.0 Cached+ cuda-tools 12.1.0 0 nvidia/label/cuda-12.1.0 Cached+ cuda-toolkit 12.1.0 0 nvidia/label/cuda-12.1.0 Cached
这是安装 cuda-toolkit-12.1.1
的包:
+ cuda-documentation 12.1.105 0 nvidia/label/cuda-12.1.1 91kB+ cuda-nvml-dev 12.1.105 0 nvidia/label/cuda-12.1.1 87kB+ libnvvm-samples 12.1.105 0 nvidia/label/cuda-12.1.1 33kB+ cuda-cccl 12.1.109 0 nvidia/label/cuda-12.1.1 1MB+ cuda-driver-dev 12.1.105 0 nvidia/label/cuda-12.1.1 17kB+ cuda-profiler-api 12.1.105 0 nvidia/label/cuda-12.1.1 19kB+ cuda-cudart 12.1.105 0 nvidia/label/cuda-12.1.1 Cached+ cuda-nvrtc 12.1.105 0 nvidia/label/cuda-12.1.1 Cached+ cuda-opencl 12.1.105 0 nvidia/label/cuda-12.1.1 11kB+ libcublas 12.1.3.1 0 nvidia/label/cuda-12.1.1 367MB+ libcufft 11.0.2.54 0 nvidia/label/cuda-12.1.1 108MB+ libcufile 1.6.1.9 0 nvidia/label/cuda-12.1.1 783kB+ libcurand 10.3.2.106 0 nvidia/label/cuda-12.1.1 54MB+ libcusolver 11.4.5.107 0 nvidia/label/cuda-12.1.1 116MB+ libcusparse 12.1.0.106 0 nvidia/label/cuda-12.1.1 177MB+ libnpp 12.1.0.40 0 nvidia/label/cuda-12.1.1 147MB+ libnvjitlink 12.1.105 0 nvidia/label/cuda-12.1.1 Cached+ libnvjpeg 12.2.0.2 0 nvidia/label/cuda-12.1.1 3MB+ cuda-cupti 12.1.105 0 nvidia/label/cuda-12.1.1 Cached+ cuda-cuobjdump 12.1.111 0 nvidia/label/cuda-12.1.1 245kB+ cuda-cuxxfilt 12.1.105 0 nvidia/label/cuda-12.1.1 302kB+ cuda-nvcc 12.1.105 0 nvidia/label/cuda-12.1.1 55MB+ cuda-nvprune 12.1.105 0 nvidia/label/cuda-12.1.1 67kB+ cuda-gdb 12.1.105 0 nvidia/label/cuda-12.1.1 6MB+ cuda-nvdisasm 12.1.105 0 nvidia/label/cuda-12.1.1 50MB+ cuda-nvprof 12.1.105 0 nvidia/label/cuda-12.1.1 5MB+ cuda-nvtx 12.1.105 0 nvidia/label/cuda-12.1.1 Cached+ cuda-sanitizer-api 12.1.105 0 nvidia/label/cuda-12.1.1 18MB+ cuda-nsight 12.1.105 0 nvidia/label/cuda-12.1.1 119MB+ nsight-compute 2023.1.1.4 0 nvidia/label/cuda-12.1.1 808MB+ cuda-cudart-dev 12.1.105 0 nvidia/label/cuda-12.1.1 381kB+ cuda-nvrtc-dev 12.1.105 0 nvidia/label/cuda-12.1.1 12kB+ cuda-opencl-dev 12.1.105 0 nvidia/label/cuda-12.1.1 59kB+ libcublas-dev 12.1.3.1 0 nvidia/label/cuda-12.1.1 76kB+ libcufft-dev 11.0.2.54 0 nvidia/label/cuda-12.1.1 14kB+ gds-tools 1.6.1.9 0 nvidia/label/cuda-12.1.1 43MB+ libcufile-dev 1.6.1.9 0 nvidia/label/cuda-12.1.1 13kB+ libcurand-dev 10.3.2.106 0 nvidia/label/cuda-12.1.1 460kB+ libcusolver-dev 11.4.5.107 0 nvidia/label/cuda-12.1.1 51kB+ libcusparse-dev 12.1.0.106 0 nvidia/label/cuda-12.1.1 178MB+ libnpp-dev 12.1.0.40 0 nvidia/label/cuda-12.1.1 525kB+ libnvjitlink-dev 12.1.105 0 nvidia/label/cuda-12.1.1 15MB+ libnvjpeg-dev 12.2.0.2 0 nvidia/label/cuda-12.1.1 13kB+ cuda-libraries 12.1.1 0 nvidia/label/cuda-12.1.1 2kB+ cuda-cupti-static 12.1.105 0 nvidia/label/cuda-12.1.1 12MB+ cuda-compiler 12.1.1 0 nvidia/label/cuda-12.1.1 1kB+ cuda-nvvp 12.1.105 0 nvidia/label/cuda-12.1.1 120MB+ cuda-command-line-tools 12.1.1 0 nvidia/label/cuda-12.1.1 1kB+ cuda-nsight-compute 12.1.1 0 nvidia/label/cuda-12.1.1 1kB+ cuda-cudart-static 12.1.105 0 nvidia/label/cuda-12.1.1 948kB+ cuda-nvrtc-static 12.1.105 0 nvidia/label/cuda-12.1.1 18MB+ libcublas-static 12.1.3.1 0 nvidia/label/cuda-12.1.1 389MB+ libcufft-static 11.0.2.54 0 nvidia/label/cuda-12.1.1 199MB+ libcufile-static 1.6.1.9 0 nvidia/label/cuda-12.1.1 3MB+ libcurand-static 10.3.2.106 0 nvidia/label/cuda-12.1.1 55MB+ libcusolver-static 11.4.5.107 0 nvidia/label/cuda-12.1.1 76MB+ libcusparse-static 12.1.0.106 0 nvidia/label/cuda-12.1.1 185MB+ libnpp-static 12.1.0.40 0 nvidia/label/cuda-12.1.1 143MB+ libnvjpeg-static 12.2.0.2 0 nvidia/label/cuda-12.1.1 3MB+ cuda-libraries-dev 12.1.1 0 nvidia/label/cuda-12.1.1 2kB+ cuda-libraries-static 12.1.1 0 nvidia/label/cuda-12.1.1 2kB+ cuda-visual-tools 12.1.1 0 nvidia/label/cuda-12.1.1 1kB+ cuda-tools 12.1.1 0 nvidia/label/cuda-12.1.1 1kB+ cuda-toolkit 12.1.1 0 nvidia/label/cuda-12.1.1 2kB
对比发现是 cuda-12.1.1
才对的上CUDA版本12.1的Pytorch。但是我们在安装的时候,先安装CUDA版本12.1的Pytorch,再安装 cuda-12.1.1
会出现冲突问题:
└─ pytorch-cuda is not installable because it requires└─ libcublas >=12.1.0.26,<12.1.3.1 , which conflicts with any installable versions previously reported.
也就是说,该死的CUDA版本12.1的Pytorch的 libcublas
需要适配 cuda-toolkit-12.1.0
,但是其的 cuda-cudart
等库又需要适配 cuda-toolkit-12.1.1
可以看到 pytorch-cuda 强要求 libcublas >=12.1.0.26,<12.1.3.1
,我们只好迁就 pytorch,安装12.1.0的CUDA,但是呢!我们可以修改Pytorch官方给出的 nvidia
channel 为 nvidia/label/cuda-12.1.0
使用以下命令:
mamba install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=12.1 -c pytorch -c nvidia/label/cuda-12.1.0
其就会安装与我们安装的 cuda-toolkit-12.1.0
一样的一些 cuda 库了!
+ pytorch-mutex 1.0 cuda pytorch Cached+ libcublas 12.1.0.26 0 nvidia/label/cuda-12.1.0 Cached+ libcufft 11.0.2.4 0 nvidia/label/cuda-12.1.0 Cached+ libcusolver 11.4.4.55 0 nvidia/label/cuda-12.1.0 Cached+ libcusparse 12.0.2.55 0 nvidia/label/cuda-12.1.0 Cached+ libnpp 12.0.2.50 0 nvidia/label/cuda-12.1.0 Cached+ libnvjpeg 12.1.0.39 0 nvidia/label/cuda-12.1.0 3MB+ cuda-cudart 12.1.55 0 nvidia/label/cuda-12.1.0 Cached+ cuda-nvrtc 12.1.55 0 nvidia/label/cuda-12.1.0 21MB+ cuda-opencl 12.1.56 0 nvidia/label/cuda-12.1.0 11kB+ libcufile 1.6.0.25 0 nvidia/label/cuda-12.1.0 782kB+ libcurand 10.3.2.56 0 nvidia/label/cuda-12.1.0 54MB+ cuda-cupti 12.1.62 0 nvidia/label/cuda-12.1.0 5MB+ cuda-nvtx 12.1.66 0 nvidia/label/cuda-12.1.0 58kB+ cuda-version 12.1 h1d6eff3_3 conda-forge 21kB+ ffmpeg 4.3 hf484d3e_0 pytorch Cached+ libjpeg-turbo 2.0.0 h9bf148f_0 pytorch Cached+ libnvjitlink 12.1.105 hd3aeb46_0 conda-forge 16MB+ cuda-libraries 12.1.0 0 nvidia/label/cuda-12.1.0 Cached+ cuda-runtime 12.1.0 0 nvidia/label/cuda-12.1.0 Cached+ pytorch-cuda 12.1 ha16c6d3_6 pytorch Cached+ pytorch 2.4.1 py3.12_cuda12.1_cudnn9.1.0_0 pytorch Cached+ torchtriton 3.0.0 py312 pytorch Cached+ torchvision 0.19.1 py312_cu121 pytorch Cached+ torchaudio 2.4.1 py312_cu121 pytorch Cached
到这里,问题就解决了:我们之后要安装 pytorch-cuda 和 cuda-toolkit 时,只需要执行以下命令(顺序应该不重要了):
mamba install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=12.1 -c pytorch -c nvidia/label/cuda-12.1.0
mamba install nvidia/label/cuda-12.1.0::cuda-toolkit -c nvidia/label/cuda-12.1.0
安装 cuda-toolkit 就相当于在安装完 pytorch-cuda 的需要的部分 cuda 库后,进行了补充安装,都是同一个 channel 的当然就不会有问题了