高性能计算

cuda安装 nvidia.ko错误

2010年4月27日 阅读(287)

root@gpu-node1 cuda_test]# ./a.out
FATAL: Error inserting nvidia (/lib/modules/2.6.18-164.el5PAE/kernel/drivers/video/nvidia.ko): Invalid module format
查看/var/log/ nvidia-installer.log,可以看到如下信息:
nvidia: disagrees about version of symbol struct_module.

[root@gpu-node1 install]# dmesg|grep gcc
Linux version 2.6.18-164.el5PAE (mockbuild@builder16.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Thu Sep 3 04:10:44 EDT 2009
[root@gpu-node1 install]# gcc –version
gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46)
Copyright (C) 2006 Free Software Foundation, I

在看dmesg发现启动的是Linux version 2.6.18-164.el5PAE,但是使用的是
./devdriver_3.0_linux_32_195.36.15.run –kernel-source-path /usr/src/kernels/2.6.18-164.el5-i686
所以导致内核不一致。
再次查看:grub.conf,发现有2个选项
default=1
timeout=5
splashimage=(hd0,7)/boot/grub/splash.xpm.gz
hiddenmenu
title CentOS (2.6.18-164.el5)
        root (hd0,7)
        kernel /boot/vmlinuz-2.6.18-164.el5 ro root=LABEL=/
        initrd /boot/initrd-2.6.18-164.el5.img
title CentOS (2.6.18-164.el5PAE)
        root (hd0,7)
        kernel /boot/vmlinuz-2.6.18-164.el5PAE ro root=LABEL=/
        initrd /boot/initrd-2.6.18-164.el5PAE.img
title Other
        rootnoverify (hd0,1)
        chainloader +1
title centos64
        rootnoverify (hd0,8)
        chainloader +1
~                             
系统默认进入的是CentOS (2.6.18-164.el5PAE),将该选项删除,直接进入第1个再安装。
./devdriver_3.0_linux_32_195.36.15.run –kernel-source-path /usr/src/kernels/2.6.18-164.el5-i686
这时终于成功了。

同时一开始用yum install kernel kernel-headers kernel-devel的时候,默认连接的是ustc的源,导致下的内核版本与安装的也不同,出现这种错误“nvidia: disagrees about version of symbol struct_module.”
一般就是因为内核与内核源码不一致。出现错误注意观测输出信息,以及/var/log下的安装日志。

因此注意做如下检查:
OKay having reconfigured your kernel did you reinstall your kernel and reboot to the new one, and did you check grub points to the new kernel image?

Also when you installed did you check that you had the /boot partition mounted? Due to the fact the gentoo handbook says to set the /boot partition to noauto a lot of new users forget to mount their boot partition before copying a new kernel to /boot and get confused.

You may also need to visit /lib/modules/{kernel-version}/kernel/drivers/ and poke around to find the nvidiafb module and remove it.

Something that con be helpful to check you got your kernel rebuild right is enabling the exporting of the currently running kernel config though /proc/config.gz, then you can gzcat /proc/config.gz | grep -i <searchterm>, to check you got it right and loaded the right kernel.

以下为引用:

http://forums.gentoo.org/viewtopic-t-811924-start-0.html
http://www.linuxquestions.org/questions/linux-general-1/unable-to-install-nvidia-drivers-587637/

-> Kernel module compilation complete.
ERROR: Unable to load the kernel module ‘nvidia.ko’.  This happens most
       frequently when this kernel module was built against the wrong or
       improperly configured kernel sources, with a version of gcc that differs
       from the one used to build the target kernel, or if a driver such as
       rivafb/nvidiafb is present and prevents the NVIDIA kernel module from
       obtaining ownership of the NVIDIA graphics device(s).

参考这里的讨论:http://www.nvnews.net/vbulletin/showthread.php?t=49951&page=3

Looks like I spoke too soon, the problem was solved by using the "-k $(uname -r)"  (thank you jong0357 and the all of you)

把命令改为:
./devdriver_3.0_linux_32_195.36.15.run –kernel-source-path /usr/src/kernels/2.6.18-164.15.1.el5-i686 -k $(uname -r)

You Might Also Like