码迷,mamicode.com
首页 > 其他好文 > 详细

AMD cpu 下 Pytorch 多卡并行卡死问题解决

时间:2019-03-11 23:41:54      阅读:404      评论:0      收藏:0      [点我收藏+]

标签:update   mes   ase   nvidia   lag   系统   add   注意   hub   

dataparallel not working on nvidia gpus and amd cpus

 
 
问题:
 
多卡运行时, 网络会卡在那里不能运行.
系统是 AMD Ryzen5 1600x 和 两张taitanXP
之前两张卡是2070+taitanXP是可以多卡运行的, 只不过是显存不一样大...
 
看了下日志, 都是下面的错误
 
these error messages were found in the dmesg log:

[1118468.873266] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000ea13a000 flags=0x0020]
[1118468.942145] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000ea139068 flags=0x0020]
[1118468.942189] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0000040 flags=0x0020]
[1118468.942227] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d00007c0 flags=0x0020]
[1118468.942265] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0001040 flags=0x0020]
[1118468.942303] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0000f40 flags=0x0020]
[1118468.942340] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d00016c0 flags=0x0020]
[1118468.942377] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0002040 flags=0x0020]

 

搜了一下, 似乎是一个bug . . .
 
临时解决办法:
 
修改 /etc/default/grub
 
GRUB_DEFAULT=0
GRUB_TIMEOUT_STYLE=hidden
GRUB_TIMEOUT=10
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
GRUB_CMDLINE_LINUX="iommu=soft" # 注意修改这一行 ...

 

然后
sudo update grub
最后重启
 
这样就可以正常运行了

AMD cpu 下 Pytorch 多卡并行卡死问题解决

标签:update   mes   ase   nvidia   lag   系统   add   注意   hub   

原文地址:https://www.cnblogs.com/JiangOil/p/10513906.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!