当前位置: 代码迷 >> 综合 >> mxnet mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess: CUDA: an illegal memory access
  详细解决方案

mxnet mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess: CUDA: an illegal memory access

热度:80   发布时间:2023-12-15 16:38:44.0

mxnet 1.6

报错:

Traceback (most recent call last):File "train_0723.py", line 455, in <module>main()File "train_0723.py", line 451, in maintrain_net(args)File "train_0723.py", line 445, in train_netepoch_end_callback=epoch_cb)File "/home/user1/recognition/parall_module_local_v1_gluon_group.py", line 573, in fitself.update()File "/home/user1/recognition/parall_module_local_v1_gluon_group.py", line 406, in updatemx.nd.waitall()File "/home/user1/miniconda3/lib/python3.7/site-packages/mxnet/ndarray/ndarray.py", line 200, in waitallcheck_call(_LIB.MXNDArrayWaitAll())File "/home/user1/miniconda3/lib/python3.7/site-packages/mxnet/base.py", line 255, in check_callraise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [03:32:38] /home/ubuntu/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess: CUDA: an illegal memory access was encountered
Stack trace:[bt] (0) /home/user1/miniconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x6b41eb) [0x7f76131a51eb][bt] (1) /home/user1/miniconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x37b2742) [0x7f76162a3742][bt] (2) /home/user1/miniconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x37e3515) [0x7f76162d4515][bt] (3) /home/user1/miniconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x37bf6d1) [0x7f76162b06d1][bt] (4) /home/user1/miniconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x37c2c10) [0x7f76162b3c10][bt] (5) /home/user1/miniconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x37c2ea6) [0x7f76162b3ea6][bt] (6) /home/user1/miniconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x37bde84) [0x7f76162aee84][bt] (7) /home/user1/miniconda3/bin/../lib/libstdc++.so.6(+0xc8421) [0x7f76aca9d421][bt] (8) /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f76bb1f0609]

排查:
查看不同ndarray的context

print(array1.context)

可能的原因:

  1. batch_size太大。把batch_size从100调到80,训练了8个epoch没有出现问题。
  2. label和pred两者context不一致,可能一个在gpu,一个在cpu,你可能用一句类似label = labels[i].as_in_context(mx.gpu(0))人为改变了两者的context,导致不一致。
  3. 自定义的op
  4. 换个机器,在另一个机器上,两个机器batch_size per GPU和显卡型号相同(只不过一个用了4卡一个用了3卡),CPU型号也相同,没有出现过这个问题。
  5. 其实我不知道到底怎么造成的,网上也没有很好的结论,有高手的话请指教

解决:
运算(op)中涉及到的所有ndarry都要放在同一个context中,比如mx.gpu(0)或mx.cpu()之类的:

diff1 = diff2.as_in_context(tmp_ctx)
diff2 = diff2.as_in_context(tmp_ctx)
bkgradDiff = 2 * diff1 * diff2
bkgradDiff = bkgradDiff.as_in_context(tmp_ctx)

实在搞不定的话你就统一把它们统统都放到CPU上进行计算吧:

tmp_ctx = mx.cpu()
someNdarray.as_in_context(tmp_ctx)
  相关解决方案