当前位置: 代码迷 >> 综合 >> mxnet stream_gpu-inl.h:62: Check failed: e == cudaSuccess: CUDA: unspecified launch failure Stack tr
  详细解决方案

mxnet stream_gpu-inl.h:62: Check failed: e == cudaSuccess: CUDA: unspecified launch failure Stack tr

热度:11   发布时间:2023-12-15 16:13:39.0

完整报错:

Traceback (most recent call last):File "train_parall_fc7.py", line 409, in <module>main()File "train_parall_fc7.py", line 406, in maintrain_net(args)File "train_parall_fc7.py", line 401, in train_netepoch_end_callback = epoch_cb )File "/home/user1/pjs/frvt/arcface_attention/recognition/parall_module_local_v1.py", line 561, in fitbatch_end_callback(batch_end_params)File "train_parall_fc7.py", line 313, in _batch_callbackacc_list = ver_test(mbatch)File "train_parall_fc7.py", line 282, in ver_testacc1, std1, acc2, std2, xnorm, embeddings_list = verification.test(ver_list[i], model, args.batch_size, 10, None, None)File "eval/verification.py", line 252, in test_embeddings = net_out[0].asnumpy()File "/home/user1/miniconda3/envs/hardsample/lib/python2.7/site-packages/mxnet/ndarray/ndarray.py", line 1996, in asnumpyctypes.c_size_t(data.size)))File "/home/user1/miniconda3/envs/hardsample/lib/python2.7/site-packages/mxnet/base.py", line 253, in check_callraise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [22:28:19] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess: CUDA: unspecified launch failure
Stack trace:[bt] (0) /home/user1/miniconda3/envs/hardsample/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x4b04cb) [0x7f7d2afbe4cb][bt] (1) /home/user1/miniconda3/envs/hardsample/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x25b52b2) [0x7f7d2d0c32b2][bt] (2) /home/user1/miniconda3/envs/hardsample/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x27d988d) [0x7f7d2d2e788d][bt] (3) /home/user1/miniconda3/envs/hardsample/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x25c1cd1) [0x7f7d2d0cfcd1][bt] (4) /home/user1/miniconda3/envs/hardsample/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x25c80b0) [0x7f7d2d0d60b0][bt] (5) /home/user1/miniconda3/envs/hardsample/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x25c8346) [0x7f7d2d0d6346][bt] (6) /home/user1/miniconda3/envs/hardsample/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x25c0434) [0x7f7d2d0ce434][bt] (7) /home/user1/miniconda3/envs/hardsample/lib/python2.7/site-packages/scipy/sparse/../../../../libstdc++.so.6(+0xc8421) [0x7f7d6d393421][bt] (8) /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f7d747c1609]

原因:可能是程序有问题,引起了内存(或显存)泄漏。这时候即使重新运行训练程序,也会报显存out of memory。
解决:重启机器(这样可能是清空了内存垃圾),重新运行程序。

  相关解决方案