当前位置: 代码迷 >> 综合 >> 解决分布式训练 报terminate called after throwing an instance of 'std::length_error'
  详细解决方案

解决分布式训练 报terminate called after throwing an instance of 'std::length_error'

热度:71   发布时间:2023-12-19 13:35:58.0

在进行分布式进行训练,

INFO:tensorflow:Reduce to /replica:0/task:0/device:CPU:0 then broadcast to ('/replica:0/task:0/device:CPU:0',).
I0408 04:01:41.507015 140706188736256 cross_device_ops.py:427] Reduce to /replica:0/task:0/device:CPU:0 then broadcast to ('/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Create CheckpointSaverHook.
I0408 04:01:44.424420 140706188736256 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::append
Fatal Python error: Aborted 

 饶了一大圈排查,通过减少gpu数量,可正常运行了

  相关解决方案