hans

hans

【Linux】keras multi-GPU, distributed training error


Error message:

  Epoch 1/1
  2020-01-13 21:39:19.392806: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
  2020-01-13 21:39:22.432074: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at tensor_array_ops.cc:122 : Not found: Resource __per_step_10/_tensor_arraysinput_ta_0_4/N10tensorflow11TensorArrayE does not exist.
  Traceback (most recent call last):
    File "/home/hans/WorkSpace/client.py", line 90, in <module>
      hist = client.train(convertDict(dictItem), dictItem.get(f"DirName"), dictItem.get(f"round"), CRNNconfig["ESTIMATION"]["SAVE_CLIENT_MODEL"])
    File "/home/hans/WorkSpace/FLutils/client.py", line 44, in train
      verbose=1)
    File "/home/hans/WorkSpace/venv/Rand2AI/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
      return func(*args, **kwargs)
    File "/home/hans/WorkSpace/venv/Rand2AI/lib/python3.6/site-packages/keras/engine/training.py", line 1732, in fit_generator
      initial_epoch=initial_epoch)
    File "/home/hans/WorkSpace/venv/Rand2AI/lib/python3.6/site-packages/keras/engine/training_generator.py", line 220, in fit_generator
      reset_metrics=False)
    File "/home/hans/WorkSpace/venv/Rand2AI/lib/python3.6/site-packages/keras/engine/training.py", line 1514, in train_on_batch
      outputs = self.train_function(ins)
    File "/home/hans/WorkSpace/venv/Rand2AI/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 3076, in __call__
      run_metadata=self.run_metadata)
    File "/home/hans/WorkSpace/venv/Rand2AI/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1439, in __call__
      run_metadata_ptr)
    File "/home/hans/WorkSpace/venv/Rand2AI/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
      c_api.TF_GetCode(self.status.status))
  tensorflow.python.framework.errors_impl.NotFoundError: Resource __per_step_10/_tensor_arraysinput_ta_0_4/N10tensorflow11TensorArrayE does not exist.
  	 [[{{node training/Adadelta/gradients/replica_0/model_1/rnn1_bgru1/while_1/TensorArrayReadV3_grad/TensorArrayGrad/TensorArrayGradV3}}]]
  	 [[{{node training/Adadelta/gradients/b_count_26}}]]

Spent a whole day, from reinstalling drivers to CUDA to the system to the virtual environment, and finally determined that it was due to the keras version issue. The versions on my other two machines are 2.2.4, and this computer that encountered the problem is 2.3.1. The error no longer occurs when the version is downgraded to 2.2.4.

I am the first person to encounter this problem, at least I couldn't find any relevant posts online before.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.