TrajectoryCNN调试记录

问题：运行命令sh TrajectoryCNN_long_term_train.sh，报错：TrajectoryCNN_long_term_train.sh: 3: cd: can't cd to ../..

解决：这是因为TrajectoryCNN_long_term_train.sh这个blash文件是外部的，没有权限。重新用touch新建一个文件，例如：touch 1.sh，然后chmod 755 1.sh。将原blash文件的内容复制到1.sh，删除原来的文件，将1.sh重命名为TrajectoryCNN_long_term_train.sh。

问题：tail: cannot open 'logs/h36m/train_h36m.log' for reading: No such file or directory

tail: ./1.sh: line 9: logs/h36m/train_h36m.log: No such file or directory

no files remaining

解决：把这个log文件手动创建出来

问题：File "train_TrajectoryCNN_h36m.py", line 70

print'!!! TrajectoryCNN:', num_hidden

SyntaxError: invalid syntax

解决：推测是因为原实验环境是python2，而我用的python3在语法上不支持print不加括号。由于这个项目的.py文件中有很多print都不加括号，以后每次报这种错误，都要到报错的文件中搜索到所有print，并加上括号。

问题：W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

解决：不要使用tensorflow2，换为tensorflow1

问题：File "/root/TrajectoryCNN/nets/TrajectoryCNN.py", line 2, in

from layers.TrajBlock import TrajBlock as TB

ImportError: bad magic number in 'layers': b'\x03\xf3\r\n'

解决：cd到目录/root/TrajectoryCNN/nets，ls -a命令可以看到目录下有几个.pyc文件，用rm *.pyc命令删除所有.pyc文件即可

问题：File "train_TrajectoryCNN_h36m.py", line 14, in

FLAGS = tf.app.flags.FLAGS

AttributeError: module 'tensorflow' has no attribute 'app'

解决：修改代码

import tensorflow.compat.v1 as tf

FLAGS = tf.app.flags.FLAGS

问题：File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/ops/array_ops.py", line 2715, in placeholder

raise RuntimeError("tf.placeholder() is not compatible with "

RuntimeError: tf.placeholder() is not compatible with eager execution.

解决：网上的解决方案是在这段代码之前加上tf.compat.v1.disable_eager_execution()。但是由于我报错的位置是包内部，所以暂时没有找到解决方案。由于在测试过程中发现一些报错和tensorflow版本有关联，推测TrajectoryCNN是用tensorflow1跑的，在2上存在一些不兼容。将tensorflow得版本换为1即可。

问题：安装tensorflow时报错protobuf requires Python '>=3.7' but the running Python is 3.6.5

解决：更新pip后重新安装tensorflow

问题：使用命令sh scripts/h36m/TrajectoryCNN_short_term_train.sh运行，报错cannot create ./logs/h36m/train_h36m.log: Directory nonexistent

解决：这是因为偷懒把cd scripts/h36m和sh TrajectoryCNN_short_term_train.sh合到了一起运行，按照github上推荐的方式，分开运行就好了

问题：ImportError: bad magic number in 'layers': b'\x03\xf3\r\n'

解决：已经删除了目录下的.pyc文件，但是还是报错。所以最好还是删除整个项目中所有的.pyc文件，使用命令find . -name \*.pyc -delete即可

问题：Loaded runtime CuDNN library: 7401 (compatibility version 7400) but source was compiled with 7004 (compatibility version 7000). If using a binary install, upgrade your CuDNN library to match. If building from sources, make sure the library loaded at runtime matches a compatible version specified during compile configuration.

原因：这是因为我们的cuda安装的是9，cudnn安装的是7.4.1，tensorflow安装的是1.5.0。但tensorflow要求cuda9必须和cudnn7.0.x才能匹配。

解决：联系师兄帮忙装cudnn，在cuda10下装了cudnn7.4.1。卸载重装tensorflow-gpu==1.14.0

问题：WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: converting >: AttributeError: module 'gast' has no attribute 'Str'

解决：是gast版本太高导致的问题，卸载然后安装gast==0.2.2

问题：Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory

解决：重新登录服务器之后需要重新设置cuda的环境变量

问题：由于电脑休眠或者网络断了导致shh断连，如何继续之前的训练？