TrajectoryCNN调试记录

发布于 2022-10-25  39 次阅读


问题:运行命令sh TrajectoryCNN_long_term_train.sh,报错:TrajectoryCNN_long_term_train.sh: 3: cd: can't cd to ../..

解决:这是因为TrajectoryCNN_long_term_train.sh这个blash文件是外部的,没有权限。重新用touch新建一个文件,例如:touch 1.sh,然后chmod 755 1.sh。将原blash文件的内容复制到1.sh,删除原来的文件,将1.sh重命名为TrajectoryCNN_long_term_train.sh。

问题:tail: cannot open 'logs/h36m/train_h36m.log' for reading: No such file or directory

tail: ./1.sh: line 9: logs/h36m/train_h36m.log: No such file or directory

no files remaining

解决:把这个log文件手动创建出来

问题:File "train_TrajectoryCNN_h36m.py", line 70

print'!!! TrajectoryCNN:', num_hidden

SyntaxError: invalid syntax

解决:推测是因为原实验环境是python2,而我用的python3在语法上不支持print不加括号。由于这个项目的.py文件中有很多print都不加括号,以后每次报这种错误,都要到报错的文件中搜索到所有print,并加上括号。

问题:W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

解决:不要使用tensorflow2,换为tensorflow1

问题:File "/root/TrajectoryCNN/nets/TrajectoryCNN.py", line 2, in

from layers.TrajBlock import TrajBlock as TB

ImportError: bad magic number in 'layers': b'\x03\xf3\r\n'

解决:cd到目录/root/TrajectoryCNN/nets,ls -a命令可以看到目录下有几个.pyc文件,用rm *.pyc命令删除所有.pyc文件即可

问题:File "train_TrajectoryCNN_h36m.py", line 14, in

FLAGS = tf.app.flags.FLAGS

AttributeError: module 'tensorflow' has no attribute 'app'

解决:修改代码

import tensorflow.compat.v1 as tf

FLAGS = tf.app.flags.FLAGS

问题:File "/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/ops/array_ops.py", line 2715, in placeholder

raise RuntimeError("tf.placeholder() is not compatible with "

RuntimeError: tf.placeholder() is not compatible with eager execution.

解决:网上的解决方案是在这段代码之前加上tf.compat.v1.disable_eager_execution()。但是由于我报错的位置是包内部,所以暂时没有找到解决方案。由于在测试过程中发现一些报错和tensorflow版本有关联,推测TrajectoryCNN是用tensorflow1跑的,在2上存在一些不兼容。将tensorflow得版本换为1即可。

问题:安装tensorflow时报错protobuf requires Python '>=3.7' but the running Python is 3.6.5

解决:更新pip后重新安装tensorflow

问题:使用命令sh scripts/h36m/TrajectoryCNN_short_term_train.sh运行,报错cannot create ./logs/h36m/train_h36m.log: Directory nonexistent

解决:这是因为偷懒把cd scripts/h36m和sh TrajectoryCNN_short_term_train.sh合到了一起运行,按照github上推荐的方式,分开运行就好了

问题:ImportError: bad magic number in 'layers': b'\x03\xf3\r\n'

解决:已经删除了目录下的.pyc文件,但是还是报错。所以最好还是删除整个项目中所有的.pyc文件,使用命令find . -name \*.pyc -delete即可

问题:Loaded runtime CuDNN library: 7401 (compatibility version 7400) but source was compiled with 7004 (compatibility version 7000). If using a binary install, upgrade your CuDNN library to match. If building from sources, make sure the library loaded at runtime matches a compatible version specified during compile configuration.

原因:这是因为我们的cuda安装的是9,cudnn安装的是7.4.1,tensorflow安装的是1.5.0。但tensorflow要求cuda9必须和cudnn7.0.x才能匹配。

img

解决:联系师兄帮忙装cudnn,在cuda10下装了cudnn7.4.1。卸载重装tensorflow-gpu==1.14.0

问题:WARNING:tensorflow:Entity > could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: converting >: AttributeError: module 'gast' has no attribute 'Str'

解决:是gast版本太高导致的问题,卸载然后安装gast==0.2.2

问题:Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory

解决:重新登录服务器之后需要重新设置cuda的环境变量

问题:由于电脑休眠或者网络断了导致shh断连,如何继续之前的训练?

img

解决:由于运行的脚本命令是nohup python -u train_TrajectoryCNN_h36m.py,所以哪怕离线进程也会挂在后台继续执行。重新链接服务器,用命令ps aux查看所有用户所有进程,可以看到挂在后台的进程

img
这个进程一直在运行中,所以不用管它。如果想要知道当前的进度,用sz train_h36m.log把打印的日志下载到本地查看即可。