0x00 object
food101数据集使用Googlenet train好了与baseline比较
0x01 有用信息
(1)loss很低,但是accuracy一直在50%左右一般是什么原因
如果loss一直降低而你的validation accuracy不升高的话,就是overfit了
(2)现在foodai的accuracy有多高,用了多少数据
六七十,差不多每一类留了50张测试。其他都是训练
(3)food101的baseline
food101现在googlenet v1应该是80/95,论文wide-slice residual network for food recognition0
(4)caffe文件夹
0x02 Guide
How to run the experiments using AlexNet/GoogLeNet on Food-101?
- clone this repo from scratch:
git clone https://github.com/deercoder/DeepFood.git
- configure the environment according to the official tutorial. Minor changes have been applied in this repo.
- download pre-trained model(alexNet, googleNet), under the
./models
folder - download imagenet mean file, under
data/ilsvrc12/
folder with theget_ilsvrc_aux.sh
- run the model from the caffe’s root directory, with
./models/finetune-food101-alexNet/train_full.sh
or./models/finetune-food101-googlenet/train_full.sh
, check results!
详见:https://github.com/deercoder/DeepFood
0x03 DGX docker
因为实验室服务器down掉了,改用DGX
教程:
https://philipskokoh.github.io/blog/nvidia-docker-for-your-GPU-application-development
创建docker
nvidia-docker run -it --name <username>-tensorflow -v /mnt/StorageArray2_DGX1/<username>/codes:/opt/codes compute.nvidia.com/nvidia/tensorflow bash
对于我
nvidia-docker run -it --name huwang-ncaffe -v /mnt/StorageArray2_DGX1/huwang:/home/huwang compute.nvidia.com/nvidia/caffe bash
nvidia-docker run -it --name huwang-bcaffe -v /mnt/StorageArray2_DGX1:/home zh-caffe bash
start和打开docker
## run bash shell in tensorflow container
## the container will not be removed after it exits
$ nvidia-docker run -it --name my-tensorflow compute.nvidia.com/nvidia/tensorflow bash## start and connect back to previously created container my-tensorflow
$ nvidia-docker start my-tensorflow
$ nvidia-docker attach my-tensorflow## delete the container
$ nvidia-docker rm my-tensorflow
nVidia有维护一个caffe,与BVLC不太一样,有时候出问题
如果要用BVLC就需要自己搭环境
0x04 finetune过程
Step1:
下载 pre-trained model(googleNet), 地址在caffe 的 ./models 文件夹中
Step2:
下载 imagenet mean file, 使用 data/ilsvrc12/ 下的 get_ilsvrc_aux.sh
这里有另一种说法应该用新数据的mean,个人觉得,如果新数据量小,应该用旧的,但是如果新数据量大,mean也会往新数据的方向靠,应该设置为新的mean
Step3:
将文件转为lmdb文件
Step4:
准备好 train_val.prototxt(修改mean与source)、solver.prototxt、train.sh 准备train
手动make food101 mean
#!/usr/bin/env sh
# Compute the mean image from the imagenet training lmdb
# By BillDBNAME=.
ListPath=.
TOOL=/usr/bin/caffe_compute_image_mean$TOOL $DBNAME/train_lmdb \$ListPath/food101_mean.binaryprotoecho "Done."
0x05 调参
跑了一晚上(10 pm — 1 pm)
top1 | top5 |
---|---|
0.3 | 0.5 |
意见:
(1)我现在batchsize是32,用一个gpu可以快点,用64或者128
(2)finetuning整个网络,初始lr可以大一点,0.005~0.01之间都可以(我现在用的0.001)
(3)如果lr设置的大,gamma没必要用这么大,gamma越小,lr衰减越快
(4)max_iter、stepsize要根据自己的数据,计算epoch。开始可以安排大一点,比如max_iter设置15w或者20w
0x06 期间遇到问题
1、windows编写的脚本,xftp到Linux总报not found错
之前有说是文件尾问题,但是试过并没有用
在windows编写的service脚本(无sh为后缀的)在linux中使用,使用vim打开文件,使用:set ff=unix,就可以将文件转换为linux识别的格式。
详见:http://blog.csdn.net/faryang/article/details/52348029
2、Syntax error: redirection unexpected
执行这个脚本一直报错
##!/usr/bin/env sh
#!/bin/bash
#By Bill#GPU_ID=3
NET=finetune_googlenet_food101
SOLVER=finetune_googlenet_foodai101_solver.prototxt
TOOLS=caffe
WEIGHTS=./bvlc_googlenet.caffemodel
#set -x
#set -e#LOG=logs/${NET}.txt.`date +'%Y-%m-%d_%H-%M-%S'`
LOG=logs/${NET}.log
exec &> >(tee -a "$LOG")
echo Logging output to "$LOG"#./build/tools/caffe train --solver=./examples/food_tst/${SOLVER} --weights=./examples/food_tst/bvlc_googlenet.caffemodel --gpu=5
$TOOLS train \--solver=${SOLVER} \--gpu=4,5#--snapshot=examples/food_tst/googlenet_food100_aug_iter_5000.solverstate--weights=$WEIGHTS \#| tee $LOG
原因是这个应该用bash来跑,我用了sh
详见:https://stackoverflow.com/questions/2462317/bash-syntax-error-redirection-unexpected
3、Data layer prefetch queue empty
当用image直接输入的时候,就会报这个错,因为data的IO太耗时,于是我转为使用lmdb
https://github.com/BVLC/caffe/issues/3177
4、是不是最好跑caffe的时候直接用python而不是caffe命令行,有什么好处?直接画图?
从caffe命令行打印到log画图就好
5、找不到caffemodel,意味着从scratch训练结果 66/86?
wh_train: line 24: –weights=./bvlc_googlenet.caffemodel: No such file or directory
shell脚本如下
##!/usr/bin/env sh
#!/bin/bash
#By Bill#GPU_ID=3
NET=finetune_googlenet_food101
SOLVER=/home/huwang/bcaffe/dataset/food-101/finetune_googlenet_foodai101_solver.prototxt
TOOLS=caffe
WEIGHTS=/home/huwang/bcaffe/dataset/food-101/bvlc_googlenet.caffemodel
#set -x
#set -eLOG=logs/${NET}_`date +'%Y-%m-%d_%H-%M-%S'`.log
#LOG=logs/${NET}.log
exec &> >(tee -a "$LOG")
echo Logging output to "$LOG"#./build/tools/caffe train --solver=./examples/food_tst/${SOLVER} #--weights=./examples/food_tst/bvlc_googlenet.caffemodel --gpu=5
$TOOLS train \--solver=${SOLVER} \--gpu=4,5#--snapshot=examples/food_tst/googlenet_food100_aug_iter_5000.solverstate--weights=$WEIGHTS \#| tee $LOG
原因是在这里的--gpu=4,5
这句后面缺少 \ 号,且后面注释一句话
6、batchsize/GPU
报错Check failed: batch * solver_count == total Batch size must be divisible by the number of solvers (GPUs),原因是用了3个核,于是改用了两个核,详见batchsize_divide_gpu.log
nVidia的跟普通的BVLC caffe处理逻辑有略微差别,每次处理的是batch * solver_count (GPUs),需要被gpu数可分
个人觉得应该是BVLC caffe的prototxt的batchsize是单个GPU核的batchsize,但是nVidia的是多个gpu平分
详见:https://github.com/NVIDIA/DIGITS/issues/413
7、snapshot跑多次是会覆盖原文件吗
不会的,会在那个的基础上再跑,事实上试过发现是覆盖的
0x07 实验结果
对应:
https://github.com/billhhh/caffe-LARC/tree/master/6%5Bfine%20tune%5Dfood-101
No. | description | top1 | top5 |
---|---|---|---|
1 | mean:imgnet batch_size:32/32 lr:0.001 gamma:0.96 max_iter:100000 scratch server gpu:1 20170816_9pm 忘加test mean | 0.268406 | 0.544563 |
2 | mean:input batch_size:64/64 lr:0.005 gamma:0.2 max_iter:100000 scratch dgx ncaffe gpu:2 20170817_6pm 直接读入img太慢 搁浅 | NA | NA |
3 | mean:imgnet batch_size:64/64 lr:0.005 gamma:0.2 max_iter:100000 scratch server gpu:3 20170817_9pm 重编译caffe | 0.688359 | 0.894203 |
4 | mean:input batch_size:64/64 lr:0.005 gamma:0.2 max_iter:100000 scratch dgx gpu:2 20170817_9pm 相对路径实验OK | 0.668656 | 0.886813 |
5 | mean:input batch_size:64/64 lr:0.005 gamma:0.2 max_iter:100000 scratch dgx gpu:2 20170818_8am 同4,改为绝对路径 | 0.662859 | 0.881969 |
6 | mean:imgnet batch_size:64/64 lr:0.001 gamma:0.96 max_iter:100000 scratch dgx gpu:2 20170818_11am | 0.641859 | 0.868469 |
7 | mean:imgnet batch_size:64/64 lr:0.005 gamma:0.2 max_iter:100000 pre-trained dgx gpu:2 20170818_8.21pm | 0.801063 | 0.945859 |
8 | mean:imgnet batch_size:64/64 lr:0.008 gamma:0.2 max_iter:300000 pre-trained dgx gpu:2 20170818_12pm | 0.798828 | 0.945312 |
9 | mean:food-101 batch_size:64/64 lr:0.005 gamma:0.2 max_iter:100000 pre-trained dgx gpu:2 20170819_10.45am | 0.802125 | 0.946219 |
10 | mean:food-101 batch_size:64/64 lr:0.005 gamma:0.1 max_iter:100000 pre-trained dgx gpu:2 20170819_4pm | 0.802047 | 0.948672 |
11 | mean:food-101 batch_size:128/128 lr:0.005 gamma:0.2 max_iter:100000 pre-trained dgx gpu:2 20170819_8pm | 0.802664 | 0.947563 |
12 | pre-trained googLeNet(直接跑test) dgx gpu:2 20170820_10am | ||
14 | mean:food-101 batch_size:64/100 lr:0.005 gamma:0.2 max_iter:100000 pre-trained dgx gpu:2 20170820_11am | 0.801186 | 0.94668 |
baseline
(1)from “Deep Learning Based Food Recognition”
(2)from “Wide-Slice Residual Networks for Food Recognition”
附:
DGX Few things to note
Important 1: DO NOT run apt-get upgrade or update or install any packages on the DGX1 host. You should install packages on your own containers only. If you need to install anything on the host system, please kindly contact System Administrator for assistance.
Important 2: You MUST name your containers with your username, failure to comply will result in the removal of containers without prior notice
Use docker container only to run your code
You have to use docker container to run your GPU codes. NVIDIA provides nvidia-docker, a wrapper around docker-cli to run GPU application. You have to use nvidia-docker to run applications which require GPU, otherwise your application will not leverage NVIDIA GPUs.
Use docker image as backup
Please backup your docker containers by using docker image and store them in the Storrage Array. In the event of docker container corruption, the backup image can be used to recover and restore the docker containers.
GPU Codes (Recommended)
To maximize the computing capabilities of the DGX-1 and also to ensure there is sufficient CPU resources for everyone to perform their experiments, you should execute the codes using GPU mode instead of CPU mode.
For TensorFlow and Cuda framework, you can refer to the guide below for assigning single or multiple GPUs on your code as well as to prevent over utilising the resources.
Tensorflow - https://www.tensorflow.org/how_tos/using_gpu/
cuda flag, CUDA_VISIBLE_DEVICES - http://acceleware.com/blog/cudavisibledevices-masking-gpus
Storage Array
DGX-1 Harddisk has no redudancy and optimized only for speed. We suggest you to put your important stuffs in storage array (/mnt/StorageArray2_DGX1/). It has redudancy and more space. You can copy your data temporarily to DGX-1 harddisk to speed up your experiment, but always have backup in storage array. Use -v option in nvidia-docker run
command to mount host folder to your container, for example:
nvidia-docker run -it --name <username>-tensorflow -v /mnt/StorageArray2_DGX1/<username>/codes:/opt/codes compute.nvidia.com/nvidia/tensorflow bash
that command will map /mnt/StorageArray2_DGX1/<username>/codes to /opt/codes
in your container.
compute.nvidia.com/nvidia/tensorflow
is the official torch container from nvidia
You can read my blog about basic nvidia-docker, and commonly use nvidia-docker commands:
https://philipskokoh.github.io/blog/nvidia-docker-for-your-GPU-application-development
such as docker ps –s to check containers & nvidia-smi to check GPU usage
Please read introduction on nvidia-docker from nvidia too:
https://devblogs.nvidia.com/parallelforall/nvidia-docker-gpu-server-application-deployment-made-easy/
Food100
Hi, I have put the SGFood100 data in /data5/FOODAI/Food100 folder. There are two folders: Images placing source images and ImageSets placing my original Imagesets and label mapping as well as the corresponding lmdb data(you’d better generate it by yourself again). The absolute path is a bit different now but only a small modification is required.
I also put the results based on VGG16, AlexNet, Res152 and InceptionV1 in results.md, you can refer these results I got based on the provided imagesets