torch跟torchvision的有相应的对应官方推荐版本,如何确定两者之间对应的合适版本呢?有一个小技巧就是torch跟torchvision的最新版本从后往前数!因为每推出一个新版的torch,就会有一个版本的torchvision。
https://download.pytorch.org/whl/torch_stable.html
如图
对应关系:torch和torchvision版本对应关系, torch cuda api 和本地cuda 对应,python cudatoolkit 无所谓,python 和pytorch 对应
torch torchvision python cuda
1.5.1 0.6.1 >=3.6 9.2, 10.1,10.2
1.5.0 0.6.0 >=3.6 9.2, 10.1,10.2
1.4.0 0.5.0 ==2.7, >=3.5, <=3.8 9.2, 10.0
1.3.1 0.4.2 ==2.7, >=3.5, <=3.7 9.2, 10.0
1.3.0 0.4.1 ==2.7, >=3.5, <=3.7 9.2, 10.0
1.2.0 0.4.0 ==2.7, >=3.5, <=3.7 9.2, 10.0
1.1.0 0.3.0 ==2.7, >=3.5, <=3.7 9.0, 10.0
<1.0.1 0.2.2 ==2.7, >=3.5, <=3.7 9.0, 10.0
torch和torchvison下载地址
https://download.pytorch.org/whl/torch_stable.html
Pytorch 0.4.0迁移指南(与之前版本编程上的不同点)https://blog.csdn.net/sunqiande88/article/details/80172391
安装一个旧环境
conda create -n env_27 python=2.7.13conda install pytorch=0.3.0 torchvision cuda80 cudatoolkit=8.0 six=1.12 numpy matplotlib pandas
失败,各种库函数不兼容,所以环境还是要配新一点的,配好就别动了
conda python3.6 安装 opencv
conda install -c https://conda.anaconda.org/menpo opencv3 #安装opencv3
另一个conda虚拟环境:
Python 3.5.4 |Continuum Analytics, Inc.| (default, Aug 14 2017, 13:26:58)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torchvision
>>> print(torch.__version__)
0.4.1
>>> print(torchvision.__version__)
0.2.2
>>> print(torch.version.cuda)
8.0.61cat /usr/local/cuda/version.txtCUDA Version 8.0.44nvcc -Vnvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Sun_Sep__4_22:14:01_CDT_2016
Cuda compilation tools, release 8.0, V8.0.44cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2#define CUDNN_MAJOR 6
#define CUDNN_MINOR 0
#define CUDNN_PATCHLEVEL 21
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)#include "driver_types.h"
resnet50 train: https://www.jianshu.com/p/b935e108ba7d
resnet18 regression train.py
#coding:utf-8
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms, models
from torch.utils.data import Dataset, DataLoader
from PIL import Imageimport time
import os
import numpy as np
import matplotlib.pyplot as pltbatch_size= 64
num_classes= 50class MyDataset(Dataset):def __init__(self, path, transform= None, target_transform= None):fh= open(path, 'r')imgs= []for line in fh:line= line.rstrip()words= line.split()imgs.append((words[0], int(words[1])))self.imgs= imgsself.transform= transformself.target_transform= target_transformdef __getitem__(self, index):fn, label= self.imgs[index]img= Image.open(fn).convert('RGB')if self.transform is not None:img = self.transform(img)return img, labeldef __len__(self):return len(self.imgs)train_transforms = transforms.Compose([transforms.Resize(224),#transforms.RandomHorizontalFlip(),#随机水平翻转transforms.ToTensor() #转化成张量#transforms.Normalize([0.485, 0.456, 0.406], #归一化# [0.229, 0.224, 0.225])
])test_valid_transforms = transforms.Compose([transforms.Resize(224),transforms.ToTensor()#transforms.Normalize([0.485, 0.456, 0.406],# [0.229, 0.224, 0.225])
])train_dataset= MyDataset('/home/zmz/model_09/train.txt', transform= train_transforms)
valid_dataset= MyDataset('/home/zmz/model_09/valid.txt' , transform= test_valid_transforms)train_data_size= len(train_dataset)
valid_data_size= len(valid_dataset)train_loader = DataLoader( train_dataset, batch_size= batch_size, shuffle=True)
valid_loader = DataLoader( valid_dataset, batch_size= batch_size, shuffle=True)model = models.resnet18(pretrained=True)
for param in model.parameters():param.requires_grad= Truefc_inputs= model.fc.in_features
model.fc= nn.Sequential(nn.Linear(fc_inputs, 256),nn.ReLU(),nn.Dropout(0.4),nn.Linear(256, num_classes)#nn.LogSoftmax(dim=1)
)device= torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model= model.to(device)#criterion = nn.NLLLoss()
criterion=nn.CrossEntropyLoss()
optimizer=optim.SGD(model.parameters(),lr=0.01,momentum=0.8)def train_and_valid(model, loss_function, optimizer, epochs):history= []best_acc= 0.0best_epoch= 0for epoch in range(epochs):epoch_start= time.time()print("Epoch: {}/{}".format(epoch+1, epochs))model.train()train_loss= 0.0train_acc= 0.0valid_loss= 0.0valid_acc= 0.0for batch_index, data in enumerate( train_loader, 0):inputs, target= datainputs= inputs.to(device)target= target.to(device)optimizer.zero_grad()outputs= model(inputs)loss= criterion(outputs,target)loss.backward()optimizer.step()train_loss += loss.item()* inputs.size(0)ret, prediction= torch.max(outputs.data, 1)correct_count= prediction.eq(target.data.view_as(prediction))acc= torch.mean(correct_count.type(torch.FloatTensor))train_acc+= acc.item()*inputs.size(0)correct= 0total= 0with torch.no_grad(): #不计算梯度model.eval()for batch_size, (inputs, labels) in enumerate(valid_loader, 0):inputs = inputs.to(device)labels = labels.to(device)outputs = model(inputs)loss = loss_function(outputs, labels)valid_loss += loss.item() * inputs.size(0)ret, predictions = torch.max(outputs.data, 1)correct_counts = predictions.eq(labels.data.view_as(predictions))acc = torch.mean(correct_counts.type(torch.FloatTensor))valid_acc += acc.item() * inputs.size(0)avg_train_loss = train_loss/train_data_sizeavg_train_acc = train_acc/train_data_sizeavg_valid_loss = valid_loss/valid_data_sizeavg_valid_acc = valid_acc/valid_data_sizehistory.append([avg_train_loss, avg_valid_loss, avg_train_acc, avg_valid_acc])if best_acc < avg_valid_acc:best_acc = avg_valid_accbest_epoch = epoch + 1epoch_end = time.time()print("Epoch: {:03d}, Training: Loss: {:.4f}, Accuracy: {:.4f}%, \n Validation: Loss: {:.4f}, Accuracy: {:.4f}%, Time: {:.4f}s".format(epoch+1, avg_train_loss, avg_train_acc*100, avg_valid_loss, avg_valid_acc*100, epoch_end-epoch_start))print("Best Accuracy for validation : {:.4f} at epoch {:03d}".format(best_acc, best_epoch))torch.save(model, '/home/zmz/model_09/models/'+'model_'+str(epoch+1)+'.pkl')return model, historynum_epochs = 50
trained_model, history = train_and_valid(model, criterion, optimizer, num_epochs)
torch.save(history, '/home/zmz/model_09/models/'+'_history.pkl')history = np.array(history)plt.plot(history[:, 0:2])
plt.legend(['Tr Loss', 'Val Loss'])
plt.xlabel('Epoch Number')
plt.ylabel('Loss')
#plt.ylim(0, 1)
plt.savefig('_loss_curve.png')
plt.show()plt.plot(history[:, 2:4])
plt.legend(['Tr Accuracy', 'Val Accuracy'])
plt.xlabel('Epoch Number')
plt.ylabel('Accuracy')
#plt.ylim(0, 1)
plt.savefig('_accuracy_curve.png')plt.show()
训练\valid 数据存在一个txt 如下
/home/zmz/model_09/cut2/2720.jpg 3000
/home/zmz/model_09/cut2/3095.jpg 3000
/home/zmz/model_09/cut2/3470.jpg 3000
/home/zmz/model_09/cut2/3845.jpg 3000
/home/zmz/model_09/cut2/4595.jpg 3000
/home/zmz/model_09/cut2/4970.jpg 3000
test.py
#coding:utf-8
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms, models
from torch.utils.data import Dataset, DataLoader
from PIL import Imageimport time
import os
import numpy as np
import matplotlib.pyplot as pltbatch_size= 1
num_classes= 50class MyDataset(Dataset):def __init__(self, path, transform= None, target_transform= None):fh= open(path, 'r')imgs= []for line in fh:line= line.rstrip()words= line.split()imgs.append((words[0],float(words[1])))self.imgs= imgsself.transform= transformself.target_transform= target_transformdef __getitem__(self, index):fn, label= self.imgs[index]img= Image.open(fn).convert('RGB')if self.transform is not None:img = self.transform(img)return img, labeldef __len__(self):return len(self.imgs)test_valid_transforms = transforms.Compose([transforms.Resize(224),transforms.ToTensor()#transforms.Normalize([0.485, 0.456, 0.406],# [0.229, 0.224, 0.225])
])valid_dataset= MyDataset('/home/zmz/model_09/reg_valid.txt' , transform= test_valid_transforms)valid_data_size= len(valid_dataset)valid_loader = DataLoader( valid_dataset, batch_size= batch_size, shuffle=True)model = torch.load('./fig_reg/model_47.pkl') # = models.resnet18(pretrained=True)device= torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model= model.to(device)#criterion = nn.NLLLoss()
criterion=nn.MSELoss(size_average= False)def train_and_valid(model, loss_function, valid_data_size):for epoch in range(valid_data_size):correct= 0total= 0with torch.no_grad(): #不计算梯度model.eval()for batch_index, (inputs, labels) in enumerate(valid_loader, 0):inputs = inputs.to(device)labels = labels.float()labels = labels.to(device)outputs = model(inputs)loss = loss_function(outputs.squeeze(1), labels)print("Epoch: {:03d}, outputs:{:4f}, label:{:.4f}, Loss: {:.4f}: ".format(epoch+1, outputs.item(), labels.item(), loss.item()))
train_and_valid(model, criterion, valid_data_size)
训练时候选择 不更新预训练模型,训练效果40%准确率很差
锁了以后显存明显占用小2300/11000MB, 打开显存10000/11000
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1532498777990/work/aten/src/THC/generated/../generic/THCTensorMathReduce.cu:181080ti
64以上的batch会报out of memory
实时显存
watch -n 1 nvidia-smi
训练需要关注的一些问题:
数据:
数据量, 1400训练是否容易过拟合
数据增强,
预处理:转换色彩空间:transform里的随机裁剪\翻转\中心裁剪\色度变化\归一化
数据和label是不是对应
训练和测试集是不是同分布,随机抽取, 例如 8:1:1划分为 train valid test
训练:
网络的输入输出大小
是否使用并更新预训练模型的权重
针对分类\回归选择loss
优化方法:sgdm\ Aadm (真好用)
batch size: 一般大一些好,128左右最好, 这个看显存,本次Resnet50 224*224 batch64基本用满10g
学习率:0.001开始10的倍数增长,动量比例0.8左右吧,Adam 不存在这个问题
每一代存一次模型,在train loss 持续下降\valid loss 在下降后准备上升的,作为比较好的训练结果..如果train 很差考虑欠拟合
处理过拟合:resnet50过拟合处理 https://blog.csdn.net/weixin_43610118/article/details/99561227
Resnet 如何提高验证和测试准确率? https://www.zhihu.com/question/278563008
索性把训练集的每张图片都进行了随机翻转 截取 变色处理,这样一来,训练样本翻倍,3000变6000,再硬train一发后,it works!!val acc逐级上涨,最高达到了96.7%,最后test acc也有96.2%. 怎么样,听起来似乎在自己骗自己,不就是扩充了数据嘛。但我认为这也是一种手段吧,至少没有费劲的去网上扒图片来提升准确率。所以之前确实是过拟合了,也就是说在加了正则和BN后,甚至网络层数降到只有8层卷积层,模型依然过拟合,dropout什么的也没用,有了BN后加不加dropout影响不大,也就是为什么之前怎么train都不work,训练的准确率倒是很快稳定在100%,这也说明了Resnet是训练神经的一大利器. 看来样本大小和模型必须要匹配,否则很容易过拟合。样本扩充后,训练明显不是那么容易了,大体上验证和训练跟随着跌跌撞撞的往谷底收敛,但效果达到了才最重要。作者:C-Walk
链接:https://www.zhihu.com/question/278563008/answer/401790505
来源:知乎
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。
torchvision.ToTensor https://blog.csdn.net/WhiffeYF/article/details/104747845
PyTorch 实战(模型训练、模型加载、模型测试) https://blog.csdn.net/public669/article/details/97752226
matplotlib https://www.cnblogs.com/BackingStar/p/10986955.html
回归 分类 指标 https://blog.csdn.net/weixin_41012399/article/details/91472569