当前位置: 代码迷 >> 综合 >> Speeding Up Deep Learning Inference Using TensorRT
  详细解决方案

Speeding Up Deep Learning Inference Using TensorRT

热度:33   发布时间:2024-01-06 11:33:08.0

引用

It uses a C++ example to walk you through converting a PyTorch model into an ONNX model 
and importing it into TensorRT, applying optimizations, and generating a high-performance 
runtime engine for the datacenter environment.
Simple TensorRT example

Following are the four steps for this example application:

  1. Convert the pretrained image segmentation PyTorch model into ONNX.
  2. Import the ONNX model into TensorRT.
  3. Apply optimizations and generate an engine.
  4. Perform inference on the GPU.

Importing the ONNX model includes loading it from a saved file on disk and converting it to a TensorRT network from its native framework or format.

ONNX is a standard for representing deep learning models enabling them to be transferred between frameworks(Caffe2, Chainer, CNTK, PaddlePaddle, PyTorch, and MXNet support the ONNX format).

The last step is to provide input data to the TensorRT engine to perform inference.

The application uses the following components in TensorRT:
ONNX parser: Takes a converted PyTorch trained model into the ONNX format as input and populates a network object in TensorRT.
Builder: Takes a network in TensorRT and generates an engine that is optimized for the target platform.
Engine: Takes input data, performs inferences, and emits inference output.
Logger: Associated with the builder and engine to capture errors, warnings, and other information during the build and inference phases.

Convert the pretrained image segmentation PyTorch model into ONNX

Convert the PyTorch-trained UNet model into ONNX, as shown in the following code example:

import torch
from torch.autograd import Variable
import torch.onnx as torch_onnx
import onnx
def main():input_shape = (3,256,256)model_onnx_path = "unet.onnx"dummy_input = Variable(torch.randn(1, *input_shape))model = torch.hub.load('mateuszbuda/brain-segmentation-pytorch','unet',in_channels=3, out_channels=1, init_feature = 32, pretrained = true)model.train(False)inputs = ['input.1']outputs = ['186']dynamic_axes = {
    'input.1':{
    0:'batch'}, '186':{
    0:'batch'}}out = torch.onnx.export(model, dummy_input, model_onnx_path, input_names=inputs, output_name=outputs, dynamic_axes = dynamic_axes)if __name__=='__main__':main()
Next, prepare the input data for inference.
Import the ONNX model into TensorRT, generate the engine, and perform inference

The data is provided as an ONNX protobuf file.
The sample application compares output generated from TensorRT with reference values available as ONNX .pb files

The main function in the following code example starts by declaring a CUDA engine to hold the network definition and trained parameters.

The engine is generated in the createCudaEnginefunction that takes the path to the ONNX model as input.

// Declare the CUDA engine
unique_ptr<ICudaEngine, Destroy<ICudaEngine>> engine{
    nullptr};// Create the CUDA engine
engine.reset(createCudaEngine(onnxModelPath));

The createCudaEngine function parses the ONNX model and holds it in the network object.
To handle the dynamic input dimensions of input images and shape tensors for U-Net model,

you must create an optimization profile from the builder class, as shown in the following code example.

The optimization profile enables you to set the optimum input, minimum, and maximum dimensions to the profile.

The builder selects the kernel that results in lowest runtime for input tensor dimensions and which is valid for all input tensor dimensions in the range between the minimum and maximum dimensions.

It also converts the network object into a TensorRT engine.

The setMaxBatchSize function in the following code example is used to specify the maximum batch size that a TensorRT engine expects.

create cudaEngine的代码如下

nvinfer1::ICudaEngine* createCudaEngine(string const& onnxModelPath, int batchSize){
    unique_ptr<nvinfer1::IBuilder, Destroy<nvinfer1::IBuilder>> builder{
    nvinfer1::createInferBuilder(gLogger)};const auto explicitBatch = 1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);unique_ptr<nvinfer1::INetworkDefinition, Destroy<nvinfer1::INetworkDefinition>> network{
    builder->createNetworkV2(explicitBatch)};unique_ptr<nvonnxparser::IParser, Destroy<nvonnxparser::IParser>> parser{
    nvonnxparser::createParser(*network, gLogger)};unique_ptr<nvinfer1::IBuilderConfig,Destroy<nvinfer1::IBuilderConfig>> config{
    builder->createBuilderConfig()};if (!parser->parseFromFile(onnxModelPath.c_str(), static_cast<int>(ILogger::Severity::kINFO))){
    cout << "ERROR: could not parse input engine." << endl;return nullptr;}builder->setMaxBatchSize(batchSize);config->setMaxWorkspaceSize((1 << 30));auto profile = builder->createOptimizationProfile();profile->setDimensions(network->getInput(0)->getName(), OptProfileSelector::kMIN, Dims4{
    1, 3, 256 , 256});profile->setDimensions(network->getInput(0)->getName(), OptProfileSelector::kOPT, Dims4{
    1, 3, 256 , 256});profile->setDimensions(network->getInput(0)->getName(), OptProfileSelector::kMAX, Dims4{
    32, 3, 256 , 256});config->addOptimizationProfile(profile);return builder->buildEngineWithConfig(*network, *config);
}

After an engine has been created, create an execution context to hold intermediate activation values generated during inference. The following code shows how to create the execution context.

// Declare the execution context
unique_ptr<IExecutionContext, Destroy<IExecutionContext>> context{
    nullptr};
...
// Create the execution context
context.reset(engine->createExecutionContext()); 

This application places inference requests on the GPU asynchronously in the function launchInference shown in the following code example.

Inputs are copied from host (CPU) to device (GPU) within launchInference,

inference is then performed with the enqueue function, and results copied back asynchronously.异步地

The example uses CUDA streams to manage asynchronous work on the GPU.

Asynchronous inference execution generally increases performance by overlapping compute as it maximizes GPU utilization.

The enqueue function places inference requests on CUDA streams and takes as input runtime batch size, pointers to input and output, plus the CUDA stream to be used for kernel execution.

Asynchronous data transfers are performed from the host to device and the reverse using cudaMemcpyAsync.

void launchInference(IExecutionContext* context, cudaStream_t stream, vector<float> const& inputTensor, vector<float>& outputTensor, void** bindings, int batchSize)
{
    int inputId = getBindingInputIndex(context);cudaMemcpyAsync(bindings[inputId], inputTensor.data(), inputTensor.size() * sizeof(float), cudaMemcpyHostToDevice, stream);context->enqueueV2(bindings, stream, nullptr);cudaMemcpyAsync(outputTensor.data(), bindings[1 - inputId], outputTensor.size() * sizeof(float), cudaMemcpyDeviceToHost, stream);
}

The number of inputs and outputs, as well as the value and dimension of each, can be queried using functions from the ICudaEngine class.

Batch your inputs

一般在GPU里支持单精度运算的Single Precision ALU称之为FP32 core或简称core,而把用作双精度运算的Double Precision ALU称之为DP unit或者FP64 core,在Nvidia不同架构不同型号的GPU之间,这两者数量的比例差异很大

TensorRT模型转换及部署,FP32/FP16/INT8精度区分

  相关解决方案