Welcome to PaddleClas!

tutorials

Installation


Introducation

This document introduces how to install PaddleClas and its requirements.

Install PaddlePaddle

Python 3.5, CUDA 9.0, CUDNN7.0 nccl2.1.2 and later version are required at first, For now, PaddleClas only support training on the GPU device. Please follow the instructions in the Installation if the PaddlePaddle on the device is lower than v1.7

Install PaddlePaddle

pip install paddlepaddle-gpu --upgrade

or compile from source code, please refer to Installation.

Verify Installation

import paddle.fluid as fluid
fluid.install_check.run_check()

Check PaddlePaddle version:

python -c "import paddle; print(paddle.__version__)"

Note:

  • Make sure the compiled version is later than v1.7
  • Indicate WITH_DISTRIBUTE=ON when compiling, Please refer to Instruction for more details.

Install PaddleClas

**Clone PaddleClas: **

cd path_to_clone_PaddleClas
git clone https://github.com/PaddlePaddle/PaddleClas.git

Install requirements

pip install --upgrade -r requirements.txt

Trial in 30mins

Based on the flowers102 dataset, it takes only 30 mins to experience PaddleClas, include training varieties of backbone and pretrained model, SSLD distillation, and multiple data augmentation, Please refer to Installation to install at first.

Preparation

  • enter insatallation dir
cd path_to_PaddleClas
  • enter dataset/flowers102, download and decompress flowers102 dataset.
cd dataset/flowers102
wget https://www.robots.ox.ac.uk/~vgg/data/flowers/102/102flowers.tgz
wget https://www.robots.ox.ac.uk/~vgg/data/flowers/102/imagelabels.mat
wget https://www.robots.ox.ac.uk/~vgg/data/flowers/102/setid.mat
tar -xf 102flowers.tgz
  • create train/val/test label files
python generate_flowers102_list.py jpg train > train_list.txt
python generate_flowers102_list.py jpg valid > val_list.txt
python generate_flowers102_list.py jpg test > extra_list.txt
cat train_list.txt extra_list.txt > train_extra_list.txt

Note: In order to offer more data to SSLD training task, train_list.txt and extra_list.txt will merge into train_extra_list.txft

  • return PaddleClas dir
cd ../../

Environment

Set PYTHONPATH
export PYTHONPATH=./:$PYTHONPATH
Download pretrained model
python tools/download.py -a ResNet50_vd -p ./pretrained -d True
python tools/download.py -a ResNet50_vd_ssld -p ./pretrained -d True
python tools/download.py -a MobileNetV3_large_x1_0 -p ./pretrained -d True

Paramters:

  • architecture(shortname: a): model name.
  • path(shortname: p) download path.
  • decompress(shortname: d) whether to decompress.
  • All experiments are running on the NVIDIA® Tesla® V100 sigle card.

Training

Train from scratch
  • Train ResNet50_vd
export CUDA_VISIBLE_DEVICES=0
python -m paddle.distributed.launch \
    --selected_gpus="0" \
    tools/train.py \
        -c ./configs/quick_start/ResNet50_vd.yaml

The validation Top1 Acc curve is showmn below.

_images/r50_vd_acc.png

Finetune - ResNet50_vd pretrained model (Acc 79.12%)
  • finetune ResNet50_vd_ model pretrained on the 1000-class Imagenet dataset
export CUDA_VISIBLE_DEVICES=0
python -m paddle.distributed.launch \
    --selected_gpus="0" \
    tools/train.py \
        -c ./configs/quick_start/ResNet50_vd_finetune.yaml

The validation Top1 Acc curve is shown below

_images/r50_vd_pretrained_acc.png

Compare with training from scratch, it improve by 65% to 94.02%

SSLD finetune - ResNet50_vd_ssld pretrained model (Acc 82.39%)

Note: when finetuning model, which has been trained by SSLD, please use smaller learning rate in the middle of net.

ARCHITECTURE:
    name: 'ResNet50_vd'
    params:
        lr_mult_list: [0.1, 0.1, 0.2, 0.2, 0.3]
pretrained_model: "./pretrained/ResNet50_vd_ssld_pretrained"

Tringing script

export CUDA_VISIBLE_DEVICES=0
python -m paddle.distributed.launch \
    --selected_gpus="0" \
    tools/train.py \
        -c ./configs/quick_start/ResNet50_vd_ssld_finetune.yaml

Compare with finetune on the 79.12% pretrained model, it improve by 0.9% to 95%.

More architecture - MobileNetV3

Training script

export CUDA_VISIBLE_DEVICES=0
python -m paddle.distributed.launch \
    --selected_gpus="0" \
    tools/train.py \
        -c ./configs/quick_start/MobileNetV3_large_x1_0_finetune.yaml

Compare with ResNet50_vd pretrained model, it decrease by 5% to 90%. Different architecture generates different performance, actually it is a task-oriented decision to apply the best performance model, should consider the inference time, storage, heterogeneous device, etc.

RandomErasing

Data augmentation works when training data is small.

Training script

export CUDA_VISIBLE_DEVICES=0
python -m paddle.distributed.launch \
    --selected_gpus="0" \
    tools/train.py \
        -c ./configs/quick_start/ResNet50_vd_ssld_random_erasing_finetune.yaml

It improves by 1.27% to 96.27%

  • Save ResNet50_vd pretrained model to experience next chapter.
cp -r output/ResNet50_vd/19/  ./pretrained/flowers102_R50_vd_final/
Distillation
  • Use extra_list.txt as unlabeled data, Note:
    • Samples in the extra_list.txt and val_list.txt don’t have intersection
    • Because of in the source code, label information is unused, This is still unlabeled distillation
    • Teacher model use the pretrained_model trained on the flowers102 dataset, and student model use the MobileNetV3_large_x1_0 pretrained model(Acc 75.32%) trained on the ImageNet1K dataset
total_images: 7169
ARCHITECTURE:
    name: 'ResNet50_vd_distill_MobileNetV3_large_x1_0'
pretrained_model:
    - "./pretrained/flowers102_R50_vd_final/ppcls"
    - "./pretrained/MobileNetV3_large_x1_0_pretrained/”
TRAIN:
    file_list: "./dataset/flowers102/train_extra_list.txt"

Final training script

export CUDA_VISIBLE_DEVICES=0
python -m paddle.distributed.launch \
    --selected_gpus="0" \
    tools/train.py \
        -c ./configs/quick_start/R50_vd_distill_MV3_large_x1_0.yaml

It significantly imporve by 6.47% to 96.47% with more unlabeled data and teacher model.

All accuracy
Configuration Top1 Acc
ResNet50_vd.yaml 0.2735
MobileNetV3_large_x1_0_finetune.yaml 0.9000
ResNet50_vd_finetune.yaml 0.9402
ResNet50_vd_ssld_finetune.yaml 0.9500
ResNet50_vd_ssld_random_erasing_finetune.yaml 0.9627
R50_vd_distill_MV3_large_x1_0.yaml 0.9647

The whole accuracy curves are shown below

_images/all_acc.png

  • NOTE: As flowers102 is a small dataset, validatation accuracy maybe float 1%.
  • Please refer to Getting_started for more details

Data


Introducation

This document introduces the preparation of ImageNet1k and flowers102

Dataset

Dataset train dataset size valid dataset size category
flowers102 1k 6k 102
ImageNet1k 1.2M 50k 1000
  • Data format

Please follow the steps mentioned below to organize data, include train_list.txt and val_list.txt

# delimiter: "space"

ILSVRC2012_val_00000001.JPEG 65
...
ImageNet1k

After downloading data, please organize the data dir as below

PaddleClas/dataset/imagenet/
|_ train/
|  |_ n01440764
|  |  |_ n01440764_10026.JPEG
|  |  |_ ...
|  |_ ...
|  |
|  |_ n15075141
|     |_ ...
|     |_ n15075141_9993.JPEG
|_ val/
|  |_ ILSVRC2012_val_00000001.JPEG
|  |_ ...
|  |_ ILSVRC2012_val_00050000.JPEG
|_ train_list.txt
|_ val_list.txt
Flowers102 Dataset

Download Data then decompress:

jpg/
setid.mat
imagelabels.mat

Please put all the files under PaddleClas/dataset/flowers102

generate generate_flowers102_list.py and train_list.txt和val_list.txt

python generate_flowers102_list.py jpg train > train_list.txt
python generate_flowers102_list.py jpg valid > val_list.txt

Please organize data dir as below

PaddleClas/dataset/flowers102/
|_ jpg/
|  |_ image_03601.jpg
|  |_ ...
|  |_ image_02355.jpg
|_ train_list.txt
|_ val_list.txt

Getting Started


Please refer to Installation to setup environment at first, and prepare ImageNet1K data by following the instruction mentioned in the data

Setup

Setup PYTHONPATH:

export PYTHONPATH=path_to_PaddleClas:$PYTHONPATH

Training and validating

PaddleClas support tools/train.py and tools/eval.py to start training and validating.

Training
# PaddleClas use paddle.distributed.launch to start multi-cards and multiprocess training.
# Set FLAGS_selected_gpus to indicate GPU cards

python -m paddle.distributed.launch \
    --selected_gpus="0,1,2,3" \
    tools/train.py \
        -c ./configs/ResNet/ResNet50_vd.yaml
  • log:
epoch:0    train    step:13    loss:7.9561    top1:0.0156    top5:0.1094    lr:0.100000    elapse:0.193

add -o params to update configuration

python -m paddle.distributed.launch \
    --selected_gpus="0,1,2,3" \
    tools/train.py \
        -c ./configs/ResNet/ResNet50_vd.yaml \
        -o use_mix=1 \
    --vdl_dir=./scalar/
  • log:
epoch:0    train    step:522    loss:1.6330    lr:0.100000    elapse:0.210

or modify configuration directly to config fileds, please refer to config for more details.

use visuldl to visulize training loss in the real time

visualdl --logdir ./scalar --host <host_IP> --port <port_num>
finetune
  • please refer to Trial for more details.
validation
python tools/eval.py \
    -c ./configs/eval.yaml \
    -o ARCHITECTURE.name="ResNet50_vd" \
    -o pretrained_model=path_to_pretrained_models

modify `configs/eval.yaml filed: `ARCHITECTURE.name` and filed: `pretrained_model` to config valid model or add -o params to update config directly.


**NOTE: ** when loading the pretrained model, should ignore the suffix ```.pdparams```

## Predict

PaddlePaddle supprot three predict interfaces
Use predicator interface to predict
First, export inference model

```bash
python tools/export_model.py \
    --model=model_name \
    --pretrained_model=pretrained_model_dir \
    --output_path=save_inference_dir

Second, start predicator enginee:

python tools/infer/predict.py \
    -m model_path \
    -p params_path \
    -i image path \
    --use_gpu=1 \
    --use_tensorrt=True

please refer to inference for more details.

#Configuration


Introduction

This document introduces the configuration(filed in config/*.yaml) of PaddleClas.

Basic

name detail default value optional value
mode mode "train" ["train"," valid"]
architecture model name "ResNet50_vd" one of 23 architectures
pretrained_model pretrained model path "" Str
model_save_dir model stored path "" Str
classes_num class number 1000 int
total_images total images 1281167 int
save_interval save interval 1 int
validate whether to validate when training TRUE bool
valid_interval valid interval 1 int
epochs epoch int
topk K value 5 int
image_shape image size [3,224,224] list, shape: (3,)
use_mix whether to use mixup False ['True', 'False']
ls_epsilon label_smoothing epsilon value 0 float

Optimizer & Learning rate

learning rate

name detail default value Optional value
function decay type "Linear" ["Linear", "Cosine",
"Piecewise", "CosineWarmup"]
params.lr initial learning rate 0.1 float
params.decay_epochs milestone in piecewisedecay list
params.gamma gamma in piecewisedecay 0.1 float
params.warmup_epoch warmup epoch 5 int
parmas.steps decay steps in lineardecay 100 int
params.end_lr end lr in lineardecay 0 float

optimizer

name detail default value optional value
function optimizer name "Momentum" ["Momentum", "RmsProp"]
params.momentum momentum value 0.9 float
regularizer.function regularizer method name "L2" ["L1", "L2"]
regularizer.factor regularizer factor 0.0001 float

reader

name detail
batch_size batch size
num_workers worker number
file_list train list path
data_dir train dataset path
shuffle_seed seed

processing

function name attribute name detail
DecodeImage to_rgb decode to RGB
to_np to numpy
channel_first Channel first
RandCropImage size random crop
RandFlipImage random flip
NormalizeImage scale normalize image
mean mean
std std
order order
ToCHWImage to CHW
CropImage size crop size
ResizeImage resize_short resize according to short size

mix preprocessing

name detail
MixupOperator.alpha alpha value in mixup

models

Model Library Overview

Overview

Based on the ImageNet1k classification dataset, the 23 classification network structures supported by PaddleClas and the corresponding 117 image classification pretrained models are shown below. Training trick, a brief introduction to each series of network structures, and performance evaluation will be shown in the corresponding chapters.

Evaluation environment

  • CPU evaluation environment is based on Snapdragon 855 (SD855).
  • The GPU evaluation environment is based on V100 and TensorRT, and the evaluation script is as follows.
#!/usr/bin/env bash

export PYTHONPATH=$PWD:$PYTHONPATH

python tools/infer/predict.py \
    --model_file='pretrained/infer/model' \
    --params_file='pretrained/infer/params' \
    --enable_benchmark=True \
    --model_name=ResNet50_vd \
    --use_tensorrt=True \
    --use_fp16=False \
    --batch_size=1

_images/t4.fp32.bs4.main_fps_top1.png

_images/v100.fp32.bs1.main_fps_top1_s.jpg

_images/mobile_arm_top1.png

If you think this document is helpful to you, welcome to give a star to our project:https://github.com/PaddlePaddle/PaddleClas

Pretrained model list and download address

Note: The pretrained models of EfficientNetB1-B7 in the above models are transferred from pytorch version of EfficientNet, and the ResNeXt101_wsl series of pretrained models are transferred from Official repo, the remaining pretrained models are obtained by training with the PaddlePaddle framework, and the corresponding training hyperparameters are given in configs.

References

[1] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.

[2] He T, Zhang Z, Zhang H, et al. Bag of tricks for image classification with convolutional neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 558-567.

[3] Howard A, Sandler M, Chu G, et al. Searching for mobilenetv3[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 1314-1324.

[4] Sandler M, Howard A, Zhu M, et al. Mobilenetv2: Inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 4510-4520.

[5] Howard A G, Zhu M, Chen B, et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications[J]. arXiv preprint arXiv:1704.04861, 2017.

[6] Ma N, Zhang X, Zheng H T, et al. Shufflenet v2: Practical guidelines for efficient cnn architecture design[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 116-131.

[7] Xie S, Girshick R, Dollár P, et al. Aggregated residual transformations for deep neural networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 1492-1500.

[8] Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7132-7141.

[9] Gao S, Cheng M M, Zhao K, et al. Res2net: A new multi-scale backbone architecture[J]. IEEE transactions on pattern analysis and machine intelligence, 2019.

[10] Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 1-9.

[11] Szegedy C, Ioffe S, Vanhoucke V, et al. Inception-v4, inception-resnet and the impact of residual connections on learning[C]//Thirty-first AAAI conference on artificial intelligence. 2017.

[12] Chollet F. Xception: Deep learning with depthwise separable convolutions[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 1251-1258.

[13] Wang J, Sun K, Cheng T, et al. Deep high-resolution representation learning for visual recognition[J]. arXiv preprint arXiv:1908.07919, 2019.

[14] Chen Y, Li J, Xiao H, et al. Dual path networks[C]//Advances in neural information processing systems. 2017: 4467-4475.

[15] Huang G, Liu Z, Van Der Maaten L, et al. Densely connected convolutional networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 4700-4708.

[16] Tan M, Le Q V. Efficientnet: Rethinking model scaling for convolutional neural networks[J]. arXiv preprint arXiv:1905.11946, 2019.

[17] Mahajan D, Girshick R, Ramanathan V, et al. Exploring the limits of weakly supervised pretraining[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 181-196.

[18] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[C]//Advances in neural information processing systems. 2012: 1097-1105.

[19] Iandola F N, Han S, Moskewicz M W, et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size[J]. arXiv preprint arXiv:1602.07360, 2016.

[20] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.

[21] Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 779-788.

[22] Ding X, Guo Y, Ding G, et al. Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 1911-1920.

Tricks for Training

Choice of Optimizers:

Since the development of deep learning, there have been many researchers working on the optimizer. The purpose of the optimizer is to make the loss function as small as possible, so as to find suitable parameters to complete a certain task. At present, the main optimizers used in model training are SGD, RMSProp, Adam, AdaDelt and so on. The SGD optimizers with momentum is widely used in academia and industry, so most of models we release are trained by SGD optimizer with momentum. But the SGD optimizer with momentum has two disadvantages, one is that the convergence speed is slow, the other is that the initial learning rate is difficult to set, however, if the initial learning rate is set properly and the models are trained in sufficient iterations, the models trained by SGD with momentum can reach higher accuracy compared with the models trained by other optimizers. Some other optimizers with adaptive learning rate such as Adam, RMSProp and so on tent to converge faster, but the final convergence accuracy will be slightly worse. If you want to train a model in faster convergence speed, we recommend you use the optimizers with adaptive learning rate, but if you want to train a model with higher accuracy, we recommend you to use SGD optimizer with momentum.

Choice of Learning Rate and Learning Rate Declining Strategy:

The choice of learning rate is related to the optimizer, data set and tasks. Here we mainly introduce the learning rate of training ImageNet-1K with momentum + SGD as the optimizer and the choice of learning rate decline.

Concept of Learning Rate:

the learning rate is the hyperparameter to control the learning speed, the lower the learning rate, the slower the change of the loss value, though using a low learning rate can ensure that you will not miss any local minimum, but it also means that the convergence speed is slow, especially when the gradient is trapped in a gradient plateau area.

Learning Rate Decline Strategy:

During training, if we always use the same learning rate, we cannot get the model with highest accuracy, so the learning rate should be adjust during training. In the early stage of training, the weights are in a random initialization state and the gradients are tended to descent, so we can set a relatively large learning rate for faster convergence. In the late stage of training, the weights are close to the optimal values, the optimal value cannot be reached by a relatively large learning rate, so a relatively smaller learning rate should be used. During training, many researchers use the piecewise_decay learning rate reduction strategy, which is a stepwise decline learning rate. For example, in the training of ResNet50, the initial learning rate we set is 0.1, and the learning rate drops to 1/10 every 30 epoches, the total epoches for training is 120. Besides the piecewise_decay, many researchers also proposed other ways to decrease the learning rate, such as polynomial_decay, exponential_decay and cosine_decay and so on, among them, cosine_decay has become the preferred learning rate reduction method for improving model accuracy beacause there is no need to adjust hyperparameters and the robustness is relatively high. The learning rate curves of cosine_decay and piecewise_decay are shown in the following figures, it is easy to observe that during the entire training process, cosine_decay keeps a relatively large learning rate, so its convergence is slower, but the final convergence accuracy is better than the one using piecewise_decay.

_images/lr_decay.jpeg

In addition, we can also see from the figures that the number of epoches with a small learning rate in cosine_decay is fewer, which will affect the final accuracy, so in order to make cosine_decay play a better effect, it is recommended to use cosine_decay in large epoched, such as 200 epoches.

Warmup Strategy

If a large batch_size is adopted to train nerual network, we recommend you to adopt warmup strategy. as the name suggests, the warmup strategy is to let model learning first warm up, we do not directly use the initial learning rate at the begining of training, instead, we use a gradually increasing learning rate to train the model, when the increasing learning rate reaches the initial learning rate, the learning rate reduction method mentioned in the learning rate reduction strategy is then used to decay the learning rate. Experiments show that when the batch size is large, warmup strategy can improve the accuracy. Some model training with large batch_size such as MobileNetV3 training, we set the epoch in warmup to 5 by default, that is, first in 5 epoches, the learning rate increases from 0 to initial learning rate, then learning rate decay begins.

Choice of Batch_size

Batch_size is an important hyperparameter in training neural networks, batch_size determines how much data is sent to the neural network to for training at a time. In the paper [1], the author found in experiments that when batch_size is linearly related to the learning rate, the convergence accuracy is hardly affected. When training ImageNet data, an initial learning rate of 0.1 are commonly chosen for training, and batch_size is 256, so according to the actual model size and memory, you can set the learning rate to 0.1*k, batch_size to 256*k.

Choice of Weight_decay

Overfitting is a common term in machine learning. A simple understanding is that the model performs well on the training data, but it performs poorly on the test data. In the convolutional neural network, there also exists the problem of overfitting. To avoid overfitting, many regular ways have been proposed. Among them, weight_decay is one of the widely used ways to avoid overfitting. After the final loss function, L2 regularization(weight_decay) is added to the loss function, with the help of L2 regularization, the weight of the network tend to choose a smaller value, and finally the parameters in the entire network tends to 0, and the generalization performance of the model is improved accordingly. In different kinds of Deep learning frame, the meaning of L2_decay is the coefficient of L2 regularization, on paddle, the name of this value is L2_decay, so in the following the value is called L2_decay. the larger the coefficient, the more the model tends to be underfitting. In the task of training ImageNet, this parameter is set to 1e-4 in most network. In some small networks such as MobileNet networks, in order to avoid network underfitting, the value is set to 1e-5 ~ 4e-5. Of course, the setting of this value is also related to the specific data set, When the data set is large, the network itself tends to be under-fitted, and the value can be appropriately reduced. When the data set is small, the network tends to overfit itself, so the value can be increased appropriately. The following table shows the accuracy of MobileNetV1_x0_25 using different l2_decay on ImageNet-1k. Since MobileNetV1_x0_25 is a relatively small network, the large l2_decay will make the network tend to be underfitting, so in this network, 3e-5 are better choices compared with 1e-4.

Model L2_decay Train acc1/acc5 Test acc1/acc5
MobileNetV1_x0_25 1e-4 43.79%/67.61% 50.41%/74.70%
MobileNetV1_x0_25 3e-5 47.38%/70.83% 51.45%/75.45%

In addition, the setting of L2_decay is also related to whether other regularization is used during training. If the data argument during the training is more complicated, which means that the training becomes more difficult, L2_decay can be appropriately reduced. The following table shows that the precision of ResNet50 using a different l2_decay on ImageNet-1K. It is easy to observe that after the training becomes difficult, using a smaller l2_decay helps to improve the accuracy of the model.

Model L2_decay Train acc1/acc5 Test acc1/acc5
ResNet50 1e-4 75.13%/90.42% 77.65%/93.79%
ResNet50 7e-5 75.56%/90.55% 78.04%/93.74%

In summary, l2_decay can be adjusted according to specific tasks and models. Usually simple tasks or larger models are recommended to use Larger l2_decay, complex tasks or smaller models are recommended to use smaller l2_decay.

Choice of Label_smoothing

Label_smoothing is a regularization method in deep learning. Its full name is Label Smoothing Regularization (LSR), that is, label smoothing regularization. In the traditional classification task, when calculating the loss function, the real one hot label and the output of the neural network are calculated in cross-entropy formula, the label smoothing aims to make the real one hot label become smooth label, which makes the neural network no longer learn from the hard labels, but the soft labels with a probability value, where the probability of the position corresponding to the category is the largest and the probability of other positions are very small value, specific calculation method can be seen in the paper[2]. In label-smoothing, there is an epsilon parameter describing the degree of softening the label. The larger epsilon, the smaller the probability and smoother the label, on the contrary, the label tends to be hard label. during training on ImageNet-1K, the parameter is usually set to 0.1. In the experiments of training ResNet50, when using label_smoothing, the accuracy is higher than the one without label_smoothing, the following table shows the performance of ResNet50_vd with label smoothing and without label smoothing.

Model Use_label_smoothing Test acc1
ResNet50_vd 0 77.9%
ResNet50_vd 1 78.4%

But, because label smoothing can be regarded as a regular way, on relatively small models, the accuracy improvement is not obvious or even decreases, the following table shows the accuracy performance of ResNet18 with label smoothing and without label smoothing on ImageNet-1K, it can be clearly seen that after using label smoothing, the accuracy of ResNet has decreased.

Model Use_label_smoohing Train acc1/acc5 Test acc1/acc5
ResNet18 0 69.81%/87.70% 70.98%/89.92%
ResNet18 1 68.00%/86.56% 70.81%/89.89%

In summary, the use of label_smoohing for larger models can effectively improve the accuracy of the model, and the use of label_smoohing for smaller models may reduce the accuracy of the model, so before deciding whether to use label_smoohing, you need to evaluate the size of the model and the difficulty of the task.

Change the Crop Area and Stretch Transformation Degree of the Images for Small Models

In the standard preprocessing of ImageNet-1k data, two values of scale and ratio are defined in the random_crop function. These two values respectively determine the size of the image crop and the degree of stretching of the image. The default value of scale is 0.08-1(lower_scale-upper_scale), the default value range of ratio is 3/4-4/3(lower_ratio-upper_ratio). In small network training, such data argument will make the network underfitting, resulting in a decrease in accuracy. In order to improve the accuracy of the network, you can make the data argument weaker, that is, increase the crop area of the images or weaken the degree of stretching and transformation of the images, we can achieve weaker image transformation by increasing the value of lower_scale or narrowing the gap between lower_ratio and upper_scale. The following table lists the accuracy of training MobileNetV2_x0_25 with different lower_scale. It can be seen that the training accuracy and validation accuracy are improved after increasing the crop area of the images

Model Scale Range Train_acc1/acc5 Test_acc1/acc5
MobileNetV2_x0_25 [0.08,1] 50.36%/72.98% 52.35%/75.65%
MobileNetV2_x0_25 [0.2,1] 54.39%/77.08% 53.18%/76.14%

Use Data Augmentation to Improve Accuracy

In general, the size of the data set is critical to the performances, but the annotation of images are often more expensive, so the number of annotated images are often scarce. In this case, the data argument is particularly important. In the standard data augmentation for training on ImageNet-1k, two data augmentation methods which are random_crop and random_flip are mainly used. However, in recent years, more and more data augmentation methods have been proposed, such as cutout, mixup, cutmix, AutoAugment, etc. Experiments show that these data augmentation methods can effectively improve the accuracy of the model. The following table lists the performance of ResNet50 in 8 different data augmentation methods. It can be seen that compared to the baseline, all data augmentation methods can be useful for the accuracy of ResNet50, among them cutmix is currently the most effective data argument. More data argument can be seen hereData Argument.

Model Data Argument Test top-1
ResNet50 Baseline 77.31%
ResNet50 Auto-Augment 77.95%
ResNet50 Mixup 78.28%
ResNet50 Cutmix 78.39%
ResNet50 Cutout 78.01%
ResNet50 Gridmask 77.85%
ResNet50 Random-Augment 77.70%
ResNet50 Random-Erasing 77.91%
ResNet50 Hide-and-Seek 77.43%

Determine the Tuning Strategy by Train_acc and Test_acc

In the process of training the network, the training set accuracy rate and validation set accuracy rate of each epoch are usually printed. Generally speaking, the accuracy of the training set is slightly higher than the accuracy of the validation set or the same are good state in training, but if you find that the accuracy of training set is much higher than the one of validation set, it means that overfitting happens in your task, which need more regularization, such as increase the value of L2_decay, using more data argument or label smoothing and so on. If you find that the accuracy of training set is lower than the one of validation set, it means that underfitting happens in your task, which recommend you to decrease the value of L2_decay, using fewer data argument, increase the area of the crop area of the images, weaken the stretching transformation of the images, remove label_smoothing, etc.

Improve the Accuracy of Your Own Data Set with Existing Pre-trained Models

In the field of computer vision, it has become common to load pre-trained models to train one’s own tasks. Compared with starting training from random initialization, loading pre-trained models can often improve the accuracy of specific tasks. In general, the pre-trained model widely used in the industry is obtained from the ImageNet-1k dataset. The fc layer weight of the pre-trained model is a matrix of k*1000, where k is The number of neurons before, and the weights of the fc layer is not need to load because of the different tasks. In terms of learning rate, if your training data set is particularly small (such as less than 1,000), we recommend that you use a smaller initial learning rate, such as 0.001 (batch_size: 256, the same below), to avoid a large learning rate undermine pre-training weights, if your training data set is relatively large (greater than 100,000), we recommend that you try a larger initial learning rate, such as 0.01 or greater.

If you think this guide is helpful to you, welcome to star our repo:https://github.com/PaddlePaddle/PaddleClas

Reference

[1]P. Goyal, P. Dolla ́r, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017.

[2]C.Szegedy,V.Vanhoucke,S.Ioffe,J.Shlens,andZ.Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.

ResNet and ResNet_vd series

Overview

The ResNet series model was proposed in 2015 and won the championship in the ILSVRC2015 competition with a top5 error rate of 3.57%. The network innovatively proposed the residual structure, and built the ResNet network by stacking multiple residual structures. Experiments show that using residual blocks can improve the convergence speed and accuracy effectively.

Joyce Xu of Stanford university calls ResNet one of three architectures that “really redefine the way we think about neural networks.” Due to the outstanding performance of ResNet, more and more scholars and engineers from academia and industry have improved its structure. The well-known ones include wide-resnet, resnet-vc, resnet-vd, Res2Net, etc. The number of parameters and FLOPs of resnet-vc and resnet-vd are almost the same as those of ResNet, so we hereby unified them into the ResNet series.

The models of the ResNet series released this time include 14 pre-trained models including ResNet50, ResNet50_vd, ResNet50_vd_ssld, and ResNet200_vd. At the training level, ResNet adopted the standard training process for training ImageNet, while the rest of the improved model adopted more training strategies, such as cosine decay for the decline of learning rate and the regular label smoothing method,mixup was added to the data preprocessing, and the total number of iterations increased from 120 epoches to 200 epoches.

Among them, ResNet50_vd_v2 and ResNet50_vd_ssld adopted knowledge distillation, which further improved the accuracy of the model while keeping the structure unchanged. Specifically, the teacher model of ResNet50_vd_v2 is ResNet152_vd (top1 accuracy 80.59%), the training set is imagenet-1k, the teacher model of ResNet50_vd_ssld is ResNeXt101_32x16d_wsl (top1 accuracy 84.2%), and the training set is the combination of 4 million data mined by imagenet-22k and ImageNet-1k . The specific methods of knowledge distillation are being continuously updated.

The FLOPS, parameters, and inference time on the T4 GPU of this series of models are shown in the figure below.

_images/t4.fp32.bs4.ResNet.flops.png

_images/t4.fp32.bs4.ResNet.params.png

_images/t4.fp32.bs4.ResNet.png

_images/t4.fp16.bs4.ResNet.png

As can be seen from the above curves, the higher the number of layers, the higher the accuracy, but the corresponding number of parameters, calculation and latency will increase. ResNet50_vd_ssld further improves the accuracy of top-1 of the ImageNet-1k validation set by using stronger teachers and more data, reaching 82.39%, refreshing the accuracy of ResNet50 series models.

Accuracy, FLOPS and Parameters

Models Top1 Top5 Reference
top1
Reference
top5
FLOPS
(G)
Parameters
(M)
ResNet18 0.710 0.899 0.696 0.891 3.660 11.690
ResNet18_vd 0.723 0.908 4.140 11.710
ResNet34 0.746 0.921 0.732 0.913 7.360 21.800
ResNet34_vd 0.760 0.930 7.390 21.820
ResNet50 0.765 0.930 0.760 0.930 8.190 25.560
ResNet50_vc 0.784 0.940 8.670 25.580
ResNet50_vd 0.791 0.944 0.792 0.946 8.670 25.580
ResNet50_vd_v2 0.798 0.949 8.670 25.580
ResNet101 0.776 0.936 0.776 0.938 15.520 44.550
ResNet101_vd 0.802 0.950 16.100 44.570
ResNet152 0.783 0.940 0.778 0.938 23.050 60.190
ResNet152_vd 0.806 0.953 23.530 60.210
ResNet200_vd 0.809 0.953 30.530 74.740
ResNet50_vd_ssld 0.824 0.961 8.670 25.580
ResNet50_vd_ssld_v2 0.830 0.964 8.670 25.580
Fix_ResNet50_vd_ssld_v2 0.840 0.970 17.696 25.580
ResNet101_vd_ssld 0.837 0.967 16.100 44.570
  • Note: ResNet50_vd_ssld_v2 is obtained by adding AutoAugment in training process on the basis of ResNet50_vd_ssld training strategy.Fix_ResNet50_vd_ssld_v2 stopped all parameter updates of ResNet50_vd_ssld_v2 except the FC layer,and fine-tuned on ImageNet1k dataset, the resolution is 320x320.

Inference speed based on V100 GPU

Models Crop Size Resize Short Size FP32
Batch Size=1
(ms)
ResNet18 224 256 1.499
ResNet18_vd 224 256 1.603
ResNet34 224 256 2.272
ResNet34_vd 224 256 2.343
ResNet50 224 256 2.939
ResNet50_vc 224 256 3.041
ResNet50_vd 224 256 3.165
ResNet50_vd_v2 224 256 3.165
ResNet101 224 256 5.314
ResNet101_vd 224 256 5.252
ResNet152 224 256 7.205
ResNet152_vd 224 256 7.200
ResNet200_vd 224 256 8.885
ResNet50_vd_ssld 224 256 3.165
ResNet101_vd_ssld 224 256 5.252

Inference speed based on T4 GPU

Models Crop Size Resize Short Size FP16
Batch Size=1
(ms)
FP16
Batch Size=4
(ms)
FP16
Batch Size=8
(ms)
FP32
Batch Size=1
(ms)
FP32
Batch Size=4
(ms)
FP32
Batch Size=8
(ms)
ResNet18 224 256 1.3568 2.5225 3.61904 1.45606 3.56305 6.28798
ResNet18_vd 224 256 1.39593 2.69063 3.88267 1.54557 3.85363 6.88121
ResNet34 224 256 2.23092 4.10205 5.54904 2.34957 5.89821 10.73451
ResNet34_vd 224 256 2.23992 4.22246 5.79534 2.43427 6.22257 11.44906
ResNet50 224 256 2.63824 4.63802 7.02444 3.47712 7.84421 13.90633
ResNet50_vc 224 256 2.67064 4.72372 7.17204 3.52346 8.10725 14.45577
ResNet50_vd 224 256 2.65164 4.84109 7.46225 3.53131 8.09057 14.45965
ResNet50_vd_v2 224 256 2.65164 4.84109 7.46225 3.53131 8.09057 14.45965
ResNet101 224 256 5.04037 7.73673 10.8936 6.07125 13.40573 24.3597
ResNet101_vd 224 256 5.05972 7.83685 11.34235 6.11704 13.76222 25.11071
ResNet152 224 256 7.28665 10.62001 14.90317 8.50198 19.17073 35.78384
ResNet152_vd 224 256 7.29127 10.86137 15.32444 8.54376 19.52157 36.64445
ResNet200_vd 224 256 9.36026 13.5474 19.0725 10.80619 25.01731 48.81399
ResNet50_vd_ssld 224 256 2.65164 4.84109 7.46225 3.53131 8.09057 14.45965
ResNet50_vd_ssld_v2 224 256 2.65164 4.84109 7.46225 3.53131 8.09057 14.45965
Fix_ResNet50_vd_ssld_v2 320 320 3.42818 7.51534 13.19370 5.07696 14.64218 27.01453
ResNet101_vd_ssld 224 256 5.05972 7.83685 11.34235 6.11704 13.76222 25.11071

Mobile and Embedded Vision Applications Network series

Overview

MobileNetV1 is a network launched by Google in 2017 for use on mobile devices or embedded devices. The network replaces the depthwise separable convolution with the traditional convolution operation, that is, the combination of depthwise convolution and pointwise convolution. Compared with the traditional convolution operation, this combination can greatly save the number of parameters and computation. At the same time, MobileNetV1 can also be used for object detection, image segmentation and other visual tasks.

MobileNetV2 is a lightweight network proposed by Google following MobileNetV1. Compared with MobileNetV1, MobileNetV2 proposed Linear bottlenecks and Inverted residual block as a basic network structures, to constitute MobileNetV2 network architecture through stacking these basic module a lot. In the end, higher classification accuracy was achieved when FLOPS was only half of MobileNetV1.

The ShuffleNet series network is the lightweight network structure proposed by MEGVII. So far, there are two typical structures in this series network, namely, ShuffleNetV1 and ShuffleNetV2. A Channel Shuffle operation in ShuffleNet can exchange information between groups and perform end-to-end training. In the paper of ShuffleNetV2, the author proposes four criteria for designing lightweight networks, and designs the ShuffleNetV2 network according to the four criteria and the shortcomings of ShuffleNetV1.

MobileNetV3 is a new and lightweight network based on NAS proposed by Google in 2019. In order to further improve the effect, the activation functions of relu and sigmoid were replaced with hard_swish and hard_sigmoid activation functions, and some improved strategies were introduced to reduce the amount of network computing.

_images/mobile_arm_top1.png

_images/mobile_arm_storage.png

_images/t4.fp32.bs4.mobile_trt.flops.png

_images/t4.fp32.bs4.mobile_trt.params.png

Currently there are 32 pretrained models of the mobile series open source by PaddleClas, and their indicators are shown in the figure below. As you can see from the picture, newer lightweight models tend to perform better, and MobileNetV3 represents the latest lightweight neural network architecture. In MobileNetV3, the author used 1x1 convolution after global-avg-pooling in order to obtain higher accuracy,this operation significantly increases the number of parameters but has little impact on the amount of computation, so if the model is evaluated from a storage perspective of excellence, MobileNetV3 does not have much advantage, but because of its smaller computation, it has a faster inference speed. In addition, the SSLD distillation model in our model library performs excellently, refreshing the accuracy of the current lightweight model from various perspectives. Due to the complex structure and many branches of the MobileNetV3 model, which is not GPU friendly, the GPU inference speed is not as good as that of MobileNetV1.

Accuracy, FLOPS and Parameters

Models Top1 Top5 Reference
top1
Reference
top5
FLOPS
(G)
Parameters
(M)
MobileNetV1_x0_25 0.514 0.755 0.506 0.070 0.460
MobileNetV1_x0_5 0.635 0.847 0.637 0.280 1.310
MobileNetV1_x0_75 0.688 0.882 0.684 0.630 2.550
MobileNetV1 0.710 0.897 0.706 1.110 4.190
MobileNetV1_ssld 0.779 0.939 1.110 4.190
MobileNetV2_x0_25 0.532 0.765 0.050 1.500
MobileNetV2_x0_5 0.650 0.857 0.654 0.864 0.170 1.930
MobileNetV2_x0_75 0.698 0.890 0.698 0.896 0.350 2.580
MobileNetV2 0.722 0.907 0.718 0.910 0.600 3.440
MobileNetV2_x1_5 0.741 0.917 1.320 6.760
MobileNetV2_x2_0 0.752 0.926 2.320 11.130
MobileNetV2_ssld 0.7674 0.9339 0.600 3.440
MobileNetV3_large_
x1_25
0.764 0.930 0.766 0.714 7.440
MobileNetV3_large_
x1_0
0.753 0.923 0.752 0.450 5.470
MobileNetV3_large_
x0_75
0.731 0.911 0.733 0.296 3.910
MobileNetV3_large_
x0_5
0.692 0.885 0.688 0.138 2.670
MobileNetV3_large_
x0_35
0.643 0.855 0.642 0.077 2.100
MobileNetV3_small_
x1_25
0.707 0.895 0.704 0.195 3.620
MobileNetV3_small_
x1_0
0.682 0.881 0.675 0.123 2.940
MobileNetV3_small_
x0_75
0.660 0.863 0.654 0.088 2.370
MobileNetV3_small_
x0_5
0.592 0.815 0.580 0.043 1.900
MobileNetV3_small_
x0_35
0.530 0.764 0.498 0.026 1.660
MobileNetV3_large_
x1_0_ssld
0.790 0.945 0.450 5.470
MobileNetV3_large_
x1_0_ssld_int8
0.761
MobileNetV3_small_
x1_0_ssld
0.713 0.901 0.123 2.940
ShuffleNetV2 0.688 0.885 0.694 0.280 2.260
ShuffleNetV2_x0_25 0.499 0.738 0.030 0.600
ShuffleNetV2_x0_33 0.537 0.771 0.040 0.640
ShuffleNetV2_x0_5 0.603 0.823 0.603 0.080 1.360
ShuffleNetV2_x1_5 0.716 0.902 0.726 0.580 3.470
ShuffleNetV2_x2_0 0.732 0.912 0.749 1.120 7.320
ShuffleNetV2_swish 0.700 0.892 0.290 2.260

Inference speed and storage size based on SD855

Models Batch Size=1(ms) Storage Size(M)
MobileNetV1_x0_25 3.220 1.900
MobileNetV1_x0_5 9.580 5.200
MobileNetV1_x0_75 19.436 10.000
MobileNetV1 32.523 16.000
MobileNetV1_ssld 32.523 16.000
MobileNetV2_x0_25 3.799 6.100
MobileNetV2_x0_5 8.702 7.800
MobileNetV2_x0_75 15.531 10.000
MobileNetV2 23.318 14.000
MobileNetV2_x1_5 45.624 26.000
MobileNetV2_x2_0 74.292 43.000
MobileNetV2_ssld 23.318 14.000
MobileNetV3_large_x1_25 28.218 29.000
MobileNetV3_large_x1_0 19.308 21.000
MobileNetV3_large_x0_75 13.565 16.000
MobileNetV3_large_x0_5 7.493 11.000
MobileNetV3_large_x0_35 5.137 8.600
MobileNetV3_small_x1_25 9.275 14.000
MobileNetV3_small_x1_0 6.546 12.000
MobileNetV3_small_x0_75 5.284 9.600
MobileNetV3_small_x0_5 3.352 7.800
MobileNetV3_small_x0_35 2.635 6.900
MobileNetV3_large_x1_0_ssld 19.308 21.000
MobileNetV3_large_x1_0_ssld_int8 14.395 10.000
MobileNetV3_small_x1_0_ssld 6.546 12.000
ShuffleNetV2 10.941 9.000
ShuffleNetV2_x0_25 2.329 2.700
ShuffleNetV2_x0_33 2.643 2.800
ShuffleNetV2_x0_5 4.261 5.600
ShuffleNetV2_x1_5 19.352 14.000
ShuffleNetV2_x2_0 34.770 28.000
ShuffleNetV2_swish 16.023 9.100

Inference speed based on T4 GPU

Models FP16
Batch Size=1
(ms)
FP16
Batch Size=4
(ms)
FP16
Batch Size=8
(ms)
FP32
Batch Size=1
(ms)
FP32
Batch Size=4
(ms)
FP32
Batch Size=8
(ms)
MobileNetV1_x0_25 0.68422 1.13021 1.72095 0.67274 1.226 1.84096
MobileNetV1_x0_5 0.69326 1.09027 1.84746 0.69947 1.43045 2.39353
MobileNetV1_x0_75 0.6793 1.29524 2.15495 0.79844 1.86205 3.064
MobileNetV1 0.71942 1.45018 2.47953 0.91164 2.26871 3.90797
MobileNetV1_ssld 0.71942 1.45018 2.47953 0.91164 2.26871 3.90797
MobileNetV2_x0_25 2.85399 3.62405 4.29952 2.81989 3.52695 4.2432
MobileNetV2_x0_5 2.84258 3.1511 4.10267 2.80264 3.65284 4.31737
MobileNetV2_x0_75 2.82183 3.27622 4.98161 2.86538 3.55198 5.10678
MobileNetV2 2.78603 3.71982 6.27879 2.62398 3.54429 6.41178
MobileNetV2_x1_5 2.81852 4.87434 8.97934 2.79398 5.30149 9.30899
MobileNetV2_x2_0 3.65197 6.32329 11.644 3.29788 7.08644 12.45375
MobileNetV2_ssld 2.78603 3.71982 6.27879 2.62398 3.54429 6.41178
MobileNetV3_large_x1_25 2.34387 3.16103 4.79742 2.35117 3.44903 5.45658
MobileNetV3_large_x1_0 2.20149 3.08423 4.07779 2.04296 2.9322 4.53184
MobileNetV3_large_x0_75 2.1058 2.61426 3.61021 2.0006 2.56987 3.78005
MobileNetV3_large_x0_5 2.06934 2.77341 3.35313 2.11199 2.88172 3.19029
MobileNetV3_large_x0_35 2.14965 2.7868 3.36145 1.9041 2.62951 3.26036
MobileNetV3_small_x1_25 2.06817 2.90193 3.5245 2.02916 2.91866 3.34528
MobileNetV3_small_x1_0 1.73933 2.59478 3.40276 1.74527 2.63565 3.28124
MobileNetV3_small_x0_75 1.80617 2.64646 3.24513 1.93697 2.64285 3.32797
MobileNetV3_small_x0_5 1.95001 2.74014 3.39485 1.88406 2.99601 3.3908
MobileNetV3_small_x0_35 2.10683 2.94267 3.44254 1.94427 2.94116 3.41082
MobileNetV3_large_x1_0_ssld 2.20149 3.08423 4.07779 2.04296 2.9322 4.53184
MobileNetV3_small_x1_0_ssld 1.73933 2.59478 3.40276 1.74527 2.63565 3.28124
ShuffleNetV2 1.95064 2.15928 2.97169 1.89436 2.26339 3.17615
ShuffleNetV2_x0_25 1.43242 2.38172 2.96768 1.48698 2.29085 2.90284
ShuffleNetV2_x0_33 1.69008 2.65706 2.97373 1.75526 2.85557 3.09688
ShuffleNetV2_x0_5 1.48073 2.28174 2.85436 1.59055 2.18708 3.09141
ShuffleNetV2_x1_5 1.51054 2.4565 3.41738 1.45389 2.5203 3.99872
ShuffleNetV2_x2_0 1.95616 2.44751 4.19173 2.15654 3.18247 5.46893
ShuffleNetV2_swish 2.50213 2.92881 3.474 2.5129 2.97422 3.69357

SEResNeXt and Res2Net series

Overview

ResNeXt, one of the typical variants of ResNet, was presented at the CVPR conference in 2017. Prior to this, the methods to improve the model accuracy mainly focused on deepening or widening the network, which increased the number of parameters and calculation, and slowed down the inference speed accordingly. The concept of cardinality was proposed in ResNeXt structure. The author found that increasing the number of channel groups was more effective than increasing the depth and width through experiments. It can improve the accuracy without increasing the parameter complexity and reduce the number of parameters at the same time, so it is a more successful variant of ResNet.

SENet is the winner of the 2017 ImageNet classification competition. It proposes a new SE structure that can be migrated to any other network. It controls the scale to enhance the important features between each channel, and weaken the unimportant features. So that the extracted features are more directional.

Res2Net is a brand-new improvement of ResNet proposed in 2019. The solution can be easily integrated with other excellent modules. Without increasing the amount of calculation, the performance on ImageNet, CIFAR-100 and other data sets exceeds ResNet. Res2Net, with its simple structure and superior performance, further explores the multi-scale representation capability of CNN at a more fine-grained level. Res2Net reveals a new dimension to improve model accuracy, called scale, which is an essential and more effective factor in addition to the existing dimensions of depth, width, and cardinality. The network also performs well in other visual tasks such as object detection and image segmentation.

The FLOPS, parameters, and inference time on the T4 GPU of this series of models are shown in the figure below.

_images/t4.fp32.bs4.SeResNeXt.flops.png

_images/t4.fp32.bs4.SeResNeXt.params.png

_images/t4.fp32.bs4.SeResNeXt.png

_images/t4.fp16.bs4.SeResNeXt.png

At present, there are a total of 24 pretrained models of the three categories open sourced by PaddleClas, and the indicators are shown in the figure. It can be seen from the diagram that under the same Flops and Params, the improved model tends to have higher accuracy, but the inference speed is often inferior to the ResNet series. On the other hand, Res2Net performed better. Compared with group operation in ResNeXt and SE structure operation in SEResNet, Res2Net tended to have better accuracy in the same Flops, Params and inference speed.

Accuracy, FLOPS and Parameters

Models Top1 Top5 Reference
top1
Reference
top5
FLOPS
(G)
Parameters
(M)
Res2Net50_26w_4s 0.793 0.946 0.780 0.936 8.520 25.700
Res2Net50_vd_26w_4s 0.798 0.949 8.370 25.060
Res2Net50_14w_8s 0.795 0.947 0.781 0.939 9.010 25.720
Res2Net101_vd_26w_4s 0.806 0.952 16.670 45.220
Res2Net200_vd_26w_4s 0.812 0.957 31.490 76.210
ResNeXt50_32x4d 0.778 0.938 0.778 8.020 23.640
ResNeXt50_vd_32x4d 0.796 0.946 8.500 23.660
ResNeXt50_64x4d 0.784 0.941 15.060 42.360
ResNeXt50_vd_64x4d 0.801 0.949 15.540 42.380
ResNeXt101_32x4d 0.787 0.942 0.788 15.010 41.540
ResNeXt101_vd_32x4d 0.803 0.951 15.490 41.560
ResNeXt101_64x4d 0.784 0.945 0.796 29.050 78.120
ResNeXt101_vd_64x4d 0.808 0.952 29.530 78.140
ResNeXt152_32x4d 0.790 0.943 22.010 56.280
ResNeXt152_vd_32x4d 0.807 0.952 22.490 56.300
ResNeXt152_64x4d 0.795 0.947 43.030 107.570
ResNeXt152_vd_64x4d 0.811 0.953 43.520 107.590
SE_ResNet18_vd 0.733 0.914 4.140 11.800
SE_ResNet34_vd 0.765 0.932 7.840 21.980
SE_ResNet50_vd 0.795 0.948 8.670 28.090
SE_ResNeXt50_32x4d 0.784 0.940 0.789 0.945 8.020 26.160
SE_ResNeXt50_vd_32x4d 0.802 0.949 10.760 26.280
SE_ResNeXt101_32x4d 0.791 0.942 0.793 0.950 15.020 46.280
SENet154_vd 0.814 0.955 45.830 114.290

Inference speed based on V100 GPU

Models Crop Size Resize Short Size FP32
Batch Size=1
(ms)
Res2Net50_26w_4s 224 256 4.148
Res2Net50_vd_26w_4s 224 256 4.172
Res2Net50_14w_8s 224 256 5.113
Res2Net101_vd_26w_4s 224 256 7.327
Res2Net200_vd_26w_4s 224 256 12.806
ResNeXt50_32x4d 224 256 10.964
ResNeXt50_vd_32x4d 224 256 7.566
ResNeXt50_64x4d 224 256 13.905
ResNeXt50_vd_64x4d 224 256 14.321
ResNeXt101_32x4d 224 256 14.915
ResNeXt101_vd_32x4d 224 256 14.885
ResNeXt101_64x4d 224 256 28.716
ResNeXt101_vd_64x4d 224 256 28.398
ResNeXt152_32x4d 224 256 22.996
ResNeXt152_vd_32x4d 224 256 22.729
ResNeXt152_64x4d 224 256 46.705
ResNeXt152_vd_64x4d 224 256 46.395
SE_ResNet18_vd 224 256 1.694
SE_ResNet34_vd 224 256 2.786
SE_ResNet50_vd 224 256 3.749
SE_ResNeXt50_32x4d 224 256 8.924
SE_ResNeXt50_vd_32x4d 224 256 9.011
SE_ResNeXt101_32x4d 224 256 19.204
SENet154_vd 224 256 50.406

Inference speed based on T4 GPU

Models Crop Size Resize Short Size FP16
Batch Size=1
(ms)
FP16
Batch Size=4
(ms)
FP16
Batch Size=8
(ms)
FP32
Batch Size=1
(ms)
FP32
Batch Size=4
(ms)
FP32
Batch Size=8
(ms)
Res2Net50_26w_4s 224 256 3.56067 6.61827 11.41566 4.47188 9.65722 17.54535
Res2Net50_vd_26w_4s 224 256 3.69221 6.94419 11.92441 4.52712 9.93247 18.16928
Res2Net50_14w_8s 224 256 4.45745 7.69847 12.30935 5.4026 10.60273 18.01234
Res2Net101_vd_26w_4s 224 256 6.53122 10.81895 18.94395 8.08729 17.31208 31.95762
Res2Net200_vd_26w_4s 224 256 11.66671 18.93953 33.19188 14.67806 32.35032 63.65899
ResNeXt50_32x4d 224 256 7.61087 8.88918 12.99674 7.56327 10.6134 18.46915
ResNeXt50_vd_32x4d 224 256 7.69065 8.94014 13.4088 7.62044 11.03385 19.15339
ResNeXt50_64x4d 224 256 13.78688 15.84655 21.79537 13.80962 18.4712 33.49843
ResNeXt50_vd_64x4d 224 256 13.79538 15.22201 22.27045 13.94449 18.88759 34.28889
ResNeXt101_32x4d 224 256 16.59777 17.93153 21.36541 16.21503 19.96568 33.76831
ResNeXt101_vd_32x4d 224 256 16.36909 17.45681 22.10216 16.28103 20.25611 34.37152
ResNeXt101_64x4d 224 256 30.12355 32.46823 38.41901 30.4788 36.29801 68.85559
ResNeXt101_vd_64x4d 224 256 30.34022 32.27869 38.72523 30.40456 36.77324 69.66021
ResNeXt152_32x4d 224 256 25.26417 26.57001 30.67834 24.86299 29.36764 52.09426
ResNeXt152_vd_32x4d 224 256 25.11196 26.70515 31.72636 25.03258 30.08987 52.64429
ResNeXt152_64x4d 224 256 46.58293 48.34563 56.97961 46.7564 56.34108 106.11736
ResNeXt152_vd_64x4d 224 256 47.68447 48.91406 57.29329 47.18638 57.16257 107.26288
SE_ResNet18_vd 224 256 1.61823 3.1391 4.60282 1.7691 4.19877 7.5331
SE_ResNet34_vd 224 256 2.67518 5.04694 7.18946 2.88559 7.03291 12.73502
SE_ResNet50_vd 224 256 3.65394 7.568 12.52793 4.28393 10.38846 18.33154
SE_ResNeXt50_32x4d 224 256 9.06957 11.37898 18.86282 8.74121 13.563 23.01954
SE_ResNeXt50_vd_32x4d 224 256 9.25016 11.85045 25.57004 9.17134 14.76192 19.914
SE_ResNeXt101_32x4d 224 256 19.34455 20.6104 32.20432 18.82604 25.31814 41.97758
SENet154_vd 224 256 49.85733 54.37267 74.70447 53.79794 66.31684 121.59885

Inception series

Overview

GoogLeNet is a new neural network structure designed by Google in 2014, which, together with VGG network, became the twin champions of the ImageNet challenge that year. GoogLeNet introduces the Inception structure for the first time, and stacks the Inception structure in the network so that the number of network layers reaches 22, which is also the mark of the convolutional network exceeding 20 layers for the first time. Since 1x1 convolution is used in the Inception structure to reduce the dimension of channel number, and Global pooling is used to replace the traditional method of processing features in multiple fc layers, the final GoogLeNet network has much less FLOPS and parameters than VGG network, which has become a beautiful scenery of neural network design at that time.

Xception is another improvement to InceptionV3 that Google proposed after Inception. In Xception, the author used the depthwise separable convolution to replace the traditional convolution operation, which greatly saved the network FLOPS and the number of parameters, but improved the accuracy. In DeeplabV3+, the author further improved the Xception and increased the number of Xception layers, and designed the network of Xception65 and Xception71.

InceptionV4 is a new neural network designed by Google in 2016, when residual structure were all the rage, but the authors believe that high performance can be achieved using only Inception structure. InceptionV4 uses more Inception structure to achieve even greater precision on Imagenet-1k.

The FLOPS, parameters, and inference time on the T4 GPU of this series of models are shown in the figure below.

_images/t4.fp32.bs4.Inception.flops.png

_images/t4.fp32.bs4.Inception.params.png

_images/t4.fp32.bs4.Inception.png

_images/t4.fp16.bs4.Inception.png

The figure above reflects the relationship between the accuracy of Xception series and InceptionV4 and other indicators. Among them, Xception_deeplab is consistent with the structure of the paper, and Xception is an improved model developed by PaddleClas, which improves the accuracy by about 0.6% when the inference speed is basically unchanged. Details of the improved model are being updated, so stay tuned.

Accuracy, FLOPS and Parameters

Models Top1 Top5 Reference
top1
Reference
top5
FLOPS
(G)
Parameters
(M)
GoogLeNet 0.707 0.897 0.698 2.880 8.460
Xception41 0.793 0.945 0.790 0.945 16.740 22.690
Xception41
_deeplab
0.796 0.944 18.160 26.730
Xception65 0.810 0.955 25.950 35.480
Xception65
_deeplab
0.803 0.945 27.370 39.520
Xception71 0.811 0.955 31.770 37.280
InceptionV4 0.808 0.953 0.800 0.950 24.570 42.680

Inference speed based on V100 GPU

Models Crop Size Resize Short Size FP32
Batch Size=1
(ms)
GoogLeNet 224 256 1.807
Xception41 299 320 3.972
Xception41_
deeplab
299 320 4.408
Xception65 299 320 6.174
Xception65_
deeplab
299 320 6.464
Xception71 299 320 6.782
InceptionV4 299 320 11.141

Inference speed based on T4 GPU

Models Crop Size Resize Short Size FP16
Batch Size=1
(ms)
FP16
Batch Size=4
(ms)
FP16
Batch Size=8
(ms)
FP32
Batch Size=1
(ms)
FP32
Batch Size=4
(ms)
FP32
Batch Size=8
(ms)
GoogLeNet 299 320 1.75451 3.39931 4.71909 1.88038 4.48882 6.94035
Xception41 299 320 2.91192 7.86878 15.53685 4.96939 17.01361 32.67831
Xception41_
deeplab
299 320 2.85934 7.2075 14.01406 5.33541 17.55938 33.76232
Xception65 299 320 4.30126 11.58371 23.22213 7.26158 25.88778 53.45426
Xception65_
deeplab
299 320 4.06803 9.72694 19.477 7.60208 26.03699 54.74724
Xception71 299 320 4.80889 13.5624 27.18822 8.72457 31.55549 69.31018
InceptionV4 299 320 9.50821 13.72104 20.27447 12.99342 25.23416 43.56121

HRNet series

Overview

HRNet is a brand new neural network proposed by Microsoft research Asia in 2019. Different from the previous convolutional neural network, this network can still maintain high resolution in the deep layer of the network, so the heat map of the key points predicted is more accurate, and it is also more accurate in space. In addition, the network performs particularly well in other visual tasks sensitive to resolution, such as detection and segmentation.

The FLOPS, parameters, and inference time on the T4 GPU of this series of models are shown in the figure below.

_images/t4.fp32.bs4.HRNet.flops.png

_images/t4.fp32.bs4.HRNet.params.png

_images/t4.fp32.bs4.HRNet.png

_images/t4.fp16.bs4.HRNet.png

At present, there are 7 pretrained models of such models open-sourced by PaddleClas, and their indicators are shown in the figure. Among them, the reason why the accuracy of the HRNet_W48_C indicator is abnormal may be due to fluctuations in training.

Accuracy, FLOPS and Parameters

Models Top1 Top5 Reference
top1
Reference
top5
FLOPS
(G)
Parameters
(M)
HRNet_W18_C 0.769 0.934 0.768 0.934 4.140 21.290
HRNet_W30_C 0.780 0.940 0.782 0.942 16.230 37.710
HRNet_W32_C 0.783 0.942 0.785 0.942 17.860 41.230
HRNet_W40_C 0.788 0.945 0.789 0.945 25.410 57.550
HRNet_W44_C 0.790 0.945 0.789 0.944 29.790 67.060
HRNet_W48_C 0.790 0.944 0.793 0.945 34.580 77.470
HRNet_W64_C 0.793 0.946 0.795 0.946 57.830 128.060

Inference speed based on V100 GPU

Models Crop Size Resize Short Size FP32
Batch Size=1
(ms)
HRNet_W18_C 224 256 7.368
HRNet_W30_C 224 256 9.402
HRNet_W32_C 224 256 9.467
HRNet_W40_C 224 256 10.739
HRNet_W44_C 224 256 11.497
HRNet_W48_C 224 256 12.165
HRNet_W64_C 224 256 15.003

Inference speed based on T4 GPU

Models Crop Size Resize Short Size FP16
Batch Size=1
(ms)
FP16
Batch Size=4
(ms)
FP16
Batch Size=8
(ms)
FP32
Batch Size=1
(ms)
FP32
Batch Size=4
(ms)
FP32
Batch Size=8
(ms)
HRNet_W18_C 224 256 6.79093 11.50986 17.67244 7.40636 13.29752 23.33445
HRNet_W30_C 224 256 8.98077 14.08082 21.23527 9.57594 17.35485 32.6933
HRNet_W32_C 224 256 8.82415 14.21462 21.19804 9.49807 17.72921 32.96305
HRNet_W40_C 224 256 11.4229 19.1595 30.47984 12.12202 25.68184 48.90623
HRNet_W44_C 224 256 12.25778 22.75456 32.61275 13.19858 32.25202 59.09871
HRNet_W48_C 224 256 12.65015 23.12886 33.37859 13.70761 34.43572 63.01219
HRNet_W64_C 224 256 15.10428 27.68901 40.4198 17.57527 47.9533 97.11228

DPN and DenseNet series

Overview

DenseNet is a new network structure proposed in 2017 and was the best paper of CVPR. The network has designed a new cross-layer connected block called dense-block. Compared to the bottleneck in ResNet, dense-block has designed a more aggressive dense connection module, that is, connecting all the layers to each other, and each layer will accept all the layers in front of it as its additional input. DenseNet stacks all dense-blocks into a densely connected network. The dense connection makes DenseNet easier to backpropagate, making the network easier to train and converge. The full name of DPN is Dual Path Networks, which is a network composed of DenseNet and ResNeXt, which proves that DenseNet can extract new features from the previous level, and ResNeXt essentially reuses the extracted features . The author further analyzes and finds that ResNeXt has high reuse rate for features, but low redundancy, while DenseNet can create new features, but with high redundancy. Combining the advantages of the two structures, the author designed the DPN network. In the end, the DPN network achieved better results than ResNeXt and DenseNet under the same FLOPS and parameters.

The FLOPS, parameters, and inference time on the T4 GPU of this series of models are shown in the figure below.

_images/t4.fp32.bs4.DPN.flops.png

_images/t4.fp32.bs4.DPN.params.png

_images/t4.fp32.bs4.DPN.png

_images/t4.fp16.bs4.DPN.png

The pretrained models of these two types of models (a total of 10) are open sourced in PaddleClas at present. The indicators are shown in the figure above. It is easy to observe that under the same FLOPS and parameters, DPN has higher accuracy than DenseNet. However,because DPN has more branches, its inference speed is slower than DenseNet. Since DenseNet264 has the deepest layers in all DenseNet networks, it has the largest parameters,DenseNet161 has the largest width, resulting the largest FLOPs and the highest accuracy in this series. From the perspective of inference speed, DenseNet161, which has a large FLOPs and high accuracy, has a faster speed than DenseNet264, so it has a greater advantage than DenseNet264.

For DPN series networks, the larger the model’s FLOPs and parameters, the higher the model’s accuracy. Among them, since the width of DPN107 is the largest, it has the largest number of parameters and FLOPs in this series of networks.

Accuracy, FLOPS and Parameters

Models Top1 Top5 Reference
top1
Reference
top5
FLOPS
(G)
Parameters
(M)
DenseNet121 0.757 0.926 0.750 5.690 7.980
DenseNet161 0.786 0.941 0.778 15.490 28.680
DenseNet169 0.768 0.933 0.764 6.740 14.150
DenseNet201 0.776 0.937 0.775 8.610 20.010
DenseNet264 0.780 0.939 0.779 11.540 33.370
DPN68 0.768 0.934 0.764 0.931 4.030 10.780
DPN92 0.799 0.948 0.793 0.946 12.540 36.290
DPN98 0.806 0.951 0.799 0.949 22.220 58.460
DPN107 0.809 0.953 0.802 0.951 35.060 82.970
DPN131 0.807 0.951 0.801 0.949 30.510 75.360

Inference speed based on V100 GPU

Models Crop Size Resize Short Size FP32
Batch Size=1
(ms)
DenseNet121 224 256 4.371
DenseNet161 224 256 8.863
DenseNet169 224 256 6.391
DenseNet201 224 256 8.173
DenseNet264 224 256 11.942
DPN68 224 256 11.805
DPN92 224 256 17.840
DPN98 224 256 21.057
DPN107 224 256 28.685
DPN131 224 256 28.083

Inference speed based on T4 GPU

Models Crop Size Resize Short Size FP16
Batch Size=1
(ms)
FP16
Batch Size=4
(ms)
FP16
Batch Size=8
(ms)
FP32
Batch Size=1
(ms)
FP32
Batch Size=4
(ms)
FP32
Batch Size=8
(ms)
DenseNet121 224 256 4.16436 7.2126 10.50221 4.40447 9.32623 15.25175
DenseNet161 224 256 9.27249 14.25326 20.19849 10.39152 22.15555 35.78443
DenseNet169 224 256 6.11395 10.28747 13.68717 6.43598 12.98832 20.41964
DenseNet201 224 256 7.9617 13.4171 17.41949 8.20652 17.45838 27.06309
DenseNet264 224 256 11.70074 19.69375 24.79545 12.14722 26.27707 40.01905
DPN68 224 256 11.7827 13.12652 16.19213 11.64915 12.82807 18.57113
DPN92 224 256 18.56026 20.35983 29.89544 18.15746 23.87545 38.68821
DPN98 224 256 21.70508 24.7755 40.93595 21.18196 33.23925 62.77751
DPN107 224 256 27.84462 34.83217 60.67903 27.62046 52.65353 100.11721
DPN131 224 256 28.58941 33.01078 55.65146 28.33119 46.19439 89.24904

EfficientNet and ResNeXt101_wsl series

Overview

EfficientNet is a lightweight NAS-based network released by Google in 2019. EfficientNetB7 refreshed the classification accuracy of ImageNet-1k at that time. In this paper, the author points out that the traditional methods to improve the performance of neural networks mainly start with the width of the network, the depth of the network, and the resolution of the input picture. However, the author found that balancing these three dimensions is essential for improving accuracy and efficiency through experiments. Therefore, the author summarized how to balance the three dimensions at the same time through a series of experiments. At the same time, based on this scaling method, the author built a total of 7 networks B1-B7 in the EfficientNet series on the basis of EfficientNetB0, and with the same FLOPS and parameters, the accuracy reached state-of-the-art effect.

ResNeXt is an improved version of ResNet that proposed by Facebook in 2016. In 2019, Facebook researchers studied the accuracy limit of the series network on ImageNet through weakly-supervised-learning. In order to distinguish the previous ResNeXt network, the suffix of this series network is WSL, where WSL is the abbreviation of weakly-supervised-learning. In order to have stronger feature extraction capability, the researchers further enlarged the network width, among which the largest ResNeXt101_32x48d_wsl has 800 million parameters. It was trained under 940 million weak-labeled images, and the results were finetune trained on imagenet-1k. Finally, the acc-1 of imagenet-1k reaches 85.4%, which is also the network with the highest precision under the resolution of 224x224 on imagenet-1k so far. In Fix-ResNeXt, the author used a larger image resolution, made a special Fix strategy for the inconsistency of image data preprocessing in training and testing, and made ResNeXt101_32x48d_wsl have a higher accuracy. Since it used the Fix strategy, it was named Fix-ResNeXt101_32x48d_wsl.

The FLOPS, parameters, and inference time on the T4 GPU of this series of models are shown in the figure below.

_images/t4.fp32.bs4.EfficientNet.flops.png

_images/t4.fp32.bs4.EfficientNet.params.png

_images/t4.fp32.bs1.EfficientNet.png

_images/t4.fp16.bs1.EfficientNet.png

At present, there are a total of 14 pretrained models of the two types of models that PaddleClas open source. It can be seen from the above figure that the advantages of the EfficientNet series network are very obvious. The ResNeXt101_wsl series model uses more data, and the final accuracy is also higher. EfficientNet_B0_small removes SE_block based on EfficientNet_B0, which has faster inference speed.

Accuracy, FLOPS and Parameters

Models Top1 Top5 Reference
top1
Reference
top5
FLOPS
(G)
Parameters
(M)
ResNeXt101_
32x8d_wsl
0.826 0.967 0.822 0.964 29.140 78.440
ResNeXt101_
32x16d_wsl
0.842 0.973 0.842 0.972 57.550 152.660
ResNeXt101_
32x32d_wsl
0.850 0.976 0.851 0.975 115.170 303.110
ResNeXt101_
32x48d_wsl
0.854 0.977 0.854 0.976 173.580 456.200
Fix_ResNeXt101_
32x48d_wsl
0.863 0.980 0.864 0.980 354.230 456.200
EfficientNetB0 0.774 0.933 0.773 0.935 0.720 5.100
EfficientNetB1 0.792 0.944 0.792 0.945 1.270 7.520
EfficientNetB2 0.799 0.947 0.803 0.950 1.850 8.810
EfficientNetB3 0.812 0.954 0.817 0.956 3.430 11.840
EfficientNetB4 0.829 0.962 0.830 0.963 8.290 18.760
EfficientNetB5 0.836 0.967 0.837 0.967 19.510 29.610
EfficientNetB6 0.840 0.969 0.842 0.968 36.270 42.000
EfficientNetB7 0.843 0.969 0.844 0.971 72.350 64.920
EfficientNetB0_
small
0.758 0.926 0.720 4.650

Inference speed based on V100 GPU

Models Crop Size Resize Short Size FP32
Batch Size=1
(ms)
ResNeXt101_
32x8d_wsl
224 256 19.127
ResNeXt101_
32x16d_wsl
224 256 23.629
ResNeXt101_
32x32d_wsl
224 256 40.214
ResNeXt101_
32x48d_wsl
224 256 59.714
Fix_ResNeXt101_
32x48d_wsl
320 320 82.431
EfficientNetB0 224 256 2.449
EfficientNetB1 240 272 3.547
EfficientNetB2 260 292 3.908
EfficientNetB3 300 332 5.145
EfficientNetB4 380 412 7.609
EfficientNetB5 456 488 12.078
EfficientNetB6 528 560 18.381
EfficientNetB7 600 632 27.817
EfficientNetB0_
small
224 256 1.692

Inference speed based on T4 GPU

Models Crop Size Resize Short Size FP16
Batch Size=1
(ms)
FP16
Batch Size=4
(ms)
FP16
Batch Size=8
(ms)
FP32
Batch Size=1
(ms)
FP32
Batch Size=4
(ms)
FP32
Batch Size=8
(ms)
ResNeXt101_
32x8d_wsl
224 256 18.19374 21.93529 34.67802 18.52528 34.25319 67.2283
ResNeXt101_
32x16d_wsl
224 256 18.52609 36.8288 62.79947 25.60395 71.88384 137.62327
ResNeXt101_
32x32d_wsl
224 256 33.51391 70.09682 125.81884 54.87396 160.04337 316.17718
ResNeXt101_
32x48d_wsl
224 256 50.97681 137.60926 190.82628 99.01698256 315.91261 551.83695
Fix_ResNeXt101_
32x48d_wsl
320 320 78.62869 191.76039 317.15436 160.0838242 595.99296 1151.47384
EfficientNetB0 224 256 3.40122 5.95851 9.10801 3.442 6.11476 9.3304
EfficientNetB1 240 272 5.25172 9.10233 14.11319 5.3322 9.41795 14.60388
EfficientNetB2 260 292 5.91052 10.5898 17.38106 6.29351 10.95702 17.75308
EfficientNetB3 300 332 7.69582 16.02548 27.4447 7.67749 16.53288 28.5939
EfficientNetB4 380 412 11.55585 29.44261 53.97363 12.15894 30.94567 57.38511
EfficientNetB5 456 488 19.63083 56.52299 - 20.48571 61.60252 -
EfficientNetB6 528 560 30.05911 - - 32.62402 - -
EfficientNetB7 600 632 47.86087 - - 53.93823 - -
EfficientNetB0_small 224 256 2.39166 4.36748 6.96002 2.3076 4.71886 7.21888

Other networks

Overview

In 2012, AlexNet network proposed by Alex et al. won the ImageNet competition by far surpassing the second place, and the convolutional neural network and even deep learning attracted wide attention. AlexNet used relu as the activation function of CNN to solve the gradient dispersion problem of sigmoid when the network is deep. During the training, Dropout was used to randomly lose a part of the neurons, avoiding the overfitting of the model. In the network, overlapping maximum pooling is used to replace the average pooling commonly used in CNN, which avoids the fuzzy effect of average pooling and improves the feature richness. In a sense, AlexNet has exploded the research and application of neural networks.

SqueezeNet achieved the same precision as AlexNet on Imagenet-1k, but only with 1/50 parameters. The core of the network is the Fire module, which used the convolution of 1x1 to achieve channel dimensionality reduction, thus greatly saving the number of parameters. The author created SqueezeNet by stacking a large number of Fire modules.

VGG is a convolutional neural network developed by researchers at Oxford University’s Visual Geometry Group and DeepMind. The network explores the relationship between the depth of the convolutional neural network and its performance. By repeatedly stacking the small convolutional kernel of 3x3 and the maximum pooling layer of 2x2, the multi-layer convolutional neural network is successfully constructed and has achieved good convergence accuracy. In the end, VGG won the runner-up of ILSVRC 2014 classification and the champion of positioning.

DarkNet53 is designed for object detection by YOLO author in the paper. The network is basically composed of 1x1 and 3x3 kernel, with a total of 53 layers, named DarkNet53.

Accuracy, FLOPS and Parameters

Models Top1 Top5 Reference
top1
Reference
top5
FLOPS
(G)
Parameters
(M)
AlexNet 0.567 0.792 0.5720 1.370 61.090
SqueezeNet1_0 0.596 0.817 0.575 1.550 1.240
SqueezeNet1_1 0.601 0.819 0.690 1.230
VGG11 0.693 0.891 15.090 132.850
VGG13 0.700 0.894 22.480 133.030
VGG16 0.720 0.907 0.715 0.901 30.810 138.340
VGG19 0.726 0.909 39.130 143.650
DarkNet53 0.780 0.941 0.772 0.938 18.580 41.600
ResNet50_ACNet 0.767 0.932 10.730 33.110
ResNet50_ACNet
_deploy
0.767 0.932 8.190 25.550

Inference speed based on V100 GPU

Models Crop Size Resize Short Size FP32
Batch Size=1
(ms)
AlexNet 224 256 1.176
SqueezeNet1_0 224 256 0.860
SqueezeNet1_1 224 256 0.763
VGG11 224 256 1.867
VGG13 224 256 2.148
VGG16 224 256 2.616
VGG19 224 256 3.076
DarkNet53 256 256 3.139
ResNet50_ACNet
_deploy
224 256 5.626

Inference speed based on T4 GPU

Models Crop Size Resize Short Size FP16
Batch Size=1
(ms)
FP16
Batch Size=4
(ms)
FP16
Batch Size=8
(ms)
FP32
Batch Size=1
(ms)
FP32
Batch Size=4
(ms)
FP32
Batch Size=8
(ms)
AlexNet 224 256 1.06447 1.70435 2.38402 1.44993 2.46696 3.72085
SqueezeNet1_0 224 256 0.97162 2.06719 3.67499 0.96736 2.53221 4.54047
SqueezeNet1_1 224 256 0.81378 1.62919 2.68044 0.76032 1.877 3.15298
VGG11 224 256 2.24408 4.67794 7.6568 3.90412 9.51147 17.14168
VGG13 224 256 2.58589 5.82708 10.03591 4.64684 12.61558 23.70015
VGG16 224 256 3.13237 7.19257 12.50913 5.61769 16.40064 32.03939
VGG19 224 256 3.69987 8.59168 15.07866 6.65221 20.4334 41.55902
DarkNet53 256 256 3.18101 5.88419 10.14964 4.10829 12.1714 22.15266
ResNet50_ACNet 256 256 3.89002 4.58195 9.01095 5.33395 10.96843 18.70368
ResNet50_ACNet_deploy 224 256 2.6823 5.944 7.16655 3.49161 7.78374 13.94361

advanced_tutorials

image_augmentation

Image Augmentation

Image augmentation is a commonly used regularization method in image classification task, which is often used in scenarios with insufficient data or large model. In this chapter, we mainly introduce 8 image augmentation methods besides standard augmentation methods. Users can apply these methods in their own tasks for better model performance. Under the same conditions, These augmentation methods’ performance on ImageNet1k dataset is shown as follows.

_images/main_image_aug.png

Common image augmentation methods

If without special explanation, all the examples and experiments in this chapter are based on ImageNet1k dataset with the network input image size set as 224.

The standard data augmentation pipeline in ImageNet classification tasks contains the following steps.

  1. Decode image, abbreviated as ImageDecode.
  2. Randomly crop the image to size with 224x224, abbreviated as RandCrop.
  3. Randomly flip the image horizontally, abbreviated as RandFlip.
  4. Normalize the image pixel values, abbreviated as Normalize.
  5. Transpose the image from [224, 224, 3](HWC) to [3, 224, 224](CHW), abbreviated as Transpose.
  6. Group the image data([3, 224, 224]) into a batch([N, 3, 224, 224]), where N is the batch size. It is abbreviated as Batch.

Compared with the above standard image augmentation methods, the researchers have also proposed many improved image augmentation strategies. These strategies are to insert certain operations at different stages of the standard augmentation method, based on the different stages of operation. We divide it into the following three categories.

  1. Transformation. Perform some transformations on the image after RandCrop, such as AutoAugment and RandAugment.
  2. Cropping. Perform some transformations on the image after Transpose, such as CutOut, RandErasing, HideAndSeek and GridMask.
  3. Aliasing. Perform some transformations on the image after Batch, such as Mixup and Cutmix.

The following table shows more detailed information of the transformations.

Method Input Output Auto-
Augment[1]
Rand-
Augment[2]
CutOut[3] Rand
Erasing[4]
HideAnd-
Seek[5]
GridMask[6] Mixup[7] Cutmix[8]
Image
Decode
Binary (224, 224, 3)
uint8
Y Y Y Y Y Y Y Y
RandCrop (:, :, 3)
uint8
(224, 224, 3)
uint8
Y Y Y Y Y Y Y Y
Process (224, 224, 3)
uint8
(224, 224, 3)
uint8
Y Y - - - - - -
RandFlip (224, 224, 3)
uint8
(224, 224, 3)
float32
Y Y Y Y Y Y Y Y
Normalize (224, 224, 3)
uint8
(3, 224, 224)
float32
Y Y Y Y Y Y Y Y
Transpose (224, 224, 3)
float32
(3, 224, 224)
float32
Y Y Y Y Y Y Y Y
Process (3, 224, 224)
float32
(3, 224, 224)
float32
- - Y Y Y Y - -
Batch (3, 224, 224)
float32
(N, 3, 224, 224)
float32
Y Y Y Y Y Y Y Y
Process (N, 3, 224, 224)
float32
(N, 3, 224, 224)
float32
- - - - - - Y Y

PaddleClas integrates all the above data augmentation strategies. More details including principles and usage of the strategies are introduced in the following chapters. For better visualization, we use the following figure to show the changes after the transformations. And RandCrop is replaced withResize for simplification.

_images/test_baseline.jpeg

Image Transformation

Transformation means performing some transformations on the image after RandCrop. It mainly contains AutoAugment and RandAugment.

AutoAugment

Address:https://arxiv.org/abs/1805.09501v1

Github repo:https://github.com/DeepVoltaire/AutoAugment

Unlike conventional artificially designed image augmentation methods, AutoAugment is an image augmentation solution suitable for a specific data set found by certain search algorithm in the search space of a series of image augmentation sub-strategies. For the ImageNet dataset, the final data augmentation solution contains 25 sub-strategy combinations. Each sub-strategy contains two transformations. For each image, a sub-strategy combination is randomly selected and then determined with a certain probability Perform each transformation in the sub-strategy.

In PaddleClas, AutoAugment is used as follows.

from ppcls.data.imaug import DecodeImage
from ppcls.data.imaug import ResizeImage
from ppcls.data.imaug import ImageNetPolicy
from ppcls.data.imaug import transform

size = 224

decode_op = DecodeImage()
resize_op = ResizeImage(size=(size, size))
autoaugment_op = ImageNetPolicy()

ops = [decode_op, resize_op, autoaugment_op]

imgs_dir = image_path
fnames = os.listdir(imgs_dir)
for f in fnames:
    data = open(os.path.join(imgs_dir, f)).read()
    img = transform(data, ops)

The images after AutoAugment are as follows.

_images/test_autoaugment.jpeg

RandAugment

Address: https://arxiv.org/pdf/1909.13719.pdf

Github repo: https://github.com/heartInsert/randaugment

The search method of AutoAugment is relatively violent. Searching for the optimal strategy for this data set directly on the data set requires a lot of computation. In RandAugment, the author found that on the one hand, for larger models and larger datasets, the gains generated by the augmentation method searched using AutoAugment are smaller. On the other hand, the searched strategy is limited to certain dataset, which has poor generalization performance and not sutable for other datasets.

In RandAugment, the author proposes a random augmentation method. Instead of using a specific probability to determine whether to use a certain sub-strategy, all sub-strategies are selected with the same probability. The experiments in the paper also show that this method performs well even for large models.

In PaddleClas, RandAugment is used as follows.

from ppcls.data.imaug import DecodeImage
from ppcls.data.imaug import ResizeImage
from ppcls.data.imaug import RandAugment
from ppcls.data.imaug import transform

size = 224

decode_op = DecodeImage()
resize_op = ResizeImage(size=(size, size))
randaugment_op = RandAugment()

ops = [decode_op, resize_op, randaugment_op]

imgs_dir = image_path
fnames = os.listdir(imgs_dir)
for f in fnames:
    data = open(os.path.join(imgs_dir, f)).read()
    img = transform(data, ops)

The images after RandAugment are as follows.

_images/test_randaugment.jpeg

Image Cropping

Cropping means performing some transformations on the image after Transpose, setting pixels of the cropped area as certain constant. It mainly contains CutOut, RandErasing, HideAndSeek and GridMask.

Image cropping methods can be operated before or after normalization. The difference is that if we crop the image before normalization and fill the areas with 0, the cropped areas’ pixel values will not be 0 after normalization, which will cause grayscale distribution change of the data.

The above-mentioned cropping transformation ideas are the similar, all to solve the problem of poor generalization ability of the trained model on occlusion images, the difference lies in that their cropping details.

Cutout

Address: https://arxiv.org/abs/1708.04552

Github repo: https://github.com/uoguelph-mlrg/Cutout

Cutout is a kind of dropout, but occludes input image rather than feature map. It is more robust to noise than noise. Cutout has two advantages: (1) Using Cutout, we can simulate the situation when the subject is partially occluded. (2) It can promote the model to make full use of more content in the image for classification, and prevent the network from focusing only on the saliency area, thereby causing overfitting.

In PaddleClas, Cutout is used as follows.

from ppcls.data.imaug import DecodeImage
from ppcls.data.imaug import ResizeImage
from ppcls.data.imaug import Cutout
from ppcls.data.imaug import transform

size = 224

decode_op = DecodeImage()
resize_op = ResizeImage(size=(size, size))
cutout_op = Cutout(n_holes=1, length=112)

ops = [decode_op, resize_op, cutout_op]

imgs_dir = image_path
fnames = os.listdir(imgs_dir)
for f in fnames:
    data = open(os.path.join(imgs_dir, f)).read()
    img = transform(data, ops)

The images after Cutout are as follows.

_images/test_cutout.jpeg

RandomErasing

Address: https://arxiv.org/pdf/1708.04896.pdf

Github repo: https://github.com/zhunzhong07/Random-Erasing

RandomErasing is similar to the Cutout. It is also to solve the problem of poor generalization ability of the trained model on images with occlusion. The author also pointed out in the paper that the way of random cropping is complementary to random horizontal flipping. The author also verified the effectiveness of the method on pedestrian re-identification (REID). Unlike Cutout, in, RandomErasing is operateed on the image with a certain probability, size and aspect ratio of the generated mask are also randomly generated according to pre-defined hyperparameters.

In PaddleClas, RandomErasing is used as follows.

from ppcls.data.imaug import DecodeImage
from ppcls.data.imaug import ResizeImage
from ppcls.data.imaug import ToCHWImage
from ppcls.data.imaug import RandomErasing
from ppcls.data.imaug import transform

size = 224

decode_op = DecodeImage()
resize_op = ResizeImage(size=(size, size))
randomerasing_op = RandomErasing()

ops = [decode_op, resize_op, tochw_op, randomerasing_op]

imgs_dir = image_path
fnames = os.listdir(imgs_dir)
for f in fnames:
    data = open(os.path.join(imgs_dir, f)).read()
    img = transform(data, ops)
    img = img.transpose((1, 2, 0))

The images after RandomErasing are as follows.

_images/test_randomerassing.jpeg

HideAndSeek

Address: https://arxiv.org/pdf/1811.02545.pdf

Github repo: https://github.com/kkanshul/Hide-and-Seek

Images are divided into some patches for HideAndSeek and masks are generated with certain probability for each patch. The meaning of the masks in different areas is shown in the figure below.

_images/hide-and-seek-visual.png

In PaddleClas, HideAndSeek is used as follows.

from ppcls.data.imaug import DecodeImage
from ppcls.data.imaug import ResizeImage
from ppcls.data.imaug import ToCHWImage
from ppcls.data.imaug import HideAndSeek
from ppcls.data.imaug import transform

size = 224

decode_op = DecodeImage()
resize_op = ResizeImage(size=(size, size))
hide_and_seek_op = HideAndSeek()

ops = [decode_op, resize_op, tochw_op, hide_and_seek_op]

imgs_dir = image_path
fnames = os.listdir(imgs_dir)
for f in fnames:
    data = open(os.path.join(imgs_dir, f)).read()
    img = transform(data, ops)
    img = img.transpose((1, 2, 0))

The images after HideAndSeek are as follows.

_images/test_hideandseek.jpeg

GridMask

Address:https://arxiv.org/abs/2001.04086

Github repo:https://github.com/akuxcw/GridMask

The author points out that the previous method based on image cropping has two problems, as shown in the following figure:

  1. Excessive deletion of the area may cause most or all of the target subject to be deleted, or cause the context information loss, resulting in the images after enhancement becoming noisy data.
  2. Reserving too much area has little effect on the object and context.

_images/gridmask-0.png

Therefore, it is the core problem to be solved how to if you avoid over-deletion or over-retention becomes the core problem to be solved.

GridMask is to generate a mask with the same resolution as the original image and multiply it with the original image. The mask grid and size are adjusted by the hyperparameters.

In the training process, there are two methods to use:

  1. Set a probability p and use the GridMask to augment the image with probability p from the beginning of training.
  2. Initially set the augmentation probability to 0, and the probability is increased with number of iterations from 0 to p.

It shows that the second method is better.

The usage of GridMask in PaddleClas is shown below.

from data.imaug import DecodeImage
from data.imaug import ResizeImage
from data.imaug import ToCHWImage
from data.imaug import GridMask
from data.imaug import transform

size = 224

decode_op = DecodeImage()
resize_op = ResizeImage(size=(size, size))
tochw_op = ToCHWImage()
gridmask_op = GridMask(d1=96, d2=224, rotate=1, ratio=0.6, mode=1, prob=0.8)

ops = [decode_op, resize_op, tochw_op, gridmask_op]

imgs_dir = image_path
fnames = os.listdir(imgs_dir)
for f in fnames:
    data = open(os.path.join(imgs_dir, f)).read()
    img = transform(data, ops)
    img = img.transpose((1, 2, 0))

The images after GridMask are as follows.

_images/test_gridmask.jpeg

Image aliasing

Aliasing means performing some transformations on the image after Batch, which contains Mixup and Cutmix.

Data augmentation methods introduced before are based on single image while aliasing is carried on a certain batch to generate a new batch.

Mixup

Address: https://arxiv.org/pdf/1710.09412.pdf

Github repo: https://github.com/facebookresearch/mixup-cifar10

Mixup is the first solution for image aliasing, it is easy to realize and performs well not only on image classification but also on object detection. Mixup is usually carried out in a batch for simplification, so as Cutmix.

The usage of Mixup in PaddleClas is shown below.

from ppcls.data.imaug import DecodeImage
from ppcls.data.imaug import ResizeImage
from ppcls.data.imaug import ToCHWImage
from ppcls.data.imaug import transform
from ppcls.data.imaug import MixupOperator

size = 224

decode_op = DecodeImage()
resize_op = ResizeImage(size=(size, size))
tochw_op = ToCHWImage()
hide_and_seek_op = HideAndSeek()
mixup_op = MixupOperator()
cutmix_op = CutmixOperator()

ops = [decode_op, resize_op, tochw_op]

imgs_dir = image_path

batch = []
fnames = os.listdir(imgs_dir)
for idx, f in enumerate(fnames):
    data = open(os.path.join(imgs_dir, f)).read()
    img = transform(data, ops)
    batch.append( (img, idx) ) # fake label

new_batch = mixup_op(batch)

The images after Mixup are as follows.

_images/test_mixup.png

Cutmix

Address: https://arxiv.org/pdf/1905.04899v2.pdf

Github repo: https://github.com/clovaai/CutMix-PyTorch

Unlike Mixup which directly adds two images, for Cutmix, an ROI is cut out from one image and Cutmix randomly cuts out an ROI from one image, and then covered onto the corresponding area in the another image. The usage of Cutmix in PaddleClas is shown below.

rom ppcls.data.imaug import DecodeImage
from ppcls.data.imaug import ResizeImage
from ppcls.data.imaug import ToCHWImage
from ppcls.data.imaug import transform
from ppcls.data.imaug import CutmixOperator

size = 224

decode_op = DecodeImage()
resize_op = ResizeImage(size=(size, size))
tochw_op = ToCHWImage()
hide_and_seek_op = HideAndSeek()
cutmix_op = CutmixOperator()

ops = [decode_op, resize_op, tochw_op]

imgs_dir = image_path

batch = []
fnames = os.listdir(imgs_dir)
for idx, f in enumerate(fnames):
    data = open(os.path.join(imgs_dir, f)).read()
    img = transform(data, ops)
    batch.append( (img, idx) ) # fake label

new_batch = cutmix_op(batch)

The images after Cutmix are as follows.

_images/test_cutmix.png

Experiments

Based on PaddleClas, Metrics of different augmentation methods on ImageNet1k dataset are as follows.

Model Learning strategy l2 decay batch size epoch Augmentation method Top1 Acc Reference
ResNet50 0.1/cosine_decay 0.0001 256 300 Standard transform 0.7731 -
ResNet50 0.1/cosine_decay 0.0001 256 300 AutoAugment 0.7795 0.7763
ResNet50 0.1/cosine_decay 0.0001 256 300 mixup 0.7828 0.7790
ResNet50 0.1/cosine_decay 0.0001 256 300 cutmix 0.7839 0.7860
ResNet50 0.1/cosine_decay 0.0001 256 300 cutout 0.7801 -
ResNet50 0.1/cosine_decay 0.0001 256 300 gridmask 0.7785 0.7790
ResNet50 0.1/cosine_decay 0.0001 256 300 random-augment 0.7770 0.7760
ResNet50 0.1/cosine_decay 0.0001 256 300 random erasing 0.7791 -
ResNet50 0.1/cosine_decay 0.0001 256 300 hide and seek 0.7743 0.7720

note:

  • In the experiment here, for better comparison, we fixed the l2 decay to 1e-4. To achieve higher accuracy, we recommend trying to use a smaller l2 decay. Combined with data augmentaton, we found that reducing l2 decay from 1e-4 to 7e-5 can bring at least 0.3~0.5% accuracy improvement.
  • We have not yet combined different strategies or verified, whch is our future work.
Data augmentation practice

Experiments about data augmentation will be introduced in detail in this section. If you want to quickly experience these methods, please refer to Quick start PaddleClas in 30 miniutes.

Configurations

Since hyperparameters differ from different augmentation methods. For better understanding, we list 8 augmentation configuration files in configs/DataAugment based on ResNet50. Users can train the model with tools/run.sh. The following are 3 of them.

RandAugment

Configuration of RandAugment is shown as follows. Num_layers(default as 2) and magnitude(default as 5) are two hyperparameters.

    transforms:
        - DecodeImage:
            to_rgb: True
            to_np: False
            channel_first: False
        - RandCropImage:
            size: 224
        - RandFlipImage:
            flip_code: 1
        - RandAugment:
            num_layers: 2
            magnitude: 5
        - NormalizeImage:
            scale: 1./255.
            mean: [0.485, 0.456, 0.406]
            std: [0.229, 0.224, 0.225]
            order: ''
        - ToCHWImage:
Cutout

Configuration of Cutout is shown as follows. n_holes(default as 1) and n_holes(default as 112) are two hyperparameters.

    transforms:
        - DecodeImage:
            to_rgb: True
            to_np: False
            channel_first: False
        - RandCropImage:
            size: 224
        - RandFlipImage:
            flip_code: 1
        - NormalizeImage:
            scale: 1./255.
            mean: [0.485, 0.456, 0.406]
            std: [0.229, 0.224, 0.225]
            order: ''
        - Cutout:
            n_holes: 1
            length: 112
        - ToCHWImage:
Mixup

Configuration of Mixup is shown as follows. alpha(default as 0.2) is hyperparameter which users need to care about. What’s more, use_mix need to be set as True in the root of the configuration.

    transforms:
        - DecodeImage:
            to_rgb: True
            to_np: False
            channel_first: False
        - RandCropImage:
            size: 224
        - RandFlipImage:
            flip_code: 1
        - NormalizeImage:
            scale: 1./255.
            mean: [0.485, 0.456, 0.406]
            std: [0.229, 0.224, 0.225]
            order: ''
        - ToCHWImage:
    mix:
        - MixupOperator:
            alpha: 0.2
启动命令

Users can use the following command to start the training process, which can also be referred to tools/run.sh.

export PYTHONPATH=path_to_PaddleClas:$PYTHONPATH

python -m paddle.distributed.launch \
    --selected_gpus="0,1,2,3" \
    tools/train.py \
        -c ./configs/DataAugment/ResNet50_Cutout.yaml
Note
  • When using augmentation methods based on image aliasing, users need to set use_mix in the configuration file as True. In addition, because the label needs to be aliased when the image is aliased, the accuracy of the training data cannot be calculated. The training accuracy rate was not printed during the training process.
  • The training data is more difficult with data augmentation, so the training loss may be larger, the training set accuracy is relatively low, but it has better generalization ability, so the validation set accuracy is relatively higher.
  • After the use of data augmentation, the model may tend to be underfitting. It is recommended to reduce l2_decay for better performance on validation set.
  • hyperparameters exist in almost all agmenatation methods. Here we provide hyperparameters for ImageNet1k dataset. User may need to finetune the hyperparameters on specified dataset. More training tricks can be referred to Tricks.
If this document is helpful to you, welcome to star our project: https://github.com/PaddlePaddle/PaddleClas

Reference

[1] Cubuk E D, Zoph B, Mane D, et al. Autoaugment: Learning augmentation strategies from data[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2019: 113-123.

[2] Cubuk E D, Zoph B, Shlens J, et al. Randaugment: Practical automated data augmentation with a reduced search space[J]. arXiv preprint arXiv:1909.13719, 2019.

[3] DeVries T, Taylor G W. Improved regularization of convolutional neural networks with cutout[J]. arXiv preprint arXiv:1708.04552, 2017.

[4] Zhong Z, Zheng L, Kang G, et al. Random erasing data augmentation[J]. arXiv preprint arXiv:1708.04896, 2017.

[5] Singh K K, Lee Y J. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization[C]//2017 IEEE international conference on computer vision (ICCV). IEEE, 2017: 3544-3553.

[6] Chen P. GridMask Data Augmentation[J]. arXiv preprint arXiv:2001.04086, 2020.

[7] Zhang H, Cisse M, Dauphin Y N, et al. mixup: Beyond empirical risk minimization[J]. arXiv preprint arXiv:1710.09412, 2017.

[8] Yun S, Han D, Oh S J, et al. Cutmix: Regularization strategy to train strong classifiers with localizable features[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 6023-6032.

distillation

Introduction of model compression methods

In recent years, deep neural networks have been proven to be an extremely effective method to solve problems in the fields of computer vision and natural language processing. The deep learning methods performs better than traditional methods with suitable network structure and training process.

With enough training data, increasing parameters of the neural network by building a reasonabe network can significantly the model performance. But this increases the model complexity, which takes too much computation cost in real scenarios.

Parameter redundancy exists in deep neural networks. There are several methods to compress the model suck as pruning ,quantization, knowledge distillation, etc. Knowledge distillation refers to using the teacher model to guide the student model to learn specific tasks, ensuring that the small model has a relatively large effect improvement with the computation cost unchanged, and even obtains similar accuracy with the large model [1]. Combining some of the existing distillation methods [2,3], PaddleClas provides a simple semi-supervised label knowledge distillation solution (SSLD). Top-1 Accuarcy on ImageNet1k dataset has an improvement of more than 3% based on ResNet_vd and MobileNet series, which can be shown as below.

_images/distillation_perform_s.jpg

SSLD

Introduction

The following figure shows the framework of SSLD.

_images/ppcls_distillation.png

First, we select nearly 4 million images from ImageNet22k dataset, and integrate it with the ImageNet-1k training set to get a new dataset containing 5 million images. Then, we combine the student model and the teacher model into a new network, which outputs the predictions of the student model and the teacher model, respectively. The gradient of the entire network of the teacher model is fixed. Finally, we use JS divergence loss as the loss function for the training process. Here we take MobileNetV3 distillation task as an example, and introduce key points of SSLD.

  • Choice of the teacher model, During knowledge distillation, it may not be an optimal solution if the structure of the teacher model and the student model are too different. Under the same structure, the teacher model with higher accuracy leads to better performance for the student model during distillation. Compared with the 79.12% ResNet50_vd teacher model, using the 82.4% teacher model can bring a 0.4% accuracy improvement on Top-1 accuracy (75.6%-> 76.0%).
  • Improvement of loss function. The most commonly used loss function for classification is cross entropy loss. We fint that when using soft label for training, KL divergence loss is almost useless to improve model performance compared to cross entropy loss, but The accuracy has a 0.2% improvement using JS divergence loss (76.0%-> 76.2%). Loss function in SSLD is JS divergence loss.
  • More iteration number. It is only 120 for the baseline experiment. We can achieve a 0.9% improvement by setting it as 360 (76.2%-> 77.1%).
  • There is not need for laleled data in SSLD, which leads to convenient training data expansion. label is not utilized when computing the loss function, therefore the unlabeled data can also be used to train the network. The label-free distillation strategy of this distillation solution has also greatly improved the upper performance limit of student models (77.1%-> 78.5%).
  • ImageNet1k finetune. ImageNet1k training set is used for finetuning, which brings a 0.4% accuracy improvement (75.8%-> 78.9%).
Data selection
  • An important feature of the SSLD distillation scheme is no need for labeled images, so the dataset size can be arbitrarily expanded. Considering the limitation of computing resources, we here only expand the training set of the distillation task based on the ImageNet22k dataset. For SSLD, we used the Top-k per class data sampling scheme [3]. Specific steps are as follows.      * Deduplication of training set. We first deduplicate the ImageNet22k dataset and the ImageNet1k validation set based on the SIFT feature similarity matching method to prevent the added ImageNet22k training set from containing the ImageNet1k validation set images. Finally we removed 4511 similar images. Similar pictures with partial filtering are shown below.

    _images/22k_1k_val_compare_w_sift.png

    • Obtain the soft label of the ImageNet22k dataset. For the ImageNet22k dataset after deduplication, we use the ResNeXt101_32x16d_wsl model to make predictions to obtain the soft label of each image.      * Top-k data selection. There contains 1000 categories in ImageNet1k dataset. For each category, we find out images in the category with Top-k highest score, and finally generate a dataset whose image number does not exceed 1000 * k (For some categories, there may contain less than k images).      * The selected images are merged with the ImageNet1k training set to form the new dataset used for the final distillation model training, which contains 5 million images in all.

Experiments

The distillation solution that PaddleClas provides is combining common training with finetuning. Given a suitable teacher model, the large dataset(5 million) is used for common training and the ImageNet1k dataset is used for finetuning.

Choice of teacher model

In order to verify the influence of the model size difference between the teacher model and the student model on the distillation results as well as the teacher model accuracy, we conducted several experiments. The training strategy is unified as follows: cosine_decay_warmup, lr = 1.3, epoch = 120, bs = 2048, and the student models are all trained from scratch.

Teacher Model Teacher Top1 Student Model Student Top1
ResNeXt101_32x16d_wsl 84.2% MobileNetV3_large_x1_0 75.78%
ResNet50_vd 79.12% MobileNetV3_large_x1_0 75.60%
ResNet50_vd 82.35% MobileNetV3_large_x1_0 76.00%

It can be shown from the table that:

When the teacher model structure is the same, the higher the teacher model accuracy, the better the final student model will be.

The size difference between the teacher model and the student model should not be too large, otherwise it will decrease the accuracy of the distillation results.

Therefore, during distillation, for the ResNet series student model, we use ResNeXt101_32x16d_wsl as the teacher model; for the MobileNet series student model, we useResNet50_vd_SSLD as the teacher model.

Distillation using large-scale dataset

Training process is carried out on the large-scale dataset with 5 million images. Specifically, the following table shows more details of different models.

Student Model num_epoch l2_ecay batch size/gpu cards base lr learning rate decay top1 acc
MobileNetV1 360 3e-5 4096/8 1.6 cosine_decay_warmup 77.65%
MobileNetV2 360 1e-5 3072/8 0.54 cosine_decay_warmup 76.34%
MobileNetV3_large_x1_0 360 1e-5 5760/24 3.65625 cosine_decay_warmup 78.54%
MobileNetV3_small_x1_0 360 1e-5 5760/24 3.65625 cosine_decay_warmup 70.11%
ResNet50_vd 360 7e-5 1024/32 0.4 cosine_decay_warmup 82.07%
ResNet101_vd 360 7e-5 1024/32 0.4 cosine_decay_warmup 83.41%
finetuning using ImageNet1k

Finetuning is carried out on ImageNet1k dataset to restore distribution between training set and test set. the following table shows more details of finetuning.

Student Model num_epoch l2_ecay batch size/gpu cards base lr learning rate decay top1 acc
MobileNetV1 30 3e-5 4096/8 0.016 cosine_decay_warmup 77.89%
MobileNetV2 30 1e-5 3072/8 0.0054 cosine_decay_warmup 76.73%
MobileNetV3_large_x1_0 30 1e-5 2048/8 0.008 cosine_decay_warmup 78.96%
MobileNetV3_small_x1_0 30 1e-5 6400/32 0.025 cosine_decay_warmup 71.28%
ResNet50_vd 60 7e-5 1024/32 0.004 cosine_decay_warmup 82.39%
ResNet101_vd 30 7e-5 1024/32 0.004 cosine_decay_warmup 83.73%
Data agmentation and Fix strategy
  • Based on experiments mentioned above, we add AutoAugment [4] during training process, and reduced l2_decay from 4e-5 t 2e-5. Finally, the Top-1 accuracy on ImageNet1k dataset can reach 82.99%, with 0.6% improvement compared to the standard SSLD distillation strategy.
  • For image classsification tasks, The model accuracy can be further improved when the test scale is 1.15 times that of training[5]. For the 82.99% ResNet50_vd pretrained model, it comes to 83.7% using 320x320 for the evaluation. We use Fix strategy to finetune the model with the training scale set as 320x320. During the process, the pre-preocessing pipeline is same for both training and test. All the weights except the fully connected layer are freezed. Finally the top-1 accuracy comes to 84.0%.

Application of the distillation model

Instructions
  • Adjust the learning rate of the middle layer. The middle layer feature map of the model obtained by distillation is more refined. Therefore, when the distillation model is used as the pretrained model in other tasks, if the same learning rate as before is adopted, it is easy to destroy the features. If the learning rate of the overall model training is reduced, it will bring about the problem of slow convergence. Therefore, we use the strategy of adjusting the learning rate of the middle layer. specifically:     * For ResNet50_vd, we set up a learning rate list. The three conv2d convolution parameters before the resiual block have a uniform learning rate multiple, and the four resiual block conv2d have theirs own learning rate parameters, respectively. 5 values need to be set in the list. By the experiment, we find that when used for transfer learning finetune classification model, the learning rate list with [0.1,0.1,0.2,0.2,0.3] performs better in most tasks; while in the object detection tasks, [0.05, 0.05, 0.05, 0.1, 0.15] can bring greater accuracy gains.     * For MoblileNetV3_large_1x0, because it contains 15 blocks, we set each 3 blocks to share a learning rate, so 5 learning rate values are required. We find that in classification and detection tasks, the learning rate list with [0.25, 0.25, 0.5, 0.5, 0.75] performs better in most tasks.
  • Appropriate l2 decay. Different l2 decay values are set for different models during training. In order to prevent overfitting, l2 decay is ofen set as large for large models. L2 decay is set as 1e-4 for ResNet50, and 1e-5 ~ 4e-5 for MobileNet series models. L2 decay needs also to be adjusted when applied in other tasks. Taking Faster_RCNN_MobiletNetV3_FPN as an example, we found that only modifying l2 decay can bring up to 0.5% accuracy (mAP) improvement on the COCO2017 dataset.
Transfer learning
  • To verify the effect of the SSLD pretrained model in transfer learning, we carried out experiments on 10 small datasets. Here, in order to ensure the comparability of the experiment, we use the standard preprocessing process trained by the ImageNet1k dataset. For the distillation model, we also add a simple search method for the learning rate of the middle layers of the distillation pretrained model.
  • For ResNet50_vd, the baseline pretrained model Top-1 Acc is 79.12%, the other parameters are got by grid search. For distillation pretrained model, we add learning rate of the middle layers into the search space. The following table shows the results.
Dataset Model Baseline Top1 Acc Distillation Model Finetune
Oxford102 flowers ResNete50_vd 97.18% 97.41%
caltech-101 ResNete50_vd 92.57% 93.21%
Oxford-IIIT-Pets ResNete50_vd 94.30% 94.76%
DTD ResNete50_vd 76.48% 77.71%
fgvc-aircraft-2013b ResNete50_vd 88.98% 90.00%
Stanford-Cars ResNete50_vd 92.65% 92.76%
SUN397 ResNete50_vd 64.02% 68.36%
cifar100 ResNete50_vd 86.50% 87.58%
cifar10 ResNete50_vd 97.72% 97.94%
Food-101 ResNete50_vd 89.58% 89.99%
  • It can be seen that on the above 10 datasets, combined with the appropriate middle layer learning rate, the distillation pretrained model can bring an average accuracy improvement of more than 1%.
Object detection

Based on the two-stage Faster/Cascade RCNN model, we verify the effect of the pretrained model obtained by distillation.

  • ResNet50_vd

Training scale and test scale are set as 640x640, and some of the ablationstudies are as follows.

Model train/test scale pretrain top1 acc feature map lr coco mAP
Faster RCNN R50_vd FPN 640/640 79.12% [1.0,1.0,1.0,1.0,1.0] 34.8%
Faster RCNN R50_vd FPN 640/640 79.12% [0.05,0.05,0.1,0.1,0.15] 34.3%
Faster RCNN R50_vd FPN 640/640 82.18% [0.05,0.05,0.1,0.1,0.15] 36.3%

It can be seen here that for the baseline pretrained model, excessive adjustment of the middle-layer learning rate actually reduces the performance of the detection model. Based on this distillation model, we also provide a practical server-side detection solution. The detailed configuration and training code are open source, more details can be refer to [PaddleDetection] (https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/rcnn_enhance).

Practice

This section will introduce the SSLD distillation experiments in detail based on the ImageNet-1K dataset. If you want to experience this method quickly, you can refer to [** Quick start PaddleClas in 30 minutes**] (../../tutorials/quick_start.md), whose dataset is set as Flowers102.

Configuration
Distill ResNet50_vd using ResNeXt101_32x16d_wsl

Configuration of distilling ResNet50_vd using ResNeXt101_32x16d_wsl is as follows.

ARCHITECTURE:
    name: 'ResNeXt101_32x16d_wsl_distill_ResNet50_vd'
pretrained_model: "./pretrained/ResNeXt101_32x16d_wsl_pretrained/"
# pretrained_model:
#     - "./pretrained/ResNeXt101_32x16d_wsl_pretrained/"
#     - "./pretrained/ResNet50_vd_pretrained/"
use_distillation: True
Distill MobileNetV3_large_x1_0 using ResNet50_vd_ssld

The detailed configuration is as follows.

ARCHITECTURE:
    name: 'ResNet50_vd_distill_MobileNetV3_large_x1_0'
pretrained_model: "./pretrained/ResNet50_vd_ssld_pretrained/"
# pretrained_model:
#     - "./pretrained/ResNet50_vd_ssld_pretrained/"
#     - "./pretrained/ResNet50_vd_pretrained/"
use_distillation: True
Begin to train the network

If everything is ready, users can begin to train the network using the following command.

export PYTHONPATH=path_to_PaddleClas:$PYTHONPATH

python -m paddle.distributed.launch \
    --selected_gpus="0,1,2,3" \
    --log_dir=R50_vd_distill_MV3_large_x1_0 \
    tools/train.py \
        -c ./configs/Distillation/R50_vd_distill_MV3_large_x1_0.yaml
Note
  • Before using SSLD, users need to train a teacher model on the target dataset firstly. The teacher model is used to guide the training of the student model.
  • When using SSLD, users need to set use_distillation in the configuration file toTrue. In addition, because the student model learns soft-label with knowledge information, you need to turn off the label_smoothing option.
  • If the student model is not loaded with a pretrained model, the other hyperparameters of the training can refer to the hyperparameters trained by the student model on ImageNet-1k. If the student model is loaded with the pre-trained model, the learning rate can be adjusted to 1/100~1/10 of the standard learning rate.
  • In the process of SSLD distillation, the student model only learns the soft label, which makes the training process more difficult. It is recommended that the value of l2_decay can be decreased appropriately to obtain higher accuracy of the validation set.
  • If users are going to add unlabeled training data, just the training list textfile needs to be adjusted for more data.
If this document is helpful to you, welcome to star our project: https://github.com/PaddlePaddle/PaddleClas

Reference

[1] Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network[J]. arXiv preprint arXiv:1503.02531, 2015.

[2] Bagherinezhad H, Horton M, Rastegari M, et al. Label refinery: Improving imagenet classification through label progression[J]. arXiv preprint arXiv:1805.02641, 2018.

[3] Yalniz I Z, Jégou H, Chen K, et al. Billion-scale semi-supervised learning for image classification[J]. arXiv preprint arXiv:1905.00546, 2019.

[4] Cubuk E D, Zoph B, Mane D, et al. Autoaugment: Learning augmentation strategies from data[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2019: 113-123.

[5] Touvron H, Vedaldi A, Douze M, et al. Fixing the train-test resolution discrepancy[C]//Advances in Neural Information Processing Systems. 2019: 8250-8260.

application

Transfer learning in image classification

Transfer learning is an important part of machine learning, which is widely used in various fields such as text and images. Here we mainly introduce transfer learning in the field of image classification, which is often called domain transfer, such as migration of the ImageNet classification model to the specified image classification task, such as flower classification.

Large-scale image classification

In practical applications, due to the lack of training data, the classification model trained on the ImageNet1k data set is often used as the pretrained model for other image classification tasks. In order to further help solve practical problems, based on ResNet50_vd, Baidu open sourced a self-developed large-scale classification pretrained model, in which the training data contains 100,000 categories and 43 million pictures.

We conducted transfer learning experiments on 6 self-collected datasets,

using a set of fixed parameters and a grid search method, in which the number of training rounds was set to 20epochs, the ResNet50_vd model was selected, and the ImageNet pre-training accuracy was 79.12%. The comparison results of the experimental data set parameters and model accuracy are as follows:

Fixed scheme:

lr=0.001,l2 decay=1e-4,label smoothing=False,mixup=False
Dataset Statstics Pretrained moel on ImageNet
Top-1(fixed)/Top-1(search)
Pretrained moel on large-scale dataset
Top-1(fixed)/Top-1(search)
Flowers class:102
train:5789
valid:2396
0.7779/0.9883 0.9892/0.9954
Hand-painted stick figures Class:18
train:1007
valid:432
0.8795/0.9196 0.9107/0.9219
Leaves class:6
train:5256
valid:2278
0.8212/0.8482 0.8385/0.8659
Container vehicle Class:115
train:4879
valid:2094
0.6230/0.9556 0.9524/0.9702
Chair class:5
train:169
valid:78
0.8557/0.9688 0.9077/0.9792
Geology class:4
train:671
valid:296
0.5719/0.8094 0.6781/0.8219
  • The above experiments verified that for fixed parameters, compared with the pretrained model on ImageNet, using the large-scale classification model as a pretrained model can help us improve the model performance on a new dataset in most cases. Parameter search can be further helpful to the model performance.

Reference

[1] Kornblith, Simon, Jonathon Shlens, and Quoc V. Le. “Do better imagenet models transfer better?.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2019.

[2] Kolesnikov, Alexander, et al. “Large Scale Learning of General Visual Representations for Transfer.” arXiv preprint arXiv:1912.11370 (2019).

General object detection

Practical Server-side detection method base on RCNN

Introduction
  • In recent years, object detection tasks have attracted widespread attention. PaddleClas open-sourced the ResNet50_vd_SSLD pretrained model based on ImageNet(Top1 Acc 82.4%). And based on the pretrained model, PaddleDetection provided the PSS-DET (Practical Server-side detection) with the help of the rich operators in PaddleDetection. The inference speed can reach 61FPS on single V100 GPU when COCO mAP is 41.6%, and 20FPS when COCO mAP is 47.8%.
  • We take the standard Faster RCNN ResNet50_vd FPN as an example. The following table shows ablation study of PSS-DET.
Trick Train scale Test scale COCO mAP Infer speed/FPS
baseline 640x640 640x640 36.4% 43.589
+test proposal=pre/post topk 500/300 640x640 640x640 36.2% 52.512
+fpn channel=64 640x640 640x640 35.1% 67.450
+ssld pretrain 640x640 640x640 36.3% 67.450
+ciou loss 640x640 640x640 37.1% 67.450
+DCNv2 640x640 640x640 39.4% 60.345
+3x, multi-scale training 640x640 640x640 41.0% 60.345
+auto augment 640x640 640x640 41.4% 60.345
+libra sampling 640x640 640x640 41.6% 60.345

Based on the ablation experiments, Cascade RCNN and larger inference scale(1000x1500) are used for better performance. The final COCO mAP is 47.8% and the following figure shows mAP-Speed curves for some common detectors.

_images/pssdet.pngpssdet

Note

For fair comparison, inference time for PSS-DET models on V100 GPU is transformed to Titan V GPU by multiplying by 1.2 times.

For more detailed information, you can refer to PaddleDetection.

Practical Mobile-side detection method base on RCNN

  • This part is comming soon!

extension

Prediction Framework

Introduction

Models for Paddle are stored in many different forms, which can be roughly divided into two categories:

  1. persistable model(the models saved by fluid.save_persistables) The weights are saved in checkpoint, which can be loaded to retrain, one scattered weight file saved by persistable stands for one persistable variable in the model, there is no structure information in these variable, so the weights should be used with the model structure.

    resnet50-vd-persistable/
    ├── bn2a_branch1_mean
    ├── bn2a_branch1_offset
    ├── bn2a_branch1_scale
    ├── bn2a_branch1_variance
    ├── bn2a_branch2a_mean
    ├── bn2a_branch2a_offset
    ├── bn2a_branch2a_scale
    ├── ...
    └── res5c_branch2c_weights
    
  2. inference model(the models saved by fluid.io.save_inference_model) The model saved by this function cam be used for inference directly, compared with the ones saved by persistable, the model structure will be additionally saved in the model, with the weights, the model with trained weights can be reconstruction. as shown in the following figure, the structure information is saved in model

    resnet50-vd-persistable/
    ├── bn2a_branch1_mean
    ├── bn2a_branch1_offset
    ├── bn2a_branch1_scale
    ├── bn2a_branch1_variance
    ├── bn2a_branch2a_mean
    ├── bn2a_branch2a_offset
    ├── bn2a_branch2a_scale
    ├── ...
    ├── res5c_branch2c_weights
    └── model
    

    For convenience, all weight files will be saved into a params file when saving the inference model on Paddle, as shown below:

    resnet50-vd
    ├── model
    └── params
    

Both the training engine and the prediction engine in Paddle support the model’s e inference, but the back propagation is not performed during the inference, so it can be customized optimization (such as layer fusion, kernel selection, etc.) to achieve low latency and high throughput during inference. The training engine can support either the persistable model or the inference model, and the prediction engine only supports the inference model, so three different inferences are derived:

  1. prediction engine + inference model
  2. training engine + inference model
  3. training engine + inference model

Regardless of the inference method, it basically includes the following main steps:

  • Engine Build
  • Make Data to Be Predicted
  • Perform Predictions
  • Result Analysis

There are two main differences in different inference methods: building the engine and executing the forecast. The following sections will be introduced in detail

Model Transformation

During training, we usually save some checkpoints (persistable models). These are just model weight files and cannot be directly loaded by the prediction engine to predict, so we usually find suitable checkpoints after the training and convert them to inference model. There are two main steps: 1. Build a training engine, 2. Save the inference model, as shown below.

import fluid

from ppcls.modeling.architectures.resnet_vd import ResNet50_vd

place = fluid.CPUPlace()
exe = fluid.Executor(place)
startup_prog = fluid.Program()
infer_prog = fluid.Program()
with fluid.program_guard(infer_prog, startup_prog):
    with fluid.unique_name.guard():
        image = create_input()
        image = fluid.data(name='image', shape=[None, 3, 224, 224], dtype='float32')
        out = ResNet50_vd.net(input=input, class_dim=1000)

infer_prog = infer_prog.clone(for_test=True)
fluid.load(program=infer_prog, model_path=the path of persistable model, executor=exe)

fluid.io.save_inference_model(
        dirname='./output/',
        feeded_var_names=[image.name],
        main_program=infer_prog,
        target_vars=out,
        executor=exe,
        model_filename='model',
        params_filename='params')

A complete example is provided in the tools/export_model.py, just execute the following command to complete the conversion:

python tools/export_model.py \
    --m=the name of model \
    --p=the path of persistable model\
    --o=the saved path of model and params

Prediction engine + inference model

The complete example is provided in the tools/infer/predict.py,just execute the following command to complete the prediction:

python ./predict.py \
    -i=./test.jpeg \
    -m=./resnet50-vd/model \
    -p=./resnet50-vd/params \
    --use_gpu=1 \
    --use_tensorrt=True

Parameter Description:

  • image_file(shortening i):the path of images which are needed to predict,such as ./test.jpeg.
  • model_file(shortening m):the path of weights folder,such as ./resnet50-vd/model.
  • params_file(shortening p):the path of weights file,such as ./resnet50-vd/params.
  • batch_size(shortening b):batch size,such as 1.
  • ir_optim whether to use IR optimization, default: True.
  • use_tensorrt: whether to use TensorRT prediction engine, default:True.
  • gpu_mem: Initial allocation of GPU memory, the unit is M.
  • use_gpu: whether to use GPU, default: True.
  • enable_benchmark:whether to use benchmark, default: False.
  • model_name:the name of model.

NOTE: when using benchmark, we use tersorrt by default to make predictions on Paddle.

Building prediction engine:

from paddle.fluid.core import AnalysisConfig
from paddle.fluid.core import create_paddle_predictor
config = AnalysisConfig(the path of model file, the path of params file)
config.enable_use_gpu(8000, 0)
config.disable_glog_info()
config.switch_ir_optim(True)
config.enable_tensorrt_engine(
        precision_mode=AnalysisConfig.Precision.Float32,
        max_batch_size=1)

# no zero copy方式需要去除fetch feed op
config.switch_use_feed_fetch_ops(False)

predictor = create_paddle_predictor(config)

Prediction Execution:

import numpy as np

input_names = predictor.get_input_names()
input_tensor = predictor.get_input_tensor(input_names[0])
input = np.random.randn(1, 3, 224, 224).astype("float32")
input_tensor.reshape([1, 3, 224, 224])
input_tensor.copy_from_cpu(input)
predictor.zero_copy_run()

More parameters information can be refered in Paddle Python prediction API. If you need to predict in the environment of business, we recommand you to use Paddel C++ prediction API,a rich pre-compiled prediction library is provided in the offical websitePaddle C++ prediction library

By default, Paddle’s wheel package does not include the TensorRT prediction engine. If you need to use TensorRT for prediction optimization, you need to compile the corresponding wheel package yourself. For the compilation method, please refer to Paddle’s compilation guide. Paddle compilation

Training engine + persistable model prediction

A complete example is provided in the tools/infer/infer.py, just execute the following command to complete the prediction:

python tools/infer/infer.py \
    --i=the path of images which are needed to predict \
    --m=the name of model \
    --p=the path of persistable model \
    --use_gpu=True

Parameter Description:

  • image_file(shortening i):the path of images which are needed to predict,such as ./test.jpeg
  • model_file(shortening m):the path of weights folder,such as ./resnet50-vd/model
  • params_file(shortening p):the path of weights file,such as ./resnet50-vd/params
  • use_gpu : whether to use GPU, default: True.

Training Engine Construction:

Since the persistable model does not contain the structural information of the model, it is necessary to construct the network structure first, and then load the weights to build the training engine。

import fluid
from ppcls.modeling.architectures.resnet_vd import ResNet50_vd

place = fluid.CPUPlace()
exe = fluid.Executor(place)
startup_prog = fluid.Program()
infer_prog = fluid.Program()
with fluid.program_guard(infer_prog, startup_prog):
    with fluid.unique_name.guard():
        image = create_input()
        image = fluid.data(name='image', shape=[None, 3, 224, 224], dtype='float32')
        out = ResNet50_vd.net(input=input, class_dim=1000)
infer_prog = infer_prog.clone(for_test=True)
fluid.load(program=infer_prog, model_path=the path of persistable model, executor=exe)

Perform inference:

outputs = exe.run(infer_prog,
        feed={image.name: data},
        fetch_list=[out.name],
        return_numpy=False)

For the above parameter descriptions, please refer to the official website fluid.Executor

Training engine + inference model prediction

A complete example is provided in tools/infer/py_infer.py, just execute the following command to complete the prediction:

python tools/infer/py_infer.py \
    --i=the path of images \
    --d=the path of saved model \
    --m=the path of saved model file \
    --p=the path of saved weight file \
    --use_gpu=True
  • image_file(shortening i):the path of images which are needed to predict,如 ./test.jpeg
  • model_file(shortening m):the path of model file,如 ./resnet50_vd/model
  • params_file(shortening p):the path of weights file,如 ./resnet50_vd/params
  • model_dir(shortening d):the folder of model,如./resent50_vd
  • use_gpu:whether to use GPU, default: True

Training engine build

Since inference model contains the structure of model, we do not need to construct the model before, load the model file and weights file directly to bulid training engine.

import fluid

place = fluid.CPUPlace()
exe = fluid.Executor(place)
[program, feed_names, fetch_lists] = fluid.io.load_inference_model(
        the path of saved model,
        exe,
        model_filename=the path of model file,
        params_filename=the path of weights file)
compiled_program = fluid.compiler.CompiledProgram(program)
load_inference_model Not only supports scattered weight file collection, but also supports a single weight file。

Perform inference:

outputs = exe.run(compiled_program,
        feed={feed_names[0]: data},
        fetch_list=fetch_lists,
        return_numpy=False)

For the above parameter descriptions, please refer to the official website fluid.Executor

Paddle-Lite

Introduction

Paddle-Lite is a set of lightweight inference engine which is fully functional, easy to use and then performs well. Lightweighting is reflected in the use of fewer bits to represent the weight and activation of the neural network, which can greatly reduce the size of the model, solve the problem of limited storage space of the mobile device, and the inference speed is better than other frameworks on the whole.

In PaddleClas, we uses Paddle-Lite to evaluate the performance on the mobile device, in this section we uses the MobileNetV1 model trained on the ImageNet1k dataset as an example to introduce how to use Paddle-Lite to evaluate the model speed on the mobile terminal (evaluated on SD855)

Evaluation Steps

Export the Inference Model
  • First you should transform the saved model during training to the special model which can be used to inference, the special model can be exported by tools/export_model.py, the specific way of transform is as follows.
python tools/export_model.py -m MobileNetV1 -p pretrained/MobileNetV1_pretrained/ -o inference/MobileNetV1

Finally the model and parmas can be saved in inference/MobileNetV1.

Download Benchmark Binary File
  • Use the adb (Android Debug Bridge) tool to connect the Android phone and the PC, then develop and debug. After installing adb and ensuring that the PC and the phone are successfully connected, use the following command to view the ARM version of the phone and select the pre-compiled library based on ARM version.
adb shell getprop ro.product.cpu.abi
  • Download Benchmark_bin File
wget -c https://paddle-inference-dist.bj.bcebos.com/PaddleLite/benchmark_0/benchmark_bin_v8

If the ARM version is v7, the v7 benchmark_bin file should be downloaded, the command is as follow.

wget -c https://paddle-inference-dist.bj.bcebos.com/PaddleLite/benchmark_0/benchmark_bin_v7
Inference benchmark

After the PC and mobile phone are successfully connected, use the following command to start the model evaluation.

sh tools/lite/benchmark.sh ./benchmark_bin_v8 ./inference result_armv8.txt true

Where ./benchmark_bin_v8 is the path of the benchmark binary file, ./inference is the path of all the models that need to be evaluated, result_armv8.txt is the result file, and the final parameter true means that the model will be optimized before evaluation. Eventually, the evaluation result file of result_armv8.txt will be saved in the current folder. The specific performances are as follows.

PaddleLite Benchmark
Threads=1 Warmup=10 Repeats=30
MobileNetV1                           min = 30.89100    max = 30.73600    average = 30.79750

Threads=2 Warmup=10 Repeats=30
MobileNetV1                           min = 18.26600    max = 18.14000    average = 18.21637

Threads=4 Warmup=10 Repeats=30
MobileNetV1                           min = 10.03200    max = 9.94300     average = 9.97627

Here is the model inference speed under different number of threads, the unit is FPS, taking model on one threads as an example, the average speed of MobileNetV1 on SD855 is 30.79750FPS.

Model Optimization and Speed Evaluation
  • In II.III section, we mention that the model will be optimized before evaluation, here you can first optimize the model, and then directly load the optimized model for speed evaluation
  • Paddle-Lite In Paddle-Lite, we provides multiple strategies to automatically optimize the original training model, which contain Quantify, Subgraph fusion, Hybrid scheduling, Kernel optimization and so on. In order to make the optimization more convenient and easy to use, we provide opt tools to automatically complete the optimization steps and output a lightweight, optimal and executable model in Paddle-Lite, which can be downloaded on Paddle-Lite Model Optimization Page. Here we take MacOS as our development environment, downloadopt_mac model optimization tools and use the following commands to optimize the model.
model_file="../MobileNetV1/model"
param_file="../MobileNetV1/params"
opt_models_dir="./opt_models"
mkdir ${opt_models_dir}
./opt_mac --model_file=${model_file} \
    --param_file=${param_file} \
    --valid_targets=arm \
    --optimize_out_type=naive_buffer \
    --prefer_int8_kernel=false \
    --optimize_out=${opt_models_dir}/MobileNetV1

Where the model_file and param_file are exported model file and the file address respectively, after transforming successfully, the MobileNetV1.nb will be saved in opt_models

Use the benchmark_bin file to load the optimized model for evaluation. The commands are as follows.

bash benchmark.sh ./benchmark_bin_v8 ./opt_models result_armv8.txt

Finally the result is saved in result_armv8.txt and shown as follow.

PaddleLite Benchmark
Threads=1 Warmup=10 Repeats=30
MobileNetV1_lite              min = 30.89500    max = 30.78500    average = 30.84173

Threads=2 Warmup=10 Repeats=30
MobileNetV1_lite              min = 18.25300    max = 18.11000    average = 18.18017

Threads=4 Warmup=10 Repeats=30
MobileNetV1_lite              min = 10.00600    max = 9.90000     average = 9.96177

Taking the model on one threads as an example, the average speed of MobileNetV1 on SD855 is 30.84173FPS.

More specific parameter explanation and Paddle-Lite usage can refer to Paddle-Lite docs

Model Quantifization

Int8 quantization is one of the key features in PaddleSlim. It supports two kinds of training aware, Dynamic strategy and Static strategy, layer-wise and channel-wise quantization, and using PaddleLite to deploy models generated by PaddleSlim.

By using this toolkit, PaddleClas quantized the mobilenet_v3_large_x1_0 model whose accuracy is 78.9% after distilled. After quantized, the prediction speed is accelerated from 19.308ms to 14.395ms on SD855. The storage size is reduced from 21M to 10M. The top1 recognition accuracy rate is 75.9%. For specific training methods, please refer to PaddleSlim quant aware

Distributed Training

Distributed deep neural networks training is highly efficient in PaddlePaddle. And it is one of the PaddlePaddle’s core advantage technologies. On image classification tasks, distributed training can achieve almost linear acceleration ratio. Fleet is High-Level API for distributed training in PaddlePaddle. By using Fleet, a user can shift from local machine paddlepaddle code to distributed code easily. In order to support both single-machine training and multi-machine training, PaddleClas uses the Fleet API interface. For more information about distributed training, please refer to Fleet API documentation.

Paddle Hub

PaddleHub is a pre-trained model application tool for PaddlePaddle. Developers can conveniently use the high-quality pre-trained model combined with Fine-tune API to quickly complete the whole process from model migration to deployment. All the pre-trained models of PaddleClas have been collected by PaddleHub. For further details, please refer to PaddleHub website.

Model Service Deployment

Overview

Paddle Serving aims to help deep-learning researchers to easily deploy online inference services, supporting one-click deployment of industry, high concurrency and efficient communication between client and server and supporting multiple programming languages to develop clients.

Taking HTTP inference service deployment as an example to introduce how to use PaddleServing to deploy model services in PaddleClas.

Serving Install

It is recommends to use docker to install and deploy the Serving environment in the Serving official website, first, you need to pull the docker environment and create Serving-based docker.

nvidia-docker pull hub.baidubce.com/paddlepaddle/serving:0.2.0-gpu
nvidia-docker run -p 9292:9292 --name test -dit hub.baidubce.com/paddlepaddle/serving:0.2.0-gpu
nvidia-docker exec -it test bash

In docker, you need to install some packages about Serving

pip install paddlepaddle-gpu
pip install paddle-serving-client
pip install paddle-serving-server-gpu
  • If the installation speed is too slow, you can add -i https://pypi.tuna.tsinghua.edu.cn/simple following pip to speed up the process.
  • If you want to deploy CPU service, you can install the cpu version of Serving, the command is as follow.
pip install paddle-serving-server
Export Model

Exporting the Serving model using tools/export_serving_model.py, taking ResNet50_vd as an example, the command is as follow.

python tools/export_serving_model.py -m ResNet50_vd -p ./pretrained/ResNet50_vd_pretrained/ -o serving

finally, the client configures, model parameters and structure file will be saved in ppcls_client_conf and ppcls_model.

Service Deployment and Request
  • Using the following commands to start the Serving.
python tools/serving/image_service_gpu.py serving/ppcls_model workdir 9292

serving/ppcls_model is the address of the Serving model just saved, workdir is the work directory, and 9292 is the port of the service.

  • Using the following script to send an identification request to the Serving and return the result.
python tools/serving/image_http_client.py  9292 ./docs/images/logo.png

9292 is the port for sending the request, which is consistent with the Serving starting port, and ./docs/images/logo.png is the test image, the final top1 label and probability are returned.

Competition Support

PaddleClas stems from the Baidu’s visual business applications and the exploration of frontier visual capabilities. It has helped us achieve leading results in many key events, and continues to promote more frontier visual solutions and landing applications.

  • 1st place in 2018 Kaggle Open Images V4 object detection challenge
  • 2nd place in 2019 Kaggle Open Images V5 object detection challenge
  • 2nd place in Kacggle Landmark Retrieval Challenge 2019
  • 2nd place in Kaggle Landmark Recognition Challenge 2019
  • A-level certificate of three tasks: printed text OCR, face recognition and landmark recognition in the first multimedia information recognition technology competition

Release Notes

  • 2020.06.17
    • Add English documents。
  • 2020.06.12
    • Add support for training and evaluation on Windows or CPU.
  • 2020.05.17
    • Add support for mixed precision training.
  • 2020.05.09
    • Add user guide about Paddle Serving and Paddle-Lite.
    • Add benchmark about FP16/FP32 on T4 GPU.
  • 2020.04.14
    • First commit.

FAQ

  • Why are the metrics different for different cards?
  • A: Fleet is the default option for the use of PaddleClas. Each GPU card is taken as a single trainer and deals with different images, which cause the final small difference. Single card evalution is suggested to get the accurate results if you use tools/eval.py. You can also use tools/eval_multi_platform.py to evalute the models on multiple GPU cards, which is also supported on Windows and CPU.
  • Q: Why Mixup or Cutmix is not used even if I have already add the data operation in the configuration file?
  • A: When using Mixup or Cutmix, you also need to add use_mix: True in the configuration file to make it work properly.
  • Q: During evaluation and inference, pretrained model address is assgined, but the weights can not be imported. Why?
  • A: Prefix of the pretrained model is needed. For example, if the pretained weights are located in output/ResNet50_vd/19, with the filename output/ResNet50_vd/19/ppcls.pdparams, then pretrained_model in the configuration file needs to be output/ResNet50_vd/19/ppcls.
  • Q: Why are the metrics 0.3% lower than that shown in the model zoo for EfficientNet series of models?
  • A: Resize method is set as Cubic for EfficientNet(interpolation is set as 2 in OpenCV), while other models are set as Bilinear(interpolation is set as None in OpenCV). Therefore, you need to modify the interpolation explicitly in ResizeImage. Specifically, the following configuration is a demo for EfficientNet.
VALID:
    batch_size: 16
    num_workers: 4
    file_list: "./dataset/ILSVRC2012/val_list.txt"
    data_dir: "./dataset/ILSVRC2012/"
    shuffle_seed: 0
    transforms:
        - DecodeImage:
            to_rgb: True
            to_np: False
            channel_first: False
        - ResizeImage:
            resize_short: 256
            interpolation: 2
        - CropImage:
            size: 224
        - NormalizeImage:
            scale: 1.0/255.0
            mean: [0.485, 0.456, 0.406]
            std: [0.229, 0.224, 0.225]
            order: ''
        - ToCHWImage:
  • Q: What should I do if I want to transform the weights’ format from pdparams to an earlier version(before Paddle1.7.0), which consists of the scattered files?
  • A: You can use fluid.load to load the pdparams weights and use fluid.io.save_vars to save the weights as scattered files.