Mixed Precision Training on Tesla T4 and P100 (2024)

Mixed Precision Training on Tesla T4 and P100 (3)

(This post is also published on my personal blog.)

tl;dr: the power of Tensor Cores is real. Also, make sure the CPU does not become the bottleneck.

I’ve written about Apex in this previous post: Use NVIDIA Apex for Easy Mixed Precision Training in PyTorch. At that time I only have my GTX 1070 to experiment on. And as we’ve learned in that post, pre-Volta nVidia cards does not benefit from half-precision arithmetic in terms of speed. It only saves some GPU memory. Therefore, I wasn’t able to personally evaluate how much speed boost we can get from mixed precision with Tensor Cores.

Recently, Google Colab starts to allocate Tesla T4, which has 320 Turing Tensor Cores, with GPU runtime for free. It is a perfect opportunity to do a second run of the previous experiments. (GPU runtimes with K80 GPU are still being allocated, so make sure you have the correct runtime.)

Kaggle also just replaced K80 with P100 in their Kernel offerings. We’ve mentioned a source claiming P100 can benefit from half-precision arithmetic for certain networks. So we’re also going to give it a try.

Setup

  • Dataset: Cifar-10
  • Batch size 128
  • Model: Wide Resnet
  • 10 epochs
  • SGD with momentum
  • Linear LR scheduler with Warmup

Github repo: ceshine/apex_pytorch_cifar_experiment.

Google Colab

Notebook snapshots stored in colab_snapshots subfolder.

Mixed Precision Training on Tesla T4 and P100 (4)

Kaggle Kernel used: APEX Experiment — Cifar 10.

Mixed Precision Training on Tesla T4 and P100 (5)
  1. Since the model was only trained 10 epochs to save time, the validation accuracy does not have any important meanings other than indicating whether the model is converging or not.
  2. Training with mixed precision on T4 is almost twice as fast as with single precision, and consumes consistently less GPU memory.
  3. Training wide-resnet with mixed precision on P100 does not have any significant effect in terms of speed. The GPU memory footprints are quite bizarre, though. Theoretically at least O2 level should use much less memory than that.
  4. Batch size matters. Because both Kaggle and Colab equip instances with only two weak vCPU, data preprocessing and loading can quickly becomes the bottleneck. (When using batch size of 512, training under O0, O1, and O2 cost almost the same time, as most time were spent waiting the CPU.) This problem is much more severe when training smaller models.
Mixed Precision Training on Tesla T4 and P100 (2024)
Top Articles
Latest Posts
Article information

Author: Tish Haag

Last Updated:

Views: 5464

Rating: 4.7 / 5 (47 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Tish Haag

Birthday: 1999-11-18

Address: 30256 Tara Expressway, Kutchburgh, VT 92892-0078

Phone: +4215847628708

Job: Internal Consulting Engineer

Hobby: Roller skating, Roller skating, Kayaking, Flying, Graffiti, Ghost hunting, scrapbook

Introduction: My name is Tish Haag, I am a excited, delightful, curious, beautiful, agreeable, enchanting, fancy person who loves writing and wants to share my knowledge and understanding with you.