Deep Learning Performance on T4 GPUs with MLPerf Benchmarks (2024)

Article written by Rengan Xu, Frank Han, and Quy Ta of HPC and AI Innovation Lab in March 2019

Abstract

Turing architecture is NVIDIA’s latest GPU architecture after Volta architecture and the new T4 is based on Turing architecture. It was designed for High-Performance Computing (HPC), deep learning training and inference, machine learning, data analytics, and graphics. This blog will quantify the deep learning training performance of T4 GPUs on Dell EMC PowerEdge R740 server with MLPerf benchmark suite. MLPerf performance on T4 will also be compared to V100-PCIe on the same server with the same software.

Overview

The Dell EMC PowerEdge R740 is a 2-socket, 2U rack server. The system features Intel Skylake processors, up to 24 DIMMs, and up to 3 double width V100-PCIe or 4 single width T4 GPUs in x16 PCIe 3.0 slots. T4 is the GPU that uses NVIDIA’s latest Turing architecture. The specification differences of T4 and V100-PCIe GPU are listed in Table 1. MLPerf was chosen to evaluate the performance of T4 in deep learning training. MLPerf is a benchmarking tool that was assembled by a diverse group from academia and industry including Google, Baidu, Intel, AMD, Harvard, and Stanford etc., to measure the speed and performance of machine learning software and hardware. The initial released version is v0.5 and it covers model implementations in different machine learning domains including image classification, object detection and segmentation, machine translation and reinforcement learning. The summary of MLPerf benchmarks used for this evaluation is shown in Table 2. The ResNet-50 TensorFlow implementation from Google’s submission was used, and all other models’ implementations from NVIDIA’s submission were used. All benchmarks were run on bare-metal without a container. Table 3 lists the hardware and software used for the evaluation. The T4 performance with MLPerf benchmarks will be compared to V100-PCIe.

	Tesla V100-PCIe	Tesla T4
Architecture	Volta	Turing
CUDA Cores	5120	2560
Tensor Cores	640	320
Compute Capability	7.0	7.5
GPU Clock	1245 MHz	585 MHz
Boost Clock	1380 MHz	1590 MHz
Memory Type	HBM2	GDDR6
Memory Size	16GB/32GB	16GB
Bandwidth	900GB/s	320GB/s
Slot Width	Dual-Slot	Single-Slot
Single-Precision (FP32)	14 TFLOPS	8.1 TFLOPS
Mixed-Precision (FP16/FP32)	112 TFLOPS	65 TFLOPS
Double-Precision (FP64)	7 TFLOPS	254.4 GFLOPS
TDP	250 W	70 W

Table 1: The comparison between T4 and V100-PCIe

	Image Classification	Object Classification	Object Instance Segmentation	Translation (Recurrent)	Tanslation (Non-Recurrent)	Recommendation
Data	ImageNet	COCO	COCO	WMT E-G	WMT E-G	MovieLens-20M
Data Size	144GB	20GB	20GB	37GB	1.3GB	306MB
Model	ResNet-50 v1.5	Single-Stage Detector (SSD)	Mask-R-CNN	GNMT	Transformer	NCF
Framework	TensorFlow	PyTorch	PyTorch	PyTorch	PyTorch	PyTorch

Platform	PowerEdge R740
CPU	2x Intel Xeon Gold 6136 @3.0GHz (SkyLake)
Memory	384GB DDR4 @ 2666MHz
Storage	782TB Lustre
GPU	T4, V100-PCIe
OS and Firmware
Operating System	Red Hat® Enterprise Linux® 7.5 x86_64
Linux Kernal	3.10.0-693.el7.x86_64
BIOS	1.6.12
Deep Learning Related
CUDA compiler and GPU driver	CUDA 10.0.130 (410.66)
CUDNN	7.4.1
NCCL	2.3.7
TensorFlow	nightly-gpu-dev20190130
PyTorch	1.0.0
MLPerf	V0.5

Performance Evaluation

Figure 1 shows the performance results of MLPerf on T4 and V100-PCIe on PowerEdge R740 server. Six benchmarks from MLPerf are included. For each benchmark, the end-to-end model training was performed to reach the target model accuracy defined by MLPerf committee. The training time in minutes was recorded for each benchmark. The following conclusions can be made based on these results:

Conclusions and Future Work

In this blog, we evaluated the performance of T4 GPUs on Dell EMC PowerEdge R740 server using various MLPerf benchmarks. The T4’s performance was compared to V100-PCIe using the same server and software. Overall, V100-PCIe is 2.2x – 3.6x faster than T4 depending on the characteristics of each benchmark. One observation is that some models are stable no matter what random seed values are used, but other models including GNMT, NCF and Transformer are highly impacted by random seed. In future work, we will finetune the hyper-parameters to make the unstable models converge with less epochs. We will also run MLPerf on more GPUs and more nodes to evaluate the scalability of those models on PowerEdge servers.

*Disclaimer: For purpose of benchmarking, four T4 GPUs in the Dell EMC PowerEdge R740 was evaluated. Currently the PowerEdge R740 officially supports a maximum of three T4 in x16 PCIe slots.

Deep Learning Performance on T4 GPUs with MLPerf Benchmarks (2024)

Table of Contents:

Abstract

Overview

Performance Evaluation

Conclusions and Future Work