AI/ML

Device performance in OpenCL DES

by MICHAL 2023-01-172024-08-29

Among various computing devices I have there is one that stands out it is NVIDIA Quadro NVS 140M because it supports only FP32 (float) operations, but not FP64 (double). It is generally too old. In OpenCL we have both pow function which takes double and float parameters. The latter is called pown. I use first one to actually benchmark double precision computation.

Model	Year	Core	Unit	Clk	Perf	1k	10k	100k
NVS 4200M	2011	48	1	1480	156/12	13	116	1163
Tesla K20xm	2012	2688	14	732	3935/1312	2	3	24
Intel i7 2640M	2011	2	4	2800	n/a	3	27	281
RTX 3050 Ti Mobile	2021	2560	20	1695	5299/83	2	10	90
Intel UHD 10200H	2020	192	12	1050	422/?	4	19	192
NVS 140M (FP32)	2007	16	2	800	25/-	47	445	4453

The fastest in this comparison is Tesla K20xm which I find a little surprising because it is from 2012 and it wins over RTX 3050 Ti Mobile from 2021. However if we take into consideration that FP64 performance of Tesla is 15 times greater (only 4 x in actual time) than RTX then it should be obvious why it wins.

I have no need to use double to be honest (integer should be just fine here), but it is a great chance to see performance differences between various devices. Using FP32 would be quite difficult to get such a broad range of timings. Using pown(float, integer) changes above table a little bit as we start using FP32 computations (at 100k elements):

Tesla K20xm: 12ms
RTX 3050 Ti Mobile: 3ms
NVS 4200m: 352ms
NVS140M: 4453ms

Now I look at those timings from theoretical performance measured in GFLOPS. Comparing NVS 4200M and NVS 140M we have relation of approx. 6 times (156 vs 25), but timing relation is only just close to 4. So other factors come to play here also. Comparing RTX 3050 Ti and Tesla K20xm we have 1.34 (5299 vs 3935), but timing relation is 4. So actual performance gain is much higher than I would expect comparing GFLOPS measurements.

Getting Tesla K20xm is a steal in terms of FP64 computations as it is on similar level as RTX 4090.

MICHAŁ SOBCZAK

MICHAŁ SOBCZAK

Device performance in OpenCL DES

Device performance in OpenCL DES

Related Posts

Generating AI video with FramePack

GPU pass-thru in Proxmox 7 and Ubuntu 20, follow-up

Mattermost AI chatbot with image generation support from Automatic1111