Device performance in OpenCL DES
Among various computing devices I have there is one that stands out it is NVIDIA Quadro NVS 140M because it supports only FP32 (float) operations, but not FP64 (double). It is generally too old. In OpenCL we have both
pow function which takes double and float parameters. The latter is called
pown. I use first one to actually benchmark double precision computation.
|Intel i7 2640M||2011||2||4||2800||n/a||3||27||281|
|RTX 3050 Ti Mobile||2021||2560||20||1695||5299/83||2||10||90|
|Intel UHD 10200H||2020||192||12||1050||422/?||4||19||192|
|NVS 140M (FP32)||2007||16||2||800||25/-||47||445||4453|
The fastest in this comparison is Tesla K20xm which I find a little surprising because it is from 2012 and it wins over RTX 3050 Ti Mobile from 2021. However if we take into consideration that FP64 performance of Tesla is 15 times greater (only 4 x in actual time) than RTX then it should be obvious why it wins.
I have no need to use double to be honest (integer should be just fine here), but it is a great chance to see performance differences between various devices. Using FP32 would be quite difficult to get such a broad range of timings. Using
pown(float, integer) changes above table a little bit as we start using FP32 computations (at 100k elements):
- Tesla K20xm: 12ms
- RTX 3050 Ti Mobile: 3ms
- NVS 4200m: 352ms
- NVS140M: 4453ms
Now I look at those timings from theoretical performance measured in GFLOPS. Comparing NVS 4200M and NVS 140M we have relation of approx. 6 times (156 vs 25), but timing relation is only just close to 4. So other factors come to play here also. Comparing RTX 3050 Ti and Tesla K20xm we have 1.34 (5299 vs 3935), but timing relation is 4. So actual performance gain is much higher than I would expect comparing GFLOPS measurements.
Getting Tesla K20xm is a steal in terms of FP64 computations as it is on similar level as RTX 4090.