Device performance in OpenCL DES

Among various computing devices I have there is one that stands out it is NVIDIA Quadro NVS 140M because it supports only FP32 (float) operations, but not FP64 (double). It is generally too old. In OpenCL we have both pow function which takes double and float parameters. The latter is called pown. I use first one to actually benchmark double precision computation.

NVS 4200M20114811480156/12131161163
Tesla K20xm20122688147323935/13122324
Intel i7 2640M2011242800n/a327281
RTX 3050 Ti Mobile202125602016955299/8321090
Intel UHD 10200H2020192121050422/?419192
NVS 140M (FP32)200716280025/-474454453

The fastest in this comparison is Tesla K20xm which I find a little surprising because it is from 2012 and it wins over RTX 3050 Ti Mobile from 2021. However if we take into consideration that FP64 performance of Tesla is 15 times greater (only 4 x in actual time) than RTX then it should be obvious why it wins.

I have no need to use double to be honest (integer should be just fine here), but it is a great chance to see performance differences between various devices. Using FP32 would be quite difficult to get such a broad range of timings. Using pown(float, integer) changes above table a little bit as we start using FP32 computations (at 100k elements):

  • Tesla K20xm: 12ms
  • RTX 3050 Ti Mobile: 3ms
  • NVS 4200m: 352ms
  • NVS140M: 4453ms

Now I look at those timings from theoretical performance measured in GFLOPS. Comparing NVS 4200M and NVS 140M we have relation of approx. 6 times (156 vs 25), but timing relation is only just close to 4. So other factors come to play here also. Comparing RTX 3050 Ti and Tesla K20xm we have 1.34 (5299 vs 3935), but timing relation is 4. So actual performance gain is much higher than I would expect comparing GFLOPS measurements.

Getting Tesla K20xm is a steal in terms of FP64 computations as it is on similar level as RTX 4090.