Open
Description
Hello
The perf of f16 2d matmul for small input seems pretty bad:
ArrayFire v3.7.0 (CUDA, 64-bit Linux, build c589451)
Platform: CUDA Toolkit 10.0, Driver: 418.40.04
[0] TITAN V, 12037 MB, CUDA Compute 7.0
M K=N f32 f16
1 512 0.01383 0.0341257
2 512 0.01384 0.0341409
4 512 0.01386 0.0341539
8 512 0.01385 0.0107975
16 512 0.01416 0.0108004
32 512 0.01394 0.0109949
64 512 0.01510 0.0115619
128 512 0.01784 0.0110581
I guess this is related to M not being multiple of 8.
Anything that could be done to at least reach the perf of f32 ?