perf of f16 matmul for small M

Hello
The perf of f16 2d matmul for small input seems pretty bad:

ArrayFire v3.7.0 (CUDA, 64-bit Linux, build c589451)
Platform: CUDA Toolkit 10.0, Driver: 418.40.04
[0] TITAN V, 12037 MB, CUDA Compute 7.0

M  	K=N	f32	f16
1  	512	0.01383 0.0341257
2  	512	0.01384 0.0341409
4  	512	0.01386 0.0341539
8  	512	0.01385 0.0107975
16 	512	0.01416 0.0108004
32 	512	0.01394 0.0109949
64 	512	0.01510 0.0115619
128	512	0.01784 0.0110581

I guess this is related to M not being multiple of 8.
Anything that could be done to at least reach the perf of f32 ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf of f16 matmul for small M #2635

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

perf of f16 matmul for small M #2635

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions