Description
In a multi-threaded application, one thread does the processing and another one is supposed to fetch the results back to host, in the background, without interfering the processing ArrayFire CUDA stream.
The expected host transfer timings are in range of microseconds and should be as constant as possible
I have a dedicated CUDA stream for the transfer back to host.
af::host isn’t an option so the only option is array::device<>.
After reading the code I noticed that array::device is implicitly making a copy of original array using cudaMemcpyAsync, to do the copy on the ArrayFire cuda stream, locks the memory and returns without synchronization.
For multithreaded memory copies this is not good. It causes an expensive double synchronization per array and forces partial serialization with the ArrayFire cuda stream:
First cudaMemcpyAsync + sync:
cudaMemcpyAsync, for the array implicit copy, ends up being queued in the ArrayFire CUDA stream after new processing kernels. This causes random delays at synchronization depending on how big/slow processing tasks have been queued.
Second cudaMemcpyAsync + sync:
After getting the device pointer, a second cudaMemcpyAsync to read the copied array back to a pinned memory buffer is required + event synchronization to wait for completion
I wish there would be:
- a way to specify the CUDA stream used by array::device, not to queue any new work on the ArrayFire CUDA stream
- a way to specify the pinned memory buffer to copy the data into to avoid an unnecessary copy + locking on the device.
I'll probably try adding an af::read function to implement that behavior, or a way to retrieve the raw CUDA array pointer.