Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

array::device<> performance issues using CUDA backend. #2780

Copy link
Copy link
Open
@lfdmn

Description

@lfdmn
Issue body actions

In a multi-threaded application, one thread does the processing and another one is supposed to fetch the results back to host, in the background, without interfering the processing ArrayFire CUDA stream.

The expected host transfer timings are in range of microseconds and should be as constant as possible

I have a dedicated CUDA stream for the transfer back to host.

af::host isn’t an option so the only option is array::device<>.

After reading the code I noticed that array::device is implicitly making a copy of original array using cudaMemcpyAsync, to do the copy on the ArrayFire cuda stream, locks the memory and returns without synchronization.

For multithreaded memory copies this is not good. It causes an expensive double synchronization per array and forces partial serialization with the ArrayFire cuda stream:

First cudaMemcpyAsync + sync:
cudaMemcpyAsync, for the array implicit copy, ends up being queued in the ArrayFire CUDA stream after new processing kernels. This causes random delays at synchronization depending on how big/slow processing tasks have been queued.

Second cudaMemcpyAsync + sync:
After getting the device pointer, a second cudaMemcpyAsync to read the copied array back to a pinned memory buffer is required + event synchronization to wait for completion

I wish there would be:

  • a way to specify the CUDA stream used by array::device, not to queue any new work on the ArrayFire CUDA stream
  • a way to specify the pinned memory buffer to copy the data into to avoid an unnecessary copy + locking on the device.

I'll probably try adding an af::read function to implement that behavior, or a way to retrieve the raw CUDA array pointer.

vineelpratap

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.