Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

OPT: Improved memcopy, JIT & join #3144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Sep 22, 2022
Merged

Conversation

willyborn
Copy link
Contributor

@willyborn willyborn commented Jun 9, 2021

  • workload per thread is increased to optimize for memcopy and JIT.
  • memcopy can now increase its unit size
  • JIT now can write directly into the final buffer, instead of first evaluating into a temp buffer
  • JIT now performs multiple copies in parallel (when same size)
  • join is making optimal selection between (memcopy and JIT)

Description

  • No feature changes for the end-user, only a stable performance improvement.
    Improvements are extreme for:
    • vectors in 2nd, 3th or 4th dimension
    • not yet evaluated arrays (JIT nodes)
    • arrays fully contained in L2 cache

Fixes: #2417

Changes to Users

Performance improvement, less dependent on dimension.

Checklist

  • [ x ] Rebased on latest master
  • [ x ] Code compiles
  • [ x ] Tests pass
  • Functions added to unified API
  • Functions documented

@willyborn
Copy link
Contributor Author

Some extra comments to the realized performance impact:

  • copy linear array (MAX throughput) -- should be highest possible throughput
    is performed by the enclosed copy functions (OCL & CUDA). The small improvements come from C++ optimizations in copy.cpp
  • A[+10,,][,,].copy (Non linear) -- improvement: (CU)1x..84x / (CL)3x..120x
    an evaluated array is copy to a new array which 1st dimension is 10 items bigger. When the 1st dim is 1, this means that only 1 byte in 10 of the cached memory will be written explaining the slow performance at the end. (Still 120x faster than with master)
  • JIT.copy (Linear) -- improvement: (CU)3.7x / (CL)3.5x
    The array to be copied is not yet evaluated. In this case we do no longer evaluate it into a intermediate buffer after which the result is joined into the final buffer. The evaluation writes directly into the final buffer, eliminating a write to + read from the internal buffer. This explains why have a throughput higher than HW specified.
    Memory consumption is also lower, since the intermediate buffer is no longer allocated.
  • join(0,A,B) -- improvement: (CU)2%..71% / (CL)4%..3x
    The common (optimized) memcopy function is now used. Only evaluated arrays are measured here. The performances are now more stable, less independent from the array dimensions. In case we hit the HW limit, no further improvement is possible.
  • join(1,A,B) -- improvement: (CU)5%..47x / (CL)6%..50x
    Performance for vectors in 2nd, 3th or 4th dimension were very bad. Now they perform similar as the rest.
  • join(3,A,B) -- improvement: (CU)6%..360x / (CL)8%..711x
    Same remark as join(1,A,B)
  • join(0,A,B,C,D) -- improvement: (CU)7%..70% / (CL)8%..3x
    Same remark as join(0,A,B)
  • join(1,A,B,C,D) -- improvement: (CU)5%..46x / (CL)7%..50x
    same remark as join(1,A,B)
  • join(0,4xJIT) -- improvement: (CU)3x..5x / (CL)3x..7x
    Improvements of JIT generation and direct writing into final buffer
  • join(1,4xJIT) -- improvement: (CU)3x..101x / (CL)3x..108x
    Improvements of JIT generation and direct writing into final buffer. The JIT operation itself is also further optimized resulting in the exceptional improvement of 108x
  • Linear.as(f32|c32) only char -- improvement: (CU)20%..27% / (CL)18%..20%
    The improvements to the JIT engine is the only contributor this result.
  • B[+1,,][,,f32|c32] -- improvement: (CU)0%..211x / (CL)0%.. 297x
    This is the only place where the copy (reshape & scale) function is measured.

The CUDA scheduler optimizes much more than the OCL scheduler, even on the same HW.
In the machine learning example, the needed time for 250 epochs reduced by 10%.

More details, with the specific measures can be found in the file below:
Join.xlsx

@umar456
Copy link
Member

umar456 commented Jun 11, 2021

Hi @willyborn,

This is a tremendous change. Thank you for your contribution. I am looking at the benchmark numbers and they look fantastic. I am running my own benchmarks and I am seeing some good results as well. I did notice that the kernels are failing if the older kernel cache is not removed. I suppose that would only be an issue in development branches. Once the version number is updated it the errors will not appear for the end user. I want to understand the kernels that are being generated and I will give my feedback in a couple of days.

Copy link
Member

@9prady9 9prady9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was reviewing on/off and it took some time, sorry about the delay.

I have reviewed CUDA backend and OpenCL seems quite similar in changes so will browse through them once the feedback is addressed.
Thank you for your contribution!

Note: Also, kindly rebase from latest master. There is a minor conflict in cuda's jit.cpp file.

On another, I haven't figured why it fails if we keep cache of kernels which it shouldn't, I will report back on that once I figure out why. It is important and more preferable if users doesn't need to deal with this intervention.

@@ -8,16 +8,14 @@
********************************************************/

typedef struct {
dim_t dim[4];
int dims[4];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type change may likely cause type compatibility issues with dims_t struct defined else where.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I change the struct name to dims_type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason we can't use KParam like the other kernels ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already had all the dims available in int format and no longer in dim_t
format, because they are updated inside the 'memcopy.hpp calls'.
On top, most of the OpenCL kernels are converting all dims / idx into int
inside their kernels.

Finally, I wanted to speed things up more, avoiding conversions and
reducing the bytes sent to the GPU.
If I have to send the dim_t version, I have to convert on the CPU and later
convert again on GPU.

Continuing calculations in dim_t format is really much slower than int on
OpenCL. I did not have the same impression on CUDA, even when running on
the same HW.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the performance implications of using dim_t vs int. In fact, we originally had dim_t all over but later refactored to current state where we keep dim_t outside of kernels but convert them to int inside kernels for the very same reason. This is case with even CUDA backend, not just OpenCL. Type conversion overhead for a few scalars shouldn't be that noticeable.

I don't understand why you would have to convert from int to dim_t on the host. Can you please point me to the source location where these are happening. As far as I can remember, all non-kernel codes almost always deals in terms of dim_t and dim_t to int inside kernels alone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated dims_t to dims_type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to memcopy.hpp, since the dims and strides are updated in array dimension optimizations. This is all done in int, since the kernel continuous in int.

So the latest values are only available in int in opencl/memcopy.hpp

const unsigned d2 = static_cast<unsigned>(dims[2]);
const unsigned OCC = 3;
const unsigned elements = d0 * d1;
const unsigned minThreads = warp / 4; // quarter wave
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the significance of the number 4 here ? Could it be based on the max word transfer limit, 128 bytes such as 128 / 32(warp) = 4 ?

If it is indeed based on max word transfer limit, 128 line, doesn't this change w.r.t to data type ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVIDIA & AMD GPU's are build on top of processors using SIMD. For all GPU's of NVIDIA 1 instruction & 8 datasets, for AMD 1 instruction & 16 datasets.
I continue for NVIDIA, although the same for AMD.

  • The processor now executes as follows:
    Wave 1
    • 1st instruction / 0...7 datasets (= 8 threads = quarter wave)
    • 1st instruction / 8...15 datasets (= 8 threads = quarter wave)
    • 1st instruction / 16...23 datasets (= 8 threads = quarter wave)
    • 1st instruction / 24...31 datasets (= 8 threads = quarter wave)
      Wave 2
    • 2nd instruction / 0...7 datasets (= 8 threads)
    • ...
  • By repeating the same instruction 4x, we get the impression that a full warp is executed in parallel.
  • Suppose only thread 24 exits, than the processor will service datasets 25...31 each with an 1instruction and 1 dataset --> 7 instructions for 7 datasets, iso 1 instruction for 8 datasets.
  • Suppose datasets 24...31 exit together, than everything will continue although now only with a repetition of 3x. Each instruction (SIMD) remains servicing 8 datasets. Losing quarter waves will keep the processor at fully efficiency, on condition that caching of global memory still can follow. From lecture, I discovered that the caching system can keep up, up to 50% of active datasets (16 in case of NVIDIA). With lower datasets (threads) active per processor, the processor will have to wait.

Since both architectures decided to split their respective waves into 4 instructions, we can use the same formula to determine the minThreads for both NVIDIA and for AMD.

The cache line size is 128Bytes for NVIDIA and 64Bytes for AMD (A10).
The 128Bytes = 32 datasets (full warp) * 4 bytes (size float), this means that the GPU will read a full warp (linear) at once. This is only useful, when we can read this data sequentially. On top when 1 cache line is read, the next 3 cache lines are already ordered in the hope that we will continue reading.
Resulting from the larger cache line for NVIDIA, it will be faster in sequential reading although will be slower with random access (reading multiple buffers in parallel). The cache lines are also use to write to global memory, although this is fully asynchronous so no waiting times (even for random access) as long as the total cache is not occupied.

How much of the above do you want into the comments?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on what you explained about the constant, cache info is irrelevant for this comment.

I think something like the below should be good enough

<return> bestBlockSize(<params>) {
    // Currently both NVIDIA and AMD hardware splits each warp/wavefront into
    // four groups of 8/16 threads, respectively.
    // NVIDIA - 32(warp) / 8       = 4
    // AMD    - 64(wavefront) / 16 = 4
    constexpr unsigned WAVE_COUNT = 4;
    const unsigned minThreads = warp / WAVE_COUNT;
    ....
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I wasn't sure if such warp sub division happens. I don't think that's how CUDA warps work, at least from what I know. @umar456 Am I missing something from what @willyborn is referring to here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think NVIDIA has split warps since the Kepler architecture(2012). NVIDIA did this on the Fermi and earlier cards. AMD still does this to some extent for their older generations. The new RDNA architecture moved to 32 thread wavefronts and each thread is executed in parallel.

https://developer.amd.com/wp-content/resources/RDNA_Shader_ISA.pdf

Suppose datasets 24...31 exit together, than everything will continue although now only with a repetition of 3x. Each instruction (SIMD) remains servicing 8 datasets. Losing quarter waves will keep the processor at fully efficiency, on condition that caching of global memory still can follow. From lecture, I discovered that the caching system can keep up, up to 50% of active datasets (16 in case of NVIDIA). With lower datasets (threads) active per processor, the processor will have to wait.

Ignoring the 4 clock per wavefront point, The inactive threads in a wavefront will still execute a NOP instruction. This is even the case if only one thread is active. See page 41-42 here: http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I only read about such division from the days of Fermi.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking through this again and I think I am not entirely correct here. Yes, for floating point operations, typically the entire warp is executing the same instruction except for AMD, but for instructions that are more expensive(like doubles and special functions) this sort of behavior will occur. I think @willyborn's change will help for those types of operations.

// Remark: The bestBlockSize is only best for independent element operations, as
// are: copying, scaling, math on independent elements, ...
// Since vector dimensions can be returned, it is NOT USABLE FOR BLOCK
// OPERATIONS, as are: matmul, etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏾 A good heads up for users of this function.

I think a few comments/docs are missing from the function implementation itself with respect to the hard-coded constants. I have left additional comments at relevant lines, please have a look.

const unsigned d0 = static_cast<unsigned>(dims[0]);
const unsigned d1 = static_cast<unsigned>(dims[1]);
const unsigned d2 = static_cast<unsigned>(dims[2]);
const unsigned OCC = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is good to have 3 as a function-local constant but what is the significance of this 3 ? Does it depend on the GPU in use ? If it varies per hardware, is it usually populated in device properties ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the copy kernels, we only have address calculations in the threads. This means that calculation time is very low compared to the waiting time, resulting from memory latency.

The total occupation = occupation(threads0) * occupation(threads1) * occupation(threads2).

  • Since array dim[2] is mostly very small, we will push occupation(threads2) to 100%, by only providing exact dividable block dim.
  • For threads0 & threads1 occupation always drops because the last block is incomplete, although launched. The full occupation is in brackets (assuming both equal). The occupation per dim is calculated as follows:
    0 full blocks + 1 thread / 1 block allocated ==> 1/32, 1/64, ...
    1 full block + 1 thread / 2 blocks allocated ==> >50% (25%)
    2 full blocks + 1 thread / 3 blocks allocated ==> >66% (44%)
    3 full blocks + 1 thread / 4 blocks allocated ==> >75% (56%)
    From blogs, I read that around 50% threads is sufficient so somewhere between 3 & 4 allocated blocks. The performance difference between both is smaller than my measuring precision.

As result, I expect it to be mostly independent from GPU version. On top,

  • only a very small set of dimensions are possibly affected.
  • I prefer to keep it inside the function, because it is not intended to be changed outside specific performance profiling.
  • There are many uncertain assumptions, so this not an exact science. In practice, you will encounter mostly exponents of 2.
  • On NVIDIA, I noticed that the driver optimizes much more so that multiple variations give the same result.

I can also remove the constant and put directly the final number in each expression.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be I wasn't clear, I am not suggesting making this constant global or hard coded.

On the contrary, making it a constexpr(shown below) instead of const will let compiler optimize it away in all the arithmetic operations OCC is involved, especially given that it is being used in conjunction with other hard-coded constants (powers of 2).

constexpr unsigned OCCUPANCY_FACTOR = 3; // I couldn't think of vendor agnostic name, perhaps you can

It(OCC) may be based on profiling and heuristics, but it should be explained in comments why that constant so that any future reader of the code can understand why 3 and not anything else, more so since type of GPU wasn't influencing the choice of this constant.

As far as how much of an explanation is needed in comments. I think saying couple of sentences about occupancy and how this factor affects it and adding references to all, if not most, blogs you referred to should do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with comments.
const unsigned OCC changed to constexpr unsigned OCCUPANCY_FACTOR

const unsigned elements = d0 * d1;
const unsigned minThreads = warp / 4; // quarter wave
const unsigned maxThreads =
std::min(warp * 4, divup(elements * warp, 16384U) * minThreads);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments w.r.t the number 16384.

  • It would be nice to have a comment on why this specifically ? Is it hardware agnostic constant ?
  • Instead of large number something like 1<<14 is perhaps more indicative that this constant is power of 2 I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this number is based on your benchmarks ? because I see this number again as 8192 * 2 on lines 301 and 303 of src/backend/cuda/jit.cpp file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is that for very small array dimensions, the opposite is needed (lower occupation).

Suppose a GPU can handle 1024 threads (32*32 warps) in parallel. Suppose our array is a vector with dim0=256 elements.

  • When we go for maximum occupation, we set a block size of 32, resulting in the activation of 8 parallel warps (=8CPUs). The other 24CPUs will be idle. We also know that for each warp only 8 threads are really in parallel, so 64 real parallel threads for the whole system with 4 loops each.
  • Would it not better to use all the available CPUs (32) with only 1 loop. This is achieved by launching with a block size of 8 (=minThreads). The occupation is here only 25%, although 4x faster than with 100%.

As long as all elements of an array is smaller than the available parallel threads, we limit the maxThreads for 1 block to create the occupation inefficiency described above.
The number 16384U is an estimated average (=512*32; 512 stream processors) over all GPU's I looked up. Again this number does not has to be precise, since the impact decreases the closer we get there. On top an error margin of 50% is allowed.

Increasing of lowering it did not have much of an impact. Removing it generated visible slowdowns.
I reused the same number later, because I did not have anything better. The reasoning is the same, why increase the looping inside a thread when all threads are not occupied.

Do you want me to include all the above comments inside the code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I am confused by the terms here: CPUs, parallel warps. Let's stick to NVIDIA architecture for ease of explanation. Can you please use their jargon: SM, Warp and thread to explain the logic here.

Copy link
Contributor Author

@willyborn willyborn Aug 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to stick to NVIDIA nomenclature.

Suppose GPU has 4 cores (SP), so 32*4=128 threads can run in parallel.
Suppose our array is an vector of 32 floats, all in dim0.

General rule:
each SP always receives a full warp. This is achieved by extending threads up to a multiple of the warp. Empty threads do not consume cycles and are skipped.
When 8 threads execute the same instruction, the SIMD will execute in parallel (1 cycle).
When 7 threads execute the same instruction, the ALU will execute in serial (7 cycles).

  • We set threads(128,1,1) and blocks(1,1,1), resulting in
    ----------------------------------------------blockIdx0----------------------------------------------------
    cycle1: SP0:threadIdx 0.. 7; SP1:threadIdx 33..40; SP2:threadIdx 65..72; SP3:threadIdx 97..104
    cycle2: SP0:threadIdx 8..15; SP1:threadIdx 41..48; SP2:threadIdx 73..80; SP3:threadIdx 105..112
    cycle3: SP0:threadIdx 16..23; SP1:threadIdx 49..56; SP2:threadIdx 81..88; SP3:threadIdx 113..120
    cycle4: SP0:threadIdx 24..32; SP1:threadIdx 57..64; SP2:threadIdx 89..96; SP3:threadIdx 121..128
    --> 25% of available SP is used (threads 33..128 exit immediately) and 4 cycles needed.
    --> Occupation = 100%

  • We set threads(32,1,1) and blocks(1,1,1), resulting in
    ------blockIdx0-------- -------------IDLE--------------
    cycle1: SP0:threadIdx 0.. 7; SP1:idle; SP2:idle; SP3:idle
    cycle2: SP0:threadIdx 8..15; SP1:idle; SP2:idle; SP3:idle
    cycle3: SP0:threadIdx 16..23; SP1:idle; SP2:idle; SP3:idle
    cycle4: SP0:threadIdx 24..32; SP1:idle; SP2:idle; SP3:idle
    --> 25% of available SP is used and 4 cycles needed.
    --> Occupation = 25%

  • We set threads(8,1,1) and blocks(4,1,1), resulting in
    ------blockIdx0----- ------blockIdx1------ -------blockIdx2------- -------blockIdx3-------
    cycle1: SP0:threadIdx 0..7; SP1:threadIdx 8..15; SP2:threadIdx 16..23; SP3:threadIdx 24..32
    cycle2, cycle3, cycle4 are skipped since no threads to launch
    --> 100% of available SP is used and 1 cycle needed.
    --> Occupation = 100%

Assume that we only have 16 floats, all in dim0

  • We set threads(4,1,1) and blocks(4,1,1), resulting in
    ----blockIdx0---- ----blockIdx1---- ----blockIdx2---- ----blockIdx3----
    cycle1: SP0:threadIdx 0; SP1:threadIdx 4; SP2:threadIdx 8; SP3:threadIdx12
    cycle2: SP0:threadIdx 1; SP1:threadIdx 5; SP2:threadIdx 9; SP3:threadIdx13
    cycle3: SP0:threadIdx 2; SP1:threadIdx 6; SP2:threadIdx10; SP3:threadIdx14
    cycle4: SP0:threadIdx 3; SP1:threadIdx 7; SP2:threadIdx11; SP3:threadIdx15
    --> 100% of available SP is used and 4 cycles needed.
    --> Occupation = 100%

  • We set threads(8,1,1) and blocks(1,1,1), resulting in
    -----blockIdx0------
    cycle 1: SP0:threadIdx 0..7; SP1:Idle; SP2:Idle; SP3:Idle
    --> 25% of available SP is used and 1 cycle needed.
    --> Occupation = 25%

Conclusion:

  • Occupation rate is useless for very small arrays; Nr cycles counts!!
  • Minimum is warp/4 (// lines of SIMD), because this will 1 cycle == 1 individual instruction

I looked up multiple GPU's and they vary between 512 cores (Radeon R7, Vega8) up to 10496 cores (RTX-3090). As average, I wanted to take 1024 cores (16384 threads on AMD). I also assumed that NVIDIA users would have bigger cards, so there I took 2048 cores.
In case the user has less cores: All cores will be used and cycles increase (scheduling overhead)
In case the user has more cores: 4x faster compared to nothing (all cores are not used, no impact when memory latency is covered by 16384 threads)
AMD: (elements * 64 / 16384) * 16 = elements / (16 * 16) *16 --> 16 cores used
NVIDIA: (elements * 32 / 16384) * 8 = elements / (64 * 8) * 8 --> 64 cores used

I WILL RERUN THE PROFILING, because I think it should be
AMD : (elements / (1024 * 16) * 16 = (elements / 16384) * 16 --> divup(d0d1,1<<14) * minThreads
NVIDIA: (elements / (2048 * 8) * 8 = (elements / 16384) * 8 --> divup(d0
d1,1<<14) * minThreads

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all I think this is fantastic work. Your logic here is valid. I think this can be improved if you take the number of compute units into account with your calculations. You can pass it into the function this using getDeviceProp(id).multiProcessorCount on cuda and getDevice().getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>() on OpenCL. This will give you a general idea of the number of compute resources available on the device.

You will need to figure out the number of ALUs available on each compute unit. I think a safe minimum is 64. This number matches the minimum compute unit for AMD and NVIDIA GPUs. Intel GPUs have 8 shader units per compute unit but lets ignore them for now.

I think that moving below 64 threads per work-group will give you diminishing returns. I think your model of the GPU scheduling is incorrect with respect to the minimum number of clocks required to do an operation. I don't think you can minimize the impact of an inactive thread in the wavefront like you mention in your response. If you have different results then I would love to see it. I love to be proven wrong :).

Optimizing for very small data sizes will probably not yield favorable results. For these data sizes the bottleneck is going to be either driver overhead, memory loads overhead or other processing overhead internal to ArrayFire. From an engineering point of view, it might be better to simplify the logic to increase maintainability(if you don't see significant gains).

When 8 threads execute the same instruction, the SIMD will execute in parallel (1 cycle).
When 7 threads execute the same instruction, the ALU will execute in serial (7 cycles).

I am not sure I understand you here. Can you please clarify.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, I will rerun multiple performance tests with only this parameter changed.
It is possible, that due multiple improvements before on all parameters, that I got wrong conclusions.
More to come.

From what I understand from SIMD, is that all // lines have to be occupied or none at all. This has as consequence that when only 7 data lines are possible, that the SP no longer can use the SIMD (1x 8//) and is forced to use the ALU (7x 1) instead. Both do use however the same number of cycles.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that moving below 64 threads per work-group will give you diminishing returns. I think your model of the GPU scheduling is incorrect with respect to the minimum number of clocks required to do an operation. I don't think you can minimize the impact of an inactive thread in the wavefront like you mention in your response. If you have different results then I would love to see it. I love to be proven wrong :)

As mentioned in my last comment, I think I am incorrect here. I can see situations where smaller block sizes (<64 threads) can be beneficial for more expensive operations such as doubles and special functions. Please disregard that comment.

From what I understand from SIMD, is that all // lines have to be occupied or none at all. This has as consequence that when only 7 data lines are possible, that the SP no longer can use the SIMD (1x 8//) and is forced to use the ALU (7x 1) instead. Both do use however the same number of cycles.

I don't think I have heard of that. All the documentation I have read treats the inactive threads as masked off and the operations still use one cycle even though 7 are active. On AMD hardware I understand that they have a scalar execution units which comes into play when all threads perform the same operation on the same data. This is used for control flow and address operations typically.

blocks.z = divup(blocks.y, maxBlocksY);
blocks.y = divup(blocks.y, blocks.z);
template<typename T>
void memcopyN(Param<T> out, std::vector<CParamPlus<T>> &Ins) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need an additional function memcopyN and the new object CParamPlus. We avoid both of these if we create memcopyN using the new memcopy version I suggested in my earlier comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling the memcopy for each array separately means that the overhead has to be processed for each copy individually.
This was the model I had in the beginning.

After moving to memcopyN, I noticed a measurable difference (especially for small array's, which is typical for addBias() functions).

In case you give priority to this complexity reduction, please let me know.
I know alternative for addBias() performing similar to current join implementation.

af::array addBias(const af::array in) {
    af::dim4 dimsR(in.dims());
    ++dimsR[0];
    af:: array R = af::constant(1.0, dimsR);                    // yes full array, since very fast.
    R(af::seq(1,af::end),af::span,af::span,af::span) = in;   // overwrite for values of in
    return(R);
}

if (elements > 0) {
const int *maxGridSize =
cuda::getDeviceProp(cuda::getActiveDeviceId()).maxGridSize;
serializeArray<true>(dst.dims, dst.strides, ondims, src.strides,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could minimize and move non-wrapper related code to a function defined in copy.cpp similar to what I suggested about memcopy wrapper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree for the elements>0.
Moved to the upper layer.

// ALWAYS looping!!
int ioffset = g2 * istrides2;
int ooffset = g2 * ostrides2;
const int ioffsetEnd = ioffset + idims3 * istrides3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the change in jit code, Since end can go beyond actual number of elements when ioffset > 0, I think this would pose problem in the last blocks along any given kernel launch axis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an index, although an iterator. In case istrides3 is negative, the > will give wrong results.

in.ptr += iinc1;
out.ptr += oinc1;
} while (g1 < idims1);
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Occupancy calculator shows both the PR kernel and the master kernel are near max. I am wondering if master branch's kernel can be slightly modified to handle per-thread-workload rather than introducing two additional do-while loops conditionally. That would also remove the need for the two additional template parameters or compile time definitions. Occupancy shouldn't be effected due to this - I have checked this for both versions of the kernel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA has limitations on the dimensions it accepts for the blocks parameter in the launch call. For my GTX 750 Ti:

  • dim0 : <= 4,294,967,295 (assumed never hit, due to limit of GPU memory)
  • dim1: <= 65,536 (rarely hit)
  • dim2: <= 65,536 (very rarely hit)
    In case we do hit a limit, LOOP1 & LOOP2 are introduced, on top of the workload loop (dim3)
    To avoid this overhead for all arrays, they are conditional.

For OPENCL, we have different limits for the 'corresponding' global parameter (also on NVIDIA GPUs):

  • dim0: <= 18,446,744,073,709,551,615 (assumed never hit, due to limit on GPU memory)
  • dim1: <= 18,446,744,073,709,551,615 (assumed never hit, due to limit on GPU memory)
  • dim2: <= 18,446,744,073,709,551,615 (assumed never hit, due to limit on GPU memory)
    Here we only have the workload loop (dim3) remaining.

I have no idea how I can combine the dim1, dim2 and dim3 (workload) loops together into 1 loop, without exceptions.
Please give me an example.

By the way the some issue exists in cuda/jit.cpp

Comment on lines 137 to 141
{
TemplateTypename<T>(),
TemplateArg(loop1),
TemplateArg(loop2),
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{
TemplateTypename<T>(),
TemplateArg(loop1),
TemplateArg(loop2),
});
{
TemplateTypename<T>(),
},
{
DefineKeyValue(LOOP1, loop1),
DefineKeyValue(LOOP2, loop2),
}
);

Doing so can avoid the additional template parameters - loop1 and loop2 can also be macro definitions and they are included while generating the unique key for kernel. Just suggesting this alternative as these values aren't used in any calculations anyway inside the kernel.

Copy link
Contributor Author

@willyborn willyborn Aug 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I especially wanted a different kernel for the combinations: <T, NO LOOPS>, <T, LOOP1>, <T,LOOP2>, <T, LOOP1 & LOOP2>.
By passing the LOOP values by value, will I not end up with only combinations <T, LOOP1 & LOOP2>.
This would mean that all cases will have the full overhead, even when not needed.
This will impact the small array's.
In reality, I expect that not all combinations will be used since they are dependent on the slicing of the application. In most cases only <T, NO LOOPS> will be compiled.

In case I'm wrong, please explain.
Is there an advantage by using macro definitions over template variables?

@willyborn
Copy link
Contributor Author

@9prady9
The reason the old kernels are no longer valid is a result of the hash calculation. For JIT kernels, we only take the function name (backend/cuda/jit.cpp:207) into account but not the real code. The function name remained the same, although the code behind (and corresponding parameters to call it) changed.
A solution could be that we include a 'fixed version number' into the hash calculation, so that we can force the generation of a new set of hash codes.

@willyborn willyborn closed this Jul 31, 2021
@9prady9
Copy link
Member

9prady9 commented Aug 1, 2021

@willyborn I can't recollect why I avoided hashing JIT code, perhaps to avoid extra computation time.

Fixed version solution might not work in a couple of corner cases such as when a particular Node's logic changed but the name of the JIT tree comes out same as earlier even with a version(JIT version sort of this) suffix. I think it is better we hash JIT source as well to prevent problems such as these unless it is adding a performance hit to JIT engine.

@willyborn
Copy link
Contributor Author

willyborn commented Aug 1, 2021 via email

@9prady9
Copy link
Member

9prady9 commented Aug 1, 2021

This will have a serious performance impact, since the code generation is take more time than the exécution of the resulting kernel. You will also have to generate the kernel code, even when cached, because the hash key Is used to search for the cached kernel. The problem is however not that big as it seams because the version nr of arrayfire (3.9) is included in the filename. As result endusers will never notice it.

On Sun, Aug 1, 2021, 06:38 pradeep @.***> wrote: @willyborn https://github.com/willyborn I can't recollect why I avoided hashing JIT code, perhaps to avoid extra computation time. Fixed version solution might not work in a couple of corner cases such as when a particular Node's logic changed but the name of the JIT tree comes out same as earlier even with a version(JIT version sort of this) suffix. I think it is better we hash JIT source as well to prevent problems such as these unless it is adding a performance hit to JIT engine. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3144 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ2WGPDKLO3BRKWWA77GKMDT2TFVNANCNFSM46MTN7OQ .

I haven't timed the hashing mechanism. But yes, it needs to be hashed at each lookup which would be inefficient.

I think we can do away with re-using AF_REVISION defined in version.hpp which is generated during build. It has the commit hash and that should do the trick - adding AF_REVISION to the hashed inputs for JIT workflow.

@willyborn
Copy link
Contributor Author

willyborn commented Aug 3, 2021 via email

@willyborn
Copy link
Contributor Author

I notice that this PR is already closed.

Does it still make sense to update the code with the comments?
Can it still be merged into master or am I too late for code updates?

I will independently continue trying to answer all your the questions.

@9prady9
Copy link
Member

9prady9 commented Aug 4, 2021

@willyborn I was under the impression you closed it intentionally to avoid repetitive ci builds or something similar.

@9prady9 9prady9 reopened this Aug 4, 2021
@9prady9
Copy link
Member

9prady9 commented Aug 4, 2021

@willyborn Also, can you please remove the merge commit from this branch

With your master even with arayfire's master branch, git rebase master should remove it the merge commit. Also, feel free to squash the trace statement fix into your commit. I did these locally for easy review/test. If you want me to push, I can do that as well. Just let me know.

9prady9
9prady9 previously requested changes Aug 4, 2021
src/backend/common/dispatch.hpp Outdated Show resolved Hide resolved
Comment on lines 108 to 112
(threads01 <= maxThreads / 64) && !(d2 & (64 - 1)) ? 64
: (threads01 <= maxThreads / 32) && !(d2 & (32 - 1)) ? 32
: (threads01 <= maxThreads / 16) && !(d2 & (16 - 1)) ? 16
: (threads01 <= maxThreads / 8) && !(d2 & (8 - 1)) ? 8
: (threads01 <= maxThreads / 4) && !(d2 & (4 - 1)) ? 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even here (for CUDA backend):

Range of threads0 is [32, 64, 128]
Range of threads1 is [1, 2, 4, 8, 16, 32, 128]

Min value of threads01 being 32, first four conditions are not needed.

Suggested change
(threads01 <= maxThreads / 64) && !(d2 & (64 - 1)) ? 64
: (threads01 <= maxThreads / 32) && !(d2 & (32 - 1)) ? 32
: (threads01 <= maxThreads / 16) && !(d2 & (16 - 1)) ? 16
: (threads01 <= maxThreads / 8) && !(d2 & (8 - 1)) ? 8
: (threads01 <= maxThreads / 4) && !(d2 & (4 - 1)) ? 4
#ifdef AF_OPENCL
(threads01 <= maxThreads / 64) && !(d2 & (64 - 1)) ? 64
: (threads01 <= maxThreads / 32) && !(d2 & (32 - 1)) ? 32
: (threads01 <= maxThreads / 16) && !(d2 & (16 - 1)) ? 16
: (threads01 <= maxThreads / 8) && !(d2 & (8 - 1)) ? 8
:
#endif
(threads01 <= maxThreads / 4) && !(d2 & (4 - 1)) ? 4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

included

const unsigned d0 = static_cast<unsigned>(dims[0]);
const unsigned d1 = static_cast<unsigned>(dims[1]);
const unsigned d2 = static_cast<unsigned>(dims[2]);
const unsigned OCC = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be I wasn't clear, I am not suggesting making this constant global or hard coded.

On the contrary, making it a constexpr(shown below) instead of const will let compiler optimize it away in all the arithmetic operations OCC is involved, especially given that it is being used in conjunction with other hard-coded constants (powers of 2).

constexpr unsigned OCCUPANCY_FACTOR = 3; // I couldn't think of vendor agnostic name, perhaps you can

It(OCC) may be based on profiling and heuristics, but it should be explained in comments why that constant so that any future reader of the code can understand why 3 and not anything else, more so since type of GPU wasn't influencing the choice of this constant.

As far as how much of an explanation is needed in comments. I think saying couple of sentences about occupancy and how this factor affects it and adding references to all, if not most, blogs you referred to should do.

#ifdef AF_OPENCL
(d0 < warp) ? d0 :
#endif
(d1 == 1) ? warp * 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might in CUDA backend but I doubt that in OpenCL backend since that isn't a hard-coded value like 32 in CUDA that is being passed down to bestBlockSize.

const unsigned d2 = static_cast<unsigned>(dims[2]);
const unsigned OCC = 3;
const unsigned elements = d0 * d1;
const unsigned minThreads = warp / 4; // quarter wave
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I wasn't sure if such warp sub division happens. I don't think that's how CUDA warps work, at least from what I know. @umar456 Am I missing something from what @willyborn is referring to here ?

const unsigned elements = d0 * d1;
const unsigned minThreads = warp / 4; // quarter wave
const unsigned maxThreads =
std::min(warp * 4, divup(elements * warp, 16384U) * minThreads);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I am confused by the terms here: CPUs, parallel warps. Let's stick to NVIDIA architecture for ease of explanation. Can you please use their jargon: SM, Warp and thread to explain the logic here.

Copy link
Member

@umar456 umar456 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the call to JIT in join. I am super excited about this change. Its pretty late for me now. I will try to add more comments tomorrow.

Comment on lines 42 to 43
Param<T> info(out.get(), src.dims().dims, src.strides().dims);
evalNodes(info, src.getNode().get());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is really clever.

Suggested change
Param<T> info(out.get(), src.dims().dims, src.strides().dims);
evalNodes(info, src.getNode().get());
evalNodes(out, src.getNode().get());

Array<T> objects implicity convert to Param<T>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the src parameters are copied by the out = CreateEmptyArray<T>(src.dims()) into out.

Updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did NOT succeed in doing something similar in OPENCL.
The compiler would not convert and was confused between Param and vector...

Comment on lines 71 to 72
Param<T> info(dst.get(), src.dims().dims, dst.strides().dims);
evalNodes(info, src.getNode().get());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Param<T> info(dst.get(), src.dims().dims, dst.strides().dims);
evalNodes(info, src.getNode().get());
evalNodes(dst, src.getNode().get());

Array objects implicity convert to Param

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src.dims() can be different from dst.dims(), so I have to build my own version of Param.

Example: test_assign_cuda.exe

for (int dim = 0, elements = 1; dim < ndims; ++dim) {
is_linear &= (elements == (int)outputs[0].strides[dim]);
elements *= (int)outDims[dim];
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, output's are currently linear for all the cases I recall.

vector<Param<T>> outputs{{out.get() + outOffset, iArray.dims().dims,
out.strides().dims}};
vector<common::Node *> nodes{iArray.getNode().get()};
evalNodes<T>(outputs, nodes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again. I really like this. I think you can use evalMultiple here right?

I can see cases where the kernels may get too large but I am working on a way around that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I toughed that for evalMultiple(), all the arrays needed to be the same size, which is not guaranteed here.

// optional dims: odims, ostrides
// ALL parameters will be updated!!
template<bool RESHAPE = false>
void serializeArray(dim_t mdims[4], dim_t mstrides[4], dim_t &mndims,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the name collapseShape.

@willyborn
Copy link
Contributor Author

All remarks are included now, except 1 on 'src/backend/cuda/kernel/memcopy.cuh ' from @9prady9 where I need some help.

I will start the testing before Releasing.
After the discussions, I got some ideas to improve further on collapseShape & increasePerThreadWorkload.
This can take some time, because test runs typically take a few hours each.

@9prady9
Copy link
Member

9prady9 commented Oct 11, 2021

@willyborn Sorry about the delay on my end. Have been caught up with some work and other things. I will take a stab at this week. If it isn't hard, I will try to rebase this. Whether I do it or not, there is another PR that is touching jit code #3177 which will have some conflicts in it. We might have rebase this one at that point again. Just giving you an heads up.

@willyborn
Copy link
Contributor Author

willyborn commented Oct 11, 2021 via email

@willyborn
Copy link
Contributor Author

willyborn commented Oct 11, 2021 via email

@9prady9
Copy link
Member

9prady9 commented Oct 13, 2021

@willyborn Any idea on the ETA when this should be ready ? rough estimate is good.

@willyborn
Copy link
Contributor Author

willyborn commented Oct 13, 2021 via email

@9prady9
Copy link
Member

9prady9 commented Oct 13, 2021

@willyborn Sure, please do. We can have the JIT caching issue fix in 3.8.1. I am curious, if this fix is relevant to master or is it on top of this branch ? If it isn't associated with master, then we don't need to rush that one either.

@9prady9 9prady9 modified the milestones: 3.8.1, 3.8.2 Oct 14, 2021
@umar456
Copy link
Member

umar456 commented Apr 5, 2022

@willyborn Just following up on this PR. Where you able to make progress here? I noticed there are a couple of conflicts on this branch. I can help you rebase your changes on top of the current master and get this merged.

@willyborn willyborn changed the title OPT: Improved memcopy, join & JIT (Needs cache cleanup) OPT: Improved memcopy, join & JIT Aug 2, 2022
@willyborn willyborn changed the title OPT: Improved memcopy, join & JIT OPT: Improved memcopy, JIT & join Aug 2, 2022
@willyborn
Copy link
Contributor Author

I joined a document with all my findings during this journey. I was surprised by some of the results.
You will also find the final performance results on many cases of joins. It is important to look at the full curve, instead of only large array's. Variations in speed are mostly resulting from optimizations being possible or not (like vectorization).

Findings Copy&Join.docx
Join-Final.xlsx

@willyborn willyborn closed this Aug 2, 2022
@willyborn willyborn reopened this Aug 2, 2022
@umar456
Copy link
Member

umar456 commented Aug 2, 2022

Hey @willyborn, This is amazing work. I am still going through the document and will go over the code later today. These are excellent findings. I'd often assumed that this function could be implemented in a better way. There is a lot of useful information here and I will try to turn this around asap. Do you believe this PR is ready to be merged from your perspective? Do you still have some unfinished work before I begin? I see there are a couple of conflicts. They seem pretty minor. I can help you with them if you have any issues resolving them.

@willyborn
Copy link
Contributor Author

willyborn commented Aug 2, 2022

I have the impression that those conflicts are from yet unpublished changes.
I verified if the remote master was updated in the mean time, although it wasn't. (I have the impression, internally it is)

Looking at the conflicts, the changes are minor, although I have not idea how to solve this without trowing away my changes.

@umar456
Copy link
Member

umar456 commented Aug 2, 2022

Do you mind if I resolve them? I can push once I make the relevant changes to your branch.

@willyborn
Copy link
Contributor Author

No problem. At the end it is a team result.

@willyborn
Copy link
Contributor Author

I found the missing updates, github will remain a mystery for me, or perhaps it is the MS Visual Studio implementation which puts me frequently on the wrong foot.

@umar456
Copy link
Member

umar456 commented Aug 2, 2022

Ahh okay. Go on ahead and push it with the update. Maybe the conflicts will be resolved with the rebase.

@willyborn willyborn force-pushed the Join+Copy branch 3 times, most recently from 79e6d57 to 9bc8f00 Compare August 3, 2022 10:56
@willyborn willyborn requested a review from 9prady9 August 3, 2022 10:57
@willyborn
Copy link
Contributor Author

Rebased and reran all tests.

General Threads/Blocks (Local/Global) calculations when all available dimensions are used., including optimized number of active parallel GPU threads.
@willyborn
Copy link
Contributor Author

@umar456 I missed one of your questions.
I have no outstanding work.

@umar456
Copy link
Member

umar456 commented Sep 9, 2022

Hey @willyborn,

Sorry for the delay. I have not forgotten about this and I have been going through this over the last week or so. This is an invasive change so I am trying to be careful with the review. I really appreciate the effort and thought you have put into this and I want to make sure that I review every aspect of this before I merge it in.

@umar456 umar456 removed the request for review from 9prady9 September 22, 2022 20:22
@umar456 umar456 dismissed 9prady9’s stale review September 22, 2022 20:23

addressed changes

@umar456 umar456 merged commit bef4f10 into arrayfire:master Sep 22, 2022
@syurkevi syurkevi mentioned this pull request Apr 19, 2023
3 tasks
@willyborn willyborn deleted the Join+Copy branch May 11, 2023 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve performance of join
3 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.