-
Notifications
You must be signed in to change notification settings - Fork 543
OPT: Improved memcopy, JIT & join #3144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Some extra comments to the realized performance impact:
The CUDA scheduler optimizes much more than the OCL scheduler, even on the same HW. More details, with the specific measures can be found in the file below: |
Hi @willyborn, This is a tremendous change. Thank you for your contribution. I am looking at the benchmark numbers and they look fantastic. I am running my own benchmarks and I am seeing some good results as well. I did notice that the kernels are failing if the older kernel cache is not removed. I suppose that would only be an issue in development branches. Once the version number is updated it the errors will not appear for the end user. I want to understand the kernels that are being generated and I will give my feedback in a couple of days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was reviewing on/off and it took some time, sorry about the delay.
I have reviewed CUDA backend and OpenCL seems quite similar in changes so will browse through them once the feedback is addressed.
Thank you for your contribution!
Note: Also, kindly rebase from latest master. There is a minor conflict in cuda's jit.cpp file.
On another, I haven't figured why it fails if we keep cache of kernels which it shouldn't, I will report back on that once I figure out why. It is important and more preferable if users doesn't need to deal with this intervention.
@@ -8,16 +8,14 @@ | ||
********************************************************/ | ||
|
||
typedef struct { | ||
dim_t dim[4]; | ||
int dims[4]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type change may likely cause type compatibility issues with dims_t
struct defined else where.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, I change the struct name to dims_type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason we can't use KParam
like the other kernels ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already had all the dims available in int format and no longer in dim_t
format, because they are updated inside the 'memcopy.hpp calls'.
On top, most of the OpenCL kernels are converting all dims / idx into int
inside their kernels.
Finally, I wanted to speed things up more, avoiding conversions and
reducing the bytes sent to the GPU.
If I have to send the dim_t version, I have to convert on the CPU and later
convert again on GPU.
Continuing calculations in dim_t format is really much slower than int on
OpenCL. I did not have the same impression on CUDA, even when running on
the same HW.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the performance implications of using dim_t
vs int
. In fact, we originally had dim_t
all over but later refactored to current state where we keep dim_t
outside of kernels but convert them to int
inside kernels for the very same reason. This is case with even CUDA backend, not just OpenCL. Type conversion overhead for a few scalars shouldn't be that noticeable.
I don't understand why you would have to convert from int to dim_t on the host. Can you please point me to the source location where these are happening. As far as I can remember, all non-kernel codes almost always deals in terms of dim_t
and dim_t
to int
inside kernels alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated dims_t to dims_type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to memcopy.hpp, since the dims and strides are updated in array dimension optimizations. This is all done in int, since the kernel continuous in int.
So the latest values are only available in int in opencl/memcopy.hpp
src/backend/common/dispatch.hpp
Outdated
const unsigned d2 = static_cast<unsigned>(dims[2]); | ||
const unsigned OCC = 3; | ||
const unsigned elements = d0 * d1; | ||
const unsigned minThreads = warp / 4; // quarter wave |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the significance of the number 4
here ? Could it be based on the max word transfer limit, 128 bytes such as 128 / 32(warp) = 4 ?
If it is indeed based on max word transfer limit, 128 line, doesn't this change w.r.t to data type ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVIDIA & AMD GPU's are build on top of processors using SIMD. For all GPU's of NVIDIA 1 instruction & 8 datasets, for AMD 1 instruction & 16 datasets.
I continue for NVIDIA, although the same for AMD.
- The processor now executes as follows:
Wave 1- 1st instruction / 0...7 datasets (= 8 threads = quarter wave)
- 1st instruction / 8...15 datasets (= 8 threads = quarter wave)
- 1st instruction / 16...23 datasets (= 8 threads = quarter wave)
- 1st instruction / 24...31 datasets (= 8 threads = quarter wave)
Wave 2 - 2nd instruction / 0...7 datasets (= 8 threads)
- ...
- By repeating the same instruction 4x, we get the impression that a full warp is executed in parallel.
- Suppose only thread 24 exits, than the processor will service datasets 25...31 each with an 1instruction and 1 dataset --> 7 instructions for 7 datasets, iso 1 instruction for 8 datasets.
- Suppose datasets 24...31 exit together, than everything will continue although now only with a repetition of 3x. Each instruction (SIMD) remains servicing 8 datasets. Losing quarter waves will keep the processor at fully efficiency, on condition that caching of global memory still can follow. From lecture, I discovered that the caching system can keep up, up to 50% of active datasets (16 in case of NVIDIA). With lower datasets (threads) active per processor, the processor will have to wait.
Since both architectures decided to split their respective waves into 4 instructions, we can use the same formula to determine the minThreads for both NVIDIA and for AMD.
The cache line size is 128Bytes for NVIDIA and 64Bytes for AMD (A10).
The 128Bytes = 32 datasets (full warp) * 4 bytes (size float), this means that the GPU will read a full warp (linear) at once. This is only useful, when we can read this data sequentially. On top when 1 cache line is read, the next 3 cache lines are already ordered in the hope that we will continue reading.
Resulting from the larger cache line for NVIDIA, it will be faster in sequential reading although will be slower with random access (reading multiple buffers in parallel). The cache lines are also use to write to global memory, although this is fully asynchronous so no waiting times (even for random access) as long as the total cache is not occupied.
How much of the above do you want into the comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on what you explained about the constant, cache info is irrelevant for this comment.
I think something like the below should be good enough
<return> bestBlockSize(<params>) {
// Currently both NVIDIA and AMD hardware splits each warp/wavefront into
// four groups of 8/16 threads, respectively.
// NVIDIA - 32(warp) / 8 = 4
// AMD - 64(wavefront) / 16 = 4
constexpr unsigned WAVE_COUNT = 4;
const unsigned minThreads = warp / WAVE_COUNT;
....
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, I wasn't sure if such warp sub division happens. I don't think that's how CUDA warps work, at least from what I know. @umar456 Am I missing something from what @willyborn is referring to here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated as proposed.
Add references:
https://www.cvg.ethz.ch/teaching/2011spring/gpgpu/GPU-Optimization.pdf
https://en.wikipedia.org/wiki/Graphics_Core_Next#SIMD_Vector_Unit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think NVIDIA has split warps since the Kepler architecture(2012). NVIDIA did this on the Fermi and earlier cards. AMD still does this to some extent for their older generations. The new RDNA architecture moved to 32 thread wavefronts and each thread is executed in parallel.
https://developer.amd.com/wp-content/resources/RDNA_Shader_ISA.pdf
Suppose datasets 24...31 exit together, than everything will continue although now only with a repetition of 3x. Each instruction (SIMD) remains servicing 8 datasets. Losing quarter waves will keep the processor at fully efficiency, on condition that caching of global memory still can follow. From lecture, I discovered that the caching system can keep up, up to 50% of active datasets (16 in case of NVIDIA). With lower datasets (threads) active per processor, the processor will have to wait.
Ignoring the 4 clock per wavefront point, The inactive threads in a wavefront will still execute a NOP instruction. This is even the case if only one thread is active. See page 41-42 here: http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I only read about such division from the days of Fermi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking through this again and I think I am not entirely correct here. Yes, for floating point operations, typically the entire warp is executing the same instruction except for AMD, but for instructions that are more expensive(like doubles and special functions) this sort of behavior will occur. I think @willyborn's change will help for those types of operations.
src/backend/common/dispatch.hpp
Outdated
// Remark: The bestBlockSize is only best for independent element operations, as | ||
// are: copying, scaling, math on independent elements, ... | ||
// Since vector dimensions can be returned, it is NOT USABLE FOR BLOCK | ||
// OPERATIONS, as are: matmul, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏾 A good heads up for users of this function.
I think a few comments/docs are missing from the function implementation itself with respect to the hard-coded constants. I have left additional comments at relevant lines, please have a look.
src/backend/common/dispatch.hpp
Outdated
const unsigned d0 = static_cast<unsigned>(dims[0]); | ||
const unsigned d1 = static_cast<unsigned>(dims[1]); | ||
const unsigned d2 = static_cast<unsigned>(dims[2]); | ||
const unsigned OCC = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is good to have 3
as a function-local constant but what is the significance of this 3 ? Does it depend on the GPU in use ? If it varies per hardware, is it usually populated in device properties ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the copy kernels, we only have address calculations in the threads. This means that calculation time is very low compared to the waiting time, resulting from memory latency.
The total occupation = occupation(threads0) * occupation(threads1) * occupation(threads2).
- Since array dim[2] is mostly very small, we will push occupation(threads2) to 100%, by only providing exact dividable block dim.
- For threads0 & threads1 occupation always drops because the last block is incomplete, although launched. The full occupation is in brackets (assuming both equal). The occupation per dim is calculated as follows:
0 full blocks + 1 thread / 1 block allocated ==> 1/32, 1/64, ...
1 full block + 1 thread / 2 blocks allocated ==> >50% (25%)
2 full blocks + 1 thread / 3 blocks allocated ==> >66% (44%)
3 full blocks + 1 thread / 4 blocks allocated ==> >75% (56%)
From blogs, I read that around 50% threads is sufficient so somewhere between 3 & 4 allocated blocks. The performance difference between both is smaller than my measuring precision.
As result, I expect it to be mostly independent from GPU version. On top,
- only a very small set of dimensions are possibly affected.
- I prefer to keep it inside the function, because it is not intended to be changed outside specific performance profiling.
- There are many uncertain assumptions, so this not an exact science. In practice, you will encounter mostly exponents of 2.
- On NVIDIA, I noticed that the driver optimizes much more so that multiple variations give the same result.
I can also remove the constant and put directly the final number in each expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be I wasn't clear, I am not suggesting making this constant global or hard coded.
On the contrary, making it a constexpr
(shown below) instead of const
will let compiler optimize it away in all the arithmetic operations OCC
is involved, especially given that it is being used in conjunction with other hard-coded constants (powers of 2).
constexpr unsigned OCCUPANCY_FACTOR = 3; // I couldn't think of vendor agnostic name, perhaps you can
It(OCC) may be based on profiling and heuristics, but it should be explained in comments why that constant so that any future reader of the code can understand why 3
and not anything else, more so since type of GPU wasn't influencing the choice of this constant.
As far as how much of an explanation is needed in comments. I think saying couple of sentences about occupancy and how this factor affects it and adding references to all, if not most, blogs you referred to should do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with comments.
const unsigned OCC changed to constexpr unsigned OCCUPANCY_FACTOR
src/backend/common/dispatch.hpp
Outdated
const unsigned elements = d0 * d1; | ||
const unsigned minThreads = warp / 4; // quarter wave | ||
const unsigned maxThreads = | ||
std::min(warp * 4, divup(elements * warp, 16384U) * minThreads); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of comments w.r.t the number 16384
.
- It would be nice to have a comment on why this specifically ? Is it hardware agnostic constant ?
- Instead of large number something like
1<<14
is perhaps more indicative that this constant is power of 2 I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this number is based on your benchmarks ? because I see this number again as 8192 * 2
on lines 301 and 303 of src/backend/cuda/jit.cpp
file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is that for very small array dimensions, the opposite is needed (lower occupation).
Suppose a GPU can handle 1024 threads (32*32 warps) in parallel. Suppose our array is a vector with dim0=256 elements.
- When we go for maximum occupation, we set a block size of 32, resulting in the activation of 8 parallel warps (=8CPUs). The other 24CPUs will be idle. We also know that for each warp only 8 threads are really in parallel, so 64 real parallel threads for the whole system with 4 loops each.
- Would it not better to use all the available CPUs (32) with only 1 loop. This is achieved by launching with a block size of 8 (=minThreads). The occupation is here only 25%, although 4x faster than with 100%.
As long as all elements of an array is smaller than the available parallel threads, we limit the maxThreads for 1 block to create the occupation inefficiency described above.
The number 16384U is an estimated average (=512*32; 512 stream processors) over all GPU's I looked up. Again this number does not has to be precise, since the impact decreases the closer we get there. On top an error margin of 50% is allowed.
Increasing of lowering it did not have much of an impact. Removing it generated visible slowdowns.
I reused the same number later, because I did not have anything better. The reasoning is the same, why increase the looping inside a thread when all threads are not occupied.
Do you want me to include all the above comments inside the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I am confused by the terms here: CPUs, parallel warps. Let's stick to NVIDIA architecture for ease of explanation. Can you please use their jargon: SM, Warp and thread to explain the logic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to stick to NVIDIA nomenclature.
Suppose GPU has 4 cores (SP), so 32*4=128 threads can run in parallel.
Suppose our array is an vector of 32 floats, all in dim0.
General rule:
each SP always receives a full warp. This is achieved by extending threads up to a multiple of the warp. Empty threads do not consume cycles and are skipped.
When 8 threads execute the same instruction, the SIMD will execute in parallel (1 cycle).
When 7 threads execute the same instruction, the ALU will execute in serial (7 cycles).
-
We set threads(128,1,1) and blocks(1,1,1), resulting in
----------------------------------------------blockIdx0----------------------------------------------------
cycle1: SP0:threadIdx 0.. 7; SP1:threadIdx 33..40; SP2:threadIdx 65..72; SP3:threadIdx 97..104
cycle2: SP0:threadIdx 8..15; SP1:threadIdx 41..48; SP2:threadIdx 73..80; SP3:threadIdx 105..112
cycle3: SP0:threadIdx 16..23; SP1:threadIdx 49..56; SP2:threadIdx 81..88; SP3:threadIdx 113..120
cycle4: SP0:threadIdx 24..32; SP1:threadIdx 57..64; SP2:threadIdx 89..96; SP3:threadIdx 121..128
--> 25% of available SP is used (threads 33..128 exit immediately) and 4 cycles needed.
--> Occupation = 100% -
We set threads(32,1,1) and blocks(1,1,1), resulting in
------blockIdx0-------- -------------IDLE--------------
cycle1: SP0:threadIdx 0.. 7; SP1:idle; SP2:idle; SP3:idle
cycle2: SP0:threadIdx 8..15; SP1:idle; SP2:idle; SP3:idle
cycle3: SP0:threadIdx 16..23; SP1:idle; SP2:idle; SP3:idle
cycle4: SP0:threadIdx 24..32; SP1:idle; SP2:idle; SP3:idle
--> 25% of available SP is used and 4 cycles needed.
--> Occupation = 25% -
We set threads(8,1,1) and blocks(4,1,1), resulting in
------blockIdx0----- ------blockIdx1------ -------blockIdx2------- -------blockIdx3-------
cycle1: SP0:threadIdx 0..7; SP1:threadIdx 8..15; SP2:threadIdx 16..23; SP3:threadIdx 24..32
cycle2, cycle3, cycle4 are skipped since no threads to launch
--> 100% of available SP is used and 1 cycle needed.
--> Occupation = 100%
Assume that we only have 16 floats, all in dim0
-
We set threads(4,1,1) and blocks(4,1,1), resulting in
----blockIdx0---- ----blockIdx1---- ----blockIdx2---- ----blockIdx3----
cycle1: SP0:threadIdx 0; SP1:threadIdx 4; SP2:threadIdx 8; SP3:threadIdx12
cycle2: SP0:threadIdx 1; SP1:threadIdx 5; SP2:threadIdx 9; SP3:threadIdx13
cycle3: SP0:threadIdx 2; SP1:threadIdx 6; SP2:threadIdx10; SP3:threadIdx14
cycle4: SP0:threadIdx 3; SP1:threadIdx 7; SP2:threadIdx11; SP3:threadIdx15
--> 100% of available SP is used and 4 cycles needed.
--> Occupation = 100% -
We set threads(8,1,1) and blocks(1,1,1), resulting in
-----blockIdx0------
cycle 1: SP0:threadIdx 0..7; SP1:Idle; SP2:Idle; SP3:Idle
--> 25% of available SP is used and 1 cycle needed.
--> Occupation = 25%
Conclusion:
- Occupation rate is useless for very small arrays; Nr cycles counts!!
- Minimum is warp/4 (// lines of SIMD), because this will 1 cycle == 1 individual instruction
I looked up multiple GPU's and they vary between 512 cores (Radeon R7, Vega8) up to 10496 cores (RTX-3090). As average, I wanted to take 1024 cores (16384 threads on AMD). I also assumed that NVIDIA users would have bigger cards, so there I took 2048 cores.
In case the user has less cores: All cores will be used and cycles increase (scheduling overhead)
In case the user has more cores: 4x faster compared to nothing (all cores are not used, no impact when memory latency is covered by 16384 threads)
AMD: (elements * 64 / 16384) * 16 = elements / (16 * 16) *16 --> 16 cores used
NVIDIA: (elements * 32 / 16384) * 8 = elements / (64 * 8) * 8 --> 64 cores used
I WILL RERUN THE PROFILING, because I think it should be
AMD : (elements / (1024 * 16) * 16 = (elements / 16384) * 16 --> divup(d0d1,1<<14) * minThreads
NVIDIA: (elements / (2048 * 8) * 8 = (elements / 16384) * 8 --> divup(d0d1,1<<14) * minThreads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all I think this is fantastic work. Your logic here is valid. I think this can be improved if you take the number of compute units into account with your calculations. You can pass it into the function this using getDeviceProp(id).multiProcessorCount on cuda and getDevice().getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>()
on OpenCL. This will give you a general idea of the number of compute resources available on the device.
You will need to figure out the number of ALUs available on each compute unit. I think a safe minimum is 64. This number matches the minimum compute unit for AMD and NVIDIA GPUs. Intel GPUs have 8 shader units per compute unit but lets ignore them for now.
I think that moving below 64 threads per work-group will give you diminishing returns. I think your model of the GPU scheduling is incorrect with respect to the minimum number of clocks required to do an operation. I don't think you can minimize the impact of an inactive thread in the wavefront like you mention in your response. If you have different results then I would love to see it. I love to be proven wrong :).
Optimizing for very small data sizes will probably not yield favorable results. For these data sizes the bottleneck is going to be either driver overhead, memory loads overhead or other processing overhead internal to ArrayFire. From an engineering point of view, it might be better to simplify the logic to increase maintainability(if you don't see significant gains).
When 8 threads execute the same instruction, the SIMD will execute in parallel (1 cycle).
When 7 threads execute the same instruction, the ALU will execute in serial (7 cycles).
I am not sure I understand you here. Can you please clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned, I will rerun multiple performance tests with only this parameter changed.
It is possible, that due multiple improvements before on all parameters, that I got wrong conclusions.
More to come.
From what I understand from SIMD, is that all // lines have to be occupied or none at all. This has as consequence that when only 7 data lines are possible, that the SP no longer can use the SIMD (1x 8//) and is forced to use the ALU (7x 1) instead. Both do use however the same number of cycles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that moving below 64 threads per work-group will give you diminishing returns. I think your model of the GPU scheduling is incorrect with respect to the minimum number of clocks required to do an operation. I don't think you can minimize the impact of an inactive thread in the wavefront like you mention in your response. If you have different results then I would love to see it. I love to be proven wrong :)
As mentioned in my last comment, I think I am incorrect here. I can see situations where smaller block sizes (<64 threads) can be beneficial for more expensive operations such as doubles and special functions. Please disregard that comment.
From what I understand from SIMD, is that all // lines have to be occupied or none at all. This has as consequence that when only 7 data lines are possible, that the SP no longer can use the SIMD (1x 8//) and is forced to use the ALU (7x 1) instead. Both do use however the same number of cycles.
I don't think I have heard of that. All the documentation I have read treats the inactive threads as masked off and the operations still use one cycle even though 7 are active. On AMD hardware I understand that they have a scalar execution units which comes into play when all threads perform the same operation on the same data. This is used for control flow and address operations typically.
src/backend/cuda/kernel/memcopy.hpp
Outdated
blocks.z = divup(blocks.y, maxBlocksY); | ||
blocks.y = divup(blocks.y, blocks.z); | ||
template<typename T> | ||
void memcopyN(Param<T> out, std::vector<CParamPlus<T>> &Ins) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need an additional function memcopyN
and the new object CParamPlus
. We avoid both of these if we create memcopyN
using the new memcopy
version I suggested in my earlier comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling the memcopy for each array separately means that the overhead has to be processed for each copy individually.
This was the model I had in the beginning.
After moving to memcopyN, I noticed a measurable difference (especially for small array's, which is typical for addBias() functions).
In case you give priority to this complexity reduction, please let me know.
I know alternative for addBias() performing similar to current join implementation.
af::array addBias(const af::array in) {
af::dim4 dimsR(in.dims());
++dimsR[0];
af:: array R = af::constant(1.0, dimsR); // yes full array, since very fast.
R(af::seq(1,af::end),af::span,af::span,af::span) = in; // overwrite for values of in
return(R);
}
src/backend/cuda/kernel/memcopy.hpp
Outdated
if (elements > 0) { | ||
const int *maxGridSize = | ||
cuda::getDeviceProp(cuda::getActiveDeviceId()).maxGridSize; | ||
serializeArray<true>(dst.dims, dst.strides, ondims, src.strides, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could minimize and move non-wrapper related code to a function defined in copy.cpp
similar to what I suggested about memcopy
wrapper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree for the elements>0.
Moved to the upper layer.
src/backend/cuda/kernel/memcopy.cuh
Outdated
// ALWAYS looping!! | ||
int ioffset = g2 * istrides2; | ||
int ooffset = g2 * ostrides2; | ||
const int ioffsetEnd = ioffset + idims3 * istrides3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the change in jit code, Since end can go beyond actual number of elements when ioffset > 0, I think this would pose problem in the last blocks along any given kernel launch axis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not an index, although an iterator. In case istrides3 is negative, the > will give wrong results.
src/backend/cuda/kernel/memcopy.cuh
Outdated
in.ptr += iinc1; | ||
out.ptr += oinc1; | ||
} while (g1 < idims1); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Occupancy calculator shows both the PR kernel and the master kernel are near max. I am wondering if master branch's kernel can be slightly modified to handle per-thread-workload rather than introducing two additional do-while loops conditionally. That would also remove the need for the two additional template parameters or compile time definitions. Occupancy shouldn't be effected due to this - I have checked this for both versions of the kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA has limitations on the dimensions it accepts for the blocks parameter in the launch call. For my GTX 750 Ti:
- dim0 : <= 4,294,967,295 (assumed never hit, due to limit of GPU memory)
- dim1: <= 65,536 (rarely hit)
- dim2: <= 65,536 (very rarely hit)
In case we do hit a limit, LOOP1 & LOOP2 are introduced, on top of the workload loop (dim3)
To avoid this overhead for all arrays, they are conditional.
For OPENCL, we have different limits for the 'corresponding' global parameter (also on NVIDIA GPUs):
- dim0: <= 18,446,744,073,709,551,615 (assumed never hit, due to limit on GPU memory)
- dim1: <= 18,446,744,073,709,551,615 (assumed never hit, due to limit on GPU memory)
- dim2: <= 18,446,744,073,709,551,615 (assumed never hit, due to limit on GPU memory)
Here we only have the workload loop (dim3) remaining.
I have no idea how I can combine the dim1, dim2 and dim3 (workload) loops together into 1 loop, without exceptions.
Please give me an example.
By the way the some issue exists in cuda/jit.cpp
src/backend/cuda/kernel/memcopy.hpp
Outdated
{ | ||
TemplateTypename<T>(), | ||
TemplateArg(loop1), | ||
TemplateArg(loop2), | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | |
TemplateTypename<T>(), | |
TemplateArg(loop1), | |
TemplateArg(loop2), | |
}); | |
{ | |
TemplateTypename<T>(), | |
}, | |
{ | |
DefineKeyValue(LOOP1, loop1), | |
DefineKeyValue(LOOP2, loop2), | |
} | |
); |
Doing so can avoid the additional template parameters - loop1
and loop2
can also be macro definitions and they are included while generating the unique key for kernel. Just suggesting this alternative as these values aren't used in any calculations anyway inside the kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I especially wanted a different kernel for the combinations: <T, NO LOOPS>, <T, LOOP1>, <T,LOOP2>, <T, LOOP1 & LOOP2>.
By passing the LOOP values by value, will I not end up with only combinations <T, LOOP1 & LOOP2>.
This would mean that all cases will have the full overhead, even when not needed.
This will impact the small array's.
In reality, I expect that not all combinations will be used since they are dependent on the slicing of the application. In most cases only <T, NO LOOPS> will be compiled.
In case I'm wrong, please explain.
Is there an advantage by using macro definitions over template variables?
@9prady9 |
@willyborn I can't recollect why I avoided hashing JIT code, perhaps to avoid extra computation time. Fixed version solution might not work in a couple of corner cases such as when a particular Node's logic changed but the name of the JIT tree comes out same as earlier even with a version(JIT version sort of this) suffix. I think it is better we hash JIT source as well to prevent problems such as these unless it is adding a performance hit to JIT engine. |
This will have a serious performance impact, since the code generation is
take more time than the exécution of the resulting kernel.
You will also have to generate the kernel code, even when cached, because
the hash key Is used to search for the cached kernel.
The problem is however not that big as it seams because the version nr of
arrayfire (3.9) is included in the filename. As result endusers will never
notice it.
…On Sun, Aug 1, 2021, 06:38 pradeep ***@***.***> wrote:
@willyborn <https://github.com/willyborn> I can't recollect why I avoided
hashing JIT code, perhaps to avoid extra computation time.
Fixed version solution might not work in a couple of corner cases such as
when a particular Node's logic changed but the name of the JIT tree comes
out same as earlier even with a version(JIT version sort of this) suffix. I
think it is better we hash JIT source as well to prevent problems such as
these unless it is adding a performance hit to JIT engine.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3144 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQ2WGPDKLO3BRKWWA77GKMDT2TFVNANCNFSM46MTN7OQ>
.
|
I haven't timed the hashing mechanism. But yes, it needs to be hashed at each lookup which would be inefficient. I think we can do away with re-using |
I already had all the dims available in int format and no longer in dim_t
format, because they are updated inside the 'memcopy.hpp calls'.
On top, most of the OpenCL kernels are converting all dims / idx into int
inside their kernels.
Finally, I wanted to speed things up more, avoiding conversions and
reducing the bytes sent to the GPU.
If I have to send the dim_t version, I have to convert on the CPU and later
convert again on GPU.
Continuing calculations in dim_t format is really much slower than int on
OpenCL. I did not have the same impression on CUDA, even when running on
the same HW.
…On Tue, 3 Aug 2021 at 16:55, pradeep ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/backend/opencl/kernel/copy.cl
<#3144 (comment)>:
> @@ -8,16 +8,14 @@
********************************************************/
typedef struct {
- dim_t dim[4];
+ int dims[4];
is there a reason we can't use KParam like the other kernels ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3144 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQ2WGPDHXLLXZN3DT3EHBHDT277P7ANCNFSM46MTN7OQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
I notice that this PR is already closed. Does it still make sense to update the code with the comments? I will independently continue trying to answer all your the questions. |
@willyborn I was under the impression you closed it intentionally to avoid repetitive ci builds or something similar. |
@willyborn Also, can you please remove the merge commit from this branch With your master even with arayfire's master branch, |
src/backend/common/dispatch.hpp
Outdated
(threads01 <= maxThreads / 64) && !(d2 & (64 - 1)) ? 64 | ||
: (threads01 <= maxThreads / 32) && !(d2 & (32 - 1)) ? 32 | ||
: (threads01 <= maxThreads / 16) && !(d2 & (16 - 1)) ? 16 | ||
: (threads01 <= maxThreads / 8) && !(d2 & (8 - 1)) ? 8 | ||
: (threads01 <= maxThreads / 4) && !(d2 & (4 - 1)) ? 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even here (for CUDA backend):
Range of threads0 is [32, 64, 128]
Range of threads1 is [1, 2, 4, 8, 16, 32, 128]
Min value of threads01 being 32, first four conditions are not needed.
(threads01 <= maxThreads / 64) && !(d2 & (64 - 1)) ? 64 | |
: (threads01 <= maxThreads / 32) && !(d2 & (32 - 1)) ? 32 | |
: (threads01 <= maxThreads / 16) && !(d2 & (16 - 1)) ? 16 | |
: (threads01 <= maxThreads / 8) && !(d2 & (8 - 1)) ? 8 | |
: (threads01 <= maxThreads / 4) && !(d2 & (4 - 1)) ? 4 | |
#ifdef AF_OPENCL | |
(threads01 <= maxThreads / 64) && !(d2 & (64 - 1)) ? 64 | |
: (threads01 <= maxThreads / 32) && !(d2 & (32 - 1)) ? 32 | |
: (threads01 <= maxThreads / 16) && !(d2 & (16 - 1)) ? 16 | |
: (threads01 <= maxThreads / 8) && !(d2 & (8 - 1)) ? 8 | |
: | |
#endif | |
(threads01 <= maxThreads / 4) && !(d2 & (4 - 1)) ? 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
included
src/backend/common/dispatch.hpp
Outdated
const unsigned d0 = static_cast<unsigned>(dims[0]); | ||
const unsigned d1 = static_cast<unsigned>(dims[1]); | ||
const unsigned d2 = static_cast<unsigned>(dims[2]); | ||
const unsigned OCC = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be I wasn't clear, I am not suggesting making this constant global or hard coded.
On the contrary, making it a constexpr
(shown below) instead of const
will let compiler optimize it away in all the arithmetic operations OCC
is involved, especially given that it is being used in conjunction with other hard-coded constants (powers of 2).
constexpr unsigned OCCUPANCY_FACTOR = 3; // I couldn't think of vendor agnostic name, perhaps you can
It(OCC) may be based on profiling and heuristics, but it should be explained in comments why that constant so that any future reader of the code can understand why 3
and not anything else, more so since type of GPU wasn't influencing the choice of this constant.
As far as how much of an explanation is needed in comments. I think saying couple of sentences about occupancy and how this factor affects it and adding references to all, if not most, blogs you referred to should do.
src/backend/common/dispatch.hpp
Outdated
#ifdef AF_OPENCL | ||
(d0 < warp) ? d0 : | ||
#endif | ||
(d1 == 1) ? warp * 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might in CUDA backend but I doubt that in OpenCL backend since that isn't a hard-coded value like 32
in CUDA that is being passed down to bestBlockSize
.
src/backend/common/dispatch.hpp
Outdated
const unsigned d2 = static_cast<unsigned>(dims[2]); | ||
const unsigned OCC = 3; | ||
const unsigned elements = d0 * d1; | ||
const unsigned minThreads = warp / 4; // quarter wave |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, I wasn't sure if such warp sub division happens. I don't think that's how CUDA warps work, at least from what I know. @umar456 Am I missing something from what @willyborn is referring to here ?
src/backend/common/dispatch.hpp
Outdated
const unsigned elements = d0 * d1; | ||
const unsigned minThreads = warp / 4; // quarter wave | ||
const unsigned maxThreads = | ||
std::min(warp * 4, divup(elements * warp, 16384U) * minThreads); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I am confused by the terms here: CPUs, parallel warps. Let's stick to NVIDIA architecture for ease of explanation. Can you please use their jargon: SM, Warp and thread to explain the logic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the call to JIT in join. I am super excited about this change. Its pretty late for me now. I will try to add more comments tomorrow.
src/backend/cuda/copy.cpp
Outdated
Param<T> info(out.get(), src.dims().dims, src.strides().dims); | ||
evalNodes(info, src.getNode().get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This is really clever.
Param<T> info(out.get(), src.dims().dims, src.strides().dims); | |
evalNodes(info, src.getNode().get()); | |
evalNodes(out, src.getNode().get()); |
Array<T>
objects implicity convert to Param<T>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, the src parameters are copied by the out = CreateEmptyArray<T>(src.dims())
into out.
Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did NOT succeed in doing something similar in OPENCL.
The compiler would not convert and was confused between Param and vector...
src/backend/cuda/copy.cpp
Outdated
Param<T> info(dst.get(), src.dims().dims, dst.strides().dims); | ||
evalNodes(info, src.getNode().get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Param<T> info(dst.get(), src.dims().dims, dst.strides().dims); | |
evalNodes(info, src.getNode().get()); | |
evalNodes(dst, src.getNode().get()); |
Array objects implicity convert to Param
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src.dims() can be different from dst.dims(), so I have to build my own version of Param.
Example: test_assign_cuda.exe
src/backend/cuda/jit.cpp
Outdated
for (int dim = 0, elements = 1; dim < ndims; ++dim) { | ||
is_linear &= (elements == (int)outputs[0].strides[dim]); | ||
elements *= (int)outDims[dim]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, output's are currently linear for all the cases I recall.
src/backend/cuda/join.cpp
Outdated
vector<Param<T>> outputs{{out.get() + outOffset, iArray.dims().dims, | ||
out.strides().dims}}; | ||
vector<common::Node *> nodes{iArray.getNode().get()}; | ||
evalNodes<T>(outputs, nodes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again. I really like this. I think you can use evalMultiple here right?
I can see cases where the kernels may get too large but I am working on a way around that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I toughed that for evalMultiple(), all the arrays needed to be the same size, which is not guaranteed here.
src/backend/cuda/kernel/memcopy.hpp
Outdated
// optional dims: odims, ostrides | ||
// ALL parameters will be updated!! | ||
template<bool RESHAPE = false> | ||
void serializeArray(dim_t mdims[4], dim_t mstrides[4], dim_t &mndims, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the name collapseShape
.
All remarks are included now, except 1 on 'src/backend/cuda/kernel/memcopy.cuh ' from @9prady9 where I need some help. I will start the testing before Releasing. |
@willyborn Sorry about the delay on my end. Have been caught up with some work and other things. I will take a stab at this week. If it isn't hard, I will try to rebase this. Whether I do it or not, there is another PR that is touching jit code #3177 which will have some conflicts in it. We might have rebase this one at that point again. Just giving you an heads up. |
No problem, I am continuing with the info provided last time. It made many
things clear, resulting in much simplification.
I will provide a separate PR for the JIT disk caching. I found a way so
that it reacts now identical to the rest, without extra performance
impact. For this PR the JIT code is even not impacted. I will prepare
this before continuing with the copy bench tests.
Willy
…On Mon, Oct 11, 2021, 13:23 pradeep ***@***.***> wrote:
@willyborn <https://github.com/willyborn> Sorry about the delay on my
end. Have been caught up with some work and other things. I will take a
stab at this week. If it isn't hard, I will try to rebase this. Whether I
do it or not, there is another PR that is touching jit code #3177
<#3177> which will have some
conflicts in it. We might have rebase this one at that point again. Just
giving you an heads up.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3144 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQ2WGPEGTN2QIWNQNCYFPATUGLCMHANCNFSM46MTN7OQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Pradeep,
Do not rebase, since the code still is in progress.
I know it is taking a long time, although I'm using this opportunity to
better understand which factors are impacting performance. For the moment,
I am around 40% better for u8 and 15% for f32, so still progressing
(although I am approaching the HW limit).
I will rebase before pushing new code.
On Mon, 11 Oct 2021 at 15:40, Sabine & Willy Born <
***@***.***> wrote:
… No problem, I am continuing with the info provided last time. It made
many things clear, resulting in much simplification.
I will provide a separate PR for the JIT disk caching. I found a way so
that it reacts now identical to the rest, without extra performance
impact. For this PR the JIT code is even not impacted. I will prepare
this before continuing with the copy bench tests.
Willy
On Mon, Oct 11, 2021, 13:23 pradeep ***@***.***> wrote:
> @willyborn <https://github.com/willyborn> Sorry about the delay on my
> end. Have been caught up with some work and other things. I will take a
> stab at this week. If it isn't hard, I will try to rebase this. Whether I
> do it or not, there is another PR that is touching jit code #3177
> <#3177> which will have some
> conflicts in it. We might have rebase this one at that point again. Just
> giving you an heads up.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#3144 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AQ2WGPEGTN2QIWNQNCYFPATUGLCMHANCNFSM46MTN7OQ>
> .
> Triage notifications on the go with GitHub Mobile for iOS
> <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
> or Android
> <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
>
>
|
@willyborn Any idea on the ETA when this should be ready ? rough estimate is good. |
For the JIT disk caching bug, you will have the PR today.
All tests ran successfully yesterday evening.
For the copy, this will take more time and I do not have an ETA.
Willy
…On Wed, Oct 13, 2021, 07:42 pradeep ***@***.***> wrote:
@willyborn <https://github.com/willyborn> Any idea on the ETA when this
should be ready ? rough estimate is good.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3144 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQ2WGPBRMYODPVSJLURCY73UGUL5JANCNFSM46MTN7OQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
@willyborn Sure, please do. We can have the JIT caching issue fix in 3.8.1. I am curious, if this fix is relevant to master or is it on top of this branch ? If it isn't associated with master, then we don't need to rush that one either. |
@willyborn Just following up on this PR. Where you able to make progress here? I noticed there are a couple of conflicts on this branch. I can help you rebase your changes on top of the current master and get this merged. |
I joined a document with all my findings during this journey. I was surprised by some of the results. |
Hey @willyborn, This is amazing work. I am still going through the document and will go over the code later today. These are excellent findings. I'd often assumed that this function could be implemented in a better way. There is a lot of useful information here and I will try to turn this around asap. Do you believe this PR is ready to be merged from your perspective? Do you still have some unfinished work before I begin? I see there are a couple of conflicts. They seem pretty minor. I can help you with them if you have any issues resolving them. |
I have the impression that those conflicts are from yet unpublished changes. Looking at the conflicts, the changes are minor, although I have not idea how to solve this without trowing away my changes. |
Do you mind if I resolve them? I can push once I make the relevant changes to your branch. |
No problem. At the end it is a team result. |
I found the missing updates, github will remain a mystery for me, or perhaps it is the MS Visual Studio implementation which puts me frequently on the wrong foot. |
Ahh okay. Go on ahead and push it with the update. Maybe the conflicts will be resolved with the rebase. |
79e6d57
to
9bc8f00
Compare
Rebased and reran all tests. |
General Threads/Blocks (Local/Global) calculations when all available dimensions are used., including optimized number of active parallel GPU threads.
@umar456 I missed one of your questions. |
Hey @willyborn, Sorry for the delay. I have not forgotten about this and I have been going through this over the last week or so. This is an invasive change so I am trying to be careful with the review. I really appreciate the effort and thought you have put into this and I want to make sure that I review every aspect of this before I merge it in. |
Description
Improvements are extreme for:
Fixes: #2417
Changes to Users
Performance improvement, less dependent on dimension.
Checklist