OPT: Improved memcopy, JIT & join #3144

willyborn · Jun 9, 2021

workload per thread is increased to optimize for memcopy and JIT.
memcopy can now increase its unit size
JIT now can write directly into the final buffer, instead of first evaluating into a temp buffer
JIT now performs multiple copies in parallel (when same size)
join is making optimal selection between (memcopy and JIT)

Description

No feature changes for the end-user, only a stable performance improvement.
Improvements are extreme for:
- vectors in 2nd, 3th or 4th dimension
- not yet evaluated arrays (JIT nodes)
- arrays fully contained in L2 cache

Fixes: #2417

Changes to Users

Performance improvement, less dependent on dimension.

Checklist

[ x ] Rebased on latest master
[ x ] Code compiles
[ x ] Tests pass
Functions added to unified API
Functions documented

willyborn · Jun 9, 2021

Some extra comments to the realized performance impact:

copy linear array (MAX throughput) -- should be highest possible throughput
is performed by the enclosed copy functions (OCL & CUDA). The small improvements come from C++ optimizations in copy.cpp
A[+10,,][,,].copy (Non linear) -- improvement: (CU)1x..84x / (CL)3x..120x
an evaluated array is copy to a new array which 1st dimension is 10 items bigger. When the 1st dim is 1, this means that only 1 byte in 10 of the cached memory will be written explaining the slow performance at the end. (Still 120x faster than with master)
JIT.copy (Linear) -- improvement: (CU)3.7x / (CL)3.5x
The array to be copied is not yet evaluated. In this case we do no longer evaluate it into a intermediate buffer after which the result is joined into the final buffer. The evaluation writes directly into the final buffer, eliminating a write to + read from the internal buffer. This explains why have a throughput higher than HW specified.
Memory consumption is also lower, since the intermediate buffer is no longer allocated.
join(0,A,B) -- improvement: (CU)2%..71% / (CL)4%..3x
The common (optimized) memcopy function is now used. Only evaluated arrays are measured here. The performances are now more stable, less independent from the array dimensions. In case we hit the HW limit, no further improvement is possible.
join(1,A,B) -- improvement: (CU)5%..47x / (CL)6%..50x
Performance for vectors in 2nd, 3th or 4th dimension were very bad. Now they perform similar as the rest.
join(3,A,B) -- improvement: (CU)6%..360x / (CL)8%..711x
Same remark as join(1,A,B)
join(0,A,B,C,D) -- improvement: (CU)7%..70% / (CL)8%..3x
Same remark as join(0,A,B)
join(1,A,B,C,D) -- improvement: (CU)5%..46x / (CL)7%..50x
same remark as join(1,A,B)
join(0,4xJIT) -- improvement: (CU)3x..5x / (CL)3x..7x
Improvements of JIT generation and direct writing into final buffer
join(1,4xJIT) -- improvement: (CU)3x..101x / (CL)3x..108x
Improvements of JIT generation and direct writing into final buffer. The JIT operation itself is also further optimized resulting in the exceptional improvement of 108x
Linear.as(f32|c32) only char -- improvement: (CU)20%..27% / (CL)18%..20%
The improvements to the JIT engine is the only contributor this result.
B[+1,,][,,f32|c32] -- improvement: (CU)0%..211x / (CL)0%.. 297x
This is the only place where the copy (reshape & scale) function is measured.

The CUDA scheduler optimizes much more than the OCL scheduler, even on the same HW.
In the machine learning example, the needed time for 250 epochs reduced by 10%.

More details, with the specific measures can be found in the file below:
Join.xlsx

umar456 · Jun 11, 2021

Hi @willyborn,

This is a tremendous change. Thank you for your contribution. I am looking at the benchmark numbers and they look fantastic. I am running my own benchmarks and I am seeing some good results as well. I did notice that the kernels are failing if the older kernel cache is not removed. I suppose that would only be an issue in development branches. Once the version number is updated it the errors will not appear for the end user. I want to understand the kernels that are being generated and I will give my feedback in a couple of days.

9prady9

I was reviewing on/off and it took some time, sorry about the delay.

I have reviewed CUDA backend and OpenCL seems quite similar in changes so will browse through them once the feedback is addressed.
Thank you for your contribution!

Note: Also, kindly rebase from latest master. There is a minor conflict in cuda's jit.cpp file.

On another, I haven't figured why it fails if we keep cache of kernels which it shouldn't, I will report back on that once I figure out why. It is important and more preferable if users doesn't need to deal with this intervention.

9prady9 · Jun 21, 2021

src/backend/opencl/kernel/copy.cl

@@ -8,16 +8,14 @@
 ********************************************************/

 typedef struct {
-    dim_t dim[4];
+    int dims[4];


This type change may likely cause type compatibility issues with dims_t struct defined else where.

In this case, I change the struct name to dims_type.

is there a reason we can't use KParam like the other kernels ?

I already had all the dims available in int format and no longer in dim_t
format, because they are updated inside the 'memcopy.hpp calls'.
On top, most of the OpenCL kernels are converting all dims / idx into int
inside their kernels.

Finally, I wanted to speed things up more, avoiding conversions and
reducing the bytes sent to the GPU.
If I have to send the dim_t version, I have to convert on the CPU and later
convert again on GPU.

Continuing calculations in dim_t format is really much slower than int on
OpenCL. I did not have the same impression on CUDA, even when running on
the same HW.

I understand the performance implications of using dim_t vs int. In fact, we originally had dim_t all over but later refactored to current state where we keep dim_t outside of kernels but convert them to int inside kernels for the very same reason. This is case with even CUDA backend, not just OpenCL. Type conversion overhead for a few scalars shouldn't be that noticeable.

I don't understand why you would have to convert from int to dim_t on the host. Can you please point me to the source location where these are happening. As far as I can remember, all non-kernel codes almost always deals in terms of dim_t and dim_t to int inside kernels alone.

I updated dims_t to dims_type

I was referring to memcopy.hpp, since the dims and strides are updated in array dimension optimizations. This is all done in int, since the kernel continuous in int.

So the latest values are only available in int in opencl/memcopy.hpp

9prady9 · Jul 12, 2021

src/backend/common/dispatch.hpp

+    const unsigned d2         = static_cast<unsigned>(dims[2]);
+    const unsigned OCC        = 3;
+    const unsigned elements   = d0 * d1;
+    const unsigned minThreads = warp / 4;  // quarter wave


What is the significance of the number 4 here ? Could it be based on the max word transfer limit, 128 bytes such as 128 / 32(warp) = 4 ?

If it is indeed based on max word transfer limit, 128 line, doesn't this change w.r.t to data type ?

NVIDIA & AMD GPU's are build on top of processors using SIMD. For all GPU's of NVIDIA 1 instruction & 8 datasets, for AMD 1 instruction & 16 datasets.
I continue for NVIDIA, although the same for AMD.

The processor now executes as follows:
Wave 1

1st instruction / 0...7 datasets (= 8 threads = quarter wave)

1st instruction / 8...15 datasets (= 8 threads = quarter wave)

1st instruction / 16...23 datasets (= 8 threads = quarter wave)

1st instruction / 24...31 datasets (= 8 threads = quarter wave)
Wave 2

2nd instruction / 0...7 datasets (= 8 threads)

...

By repeating the same instruction 4x, we get the impression that a full warp is executed in parallel.

Suppose only thread 24 exits, than the processor will service datasets 25...31 each with an 1instruction and 1 dataset --> 7 instructions for 7 datasets, iso 1 instruction for 8 datasets.

Suppose datasets 24...31 exit together, than everything will continue although now only with a repetition of 3x. Each instruction (SIMD) remains servicing 8 datasets. Losing quarter waves will keep the processor at fully efficiency, on condition that caching of global memory still can follow. From lecture, I discovered that the caching system can keep up, up to 50% of active datasets (16 in case of NVIDIA). With lower datasets (threads) active per processor, the processor will have to wait.

Since both architectures decided to split their respective waves into 4 instructions, we can use the same formula to determine the minThreads for both NVIDIA and for AMD.

The cache line size is 128Bytes for NVIDIA and 64Bytes for AMD (A10).
The 128Bytes = 32 datasets (full warp) * 4 bytes (size float), this means that the GPU will read a full warp (linear) at once. This is only useful, when we can read this data sequentially. On top when 1 cache line is read, the next 3 cache lines are already ordered in the hope that we will continue reading.
Resulting from the larger cache line for NVIDIA, it will be faster in sequential reading although will be slower with random access (reading multiple buffers in parallel). The cache lines are also use to write to global memory, although this is fully asynchronous so no waiting times (even for random access) as long as the total cache is not occupied.

How much of the above do you want into the comments?

Based on what you explained about the constant, cache info is irrelevant for this comment.

I think something like the below should be good enough

<return> bestBlockSize(<params>) { // Currently both NVIDIA and AMD hardware splits each warp/wavefront into // four groups of 8/16 threads, respectively. // NVIDIA - 32(warp) / 8 = 4 // AMD - 64(wavefront) / 16 = 4 constexpr unsigned WAVE_COUNT = 4; const unsigned minThreads = warp / WAVE_COUNT; .... }

My bad, I wasn't sure if such warp sub division happens. I don't think that's how CUDA warps work, at least from what I know. @umar456 Am I missing something from what @willyborn is referring to here ?

Updated as proposed.
Add references:
https://www.cvg.ethz.ch/teaching/2011spring/gpgpu/GPU-Optimization.pdf
https://en.wikipedia.org/wiki/Graphics_Core_Next#SIMD_Vector_Unit

I don't think NVIDIA has split warps since the Kepler architecture(2012). NVIDIA did this on the Fermi and earlier cards. AMD still does this to some extent for their older generations. The new RDNA architecture moved to 32 thread wavefronts and each thread is executed in parallel.

https://developer.amd.com/wp-content/resources/RDNA_Shader_ISA.pdf

Suppose datasets 24...31 exit together, than everything will continue although now only with a repetition of 3x. Each instruction (SIMD) remains servicing 8 datasets. Losing quarter waves will keep the processor at fully efficiency, on condition that caching of global memory still can follow. From lecture, I discovered that the caching system can keep up, up to 50% of active datasets (16 in case of NVIDIA). With lower datasets (threads) active per processor, the processor will have to wait.

Ignoring the 4 clock per wavefront point, The inactive threads in a wavefront will still execute a NOP instruction. This is even the case if only one thread is active. See page 41-42 here: http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf

Yes, I only read about such division from the days of Fermi.

I am thinking through this again and I think I am not entirely correct here. Yes, for floating point operations, typically the entire warp is executing the same instruction except for AMD, but for instructions that are more expensive(like doubles and special functions) this sort of behavior will occur. I think @willyborn's change will help for those types of operations.

9prady9 · Jul 12, 2021

src/backend/common/dispatch.hpp

+// Remark: The bestBlockSize is only best for independent element operations, as
+// are: copying, scaling, math on independent elements, ...
+// Since vector dimensions can be returned, it is NOT USABLE FOR BLOCK
+// OPERATIONS, as are: matmul, etc.


👍🏾 A good heads up for users of this function.

I think a few comments/docs are missing from the function implementation itself with respect to the hard-coded constants. I have left additional comments at relevant lines, please have a look.

9prady9 · Jul 12, 2021

src/backend/common/dispatch.hpp

+    const unsigned d0         = static_cast<unsigned>(dims[0]);
+    const unsigned d1         = static_cast<unsigned>(dims[1]);
+    const unsigned d2         = static_cast<unsigned>(dims[2]);
+    const unsigned OCC        = 3;


It is good to have 3 as a function-local constant but what is the significance of this 3 ? Does it depend on the GPU in use ? If it varies per hardware, is it usually populated in device properties ?

In the copy kernels, we only have address calculations in the threads. This means that calculation time is very low compared to the waiting time, resulting from memory latency.

The total occupation = occupation(threads0) * occupation(threads1) * occupation(threads2).

Since array dim[2] is mostly very small, we will push occupation(threads2) to 100%, by only providing exact dividable block dim.

For threads0 & threads1 occupation always drops because the last block is incomplete, although launched. The full occupation is in brackets (assuming both equal). The occupation per dim is calculated as follows:
0 full blocks + 1 thread / 1 block allocated ==> 1/32, 1/64, ...
1 full block + 1 thread / 2 blocks allocated ==> >50% (25%)
2 full blocks + 1 thread / 3 blocks allocated ==> >66% (44%)
3 full blocks + 1 thread / 4 blocks allocated ==> >75% (56%)
From blogs, I read that around 50% threads is sufficient so somewhere between 3 & 4 allocated blocks. The performance difference between both is smaller than my measuring precision.

As result, I expect it to be mostly independent from GPU version. On top,

only a very small set of dimensions are possibly affected.

I prefer to keep it inside the function, because it is not intended to be changed outside specific performance profiling.

There are many uncertain assumptions, so this not an exact science. In practice, you will encounter mostly exponents of 2.

On NVIDIA, I noticed that the driver optimizes much more so that multiple variations give the same result.

I can also remove the constant and put directly the final number in each expression.

May be I wasn't clear, I am not suggesting making this constant global or hard coded.

On the contrary, making it a constexpr(shown below) instead of const will let compiler optimize it away in all the arithmetic operations OCC is involved, especially given that it is being used in conjunction with other hard-coded constants (powers of 2).

constexpr unsigned OCCUPANCY_FACTOR = 3; // I couldn't think of vendor agnostic name, perhaps you can

It(OCC) may be based on profiling and heuristics, but it should be explained in comments why that constant so that any future reader of the code can understand why 3 and not anything else, more so since type of GPU wasn't influencing the choice of this constant.

As far as how much of an explanation is needed in comments. I think saying couple of sentences about occupancy and how this factor affects it and adding references to all, if not most, blogs you referred to should do.

Updated with comments.
const unsigned OCC changed to constexpr unsigned OCCUPANCY_FACTOR

9prady9 · Jul 12, 2021

src/backend/common/dispatch.hpp

+    const unsigned elements   = d0 * d1;
+    const unsigned minThreads = warp / 4;  // quarter wave
+    const unsigned maxThreads =
+        std::min(warp * 4, divup(elements * warp, 16384U) * minThreads);


A couple of comments w.r.t the number 16384.

It would be nice to have a comment on why this specifically ? Is it hardware agnostic constant ?

Instead of large number something like 1<<14 is perhaps more indicative that this constant is power of 2 I think.

Looks like this number is based on your benchmarks ? because I see this number again as 8192 * 2 on lines 301 and 303 of src/backend/cuda/jit.cpp file

The idea here is that for very small array dimensions, the opposite is needed (lower occupation).

Suppose a GPU can handle 1024 threads (32*32 warps) in parallel. Suppose our array is a vector with dim0=256 elements.

When we go for maximum occupation, we set a block size of 32, resulting in the activation of 8 parallel warps (=8CPUs). The other 24CPUs will be idle. We also know that for each warp only 8 threads are really in parallel, so 64 real parallel threads for the whole system with 4 loops each.

Would it not better to use all the available CPUs (32) with only 1 loop. This is achieved by launching with a block size of 8 (=minThreads). The occupation is here only 25%, although 4x faster than with 100%.

As long as all elements of an array is smaller than the available parallel threads, we limit the maxThreads for 1 block to create the occupation inefficiency described above.
The number 16384U is an estimated average (=512*32; 512 stream processors) over all GPU's I looked up. Again this number does not has to be precise, since the impact decreases the closer we get there. On top an error margin of 50% is allowed.

Increasing of lowering it did not have much of an impact. Removing it generated visible slowdowns.
I reused the same number later, because I did not have anything better. The reasoning is the same, why increase the looping inside a thread when all threads are not occupied.

Do you want me to include all the above comments inside the code?

Sorry, but I am confused by the terms here: CPUs, parallel warps. Let's stick to NVIDIA architecture for ease of explanation. Can you please use their jargon: SM, Warp and thread to explain the logic here.

I'll try to stick to NVIDIA nomenclature.

Suppose GPU has 4 cores (SP), so 32*4=128 threads can run in parallel.
Suppose our array is an vector of 32 floats, all in dim0.

General rule:
each SP always receives a full warp. This is achieved by extending threads up to a multiple of the warp. Empty threads do not consume cycles and are skipped.
When 8 threads execute the same instruction, the SIMD will execute in parallel (1 cycle).
When 7 threads execute the same instruction, the ALU will execute in serial (7 cycles).

We set threads(128,1,1) and blocks(1,1,1), resulting in
----------------------------------------------blockIdx0----------------------------------------------------
cycle1: SP0:threadIdx 0.. 7; SP1:threadIdx 33..40; SP2:threadIdx 65..72; SP3:threadIdx 97..104
cycle2: SP0:threadIdx 8..15; SP1:threadIdx 41..48; SP2:threadIdx 73..80; SP3:threadIdx 105..112
cycle3: SP0:threadIdx 16..23; SP1:threadIdx 49..56; SP2:threadIdx 81..88; SP3:threadIdx 113..120
cycle4: SP0:threadIdx 24..32; SP1:threadIdx 57..64; SP2:threadIdx 89..96; SP3:threadIdx 121..128
--> 25% of available SP is used (threads 33..128 exit immediately) and 4 cycles needed.
--> Occupation = 100%

We set threads(32,1,1) and blocks(1,1,1), resulting in
------blockIdx0-------- -------------IDLE--------------
cycle1: SP0:threadIdx 0.. 7; SP1:idle; SP2:idle; SP3:idle
cycle2: SP0:threadIdx 8..15; SP1:idle; SP2:idle; SP3:idle
cycle3: SP0:threadIdx 16..23; SP1:idle; SP2:idle; SP3:idle
cycle4: SP0:threadIdx 24..32; SP1:idle; SP2:idle; SP3:idle
--> 25% of available SP is used and 4 cycles needed.
--> Occupation = 25%

We set threads(8,1,1) and blocks(4,1,1), resulting in
------blockIdx0----- ------blockIdx1------ -------blockIdx2------- -------blockIdx3-------
cycle1: SP0:threadIdx 0..7; SP1:threadIdx 8..15; SP2:threadIdx 16..23; SP3:threadIdx 24..32
cycle2, cycle3, cycle4 are skipped since no threads to launch
--> 100% of available SP is used and 1 cycle needed.
--> Occupation = 100%

Assume that we only have 16 floats, all in dim0

We set threads(4,1,1) and blocks(4,1,1), resulting in
----blockIdx0---- ----blockIdx1---- ----blockIdx2---- ----blockIdx3----
cycle1: SP0:threadIdx 0; SP1:threadIdx 4; SP2:threadIdx 8; SP3:threadIdx12
cycle2: SP0:threadIdx 1; SP1:threadIdx 5; SP2:threadIdx 9; SP3:threadIdx13
cycle3: SP0:threadIdx 2; SP1:threadIdx 6; SP2:threadIdx10; SP3:threadIdx14
cycle4: SP0:threadIdx 3; SP1:threadIdx 7; SP2:threadIdx11; SP3:threadIdx15
--> 100% of available SP is used and 4 cycles needed.
--> Occupation = 100%

We set threads(8,1,1) and blocks(1,1,1), resulting in
-----blockIdx0------
cycle 1: SP0:threadIdx 0..7; SP1:Idle; SP2:Idle; SP3:Idle
--> 25% of available SP is used and 1 cycle needed.
--> Occupation = 25%

Conclusion:

Occupation rate is useless for very small arrays; Nr cycles counts!!

Minimum is warp/4 (// lines of SIMD), because this will 1 cycle == 1 individual instruction

I looked up multiple GPU's and they vary between 512 cores (Radeon R7, Vega8) up to 10496 cores (RTX-3090). As average, I wanted to take 1024 cores (16384 threads on AMD). I also assumed that NVIDIA users would have bigger cards, so there I took 2048 cores.
In case the user has less cores: All cores will be used and cycles increase (scheduling overhead)
In case the user has more cores: 4x faster compared to nothing (all cores are not used, no impact when memory latency is covered by 16384 threads)
AMD: (elements * 64 / 16384) * 16 = elements / (16 * 16) *16 --> 16 cores used
NVIDIA: (elements * 32 / 16384) * 8 = elements / (64 * 8) * 8 --> 64 cores used

I WILL RERUN THE PROFILING, because I think it should be
AMD : (elements / (1024 * 16) * 16 = (elements / 16384) * 16 --> divup(d0d1,1<<14) * minThreads
NVIDIA: (elements / (2048 * 8) * 8 = (elements / 16384) * 8 --> divup(d0d1,1<<14) * minThreads

First of all I think this is fantastic work. Your logic here is valid. I think this can be improved if you take the number of compute units into account with your calculations. You can pass it into the function this using getDeviceProp(id).multiProcessorCount on cuda and getDevice().getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>() on OpenCL. This will give you a general idea of the number of compute resources available on the device.

You will need to figure out the number of ALUs available on each compute unit. I think a safe minimum is 64. This number matches the minimum compute unit for AMD and NVIDIA GPUs. Intel GPUs have 8 shader units per compute unit but lets ignore them for now.

I think that moving below 64 threads per work-group will give you diminishing returns. I think your model of the GPU scheduling is incorrect with respect to the minimum number of clocks required to do an operation. I don't think you can minimize the impact of an inactive thread in the wavefront like you mention in your response. If you have different results then I would love to see it. I love to be proven wrong :).

Optimizing for very small data sizes will probably not yield favorable results. For these data sizes the bottleneck is going to be either driver overhead, memory loads overhead or other processing overhead internal to ArrayFire. From an engineering point of view, it might be better to simplify the logic to increase maintainability(if you don't see significant gains).

When 8 threads execute the same instruction, the SIMD will execute in parallel (1 cycle).
When 7 threads execute the same instruction, the ALU will execute in serial (7 cycles).

I am not sure I understand you here. Can you please clarify.

As mentioned, I will rerun multiple performance tests with only this parameter changed.
It is possible, that due multiple improvements before on all parameters, that I got wrong conclusions.
More to come.

From what I understand from SIMD, is that all // lines have to be occupied or none at all. This has as consequence that when only 7 data lines are possible, that the SP no longer can use the SIMD (1x 8//) and is forced to use the ALU (7x 1) instead. Both do use however the same number of cycles.

I think that moving below 64 threads per work-group will give you diminishing returns. I think your model of the GPU scheduling is incorrect with respect to the minimum number of clocks required to do an operation. I don't think you can minimize the impact of an inactive thread in the wavefront like you mention in your response. If you have different results then I would love to see it. I love to be proven wrong :)

As mentioned in my last comment, I think I am incorrect here. I can see situations where smaller block sizes (<64 threads) can be beneficial for more expensive operations such as doubles and special functions. Please disregard that comment.

From what I understand from SIMD, is that all // lines have to be occupied or none at all. This has as consequence that when only 7 data lines are possible, that the SP no longer can use the SIMD (1x 8//) and is forced to use the ALU (7x 1) instead. Both do use however the same number of cycles.

I don't think I have heard of that. All the documentation I have read treats the inactive threads as masked off and the operations still use one cycle even though 7 are active. On AMD hardware I understand that they have a scalar execution units which comes into play when all threads perform the same operation on the same data. This is used for control flow and address operations typically.

9prady9 · Jul 14, 2021

src/backend/cuda/kernel/memcopy.hpp

-    blocks.z = divup(blocks.y, maxBlocksY);
-    blocks.y = divup(blocks.y, blocks.z);
+template<typename T>
+void memcopyN(Param<T> out, std::vector<CParamPlus<T>> &Ins) {


I don't think we need an additional function memcopyN and the new object CParamPlus. We avoid both of these if we create memcopyN using the new memcopy version I suggested in my earlier comments.

Calling the memcopy for each array separately means that the overhead has to be processed for each copy individually.
This was the model I had in the beginning.

After moving to memcopyN, I noticed a measurable difference (especially for small array's, which is typical for addBias() functions).

In case you give priority to this complexity reduction, please let me know.
I know alternative for addBias() performing similar to current join implementation.

af::array addBias(const af::array in) { af::dim4 dimsR(in.dims()); ++dimsR[0]; af:: array R = af::constant(1.0, dimsR); // yes full array, since very fast. R(af::seq(1,af::end),af::span,af::span,af::span) = in; // overwrite for values of in return(R); }

9prady9 · Jul 14, 2021

src/backend/cuda/kernel/memcopy.hpp

+    if (elements > 0) {
+        const int *maxGridSize =
+            cuda::getDeviceProp(cuda::getActiveDeviceId()).maxGridSize;
+        serializeArray<true>(dst.dims, dst.strides, ondims, src.strides,


I think we could minimize and move non-wrapper related code to a function defined in copy.cpp similar to what I suggested about memcopy wrapper.

I agree for the elements>0.
Moved to the upper layer.

9prady9 · Jul 14, 2021

src/backend/cuda/kernel/memcopy.cuh

+                // ALWAYS looping!!
+                int ioffset          = g2 * istrides2;
+                int ooffset          = g2 * ostrides2;
+                const int ioffsetEnd = ioffset + idims3 * istrides3;


Similar to the change in jit code, Since end can go beyond actual number of elements when ioffset > 0, I think this would pose problem in the last blocks along any given kernel launch axis.

This is not an index, although an iterator. In case istrides3 is negative, the > will give wrong results.

9prady9 · Jul 14, 2021

src/backend/cuda/kernel/memcopy.cuh

+            in.ptr += iinc1;
+            out.ptr += oinc1;
+        } while (g1 < idims1);
+#endif


Occupancy calculator shows both the PR kernel and the master kernel are near max. I am wondering if master branch's kernel can be slightly modified to handle per-thread-workload rather than introducing two additional do-while loops conditionally. That would also remove the need for the two additional template parameters or compile time definitions. Occupancy shouldn't be effected due to this - I have checked this for both versions of the kernel.

CUDA has limitations on the dimensions it accepts for the blocks parameter in the launch call. For my GTX 750 Ti:

dim0 : <= 4,294,967,295 (assumed never hit, due to limit of GPU memory)

dim1: <= 65,536 (rarely hit)

dim2: <= 65,536 (very rarely hit)
In case we do hit a limit, LOOP1 & LOOP2 are introduced, on top of the workload loop (dim3)
To avoid this overhead for all arrays, they are conditional.

For OPENCL, we have different limits for the 'corresponding' global parameter (also on NVIDIA GPUs):

dim0: <= 18,446,744,073,709,551,615 (assumed never hit, due to limit on GPU memory)

dim1: <= 18,446,744,073,709,551,615 (assumed never hit, due to limit on GPU memory)

dim2: <= 18,446,744,073,709,551,615 (assumed never hit, due to limit on GPU memory)
Here we only have the workload loop (dim3) remaining.

I have no idea how I can combine the dim1, dim2 and dim3 (workload) loops together into 1 loop, without exceptions.
Please give me an example.

By the way the some issue exists in cuda/jit.cpp

9prady9 · Jul 27, 2021

src/backend/cuda/kernel/memcopy.hpp

+                                             {
+                                                 TemplateTypename<T>(),
+                                                 TemplateArg(loop1),
+                                                 TemplateArg(loop2),
+                                             });


Suggested change

{

TemplateTypename<T>(),

TemplateArg(loop1),

TemplateArg(loop2),

});

{

TemplateTypename<T>(),

},

{

DefineKeyValue(LOOP1, loop1),

DefineKeyValue(LOOP2, loop2),

}

);

Doing so can avoid the additional template parameters - loop1 and loop2 can also be macro definitions and they are included while generating the unique key for kernel. Just suggesting this alternative as these values aren't used in any calculations anyway inside the kernel.

I especially wanted a different kernel for the combinations: <T, NO LOOPS>, <T, LOOP1>, <T,LOOP2>, <T, LOOP1 & LOOP2>.
By passing the LOOP values by value, will I not end up with only combinations <T, LOOP1 & LOOP2>.
This would mean that all cases will have the full overhead, even when not needed.
This will impact the small array's.
In reality, I expect that not all combinations will be used since they are dependent on the slicing of the application. In most cases only <T, NO LOOPS> will be compiled.

In case I'm wrong, please explain.
Is there an advantage by using macro definitions over template variables?

willyborn · Jul 31, 2021

@9prady9
The reason the old kernels are no longer valid is a result of the hash calculation. For JIT kernels, we only take the function name (backend/cuda/jit.cpp:207) into account but not the real code. The function name remained the same, although the code behind (and corresponding parameters to call it) changed.
A solution could be that we include a 'fixed version number' into the hash calculation, so that we can force the generation of a new set of hash codes.

9prady9 · Aug 1, 2021

@willyborn I can't recollect why I avoided hashing JIT code, perhaps to avoid extra computation time.

Fixed version solution might not work in a couple of corner cases such as when a particular Node's logic changed but the name of the JIT tree comes out same as earlier even with a version(JIT version sort of this) suffix. I think it is better we hash JIT source as well to prevent problems such as these unless it is adding a performance hit to JIT engine.

willyborn · Aug 1, 2021

This will have a serious performance impact, since the code generation is take more time than the exécution of the resulting kernel. You will also have to generate the kernel code, even when cached, because the hash key Is used to search for the cached kernel. The problem is however not that big as it seams because the version nr of arrayfire (3.9) is included in the filename. As result endusers will never notice it.

…

On Sun, Aug 1, 2021, 06:38 pradeep ***@***.***> wrote: @willyborn <https://github.com/willyborn> I can't recollect why I avoided hashing JIT code, perhaps to avoid extra computation time. Fixed version solution might not work in a couple of corner cases such as when a particular Node's logic changed but the name of the JIT tree comes out same as earlier even with a version(JIT version sort of this) suffix. I think it is better we hash JIT source as well to prevent problems such as these unless it is adding a performance hit to JIT engine. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3144 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQ2WGPDKLO3BRKWWA77GKMDT2TFVNANCNFSM46MTN7OQ> .

9prady9 · Aug 1, 2021

This will have a serious performance impact, since the code generation is take more time than the exécution of the resulting kernel. You will also have to generate the kernel code, even when cached, because the hash key Is used to search for the cached kernel. The problem is however not that big as it seams because the version nr of arrayfire (3.9) is included in the filename. As result endusers will never notice it.
…
On Sun, Aug 1, 2021, 06:38 pradeep @.***> wrote: @willyborn https://github.com/willyborn I can't recollect why I avoided hashing JIT code, perhaps to avoid extra computation time. Fixed version solution might not work in a couple of corner cases such as when a particular Node's logic changed but the name of the JIT tree comes out same as earlier even with a version(JIT version sort of this) suffix. I think it is better we hash JIT source as well to prevent problems such as these unless it is adding a performance hit to JIT engine. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3144 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ2WGPDKLO3BRKWWA77GKMDT2TFVNANCNFSM46MTN7OQ .

I haven't timed the hashing mechanism. But yes, it needs to be hashed at each lookup which would be inefficient.

I think we can do away with re-using AF_REVISION defined in version.hpp which is generated during build. It has the commit hash and that should do the trick - adding AF_REVISION to the hashed inputs for JIT workflow.

willyborn · Aug 3, 2021

I already had all the dims available in int format and no longer in dim_t format, because they are updated inside the 'memcopy.hpp calls'. On top, most of the OpenCL kernels are converting all dims / idx into int inside their kernels. Finally, I wanted to speed things up more, avoiding conversions and reducing the bytes sent to the GPU. If I have to send the dim_t version, I have to convert on the CPU and later convert again on GPU. Continuing calculations in dim_t format is really much slower than int on OpenCL. I did not have the same impression on CUDA, even when running on the same HW.

…

On Tue, 3 Aug 2021 at 16:55, pradeep ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/backend/opencl/kernel/copy.cl <#3144 (comment)>: > @@ -8,16 +8,14 @@ ********************************************************/ typedef struct { - dim_t dim[4]; + int dims[4]; is there a reason we can't use KParam like the other kernels ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3144 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQ2WGPDHXLLXZN3DT3EHBHDT277P7ANCNFSM46MTN7OQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

willyborn · Aug 3, 2021

I notice that this PR is already closed.

Does it still make sense to update the code with the comments?
Can it still be merged into master or am I too late for code updates?

I will independently continue trying to answer all your the questions.

9prady9 · Aug 4, 2021

@willyborn I was under the impression you closed it intentionally to avoid repetitive ci builds or something similar.

9prady9 · Aug 4, 2021

@willyborn Also, can you please remove the merge commit from this branch

With your master even with arayfire's master branch, git rebase master should remove it the merge commit. Also, feel free to squash the trace statement fix into your commit. I did these locally for easy review/test. If you want me to push, I can do that as well. Just let me know.

src/backend/common/dispatch.hpp

9prady9 · Aug 4, 2021

src/backend/common/dispatch.hpp

+        (threads01 <= maxThreads / 64) && !(d2 & (64 - 1))   ? 64
+        : (threads01 <= maxThreads / 32) && !(d2 & (32 - 1)) ? 32
+        : (threads01 <= maxThreads / 16) && !(d2 & (16 - 1)) ? 16
+        : (threads01 <= maxThreads / 8) && !(d2 & (8 - 1))   ? 8
+        : (threads01 <= maxThreads / 4) && !(d2 & (4 - 1))   ? 4


Even here (for CUDA backend):

Range of threads0 is [32, 64, 128]
Range of threads1 is [1, 2, 4, 8, 16, 32, 128]

Min value of threads01 being 32, first four conditions are not needed.

Suggested change

(threads01 <= maxThreads / 64) && !(d2 & (64 - 1)) ? 64

: (threads01 <= maxThreads / 32) && !(d2 & (32 - 1)) ? 32

: (threads01 <= maxThreads / 16) && !(d2 & (16 - 1)) ? 16

: (threads01 <= maxThreads / 8) && !(d2 & (8 - 1)) ? 8

: (threads01 <= maxThreads / 4) && !(d2 & (4 - 1)) ? 4

#ifdef AF_OPENCL

(threads01 <= maxThreads / 64) && !(d2 & (64 - 1)) ? 64

: (threads01 <= maxThreads / 32) && !(d2 & (32 - 1)) ? 32

: (threads01 <= maxThreads / 16) && !(d2 & (16 - 1)) ? 16

: (threads01 <= maxThreads / 8) && !(d2 & (8 - 1)) ? 8

:

#endif

(threads01 <= maxThreads / 4) && !(d2 & (4 - 1)) ? 4

9prady9 · Aug 4, 2021

src/backend/common/dispatch.hpp

+    const unsigned d0         = static_cast<unsigned>(dims[0]);
+    const unsigned d1         = static_cast<unsigned>(dims[1]);
+    const unsigned d2         = static_cast<unsigned>(dims[2]);
+    const unsigned OCC        = 3;


May be I wasn't clear, I am not suggesting making this constant global or hard coded.

On the contrary, making it a constexpr(shown below) instead of const will let compiler optimize it away in all the arithmetic operations OCC is involved, especially given that it is being used in conjunction with other hard-coded constants (powers of 2).

constexpr unsigned OCCUPANCY_FACTOR = 3; // I couldn't think of vendor agnostic name, perhaps you can

It(OCC) may be based on profiling and heuristics, but it should be explained in comments why that constant so that any future reader of the code can understand why 3 and not anything else, more so since type of GPU wasn't influencing the choice of this constant.

As far as how much of an explanation is needed in comments. I think saying couple of sentences about occupancy and how this factor affects it and adding references to all, if not most, blogs you referred to should do.

9prady9 · Aug 4, 2021

src/backend/common/dispatch.hpp

+#ifdef AF_OPENCL
+        (d0 < warp) ? d0 :
+#endif
+        (d1 == 1) ? warp * 4


It might in CUDA backend but I doubt that in OpenCL backend since that isn't a hard-coded value like 32 in CUDA that is being passed down to bestBlockSize.

9prady9 · Aug 4, 2021

src/backend/common/dispatch.hpp

+    const unsigned d2         = static_cast<unsigned>(dims[2]);
+    const unsigned OCC        = 3;
+    const unsigned elements   = d0 * d1;
+    const unsigned minThreads = warp / 4;  // quarter wave


My bad, I wasn't sure if such warp sub division happens. I don't think that's how CUDA warps work, at least from what I know. @umar456 Am I missing something from what @willyborn is referring to here ?

9prady9 · Aug 4, 2021

src/backend/common/dispatch.hpp

+    const unsigned elements   = d0 * d1;
+    const unsigned minThreads = warp / 4;  // quarter wave
+    const unsigned maxThreads =
+        std::min(warp * 4, divup(elements * warp, 16384U) * minThreads);


Sorry, but I am confused by the terms here: CPUs, parallel warps. Let's stick to NVIDIA architecture for ease of explanation. Can you please use their jargon: SM, Warp and thread to explain the logic here.

umar456

I really like the call to JIT in join. I am super excited about this change. Its pretty late for me now. I will try to add more comments tomorrow.

umar456 · Aug 6, 2021

src/backend/cuda/copy.cpp

+        Param<T> info(out.get(), src.dims().dims, src.strides().dims);
+        evalNodes(info, src.getNode().get());


Nice! This is really clever.

Suggested change

Param<T> info(out.get(), src.dims().dims, src.strides().dims);

evalNodes(info, src.getNode().get());

evalNodes(out, src.getNode().get());

Array<T> objects implicity convert to Param<T>

Yep, the src parameters are copied by the out = CreateEmptyArray<T>(src.dims()) into out.

Updated.

I did NOT succeed in doing something similar in OPENCL.
The compiler would not convert and was confused between Param and vector...

umar456 · Aug 6, 2021

src/backend/cuda/copy.cpp

+                Param<T> info(dst.get(), src.dims().dims, dst.strides().dims);
+                evalNodes(info, src.getNode().get());


Suggested change

Param<T> info(dst.get(), src.dims().dims, dst.strides().dims);

evalNodes(info, src.getNode().get());

evalNodes(dst, src.getNode().get());

Array objects implicity convert to Param

src.dims() can be different from dst.dims(), so I have to build my own version of Param.

Example: test_assign_cuda.exe

umar456 · Aug 6, 2021

src/backend/cuda/jit.cpp

+    for (int dim = 0, elements = 1; dim < ndims; ++dim) {
+        is_linear &= (elements == (int)outputs[0].strides[dim]);
+        elements *= (int)outDims[dim];
+    }


yes, output's are currently linear for all the cases I recall.

umar456 · Aug 6, 2021

src/backend/cuda/join.cpp

+            vector<Param<T>> outputs{{out.get() + outOffset, iArray.dims().dims,
+                                      out.strides().dims}};
+            vector<common::Node *> nodes{iArray.getNode().get()};
+            evalNodes<T>(outputs, nodes);


Again. I really like this. I think you can use evalMultiple here right?

I can see cases where the kernels may get too large but I am working on a way around that.

I toughed that for evalMultiple(), all the arrays needed to be the same size, which is not guaranteed here.

umar456 · Aug 6, 2021

src/backend/cuda/kernel/memcopy.hpp

+// optional dims:   odims, ostrides
+// ALL parameters will be updated!!
+template<bool RESHAPE = false>
+void serializeArray(dim_t mdims[4], dim_t mstrides[4], dim_t &mndims,


I like the name collapseShape.

willyborn · Aug 11, 2021

All remarks are included now, except 1 on 'src/backend/cuda/kernel/memcopy.cuh ' from @9prady9 where I need some help.

I will start the testing before Releasing.
After the discussions, I got some ideas to improve further on collapseShape & increasePerThreadWorkload.
This can take some time, because test runs typically take a few hours each.

9prady9 · Oct 11, 2021

@willyborn Sorry about the delay on my end. Have been caught up with some work and other things. I will take a stab at this week. If it isn't hard, I will try to rebase this. Whether I do it or not, there is another PR that is touching jit code #3177 which will have some conflicts in it. We might have rebase this one at that point again. Just giving you an heads up.

willyborn · Oct 11, 2021

No problem, I am continuing with the info provided last time. It made many things clear, resulting in much simplification. I will provide a separate PR for the JIT disk caching. I found a way so that it reacts now identical to the rest, without extra performance impact. For this PR the JIT code is even not impacted. I will prepare this before continuing with the copy bench tests. Willy

…

On Mon, Oct 11, 2021, 13:23 pradeep ***@***.***> wrote: @willyborn <https://github.com/willyborn> Sorry about the delay on my end. Have been caught up with some work and other things. I will take a stab at this week. If it isn't hard, I will try to rebase this. Whether I do it or not, there is another PR that is touching jit code #3177 <#3177> which will have some conflicts in it. We might have rebase this one at that point again. Just giving you an heads up. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3144 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQ2WGPEGTN2QIWNQNCYFPATUGLCMHANCNFSM46MTN7OQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

willyborn · Oct 11, 2021

Pradeep, Do not rebase, since the code still is in progress. I know it is taking a long time, although I'm using this opportunity to better understand which factors are impacting performance. For the moment, I am around 40% better for u8 and 15% for f32, so still progressing (although I am approaching the HW limit). I will rebase before pushing new code. On Mon, 11 Oct 2021 at 15:40, Sabine & Willy Born < ***@***.***> wrote:

…

No problem, I am continuing with the info provided last time. It made many things clear, resulting in much simplification. I will provide a separate PR for the JIT disk caching. I found a way so that it reacts now identical to the rest, without extra performance impact. For this PR the JIT code is even not impacted. I will prepare this before continuing with the copy bench tests. Willy On Mon, Oct 11, 2021, 13:23 pradeep ***@***.***> wrote: > @willyborn <https://github.com/willyborn> Sorry about the delay on my > end. Have been caught up with some work and other things. I will take a > stab at this week. If it isn't hard, I will try to rebase this. Whether I > do it or not, there is another PR that is touching jit code #3177 > <#3177> which will have some > conflicts in it. We might have rebase this one at that point again. Just > giving you an heads up. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#3144 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AQ2WGPEGTN2QIWNQNCYFPATUGLCMHANCNFSM46MTN7OQ> > . > Triage notifications on the go with GitHub Mobile for iOS > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> > or Android > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. > >

9prady9 · Oct 13, 2021

@willyborn Any idea on the ETA when this should be ready ? rough estimate is good.

willyborn · Oct 13, 2021

For the JIT disk caching bug, you will have the PR today. All tests ran successfully yesterday evening. For the copy, this will take more time and I do not have an ETA. Willy

…

On Wed, Oct 13, 2021, 07:42 pradeep ***@***.***> wrote: @willyborn <https://github.com/willyborn> Any idea on the ETA when this should be ready ? rough estimate is good. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3144 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQ2WGPBRMYODPVSJLURCY73UGUL5JANCNFSM46MTN7OQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

9prady9 · Oct 13, 2021

@willyborn Sure, please do. We can have the JIT caching issue fix in 3.8.1. I am curious, if this fix is relevant to master or is it on top of this branch ? If it isn't associated with master, then we don't need to rush that one either.

umar456 · Apr 5, 2022

@willyborn Just following up on this PR. Where you able to make progress here? I noticed there are a couple of conflicts on this branch. I can help you rebase your changes on top of the current master and get this merged.

willyborn · Aug 2, 2022

I joined a document with all my findings during this journey. I was surprised by some of the results.
You will also find the final performance results on many cases of joins. It is important to look at the full curve, instead of only large array's. Variations in speed are mostly resulting from optimizations being possible or not (like vectorization).

Findings Copy&Join.docx
Join-Final.xlsx

umar456 · Aug 2, 2022

Hey @willyborn, This is amazing work. I am still going through the document and will go over the code later today. These are excellent findings. I'd often assumed that this function could be implemented in a better way. There is a lot of useful information here and I will try to turn this around asap. Do you believe this PR is ready to be merged from your perspective? Do you still have some unfinished work before I begin? I see there are a couple of conflicts. They seem pretty minor. I can help you with them if you have any issues resolving them.

willyborn · Aug 2, 2022

I have the impression that those conflicts are from yet unpublished changes.
I verified if the remote master was updated in the mean time, although it wasn't. (I have the impression, internally it is)

Looking at the conflicts, the changes are minor, although I have not idea how to solve this without trowing away my changes.

umar456 · Aug 2, 2022

Do you mind if I resolve them? I can push once I make the relevant changes to your branch.

willyborn · Aug 2, 2022

No problem. At the end it is a team result.

willyborn · Aug 2, 2022

I found the missing updates, github will remain a mystery for me, or perhaps it is the MS Visual Studio implementation which puts me frequently on the wrong foot.

umar456 · Aug 2, 2022

Ahh okay. Go on ahead and push it with the update. Maybe the conflicts will be resolved with the rebase.

willyborn · Aug 3, 2022

Rebased and reran all tests.

General Threads/Blocks (Local/Global) calculations when all available dimensions are used., including optimized number of active parallel GPU threads.

willyborn · Sep 5, 2022

@umar456 I missed one of your questions.
I have no outstanding work.

umar456 · Sep 9, 2022

Hey @willyborn,

Sorry for the delay. I have not forgotten about this and I have been going through this over the last week or so. This is an invasive change so I am trying to be careful with the review. I really appreciate the effort and thought you have put into this and I want to make sure that I review every aspect of this before I merge it in.

addressed changes

9prady9 added improvement perf labels Jun 14, 2021

9prady9 added this to the 3.8.1 milestone Jun 14, 2021

9prady9 mentioned this pull request Jun 14, 2021

Improve performance of join #2417

Closed

9prady9 force-pushed the Join+Copy branch from 0545b16 to 0b1fd58 Compare July 12, 2021 03:38

9prady9 requested changes Jul 27, 2021

View reviewed changes

willyborn closed this Jul 31, 2021

9prady9 reopened this Aug 4, 2021

9prady9 previously requested changes Aug 4, 2021

View reviewed changes

umar456 reviewed Aug 6, 2021

View reviewed changes

9prady9 modified the milestones: 3.8.1, 3.8.2 Oct 14, 2021

willyborn changed the title ~~OPT: Improved memcopy, join & JIT (Needs cache cleanup)~~ OPT: Improved memcopy, join & JIT Aug 2, 2022

willyborn changed the title ~~OPT: Improved memcopy, join & JIT~~ OPT: Improved memcopy, JIT & join Aug 2, 2022

willyborn closed this Aug 2, 2022

willyborn reopened this Aug 2, 2022

willyborn force-pushed the Join+Copy branch 3 times, most recently from 79e6d57 to 9bc8f00 Compare August 3, 2022 10:56

willyborn requested a review from 9prady9 August 3, 2022 10:57

willyborn added 3 commits August 4, 2022 01:09

Threads management

34b738c

General Threads/Blocks (Local/Global) calculations when all available dimensions are used., including optimized number of active parallel GPU threads.

OPT: memcopy

d871d55

OPT: JIT

9af3288

willyborn force-pushed the Join+Copy branch from 9bc8f00 to fd22981 Compare August 3, 2022 23:11

OPT: join

c4697fb

umar456 force-pushed the Join+Copy branch from fd22981 to c4697fb Compare September 22, 2022 17:34

umar456 approved these changes Sep 22, 2022

View reviewed changes

umar456 removed the request for review from 9prady9 September 22, 2022 20:22

umar456 merged commit bef4f10 into arrayfire:master Sep 22, 2022

syurkevi mentioned this pull request Apr 19, 2023

enables join in oneapi backend #3406

Merged

3 tasks

willyborn deleted the Join+Copy branch May 11, 2023 21:34

		Param<T> info(out.get(), src.dims().dims, src.strides().dims);
		evalNodes(info, src.getNode().get());

	Param<T> info(out.get(), src.dims().dims, src.strides().dims);
	evalNodes(info, src.getNode().get());
	evalNodes(out, src.getNode().get());

		Param<T> info(dst.get(), src.dims().dims, dst.strides().dims);
		evalNodes(info, src.getNode().get());

	Param<T> info(dst.get(), src.dims().dims, dst.strides().dims);
	evalNodes(info, src.getNode().get());
	evalNodes(dst, src.getNode().get());

Search code, repositories, users, issues, pull requests...

OPT: Improved memcopy, JIT & join #3144

OPT: Improved memcopy, JIT & join #3144

Uh oh!

Conversation

willyborn commented Jun 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes to Users

Checklist

Uh oh!

willyborn commented Jun 9, 2021

Uh oh!

umar456 commented Jun 11, 2021

Uh oh!

9prady9 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

willyborn Aug 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

willyborn commented Jun 9, 2021 •

edited

Loading

9prady9 left a comment •

edited

Loading

willyborn Aug 5, 2021 •

edited

Loading

willyborn Aug 11, 2021 •

edited

Loading

9prady9 commented Aug 1, 2021 •

edited

Loading