Refactor tensor class in C++ unit tests#2962
Refactor tensor class in C++ unit tests#2962timmoon10 wants to merge 17 commits intoNVIDIA:mainNVIDIA/TransformerEngine:mainfrom timmoon10:tmoon/refactor-cpp-test-tensortimmoon10/TransformerEngine:tmoon/refactor-cpp-test-tensorCopy head branch name to clipboard
Conversation
Refactor test tensor wrapper by removing recipe-specific logic whenever possible. Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
- Fix syntax error in switch case (:: -> :) - Fix double-underscore typo in variable name - Fix wrong buffer passed to set_amax_columnwise - Fix unique_ptr assignment from raw pointer (use reset()) - Remove dead duplicate NVTE_MXFP8_1D_SCALING branch in get_scales() - Rename cpu_data -> cpu_buffer to match Buffer class API - Remove const from Tensor::to_cpu/from_cpu and their callers, since both methods write to the CPU buffer Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
CPU and GPU types are inconsistent, so the type checks cause too many problems. Signed-off-by: Tim Moon <tmoon@nvidia.com>
Greptile SummaryThis PR refactors the C++ test harness by introducing a
Confidence Score: 4/5Safe to merge for the common rowwise=true test paths; one constructor edge case will throw if fillCase is called on a rowwise=false, columnwise=true FP8 delayed-scaling tensor. The refactor is well-structured and the explicit checks prevent the class of silent null-pointer bugs the PR targets. The unresolved gap is in set_scale_inv: for a tensor built with rowwise=false, columnwise=true under FP8 delayed scaling, scale_inv_rowwise_ is left null while scale_inv_columnwise_ holds the shared buffer, causing set_scale_inv to abort on NVTE_CHECK. fillCase_special calls set_scale_inv unconditionally for any FP8 delayed-scaling tensor, so this combination produces an opaque failure. tests/cpp/test_common.cu — specifically set_scale_inv and any other method that routes exclusively through scale_inv_rowwise_ for what is conceptually a shared buffer. Important Files Changed
Reviews (8): Last reviewed commit: "Merge branch 'main' into tmoon/refactor-..." | Re-trigger Greptile |
Also adopt review suggestions from @greptile-apps. Signed-off-by: Tim Moon <tmoon@nvidia.com>
| // Fill scales | ||
| if (t->scaling_mode() == NVTE_DELAYED_TENSOR_SCALING) { | ||
| if (isFp8Type(t->dtype())) { | ||
| // FP8 tensor scale is set to 1 | ||
| t->set_scale_inv(1.0); | ||
| } | ||
| } else { | ||
| // Block scales are filled randomly | ||
| t->fill_uniform_rowwise_scale_inv(); | ||
| t->fill_uniform_columnwise_scale_inv(); | ||
| } |
There was a problem hiding this comment.
This is weird, but it approximates the previous behavior.
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
|
/te-ci core L1 |
Oleg-Goncharov
left a comment
There was a problem hiding this comment.
LGTM, this looks much cleaner now, but the cast+transpose current scaling tests are failing with a segmentation fault.
Signed-off-by: Tim Moon <tmoon@nvidia.com>
|
/te-ci core L1 |
|
|
||
| return {ret_rowwise, ret_colwise}; | ||
| } | ||
| if (scaling_mode == NVTE_MXFP8_1D_SCALING) { |
There was a problem hiding this comment.
We handle MXFP8 earlier in this function, so this if-statement was redundant.
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Also do some cleanup and improve documentation. Signed-off-by: Tim Moon <tmoon@nvidia.com>
- Restore amax buffer size assertion in compare_rowwise_amax
- Remove set_tensor_amax alias in favor of set_amax
- Extract fill_uniform_buffer helper to anonymous namespace,
eliminating duplication in fill_uniform_{rowwise,columnwise}_scale_inv
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
|
While fixing merge conflicts with #2931, I've also taken the liberty of doing some cleanup and expanding its documentation. |
|
/te-ci core L1 |
| void Tensor::set_scale_inv(float scale_inv) { | ||
| NVTE_CHECK(scale_inv_rowwise_); | ||
| NVTE_CHECK(scale_inv_rowwise_->size() == 1); | ||
| NVTE_CHECK(scale_inv_rowwise_->dtype() == DType::kFloat32); | ||
| *scale_inv_rowwise_->cpu_buffer<float>() = scale_inv; | ||
| scale_inv_rowwise_->from_cpu(); | ||
| } |
There was a problem hiding this comment.
set_scale_inv fails silently for columnwise-only FP8 tensors
When a tensor is constructed with rowwise=false and columnwise=true under NVTE_DELAYED_TENSOR_SCALING, scale_inv_rowwise_ is never assigned (it stays nullptr), while scale_inv_columnwise_ holds the shared buffer. Calling set_scale_inv on such a tensor immediately throws via NVTE_CHECK(scale_inv_rowwise_) rather than updating through scale_inv_columnwise_. fillCase_special unconditionally calls set_scale_inv(1.0) for any FP8 delayed-scaling tensor, so any test that creates a rowwise=false, columnwise=true FP8 tensor and calls fillCase will abort with a non-obvious assertion failure.
A minimal fix is to fall back to scale_inv_columnwise_ when scale_inv_rowwise_ is null — since both point to the same shared Buffer for delayed scaling, the write lands on the correct GPU allocation either way.
There was a problem hiding this comment.
This seems like correct behavior, no? Column-wise data should refer to the column-wise scale-inv.
There was a problem hiding this comment.
You're right, I concede the point. Since set_scale_inv is a rowwise-oriented setter and the check NVTE_CHECK(scale_inv_rowwise_) explicitly guards this, failing loudly for a columnwise-only tensor is the correct and intentional behavior — it catches a test logic error rather than silently writing to the wrong buffer. A caller that only has a columnwise tensor shouldn't be calling set_scale_inv in the first place.
I was wrong to flag this as a bug. Sorry for the noise.
Tip: You can give me more context with [.greptile/rules.md](https://www.greptile.com/docs/code-review/custom-standards) so I get this right next time.
Description
The tensor wrapper in the C++ unit tests has become unwieldy, with complicated interactions between recipes and memory management. This has recently resulted in bugs where we accidently didn't allocate a required buffer (#2943). This PR disentangles the memory management from the recipe logic by adding a simple RAII class to manage GPU and CPU buffers. I've also added more explicit checks, e.g. when we assume a tensor is a single FP32.
Type of change
Changes
Checklist: