Refactor tensor class in C++ unit tests by timmoon10 · Pull Request #2962 · NVIDIA/TransformerEngine

timmoon10 · May 6, 2026

Description

The tensor wrapper in the C++ unit tests has become unwieldy, with complicated interactions between recipes and memory management. This has recently resulted in bugs where we accidently didn't allocate a required buffer (#2943). This PR disentangles the memory management from the recipe logic by adding a simple RAII class to manage GPU and CPU buffers. I've also added more explicit checks, e.g. when we assume a tensor is a single FP32.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Add class to manage GPU buffer, CPU buffer, and memory transfers between them.
Remove memory management logic from tensor class in C++ tests.
Add checks to accessors that make implicit assumptions on buffer size and dtype.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Refactor test tensor wrapper by removing recipe-specific logic whenever possible. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

- Fix syntax error in switch case (:: -> :) - Fix double-underscore typo in variable name - Fix wrong buffer passed to set_amax_columnwise - Fix unique_ptr assignment from raw pointer (use reset()) - Remove dead duplicate NVTE_MXFP8_1D_SCALING branch in get_scales() - Rename cpu_data -> cpu_buffer to match Buffer class API - Remove const from Tensor::to_cpu/from_cpu and their callers, since both methods write to the CPU buffer Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

CPU and GPU types are inconsistent, so the type checks cause too many problems. Signed-off-by: Tim Moon <tmoon@nvidia.com>

greptile-apps · May 6, 2026

Greptile Summary

This PR refactors the C++ test harness by introducing a Buffer RAII class that owns matching GPU and CPU allocations, then rewires the Tensor class to delegate all memory management to Buffer instances. The motivation is a recent regression where a required buffer was silently never allocated; the new design makes that kind of mistake harder by separating recipe logic from memory lifecycle.

New Buffer class (test_common.h / test_common.cu): manages cudaMalloc/cudaFree and cudaMemcpy in a single place; Tensor member variables are now std::optional<Buffer> or std::shared_ptr<Buffer>, replacing bare CudaPtr fields and ad-hoc memory size tracking.
Explicit accessor guards: rowwise_cpu_dptr<T>(), columnwise_cpu_dptr<T>(), scale(), amax(), and similar methods now NVTE_CHECK for buffer presence and dtype match before returning a pointer, replacing the previous silent null/garbage-pointer returns.
Test-side fixes: all operator tests now cache output.scale() before the kernel call (guarded by isFp8Type) to avoid calling the accessor on non-FP8 tensors that have no scale buffer; compute_ref in test_cast_nvfp4_transpose.cu is updated to take const float* amax to support both tensor-scaled and row-scaled NVFP4 paths uniformly.

Confidence Score: 4/5

Safe to merge for the common rowwise=true test paths; one constructor edge case will throw if fillCase is called on a rowwise=false, columnwise=true FP8 delayed-scaling tensor.

The refactor is well-structured and the explicit checks prevent the class of silent null-pointer bugs the PR targets. The unresolved gap is in set_scale_inv: for a tensor built with rowwise=false, columnwise=true under FP8 delayed scaling, scale_inv_rowwise_ is left null while scale_inv_columnwise_ holds the shared buffer, causing set_scale_inv to abort on NVTE_CHECK. fillCase_special calls set_scale_inv unconditionally for any FP8 delayed-scaling tensor, so this combination produces an opaque failure.

tests/cpp/test_common.cu — specifically set_scale_inv and any other method that routes exclusively through scale_inv_rowwise_ for what is conceptually a shared buffer.

Important Files Changed

Filename	Overview
tests/cpp/test_common.h	Adds Buffer RAII class for GPU/CPU memory management and refactors Tensor accessors with explicit dtype/presence checks.
tests/cpp/test_common.cu	Rewrites Tensor constructor to separate memory management from recipe logic; set_scale_inv fails for rowwise=false, columnwise=true FP8 delayed-scaling tensors.
tests/cpp/operator/test_cast_nvfp4_transpose.cu	Refactors compute_ref to take const float* amax, splitting row-scaled, 2D, and basic paths more clearly.
tests/cpp/operator/test_act.cu	Correctly gates output.scale() behind isFp8Type check.
tests/cpp/operator/test_cast.cu	Same ref_scale guard pattern as test_act.cu; no issues.
tests/cpp/operator/test_cast_float8blockwise.cu	Removes redundant DACT_FUNC_SWITCH outer wrap; purely structural.
tests/cpp/operator/test_normalization.cu	Uses new Tensor constructors and accessors; no logic changes.

_{Reviews (8): Last reviewed commit: "Merge branch 'main' into tmoon/refactor-..." | Re-trigger Greptile}

Also adopt review suggestions from @greptile-apps. Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · May 6, 2026

+  // Fill scales
+  if (t->scaling_mode() == NVTE_DELAYED_TENSOR_SCALING) {
+    if (isFp8Type(t->dtype())) {
+      // FP8 tensor scale is set to 1
+      t->set_scale_inv(1.0);
+    }
+  } else {
+    // Block scales are filled randomly
+    t->fill_uniform_rowwise_scale_inv();
+    t->fill_uniform_columnwise_scale_inv();
+  }


This is weird, but it approximates the previous behavior.

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

timmoon10 · May 6, 2026

/te-ci core L1

Oleg-Goncharov

LGTM, this looks much cleaner now, but the cast+transpose current scaling tests are failing with a segmentation fault.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · May 7, 2026

/te-ci core L1

timmoon10 · May 7, 2026


    return {ret_rowwise, ret_colwise};
  }
-  if (scaling_mode == NVTE_MXFP8_1D_SCALING) {


We handle MXFP8 earlier in this function, so this if-statement was redundant.

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Also do some cleanup and improve documentation. Signed-off-by: Tim Moon <tmoon@nvidia.com>

- Restore amax buffer size assertion in compare_rowwise_amax - Remove set_tensor_amax alias in favor of set_amax - Extract fill_uniform_buffer helper to anonymous namespace, eliminating duplication in fill_uniform_{rowwise,columnwise}_scale_inv Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · May 9, 2026

While fixing merge conflicts with #2931, I've also taken the liberty of doing some cleanup and expanding its documentation.

timmoon10 · May 9, 2026

/te-ci core L1

greptile-apps · May 9, 2026

+void Tensor::set_scale_inv(float scale_inv) {
+  NVTE_CHECK(scale_inv_rowwise_);
+  NVTE_CHECK(scale_inv_rowwise_->size() == 1);
+  NVTE_CHECK(scale_inv_rowwise_->dtype() == DType::kFloat32);
+  *scale_inv_rowwise_->cpu_buffer<float>() = scale_inv;
+  scale_inv_rowwise_->from_cpu();
+}


set_scale_inv fails silently for columnwise-only FP8 tensors

When a tensor is constructed with rowwise=false and columnwise=true under NVTE_DELAYED_TENSOR_SCALING, scale_inv_rowwise_ is never assigned (it stays nullptr), while scale_inv_columnwise_ holds the shared buffer. Calling set_scale_inv on such a tensor immediately throws via NVTE_CHECK(scale_inv_rowwise_) rather than updating through scale_inv_columnwise_. fillCase_special unconditionally calls set_scale_inv(1.0) for any FP8 delayed-scaling tensor, so any test that creates a rowwise=false, columnwise=true FP8 tensor and calls fillCase will abort with a non-obvious assertion failure.

A minimal fix is to fall back to scale_inv_columnwise_ when scale_inv_rowwise_ is null — since both point to the same shared Buffer for delayed scaling, the write lands on the correct GPU allocation either way.

This seems like correct behavior, no? Column-wise data should refer to the column-wise scale-inv.

You're right, I concede the point. Since set_scale_inv is a rowwise-oriented setter and the check NVTE_CHECK(scale_inv_rowwise_) explicitly guards this, failing loudly for a columnwise-only tensor is the correct and intentional behavior — it catches a test logic error rather than silently writing to the wrong buffer. A caller that only has a columnwise tensor shouldn't be calling set_scale_inv in the first place.

I was wrong to flag this as a bug. Sorry for the noise.

Tip: You can give me more context with [.greptile/rules.md](https://www.greptile.com/docs/code-review/custom-standards) so I get this right next time.

timmoon10 and others added 7 commits April 30, 2026 01:51

Fix bug in NVFP4 quantize test where we set scale instead of amax

45aa776

Refactor test tensor wrapper by removing recipe-specific logic whenever possible. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Only get fp32 scale when tensor is expected to have fp32 scale

1acb3c3

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Create dedicated class for managing GPU/CPU buffers

97945bc

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into tmoon/refactor-cpp-test-tensor

6ff82c0

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Debug compilation errors

6d3e1db

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Remove type check when accessing raw pointers

e324ae2

CPU and GPU types are inconsistent, so the type checks cause too many problems. Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 requested a review from Oleg-Goncharov May 6, 2026 00:19

timmoon10 added the refactor label May 6, 2026

greptile-apps Bot reviewed May 6, 2026

View reviewed changes

Comment thread tests/cpp/test_common.h

Comment thread tests/cpp/test_common.cu Outdated

Debug distributed C++ tests

346e92a

Also adopt review suggestions from @greptile-apps. Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 commented May 6, 2026

View reviewed changes

Comment thread tests/cpp/test_common.cu Outdated

Remove unused header

181cfab

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

timmoon10 commented May 6, 2026

View reviewed changes

Comment thread tests/cpp/test_common.h Outdated

Copy-paste error

b193fa7

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

Oleg-Goncharov reviewed May 6, 2026

View reviewed changes

Use shared buffer for FP8 row-wise scale-inv and col-wise scale-inv

67b730f

Signed-off-by: Tim Moon <tmoon@nvidia.com>

greptile-apps Bot reviewed May 7, 2026

View reviewed changes

Comment thread tests/cpp/test_common.cu

Merge branch 'main' into tmoon/refactor-cpp-test-tensor

2b4082c

greptile-apps Bot reviewed May 7, 2026

View reviewed changes

Comment thread tests/cpp/test_common.cu

timmoon10 commented May 7, 2026

View reviewed changes

Comment thread tests/cpp/test_common.cu Outdated

timmoon10 and others added 4 commits May 7, 2026 13:23

Typo

605fe29

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

Merge branch 'main' into tmoon/refactor-cpp-test-tensor

a4a494e

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Debug merge conflicts with NVIDIA#2931

3fe4d7e

Also do some cleanup and improve documentation. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into tmoon/refactor-cpp-test-tensor

a621bd5

greptile-apps Bot reviewed May 9, 2026

View reviewed changes

Search code, repositories, users, issues, pull requests...

Conversation

timmoon10 commented May 6, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

timmoon10 May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

timmoon10 commented May 6, 2026

Uh oh!

Oleg-Goncharov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

timmoon10 commented May 7, 2026

Uh oh!

Uh oh!

timmoon10 May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

timmoon10 commented May 9, 2026

Uh oh!

timmoon10 commented May 9, 2026

Uh oh!

greptile-apps Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented May 6, 2026 •

edited

Loading

timmoon10 May 6, 2026 •

edited

Loading

timmoon10 May 7, 2026 •

edited

Loading

timmoon10 May 9, 2026 •

edited

Loading