ENH: lib: Allow usecols to be a callable in loadtxt(). #21800

WarrenWeckesser · Jun 20, 2022

This is the C version of #15995.

Actually, it is a subset of that PR. The new feature added here is to allow usecols to be callable. In this PR, I haven't included the option for usecols to be a slice, and I haven't included the convenience function skip. Any column selection that could be done with a slice can also be done with a callable, and there are selections that a callable can make that cannot be done with a slice--i.e. the callable option is the most flexible. The smaller scope of this PR should make it easier to review; handling a slice can be added in a follow-up if there is interest.

The skip function in #15995 is just a Python function, so it could simply be copied over from that PR to here if folks think it is important. Of course, it too could be added in a follow-up PR.

(seberg: xref gh-13878)

seberg

I don't have a strong opinion on the API addition, but it seems good to me and does allow further additions pretty nicely (by doing them in Python).

A few small comments on the code, the only bigger one is that I think we need to reject (or add logic?) for changes in the number of columns n, since that would affect the result of the function.

seberg · Jun 20, 2022

numpy/core/src/multiarray/textreading/rows.c

+                // create the usecols array.
+                PyObject *seq = PyObject_CallFunction(usecols_obj, "n",
+                                                      current_num_fields);
+                if (PyErr_Occurred()) {


Suggested change

if (PyErr_Occurred()) {

if (seq == NULL) {

seberg · Jun 20, 2022

numpy/core/src/multiarray/textreading/rows.c

+                        "sequence of ints, but it returned an instance of "
+                        "type '%s'", Py_TYPE(seq)->tp_name);
+                    Py_DECREF(seq);
+                    goto error;


I am wondering if the "length check" can't be just part of the helper? We use the same error in any case?
Could even use PySequence_Fast in the helper, but that doesn't really matter.

That would mean that the helper should probably return num_usecols and fill in usecols (passed by pointer).

seberg · Jun 20, 2022

numpy/core/src/multiarray/textreading/seq_to_ssize_c_array.c

+        Py_DECREF(tmp);
+    }
+    return arr;
+}


Can we move this to conversion_utils.c? There is similar functionality there already, and now that it is split out, maybe it is time to just put it there.

seberg · Jun 20, 2022

numpy/core/src/multiarray/textreading/rows.c

+        // This function owns usecols if a callable usecols_obj was given.
+        PyMem_Free(usecols);
+        usecols = NULL;  // An overabundance of caution...
+    }


Would it be ugly to move all usecols handling here? Its fine, but conditional ownership tens to be an anti-pattern.

Another way to spell would be to have usecol_from_func = ... and usecol = usecol_from_func, so the owner is unconditionally usecol_from_func and not "usecol" itself.

Would it be ugly to move all usecols handling here?

No, I think that would be fine. In fact, with a bit more of the argument processing code moved into read_rows(), I think the function _readtext_from_stream() can be eliminated, and _load_from_filelike() can call read_rows() directly. I'll give that shot, and push to the PR if it looks reasonable.

seberg · Jun 20, 2022

numpy/lib/tests/test_loadtxt.py

+    with pytest.raises(RuntimeError,
+                       match='the number of fields in the given dtype'):
+        np.loadtxt(txt, usecols=lambda n: [0, 2, 3], dtype=dt)
+


We are missing a test (and I think also the logic) to reject a change in the number of columns when a callable is given. This is a bit of a tricky case!

The problem is that if n changes, the return value of the usecol function might also change! While normally, when usecols are given we do not have to worry about that at all.

The example would be something like:

np.loadtxt(["1 2", "1 2 3"], usecols=lambda n: range(n)[-2:])

As you can see in the code, I've been working under the assumption that once the first row has been read, the value of n is determined once and for all. If given, the user-defined callable is called just once, after the first row is read. If a structured dtype is given, n is the number of fields in the dtype; otherwise n is the number of columns found in the first row of the file. That's because I didn't think we had such strong support for ragged text files. I wish we didn't, and I don't recall the history of how we got here, but if that's where we are now, so be it. (There is no formal specification for the format of the text files that we support, so the scope of loadtxt tends to creep in fits and starts.)

It seems that a simple way to handle this is to cache the results of the callable usecols in a dictionary, and look up the result after each line is parsed. We only actually call the function when a new number of columns is encountered. There will have to be a check added to ensure that the length of the sequence returned by subsequent calls is the same as that of the first call. What do you think?

I don't mind raising an error. We already do if usecols is not given. We do have oddly strong support for changing number of columns, simply because loadtxt used to use list indexing for col in usecol: line.split(delimiter)[col].

Yes, you could cache usecols(n) with changing n (in the unlikely event that it chagned). Or... you just raise an error :).

I would be OK even deprecating that whole thing. Just dropping support for negative usecols together with ragged columns, might simplify things also.

charris · Jun 26, 2022

Needs a release note.

seberg

Just a comment in case you are looking for whats failing (since I don't think the size of intp and py_ssize_t should ever differ, if they do we probably have more to fix).

I do think some use of NPY_UNLIKELY is probably worthwhile here (think of a single column CSV file). Maybe enough so to show up in the benchmarks, but did not try.

seberg · Jun 28, 2022

numpy/core/src/multiarray/textreading/rows.c

+            }
+        }
+        else {
+            if (usecols_iscallable && current_num_fields != prev_num_fields) {


Probably makes sense to reorganize the conditions so that the outermost can be if (NPY_UNLIKELY(current_num_fields != prev_num_fields)) {.
There is a lot of code here now in a relatively hot path, so lets help the compiler shovel it to the side somewhere.

(Could move most of this into a helper as well probably).

seberg · Jun 29, 2022

numpy/core/src/multiarray/textreading/rows.c

        assert(homogeneous || num_field_types == num_usecols);
        actual_num_fields = num_usecols;
    }
    else if (!homogeneous) {
-        assert(usecols == NULL || num_field_types == num_usecols);
+        assert(num_field_types == num_usecols);


This assert is failing (if you are looking for that). num_usecols is not necessarily defined here, is it?

Ah, the result of an incomplete clean up. I just removed the assert().

…calls.

bmwoodruff · Jun 11, 2025

Based on a discussion at the triage meeting, we decided to close inactive draft PRs that are more than 1 year old. Feel free to open a new PR to continue working on this.

WarrenWeckesser added 01 - Enhancement component: numpy.lib labels Jun 20, 2022

WarrenWeckesser force-pushed the usecols-callable branch from 5b4165b to 8360279 Compare June 20, 2022 06:41

ENH: lib: Allow usecols to be a callable in loadtxt().

5ffae65

WarrenWeckesser force-pushed the usecols-callable branch from 8360279 to 5ffae65 Compare June 20, 2022 14:23

seberg reviewed Jun 20, 2022

View reviewed changes

seberg added the 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. label Jun 20, 2022

Fix check of the result of PyObject_CallFunction()

10f1d08

WarrenWeckesser mentioned this pull request Jun 23, 2022

BUG: lib: loadtxt treats usecols='' the same as usecols=[]. #21839

Closed

charris added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Jun 26, 2022

WarrenWeckesser added 7 commits June 28, 2022 13:48

Lots of updates. (Squash this commit!)

0d8d3c5

BUG: Fix free_conv_funcs() declaration and a few compiler warnings.

571a5e6

BUG: remove unnecessary (and incorrect) clause from an assert()

6d69407

Initialize a var to avoid a compiler warning.

3e30cd8

Ensure PY_SSIZE_T_CLEAN is set before including Python.h

231321e

TMP: Check sizes of integers

d48333c

TMP: Move code that checks sizes of integers

b283709

seberg reviewed Jun 29, 2022

View reviewed changes

WarrenWeckesser added 6 commits June 28, 2022 21:54

Revert previous TMP commit.

700dcb7

Remove invalid assert()

9c7bf60

get_usecols_arr_from_callable now handles the initial and subsequent …

69ab6c3

…calls.

Expand some code comments.

b7e02dd

Split a test in test_loadtxt.py

28c32ca

Merge branch 'main' into usecols-callable

f3a344a

WarrenWeckesser marked this pull request as draft February 19, 2023 21:32

bmwoodruff closed this Jun 11, 2025

Search code, repositories, users, issues, pull requests...

Uh oh!

ENH: lib: Allow usecols to be a callable in loadtxt(). #21800

ENH: lib: Allow usecols to be a callable in loadtxt(). #21800

Uh oh!

Conversation

WarrenWeckesser commented Jun 20, 2022 • edited by seberg Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

seberg Jun 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charris commented Jun 26, 2022

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bmwoodruff commented Jun 11, 2025

Uh oh!

Uh oh!

WarrenWeckesser commented Jun 20, 2022 •

edited by seberg

Loading

seberg Jun 20, 2022 •

edited

Loading