Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

ENH: lib: Allow usecols to be a callable in loadtxt(). #21800

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 15 commits into
base: main
Choose a base branch
Loading
from

Conversation

WarrenWeckesser
Copy link
Member

@WarrenWeckesser WarrenWeckesser commented Jun 20, 2022

This is the C version of #15995.

Actually, it is a subset of that PR. The new feature added here is to allow usecols to be callable. In this PR, I haven't included the option for usecols to be a slice, and I haven't included the convenience function skip. Any column selection that could be done with a slice can also be done with a callable, and there are selections that a callable can make that cannot be done with a slice--i.e. the callable option is the most flexible. The smaller scope of this PR should make it easier to review; handling a slice can be added in a follow-up if there is interest.

The skip function in #15995 is just a Python function, so it could simply be copied over from that PR to here if folks think it is important. Of course, it too could be added in a follow-up PR.

(seberg: xref gh-13878)

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion on the API addition, but it seems good to me and does allow further additions pretty nicely (by doing them in Python).

A few small comments on the code, the only bigger one is that I think we need to reject (or add logic?) for changes in the number of columns n, since that would affect the result of the function.

// create the usecols array.
PyObject *seq = PyObject_CallFunction(usecols_obj, "n",
current_num_fields);
if (PyErr_Occurred()) {
Copy link
Member

@seberg seberg Jun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (PyErr_Occurred()) {
if (seq == NULL) {

"sequence of ints, but it returned an instance of "
"type '%s'", Py_TYPE(seq)->tp_name);
Py_DECREF(seq);
goto error;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if the "length check" can't be just part of the helper? We use the same error in any case?
Could even use PySequence_Fast in the helper, but that doesn't really matter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would mean that the helper should probably return num_usecols and fill in usecols (passed by pointer).

Py_DECREF(tmp);
}
return arr;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to conversion_utils.c? There is similar functionality there already, and now that it is split out, maybe it is time to just put it there.

// This function owns usecols if a callable usecols_obj was given.
PyMem_Free(usecols);
usecols = NULL; // An overabundance of caution...
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be ugly to move all usecols handling here? Its fine, but conditional ownership tens to be an anti-pattern.

Another way to spell would be to have usecol_from_func = ... and usecol = usecol_from_func, so the owner is unconditionally usecol_from_func and not "usecol" itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be ugly to move all usecols handling here?

No, I think that would be fine. In fact, with a bit more of the argument processing code moved into read_rows(), I think the function _readtext_from_stream() can be eliminated, and _load_from_filelike() can call read_rows() directly. I'll give that shot, and push to the PR if it looks reasonable.

with pytest.raises(RuntimeError,
match='the number of fields in the given dtype'):
np.loadtxt(txt, usecols=lambda n: [0, 2, 3], dtype=dt)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are missing a test (and I think also the logic) to reject a change in the number of columns when a callable is given. This is a bit of a tricky case!

The problem is that if n changes, the return value of the usecol function might also change! While normally, when usecols are given we do not have to worry about that at all.

The example would be something like:

np.loadtxt(["1 2", "1 2 3"], usecols=lambda n: range(n)[-2:])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see in the code, I've been working under the assumption that once the first row has been read, the value of n is determined once and for all. If given, the user-defined callable is called just once, after the first row is read. If a structured dtype is given, n is the number of fields in the dtype; otherwise n is the number of columns found in the first row of the file. That's because I didn't think we had such strong support for ragged text files. I wish we didn't, and I don't recall the history of how we got here, but if that's where we are now, so be it. (There is no formal specification for the format of the text files that we support, so the scope of loadtxt tends to creep in fits and starts.)

It seems that a simple way to handle this is to cache the results of the callable usecols in a dictionary, and look up the result after each line is parsed. We only actually call the function when a new number of columns is encountered. There will have to be a check added to ensure that the length of the sequence returned by subsequent calls is the same as that of the first call. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind raising an error. We already do if usecols is not given. We do have oddly strong support for changing number of columns, simply because loadtxt used to use list indexing for col in usecol: line.split(delimiter)[col].

Yes, you could cache usecols(n) with changing n (in the unlikely event that it chagned). Or... you just raise an error :).

I would be OK even deprecating that whole thing. Just dropping support for negative usecols together with ragged columns, might simplify things also.

@seberg seberg added the 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. label Jun 20, 2022
@charris charris added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Jun 26, 2022
@charris
Copy link
Member

charris commented Jun 26, 2022

Needs a release note.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment in case you are looking for whats failing (since I don't think the size of intp and py_ssize_t should ever differ, if they do we probably have more to fix).

I do think some use of NPY_UNLIKELY is probably worthwhile here (think of a single column CSV file). Maybe enough so to show up in the benchmarks, but did not try.

}
}
else {
if (usecols_iscallable && current_num_fields != prev_num_fields) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably makes sense to reorganize the conditions so that the outermost can be if (NPY_UNLIKELY(current_num_fields != prev_num_fields)) {.
There is a lot of code here now in a relatively hot path, so lets help the compiler shovel it to the side somewhere.

(Could move most of this into a helper as well probably).

assert(homogeneous || num_field_types == num_usecols);
actual_num_fields = num_usecols;
}
else if (!homogeneous) {
assert(usecols == NULL || num_field_types == num_usecols);
assert(num_field_types == num_usecols);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assert is failing (if you are looking for that). num_usecols is not necessarily defined here, is it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the result of an incomplete clean up. I just removed the assert().

@WarrenWeckesser WarrenWeckesser marked this pull request as draft February 19, 2023 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. component: numpy.lib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.