Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

ENH: ndarray.__format__ implementation for numeric dtypes #19550

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 26 commits into
base: main
Choose a base branch
Loading
from

Conversation

scratchmex
Copy link
Contributor

This is an initial proof of concept for #5543. Suggestions are welcome.

Basically what we want is:

>>> a = np.array([1.234, 5.678])
>>> print(f"{a:.2f}")
[1.23 5.68]

@scratchmex
Copy link
Contributor Author

@mattip do you mind having a look at this?

@scratchmex scratchmex changed the title ndarray.__format__ implementation for numeric dtypes ENH: ndarray.__format__ implementation for numeric dtypes Jul 23, 2021
@mattip
Copy link
Member

mattip commented Jul 23, 2021

I don't remember if this idea hit the mailing list.

This comment in the issue goes through expected behaviour for python objects. How does your solution handle this?

Will you raise an error if a formatting string for int, float, string is passed to a non-compliant array?

Can the formatting of int128 or longdouble handle format strings without losing precision?

How will this interact with the options of np.array2string, at first glance it seems there should be a translating step between the two.

@scratchmex
Copy link
Contributor Author

At first, since we are only considering formatting the numeric type, floating numbers specifically, we are only interested in being able to change the precision, the sign, and possibly the rounding or truncation. Since the array2string function already does everything we need, we only need to implement the __format__ function of the ndarray class which parses a predefined format (similar to the one already used by Python for built-in data types) to indicate the parameters before said.

I propose a mini format specification inspired in the Format Specification Mini-Language.

format_spec ::=  [sign][.precision][type]
sign        ::=  "+" | "-" | " "
precision   ::=  [0-9]+
type        ::=  "f" | "e"

We are going to consider only 3 arguments of the array2string function: precision, suppress_small, sign. In particular, the type token sets the suppress_small argument to True when the type is f and False when it is e. This is in order to mimic Python's behavior in truncating decimals when using the fixed-point notation.

As @brandon-rhodes said in #5543, the behavior when you try to format an array containing Python objects, the behavior should be the same as Python has implemented by default in the object class: format (a, "") should be equivalent to str (a) and format(a, "not empty") should raise an exception.

What remains to be defined is the behavior when trying to format an array with a non-numeric data type (np.numeric) other than np.object_. Should we raise an exception? In my opinion yes, since in the future formatting is extended -- for example, for dates -- people are aware that before that was not implemented.

If you consider it necessary, I can send the previous explanation to the mailing list.

@rossbar
Copy link
Contributor

rossbar commented Jul 26, 2021

Just a couple points I thought were relevant to the discussion:

As mentioned here, Python doesn't support this type of format string for sequences; for example:

>>> l = [1, 2, 3, 4]
>>> print(f"{l:.2f}")
Traceback (most recent call last)
   ...
TypeError: unsupported format string passed to list.__format__

It'd be nice to do some research into related discussions to provide context for the proposed change.

Also, it's worth noting that numpy already has a (quite Pythonic) way to customize array printing --- using np.printoptions as a context manager:

>>> a = np.array([-np.pi, np.pi])
>>> with np.printoptions(precision=2, sign="+"):
...     print(a)
[-3.14 +3.14]

@brandon-rhodes
Copy link

Python doesn't support this type of format string for sequences

I suspect that's because Python sequences are heterogeneous — and it would make sense for an ndarray-of-objects to similarly refuse to perform string formatting if we wanted strict symmetry. But the case of an ndarray-of-floats doesn't have a strict analogy among the builtin Python data types of list, dict, and set.

Also, it's worth noting that numpy already has a (quite Pythonic) way to customize array printing…

I would suggest that the with statement is not a Pythonic approach.

  1. Modern Python programs avoid mutating global state.
  2. The approach breaks if two different threads are both trying to format simultaneously.
  3. The with approach can't apply different two different formats to two different arrays in the same format string.
  4. You will note that the most modern form of Python formatting, the f-string, does not make the user specify formatting in a with statement outside of the format string itself — if it had, then, yes, that would have provided a strong argument that NumPy is being Pythonic in your example code. But, in fact, no native form of Python formatting shunts the format specification out into a with statement (unless there's a corner of the Standard Library that I've missed?).

@eric-wieser
Copy link
Member

I suspect that's because Python sequences are heterogeneous — and it would make sense for an ndarray-of-objects to similarly refuse to perform string formatting if we wanted strict symmetry. But the case of an ndarray-of-floats doesn't have a strict analogy among the builtin Python data types of list, dict, and set.

Things get messy when you consider arrays with dtype object where every entry still contains a float - ideally the formatter would behave similarly for this case.

I would suggest that the with statement is not a Pythonic approach.

  1. Modern Python programs avoid mutating global state.
  2. The approach breaks if two different threads are both trying to format simultaneously.

Both of these concerns are addressed by changing our context managers to use ContextVars, which is absolutely something we need to do anyway.

@scratchmex
Copy link
Contributor Author

Things get messy when you consider arrays with dtype object where every entry still contains a float - ideally the formatter would behave similarly for this case.

I think the discussion about the formatter is not related to my implementation but to the already used -- in str and repr-- array2string function:

>>> a = np.array([-np.pi, np.pi], dtype=np.object_)
>>> np.array2string(a, precision=2)
'[-3.141592653589793 3.141592653589793]'

Both of these concerns are addressed by changing our context managers to use ContextVars, which is absolutely something we need to do anyway.

I still think the discussion is not over that direction because it is still more convenient to use a formatting spec than using np.printoptions

@scratchmex scratchmex marked this pull request as ready for review July 29, 2021 03:41
@scratchmex
Copy link
Contributor Author

I think this is ready to review.

@seberg seberg added the 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. label Aug 10, 2021
@mwtoews
Copy link
Contributor

mwtoews commented Aug 11, 2021

How would f"{np.float32(101.1)}" be formatted? (I just posted #19641 and then just discovered this PR)

@scratchmex
Copy link
Contributor Author

>>> import numpy as np
>>> np.__version__
'1.22.0.dev0+660.g8ad01d7ed'
>>> f"{np.float32(101.1)}"
'101.0999984741211'

The "conversion" for scalar types (0-dim arrays) is done at PyArray_ToScalar. Following through the code, I get til scalar_value at scalarapi.c where it seems that it doesn't do any rounding. Perhaps that is related to #9941?

@mwtoews
Copy link
Contributor

mwtoews commented Aug 11, 2021

I think it's related to #10645 as the float32 is converted to double precision for native python.

Are scalars not covered in this PR? Update: no. But see #10645 (comment) for specific advice to implement.

numpy/core/arrayprint.py Show resolved Hide resolved
Comment on lines 1678 to 1735
@array_function_dispatch(_array_format_dispatcher, module='numpy')
def array_format(a, format_spec):
print("[DEBUG] array_format")

return _array_format_implementation(a, format_spec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to expose a new public function here?
Won't builtins.format suffice for public usage?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree — "there should only be one way to do it", and leaving this name private will encourage folks to encounter this through the ecosystem of f'', .format(), and builtins.format(), rather than through a bespoke mechanism.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you have a very good point. I really didn't think about this and I tried to do it the same way as it's done with array2string. Do you know if there is any other use for the array_function_dispatch decorator apart from exposing the function with another name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why we have need array_function_dispatch here in the first place. The relevant format output can already be fully customized via custom implementations of __format__.

Copy link
Contributor Author

@scratchmex scratchmex Aug 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the code, there are several examples in which they use it and I can't seem to understand them: array_str and array_repr instead of using plain str() and repr(). Is there any use for that? What I researched is that is related to NEP-18

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the code, there are several examples in which they use it and I can't seem to understand them: array_str and array_repr instead of using plain str() and repr(). Is there any use for that?

Considering all these functions are at least 17 years old, I suspect it is mostly due to "historical reasons". NEP 18 came quite a bit latter and by that time support was added to array_str and array_repr as they were already well established.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is still exposed in the namespace. Either remove it or start a process to expose it: add a proper docstring and post the justification to the mailing list.

@scratchmex
Copy link
Contributor Author

scratchmex commented Aug 12, 2021

I think it's related to #10645 as the float32 is converted to double precision for native python.

Are scalars not covered in this PR? Update: no. But see #10645 (comment) for specific advice to implement.

Thanks for the reference @mwtoews. I will try to implement it the same way as I did with arrays: using array2string as it uses the Dragon4 implementation and behaves well:

>>> np.array2string(np.float32(101.1))
'101.1'

The only question I have is if I should use the same function for scalar types and numpy arrays?, i.e., use the _array_format_implementation function of my PR. That will depend if we want to have different behaviors for each of them but a priori I don't know about that.

@brandon-rhodes
Copy link

The only question I have is if I should use the same function for scalar types and numpy arrays?, i.e., use the _array_format_implementation function of my PR. That will depend if we want to have different behaviors for each of them but a priori I don't know about that.

Speaking merely as a user to provide some data for the real NumPy folks here to work off of: I think I would expect scalar formatting to work like normal Python int and float formatting. I often, in fact, can't predict ahead of time when I'm going to wind up with a plain float and when I'm going to wind up with a NumPy scalar, so it would be a bit disruptive for them to behave differently when given something like %4.2f.

On the other hand, folks with more NumPy experience might think differently?

@BvB93
Copy link
Member

BvB93 commented Aug 12, 2021

format/__format__ already seems to work just fine with the numpy scalar types, so I'm not sure if they're all that
relevant in the context of this PR.

In [1]: import numpy as np

In [2]: f4 = np.float32(1)

In [3]: format(f4, ".2f")
Out[3]: '1.00'

@mwtoews
Copy link
Contributor

mwtoews commented Aug 12, 2021

Scalar formatting is still broken, as it's not currently handled in this PR. Internally, the float32 gets cast to a double precision Python float.

v = np.float32(101.1)
format(v, "")  # 101.0999984741211
format(v, ".6f")  # 101.099998
# these are good (via dragon4)
np.format_float_positional(v, 6) # 101.1
np.format_float_positional(v, 16) # 101.1

@scratchmex
Copy link
Contributor Author

scratchmex commented Aug 13, 2021

My new commits solve the issue partially:

>>> np.__version__
'1.22.0.dev0+729.g5f3dadd94'
>>> v = np.float32(101.1)
>>> format(v, "")  # 101.0999984741211
'101.1'
>>> format(v, ".6f")  # 101.099998
'101.099998'

The thing is that there is a lot of code that relies on scalar types behaving like Python built-ins. For example, the 'g' specifier is used a lot and would require substantial work to make it behave the same as Python do:

>>> np.__version__
'1.19.2'
>>> a = 1.0
>>> f"{a:.3g}"
'1'
>>> a = np.float_(1.0)
>>> np.array2string(a, precision=3)
'1.'

We would need to use the trim="-" in the np.format_float_positional function.

I think what Python does is fine because it reminds you that the number you are working with is not exact if you specify 'g':

>>> np.__version__
'1.19.2'
>>> a = np.float32(101.1)
>>> f"{a:.5}"
'101.1'
>>> f"{a:.5g}"
'101.1'
>>> f"{a:.6}"
'101.1'
>>> f"{a:.6g}"
'101.1'
>>> f"{a:.6f}"
'101.099998'

but it is fine to give some "syntactic sugar" when using bare str()

>>> np.__version__
'1.22.0.dev0+729.g5f3dadd94'
>>> a = np.float32(101.1)
>>> str(a)
'101.1'

that is what I only binded format(a, "") to str(a) when a is scalar type and __format__ of 0dim ndarrays to scalartype format func.

Sorry for the force-push, I rebased upstream/main by accident :s

@h-vetinari
Copy link
Contributor

Last comment from #5543 was:

@charris: I suspect this issue needs an NEP explaining how formatting will be designed to behave for numpy datatypes, especially for ndarray.

Did something along those lines ever materialize?

One thing that surprised me in the example is that the formatted string representation of arrays had no commas between elements. I'm assuming that the discussion of this sort of choices with default formatting was one of the reasons @charris suggested a NEP?

@scratchmex
Copy link
Contributor Author

scratchmex commented Dec 1, 2021

I discussed with @seberg about maybe also trying to make the default behavior of __format__ for object arrays to apply format() to each element and reraise an exception if something went wrong with an element.

@seberg
Copy link
Member

seberg commented Dec 1, 2021

@seberg
Copy link
Member

seberg commented Dec 3, 2021

@scratchmex before floating this to the list, this code uses just the normal np.printoptions right, without any custom formatting function? (So the the normal array printing is used, just with special precision settings?)

I am asking, because I think the scalar formatting for np.float32.__format__ and np.longdouble.__format__ may be using float(scalar).__format__ (i.e. pythons formatting), which is incorrect due to the different precisions.

Assuming this is the case (and I think this is), the next thing is indeed to just make a call on whether we want to allow f"{array:f}" at all (because it is not a scalar object) or not. And I am not aware of any relevant prior art, nor do I have much of an opinion, so I would be OK with pressing on, unless anyone voices concerns.
(Meaning, I will announce it on the mailing list.)

Assuming the scalars are really broken, it would be nice to fix that up also.

@scratchmex
Copy link
Contributor Author

scratchmex commented Dec 3, 2021

If the dim is 0 (NumPy scalar) then it converts the type to Python scalar and the calls float.__format__ as you said. I did not change that behavior in this PR but maybe we could extract the NumPy scalar and forward the call to its __format__ method and fix it (the current implementation for format in the scalar types is here)
If the dim is positive, I parse the format string and pass it to np.array2string. The default behavior of str(array) is to call np.core.arrayprint._default_array_str which is just a dispatcher for np.array2string. Maybe I should also call _default_array_str?

@scratchmex
Copy link
Contributor Author

@seberg ping

doc/release/upcoming_changes/19550.new_feature.rst Outdated Show resolved Hide resolved
@charris
Copy link
Member

charris commented Sep 20, 2022

Close/reopen

@charris charris closed this Sep 20, 2022
@charris charris reopened this Sep 20, 2022
@charris
Copy link
Member

charris commented Sep 20, 2022

What still needs to be done here?

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two small comments. I like the start here, although unfortunately, I think there is a bigger overhaul necessary for formatting (not just this path though!).
That is also related to my push to change the representation of scalars probably...

For now, there is the slight oddity that the scalar may behave differently from the array code for float16, float32, or longdouble. (and the complex codes I guess).
I am fine with the current oddities though and I don't think adding this here makes revising the machinery much harder.

numpy/core/src/multiarray/strfuncs.c Outdated Show resolved Hide resolved
# TODO: implement code in `FloatingFormat` such that
# the tests in test_multiarray.py:test_general_format pass
# note: remember to add the tests to `test_arrayprint.py:test_type`
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we either implement this or just make it an error here? Ignoring the format code in this path seems not helpful?

Co-authored-by: Matti Picus <matti.picus@gmail.com>
):
format(v, "+.2f")

# TODO: this behaviour should not occur
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: This is not behavior that was changed and the array code path rejects the format spec for non 0-D arrays (we drop through for 0-D).

(Also auto-reformatted some whitespace...)
@seberg
Copy link
Member

seberg commented Sep 22, 2022

OK, I have now added a commit to simply reject g with a NotImplementedError. Not ideal, but I prefer it to simply passing...

@seberg
Copy link
Member

seberg commented Sep 22, 2022

One thing I noticed is that Python rejects the "{:.10}".format(123). That is because the precision does not make sense for integers.

OTOH, the approach here simply ignore it for integers. That seems OK, or should this be discussed?

@scratchmex
Copy link
Contributor Author

One thing I noticed is that Python rejects the "{:.10}".format(123). That is because the precision does not make sense for integers.

OTOH, the approach here simply ignore it for integers. That seems OK, or should this be discussed?

I think it should be an error

@InessaPawson
Copy link
Member

@scratchmex We have discussed your PR at today's triage meeting. Please proceed with adding an error.

@scratchmex
Copy link
Contributor Author

@InessaPawson @seberg done. Last ping to hope this can be merged with this. Thanks for the feedback

@seberg
Copy link
Member

seberg commented Oct 20, 2022

How happy are others with merging something that may still be a bit in flux? I am probing things a bit more, and a few things I noticed which I had not noticed previously:

  • Python's f formatting never uses exponential format, this is what g is used for. This code currently uses the typical code-path which does switch.
    What we have seems actually close the the g format code, except that Python seems to more dynamically decide the threshold than us (we use 1e8 anywhere in the array)
  • For integers, the float format codes convert to float first in Python. Maybe we should only allow float and complex for now?!
  • The precision formatting does not enforce the ., so it can end up as a width that is ignored. Changing the regex to: r"(\.[0-9]+)?" will work (later in the code always ignore the . then).
  • A width for left padding could be supported easily for all numerical formatting, though.
  • exp_format=None would make more sense to me for the current "auto" mode. That way exp_format=False can force no exponential formatting. (distinguish None from False)
  • Exposed trim option is never used (seems OK to expose though).

I guess the question is largely how close we want to stick to the Python formatting here. And if we want to stick closely to it, maybe we really need something like an hypothesis test to ensure we actually match up.

(I would like to push the formatting further down, but that is not important for the PR as such.)

@mattip
Copy link
Member

mattip commented Oct 20, 2022

The original use case did refer to float, so maybe restricting the scope to float formatting is OK. Other than that, I think we should just go ahead, call this "experimental" and wait for people to try it out and suggest improvements.

@charris
Copy link
Member

charris commented Oct 20, 2022

Might fix the two long lines.

@shoyer
Copy link
Member

shoyer commented Oct 22, 2022

I would much rather reduce scope (e.g., by issuing errors for some dtypes) than introduce behavior that we will almost certainly want to change later. Users have strong expectations of API stability for NumPy.

If we want to merge this in an experimental state, we could keep it off by default and only enable it when users explicitly set an experimental flag or context manager.

__all__ = ["array2string", "array_str", "array_repr", "array_format",
"set_string_function", "set_printoptions", "get_printoptions",
"printoptions", "format_float_positional",
"format_float_scientific"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to expose array_format here? You can still import it directly without exposing it in __all__.

@@ -57,13 +59,18 @@
'infstr': 'inf',
'sign': '-',
'formatter': None,
'exp_format': False, # force exp formatting as Python do when ".2e"
'trim': '.', # Controls post-processing trimming of trailing digits
# see Dragon4 arguments at `format_float_scientific` below
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is trim necessary? What was wrong with the previous way it was set?

@@ -198,6 +212,11 @@ def set_printoptions(precision=None, threshold=None, edgeitems=None,
but if every element in the array can be uniquely
represented with an equal number of fewer digits, use that
many digits for all elements.
exp_format : bool, optional
Prints in scientific notation (1.1e+01).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Prints in scientific notation (1.1e+01).
-- versionadded:: 1.24.0
Prints in scientific notation (1.1e+01).

exp_format : bool, optional
Prints in scientific notation (1.1e+01).
trim : str, optional
Controls post-processing trimming of trailing digits.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Controls post-processing trimming of trailing digits.
-- versionadded:: 1.24.0
Controls post-processing trimming of trailing digits.

@@ -198,6 +212,11 @@ def set_printoptions(precision=None, threshold=None, edgeitems=None,
but if every element in the array can be uniquely
represented with an equal number of fewer digits, use that
many digits for all elements.
exp_format : bool, optional
Prints in scientific notation (1.1e+01).
trim : str, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
trim : str, optional
trim : one of 'k.0-', optional

Prints in scientific notation (1.1e+01).
trim : str, optional
Controls post-processing trimming of trailing digits.
See Dragon4 arguments at ``format_float_scientific``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
See Dragon4 arguments at ``format_float_scientific``.
See ``trim`` argument to `numpy.format_float_scientific`.

@mattip
Copy link
Member

mattip commented Dec 10, 2022

Ping @scratchmex

@scratchmex
Copy link
Contributor Author

Ping @scratchmex

The reviewed things is the only thing missing? I will complete them next week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. triaged Issue/PR that was discussed in a triage meeting
Projects
Status: Pending authors' response
Development

Successfully merging this pull request may close these issues.

Morty Proxy This is a proxified and sanitized view of the page, visit original site.