ENH: ndarray.format implementation for numeric dtypes #19550

scratchmex · Jul 23, 2021

This is an initial proof of concept for #5543. Suggestions are welcome.

Basically what we want is:

>>> a = np.array([1.234, 5.678])
>>> print(f"{a:.2f}")
[1.23 5.68]

scratchmex · Jul 23, 2021

@mattip do you mind having a look at this?

mattip · Jul 23, 2021

I don't remember if this idea hit the mailing list.

This comment in the issue goes through expected behaviour for python objects. How does your solution handle this?

Will you raise an error if a formatting string for int, float, string is passed to a non-compliant array?

Can the formatting of int128 or longdouble handle format strings without losing precision?

How will this interact with the options of np.array2string, at first glance it seems there should be a translating step between the two.

scratchmex · Jul 26, 2021

At first, since we are only considering formatting the numeric type, floating numbers specifically, we are only interested in being able to change the precision, the sign, and possibly the rounding or truncation. Since the array2string function already does everything we need, we only need to implement the __format__ function of the ndarray class which parses a predefined format (similar to the one already used by Python for built-in data types) to indicate the parameters before said.

I propose a mini format specification inspired in the Format Specification Mini-Language.

format_spec ::=  [sign][.precision][type]
sign        ::=  "+" | "-" | " "
precision   ::=  [0-9]+
type        ::=  "f" | "e"

We are going to consider only 3 arguments of the array2string function: precision, suppress_small, sign. In particular, the type token sets the suppress_small argument to True when the type is f and False when it is e. This is in order to mimic Python's behavior in truncating decimals when using the fixed-point notation.

As @brandon-rhodes said in #5543, the behavior when you try to format an array containing Python objects, the behavior should be the same as Python has implemented by default in the object class: format (a, "") should be equivalent to str (a) and format(a, "not empty") should raise an exception.

What remains to be defined is the behavior when trying to format an array with a non-numeric data type (np.numeric) other than np.object_. Should we raise an exception? In my opinion yes, since in the future formatting is extended -- for example, for dates -- people are aware that before that was not implemented.

If you consider it necessary, I can send the previous explanation to the mailing list.

rossbar · Jul 26, 2021

Just a couple points I thought were relevant to the discussion:

As mentioned here, Python doesn't support this type of format string for sequences; for example:

>>> l = [1, 2, 3, 4]
>>> print(f"{l:.2f}")
Traceback (most recent call last)
   ...
TypeError: unsupported format string passed to list.__format__

It'd be nice to do some research into related discussions to provide context for the proposed change.

Also, it's worth noting that numpy already has a (quite Pythonic) way to customize array printing --- using np.printoptions as a context manager:

>>> a = np.array([-np.pi, np.pi])
>>> with np.printoptions(precision=2, sign="+"):
...     print(a)
[-3.14 +3.14]

brandon-rhodes · Jul 26, 2021

Python doesn't support this type of format string for sequences

I suspect that's because Python sequences are heterogeneous — and it would make sense for an ndarray-of-objects to similarly refuse to perform string formatting if we wanted strict symmetry. But the case of an ndarray-of-floats doesn't have a strict analogy among the builtin Python data types of list, dict, and set.

Also, it's worth noting that numpy already has a (quite Pythonic) way to customize array printing…

I would suggest that the with statement is not a Pythonic approach.

Modern Python programs avoid mutating global state.
The approach breaks if two different threads are both trying to format simultaneously.
The with approach can't apply different two different formats to two different arrays in the same format string.
You will note that the most modern form of Python formatting, the f-string, does not make the user specify formatting in a with statement outside of the format string itself — if it had, then, yes, that would have provided a strong argument that NumPy is being Pythonic in your example code. But, in fact, no native form of Python formatting shunts the format specification out into a with statement (unless there's a corner of the Standard Library that I've missed?).

eric-wieser · Jul 26, 2021

I suspect that's because Python sequences are heterogeneous — and it would make sense for an ndarray-of-objects to similarly refuse to perform string formatting if we wanted strict symmetry. But the case of an ndarray-of-floats doesn't have a strict analogy among the builtin Python data types of list, dict, and set.

Things get messy when you consider arrays with dtype object where every entry still contains a float - ideally the formatter would behave similarly for this case.

I would suggest that the with statement is not a Pythonic approach.

Modern Python programs avoid mutating global state.

The approach breaks if two different threads are both trying to format simultaneously.

Both of these concerns are addressed by changing our context managers to use ContextVars, which is absolutely something we need to do anyway.

scratchmex · Jul 26, 2021

Things get messy when you consider arrays with dtype object where every entry still contains a float - ideally the formatter would behave similarly for this case.

I think the discussion about the formatter is not related to my implementation but to the already used -- in str and repr-- array2string function:

>>> a = np.array([-np.pi, np.pi], dtype=np.object_)
>>> np.array2string(a, precision=2)
'[-3.141592653589793 3.141592653589793]'

Both of these concerns are addressed by changing our context managers to use ContextVars, which is absolutely something we need to do anyway.

I still think the discussion is not over that direction because it is still more convenient to use a formatting spec than using np.printoptions

scratchmex · Aug 10, 2021

I think this is ready to review.

mwtoews · Aug 11, 2021

How would f"{np.float32(101.1)}" be formatted? (I just posted #19641 and then just discovered this PR)

scratchmex · Aug 11, 2021

>>> import numpy as np
>>> np.__version__
'1.22.0.dev0+660.g8ad01d7ed'
>>> f"{np.float32(101.1)}"
'101.0999984741211'

The "conversion" for scalar types (0-dim arrays) is done at PyArray_ToScalar. Following through the code, I get til scalar_value at scalarapi.c where it seems that it doesn't do any rounding. Perhaps that is related to #9941?

mwtoews · Aug 11, 2021

I think it's related to #10645 as the float32 is converted to double precision for native python.

Are scalars not covered in this PR? Update: no. But see #10645 (comment) for specific advice to implement.

numpy/core/arrayprint.py

BvB93 · Aug 12, 2021

numpy/core/arrayprint.py

+@array_function_dispatch(_array_format_dispatcher, module='numpy')
+def array_format(a, format_spec):
+    print("[DEBUG] array_format")
+
+    return _array_format_implementation(a, format_spec)


Why do we need to expose a new public function here?
Won't builtins.format suffice for public usage?

I agree — "there should only be one way to do it", and leaving this name private will encourage folks to encounter this through the ecosystem of f'', .format(), and builtins.format(), rather than through a bespoke mechanism.

Actually, you have a very good point. I really didn't think about this and I tried to do it the same way as it's done with array2string. Do you know if there is any other use for the array_function_dispatch decorator apart from exposing the function with another name?

I don't see why we have need array_function_dispatch here in the first place. The relevant format output can already be fully customized via custom implementations of __format__.

In the code, there are several examples in which they use it and I can't seem to understand them: array_str and array_repr instead of using plain str() and repr(). Is there any use for that? What I researched is that is related to NEP-18

In the code, there are several examples in which they use it and I can't seem to understand them: array_str and array_repr instead of using plain str() and repr(). Is there any use for that?

Considering all these functions are at least 17 years old, I suspect it is mostly due to "historical reasons". NEP 18 came quite a bit latter and by that time support was added to array_str and array_repr as they were already well established.

This function is still exposed in the namespace. Either remove it or start a process to expose it: add a proper docstring and post the justification to the mailing list.

scratchmex · Aug 12, 2021

I think it's related to #10645 as the float32 is converted to double precision for native python.

Are scalars not covered in this PR? Update: no. But see #10645 (comment) for specific advice to implement.

Thanks for the reference @mwtoews. I will try to implement it the same way as I did with arrays: using array2string as it uses the Dragon4 implementation and behaves well:

>>> np.array2string(np.float32(101.1))
'101.1'

The only question I have is if I should use the same function for scalar types and numpy arrays?, i.e., use the _array_format_implementation function of my PR. That will depend if we want to have different behaviors for each of them but a priori I don't know about that.

brandon-rhodes · Aug 12, 2021

The only question I have is if I should use the same function for scalar types and numpy arrays?, i.e., use the _array_format_implementation function of my PR. That will depend if we want to have different behaviors for each of them but a priori I don't know about that.

Speaking merely as a user to provide some data for the real NumPy folks here to work off of: I think I would expect scalar formatting to work like normal Python int and float formatting. I often, in fact, can't predict ahead of time when I'm going to wind up with a plain float and when I'm going to wind up with a NumPy scalar, so it would be a bit disruptive for them to behave differently when given something like %4.2f.

On the other hand, folks with more NumPy experience might think differently?

BvB93 · Aug 12, 2021

format/__format__ already seems to work just fine with the numpy scalar types, so I'm not sure if they're all that
relevant in the context of this PR.

In [1]: import numpy as np

In [2]: f4 = np.float32(1)

In [3]: format(f4, ".2f")
Out[3]: '1.00'

mwtoews · Aug 12, 2021

Scalar formatting is still broken, as it's not currently handled in this PR. Internally, the float32 gets cast to a double precision Python float.

v = np.float32(101.1)
format(v, "")  # 101.0999984741211
format(v, ".6f")  # 101.099998
# these are good (via dragon4)
np.format_float_positional(v, 6) # 101.1
np.format_float_positional(v, 16) # 101.1

scratchmex · Aug 13, 2021

My new commits solve the issue partially:

>>> np.__version__
'1.22.0.dev0+729.g5f3dadd94'
>>> v = np.float32(101.1)
>>> format(v, "")  # 101.0999984741211
'101.1'
>>> format(v, ".6f")  # 101.099998
'101.099998'

The thing is that there is a lot of code that relies on scalar types behaving like Python built-ins. For example, the 'g' specifier is used a lot and would require substantial work to make it behave the same as Python do:

>>> np.__version__
'1.19.2'
>>> a = 1.0
>>> f"{a:.3g}"
'1'
>>> a = np.float_(1.0)
>>> np.array2string(a, precision=3)
'1.'

We would need to use the trim="-" in the np.format_float_positional function.

I think what Python does is fine because it reminds you that the number you are working with is not exact if you specify 'g':

>>> np.__version__
'1.19.2'
>>> a = np.float32(101.1)
>>> f"{a:.5}"
'101.1'
>>> f"{a:.5g}"
'101.1'
>>> f"{a:.6}"
'101.1'
>>> f"{a:.6g}"
'101.1'
>>> f"{a:.6f}"
'101.099998'

but it is fine to give some "syntactic sugar" when using bare str()

>>> np.__version__
'1.22.0.dev0+729.g5f3dadd94'
>>> a = np.float32(101.1)
>>> str(a)
'101.1'

that is what I only binded format(a, "") to str(a) when a is scalar type and __format__ of 0dim ndarrays to scalartype format func.

Sorry for the force-push, I rebased upstream/main by accident :s

h-vetinari · Aug 17, 2021

Last comment from #5543 was:

@charris: I suspect this issue needs an NEP explaining how formatting will be designed to behave for numpy datatypes, especially for ndarray.

Did something along those lines ever materialize?

One thing that surprised me in the example is that the formatted string representation of arrays had no commas between elements. I'm assuming that the discussion of this sort of choices with default formatting was one of the reasons @charris suggested a NEP?

scratchmex · Dec 1, 2021

I discussed with @seberg about maybe also trying to make the default behavior of __format__ for object arrays to apply format() to each element and reraise an exception if something went wrong with an element.

seberg · Dec 1, 2021

Also to float it: There was an older proposal here: https://gist.github.com/gustavla/2783543be1204d2b5d368f6a1fb4d069 and a brief discussion about it: https://mail.python.org/archives/list/numpy-discussion@python.org/message/4RYDHI3Y7D5OYKJODZQ57FYVXF3LBCTQ/

seberg · Dec 3, 2021

@scratchmex before floating this to the list, this code uses just the normal np.printoptions right, without any custom formatting function? (So the the normal array printing is used, just with special precision settings?)

I am asking, because I think the scalar formatting for np.float32.__format__ and np.longdouble.__format__ may be using float(scalar).__format__ (i.e. pythons formatting), which is incorrect due to the different precisions.

Assuming this is the case (and I think this is), the next thing is indeed to just make a call on whether we want to allow f"{array:f}" at all (because it is not a scalar object) or not. And I am not aware of any relevant prior art, nor do I have much of an opinion, so I would be OK with pressing on, unless anyone voices concerns.
(Meaning, I will announce it on the mailing list.)

Assuming the scalars are really broken, it would be nice to fix that up also.

scratchmex · Dec 3, 2021

If the dim is 0 (NumPy scalar) then it converts the type to Python scalar and the calls float.__format__ as you said. I did not change that behavior in this PR but maybe we could extract the NumPy scalar and forward the call to its __format__ method and fix it (the current implementation for format in the scalar types is here)
If the dim is positive, I parse the format string and pass it to np.array2string. The default behavior of str(array) is to call np.core.arrayprint._default_array_str which is just a dispatcher for np.array2string. Maybe I should also call _default_array_str?

scratchmex · Jan 18, 2022

@seberg ping

doc/release/upcoming_changes/19550.new_feature.rst

charris · Sep 20, 2022

Close/reopen

charris · Sep 20, 2022

What still needs to be done here?

seberg

Just two small comments. I like the start here, although unfortunately, I think there is a bigger overhaul necessary for formatting (not just this path though!).
That is also related to my push to change the representation of scalars probably...

For now, there is the slight oddity that the scalar may behave differently from the array code for float16, float32, or longdouble. (and the complex codes I guess).
I am fine with the current oddities though and I don't think adding this here makes revising the machinery much harder.

numpy/core/src/multiarray/strfuncs.c

seberg · Sep 21, 2022

numpy/core/arrayprint.py

+            # TODO: implement code in `FloatingFormat` such that
+            # the tests in test_multiarray.py:test_general_format pass
+            # note: remember to add the tests to `test_arrayprint.py:test_type`
+            pass


Can we either implement this or just make it an error here? Ignoring the format code in this path seems not helpful?

Co-authored-by: Matti Picus <matti.picus@gmail.com>

seberg · Sep 22, 2022

numpy/core/tests/test_datetime.py

+        ):
+            format(v, "+.2f")
+
+        # TODO: this behaviour should not occur


Just a note: This is not behavior that was changed and the array code path rejects the format spec for non 0-D arrays (we drop through for 0-D).

(Also auto-reformatted some whitespace...)

seberg · Sep 22, 2022

OK, I have now added a commit to simply reject g with a NotImplementedError. Not ideal, but I prefer it to simply passing...

seberg · Sep 22, 2022

One thing I noticed is that Python rejects the "{:.10}".format(123). That is because the precision does not make sense for integers.

OTOH, the approach here simply ignore it for integers. That seems OK, or should this be discussed?

scratchmex · Oct 5, 2022

One thing I noticed is that Python rejects the "{:.10}".format(123). That is because the precision does not make sense for integers.

OTOH, the approach here simply ignore it for integers. That seems OK, or should this be discussed?

I think it should be an error

InessaPawson · Oct 5, 2022

@scratchmex We have discussed your PR at today's triage meeting. Please proceed with adding an error.

scratchmex · Oct 17, 2022

@InessaPawson @seberg done. Last ping to hope this can be merged with this. Thanks for the feedback

seberg · Oct 20, 2022

How happy are others with merging something that may still be a bit in flux? I am probing things a bit more, and a few things I noticed which I had not noticed previously:

Python's f formatting never uses exponential format, this is what g is used for. This code currently uses the typical code-path which does switch.
What we have seems actually close the the g format code, except that Python seems to more dynamically decide the threshold than us (we use 1e8 anywhere in the array)
For integers, the float format codes convert to float first in Python. Maybe we should only allow float and complex for now?!
The precision formatting does not enforce the ., so it can end up as a width that is ignored. Changing the regex to: r"(\.[0-9]+)?" will work (later in the code always ignore the . then).
A width for left padding could be supported easily for all numerical formatting, though.
exp_format=None would make more sense to me for the current "auto" mode. That way exp_format=False can force no exponential formatting. (distinguish None from False)
Exposed trim option is never used (seems OK to expose though).

I guess the question is largely how close we want to stick to the Python formatting here. And if we want to stick closely to it, maybe we really need something like an hypothesis test to ensure we actually match up.

(I would like to push the formatting further down, but that is not important for the PR as such.)

mattip · Oct 20, 2022

The original use case did refer to float, so maybe restricting the scope to float formatting is OK. Other than that, I think we should just go ahead, call this "experimental" and wait for people to try it out and suggest improvements.

charris · Oct 20, 2022

Might fix the two long lines.

shoyer · Oct 22, 2022

I would much rather reduce scope (e.g., by issuing errors for some dtypes) than introduce behavior that we will almost certainly want to change later. Users have strong expectations of API stability for NumPy.

If we want to merge this in an experimental state, we could keep it off by default and only enable it when users explicitly set an experimental flag or context manager.

mattip · Oct 23, 2022

numpy/core/arrayprint.py

+__all__ = ["array2string", "array_str", "array_repr", "array_format",
+           "set_string_function", "set_printoptions", "get_printoptions",
+           "printoptions", "format_float_positional",
+           "format_float_scientific"]


Why is it necessary to expose array_format here? You can still import it directly without exposing it in __all__.

mattip · Oct 23, 2022

numpy/core/arrayprint.py

@@ -57,13 +59,18 @@
    'infstr': 'inf',
    'sign': '-',
    'formatter': None,
+    'exp_format': False,  # force exp formatting as Python do when ".2e"
+    'trim': '.',  # Controls post-processing trimming of trailing digits
+                  # see Dragon4 arguments at `format_float_scientific` below


Why is trim necessary? What was wrong with the previous way it was set?

mattip · Oct 23, 2022

numpy/core/arrayprint.py

@@ -198,6 +212,11 @@ def set_printoptions(precision=None, threshold=None, edgeitems=None,
                but if every element in the array can be uniquely
                represented with an equal number of fewer digits, use that
                many digits for all elements.
+    exp_format : bool, optional
+        Prints in scientific notation (1.1e+01).


Suggested change

Prints in scientific notation (1.1e+01).

-- versionadded:: 1.24.0

Prints in scientific notation (1.1e+01).

mattip · Oct 23, 2022

numpy/core/arrayprint.py

+    exp_format : bool, optional
+        Prints in scientific notation (1.1e+01).
+    trim : str, optional
+        Controls post-processing trimming of trailing digits.


Suggested change

Controls post-processing trimming of trailing digits.

-- versionadded:: 1.24.0

Controls post-processing trimming of trailing digits.

mattip · Oct 23, 2022

numpy/core/arrayprint.py

@@ -198,6 +212,11 @@ def set_printoptions(precision=None, threshold=None, edgeitems=None,
                but if every element in the array can be uniquely
                represented with an equal number of fewer digits, use that
                many digits for all elements.
+    exp_format : bool, optional
+        Prints in scientific notation (1.1e+01).
+    trim : str, optional


Suggested change

trim : str, optional

trim : one of 'k.0-', optional

mattip · Oct 23, 2022

numpy/core/arrayprint.py

+        Prints in scientific notation (1.1e+01).
+    trim : str, optional
+        Controls post-processing trimming of trailing digits.
+        See Dragon4 arguments at ``format_float_scientific``.


Suggested change

See Dragon4 arguments at ``format_float_scientific``.

See ``trim`` argument to `numpy.format_float_scientific`.

mattip · Dec 10, 2022

Ping @scratchmex

scratchmex · Dec 10, 2022

Ping @scratchmex

The reviewed things is the only thing missing? I will complete them next week

scratchmex force-pushed the array_format branch from 38fcd74 to 3b804d9 Compare July 23, 2021 08:23

scratchmex changed the title ~~ndarray.__format__ implementation for numeric dtypes~~ ENH: ndarray.__format__ implementation for numeric dtypes Jul 23, 2021

github-actions bot added the 01 - Enhancement label Jul 23, 2021

scratchmex marked this pull request as ready for review July 29, 2021 03:41

seberg added the 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. label Aug 10, 2021

BvB93 reviewed Aug 12, 2021

View reviewed changes

scratchmex force-pushed the array_format branch from fb7ba29 to 5f3dadd Compare August 13, 2021 10:29

scratchmex force-pushed the array_format branch from a2bbce1 to 13181ea Compare August 13, 2021 17:57

mattip reviewed Sep 7, 2022

View reviewed changes

doc/release/upcoming_changes/19550.new_feature.rst Outdated Show resolved Hide resolved

mattip mentioned this pull request Sep 8, 2022

MAINT: Remove the deprecated style kwargs from array2string #22229

Closed

charris closed this Sep 20, 2022

charris reopened this Sep 20, 2022

seberg reviewed Sep 21, 2022

View reviewed changes

Apply suggestions from code review

367de93

Co-authored-by: Matti Picus <matti.picus@gmail.com>

seberg reviewed Sep 22, 2022

View reviewed changes

MAINT: Fully reject format code 'g' for arrays

9fdb396

(Also auto-reformatted some whitespace...)

seberg force-pushed the array_format branch from 6bfc36c to 9fdb396 Compare September 22, 2022 09:45

fix: integers should not be able to have precision

c8448d5

mattip reviewed Oct 23, 2022

View reviewed changes

mattip mentioned this pull request Dec 10, 2022

ndarray should offer __format__ that can adjust precision #5543

Open

	See Dragon4 arguments at ``format_float_scientific``.
	See ``trim`` argument to `numpy.format_float_scientific`.

Search code, repositories, users, issues, pull requests...

Uh oh!

ENH: ndarray.__format__ implementation for numeric dtypes #19550

Are you sure you want to change the base?

ENH: ndarray.__format__ implementation for numeric dtypes #19550

Uh oh!

Conversation

scratchmex commented Jul 23, 2021

Uh oh!

scratchmex commented Jul 23, 2021

Uh oh!

mattip commented Jul 23, 2021

Uh oh!

scratchmex commented Jul 26, 2021

Uh oh!

rossbar commented Jul 26, 2021

Uh oh!

brandon-rhodes commented Jul 26, 2021

Uh oh!

eric-wieser commented Jul 26, 2021

Uh oh!

scratchmex commented Jul 26, 2021

Uh oh!

scratchmex commented Aug 10, 2021

Uh oh!

mwtoews commented Aug 11, 2021

Uh oh!

scratchmex commented Aug 11, 2021

Uh oh!

mwtoews commented Aug 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scratchmex Aug 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scratchmex commented Aug 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brandon-rhodes commented Aug 12, 2021

Uh oh!

BvB93 commented Aug 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mwtoews commented Aug 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scratchmex commented Aug 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

h-vetinari commented Aug 17, 2021

Uh oh!

scratchmex commented Dec 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seberg commented Dec 1, 2021

Uh oh!

seberg commented Dec 3, 2021

Uh oh!

scratchmex commented Dec 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scratchmex commented Jan 18, 2022

Uh oh!

Uh oh!

charris commented Sep 20, 2022

Uh oh!

ENH: ndarray.format implementation for numeric dtypes #19550

ENH: ndarray.format implementation for numeric dtypes #19550

mwtoews commented Aug 11, 2021 •

edited

Loading

scratchmex Aug 12, 2021 •

edited

Loading

scratchmex commented Aug 12, 2021 •

edited

Loading

BvB93 commented Aug 12, 2021 •

edited

Loading

mwtoews commented Aug 12, 2021 •

edited

Loading

scratchmex commented Aug 13, 2021 •

edited

Loading

scratchmex commented Dec 1, 2021 •

edited

Loading

scratchmex commented Dec 3, 2021 •

edited

Loading