ENH: Use Dragon4 algorithm to print floating values #9941

ahaldane · Oct 29, 2017

This is a C port of Ryan Juckett's C++ implementation (link) of the Dragon4 algorithm for printing floating-point values to numpy. See earlier discussion in #9919 and #9932, which are both replaced by this PR. Please check the license at the top of the dragon4.c file to confirm it is suitable.

This is a WIP and I am still deciding some of the behavior, but it is in "working" state. I am putting it up now to run tests, and if anyone wants to try it out. Here's a quick summary:

Adds a python interface to the Dragon4 algorithms in np.core.multiarray.dragon4_positional and np.core.multiarray.dragon4_scientific. Documentation of arguments is in the C code, but you can call it with a single python floating scalar argument to try it out.
Numpy scalars use the dragon4 algorithm in "unique" mode (see below) for str/repr, in a way that tries to match python float output. Note: one possible bugfix is that the scientific mode cutoff for all floats is now 1e16, instead of 1e10 for float32 and 1e17 for float64 like before. CPython's cutoff is 1e16. This also fixes complex reprs like in ENH: fix 0d array printing using str or formatter. #9332.
updates arrayprint.py code to use dragon4. I also added a new option, that you can set np.printoptions(precision='unique'), which will print the minimum number of digits to uniquely specify the value. I left the default at precision=8 for now.
In arrayprint.py I've also stolen/reworked the FormatFloat generalizations from WIP: MAINT: Rewrite LongFloatFormat, trim zeros in scientific notation output #9919. Now all floating types just use a single FloatingFormat formatter, which calls the dragon4 routines.

Some examples that you can compare to the old behavior:

print(complex(0,np.inf), np.complex128(complex(0,np.inf)), str(np.array([np.complex128(complex(0,np.inf))])))
np.set_printoptions(precision='unique')
print(np.arange(10, dtype='f4')/10.)
a = np.random.rand(4)
print(repr(a))
print(repr(a.astype('f4')))
print(repr(a.astype('f2')))
print(repr(10.**np.arange(20)))

I still need to do some cleanups and tests. Also, there may be a bug in the original implementation because np.float64(1e23) isn't nicely rounded, unlike the other powers of 10.

ahaldane · Oct 29, 2017

Output of the code in my last comment, with this PR:

(infj, infj, '[ 0.+infj]')
[0.  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9]
array([0.41337804458542926, 0.9634011951702381 , 0.38924457984371086,
       0.6889281918519422 ])
array([0.41337803, 0.9634012 , 0.3892446 , 0.6889282 ], dtype=float32)
array([0.4133, 0.9634, 0.3892, 0.689 ], dtype=float16)
array([1.e+00, 1.e+01, 1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06, 1.e+07,
       1.e+08, 1.e+09, 1.e+10, 1.e+11, 1.e+12, 1.e+13, 1.e+14, 1.e+15,
       1.e+16, 1.e+17, 1.e+18, 1.e+19])

Output with master:

(infj, inf*j, '[ 0.+infj]')
[ 0.          0.1         0.2         0.30000001  0.40000001  0.5
  0.60000002  0.69999999  0.80000001  0.89999998]
array([ 0.5864939 ,  0.15714977,  0.26403041,  0.7634838 ])
array([ 0.58649391,  0.15714976,  0.2640304 ,  0.76348382], dtype=float32)
array([ 0.58642578,  0.15710449,  0.26391602,  0.76367188], dtype=float16)
array([  1.00000000e+00,   1.00000000e+01,   1.00000000e+02,
         1.00000000e+03,   1.00000000e+04,   1.00000000e+05,
         1.00000000e+06,   1.00000000e+07,   1.00000000e+08,
         1.00000000e+09,   1.00000000e+10,   1.00000000e+11,
         1.00000000e+12,   1.00000000e+13,   1.00000000e+14,
         1.00000000e+15,   1.00000000e+16,   1.00000000e+17,
         1.00000000e+18,   1.00000000e+19])

eric-wieser · Oct 30, 2017

Out of curiosity - what keeps us from directly using the C++ version?

eric-wieser · Oct 30, 2017

The spaces in

array([0.41337803, 0.9634012 , 0.3892446 , 0.6889282 ], dtype=float32)

look pretty bad to me, but I understand the rationale. I'd perhaps argue for zero padding to make them all the same length. Otherwise it causes confusion for users comparing the last digits.

What does this give?

a = 10.**np.arange(20)
a[0] = 123456

ahaldane · Oct 30, 2017

Re: the spacing in the float32 array: That is behavior is still to be decided, now that we have more choices due to the existence of a dragon4 interface. Note that example only gets output for precision='unique', but we don't need to make that the default. (In fact, it is not the default in this PR... note I manually set it in my example).

Also, here is what this PR currently outputs by default for your code:

[1]: a = 10.**np.arange(20)
[2]: a[0] = 123456
[3]: a
array([1.23456e+05, 1.00000e+01, 1.00000e+02, 1.00000e+03, 1.00000e+04,
       1.00000e+05, 1.00000e+06, 1.00000e+07, 1.00000e+08, 1.00000e+09,
       1.00000e+10, 1.00000e+11, 1.00000e+12, 1.00000e+13, 1.00000e+14,
       1.00000e+15, 1.00000e+16, 1.00000e+17, 1.00000e+18, 1.00000e+19])

ahaldane · Oct 30, 2017

Also, another comment on the spacing in unique mode:

We already get funny spacing like that in current numpy master:

[31]: array([0.32254803, 0.8584671, 0.07836795, 0.2622], dtype=float32)
array([ 0.32254803,  0.8584671 ,  0.07836795,  0.2622    ], dtype=float32)

(in fact you can also see this in the "master" output I pasted above)

ahaldane · Oct 30, 2017

As for leaving it in C++, I honestly didn't consider that carefully since I assumed it would be harder/more confusing to try to interface to it than to just port to C. Also, while I didn't modify the dragon4 code itself much, I did pretty heavily rewrite the "wrapper" part of the code that converts dragon4 output to a string.

ahaldane · Oct 30, 2017

More thoughts on spacing: I don't think padding with 0s is correct, since the "unnecessary" digits which were dropped are not necessarily 0.

To avoid the trailing spaces, probably we could do as follows: In the first pass in FloatingFormat, figure out the maximum fractional length of the data, when output with all the trimming. Then, set the precision to that value in the second pass, and don't do any trimming in the second pass (trim-mode 'k'), thus keeping the extra digits in the values that were a bit shorter in the first pass.

mdickinson · Oct 30, 2017

Presumably this would fix #9360? (And also close #2643 and #6136.)

charris · Oct 30, 2017

The test failures are due to ieee extended precision on 32 linux being aligned on 4 byte boundaries, hence float96 instead of float128.

ahaldane · Oct 30, 2017

Yes, I thought I accounted for that with the #ifdef NPY_FLOAT96, but apparently that macro-constant isn't defined...

ahaldane · Oct 30, 2017

Also I figured out the 1e23 strangeness: It is because the implementation currently doesn't use IEEE unbiased rounding. It should be easy to modify it to do so.

Also, in principle it is possible to change the rounding mode, see GNU libc's discussion (link). However, I think CPython has forced/assumed the rounding mode to be IEEE unbiased, so I think it is safe to assume the same for numpy.

charris · Oct 30, 2017

numpy/core/src/multiarray/dragon4.h

+
+typedef enum TrimMode
+{
+    TrimMode_None,         // don't trim zeros, always leave a decimal point


Should convert these to C style comments.

charris · Oct 30, 2017

Yes, I thought I accounted for that with the #ifdef NPY_FLOAT96, but apparently that macro-constant isn't defined...

Missing include statement. Need #include <numpy/npy_common.h>, maybe something in setup also.

mdickinson · Oct 30, 2017

CPython has forced/assumed the rounding mode to be IEEE unbiased

Assumed, rather than forced. The dtoa.c code is a bit of a mix: some parts (mainly the integer arithmetic parts) explicitly construct the round-ties-to-even results, without help from the FPU. Others (the fast paths that use floating-point), assume that the FPU control word remains at its default of round-ties-to-even. If the FPU rounding mode were set to something else, it's probable that Python's float repr would give wrong results in some corner cases, and conceivable that it would do something even more horrible, like entering an infinite loop. On the one hand, that's probably bad, and CPython should (but doesn't) make an effort at least to check that the FPU control word is doing the right thing. On the other hand, it doesn't seem to have caused any error reports to date, and the code has been in use for a good number of years now.

So yes, while in principle assuming that the x87/SSE2 rounding mode settings will be the right ones makes me a bit nervous, in practice it appears to be a fairly safe assumption.

Also, the dtoa.c code is written in such a way that infinite loops could happen (at one point I started refactoring to eliminate the potential for infinite loops, but that project never got completed); I'd hope that the Dragon4 code is a bit better behaved in that respect.

[Full disclosure: I was responsible, with Eric Smith, for implementing the CPython shortest repr implementation, though of course the original core dtoa.c code comes from David Gay.]

ahaldane · Oct 30, 2017

@mdickinson thanks for the links to the issues.

Assuming IEEE unbiased rounding in this PR is "safe" in the sense that it won't crash. I think the worst case scenario is that we spit out floating point strings that don't covert back to their original values, eg in the case of 1e23. (By the way, I am discussing the rounding mode issues with Ryan Juckett over on his website, linked above).

I don't really want to go through the effort of accounting for all the rounding modes here. But at least maybe I could add a check to fegetround during numpy startup that gives a serious warning if the rounding mode isn't unbiased.

ahaldane · Oct 31, 2017

Updated

Implemented IEEE unbiased rounding
fixed a possible bug in the Dragon4 implementation
did a pass at reformatting everything to match numpy style (including C comments)

When I come back to this I'm going to work on an arrayprint.py interface and on choosing the right defaults. I'm going to try adding afloatmode printing option, which controls controls the interpretation of the precision printing option. The doc would be:

    floatmode : str, optional
        Controls the interpretation of the `precision` option for
        floating-point types. Can take the following values:
            - 'fixed' : Always print exactly `precision` fractional digits,
                    even if this would print more or fewer digits than
                    necessary to specify the value uniquely.
            - 'unique : Print the minimum number of fractional digits necessary
                    to represent each value uniquely. Different elements may
                    have a different number of digits.  The value of the
                    `precision` option is ignored.
            - 'maxprec' : Print at most `precision` fractional digits, but if
                    an element can be uniquely represented with fewer digits
                    only print it with that many.
            - 'maxprec_equal' : Print at most `precision` fractional digits,
                    but if every element in the array can be uniquely
                    represented with an equal number of fewer digits (or less),
                    use that many digits for all elements.
        Default is 'maxprec'

charris · Nov 5, 2017

I think you need a separate closes for each issue.

charris · Nov 5, 2017

The rounding still worries me and might be a problem. I suppose we need to rely on the tests to check if that is a problem. Also, out of curiosity, what were the python (hence compiler) versions that had trouble with 1e23?

eric-wieser · Nov 5, 2017

numpy/core/arrayprint.py

@@ -463,6 +483,23 @@ def array2string(a, max_line_width=None, precision=None,
        (whitespace character) in the sign position of positive values.  If
        '-', omit the sign character of positive values. If 'legacy', print a
        space for positive values except in 0d arrays.
+    floatmode : str, optional


Are we sure we don't want float_mode for consistency with suppress_small?

eric-wieser · Nov 5, 2017

numpy/core/arrayprint.py

-            return format_longfloat(x, self.precision)
-
-
-class LongComplexFormat(object):


I think we might need to keep this for compatibility, as a subclass that does nothing but throw a warning in it's constructor.

eric-wieser · Nov 5, 2017

numpy/core/arrayprint.py

@@ -597,8 +634,8 @@ def _formatArray(a, format_function, rank, max_line_len,
                          summary_insert).rstrip()+']\n'
    return s

-class FloatFormat(object):
-    def __init__(self, data, precision, suppress_small, sign=False):
+class FloatingFormat(object):


Same comment about leaving a public FloatFormat object visible, assuming we're considering this public API.

Docstring saying For np.floating objects or something would be nice too

eric-wieser · Nov 5, 2017

numpy/core/arrayprint.py

-    # XXX Have to add something to determine the width to use a la FloatFormat
-    # Right now, things won't line up properly
-    def __init__(self, precision, sign=False):
+class ComplexFormat(object):


Maybe ComplexFloatingFormat to match np.complexfloating? (#9962)

eric-wieser

Really pleased to see all the arrayprint formatters merge together, but a couple nits about names and compatibility in arrayprint, which we should probably look at before 1.14

eric-wieser · Nov 5, 2017

numpy/core/src/multiarray/scalartypes.c.src

 */

-#define PREC @NAME@PREC_@KIND@
+static PyObject *
+@name@type_@kind@_either(npy_@name@ val, TrimMode trim_pos, TrimMode trim_sci,


Could do with a comment explaining what this function is for - "either" isn't super clear

eric-wieser · Nov 5, 2017

numpy/core/src/multiarray/scalartypes.c.src

-static int
-c@name@type_print(PyObject *v, FILE *fp, int flags)
+static PyObject *
+halftype_@kind@(PyObject *self)


Is this special compared to the other ones? Why?

eric-wieser · Nov 5, 2017

numpy/core/arrayprint.py

    elif issubclass(dtypeobj, _nt.complexfloating):
-        if issubclass(dtypeobj, _nt.clongfloat):
-            return formatdict['longcomplexfloat']()


This means that users setting a longcomplexfloat formatter will now see it being ignored silently

Introduced in numpy#9941

eric-wieser · Nov 5, 2017

Deleted my comments about whitespace, and just added a STY commit to #9962, to reduce noise here

ahaldane force-pushed the dragon4 branch 2 times, most recently from dd3b272 to a96bc04 Compare October 30, 2017 01:17

ahaldane force-pushed the dragon4 branch from a96bc04 to bb7e458 Compare October 30, 2017 01:52

ahaldane force-pushed the dragon4 branch from bb7e458 to 00282f3 Compare October 30, 2017 02:12

ahaldane force-pushed the dragon4 branch 3 times, most recently from ef7cf07 to b133177 Compare October 30, 2017 04:10

ahaldane force-pushed the dragon4 branch 2 times, most recently from 642677b to bc5141c Compare October 30, 2017 14:51

charris added 01 - Enhancement component: numpy._core labels Oct 30, 2017

charris reviewed Oct 30, 2017

View reviewed changes

ahaldane force-pushed the dragon4 branch from bc5141c to 3d78b81 Compare October 30, 2017 21:30

ahaldane force-pushed the dragon4 branch 2 times, most recently from 0901003 to eae2a80 Compare October 31, 2017 04:12

eric-wieser reviewed Nov 5, 2017

View reviewed changes

eric-wieser mentioned this pull request Nov 5, 2017

MAINT: Rename formatters to match scalar type names #9962

Merged

eric-wieser added a commit to eric-wieser/numpy that referenced this pull request Nov 5, 2017

STY: Improve indentation (bad levels and tabs/spaces)

f9368d8

Introduced in numpy#9941

charris mentioned this pull request Nov 5, 2017

TST: compat with numpy 1.14 pandas-dev/pandas#18123

Closed

effigies mentioned this pull request Nov 9, 2017

More numpy pre-release formatting issues nipy/nibabel#577

Closed

This was referenced Nov 12, 2017

MAINT: Backcompat fixes for dragon4 changes #10008

Merged

Floating-point pow gives different results on different platforms #10010

Open

qinhanmin2014 mentioned this pull request Nov 14, 2017

CRON Build breaks scikit-learn/scikit-learn#10074

Closed

anntzer mentioned this pull request Nov 19, 2017

Improve reprs of transforms. matplotlib/matplotlib#9421

Merged

6 tasks

jjhelmus mentioned this pull request Jan 18, 2018

Float128 repr is wrong on ppc64 #10429

Closed

rgommers mentioned this pull request May 1, 2019

License not detected on repo, and some issues #13447

Closed

rgommers mentioned this pull request May 29, 2019

dragon4 code has a custom non-OSI approved license #13674

Closed

This was referenced Apr 21, 2020

Bump minimum dependencies to those in Ubuntu 18.04 LTS and Python to 2.7 and 3.5-3.8. chrisdembia/yeadon#129

Closed

strip_zeros kwarg no longer available in FloatFormat #16036

Closed

StrikerRUS mentioned this pull request May 20, 2020

Flaky e2e test for the XGBoost model with the 'gblinear' booster BayesWitnesses/m2cgen#205

Open

rossbar mentioned this pull request Jul 14, 2020

API: Review public functionality in arrayprint #16869

Closed

StrikerRUS mentioned this pull request Jul 30, 2020

[fix] Locale independent model load/save microsoft/LightGBM#3267

Closed

scratchmex mentioned this pull request Aug 11, 2021

ENH: ndarray.__format__ implementation for numeric dtypes #19550

Open

		return format_longfloat(x, self.precision)


		class LongComplexFormat(object):

Search code, repositories, users, issues, pull requests...

Uh oh!

ENH: Use Dragon4 algorithm to print floating values #9941

ENH: Use Dragon4 algorithm to print floating values #9941

Uh oh!

Conversation

ahaldane commented Oct 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahaldane commented Oct 29, 2017

Uh oh!

eric-wieser commented Oct 30, 2017

Uh oh!

eric-wieser commented Oct 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahaldane commented Oct 30, 2017

Uh oh!

ahaldane commented Oct 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahaldane commented Oct 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahaldane commented Oct 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mdickinson commented Oct 30, 2017

Uh oh!

charris commented Oct 30, 2017

Uh oh!

ahaldane commented Oct 30, 2017

Uh oh!

ahaldane commented Oct 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charris Oct 30, 2017

Choose a reason for hiding this comment

Uh oh!

charris commented Oct 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mdickinson commented Oct 30, 2017

Uh oh!

ahaldane commented Oct 30, 2017

Uh oh!

ahaldane commented Oct 31, 2017

Uh oh!

charris commented Nov 5, 2017

Uh oh!

charris commented Nov 5, 2017

Uh oh!

eric-wieser Nov 5, 2017

Choose a reason for hiding this comment

Uh oh!

eric-wieser Nov 5, 2017

Choose a reason for hiding this comment

Uh oh!

eric-wieser Nov 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-wieser Nov 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-wieser left a comment

Choose a reason for hiding this comment

Uh oh!

eric-wieser Nov 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-wieser Nov 5, 2017

Choose a reason for hiding this comment

Uh oh!

eric-wieser Nov 5, 2017

Choose a reason for hiding this comment

Uh oh!

eric-wieser commented Nov 5, 2017

ahaldane commented Oct 29, 2017 •

edited

Loading

eric-wieser commented Oct 30, 2017 •

edited

Loading

ahaldane commented Oct 30, 2017 •

edited

Loading

ahaldane commented Oct 30, 2017 •

edited

Loading

ahaldane commented Oct 30, 2017 •

edited

Loading

ahaldane commented Oct 30, 2017 •

edited

Loading

charris commented Oct 30, 2017 •

edited

Loading

eric-wieser Nov 5, 2017 •

edited

Loading

eric-wieser Nov 5, 2017 •

edited

Loading

eric-wieser Nov 5, 2017 •

edited

Loading