-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
Rewrite complex log1p to improve precision. #26798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
5ecd0b0
to
342a04a
Compare
342a04a
to
33860e6
Compare
ISTR some literature about computing log1p, there were some tricks, but that was some time ago and I don't know if it would still apply. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we could push this problem off to the math libraries, but I guess that isn't to be...
Overall, it seems to me like we should probably go ahead here.
@WarrenWeckesser I wonder if you can ask someone from SciPy to have a look at the code as well?
It looks fine, but I didn't try to follow the math fully and would love a bit more mpmath
tries/scan to be on the safe side (although I am sure you already did that).
I did make a few comments, the biggest one is that I don't trust that NO_OPT
stuff, and I think volatile
may do the trick.
* when the input is close to the unit circle centered at -1+0j. | ||
*/ | ||
NPY_NO_EXPORT void | ||
clog1p@c@(@np_cplx_type@ *x, @np_cplx_type@ *r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would have a mild preference to giving this a different name. clog1p
isn't clear that it is only relevant in parts of the domain.
Might be nice to avoid the .src
style templating, I think that would mean making this a static (inline)
templated function and then creating three explicit stubs that call it below.
Since each stub has a single line, I would hope it isn't significantly longer.
@@ -0,0 +1,260 @@ | ||
#ifndef LOG1P_COMPLEX_H |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is C++ only, I think. So let's rename it to .hpp
?
Also please adjust the include guard to be the google-style full file path (see other files).
CPP_WRAP2(atan2) | ||
CPP_WRAP2(hypot) | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we don't have a nicer C++ pattern/common spot for these yet, so this looks fine.
// __i386 are defined also turns off the use of the floating-point | ||
// unit '387'. It is important that when the type is, for example, | ||
// `double`, these functions compute their results with 64 bit | ||
// precision, and not with 80 bit extended precision. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes me worried that this may break e.g. on MSVC (depending on optimization levels)!? Are there tests that would fail if it does?
I suspect you should rather use something like (may require 2 volatiles):
volatile T tmp = t - x;
t = t - tmp;
I suppose that may fail with fast-math
enabled. If that failure is "graceful" as in: you get bad precision, I would ignore it. If you compile NumPy with fast-math you should get test failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to find a failure mode for either (also the sse forcing), but beyond the intel compiler effectively defaulting to fast-math I guess, I couldn't.
The volatile
always adds some store/load instructions, which I maybe avoids the use of the 80bit FPE registers as well (but maybe bad for speed?).
# print(w.imag) | ||
# | ||
@pytest.mark.skipif(np.finfo(np.longdouble).machep != -112, | ||
reason='reference values are for IEEE float128') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy keep the test, I am a bit surprised, since if I understand right, this test is for true quad precision. I am not sure that we have CI that runs with true quad precision.
@@ -30,6 +30,7 @@ | ||
#include "string_ufuncs.h" | ||
#include "stringdtype_ufuncs.h" | ||
#include "special_integer_comparisons.h" | ||
#include "clog1p_wrappers.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe nicer to include this in the funcs.inc.src
code? Since it is used here only indirectly?
OTOH, I guess it may be that a few other imports should maybe be also be moved by that logic.
Closes gh-22609
TL;DR: To implement
log1p(z)
for complex inputz = x + y*i
, uselog(1+z)
where that calculation doesn't lose precision. Otherwise, compute the real part oflog1p(z)
as0.5*log1p(x**2 + 2*x + y**2)
, and use variables with doubled precision to avoid loss of precision in the expressionx**2 + 2*x + y**2
.The formula for the complex logarithm of$z = x + iy$ is $\log z = \log |z| + i\arg z$ , where $|z| = \left(x^2 + y^2\right)^{\frac{1}{2}}$ , so for $\rm{log1p}(z)$ we have
I'll call this the naive formula.
The computation of$\textrm{atan2(y, x)}$ is stable, so we do not have to do anything special to compute it accurately. The computation of the real part, $\frac{1}{2}\log((x+1)^2 + y^2)$ , can lose precision in some regions of the complex plane. The example from gh-22609 is $\log(1+z)$ returns $\log((x+1)^2 + y^2) = \log(1 + 2x + x^2 + y^2)$ . When $x$ and $y$ are sufficiently small, the terms $2x$ , $x^2$ and $y^2$ are smaller than the machine $\varepsilon$ , so the result is $\log(1) = 0$ . So clearly there is a problem with the naive formula when $x$ and $y$ are small. We should use $\textrm{log1p}(2x + x^2 + y^2)$ , but that expression turns out to still have a problem with precision loss.
z = 1e-18 + 1e-18j
, where1e-18j
but the correct value is1e-18 + 1e-18j
. (Reference values are computed with mpmath.) The reason for the incorrect result is clear from the expressionAnother example is$x$ and $y$ values are not particularly small, but there is still loss of precision. Notice that the real part of the result is small; that occurs when $(x+1)^2 + y^2 = 1 + 2x + x^2 + y^2 \approx 1$ . The Taylor expansion of the real log function $(u - 1) - (u - 1)^2/2 + \ldots$ , so it is not surprising that there can be precision issues near 1. Moreover, for $1 + 2x + x^2 + y^2$ to be close to $1$ , $2x + x^2 + y^2$ must be small. And that implies that there must be cancellation between $2x$ and $x^2 + y^2$ (so the use of $\textrm{log1p}(2x + x^2 + y^2)$ doesn't eliminate the precision problem). The region in the complex plane where this happens is near the unit circle centered at $z_0 = -1 + 0j$ ; that is, where $|z + 1| = 1$ .
z = (-0.2892646601647662 + 0.7034589443707598j)
, where the correct value oflog1p(z)
is-3.951471916126033e-07 + 0.7802529500871193j
, but the naive formulalog(1 + z)
(on a platform compiled with gcc 11.4.0) gives-3.9514719157314957e-07+0.7802529500871193j
. The relative error of the real part is about1e-10
, and a similar relative error is observed in NumPy built on a Mac with Apple clang 14.0.3. Thelog(u)
at u=1 isThe severity of this loss of precision seems to depend on the details of the implementation of the complex log function in the standard library. On my Linux system, where I used gcc 11.4.0 to build NumPy, there is loss of precision near unit circle centered at -1+0j only in the arc corresponding to x > -0.5. On a Mac where I build NumPy with Apple clang version 14.0.3, there is loss of precision near the entire circle. In this PR, the loss of precision is handled by implementing a higher precision version of the calculation that is only used in a region near that unit circle.
The high precision calculation is implemented in C++ in$\textrm{log1p}(x^2 + 2x + y^2)$ , and compute $x^2 + 2x + y^2$ with higher precision than provided by the input types. Single precision $x^2 + 2x + y^2$ with doubled variables is included.
log1p_complex.h
. The essential idea is simple: compute the real part of the log with the formulafloat
is upgraded todouble
, anddouble
is upgraded tolong double
iflong double
has at least 106 bits of precision. Lower precisiondouble
types, and thelong double
type, are upgraded by using a doubled type (usually called double-double, but in the case oflong double
, the doubled type could be called double-long double). Only the minimal set of functions needed to computeThe files
clog1p_wrappers.cpp.src
andclog1p_wrappers.h
provide the shims for calling C++ code from C inumathmodule.c
. The functionsnc_log1p@c@(...)
infuncs.inc.src
check whether the input is close enough to the unit circle centered at -1+0j to use the high precision calculation, otherwise they simply callnc_log@c@(...)
with input1+z
.Tests are included for
np.complex64
,np.complex128
, and the various flavors of platform-dependentnp.clongdouble
.Because the new code uses$z = x + iy$ is so large that $\textrm{hypot}(x, y)$ is larger than the input type's maximum value. For example, with the main development branch:
log(1+z)
for points away from the circle, it benefits from whatever behavior is included in the standard library's implementation (or NumPy's implementation if the standard library's version is blocklisted). In particular, this PR fixes an obscure bug in the main branch that occurs when the inputWe got an overflow and the real part of the result is
inf
.With the change in this pull request, we get the correct result (at least on the platform where I ran this code):
Performance changes
The new version is slower, but that is not because of the use of the doubled types. It uses the standard library with
log(1+z)
whenever the high precision calculation is not needed, and that function (at least on my Linux box) is slower than the implementation of complexlog1p
in main. But the implementation in the main branch doesn't have any additional checks that might be in the standard library (see the "obscure bug" mentioned above).Here are timings for computing
log1p(z)
for samples taken from a normal distribution around 0+0j with standard deviation 2. First, from the main branch:The timing for this pull request:
The especially poor performance with
np.complex64
also occurs withnp.log(z)
, so it has nothing to do with this pull request, other than this change relying onlog(z)
for the "safe" case (see #26807).The slowdown for
np.complex128
andnp.complex256
is the result of deferring tolog(1+z)
most of the time. The performance oflog1p(z)
is basically the same aslog(1+z)
.