Fix #217 double NaN regression #2 #306

glenebob · Aug 2, 2014

This is a continuation of PR #257,.@franciscojunior's fix for bug #217.

I want to limit the areas that are effected by the rather substantial complexity of properly quoting string values for all the combinations of PG backend, query mechanism, and whether or not the value is an array element. So, I moved QuoteASCIIString() over right next to the massive raw string formatting/quoting function, spiffed it up a little, and marked it internal. Then I changed the float converters to use it, so that they don't have to know anything about how to quote the value.

Note that we don't strictly need the double-quoted NaN value because floating point uses the binary encoding for extended queries. However, if a test is ever put in to test NaN array elements with binary encoding suppressed, they would fail, causing some pain.

Thoughts?

-Glen

After commit 854862a Npgsql wasn't adding quotes around float4 and float8 values. But according to documentation, NaN values need to be quoted. Separated handling of float4 and float8 conversion routines.

…heads up. Add tests for Float array.

…nciscojunior/Npgsql into quote_FP_NaN_values

Add non-quoted and double quoted NaN ASCII string values. Move QuoteASCIIString() from NpgsqlNativeTypeInfo to BasicNativeToBackendTypeConverter. Add support in QuoteASCIIString() for predetermined quoted values. Use QuoteASCIIString() in the two native float to backend float text converters.

roji · Aug 2, 2014

Haven't had the time to look at the work, but right off the bat I think we abandoned non-binary encoding since we dropped support for protocol 2, right? So should we still be worried about this?

glenebob · Aug 2, 2014

Since the code knows how to do non-binary, I think it should ALL know how to do non-binary. Part of the reason is that we could decide at any time to drop binary support for some particular data type. in which case the text encoder would need to be working, or we get the "pain" I mentioned. The encoder logic is too complicated to be trifling with. It's the kind of code you work really hard to get perfect, and then hope nobody ever has to debug it again.

Why would we drop support for binary float encoding? Start supporting a platform (either us, or PG itself) that uses "some other" floating point binary format, and we pretty much have to go back to text only.

roji · Aug 2, 2014

I'm out of my depth here - I don't know this part of Npgsql and Postgresql very well. I was under the impression that for several Postgresql versions now, since the introduction of protocol 3, the best practice is always to encode in binary if possible, with text encoding being kept mostly for backwards compatibility. Can you confirm if this isn't right?

If my understanding is right, then since we no longer need to support pre-9.0 Postgresql versions, isn't it better to gradually drop text encoding altogether, yielding tighter and less complicated codebase? And even if the Postgresql backend (hypothetically) introduces a new floating point binary format, does that mean we have to maintain support for text encoding now?

Again, I really don't know much about this so I'd really appreciate being corrected.

glenebob · Aug 2, 2014

Hey @roji,

On Sat, Aug 2, 2014 at 2:16 PM, Shay Rojansky notifications@github.com
wrote:

I'm out of my depth here - I don't know this part of Npgsql and Postgresql
very well. I was under the impression that for several Postgresql versions
now, since the introduction of protocol 3, the best practice is always to
encode in binary if possible, with text encoding being kept mostly for
backwards compatibility. Can you confirm if this isn't right?

Actually Tom Lane commented at some point in the past that binary encoding
should actually be avoided for most data types to prevent portability
problems. Things like integers and byte arrays are pretty set in stone as
far as binary format goes, but you cannot say the same for most other
types. If the PG team decides to change a binary format, it could be
painful to snap to the new format, while handling yet another version
specific decision somewhere in Npgsql.

I took a run at a codec for decimal types, and I was never able to achieve
higher performance that the stock text format. If binary encoding offers
no performance gain, then it's not really worth doing. Without a direct
translation (all the types we support binary on now are the same exact
format on the backend as on the frontend), a performance improvement could
be hard to come by. The glaring exception to that is arrays; arrays of
binary encoded data are really fast, and Npgsql's handling of array text
encoding looks pretty inefficient to me.

So, don't get too cozy the binary encodings :)

Oh, and did you forget that in version 3 simple query mode, there is no
such thing as binary encoding?

If my understanding is right, then since we no longer need to support
pre-9.0 Postgresql versions, isn't it better to gradually drop text
encoding altogether, yielding tighter and less complicated codebase? And
even if the Postgresql backend (hypothetically) introduces a new floating
point binary format, does that mean we have to maintain support for text
encoding now?

Well, it obviously means we have to do something. We could go with text,
or we could support the new binary format. But, like integers, floating
point numbers are the same in memory between frontend and backend, so the
"encoding" is currently dirt cheap (just a copy and usually a byte order
swap). If we have to do a real encoding, it won't be so cheap anymore, so,
as I said above, text may be our best option.

-Glen

Again, I really don't know much about this so I'd really appreciate being

corrected.

—
Reply to this email directly or view it on GitHub
#306 (comment).

franciscojunior · Aug 3, 2014

If I recall correctly, binary encoding is only done when on extended query and if there is a binary converter for the type, right?
https://github.com/npgsql/Npgsql/blob/master/Npgsql/NpgsqlTypes/NpgsqlTypeInfoNative.cs#L210
https://github.com/npgsql/Npgsql/blob/master/Npgsql/NpgsqlTypes/NpgsqlTypeInfoNative.cs#L295
https://github.com/npgsql/Npgsql/blob/master/Npgsql/NpgsqlTypes/NpgsqlTypeInfoBackend.cs#L189

And this is done only when NpgsqlCommand.Prepare() method is called. As we can set the format each column of the result set will be returned in:
https://github.com/npgsql/Npgsql/blob/master/Npgsql/Npgsql/NpgsqlCommand.PrepareExecute.cs#L499

So, I think text encoding is needed as we have a lot of types which doesn't have binary encodings handlers.

So, I moved QuoteASCIIString() over right next to the massive raw string formatting/quoting function, spiffed it up a little, and marked it internal. Then I changed the float converters to use it, so that they don't have to know anything about how to quote the value.

I think I need more clarification on this part. Because it seems the QuoteASCIIString change makes it a little odd to me. It seems to have some short circuits (the single and double quoted values). The problem is that in order to use those short circuits, you would need to pass the source value two or three times (the original, single quoted and double quoted). Note that you did that for the NaN values which are very short. But what if I use this method for larger values? I think we would end up using the version without the single and double quote values, as you did in the other places.

I think that even with this change, the float coverters still need to know about the single quoted and double quoted values (as they need to pass them to the method). Also, I think the converters should be responsible to know when to quote and how.
When I designed this idea of the coverters, my goal was to make then hold all the information needed to serialize the value, hence the forExtendedQuery and array flags as well as some other typing information.

On a side note, we also have a little memory usage problem with the WrapASCIIString. It creates a new byte array 2 bytes longer than the original array. This means that for each value being translated which needs quoting, we are doubling the memory pressure. I think we could fix that by instead of using raw byte[] objects, maybe we could use some NpgsqlSerializableValue type or some other value, where it would keep this quoting information and would allow us to send this quote only when writing to the network stream. This way, instead of creating a new byte[] to hold the quotes and then copying the data to this new array, we would simply write the quote, then the byte[], then the quote. What do you all think?

Sorry if I said a lot of nonsense things. And please, correct me if I'm wrong.

glenebob · Aug 3, 2014

On Sat, Aug 2, 2014 at 10:10 PM, Francisco Figueiredo Jr. <
notifications@github.com> wrote:

If I recall correctly, binary encoding is only done when on extended query
and if there is a binary converter for the type, right?

Yes, exactly.

https://github.com/npgsql/Npgsql/blob/master/Npgsql/NpgsqlTypes/NpgsqlTypeInfoNative.cs#L210

https://github.com/npgsql/Npgsql/blob/master/Npgsql/NpgsqlTypes/NpgsqlTypeInfoNative.cs#L295

https://github.com/npgsql/Npgsql/blob/master/Npgsql/NpgsqlTypes/NpgsqlTypeInfoBackend.cs#L189

And this is done only when NpgsqlCommand.Prepare() method is called. As we
can set the format each column of the result set will be returned in:

https://github.com/npgsql/Npgsql/blob/master/Npgsql/Npgsql/NpgsqlCommand.PrepareExecute.cs#L499

So, I think text encoding is needed as we have a lot of types which
doesn't have binary encodings handlers.

So, I moved QuoteASCIIString() over right next to the massive raw string
formatting/quoting function, spiffed it up a little, and marked it
internal. Then I changed the float converters to use it, so that they don't
have to know anything about how to quote the value.

I think I need more clarification on this part. Because it seems the
QuoteASCIIString change makes it a little odd to me. It seems to have
some short circuits (the single and double quoted values). The problem is
that in order to use those short circuits, you would need to pass the
source value two or three times (the original, single quoted and double
quoted). Note that you did that for the NaN values which are very short.
But what if I use this method for larger values? I think we would end up
using the version without the single and double quote values, as you did in
the other places.

Arrays are reference types. It makes no difference how large the array
value is that you use as a parameter, only the pointer is copied to the
stack. And yes, normally you would just use the null value, and let the
quoter construct the return value the expensive way, but I didn't want to
kill the efficiency you had created by having the static quoted NaN value
on hand.

I think that even with this change, the float coverters still need to know
about the single quoted and double quoted values (as they need to pass them
to the method). Also, I think the converters should be responsible to know
when to quote and how.
When I designed this idea of the coverters, my goal was to make then hold
all the information needed to serialize the value, hence the
forExtendedQuery and array flags as well as some other typing information.

The rules for when to quote, and how to quote, are much more complicated
than the earlier versions of Npgsql were able to deal with. With that
complexity, and with the complexity being uniform across many types, it's
clearly beneficial to encapsulate it as much as possible,

It's OK to understand the difference between no quotes, single quotes, and
doubles quotes, but to understand exactly when each is applicable is quite
messy. It took a lot of work and hair pulling for me to get all those
rules straight and coded properly and efficiently. Let's not allow
ourselves to get caught up in that nightmare (again) any more than
absolutely necessarily.

On a side note, we also have a little memory usage problem with the
WrapASCIIString. It creates a new byte array 2 bytes longer than the
original array. This means that for each value being translated which needs
quoting, we are doubling the memory pressure. I think we could fix that by
instead of using raw byte[] objects, maybe we could use some
NpgsqlSerializableValue type or some other value, where it would keep
this quoting information and would allow us to send this quote only when
writing to the network stream. This way, instead of creating a new byte[]
to hold the quotes and then copying the data to this new array, we would
simply write the quote, then the byte[], then the quote. What do you all
think?

I have thought about that some. There is an opportunity to tighten the
memory footprint here, for sure. But I have not come up with anything
worth implementing.

Doing the quoting as we write to the stream would be a great way to
eliminate that memory overhead, for sure, but it will require a rather
substantial refactor, with the possibility regressions that goes with it.
And it's not the only way. What if there was a way to always allocate an
additional two bytes of data, and always write the data such that the first
and last actual bytes of the buffer could be populated with a quote byte if
needed, or if those two bytes could just be ignored when quoting is not
needed? More than one way to skin the cat :) But, like I said, nothing
very brilliant has come to mind yet.

Sorry if I said a lot of nonsense things. And please, correct me if I'm
wrong.

Not at all! I think we're on the same page here. But, what we have works,
and is pretty darn efficient already, so I don't see value in rushing ahead
with something that gives us limited improvement without also giving us
better overall code organization and readability.

-Glen

—
Reply to this email directly or view it on GitHub
#306 (comment).

franciscojunior · Aug 3, 2014

Arrays are reference types. It makes no difference how large the array value is that you use as a parameter, only the pointer is copied to the stack. And yes, normally you would just use the null value, and let the quoter construct the return value the expensive way, but I didn't want to kill the efficiency you had created by having the static quoted NaN value on hand.

Ah, ok. Thanks for keeping my idea :)

Regarding the array case, I was thinking about the case where you would get a big src array and then you would need to supply this same big array with single quotes and/or with double quotes to the method call. You would need to have copies of the array: one without quotes as the src and another one or two as single and/or double quoted.
I don't know if I'm explaining it correctly.

The rules for when to quote, and how to quote, are much more complicated than the earlier versions of Npgsql were able to deal with. With that complexity, and with the complexity being uniform across many types, it's clearly beneficial to encapsulate it as much as possible, It's OK to understand the difference between no quotes, single quotes, and doubles quotes, but to understand exactly when each is applicable is quite messy. It took a lot of work and hair pulling for me to get all those rules straight and coded properly and efficiently. Let's not allow ourselves to get caught up in that nightmare (again) any more than absolutely necessarily.

Agreed! In the past it was way more simple indeed. Now, there are even some exceptions like NaN which complicate things.
As @roji said in another comment about your work with parameter substitution, I think your work with quotes handling is very impressive! :)

As a comment regarding the double quote of the NaN, I think it is not needed. From the tests I did, only the single quoted value would be needed. Maybe we could drop this parameter from the method or at least from the method call? This would simplify the code.

Sorry, Glen, for being annoying with the idea of simplification (or less complexity) of the code.

Doing the quoting as we write to the stream would be a great way to eliminate that memory overhead, for sure, but it will require a rather substantial refactor, with the possibility regressions that goes with it. And it's not the only way. What if there was a way to always allocate an additional two bytes of data, and always write the data such that the first and last actual bytes of the buffer could be populated with a quote byte if needed, or if those two bytes could just be ignored when quoting is not needed? More than one way to skin the cat :) But, like I said, nothing very brilliant has come to mind yet.

Agreed! And I thought about that too! :)
We could serialize the values with two extra placeholder bytes for quoting.

Not at all! I think we're on the same page here. But, what we have works, and is pretty darn efficient already, so I don't see value in rushing ahead with something that gives us limited improvement without also giving us better overall code organization and readability.

Great! I'm glad to know we are on the same page. And I agree with you that we don't need to rush about that quote handling.

On the other side, I think that optimizing the byte[]'s we use throughout Npgsql is something we need to have a look. #308 is open so we can discuss about it there.

…ire it.

glenebob · Aug 3, 2014

On Sun, Aug 3, 2014 at 1:09 PM, Francisco Figueiredo Jr. <
notifications@github.com> wrote:

Arrays are reference types. It makes no difference how large the array
value is that you use as a parameter, only the pointer is copied to the
stack. And yes, normally you would just use the null value, and let the
quoter construct the return value the expensive way, but I didn't want to
kill the efficiency you had created by having the static quoted NaN value
on hand.

Ah, ok. Thanks for keeping my idea :)

Regarding the array case, I was thinking about the case where you would
get a big src array and then you would need to supply this same big array
with single quotes and/or with double quotes to the method call. You would
need to have copies of the array: one without quotes as the src and another
one or two as single and/or double quoted.
I don't know if I'm explaining it correctly.

The rules for when to quote, and how to quote, are much more complicated
than the earlier versions of Npgsql were able to deal with. With that
complexity, and with the complexity being uniform across many types, it's
clearly beneficial to encapsulate it as much as possible, It's OK to
understand the difference between no quotes, single quotes, and doubles
quotes, but to understand exactly when each is applicable is quite messy.
It took a lot of work and hair pulling for me to get all those rules
straight and coded properly and efficiently. Let's not allow ourselves to
get caught up in that nightmare (again) any more than absolutely
necessarily.

Agreed! In the past it was way more simple indeed. Now, there are even
some exceptions like NaN which complicate things.
As @roji https://github.com/roji said in another comment about your
work with parameter substitution, I think your work with quotes handling is
very impressive! :)

As a comment regarding the double quote of the NaN, I think it is not
needed. From the tests I did, only the single quoted value would be needed.
Maybe we could drop this parameter from the method or at least from the
method call? This would simplify the code.

Actually, I just realized that the simple query path should double quote
the NaN values for arrays, and I found I had left in a little bug that
caused that not to work. I wondered why the test somehow passes for
non-prepared queries.

In the process of verifying the bug, I discovered that for arrays, PG
doesn't actually require NaN to be quoted. It works with or without the
double quotes. It also formats arrays of NaN without the quotes. That
seems odd to me, but if PG doesn't do the quotes, then we don't need to
either. So I'm removing the double quoted value now :)

-Glen

Sorry, Glen, for being annoying with the idea of simplification (or less
complexity) of the code.

Doing the quoting as we write to the stream would be a great way to
eliminate that memory overhead, for sure, but it will require a rather
substantial refactor, with the possibility regressions that goes with it.
And it's not the only way. What if there was a way to always allocate an
additional two bytes of data, and always write the data such that the first
and last actual bytes of the buffer could be populated with a quote byte if
needed, or if those two bytes could just be ignored when quoting is not
needed? More than one way to skin the cat :) But, like I said, nothing very
brilliant has come to mind yet.

Agreed! And I thought about that too! :)
We could serialize the values with two extra placeholder bytes for
quoting.

Not at all! I think we're on the same page here. But, what we have works,
and is pretty darn efficient already, so I don't see value in rushing ahead
with something that gives us limited improvement without also giving us
better overall code organization and readability.

Great! I'm glad to know we are on the same page. And I agree with you that
we don't need to rush about that quote handling.

On the other side, I think that optimizing the byte[]'s we use throughout
Npgsql is something we need to have a look. #308
#308 is open so we can discuss
about it there.

—
Reply to this email directly or view it on GitHub
#306 (comment).

glenebob · Aug 3, 2014

And with that last commit, we're full circle. This PR no longer has value. Closing, and will merge @franciscojunior's PR.

franciscojunior and others added 5 commits July 12, 2014 20:27

Add tests for float4 and float8 NaN inserting.

e5b1b7c

Fix #217 when handling NaN for float4 and float8 datatypes.

3f01e79

After commit 854862a Npgsql wasn't adding quotes around float4 and float8 values. But according to documentation, NaN values need to be quoted. Separated handling of float4 and float8 conversion routines.

Add support to handle NaN values inside arrays. Thanks @glenebob for …

f6a7fed

…heads up. Add tests for Float array.

Merge branch '217-fix-doubleNaN-regression' of https://github.com/fra…

1c2d65c

…nciscojunior/Npgsql into quote_FP_NaN_values

glenebob mentioned this pull request Aug 2, 2014

Fix #217 double NaN regression #257

Merged

glenebob self-assigned this Aug 2, 2014

glenebob added the bug label Aug 2, 2014

glenebob added this to the 2.2 milestone Aug 2, 2014

roji mentioned this pull request Aug 3, 2014

Optimization regarding memory allocation #308

Closed

Remove double quoting from NaN value array elements; PG does not requ…

6bef247

…ire it.

glenebob closed this Aug 3, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #217 double NaN regression #2 #306

Fix #217 double NaN regression #2 #306

Uh oh!

glenebob commented Aug 2, 2014

Uh oh!

roji commented Aug 2, 2014

Uh oh!

glenebob commented Aug 2, 2014

Uh oh!

roji commented Aug 2, 2014

Uh oh!

glenebob commented Aug 2, 2014

Uh oh!

franciscojunior commented Aug 3, 2014

Uh oh!

glenebob commented Aug 3, 2014

Uh oh!

franciscojunior commented Aug 3, 2014

Uh oh!

glenebob commented Aug 3, 2014

Uh oh!

glenebob commented Aug 3, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Search code, repositories, users, issues, pull requests...

Fix #217 double NaN regression #2 #306

Fix #217 double NaN regression #2 #306

Uh oh!

Conversation

glenebob commented Aug 2, 2014

Uh oh!

roji commented Aug 2, 2014

Uh oh!

glenebob commented Aug 2, 2014

Uh oh!

roji commented Aug 2, 2014

Uh oh!

glenebob commented Aug 2, 2014

Uh oh!

franciscojunior commented Aug 3, 2014

Uh oh!

glenebob commented Aug 3, 2014

Uh oh!

franciscojunior commented Aug 3, 2014

Uh oh!

glenebob commented Aug 3, 2014

Uh oh!

glenebob commented Aug 3, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants