-
Notifications
You must be signed in to change notification settings - Fork 412
Arrow: Infer the types when reading #1669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When reading a Parquet file using PyArrow, there is some metadata stored in the Parquet file to either make it a large type (eg `large_string`, or a normal type (`string`). The difference is that the large types use a 64 bit offset to encode their arrays. This is not always needed, and we can could first check all the in the types of which it is stored, and let PyArrow decide here: https://github.com/apache/iceberg-python/blob/300b8405a0fe7d0111321e5644d704026af9266b/pyiceberg/io/pyarrow.py#L1579 In PyArrow today we just bump everything to a large type, which might lead to additional memory consumption because it allocates a int64 array to allocate the offsets, instead of an int32. I thought we would be good to go for this now with the new lower bound of PyArrow to 17. But, it looks like we still have to wait for Arrow 18 to fix the issue with the `date` types: apache/arrow#43183 Fixes: apache#1049
| target_schema, | ||
| batches, | ||
| ) | ||
| ).cast(target_schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will still return large types if you stream the batches because we don't want to fetch all the schemas upfront.
| if property_as_bool(self._io.properties, PYARROW_USE_LARGE_TYPES_ON_READ, False): | ||
| result = result.cast(arrow_schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left this in, but I would be leaning toward deprecating this, since I don't think we want to trouble the user. I think it should be an implementation detail based on how large the buffers are.
|
@sungwy Thoughts? :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Fokko - thank you for pinging me for review! The change looks good to me, but I have a reservation about introducing this change without a deprecation warning.
Firstly, without the PyIceberg code base having a properly defined list of public classes, we assume all our classes to be public facing unless they start with an underscore. I'd argue that removing an input parameter to the ArrowProjectionVisitor __init__ method is an API change.
Secondly, changing the default value of PYARROW_USE_LARGE_TYPES_ON_READ to True for to_table method also seems like a breaking change for users reading Iceberg tables through PyIceberg. their large_string columns will change to a string column on upgrade without a warning.
Would it make sense to introduce this change in two stages:
- First by introducing a new config variable like:
PYICEBERG_INFER_LARGE_TYPES_ON_READand set it toFalseon default, and raise a deprecation warning when the flag is set toFalse? - Then remove PYICEBERG_INFER_LARGE_TYPES_ON_READ and PYARROW_USE_LARGE_TYPES_ON_READ in the next major version?
sungwy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
But, it looks like we still have to wait for Arrow 18 to fix the issue with the date types:
should we first bump min version to Arrow 18?
pyiceberg/io/pyarrow.py
Outdated
| # https://github.com/apache/arrow/issues/41884 | ||
| # https://github.com/apache/arrow/issues/43183 | ||
| # Would be good to remove this later on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we remove this comment? seems related to the schema type inference
1b9b884
tests/integration/test_reads.py
Outdated
| pa.field("binary", pa.large_binary()), | ||
| pa.field("list", pa.large_list(pa.large_string())), | ||
| pa.field("binary", pa.binary()), | ||
| pa.field("list", pa.list_(pa.string())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: to show complex type also stays the same
| pa.field("list", pa.list_(pa.string())), | |
| pa.field("list", pa.list_(pa.large_string())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good one!
While @kevinjqliu did an amazing job summarizing the new stuff in 0.9.0 in the GitHub release (https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.9.0), I think it would be good to formalize this a bit. This also came up in #1669 where we introduced a behavioral change. cc @sungwy I think it would be good to allow users to populate the changelog section to ensure they know about any relevant changes. The template is pretty minimal now to avoid being a big barrier to opening a PR.
If you don't use date types, then everything works fine :) I'm a bit hesitant to bump it very aggressively, see #1822. |
| [ | ||
| pa.field("string", pa.string()), | ||
| pa.field("string-to-binary", pa.binary()), | ||
| pa.field("string-to-binary", pa.large_binary()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko is this right? type promotion for string->binary results in a large_binary type
iceberg-python/pyiceberg/io/pyarrow.py
Lines 687 to 688 in 7a56ddb
| def visit_binary(self, _: BinaryType) -> pa.DataType: | |
| return pa.large_binary() |
i found these 3 places where _ConvertToArrowSchema converts to large type by default
While @kevinjqliu did an amazing job summarizing the new stuff in 0.9.0 in the GitHub release (https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.9.0), I think it would be good to formalize this a bit. This also came up in apache#1669 where we introduced a behavioral change. cc @sungwy I think it would be good to allow users to populate the changelog section to ensure they know about any relevant changes. The template is pretty minimal now to avoid being a big barrier to opening a PR.
### Rationale for this change Time to give this another go 😆 When reading a Parquet file using PyArrow, there is some metadata stored in the Parquet file to either make it a large type (eg `large_string`, or a normal type (`string`). The difference is that the large types use a 64 bit offset to encode their arrays. This is not always needed, and we can could first check all the in the types of which it is stored, and let PyArrow decide here: https://github.com/apache/iceberg-python/blob/300b8405a0fe7d0111321e5644d704026af9266b/pyiceberg/io/pyarrow.py#L1579 In PyArrow today we just bump everything to a large type, which might lead to additional memory consumption because it allocates an int64 array to allocate the offsets, instead of an int32. I thought we would be good to go for this now with the new lower bound of PyArrow to 17. But, it looks like we still have to wait for Arrow 18 to fix the issue with the `date` types: apache/arrow#43183 Fixes: apache#1049 ### Are these changes tested? Yes, existing tests :) ### Are there any user-facing changes? Before, PyIceberg would always return the large Arrow types (eg, `large_string` instead of `string`). After this change, it will return the type it was written with.
Rationale for this change
Time to give this another go 😆
When reading a Parquet file using PyArrow, there is some metadata stored in the Parquet file to either make it a large type (eg
large_string, or a normal type (string). The difference is that the large types use a 64 bit offset to encode their arrays. This is not always needed, and we can could first check all the in the types of which it is stored, and let PyArrow decide here:iceberg-python/pyiceberg/io/pyarrow.py
Line 1579 in 300b840
In PyArrow today we just bump everything to a large type, which might lead to additional memory consumption because it allocates an int64 array to allocate the offsets, instead of an int32.
I thought we would be good to go for this now with the new lower bound of PyArrow to 17. But, it looks like we still have to wait for Arrow 18 to fix the issue with the
datetypes:apache/arrow#43183
Fixes: #1049
Are these changes tested?
Yes, existing tests :)
Are there any user-facing changes?
Before, PyIceberg would always return the large Arrow types (eg,
large_stringinstead ofstring). After this change, it will return the type it was written with.