Arrow: Infer the types when reading #1669

Fokko · Feb 16, 2025

Rationale for this change

Time to give this another go 😆

When reading a Parquet file using PyArrow, there is some metadata stored in the Parquet file to either make it a large type (eg large_string, or a normal type (string). The difference is that the large types use a 64 bit offset to encode their arrays. This is not always needed, and we can could first check all the in the types of which it is stored, and let PyArrow decide here:

iceberg-python/pyiceberg/io/pyarrow.py

Line 1579 in 300b840

result = pa.concat_tables(tables, promote_options="permissive")

In PyArrow today we just bump everything to a large type, which might lead to additional memory consumption because it allocates an int64 array to allocate the offsets, instead of an int32.

I thought we would be good to go for this now with the new lower bound of PyArrow to 17. But, it looks like we still have to wait for Arrow 18 to fix the issue with the date types:

apache/arrow#43183

Fixes: #1049

Are these changes tested?

Yes, existing tests :)

Are there any user-facing changes?

Before, PyIceberg would always return the large Arrow types (eg, large_string instead of string). After this change, it will return the type it was written with.

When reading a Parquet file using PyArrow, there is some metadata stored in the Parquet file to either make it a large type (eg `large_string`, or a normal type (`string`). The difference is that the large types use a 64 bit offset to encode their arrays. This is not always needed, and we can could first check all the in the types of which it is stored, and let PyArrow decide here: https://github.com/apache/iceberg-python/blob/300b8405a0fe7d0111321e5644d704026af9266b/pyiceberg/io/pyarrow.py#L1579 In PyArrow today we just bump everything to a large type, which might lead to additional memory consumption because it allocates a int64 array to allocate the offsets, instead of an int32. I thought we would be good to go for this now with the new lower bound of PyArrow to 17. But, it looks like we still have to wait for Arrow 18 to fix the issue with the `date` types: apache/arrow#43183 Fixes: apache#1049

Fokko · Feb 18, 2025

pyiceberg/table/__init__.py

            target_schema,
            batches,
-        )
+        ).cast(target_schema)


This will still return large types if you stream the batches because we don't want to fetch all the schemas upfront.

Fokko · Feb 18, 2025

pyiceberg/io/pyarrow.py

+        if property_as_bool(self._io.properties, PYARROW_USE_LARGE_TYPES_ON_READ, False):
+            result = result.cast(arrow_schema)


I left this in, but I would be leaning toward deprecating this, since I don't think we want to trouble the user. I think it should be an implementation detail based on how large the buffers are.

Fokko · Feb 18, 2025

@sungwy Thoughts? :D

sungwy

Hi @Fokko - thank you for pinging me for review! The change looks good to me, but I have a reservation about introducing this change without a deprecation warning.

Firstly, without the PyIceberg code base having a properly defined list of public classes, we assume all our classes to be public facing unless they start with an underscore. I'd argue that removing an input parameter to the ArrowProjectionVisitor __init__ method is an API change.

Secondly, changing the default value of PYARROW_USE_LARGE_TYPES_ON_READ to True for to_table method also seems like a breaking change for users reading Iceberg tables through PyIceberg. their large_string columns will change to a string column on upgrade without a warning.

Would it make sense to introduce this change in two stages:

First by introducing a new config variable like: PYICEBERG_INFER_LARGE_TYPES_ON_READ and set it to False on default, and raise a deprecation warning when the flag is set to False?
Then remove PYICEBERG_INFER_LARGE_TYPES_ON_READ and PYARROW_USE_LARGE_TYPES_ON_READ in the next major version?

pyiceberg/io/pyarrow.py

…-types

sungwy

LGTM!

kevinjqliu

LGTM!

But, it looks like we still have to wait for Arrow 18 to fix the issue with the date types:

should we first bump min version to Arrow 18?

kevinjqliu · Mar 12, 2025

pyiceberg/io/pyarrow.py

            # https://github.com/apache/arrow/issues/41884
            # https://github.com/apache/arrow/issues/43183
            # Would be good to remove this later on


should we remove this comment? seems related to the schema type inference
1b9b884

kevinjqliu · Mar 12, 2025

tests/integration/test_reads.py

-            pa.field("binary", pa.large_binary()),
-            pa.field("list", pa.large_list(pa.large_string())),
+            pa.field("binary", pa.binary()),
+            pa.field("list", pa.list_(pa.string())),


nit: to show complex type also stays the same

Suggested change

pa.field("list", pa.list_(pa.string())),

pa.field("list", pa.list_(pa.large_string())),

While @kevinjqliu did an amazing job summarizing the new stuff in 0.9.0 in the GitHub release (https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.9.0), I think it would be good to formalize this a bit. This also came up in #1669 where we introduced a behavioral change. cc @sungwy I think it would be good to allow users to populate the changelog section to ensure they know about any relevant changes. The template is pretty minimal now to avoid being a big barrier to opening a PR.

…-types

Fokko · Mar 26, 2025

should we first bump min version to Arrow 18?

If you don't use date types, then everything works fine :) I'm a bit hesitant to bump it very aggressively, see #1822.

kevinjqliu · Mar 27, 2025

tests/integration/test_reads.py

        [
            pa.field("string", pa.string()),
-            pa.field("string-to-binary", pa.binary()),
+            pa.field("string-to-binary", pa.large_binary()),


@Fokko is this right? type promotion for string->binary results in a large_binary type

iceberg-python/pyiceberg/io/pyarrow.py

Lines 687 to 688 in 7a56ddb

def visit_binary(self, _: BinaryType) -> pa.DataType:

return pa.large_binary()

i found these 3 places where _ConvertToArrowSchema converts to large type by default

list

string

binary, as seen above

While @kevinjqliu did an amazing job summarizing the new stuff in 0.9.0 in the GitHub release (https://github.com/apache/iceberg-python/releases/tag/pyiceberg-0.9.0), I think it would be good to formalize this a bit. This also came up in apache#1669 where we introduced a behavioral change. cc @sungwy I think it would be good to allow users to populate the changelog section to ensure they know about any relevant changes. The template is pretty minimal now to avoid being a big barrier to opening a PR.

### Rationale for this change Time to give this another go 😆 When reading a Parquet file using PyArrow, there is some metadata stored in the Parquet file to either make it a large type (eg `large_string`, or a normal type (`string`). The difference is that the large types use a 64 bit offset to encode their arrays. This is not always needed, and we can could first check all the in the types of which it is stored, and let PyArrow decide here: https://github.com/apache/iceberg-python/blob/300b8405a0fe7d0111321e5644d704026af9266b/pyiceberg/io/pyarrow.py#L1579 In PyArrow today we just bump everything to a large type, which might lead to additional memory consumption because it allocates an int64 array to allocate the offsets, instead of an int32. I thought we would be good to go for this now with the new lower bound of PyArrow to 17. But, it looks like we still have to wait for Arrow 18 to fix the issue with the `date` types: apache/arrow#43183 Fixes: apache#1049 ### Are these changes tested? Yes, existing tests :) ### Are there any user-facing changes? Before, PyIceberg would always return the large Arrow types (eg, `large_string` instead of `string`). After this change, it will return the type it was written with.

Fokko added 2 commits February 16, 2025 23:43

Less is more 😍

0384b4e

Fokko commented Feb 18, 2025

View reviewed changes

Fokko modified the milestones: PyIceberg 1.0.0, PyIceberg 0.10.0 Feb 18, 2025

Reinstate the table property

6dd9308

Fokko commented Feb 18, 2025

View reviewed changes

Cleanup

2817c61

sungwy reviewed Feb 18, 2025

View reviewed changes

pyiceberg/io/pyarrow.py Show resolved Hide resolved

pyiceberg/io/pyarrow.py Show resolved Hide resolved

Fokko mentioned this pull request Feb 26, 2025

fix(table/scanner): Fix nested field scan apache/iceberg-go#311

Merged

Fokko added 4 commits March 4, 2025 09:53

Merge branch 'main' of github.com:apache/iceberg-python into fd-infer…

d6fbca9

…-types

Fix import

fff7414

Add warning

0d19987

MOAR deprecation

7382112

sungwy approved these changes Mar 8, 2025

View reviewed changes

Fokko mentioned this pull request Mar 8, 2025

Add pull-request template #1777

Merged

kevinjqliu approved these changes Mar 12, 2025

View reviewed changes

Fokko added the changelog Indicates that the PR introduces changes that require an entry in the changelog. label Mar 12, 2025

Merge branch 'main' of github.com:apache/iceberg-python into fd-infer…

6526cc2

…-types

Fokko added 2 commits March 26, 2025 22:50

Thanks Kevin!

d9d4fda

Fix

dd1c5d4

Fokko merged commit 7a56ddb into apache:main Mar 26, 2025
7 checks passed

kevinjqliu reviewed Mar 27, 2025

View reviewed changes

kevinjqliu mentioned this pull request Mar 27, 2025

Pyarrow data type, default to small type and fix large type override #1859

Open

koenvo added a commit to koenvo/iceberg-python that referenced this pull request Jun 14, 2025

Make sure the 'infer the types when reading (apache#1669)' works again

2cd2137

enkidulan mentioned this pull request Jul 25, 2025

to_arrow_batch_reader returns a different schema than to_arrow #2250

Open

3 tasks

		if property_as_bool(self._io.properties, PYARROW_USE_LARGE_TYPES_ON_READ, False):
		result = result.cast(arrow_schema)

	pa.field("list", pa.list_(pa.string())),
	pa.field("list", pa.list_(pa.large_string())),

	def visit_binary(self, _: BinaryType) -> pa.DataType:
	return pa.large_binary()

Search code, repositories, users, issues, pull requests...

Arrow: Infer the types when reading #1669

Arrow: Infer the types when reading #1669

Uh oh!

Conversation

Fokko commented Feb 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Fokko Feb 18, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko commented Feb 18, 2025

Uh oh!

sungwy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sungwy left a comment

Choose a reason for hiding this comment

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko commented Mar 26, 2025

Uh oh!

Uh oh!

kevinjqliu Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fokko commented Feb 16, 2025 •

edited

Loading

Fokko Feb 18, 2025 •

edited

Loading

sungwy left a comment •

edited

Loading