Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Inconsistent PyArrow Schema Field Metadata on project_table: Parquet Field ID#788

Copy link
Copy link
@sungwy

Description

@sungwy
Issue body actions

Apache Iceberg version

None

Please describe the bug 馃悶

While refactoring project_table(#786) I ran into some issues with the tests because the existing behavior for the project_table function isn鈥檛 consistent in terms of whether or not it returns the Parquet Field ID in its pyarrow schema field metadata.

There are cases where the parquet field ID is attached to the field metadata, and cases where they aren鈥檛: https://github.com/apache/iceberg-python/blob/main/tests/io/test_pyarrow.py#L1062-L1080

I think this is because we use schema_to_pyarrow as a fallback schema which attaches the parquet field ID attribute onto the field metadata: https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py#L1133

I think we should correct this behavior so that it is consistent for all table scans.

  • Do we want to attach the parquet file ID attribute on all pyarrow schema returned by project_table?
  • Or should we remove parquet file ID attached on the field metadata of the pyarrow schema? The idea here is that we would have two modes of creating schema_to_pyarrow , with or without parquet Field ID (write, versus read use cases)

I think not having unintended metadata for a specific use case will be cleaner for the users. Parquet Field ID was added to schema_to_pyarrow so that we could persist the field ID into the parquet files on write. But we do not want them when we are reading the Table. Hence, I am leaning towards the second option.

Looking for some thoughts and direction on this issue so we can complete the refactoring to support Iterator[RecordBatch] output scans! @Fokko @HonahX

HonahX

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.