Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Conversation

@yigal-rozenberg
Copy link
Contributor

Polars (https://pola.rs) is an open-source library for data manipulation, known for being one of the fastest data processing solutions on a single machine. It features a well-structured, typed API that is both expressive and easy to use.

this chnage is a simple 'to_polars' addiotn to the table api.

iceberg_table = catalog.load_table('data.data_points')
pdf = iceberg_table.scan().to_polars()
print(pdf)

@amitgilad3
Copy link
Contributor

Really nice - i really like working with polars

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thanks @yigal-rozenberg for working on this. I'm open to adding this. Could you add a section on Polars to the docs as well?

See https://py.iceberg.apache.org/api/#query-the-data for examples. You can find the file here: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md

pyproject.toml Outdated Show resolved Hide resolved
@yigal-rozenberg
Copy link
Contributor Author

I will provide with the relevant documentation.

@corleyma
Copy link

corleyma commented Feb 5, 2025

At first glance, this approach seems fairly inferior to using the Polars scan_iceberg functionality, which:

  • returns a LazyFrame
  • on evaluation of the LazyFrame, will push down filters to pyiceberg to limit the data returned.

I think it might be better to document the existing polars functionality vs adding and documenting this pattern.

@yigal-rozenberg
Copy link
Contributor Author

Thanks for the comment!
I can add this to the documentation as an alternative.
The motivations was to extend existing PyIceberg table API to align with 'to_pandas' as an example.
Polars 'scan_iceberg' uses PyIceberg to create the LazyFrame:
https://github.com/pola-rs/polars/blob/9359ed576d972dce257346fcd62c8857f3d23277/py-polars/polars/io/iceberg.py#L139
The filtering can be done in PyIceberg, so aren't the2 approaches similar?

pyiceberg/table/__init__.py Show resolved Hide resolved
…he Table class with a to_polars method whihc returns a polars LazyFrame
@yigal-rozenberg yigal-rozenberg changed the title Added support for Polars DataFrame Added support for Polars DataFrame and LazyFarame Feb 6, 2025
@corleyma
Copy link

corleyma commented Feb 6, 2025

Polars 'scan_iceberg' uses PyIceberg to create the LazyFrame:
https://github.com/pola-rs/polars/blob/9359ed576d972dce257346fcd62c8857f3d23277/py-polars/polars/io/iceberg.py#L139
The filtering can be done in PyIceberg, so aren't the2 approaches similar?

The difference is the approach as documented is encouraging folks to write their own filter predicates for pyiceberg before materializing a dataframe with polars, whereas the "polars way" (as a lazy dataframe API) would be to just create the lazyframe, construct your compute graph with whatever polars predicates/etc make sense for you, and rely on polars to push that down at .collect() time to appropriately filter data before load where possible.

@corleyma
Copy link

corleyma commented Feb 6, 2025

Separately, rather than adding more library-specific conversion code, it might make sense for pyiceberg to start leveraging the PyCapsule protocol to allow any third party library (dataframe or otherwise) that supports Arrow data to seamlessly consume pyiceberg constructs.

Polars already supports the PyCapsule interface. See https://docs.pola.rs/user-guide/misc/arrow/#using-the-arrow-pycapsule-interface for details.

Implementing the interface on e.g. pyiceberg tables would allow them to be passed directly to dataframe init in polars, just like you can do a pyarrow table today. It also doesn't assume anything about polars support/doesn't add a dependency on polars.

@yigal-rozenberg
Copy link
Contributor Author

love the idea!
Will research in how to implement.
In the mean time, I believe this specific change request is straight forward, and allow both DataFrame, and LazyFrames to be utilized by Polars.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! I've added a few comments

pyproject.toml Outdated Show resolved Hide resolved
pyiceberg/table/__init__.py Show resolved Hide resolved
pyiceberg/table/__init__.py Show resolved Hide resolved
mkdocs/docs/api.md Show resolved Hide resolved
mkdocs/docs/api.md Outdated Show resolved Hide resolved
mkdocs/docs/api.md Show resolved Hide resolved
mkdocs/docs/api.md Show resolved Hide resolved
pyiceberg/table/__init__.py Show resolved Hide resolved
mkdocs/docs/api.md Outdated Show resolved Hide resolved
mkdocs/docs/api.md Outdated Show resolved Hide resolved
@kevinjqliu
Copy link
Contributor

can you rebase off main? looks like theres a conflict

@kevinjqliu
Copy link
Contributor

theres still conflict with main. could you also remove .vscode/settings.json?

yigalrozenberg and others added 3 commits February 11, 2025 11:48
Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
@Fokko Fokko changed the title Added support for Polars DataFrame and LazyFarame Added support for Polars DataFrame and LazyFrame Feb 11, 2025
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I fixed the merge conflict and some linter issues.

I'll let Fokko chime in before proceeding

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share some of @corleyma his thoughts, but I like this PR because:

  • Sometimes when you do complex predicates through Polars, they are not pushed down the PyIceberg because Polars creates a string of the expression, and there are cases where we cannot fully parse the expression, and then we fall back to true.
  • Having Polars in here makes it also part of the CI, and therefore we check that we don't break any integration (this saved us a couple of times with Daft before).

@kevinjqliu
Copy link
Contributor

@yigal-rozenberg could you rebase main to resolve the conflict?

@kevinjqliu
Copy link
Contributor

I created #1655 to investigate possible integrations using PyCapsule protocol

@kevinjqliu
Copy link
Contributor

LGTM! Thanks for the contribution @yigal-rozenberg
I merged main to fix the conflicts

@kevinjqliu kevinjqliu merged commit 38d57ea into apache:main Feb 13, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.