feat: Added DataFrameWriteOptions option when writing as csv, json, p… by allinux · Pull Request #857 · apache/datafusion-python

allinux · Sep 6, 2024

…arquet.

Which issue does this PR close?

N/A

Rationale for this change

Added DataFrameWriteOptions when using write_csv, write_json, write_parquet functions.

Are there any user-facing changes?

No

timsaucer

This is a very nice addition! Thank you!

timsaucer · Sep 6, 2024

python/datafusion/dataframe.py

+        write_options_overwrite: bool = False,
+        write_options_single_file_output: bool = False,
+        write_options_partition_by: List = [],


I think it's okay to remove the write_options_ prefixes here.

with_header: bool = False, overwrite: bool = False, single_file_output: bool = False,

Also for the partition by, I took a very quick look at the code and it looks like partition_by takes a list of strings, which I think our users would be surprised because all other uses of partition_by takes a list of expressions. So I think we want to add to the documentation a tiny bit about how to use that.

My understanding is that it's bad form in python to pass in a [] as a default, but I'm no expert. I bet we could change the type hint to partition_by: Optional[List[str]] = None and make the appropriate change on the call in the lines below.

Edit has been completed. Thank you for the review.

timsaucer · Sep 6, 2024

python/datafusion/dataframe.py

+        write_options_overwrite: bool = False,
+        write_options_single_file_output: bool = False,
+        write_options_partition_by: List = [],


Same recommendation on parameter names and partition_by as above

Edit has been completed. Thank you for the review.

timsaucer · Sep 6, 2024

python/datafusion/dataframe.py

            with_header: If true, output the CSV header row.
+            write_options_overwrite: Controls if existing data should be overwritten
+            write_options_single_file_output: Controls if all partitions should be coalesced into a single output file. Generally will have slower performance when set to true.
+            write_options_partition_by: Sets which columns should be used for hive-style partitioned writes by name. Can be set to empty vec![] for non-partitioned writes.


empty vec![] is mixing rust and python terminology

comment는 rust 의 comment 를 복사한 것 입니다. vec![] 이 포함한 라인은 제거했습니다.

The comment is a copy of Rust's comment. Lines containing vec![] have been removed.

timsaucer · Sep 6, 2024

python/datafusion/dataframe.py

            compression_level: Compression level to use.
+            write_options_overwrite: Controls if existing data should be overwritten
+            write_options_single_file_output: Controls if all partitions should be coalesced into a single output file. Generally will have slower performance when set to true.
+            write_options_partition_by: Sets which columns should be used for hive-style partitioned writes by name. Can be set to empty vec![] for non-partitioned writes.


vec![] is rust not python

The comment is a copy of Rust's comment. Lines containing vec![] have been removed.

timsaucer · Sep 6, 2024

python/datafusion/dataframe.py

+        write_options_overwrite: bool = False,
+        write_options_single_file_output: bool = False,
+        write_options_partition_by: List = [],
+    ) -> None:


Same comment as above on naming and partition_by

python/datafusion/dataframe.py

timsaucer · Sep 6, 2024

src/dataframe.rs

+    #[pyo3(signature = (
+        path,
+        with_header=false,
+
+        write_options_overwrite=false,
+        write_options_single_file_output=false,
+        write_options_partition_by=vec![],
+    ))]


Since we're setting all the type hints in the wrappers, you don't have to include this here. It's up to you but can lead to duplicate effort and long term maintainability.

As a note to myself, we need to include in our developer's documentation our best practice (and also decide as a group if we want these signatures in the rust code at all)

timsaucer · Sep 6, 2024

src/dataframe.rs

        path: &str,
        compression: &str,
        compression_level: Option<u32>,
+


formatting: extra blank line

Edit has been completed. Thank you for the review.

timsaucer · Sep 6, 2024

src/dataframe.rs

+    fn write_json(
+        &self, 
+        path: &str, 
+


formatting: extra blank line

Edit has been completed. Thank you for the review.

timsaucer · Sep 7, 2024

python/datafusion/dataframe.py

+            write_options_overwrite: Controls if existing data should be overwritten
+            write_options_single_file_output: Controls if all partitions should be coalesced into a single output file. Generally will have slower performance when set to true.
+            write_options_partition_by: Sets which columns should be used for hive-style partitioned writes by name.
        """
-        self.df.write_csv(str(path), with_header)
+        self.df.write_csv(str(path), with_header, write_options_overwrite, write_options_single_file_output, write_options_partition_by)


Since we've updated the argument names we need to update the documentation and the function call. We should add in unit tests so we can catch these errors in CI.

timsaucer · Sep 21, 2024

I was hoping we could add this in to DF42. Would you be willing to add unit tests?

timsaucer requested changes Sep 6, 2024

View reviewed changes

allinux force-pushed the main branch from 7b47717 to 610b11d Compare September 7, 2024 10:28

timsaucer reviewed Sep 7, 2024

View reviewed changes

allinux force-pushed the main branch 2 times, most recently from 1278955 to ff45187 Compare September 7, 2024 15:11

allinux closed this Oct 10, 2024

allinux force-pushed the main branch from ff45187 to cdec202 Compare October 10, 2024 03:00

Search code, repositories, users, issues, pull requests...

Comments

Conversation

allinux commented Sep 6, 2024

Which issue does this PR close?

Rationale for this change

Are there any user-facing changes?

Uh oh!

timsaucer left a comment

Choose a reason for hiding this comment

Uh oh!

timsaucer Sep 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timsaucer commented Sep 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

timsaucer Sep 6, 2024 •

edited

Loading