feat: Added DataFrameWriteOptions option when writing as csv, json, p…#857
feat: Added DataFrameWriteOptions option when writing as csv, json, p…#857allinux wants to merge 0 commit intoapache:mainapache/datafusion-python:mainfrom
Conversation
timsaucer
left a comment
There was a problem hiding this comment.
This is a very nice addition! Thank you!
python/datafusion/dataframe.py
Outdated
| write_options_overwrite: bool = False, | ||
| write_options_single_file_output: bool = False, | ||
| write_options_partition_by: List = [], |
There was a problem hiding this comment.
I think it's okay to remove the write_options_ prefixes here.
with_header: bool = False,
overwrite: bool = False,
single_file_output: bool = False,
Also for the partition by, I took a very quick look at the code and it looks like partition_by takes a list of strings, which I think our users would be surprised because all other uses of partition_by takes a list of expressions. So I think we want to add to the documentation a tiny bit about how to use that.
My understanding is that it's bad form in python to pass in a [] as a default, but I'm no expert. I bet we could change the type hint to partition_by: Optional[List[str]] = None and make the appropriate change on the call in the lines below.
There was a problem hiding this comment.
Edit has been completed. Thank you for the review.
python/datafusion/dataframe.py
Outdated
| write_options_overwrite: bool = False, | ||
| write_options_single_file_output: bool = False, | ||
| write_options_partition_by: List = [], |
There was a problem hiding this comment.
Same recommendation on parameter names and partition_by as above
There was a problem hiding this comment.
Edit has been completed. Thank you for the review.
python/datafusion/dataframe.py
Outdated
| with_header: If true, output the CSV header row. | ||
| write_options_overwrite: Controls if existing data should be overwritten | ||
| write_options_single_file_output: Controls if all partitions should be coalesced into a single output file. Generally will have slower performance when set to true. | ||
| write_options_partition_by: Sets which columns should be used for hive-style partitioned writes by name. Can be set to empty vec![] for non-partitioned writes. |
There was a problem hiding this comment.
empty vec![] is mixing rust and python terminology
There was a problem hiding this comment.
comment는 rust 의 comment 를 복사한 것 입니다. vec![] 이 포함한 라인은 제거했습니다.
There was a problem hiding this comment.
The comment is a copy of Rust's comment. Lines containing vec![] have been removed.
python/datafusion/dataframe.py
Outdated
| compression_level: Compression level to use. | ||
| write_options_overwrite: Controls if existing data should be overwritten | ||
| write_options_single_file_output: Controls if all partitions should be coalesced into a single output file. Generally will have slower performance when set to true. | ||
| write_options_partition_by: Sets which columns should be used for hive-style partitioned writes by name. Can be set to empty vec![] for non-partitioned writes. |
There was a problem hiding this comment.
The comment is a copy of Rust's comment. Lines containing vec![] have been removed.
python/datafusion/dataframe.py
Outdated
| write_options_overwrite: bool = False, | ||
| write_options_single_file_output: bool = False, | ||
| write_options_partition_by: List = [], | ||
| ) -> None: |
There was a problem hiding this comment.
Same comment as above on naming and partition_by
src/dataframe.rs
Outdated
| #[pyo3(signature = ( | ||
| path, | ||
| with_header=false, | ||
|
|
||
| write_options_overwrite=false, | ||
| write_options_single_file_output=false, | ||
| write_options_partition_by=vec![], | ||
| ))] |
There was a problem hiding this comment.
Since we're setting all the type hints in the wrappers, you don't have to include this here. It's up to you but can lead to duplicate effort and long term maintainability.
There was a problem hiding this comment.
As a note to myself, we need to include in our developer's documentation our best practice (and also decide as a group if we want these signatures in the rust code at all)
src/dataframe.rs
Outdated
| path: &str, | ||
| compression: &str, | ||
| compression_level: Option<u32>, | ||
|
|
There was a problem hiding this comment.
Edit has been completed. Thank you for the review.
src/dataframe.rs
Outdated
| fn write_json( | ||
| &self, | ||
| path: &str, | ||
|
|
There was a problem hiding this comment.
Edit has been completed. Thank you for the review.
python/datafusion/dataframe.py
Outdated
| write_options_overwrite: Controls if existing data should be overwritten | ||
| write_options_single_file_output: Controls if all partitions should be coalesced into a single output file. Generally will have slower performance when set to true. | ||
| write_options_partition_by: Sets which columns should be used for hive-style partitioned writes by name. | ||
| """ | ||
| self.df.write_csv(str(path), with_header) | ||
| self.df.write_csv(str(path), with_header, write_options_overwrite, write_options_single_file_output, write_options_partition_by) |
There was a problem hiding this comment.
Since we've updated the argument names we need to update the documentation and the function call. We should add in unit tests so we can catch these errors in CI.
1278955 to
ff45187
Compare
|
I was hoping we could add this in to DF42. Would you be willing to add unit tests? |
…arquet.
Which issue does this PR close?
N/A
Rationale for this change
Added DataFrameWriteOptions when using write_csv, write_json, write_parquet functions.
Are there any user-facing changes?
No