feat(rust/sedona-geoparquet): Add GeoParquet 2.0/Parquet-native geometry and geography support to Parquet writer#805
feat(rust/sedona-geoparquet): Add GeoParquet 2.0/Parquet-native geometry and geography support to Parquet writer#805paleolimbot merged 11 commits intoapache:mainapache/sedona-db:mainfrom paleolimbot:write-geography-in-parquetpaleolimbot/sedona-db:write-geography-in-parquetCopy head branch name to clipboard
Conversation
Co-authored-by: Copilot <copilot@github.com>
5eb173d to
9631622
Compare
| // Due to a bug in the parquet type conversion, we need to serialize invalid metadata for gegraphy | ||
| // fields. The conversion logic expects "algorithm" but the valid GeoArrow metadata we serialize | ||
| // by default is "edges". | ||
| // https://github.com/apache/arrow-rs/blob/f725bc9b955f23772a6a6d8a38c99a8b3f359116/parquet-geospatial/src/types.rs#L64-L66 | ||
| fn serialize_edges_and_crs_with_parquet_bug( | ||
| original_field: &FieldRef, | ||
| crs: &Crs, | ||
| edges: Edges, | ||
| ) -> FieldRef { | ||
| let crs_component = crs | ||
| .as_ref() | ||
| .map(|crs| format!(r#""crs":{}"#, crs.to_json())); | ||
|
|
||
| let edges_component = match edges { | ||
| Edges::Planar => None, | ||
| // This is where we apply the workaround relative to our usual | ||
| // serialize_edges_and_crs(). | ||
| Edges::Spherical => Some(r#""algorithm":"spherical""#), | ||
| }; |
| let mut output_geometry_column_indices = conf.output_schema().geometry_column_indices()?; | ||
| if output_geometry_column_indices.is_empty() { | ||
| let input_geometry_column_indices = conf.output_schema().geometry_column_indices()?; | ||
| if input_geometry_column_indices.is_empty() { |
There was a problem hiding this comment.
There is a lot of churn in this function, which prepares a plan before writing. Basically, we now always do a "projection" (that just modifies metadata) because we either have to strip the geoarrow.wkb type for GeoParquet 1.0 and 1.1 (Parquet with the geospatial feature would write that as Geometry and we want plain byte array with no logical type), canonicalize the CRS to write PROJJSON for sure in the logical type (GeoParquet 2.0/no GeoParquet), and/or create some invalid GeoArrow metadata to work around a parquet Rust bug (see above). Finally, sometimes we don't write the key/value metadata (no GeoParquet).
| impl SedonaGeoStatsAccumulatorFactory { | ||
| pub fn try_init() -> Result<()> { | ||
| init_geo_stats_accumulator_factory(Arc::new(Self))?; | ||
| Ok(()) | ||
| } | ||
| } | ||
|
|
||
| impl GeoStatsAccumulatorFactory for SedonaGeoStatsAccumulatorFactory { | ||
| fn new_accumulator(&self, descr: &ColumnDescPtr) -> Box<dyn GeoStatsAccumulator> { | ||
| if let Some(LogicalType::Geometry { .. }) = descr.logical_type_ref() { | ||
| return Box::new(ParquetGeoStatsAccumulator::default()); | ||
| } |
There was a problem hiding this comment.
To get Geography statistics written we have to do some plumbing to get s2geography's rectangle bounder called from deep within the depths of the Parquet crate. This is the mechanism I added in the PR that enabled Geometry statistics to be written...you can initialize an "accumlator factory" on startup that dishes out dynamic stats accumulators. It's a little unfortunate we need an s2geography dependency here but the fact that we can write stats at all is very cool.
| assert file_kv_metadata is None or b"geo" not in file_kv_metadata | ||
|
|
||
| file.metadata.schema.column(2).logical_type.to_json() == '{"Type": "Geography"}' | ||
|
|
||
| # We should only have stats if s2geography is enabled | ||
| geo_stats = file.metadata.row_group(0).column(2).geo_statistics | ||
| if "s2geography" not in sedonadb.__features__: | ||
| assert geo_stats is None | ||
| else: | ||
| assert geo_stats is not None |
There was a problem hiding this comment.
Geography statistics + Geography logical type!
| # Check for metadata and logical type | ||
| file = parquet.ParquetFile(tmp_parquet) | ||
| file_kv_metadata = file.metadata.metadata | ||
| assert b"geo" in file_kv_metadata | ||
| geo_metadata = json.loads(file_kv_metadata[b"geo"]) | ||
| assert geo_metadata["version"] == "2.0.0" | ||
|
|
||
| file.metadata.schema.column(2).logical_type.to_json() == '{"Type": "Geometry"}' |
There was a problem hiding this comment.
Pull request overview
This PR extends SedonaDB’s GeoParquet writer to support GeoParquet 2.0 (Parquet-native GEOMETRY/GEOGRAPHY) and adds Parquet geospatial statistics support, including geography bounds via sedona-s2geography when enabled.
Changes:
- Add GeoParquet 2.0 + “none” (omit GeoParquet KV metadata) writer support while still emitting Parquet-native logical types/statistics.
- Introduce a custom Parquet geospatial statistics accumulator factory (Geometry via Parquet implementation; Geography via S2 when
s2geographyis enabled) and initialize it fromSedonaContext. - Update Rust/Python/R APIs and tests to accept
geoparquet_version="2.0"and"none".
Reviewed changes
Copilot reviewed 14 out of 15 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| rust/sedona/src/context.rs | Registers GeoParquet + attempts to initialize the global geospatial stats accumulator factory. |
| rust/sedona/Cargo.toml | Extends s2geography feature to enable sedona-geoparquet/s2geography. |
| rust/sedona-geoparquet/src/writer.rs | Implements GeoParquet 2.0/omitted metadata paths, normalization projection, and Parquet-native logical type handling; expands tests. |
| rust/sedona-geoparquet/src/statistics_accumulator.rs | Adds Sedona custom GeoStatsAccumulatorFactory and S2-based geography stats accumulator behind feature flag. |
| rust/sedona-geoparquet/src/options.rs | Makes GeoParquetVersion hashable/equatable (needed by new UDF usage patterns). |
| rust/sedona-geoparquet/src/lib.rs | Exposes the new statistics_accumulator module. |
| rust/sedona-geoparquet/Cargo.toml | Adds parquet-geospatial dep and s2geography feature wiring. |
| Cargo.toml | Enables parquet geospatial feature and adds parquet-geospatial workspace dependency. |
| Cargo.lock | Records resolved parquet-geospatial dependency. |
| c/sedona-s2geography/src/rect_bounder.rs | Adds Debug impl for RectBounder. |
| c/sedona-s2geography/src/geography.rs | Marks GeographyFactory as Send/Sync and derives Debug. |
| python/sedonadb/python/sedonadb/dataframe.py | Expands accepted geoparquet_version values + docs for 2.0/none. |
| python/sedonadb/tests/io/test_parquet.py | Adds tests for GeoParquet 2.0 and “none” metadata behavior. |
| r/sedonadb/tests/testthat/test-dataframe.R | Adds R coverage for 2.0 and “none” versions + invalid string case. |
| r/sedonadb/tests/testthat/_snaps/dataframe.md | Updates snapshot error message for invalid version parsing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 15 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
I am looking at this code sample Is there any reason we cannot just do the following? df.write.option("geoparquet_version", "2.0").parquet("path/to/destination") In general, close to what spark sql syntax is doing ... |
Feel free to open an issue about adding a standalone Python package implementing a Spark compatibility layer like DuckDB's to see if there is interest! |
Sure, I will create a ticket on that. |
Stacked on #797 because this needs Arrow 57.1.
This PR adds support for writing Parquet GEOMETRY and GEOGRAPHY, including statistics for both. We use the escape hatch added in the arrow-rs repo that lets us override the statistics accumulator, which lets us accumulate geography statistics using s2geography.