Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

feat(rust/sedona-geoparquet): Add GeoParquet 2.0/Parquet-native geometry and geography support to Parquet writer#805

Merged
paleolimbot merged 11 commits intoapache:mainapache/sedona-db:mainfrom
paleolimbot:write-geography-in-parquetpaleolimbot/sedona-db:write-geography-in-parquetCopy head branch name to clipboard
May 8, 2026
Merged

feat(rust/sedona-geoparquet): Add GeoParquet 2.0/Parquet-native geometry and geography support to Parquet writer#805
paleolimbot merged 11 commits intoapache:mainapache/sedona-db:mainfrom
paleolimbot:write-geography-in-parquetpaleolimbot/sedona-db:write-geography-in-parquetCopy head branch name to clipboard

Conversation

@paleolimbot
Copy link
Copy Markdown
Member

@paleolimbot paleolimbot commented May 1, 2026

Stacked on #797 because this needs Arrow 57.1.

This PR adds support for writing Parquet GEOMETRY and GEOGRAPHY, including statistics for both. We use the escape hatch added in the arrow-rs repo that lets us override the statistics accumulator, which lets us accumulate geography statistics using s2geography.

from pyarrow import parquet
import sedona.db

sd = sedona.db.connect()

sd.funcs.table.sd_random_geometry("Point", bounds=[170, 10, 190, 30]).to_view(
    "pts", overwrite=True
)

sd.sql(
    "SELECT ST_SetSRID(ST_GeogFromWKB(ST_AsBinary(geometry)), 4326) as geog FROM pts"
).to_parquet("geog.parquet", geoparquet_version="2.0")

# Wraparound statistics!
f = parquet.ParquetFile("geog.parquet")
f.metadata.row_group(0).column(0).geo_statistics
# <pyarrow._parquet.GeoStatistics object at 0x12b6ace00>
#   geospatial_types: [1]
#   xmin: 170.02695197909054, xmax: -170.0081463408811
#   ymin: 10.002287497049661, ymax: 29.998259204130765
#   zmin: None, zmax: None
#   mmin: None, mmax: None

Comment thread rust/sedona-geoparquet/src/writer.rs Outdated
Comment on lines +680 to +698
// Due to a bug in the parquet type conversion, we need to serialize invalid metadata for gegraphy
// fields. The conversion logic expects "algorithm" but the valid GeoArrow metadata we serialize
// by default is "edges".
// https://github.com/apache/arrow-rs/blob/f725bc9b955f23772a6a6d8a38c99a8b3f359116/parquet-geospatial/src/types.rs#L64-L66
fn serialize_edges_and_crs_with_parquet_bug(
original_field: &FieldRef,
crs: &Crs,
edges: Edges,
) -> FieldRef {
let crs_component = crs
.as_ref()
.map(|crs| format!(r#""crs":{}"#, crs.to_json()));

let edges_component = match edges {
Edges::Planar => None,
// This is where we apply the workaround relative to our usual
// serialize_edges_and_crs().
Edges::Spherical => Some(r#""algorithm":"spherical""#),
};
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed apache/arrow-rs#9929 for this one

Comment on lines -80 to +83
let mut output_geometry_column_indices = conf.output_schema().geometry_column_indices()?;
if output_geometry_column_indices.is_empty() {
let input_geometry_column_indices = conf.output_schema().geometry_column_indices()?;
if input_geometry_column_indices.is_empty() {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of churn in this function, which prepares a plan before writing. Basically, we now always do a "projection" (that just modifies metadata) because we either have to strip the geoarrow.wkb type for GeoParquet 1.0 and 1.1 (Parquet with the geospatial feature would write that as Geometry and we want plain byte array with no logical type), canonicalize the CRS to write PROJJSON for sure in the logical type (GeoParquet 2.0/no GeoParquet), and/or create some invalid GeoArrow metadata to work around a parquet Rust bug (see above). Finally, sometimes we don't write the key/value metadata (no GeoParquet).

Comment on lines +32 to +43
impl SedonaGeoStatsAccumulatorFactory {
pub fn try_init() -> Result<()> {
init_geo_stats_accumulator_factory(Arc::new(Self))?;
Ok(())
}
}

impl GeoStatsAccumulatorFactory for SedonaGeoStatsAccumulatorFactory {
fn new_accumulator(&self, descr: &ColumnDescPtr) -> Box<dyn GeoStatsAccumulator> {
if let Some(LogicalType::Geometry { .. }) = descr.logical_type_ref() {
return Box::new(ParquetGeoStatsAccumulator::default());
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To get Geography statistics written we have to do some plumbing to get s2geography's rectangle bounder called from deep within the depths of the Parquet crate. This is the mechanism I added in the PR that enabled Geometry statistics to be written...you can initialize an "accumlator factory" on startup that dishes out dynamic stats accumulators. It's a little unfortunate we need an s2geography dependency here but the fact that we can write stats at all is very cool.

Comment on lines +569 to +578
assert file_kv_metadata is None or b"geo" not in file_kv_metadata

file.metadata.schema.column(2).logical_type.to_json() == '{"Type": "Geography"}'

# We should only have stats if s2geography is enabled
geo_stats = file.metadata.row_group(0).column(2).geo_statistics
if "s2geography" not in sedonadb.__features__:
assert geo_stats is None
else:
assert geo_stats is not None
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Geography statistics + Geography logical type!

Comment on lines +505 to +512
# Check for metadata and logical type
file = parquet.ParquetFile(tmp_parquet)
file_kv_metadata = file.metadata.metadata
assert b"geo" in file_kv_metadata
geo_metadata = json.loads(file_kv_metadata[b"geo"])
assert geo_metadata["version"] == "2.0.0"

file.metadata.schema.column(2).logical_type.to_json() == '{"Type": "Geometry"}'
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GeoParquet 2.0!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends SedonaDB’s GeoParquet writer to support GeoParquet 2.0 (Parquet-native GEOMETRY/GEOGRAPHY) and adds Parquet geospatial statistics support, including geography bounds via sedona-s2geography when enabled.

Changes:

  • Add GeoParquet 2.0 + “none” (omit GeoParquet KV metadata) writer support while still emitting Parquet-native logical types/statistics.
  • Introduce a custom Parquet geospatial statistics accumulator factory (Geometry via Parquet implementation; Geography via S2 when s2geography is enabled) and initialize it from SedonaContext.
  • Update Rust/Python/R APIs and tests to accept geoparquet_version="2.0" and "none".

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
rust/sedona/src/context.rs Registers GeoParquet + attempts to initialize the global geospatial stats accumulator factory.
rust/sedona/Cargo.toml Extends s2geography feature to enable sedona-geoparquet/s2geography.
rust/sedona-geoparquet/src/writer.rs Implements GeoParquet 2.0/omitted metadata paths, normalization projection, and Parquet-native logical type handling; expands tests.
rust/sedona-geoparquet/src/statistics_accumulator.rs Adds Sedona custom GeoStatsAccumulatorFactory and S2-based geography stats accumulator behind feature flag.
rust/sedona-geoparquet/src/options.rs Makes GeoParquetVersion hashable/equatable (needed by new UDF usage patterns).
rust/sedona-geoparquet/src/lib.rs Exposes the new statistics_accumulator module.
rust/sedona-geoparquet/Cargo.toml Adds parquet-geospatial dep and s2geography feature wiring.
Cargo.toml Enables parquet geospatial feature and adds parquet-geospatial workspace dependency.
Cargo.lock Records resolved parquet-geospatial dependency.
c/sedona-s2geography/src/rect_bounder.rs Adds Debug impl for RectBounder.
c/sedona-s2geography/src/geography.rs Marks GeographyFactory as Send/Sync and derives Debug.
python/sedonadb/python/sedonadb/dataframe.py Expands accepted geoparquet_version values + docs for 2.0/none.
python/sedonadb/tests/io/test_parquet.py Adds tests for GeoParquet 2.0 and “none” metadata behavior.
r/sedonadb/tests/testthat/test-dataframe.R Adds R coverage for 2.0 and “none” versions + invalid string case.
r/sedonadb/tests/testthat/_snaps/dataframe.md Updates snapshot error message for invalid version parsing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rust/sedona/src/context.rs
Comment thread rust/sedona-geoparquet/src/writer.rs Outdated
Comment thread rust/sedona-geoparquet/src/writer.rs Outdated
Comment thread rust/sedona-geoparquet/src/writer.rs Outdated
Comment thread python/sedonadb/tests/io/test_parquet.py
Comment thread python/sedonadb/tests/io/test_parquet.py
Comment thread python/sedonadb/tests/io/test_parquet.py Outdated
Comment thread c/sedona-s2geography/src/rect_bounder.rs Outdated
Comment thread c/sedona-s2geography/src/geography.rs Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rust/sedona-geoparquet/src/writer.rs Outdated
Comment thread rust/sedona/src/context.rs
Comment thread rust/sedona-geoparquet/src/writer.rs
@paleolimbot paleolimbot changed the title feat(rust/sedona-geoparquet): Add GeoParquet 2.0/Parquet-native geometry and geography support to parque writer feat(rust/sedona-geoparquet): Add GeoParquet 2.0/Parquet-native geometry and geography support to Parquet writer May 7, 2026
@paleolimbot paleolimbot marked this pull request as ready for review May 7, 2026 04:06
@zhangfengcdt
Copy link
Copy Markdown
Member

zhangfengcdt commented May 7, 2026

I am looking at this code sample

sd.sql(
    "SELECT ST_SetSRID(ST_GeogFromWKB(ST_AsBinary(geometry)), 4326) as geog FROM pts"
).to_parquet("geog.parquet", geoparquet_version="2.0")

Is there any reason we cannot just do the following?

df.write.option("geoparquet_version", "2.0").parquet("path/to/destination")

In general, close to what spark sql syntax is doing ...

@paleolimbot
Copy link
Copy Markdown
Member Author

Is there any reason we cannot just do the following?

.to_parquet() is the idiom in Pandas, GeoPandas, DuckDB, and Ibis. Personally I like it because autocomplete works when working interactively (parameter documentation is highlighted as you type).

Feel free to open an issue about adding a standalone Python package implementing a Spark compatibility layer like DuckDB's to see if there is interest!

@zhangfengcdt
Copy link
Copy Markdown
Member

Is there any reason we cannot just do the following?

.to_parquet() is the idiom in Pandas, GeoPandas, DuckDB, and Ibis. Personally I like it because autocomplete works when working interactively (parameter documentation is highlighted as you type).

Feel free to open an issue about adding a standalone Python package implementing a Spark compatibility layer like DuckDB's to see if there is interest!

Sure, I will create a ticket on that.

@paleolimbot paleolimbot merged commit 77df1bf into apache:main May 8, 2026
18 checks passed
@paleolimbot paleolimbot deleted the write-geography-in-parquet branch May 8, 2026 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.