Replace Spark/DuckDB APIs with Ibis by daniel-thom · Pull Request #414 · dsgrid/dsgrid

daniel-thom · Apr 11, 2026

Closes GitHub Issue #

Description

Checklist

Tests exercising the new feature or bug fix
All tests pass
At least one code review approval
Consider transferring TODOs to GitHub

Copilot

Pull request overview

This PR migrates dsgrid’s table-processing layer (and the surrounding query, registry, and test code) away from Spark/DuckDB-specific APIs toward a unified Ibis-based runtime interface.

Changes:

Introduces a new dsgrid.ibis compatibility layer (backend/session/io/ops/types) and refactors core codepaths to use it.
Reworks unit conversion, dataset mapping/expression evaluation, and reporting to operate on Ibis tables/SQL rather than Spark DataFrame APIs.
Updates a large portion of tests + docs to reflect the new runtime abstractions and API model renames (e.g., query submission models).

Reviewed changes

Copilot reviewed 128 out of 129 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/test_utilities.py	Minor test cleanup (whitespace).
tests/test_unit_conversions.py	Switch unit conversion tests from Spark DF ops to Ibis table ops/helpers.
tests/test_representative_time.py	Update representative-time tests to run with runtime/Ibis paths and DuckDB/Spark branches.
tests/test_registry_management.py	Update registry tests to expect Ibis tables + pandas conversion helpers.
tests/test_project.py	Make project tests backend-agnostic via small Ibis/Spark collection helpers.
tests/test_localize_timestamps_if_necessary.py	Update chronify localization tests to use runtime session + table-to-pandas helpers.
tests/test_ibis_session.py	Add/adjust runtime session tests (read/write/restart/custom conf).
tests/test_ibis_functions.py	Update function tests to use Ibis helpers (`filter_sql`, `order_by`, `_collect`).
tests/test_find_minimal_patterns.py	Minor assertion formatting cleanup.
tests/test_filter_registry.py	Update filtered-registry tests to join/drop via Ibis ops + `_collect`.
tests/test_filesystem_data_store.py	Use `dsgrid.ibis.session.create_dataframe_from_dicts`.
tests/test_duckdb_data_store.py	Use pandas input + Ibis counts for DuckDB store tests.
tests/test_derived_datasets.py	Make derived-dataset tests collect/order in Ibis or Spark.
tests/test_datasets.py	Update expected error text from “DataFrame” to “Ibis table”.
tests/test_dataset_utils.py	Convert dataset utility tests to runtime/Ibis equivalents.
tests/test_dataset_expression_handler.py	Convert expression-handler tests to Ibis tables + `filter_sql`.
tests/test_create_time_dimensions.py	Use runtime session instead of spark session for time-dimension test fixtures.
tests/test_convert_time_format_if_necessary.py	Update time-format conversion tests to runtime session + Ibis collection.
tests/test_api.py	Update API request/response model names + handle 413 inline-result size limit.
tests/test_annual_time.py	Update annual-time tests to Ibis tables + backend-agnostic counting/collection.
tests/test_aeo_dataset_registration.py	Switch `get_unique_values` import to Ibis table utils.
tests/simple_standard_scenarios_datasets.py	Refactor dataset builders/mappers to Ibis operations (drop/join/rename).
tests/conftest.py	Initialize runtime session fixture instead of Spark.
tests/cli/test_registry.py	Switch `use_duckdb` import to Ibis types.
tests/cli/test_config.py	Remove unused `Path` import.
scripts/enumerate_load_table_lookup.py	Switch script to Ibis read/join + `TablePartition`.
scripts/enumerate_load_table_lookup.ipynb	Update notebook to runtime session + `TablePartition`.
pyproject.toml	Add `ibis-framework[duckdb]` dependency; add `ty` dev dep; add uv source override for chronify.
dsgrid/utils/utilities.py	Make IPython imports lazy via `importlib` and tighten typing for `list_enum_values`.
dsgrid/utils/timing.py	Cast `_start` for type safety in timer exit.
dsgrid/utils/files.py	Write LDJSON with `utf-8` (no BOM).
dsgrid/units/power.py	Replace Spark column expressions with SQL CASE expression builders.
dsgrid/units/convert.py	Rewrite unit conversion for unpivoted tables using Ibis + SQL.
dsgrid/tests/utils.py	Use runtime session + Ibis join/drop helpers when reading test parquet.
dsgrid/tests/register_derived_datasets.py	Use `DatasetConfig.config_filename()` instead of `DatasetRegistry`.
dsgrid/tests/make_us_data_registry.py	Minor formatting simplification.
dsgrid/tests/common.py	Improve typing/casts around `config_class().config_filename()`.
dsgrid/spark/types.py	Remove Spark/DuckDB dual-mode types shim (deleted).
dsgrid/rust_ext/find_minimal_patterns.py	Lazy-load Rust extension via `importlib` for cleaner failure mode.
dsgrid/rust_ext/init.py	Same lazy-load pattern for Rust types.
dsgrid/registry/versioning.py	Fix json5 loading by opening file before `json5.load`.
dsgrid/registry/registry_manager.py	Initialize runtime session instead of Spark session.
dsgrid/registry/registry_manager_base.py	Type `_db` as `Any`.
dsgrid/registry/registry_interface.py	Add asserts/casts around ids/versions for stronger invariants.
dsgrid/registry/registry_auto_updater.py	Add casts for model typing in updater loops.
dsgrid/registry/registration_context.py	Switch temp-table cleanup import to `dsgrid.ibis.temp`.
dsgrid/registry/filter_registry_manager.py	Convert filtering logic to Ibis table record helpers + `is_table_empty`.
dsgrid/registry/filesystem_data_store.py	Switch store IO types to `ibis.Table` and runtime read/write helpers.
dsgrid/registry/dimension_registry_manager.py	Add asserts/casts; rename remove arg to `config_id`.
dsgrid/registry/data_store_interface.py	Update abstract interface types from Spark DF to `ibis.Table`.
dsgrid/registry/common.py	Rename `from_file` arg to `filename`.
dsgrid/query/report_peak_load.py	Replace Spark agg with SQL-based max-by-group + Ibis IO.
dsgrid/query/query_context.py	Store/restore record ids via Ibis table utils instead of Spark collects.
dsgrid/query/models.py	Replace Spark function binding with `FunctionReference`; rename spark conf model -> runtime conf model.
dsgrid/query/derived_dataset.py	Convert derived dataset validation + record writing to Ibis helpers (CSV via backend).
dsgrid/project.py	Switch query processing/mapping to runtime session + Ibis join/filter ops.
dsgrid/loggers.py	Cast handlers list to satisfy typing when appending.
dsgrid/ibis/types.py	New: runtime engine helpers (duckdb vs spark), type helpers, `is_table_empty`.
dsgrid/ibis/temp.py	New: temp view/table naming + cleanup for Spark backend.
dsgrid/ibis/table_utils.py	New: execute/collect helpers for “known-small” tables + distinct value helpers.
dsgrid/ibis/partition.py	Rename partition helper to `TablePartition` and adapt to Ibis counting/aggregation.
dsgrid/ibis/operations.py	New: cross-backend table ops (filter/join/pivot/unpivot/etc.) implemented via Ibis/SQL.
dsgrid/ibis/io.py	New: Ibis read helpers for csv/json/parquet.
dsgrid/ibis/functions.py	New: compatibility helpers mirroring prior Spark helpers using Ibis + runtime session.
dsgrid/ibis/backend.py	New: construct/cache runtime Ibis backend + chronify store creation.
dsgrid/ibis/init.py	Package marker.
dsgrid/filesystem/s3_filesystem.py	Lazy-load boto3/s3path; minor rglob hidden filtering tweak.
dsgrid/dsgrid_rc.py	Allow loading runtime config from a specified filename.
dsgrid/dimension/time_utils.py	Fix optional typing for args.
dsgrid/dimension/dimension_filters.py	Rewrite filters to SQL-string predicates executed via Ibis SQL.
dsgrid/dataset/unpivoted_table.py	Switch aggregation and unit conversion to Ibis tables/SQL.
dsgrid/dataset/table_format_handler_base.py	Convert core handler plumbing to Ibis + SQL projections/joins.
dsgrid/dataset/growth_rates.py	Convert growth-rate math to Ibis SQL operations.
dsgrid/dataset/dataset.py	Switch handler interfaces to Ibis table types; update mapping reference model type.
dsgrid/dataset/dataset_schema_handler_one_table.py	Convert schema checks/filtering to Ibis + runtime null checks.
dsgrid/dataset/dataset_mapping_manager.py	Use Ibis tables for checkpointed persistence.
dsgrid/dataset/dataset_expression_handler.py	Implement dataset arithmetic via SQL over temp views + Ibis backend.
dsgrid/data_models.py	Allow `from_file` to accept `str`; safer enum description formatting.
dsgrid/config/time_dimension_base_config.py	Switch time mapping to Ibis rename helper + updated typing.
dsgrid/config/representative_period_time_dimension_config.py	Fix return types to `Type[...]` for model_class helpers.
dsgrid/config/noop_time_dimension_config.py	Fix `model_class` typing + signature compatibility.
dsgrid/config/mapping_tables.py	Build `MappingTableModel` via `model_validate` dict.
dsgrid/config/input_dataset_requirements.py	Replace `conlist` with `Annotated[..., Field(min_length=1)]`.
dsgrid/config/index_time_dimension_config.py	Fix `model_class` typing to `Type[...]`.
dsgrid/config/file_schema.py	Switch file readers/writers to Ibis IO + rename/drop helpers.
dsgrid/config/dimensions.py	Add cast for `offset_column` access on union model.
dsgrid/config/dimension_mapping_base.py	Use enum default value instead of string literal.
dsgrid/config/dimension_config.py	Ensure `get_unique_ids()` returns `set[str]`.
dsgrid/config/date_time_dimension_config.py	Fix model_class typing + casts for chronify model creation.
dsgrid/config/dataset_config.py	Convert table scanning and trivial dims logic to Ibis SQL/joins.
dsgrid/config/config_base.py	Write records via Ibis CSV writer; cast `model_class()` for typing.
dsgrid/config/common.py	Add explicit type annotation for time dimension template dict.
dsgrid/config/annual_time_dimension_config.py	Convert annual time operations to runtime session + Ibis SQL.
dsgrid/cli/dsgrid.py	Format long call across multiple lines.
dsgrid/cli/common.py	Add asserts for exception info; adjust click Option signature typing.
dsgrid/chronify.py	Use Ibis-backed chronify store factory; remove explicit store disposal.
dsgrid/apps/registration_gui.py	Lazy-load IPython/widgets; runtime session init; fix conn creation.
dsgrid/apps/project_viewer/app.py	Cast conditional styling for Dash typing.
dsgrid/api/response_models.py	Rename submit-project-query response model.
dsgrid/api/models.py	Rename submit-project-query request model.
dsgrid/api/app.py	Use runtime session; add inline JSON size limit (413) + pandas conversion helper.
dsgrid/init.py	Minor formatting change.
docs/source/user_guide/tutorials/map_dataset.md	Update narrative from Spark-specific to Ibis runtime/backends.
docs/source/user_guide/project_queries/query_concepts.md	Update query lifecycle docs to Ibis runtime terminology.
docs/source/user_guide/apache_spark/overview.md	Clarify Spark is an Ibis backend and update wording.
docs/source/user_guide/apache_spark/index.md	Update Spark docs intro wording.
docs/source/software_reference/python_api.md	Replace `.show()` examples with `to_pyarrow()` printing.
docs/source/software_reference/architecture.md	Update architecture to Ibis + backend model.
docs/source/index.md	Update overview copy to reflect Ibis backend usage.
docs/source/getting_started/installation.md	Clarify Java only required for Spark backend.
docs/source/developer_guide/how_tos/how_to_convert_time_zone.md	Update terminology/examples from DataFrame to Ibis table + runtime variants.
.gitmodules	Point test-data submodule to `feat/use-ibis` branch.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · Apr 11, 2026

+def rename_columns(df: ibis.Table, mapping: dict[str, str]) -> ibis.Table:
+    """Rename columns using a mapping of old name to new name."""
+    return df.rename({new: old for old, new in mapping.items()})
+


rename_columns() inverts the provided mapping: df.rename({new: old for old, new in mapping.items()}). Ibis expects a mapping of existing column names to new column names, so this currently renames in the wrong direction and will fail for callers like rename_columns(df, {"unit": "from_unit"}). Update this to pass the mapping in the correct direction (old -> new) and add a small unit test for a simple rename.

Copilot · Apr 11, 2026

+def coalesce(df: ibis.Table, num_partitions: int) -> ibis.Table:
+    return df


coalesce() is currently a no-op (returns df unchanged). Several call sites rely on coalesce(..., 1) to produce a single output file/partition (e.g., dsgrid/utils/dataset.py:348-363 asserts only one parquet part file). This will break on the Spark backend and can also cause excessive small files. Implement backend-appropriate coalesce/repartition behavior (or remove the helper and update call sites to avoid assumptions about single output files).

Copilot · Apr 11, 2026

+    except Exception:
+        tmp_file = NamedTemporaryFile(suffix=".parquet", delete=False)
+        tmp_file.close()
+        df.to_parquet(tmp_file.name)
+        escaped_path = tmp_file.name.replace("'", "''")
+        conn = cast(Any, make_runtime_backend().connection)
+        conn.raw_sql(f"CREATE TEMP VIEW {view} AS SELECT * FROM read_parquet('{escaped_path}')")
+    return view


create_temp_view() fallback writes an on-disk temporary parquet file with delete=False and never deletes it. On repeated calls this can leak many temp files and fill disks. Consider (a) restricting this fallback to DuckDB only, (b) registering the temp file path for cleanup (atexit / ScratchDirContext / drop_temp_tables_and_views), or (c) using an in-memory registration approach if supported by the backend.

Copilot · Apr 11, 2026

+        case "contains":
+            expr = f"{column} LIKE '%{_escape_like_value(value)}%'"
+        case "endswith":
+            expr = f"{column} LIKE '%{_escape_like_value(value)}'"
+        case "isNotNull":
+            expr = f"{column} IS NOT NULL"
+        case "isNull":
+            expr = f"{column} IS NULL"
+        case "isin":
+            if not isinstance(value, list | tuple | set):
+                msg = f"value must be a list, tuple, or set for {operator =}"
+                raise DSGInvalidField(msg)
+            vals = ", ".join(_make_sql_value(x) for x in value)
+            expr = f"{column} IN ({vals})"
+        case "like":
+            expr = f"{column} LIKE {_make_sql_value(value)}"
+        case "rlike":
+            expr = f"{column} RLIKE {_make_sql_value(value)}"
+        case "startswith":
+            expr = f"{column} LIKE '{_escape_like_value(value)}%'"
+        case _:


The LIKE-based implementations for contains/startswith/endswith only escape single quotes. In Spark, col.contains()/startswith()/endswith() treat the input literally, but SQL LIKE will treat '%' and '' as wildcards, so values containing those characters will match incorrectly. Consider escaping '%' and '' (and adding an ESCAPE clause) or using backend string functions that preserve literal semantics.

codecov · Apr 11, 2026

Codecov Report

❌ Patch coverage is 90.40558% with 220 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.22%. Comparing base (6d8c0bd) to head (9682fa2).

Files with missing lines	Patch %	Lines
dsgrid/ibis/session.py	91.72%	69 Missing ⚠️
dsgrid/utils/dataset.py	81.81%	34 Missing ⚠️
dsgrid/registry/duckdb_data_store.py	84.76%	16 Missing ⚠️
dsgrid/ibis/functions.py	83.07%	11 Missing ⚠️
dsgrid/dataset/dataset_schema_handler_base.py	89.53%	9 Missing ⚠️
dsgrid/query/query_submitter.py	80.95%	8 Missing ⚠️
dsgrid/registry/project_registry_manager.py	80.00%	8 Missing ⚠️
dsgrid/config/project_config.py	61.11%	7 Missing ⚠️
...rid/registry/dimension_mapping_registry_manager.py	85.71%	5 Missing ⚠️
dsgrid/ibis/operations.py	96.92%	4 Missing ⚠️
... and 23 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #414      +/-   ##
==========================================
+ Coverage   82.09%   85.22%   +3.12%     
==========================================
  Files         124      128       +4     
  Lines       14441    14800     +359     
==========================================
+ Hits        11856    12613     +757     
+ Misses       2585     2187     -398

Flag	Coverage Δ
Linux	`83.47% <81.46%> (+1.46%)`	⬆️
Windows	`83.56% <81.46%> (+3.06%)`	⬆️
spark	`82.56% <75.97%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Converts set ops, joins, filters, unit conversion, and growth rates to native Ibis expressions. Adds cross-backend bridging in create_temp_view and type-alignment helpers so native set ops and joins tolerate the SQL semantics the prior string-SQL paths had (int/float coercion, schema widening). Restores a Spark-specific coalesce. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Spark's default shuffle.partitions=200 combined with dsgrid's cascaded joins produced stages with 248k+ tiny tasks on small local jobs, dominating runtime. Apply small local-mode defaults (shuffle.partitions=4, default.parallelism=4, AQE + coalescePartitions on) when Spark is running locally. SPARK_CONF_DIR and spark_conf= overrides still win. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace the Spark smoke subset with the same pytest invocation the DuckDB job uses so every test runs in both backends. Add Java 21 setup since Spark 4 requires Java 17+. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

elainethale

I have a general question regarding the documentation: To what extent do users need to know about Ibis, and to what extent is it just an implementation detail? To me, it would be best for users to just think about "DataFrames" and whether they need to do processing with "Spark" (and thus go through a bit more set up) or not. Certainly it should be documented that DuckDB and Ibis are key technologies, but they might not need to be mentioned as much as they are in this PR's documentation changes.

I was also expecting a larger audit of the existing documentation that covers installation and spark use. I'm not sure we have the right set of high-level pages/sections, and suspect that there's a lot of out-of-date information hanging around and not as much on configuring spark with sparkctl as would be helpful.

elainethale · Apr 24, 2026

 project = manager.project_manager.load_project("dsgrid_conus_2022")
 geo_dim = project.config.get_base_dimension(DimensionType.GEOGRAPHY)
-geo_dim.get_records_dataframe().show()
+print(geo_dim.get_records_dataframe().limit(5).to_pyarrow())


This is a clunkier interface for inspecting the head of a DataFrame. Is there a better option?

print(geo_dim.get_records_dataframe().head().execute())

elainethale · Apr 24, 2026

 3. **Combine datasets** - Combine the datasets as specified by the `expression` in the dataset data model of the query. The default is to take a union of all datasets.

-4. **Persist intermediate table** - If the option `--persist-intermediate-table` is `true` (which is the default) then dsgrid will evaluate the Spark query from the previous step by writing the dataframe to the filesystem in the directory `query_output/cached_tables`. This can be disabled by setting `--no-persist-intermediate-table`.
+4. **Persist intermediate table** - If the option `--persist-intermediate-table` is `true` (which is the default) then dsgrid will evaluate the Ibis query from the previous step by writing the table to the filesystem in the directory `query_output/cached_tables`. This can be disabled by setting `--no-persist-intermediate-table`.


Suggested change

4. **Persist intermediate table** - If the option `--persist-intermediate-table` is `true` (which is the default) then dsgrid will evaluate the Ibis query from the previous step by writing the table to the filesystem in the directory `query_output/cached_tables`. This can be disabled by setting `--no-persist-intermediate-table`.

4. **Persist intermediate table** - If the option `--persist-intermediate-table` is `true` (which is the default) then dsgrid will evaluate the query from the previous step by writing the table to the filesystem in the directory `query_output/cached_tables`. This can be disabled by setting `--no-persist-intermediate-table`.

elainethale · Apr 24, 2026

+8. **Sort columns** - If the field `sort_columns` in the `result` data model is `true`, sort the table by those columns.

-9. **Write output** - Evaluate the Spark job for the previous steps by writing the dataframe to the filesystem in the directory `query_output/<query-name>`.
+9. **Write output** - Evaluate the Ibis query for the previous steps by writing the table to the filesystem in the directory `query_output/<query-name>`.


Per suggestion above, can these Spark job to Ibis query replacements all be changed to something like query? (And are these descriptions only applicable with Spark, or do they also apply to DuckDB?)

elainethale · Apr 24, 2026

 ```

-By default, this will attempt to map all dimensions by performing three Spark queries:
+By default, this will attempt to map all dimensions in three Ibis query phases:


Suggested change

By default, this will attempt to map all dimensions in three Ibis query phases:

By default, this will attempt to map all dimensions in three phases:

elainethale · Apr 24, 2026

 - **Flexible Mappings**: Map datasets between different dimensional systems with explicit, documented transformations
 - **Powerful Queries**: Aggregate, filter, and transform data across dimensions to create custom views
- **Big Data Support**: Process terabyte-scale datasets using Apache Spark or gigabyte-scale datasets using DuckDB
+- **Big Data Support**: Use the Ibis table API with DuckDB for local work or Apache Spark for distributed workloads


The original version seems better for end-users. I would revert this change.

elainethale · Apr 24, 2026

    "devtools",
    "flake8",
    "mypy",
+    "ty",


It looks like you've made a bunch of changes to enforce typing, but I'm not seeing pre-commit or ci checks for it. Did I miss those somewhere?

elainethale · Apr 24, 2026

+def _with_sql_column(df, column, expr):
+    view = create_temp_view(df)
+    cols = ", ".join(x for x in df.columns if x != column)
+    return get_runtime_session().sql(f"SELECT {cols}, {expr} AS {column} FROM {view}")
+
+
+def _collect(df):
+    if isinstance(df, ibis.Table):
+        return list(df.execute().itertuples(index=False, name="Row"))
+    return df.collect()
+
+
+def _first_value(df, column):
+    return getattr(_collect(df.limit(1))[0], column)
+
+
+def _row_value(row, column):
+    try:
+        return row[column]
+    except TypeError:
+        return getattr(row, column)


This seems like a funny place for these fairly generic helper functions.

elainethale · Apr 24, 2026


 @pytest.fixture(scope="module")
-def pivoted_dataframes(records_dataframe_energy):
+def pivoted_tables(records_dataframe_energy):


Worthwhile to change these test names?

They are fixtures, not tests.

elainethale · Apr 24, 2026

+def _collect(df):
+    if isinstance(df, ibis.Table):
+        return list(df.execute().itertuples(index=False, name="Row"))
+    return df.collect()
+
+
+def _count(df) -> int:
+    if isinstance(df, ibis.Table):
+        return df.count().execute()
+    return df.count()


These generic helper functions seem misplaced.

elainethale · Apr 24, 2026

+def _count(df):
+    if isinstance(df, ibis.Table):
+        return df.count().execute()
+    return df.count()
+
+
+def _collect(df):
+    if isinstance(df, ibis.Table):
+        return list(df.execute().itertuples(index=False, name="Row"))
+    return df.collect()
+


I'm going to stop commenting on this now, but lots of simple helper functions are repeated throughout the test files.

elainethale · Apr 24, 2026


 @pytest.fixture(scope="module")
-def dataframes():
+def tables():


Worth changing?

elainethale · Apr 24, 2026

-    assert unpivoted.columns == expected_columns
-    null_data = unpivoted.filter("county = 'Boulder' and end_use = 'heating'").collect()
+    assert list(unpivoted.columns) == expected_columns
+    null_data = _collect(filter_sql(unpivoted, "county = 'Boulder' and end_use = 'heating'"))


This new syntax is a bit harder to read (right to left instead of left to right).

elainethale

Additional comments on tests.

elainethale · Apr 24, 2026

+    monkeypatch.setattr(
+        session, "is_runtime_session_active", lambda: (_ for _ in ()).throw(AssertionError)
+    )
+    drop_temp_tables_and_views()


Should there be a test for expected state after this call?

elainethale · Apr 24, 2026

+    monkeypatch.setattr(
+        session, "get_spark_session", lambda: (_ for _ in ()).throw(AssertionError)
+    )
+    drop_temp_tables_and_views()


Should there be a test for expected state after this call?

elainethale · Apr 24, 2026

    - DF.show() (and probably all arithmetics) use spark.sql.session.timeZone
-    - DF.toPandas() likely goes through spark.sql.session.timeZone
+    - DF.to_pandas() likely goes through spark.sql.session.timeZone
    - DF.collect() converts timestamps to system time_zone (different from spark.sql.session.timeZone!)


Are these still the right method calls?

elainethale · Apr 24, 2026

+    return get_runtime_session().sql(f"SELECT {cols_str}, {expr} AS {nc} FROM {view}")
+
+
 def from_utc_timestamp(


This is now a light wrapper around a local helper?

elainethale · Apr 24, 2026

+                .aggregate(**{VALUE_COLUMN: df[VALUE_COLUMN].mean()})
+            )
+            expected = filter_sql(expected_df, "hour == 16").execute()[VALUE_COLUMN].iloc[0]
+    else:


Why is there ambiguity about the data type?

elainethale · Apr 24, 2026

-def validate_electricity_use_by_state(op, results_path: DataFrame | Path, raw_stats, datasets):
-    if isinstance(results_path, DataFrame):
+def validate_electricity_use_by_state(op, results_path: ibis.Table | Path, raw_stats, datasets):
+    if isinstance(results_path, ibis.Table | ibis.Table):


Suggested change

if isinstance(results_path, ibis.Table | ibis.Table):

if isinstance(results_path, ibis.Table):

Reposition Spark as an optional distributed backend with DuckDB as the default. Remove placeholder Spark pages, swap pyspark examples for SQL or DuckDB equivalents in tutorials and how-tos, and update the HPC workflow to reflect backend choice. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

elainethale · Apr 24, 2026

@@ -85,10 +86,7 @@ allow-direct-references = true
 [project.optional-dependencies]
 spark = [


Should sparkctl[pyspark] be in this list?

elainethale · Apr 29, 2026

            return filename

-    # Use ValueError because this gets called in Pydantic model validation.
+            # Use ValueError because this gets called in Pydantic model validation.


nit: Awkward addition of spaces in this line and on some comment lines that follow.

elainethale · Apr 29, 2026

        if type_upper not in SUPPORTED_TYPES:
            supported_data_types = sorted(SUPPORTED_TYPES)
-            msg = f"{data_type=} is not one of {supported_data_types=}"
+            msg = f"{data_type =} is not one of {supported_data_types =}"


nit: I prefer the version on the left (for no great reason). Is there a reason to prefer the version on the right?

From Claude:

No good reason I can think of — it's almost certainly unintentional or a stylistic miss. In f-string debug syntax (=), whitespace around the = is preserved literally in the output: f"{data_type=}" → data_type='foo' f"{data_type =}" → data_type ='foo' (note the stray space before =, none after) So the new version produces slightly uglier output (data_type ='foo' is not one of supported_data_types =[...]) with no functional benefit. If they wanted symmetric spacing they'd need {data_type = } (space on both sides) → data_type = 'foo', which is at least a defensible style choice. The asymmetric {data_type =} form is just worse than the original. Worth flagging on the PR.

elainethale · Apr 29, 2026

            check_uniqueness((x.name for x in self.columns), "column names")

-        # Check that ignore_columns don't overlap with columns
+            # Check that ignore_columns don't overlap with columns


Another misaligned comment

elainethale · Apr 29, 2026

                columns = sorted(selector.column_values.keys())
                if columns != first:
-                    msg = f"All selectors must define the same columns: {first=} {columns=}"
+                    msg = f"All selectors must define the same columns: {first =} {columns =}"


More stray spaces in format strings in this file (here and below).

elainethale · Apr 29, 2026

+def _table_count(table: ibis.Table) -> int:
+    count = cast(Any, table.count().execute())
+    return int(count)


This helper looks like many scattered through the codebase.

elainethale · Apr 29, 2026

+        TimestampType,
+    )
+
+    def make_type(dtype: str):


Why isn't this in types.py?

elainethale · Apr 29, 2026

+    return StructType(fields)
+
+
+def _ibis_type_from_spark_type(data_type: Any) -> str:


Again--types.py?

elainethale · Apr 29, 2026

Since there is only one function here, move to functions.py or operations.py? Do we expect to add more?

elainethale · Apr 29, 2026

+    Use this only for metadata, capped diagnostics, API payloads, or other known-small
+    tables. Large dataset tables should stay as lazy Ibis expressions.
+    """
+    return table.execute()


Calling .execute on a table of type Any produces a pd.DataFrame? Apologies if this is a dumb question.

Replace Spark/DuckDB APIs with Ibis

e86c3d4

daniel-thom requested a review from Copilot April 11, 2026 21:05

Copilot started reviewing on behalf of daniel-thom April 11, 2026 21:05 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

Update chronify branch

f10e133

daniel-thom force-pushed the feat/use-ibis branch from 993c64e to f10e133 Compare April 11, 2026 22:09

daniel-thom and others added 13 commits April 11, 2026 17:06

Migrate runtime paths to Ibis

a95e2c1

Increase test coverage

f50619c

Increase test coverage

575d1b7

Update tests

79c4e1f

Fix Parquet paths

f095d44

Fix Ibis conversion errors

a2256d0

Spark fixes

15f2465

Fix Spark-mode test failures

a6a4fd2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Run full test suite in Spark mode in CI

37f88e8

Replace the Spark smoke subset with the same pytest invocation the DuckDB job uses so every test runs in both backends. Add Java 21 setup since Spark 4 requires Java 17+. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Convert more code to Ibis

96867dc

Fix dataset mapping manager cleanup

d52fdc5

daniel-thom force-pushed the feat/use-ibis branch from 368e2b5 to d52fdc5 Compare April 16, 2026 18:45

elainethale requested changes Apr 24, 2026

View reviewed changes

elainethale reviewed Apr 24, 2026

View reviewed changes

elainethale requested changes Apr 29, 2026

View reviewed changes

		def coalesce(df: ibis.Table, num_partitions: int) -> ibis.Table:
		return df

	4. Persist intermediate table - If the option `--persist-intermediate-table` is `true` (which is the default) then dsgrid will evaluate the Ibis query from the previous step by writing the table to the filesystem in the directory `query_output/cached_tables`. This can be disabled by setting `--no-persist-intermediate-table`.
	4. Persist intermediate table - If the option `--persist-intermediate-table` is `true` (which is the default) then dsgrid will evaluate the query from the previous step by writing the table to the filesystem in the directory `query_output/cached_tables`. This can be disabled by setting `--no-persist-intermediate-table`.

	By default, this will attempt to map all dimensions in three Ibis query phases:
	By default, this will attempt to map all dimensions in three phases:

		return get_runtime_session().sql(f"SELECT {cols_str}, {expr} AS {nc} FROM {view}")


		def from_utc_timestamp(

	if isinstance(results_path, ibis.Table \| ibis.Table):
	if isinstance(results_path, ibis.Table):

		return StructType(fields)


		def _ibis_type_from_spark_type(data_type: Any) -> str:

Search code, repositories, users, issues, pull requests...

Conversation

daniel-thom commented Apr 11, 2026 • edited by elainethale Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

elainethale left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elainethale left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

daniel-thom commented Apr 11, 2026 •

edited by elainethale

Loading

codecov Bot commented Apr 11, 2026 •

edited

Loading