Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Replace Spark/DuckDB APIs with Ibis#414

Draft
daniel-thom wants to merge 16 commits intomaindsgrid/dsgrid:mainfrom
feat/use-ibisdsgrid/dsgrid:feat/use-ibisCopy head branch name to clipboard
Draft

Replace Spark/DuckDB APIs with Ibis#414
daniel-thom wants to merge 16 commits intomaindsgrid/dsgrid:mainfrom
feat/use-ibisdsgrid/dsgrid:feat/use-ibisCopy head branch name to clipboard

Conversation

@daniel-thom
Copy link
Copy Markdown
Contributor

@daniel-thom daniel-thom commented Apr 11, 2026

Depends on NatLabRockies/chronify#66

Closes GitHub Issue #

Description

Checklist

  • Tests exercising the new feature or bug fix
  • All tests pass
  • At least one code review approval
  • Consider transferring TODOs to GitHub

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates dsgrid’s table-processing layer (and the surrounding query, registry, and test code) away from Spark/DuckDB-specific APIs toward a unified Ibis-based runtime interface.

Changes:

  • Introduces a new dsgrid.ibis compatibility layer (backend/session/io/ops/types) and refactors core codepaths to use it.
  • Reworks unit conversion, dataset mapping/expression evaluation, and reporting to operate on Ibis tables/SQL rather than Spark DataFrame APIs.
  • Updates a large portion of tests + docs to reflect the new runtime abstractions and API model renames (e.g., query submission models).

Reviewed changes

Copilot reviewed 128 out of 129 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_utilities.py Minor test cleanup (whitespace).
tests/test_unit_conversions.py Switch unit conversion tests from Spark DF ops to Ibis table ops/helpers.
tests/test_representative_time.py Update representative-time tests to run with runtime/Ibis paths and DuckDB/Spark branches.
tests/test_registry_management.py Update registry tests to expect Ibis tables + pandas conversion helpers.
tests/test_project.py Make project tests backend-agnostic via small Ibis/Spark collection helpers.
tests/test_localize_timestamps_if_necessary.py Update chronify localization tests to use runtime session + table-to-pandas helpers.
tests/test_ibis_session.py Add/adjust runtime session tests (read/write/restart/custom conf).
tests/test_ibis_functions.py Update function tests to use Ibis helpers (filter_sql, order_by, _collect).
tests/test_find_minimal_patterns.py Minor assertion formatting cleanup.
tests/test_filter_registry.py Update filtered-registry tests to join/drop via Ibis ops + _collect.
tests/test_filesystem_data_store.py Use dsgrid.ibis.session.create_dataframe_from_dicts.
tests/test_duckdb_data_store.py Use pandas input + Ibis counts for DuckDB store tests.
tests/test_derived_datasets.py Make derived-dataset tests collect/order in Ibis or Spark.
tests/test_datasets.py Update expected error text from “DataFrame” to “Ibis table”.
tests/test_dataset_utils.py Convert dataset utility tests to runtime/Ibis equivalents.
tests/test_dataset_expression_handler.py Convert expression-handler tests to Ibis tables + filter_sql.
tests/test_create_time_dimensions.py Use runtime session instead of spark session for time-dimension test fixtures.
tests/test_convert_time_format_if_necessary.py Update time-format conversion tests to runtime session + Ibis collection.
tests/test_api.py Update API request/response model names + handle 413 inline-result size limit.
tests/test_annual_time.py Update annual-time tests to Ibis tables + backend-agnostic counting/collection.
tests/test_aeo_dataset_registration.py Switch get_unique_values import to Ibis table utils.
tests/simple_standard_scenarios_datasets.py Refactor dataset builders/mappers to Ibis operations (drop/join/rename).
tests/conftest.py Initialize runtime session fixture instead of Spark.
tests/cli/test_registry.py Switch use_duckdb import to Ibis types.
tests/cli/test_config.py Remove unused Path import.
scripts/enumerate_load_table_lookup.py Switch script to Ibis read/join + TablePartition.
scripts/enumerate_load_table_lookup.ipynb Update notebook to runtime session + TablePartition.
pyproject.toml Add ibis-framework[duckdb] dependency; add ty dev dep; add uv source override for chronify.
dsgrid/utils/utilities.py Make IPython imports lazy via importlib and tighten typing for list_enum_values.
dsgrid/utils/timing.py Cast _start for type safety in timer exit.
dsgrid/utils/files.py Write LDJSON with utf-8 (no BOM).
dsgrid/units/power.py Replace Spark column expressions with SQL CASE expression builders.
dsgrid/units/convert.py Rewrite unit conversion for unpivoted tables using Ibis + SQL.
dsgrid/tests/utils.py Use runtime session + Ibis join/drop helpers when reading test parquet.
dsgrid/tests/register_derived_datasets.py Use DatasetConfig.config_filename() instead of DatasetRegistry.
dsgrid/tests/make_us_data_registry.py Minor formatting simplification.
dsgrid/tests/common.py Improve typing/casts around config_class().config_filename().
dsgrid/spark/types.py Remove Spark/DuckDB dual-mode types shim (deleted).
dsgrid/rust_ext/find_minimal_patterns.py Lazy-load Rust extension via importlib for cleaner failure mode.
dsgrid/rust_ext/init.py Same lazy-load pattern for Rust types.
dsgrid/registry/versioning.py Fix json5 loading by opening file before json5.load.
dsgrid/registry/registry_manager.py Initialize runtime session instead of Spark session.
dsgrid/registry/registry_manager_base.py Type _db as Any.
dsgrid/registry/registry_interface.py Add asserts/casts around ids/versions for stronger invariants.
dsgrid/registry/registry_auto_updater.py Add casts for model typing in updater loops.
dsgrid/registry/registration_context.py Switch temp-table cleanup import to dsgrid.ibis.temp.
dsgrid/registry/filter_registry_manager.py Convert filtering logic to Ibis table record helpers + is_table_empty.
dsgrid/registry/filesystem_data_store.py Switch store IO types to ibis.Table and runtime read/write helpers.
dsgrid/registry/dimension_registry_manager.py Add asserts/casts; rename remove arg to config_id.
dsgrid/registry/data_store_interface.py Update abstract interface types from Spark DF to ibis.Table.
dsgrid/registry/common.py Rename from_file arg to filename.
dsgrid/query/report_peak_load.py Replace Spark agg with SQL-based max-by-group + Ibis IO.
dsgrid/query/query_context.py Store/restore record ids via Ibis table utils instead of Spark collects.
dsgrid/query/models.py Replace Spark function binding with FunctionReference; rename spark conf model -> runtime conf model.
dsgrid/query/derived_dataset.py Convert derived dataset validation + record writing to Ibis helpers (CSV via backend).
dsgrid/project.py Switch query processing/mapping to runtime session + Ibis join/filter ops.
dsgrid/loggers.py Cast handlers list to satisfy typing when appending.
dsgrid/ibis/types.py New: runtime engine helpers (duckdb vs spark), type helpers, is_table_empty.
dsgrid/ibis/temp.py New: temp view/table naming + cleanup for Spark backend.
dsgrid/ibis/table_utils.py New: execute/collect helpers for “known-small” tables + distinct value helpers.
dsgrid/ibis/partition.py Rename partition helper to TablePartition and adapt to Ibis counting/aggregation.
dsgrid/ibis/operations.py New: cross-backend table ops (filter/join/pivot/unpivot/etc.) implemented via Ibis/SQL.
dsgrid/ibis/io.py New: Ibis read helpers for csv/json/parquet.
dsgrid/ibis/functions.py New: compatibility helpers mirroring prior Spark helpers using Ibis + runtime session.
dsgrid/ibis/backend.py New: construct/cache runtime Ibis backend + chronify store creation.
dsgrid/ibis/init.py Package marker.
dsgrid/filesystem/s3_filesystem.py Lazy-load boto3/s3path; minor rglob hidden filtering tweak.
dsgrid/dsgrid_rc.py Allow loading runtime config from a specified filename.
dsgrid/dimension/time_utils.py Fix optional typing for args.
dsgrid/dimension/dimension_filters.py Rewrite filters to SQL-string predicates executed via Ibis SQL.
dsgrid/dataset/unpivoted_table.py Switch aggregation and unit conversion to Ibis tables/SQL.
dsgrid/dataset/table_format_handler_base.py Convert core handler plumbing to Ibis + SQL projections/joins.
dsgrid/dataset/growth_rates.py Convert growth-rate math to Ibis SQL operations.
dsgrid/dataset/dataset.py Switch handler interfaces to Ibis table types; update mapping reference model type.
dsgrid/dataset/dataset_schema_handler_one_table.py Convert schema checks/filtering to Ibis + runtime null checks.
dsgrid/dataset/dataset_mapping_manager.py Use Ibis tables for checkpointed persistence.
dsgrid/dataset/dataset_expression_handler.py Implement dataset arithmetic via SQL over temp views + Ibis backend.
dsgrid/data_models.py Allow from_file to accept str; safer enum description formatting.
dsgrid/config/time_dimension_base_config.py Switch time mapping to Ibis rename helper + updated typing.
dsgrid/config/representative_period_time_dimension_config.py Fix return types to Type[...] for model_class helpers.
dsgrid/config/noop_time_dimension_config.py Fix model_class typing + signature compatibility.
dsgrid/config/mapping_tables.py Build MappingTableModel via model_validate dict.
dsgrid/config/input_dataset_requirements.py Replace conlist with Annotated[..., Field(min_length=1)].
dsgrid/config/index_time_dimension_config.py Fix model_class typing to Type[...].
dsgrid/config/file_schema.py Switch file readers/writers to Ibis IO + rename/drop helpers.
dsgrid/config/dimensions.py Add cast for offset_column access on union model.
dsgrid/config/dimension_mapping_base.py Use enum default value instead of string literal.
dsgrid/config/dimension_config.py Ensure get_unique_ids() returns set[str].
dsgrid/config/date_time_dimension_config.py Fix model_class typing + casts for chronify model creation.
dsgrid/config/dataset_config.py Convert table scanning and trivial dims logic to Ibis SQL/joins.
dsgrid/config/config_base.py Write records via Ibis CSV writer; cast model_class() for typing.
dsgrid/config/common.py Add explicit type annotation for time dimension template dict.
dsgrid/config/annual_time_dimension_config.py Convert annual time operations to runtime session + Ibis SQL.
dsgrid/cli/dsgrid.py Format long call across multiple lines.
dsgrid/cli/common.py Add asserts for exception info; adjust click Option signature typing.
dsgrid/chronify.py Use Ibis-backed chronify store factory; remove explicit store disposal.
dsgrid/apps/registration_gui.py Lazy-load IPython/widgets; runtime session init; fix conn creation.
dsgrid/apps/project_viewer/app.py Cast conditional styling for Dash typing.
dsgrid/api/response_models.py Rename submit-project-query response model.
dsgrid/api/models.py Rename submit-project-query request model.
dsgrid/api/app.py Use runtime session; add inline JSON size limit (413) + pandas conversion helper.
dsgrid/init.py Minor formatting change.
docs/source/user_guide/tutorials/map_dataset.md Update narrative from Spark-specific to Ibis runtime/backends.
docs/source/user_guide/project_queries/query_concepts.md Update query lifecycle docs to Ibis runtime terminology.
docs/source/user_guide/apache_spark/overview.md Clarify Spark is an Ibis backend and update wording.
docs/source/user_guide/apache_spark/index.md Update Spark docs intro wording.
docs/source/software_reference/python_api.md Replace .show() examples with to_pyarrow() printing.
docs/source/software_reference/architecture.md Update architecture to Ibis + backend model.
docs/source/index.md Update overview copy to reflect Ibis backend usage.
docs/source/getting_started/installation.md Clarify Java only required for Spark backend.
docs/source/developer_guide/how_tos/how_to_convert_time_zone.md Update terminology/examples from DataFrame to Ibis table + runtime variants.
.gitmodules Point test-data submodule to feat/use-ibis branch.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread dsgrid/ibis/operations.py
Comment on lines +43 to +46
def rename_columns(df: ibis.Table, mapping: dict[str, str]) -> ibis.Table:
"""Rename columns using a mapping of old name to new name."""
return df.rename({new: old for old, new in mapping.items()})

Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename_columns() inverts the provided mapping: df.rename({new: old for old, new in mapping.items()}). Ibis expects a mapping of existing column names to new column names, so this currently renames in the wrong direction and will fail for callers like rename_columns(df, {"unit": "from_unit"}). Update this to pass the mapping in the correct direction (old -> new) and add a small unit test for a simple rename.

Copilot uses AI. Check for mistakes.
Comment thread dsgrid/ibis/operations.py Outdated
Comment on lines +33 to +34
def coalesce(df: ibis.Table, num_partitions: int) -> ibis.Table:
return df
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coalesce() is currently a no-op (returns df unchanged). Several call sites rely on coalesce(..., 1) to produce a single output file/partition (e.g., dsgrid/utils/dataset.py:348-363 asserts only one parquet part file). This will break on the Spark backend and can also cause excessive small files. Implement backend-appropriate coalesce/repartition behavior (or remove the helper and update call sites to avoid assumptions about single output files).

Copilot uses AI. Check for mistakes.
Comment thread dsgrid/ibis/operations.py
Comment on lines +15 to +22
except Exception:
tmp_file = NamedTemporaryFile(suffix=".parquet", delete=False)
tmp_file.close()
df.to_parquet(tmp_file.name)
escaped_path = tmp_file.name.replace("'", "''")
conn = cast(Any, make_runtime_backend().connection)
conn.raw_sql(f"CREATE TEMP VIEW {view} AS SELECT * FROM read_parquet('{escaped_path}')")
return view
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_temp_view() fallback writes an on-disk temporary parquet file with delete=False and never deletes it. On repeated calls this can leak many temp files and fill disks. Consider (a) restricting this fallback to DuckDB only, (b) registering the temp file path for cleanup (atexit / ScratchDirContext / drop_temp_tables_and_views), or (c) using an in-memory registration approach if supported by the backend.

Copilot uses AI. Check for mistakes.
Comment on lines +162 to +182
case "contains":
expr = f"{column} LIKE '%{_escape_like_value(value)}%'"
case "endswith":
expr = f"{column} LIKE '%{_escape_like_value(value)}'"
case "isNotNull":
expr = f"{column} IS NOT NULL"
case "isNull":
expr = f"{column} IS NULL"
case "isin":
if not isinstance(value, list | tuple | set):
msg = f"value must be a list, tuple, or set for {operator =}"
raise DSGInvalidField(msg)
vals = ", ".join(_make_sql_value(x) for x in value)
expr = f"{column} IN ({vals})"
case "like":
expr = f"{column} LIKE {_make_sql_value(value)}"
case "rlike":
expr = f"{column} RLIKE {_make_sql_value(value)}"
case "startswith":
expr = f"{column} LIKE '{_escape_like_value(value)}%'"
case _:
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LIKE-based implementations for contains/startswith/endswith only escape single quotes. In Spark, col.contains()/startswith()/endswith() treat the input literally, but SQL LIKE will treat '%' and '' as wildcards, so values containing those characters will match incorrectly. Consider escaping '%' and '' (and adding an ESCAPE clause) or using backend string functions that preserve literal semantics.

Copilot uses AI. Check for mistakes.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 11, 2026

Codecov Report

❌ Patch coverage is 90.40558% with 220 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.22%. Comparing base (6d8c0bd) to head (9682fa2).

Files with missing lines Patch % Lines
dsgrid/ibis/session.py 91.72% 69 Missing ⚠️
dsgrid/utils/dataset.py 81.81% 34 Missing ⚠️
dsgrid/registry/duckdb_data_store.py 84.76% 16 Missing ⚠️
dsgrid/ibis/functions.py 83.07% 11 Missing ⚠️
dsgrid/dataset/dataset_schema_handler_base.py 89.53% 9 Missing ⚠️
dsgrid/query/query_submitter.py 80.95% 8 Missing ⚠️
dsgrid/registry/project_registry_manager.py 80.00% 8 Missing ⚠️
dsgrid/config/project_config.py 61.11% 7 Missing ⚠️
...rid/registry/dimension_mapping_registry_manager.py 85.71% 5 Missing ⚠️
dsgrid/ibis/operations.py 96.92% 4 Missing ⚠️
... and 23 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #414      +/-   ##
==========================================
+ Coverage   82.09%   85.22%   +3.12%     
==========================================
  Files         124      128       +4     
  Lines       14441    14800     +359     
==========================================
+ Hits        11856    12613     +757     
+ Misses       2585     2187     -398     
Flag Coverage Δ
Linux 83.47% <81.46%> (+1.46%) ⬆️
Windows 83.56% <81.46%> (+3.06%) ⬆️
spark 82.56% <75.97%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

daniel-thom and others added 13 commits April 11, 2026 17:06
Converts set ops, joins, filters, unit conversion, and growth rates to
native Ibis expressions. Adds cross-backend bridging in create_temp_view
and type-alignment helpers so native set ops and joins tolerate the SQL
semantics the prior string-SQL paths had (int/float coercion, schema
widening). Restores a Spark-specific coalesce.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Spark's default shuffle.partitions=200 combined with dsgrid's cascaded joins
produced stages with 248k+ tiny tasks on small local jobs, dominating runtime.
Apply small local-mode defaults (shuffle.partitions=4, default.parallelism=4,
AQE + coalescePartitions on) when Spark is running locally. SPARK_CONF_DIR and
spark_conf= overrides still win.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the Spark smoke subset with the same pytest invocation the DuckDB
job uses so every test runs in both backends. Add Java 21 setup since
Spark 4 requires Java 17+.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@elainethale elainethale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a general question regarding the documentation: To what extent do users need to know about Ibis, and to what extent is it just an implementation detail? To me, it would be best for users to just think about "DataFrames" and whether they need to do processing with "Spark" (and thus go through a bit more set up) or not. Certainly it should be documented that DuckDB and Ibis are key technologies, but they might not need to be mentioned as much as they are in this PR's documentation changes.

I was also expecting a larger audit of the existing documentation that covers installation and spark use. I'm not sure we have the right set of high-level pages/sections, and suspect that there's a lot of out-of-date information hanging around and not as much on configuring spark with sparkctl as would be helpful.

Comment thread docs/source/developer_guide/how_tos/how_to_convert_time_zone.md
Comment thread docs/source/software_reference/architecture.md Outdated
project = manager.project_manager.load_project("dsgrid_conus_2022")
geo_dim = project.config.get_base_dimension(DimensionType.GEOGRAPHY)
geo_dim.get_records_dataframe().show()
print(geo_dim.get_records_dataframe().limit(5).to_pyarrow())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a clunkier interface for inspecting the head of a DataFrame. Is there a better option?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print(geo_dim.get_records_dataframe().head().execute())

Comment thread docs/source/user_guide/apache_spark/index.md Outdated
Comment thread docs/source/user_guide/project_queries/query_concepts.md
Comment thread docs/source/user_guide/project_queries/query_concepts.md
3. **Combine datasets** - Combine the datasets as specified by the `expression` in the dataset data model of the query. The default is to take a union of all datasets.

4. **Persist intermediate table** - If the option `--persist-intermediate-table` is `true` (which is the default) then dsgrid will evaluate the Spark query from the previous step by writing the dataframe to the filesystem in the directory `query_output/cached_tables`. This can be disabled by setting `--no-persist-intermediate-table`.
4. **Persist intermediate table** - If the option `--persist-intermediate-table` is `true` (which is the default) then dsgrid will evaluate the Ibis query from the previous step by writing the table to the filesystem in the directory `query_output/cached_tables`. This can be disabled by setting `--no-persist-intermediate-table`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
4. **Persist intermediate table** - If the option `--persist-intermediate-table` is `true` (which is the default) then dsgrid will evaluate the Ibis query from the previous step by writing the table to the filesystem in the directory `query_output/cached_tables`. This can be disabled by setting `--no-persist-intermediate-table`.
4. **Persist intermediate table** - If the option `--persist-intermediate-table` is `true` (which is the default) then dsgrid will evaluate the query from the previous step by writing the table to the filesystem in the directory `query_output/cached_tables`. This can be disabled by setting `--no-persist-intermediate-table`.

8. **Sort columns** - If the field `sort_columns` in the `result` data model is `true`, sort the table by those columns.

9. **Write output** - Evaluate the Spark job for the previous steps by writing the dataframe to the filesystem in the directory `query_output/<query-name>`.
9. **Write output** - Evaluate the Ibis query for the previous steps by writing the table to the filesystem in the directory `query_output/<query-name>`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per suggestion above, can these Spark job to Ibis query replacements all be changed to something like query? (And are these descriptions only applicable with Spark, or do they also apply to DuckDB?)

```

By default, this will attempt to map all dimensions by performing three Spark queries:
By default, this will attempt to map all dimensions in three Ibis query phases:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
By default, this will attempt to map all dimensions in three Ibis query phases:
By default, this will attempt to map all dimensions in three phases:

Comment thread docs/source/index.md
- **Flexible Mappings**: Map datasets between different dimensional systems with explicit, documented transformations
- **Powerful Queries**: Aggregate, filter, and transform data across dimensions to create custom views
- **Big Data Support**: Process terabyte-scale datasets using Apache Spark or gigabyte-scale datasets using DuckDB
- **Big Data Support**: Use the Ibis table API with DuckDB for local work or Apache Spark for distributed workloads
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original version seems better for end-users. I would revert this change.

Comment thread pyproject.toml
"devtools",
"flake8",
"mypy",
"ty",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you've made a bunch of changes to enforce typing, but I'm not seeing pre-commit or ci checks for it. Did I miss those somewhere?

Comment on lines +345 to +365
def _with_sql_column(df, column, expr):
view = create_temp_view(df)
cols = ", ".join(x for x in df.columns if x != column)
return get_runtime_session().sql(f"SELECT {cols}, {expr} AS {column} FROM {view}")


def _collect(df):
if isinstance(df, ibis.Table):
return list(df.execute().itertuples(index=False, name="Row"))
return df.collect()


def _first_value(df, column):
return getattr(_collect(df.limit(1))[0], column)


def _row_value(row, column):
try:
return row[column]
except TypeError:
return getattr(row, column)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a funny place for these fairly generic helper functions.


@pytest.fixture(scope="module")
def pivoted_dataframes(records_dataframe_energy):
def pivoted_tables(records_dataframe_energy):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worthwhile to change these test names?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are fixtures, not tests.

Comment thread tests/test_annual_time.py
Comment on lines +25 to +34
def _collect(df):
if isinstance(df, ibis.Table):
return list(df.execute().itertuples(index=False, name="Row"))
return df.collect()


def _count(df) -> int:
if isinstance(df, ibis.Table):
return df.count().execute()
return df.count()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These generic helper functions seem misplaced.

Comment on lines +124 to +134
def _count(df):
if isinstance(df, ibis.Table):
return df.count().execute()
return df.count()


def _collect(df):
if isinstance(df, ibis.Table):
return list(df.execute().itertuples(index=False, name="Row"))
return df.collect()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to stop commenting on this now, but lots of simple helper functions are repeated throughout the test files.


@pytest.fixture(scope="module")
def dataframes():
def tables():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth changing?

assert unpivoted.columns == expected_columns
null_data = unpivoted.filter("county = 'Boulder' and end_use = 'heating'").collect()
assert list(unpivoted.columns) == expected_columns
null_data = _collect(filter_sql(unpivoted, "county = 'Boulder' and end_use = 'heating'"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new syntax is a bit harder to read (right to left instead of left to right).

Copy link
Copy Markdown
Contributor

@elainethale elainethale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional comments on tests.

Comment thread tests/test_ibis_temp.py
monkeypatch.setattr(
session, "is_runtime_session_active", lambda: (_ for _ in ()).throw(AssertionError)
)
drop_temp_tables_and_views()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a test for expected state after this call?

Comment thread tests/test_ibis_temp.py
monkeypatch.setattr(
session, "get_spark_session", lambda: (_ for _ in ()).throw(AssertionError)
)
drop_temp_tables_and_views()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a test for expected state after this call?

Comment on lines 345 to 347
- DF.show() (and probably all arithmetics) use spark.sql.session.timeZone
- DF.toPandas() likely goes through spark.sql.session.timeZone
- DF.to_pandas() likely goes through spark.sql.session.timeZone
- DF.collect() converts timestamps to system time_zone (different from spark.sql.session.timeZone!)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these still the right method calls?

return get_runtime_session().sql(f"SELECT {cols_str}, {expr} AS {nc} FROM {view}")


def from_utc_timestamp(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now a light wrapper around a local helper?

Comment thread tests/test_queries.py
.aggregate(**{VALUE_COLUMN: df[VALUE_COLUMN].mean()})
)
expected = filter_sql(expected_df, "hour == 16").execute()[VALUE_COLUMN].iloc[0]
else:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there ambiguity about the data type?

Comment thread tests/test_queries.py
def validate_electricity_use_by_state(op, results_path: DataFrame | Path, raw_stats, datasets):
if isinstance(results_path, DataFrame):
def validate_electricity_use_by_state(op, results_path: ibis.Table | Path, raw_stats, datasets):
if isinstance(results_path, ibis.Table | ibis.Table):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if isinstance(results_path, ibis.Table | ibis.Table):
if isinstance(results_path, ibis.Table):

Reposition Spark as an optional distributed backend with DuckDB as the
default. Remove placeholder Spark pages, swap pyspark examples for SQL
or DuckDB equivalents in tutorials and how-tos, and update the HPC
workflow to reflect backend choice.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread pyproject.toml
@@ -85,10 +86,7 @@ allow-direct-references = true
[project.optional-dependencies]
spark = [
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should sparkctl[pyspark] be in this list?

return filename

# Use ValueError because this gets called in Pydantic model validation.
# Use ValueError because this gets called in Pydantic model validation.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Awkward addition of spaces in this line and on some comment lines that follow.

if type_upper not in SUPPORTED_TYPES:
supported_data_types = sorted(SUPPORTED_TYPES)
msg = f"{data_type=} is not one of {supported_data_types=}"
msg = f"{data_type =} is not one of {supported_data_types =}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I prefer the version on the left (for no great reason). Is there a reason to prefer the version on the right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Claude:

No good reason I can think of — it's almost certainly unintentional or a stylistic miss.

In f-string debug syntax (=), whitespace around the = is preserved literally in the output:

f"{data_type=}" → data_type='foo'
f"{data_type =}" → data_type ='foo' (note the stray space before =, none after)
So the new version produces slightly uglier output (data_type ='foo' is not one of supported_data_types =[...]) with no functional benefit. If they wanted symmetric spacing they'd need {data_type = } (space on both sides) → data_type = 'foo', which is at least a defensible style choice. The asymmetric {data_type =} form is just worse than the original.

Worth flagging on the PR.

check_uniqueness((x.name for x in self.columns), "column names")

# Check that ignore_columns don't overlap with columns
# Check that ignore_columns don't overlap with columns
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another misaligned comment

columns = sorted(selector.column_values.keys())
if columns != first:
msg = f"All selectors must define the same columns: {first=} {columns=}"
msg = f"All selectors must define the same columns: {first =} {columns =}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More stray spaces in format strings in this file (here and below).

Comment thread dsgrid/ibis/session.py
Comment on lines +1193 to +1195
def _table_count(table: ibis.Table) -> int:
count = cast(Any, table.count().execute())
return int(count)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helper looks like many scattered through the codebase.

Comment thread dsgrid/ibis/session.py
TimestampType,
)

def make_type(dtype: str):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this in types.py?

Comment thread dsgrid/ibis/session.py
return StructType(fields)


def _ibis_type_from_spark_type(data_type: Any) -> str:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again--types.py?

Comment thread dsgrid/ibis/spark_only.py
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is only one function here, move to functions.py or operations.py? Do we expect to add more?

Use this only for metadata, capped diagnostics, API payloads, or other known-small
tables. Large dataset tables should stay as lazy Ibis expressions.
"""
return table.execute()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling .execute on a table of type Any produces a pd.DataFrame? Apologies if this is a dumb question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.