Replace Spark/DuckDB APIs with Ibis#414
Replace Spark/DuckDB APIs with Ibis#414daniel-thom wants to merge 16 commits intomaindsgrid/dsgrid:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR migrates dsgrid’s table-processing layer (and the surrounding query, registry, and test code) away from Spark/DuckDB-specific APIs toward a unified Ibis-based runtime interface.
Changes:
- Introduces a new
dsgrid.ibiscompatibility layer (backend/session/io/ops/types) and refactors core codepaths to use it. - Reworks unit conversion, dataset mapping/expression evaluation, and reporting to operate on Ibis tables/SQL rather than Spark DataFrame APIs.
- Updates a large portion of tests + docs to reflect the new runtime abstractions and API model renames (e.g., query submission models).
Reviewed changes
Copilot reviewed 128 out of 129 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_utilities.py | Minor test cleanup (whitespace). |
| tests/test_unit_conversions.py | Switch unit conversion tests from Spark DF ops to Ibis table ops/helpers. |
| tests/test_representative_time.py | Update representative-time tests to run with runtime/Ibis paths and DuckDB/Spark branches. |
| tests/test_registry_management.py | Update registry tests to expect Ibis tables + pandas conversion helpers. |
| tests/test_project.py | Make project tests backend-agnostic via small Ibis/Spark collection helpers. |
| tests/test_localize_timestamps_if_necessary.py | Update chronify localization tests to use runtime session + table-to-pandas helpers. |
| tests/test_ibis_session.py | Add/adjust runtime session tests (read/write/restart/custom conf). |
| tests/test_ibis_functions.py | Update function tests to use Ibis helpers (filter_sql, order_by, _collect). |
| tests/test_find_minimal_patterns.py | Minor assertion formatting cleanup. |
| tests/test_filter_registry.py | Update filtered-registry tests to join/drop via Ibis ops + _collect. |
| tests/test_filesystem_data_store.py | Use dsgrid.ibis.session.create_dataframe_from_dicts. |
| tests/test_duckdb_data_store.py | Use pandas input + Ibis counts for DuckDB store tests. |
| tests/test_derived_datasets.py | Make derived-dataset tests collect/order in Ibis or Spark. |
| tests/test_datasets.py | Update expected error text from “DataFrame” to “Ibis table”. |
| tests/test_dataset_utils.py | Convert dataset utility tests to runtime/Ibis equivalents. |
| tests/test_dataset_expression_handler.py | Convert expression-handler tests to Ibis tables + filter_sql. |
| tests/test_create_time_dimensions.py | Use runtime session instead of spark session for time-dimension test fixtures. |
| tests/test_convert_time_format_if_necessary.py | Update time-format conversion tests to runtime session + Ibis collection. |
| tests/test_api.py | Update API request/response model names + handle 413 inline-result size limit. |
| tests/test_annual_time.py | Update annual-time tests to Ibis tables + backend-agnostic counting/collection. |
| tests/test_aeo_dataset_registration.py | Switch get_unique_values import to Ibis table utils. |
| tests/simple_standard_scenarios_datasets.py | Refactor dataset builders/mappers to Ibis operations (drop/join/rename). |
| tests/conftest.py | Initialize runtime session fixture instead of Spark. |
| tests/cli/test_registry.py | Switch use_duckdb import to Ibis types. |
| tests/cli/test_config.py | Remove unused Path import. |
| scripts/enumerate_load_table_lookup.py | Switch script to Ibis read/join + TablePartition. |
| scripts/enumerate_load_table_lookup.ipynb | Update notebook to runtime session + TablePartition. |
| pyproject.toml | Add ibis-framework[duckdb] dependency; add ty dev dep; add uv source override for chronify. |
| dsgrid/utils/utilities.py | Make IPython imports lazy via importlib and tighten typing for list_enum_values. |
| dsgrid/utils/timing.py | Cast _start for type safety in timer exit. |
| dsgrid/utils/files.py | Write LDJSON with utf-8 (no BOM). |
| dsgrid/units/power.py | Replace Spark column expressions with SQL CASE expression builders. |
| dsgrid/units/convert.py | Rewrite unit conversion for unpivoted tables using Ibis + SQL. |
| dsgrid/tests/utils.py | Use runtime session + Ibis join/drop helpers when reading test parquet. |
| dsgrid/tests/register_derived_datasets.py | Use DatasetConfig.config_filename() instead of DatasetRegistry. |
| dsgrid/tests/make_us_data_registry.py | Minor formatting simplification. |
| dsgrid/tests/common.py | Improve typing/casts around config_class().config_filename(). |
| dsgrid/spark/types.py | Remove Spark/DuckDB dual-mode types shim (deleted). |
| dsgrid/rust_ext/find_minimal_patterns.py | Lazy-load Rust extension via importlib for cleaner failure mode. |
| dsgrid/rust_ext/init.py | Same lazy-load pattern for Rust types. |
| dsgrid/registry/versioning.py | Fix json5 loading by opening file before json5.load. |
| dsgrid/registry/registry_manager.py | Initialize runtime session instead of Spark session. |
| dsgrid/registry/registry_manager_base.py | Type _db as Any. |
| dsgrid/registry/registry_interface.py | Add asserts/casts around ids/versions for stronger invariants. |
| dsgrid/registry/registry_auto_updater.py | Add casts for model typing in updater loops. |
| dsgrid/registry/registration_context.py | Switch temp-table cleanup import to dsgrid.ibis.temp. |
| dsgrid/registry/filter_registry_manager.py | Convert filtering logic to Ibis table record helpers + is_table_empty. |
| dsgrid/registry/filesystem_data_store.py | Switch store IO types to ibis.Table and runtime read/write helpers. |
| dsgrid/registry/dimension_registry_manager.py | Add asserts/casts; rename remove arg to config_id. |
| dsgrid/registry/data_store_interface.py | Update abstract interface types from Spark DF to ibis.Table. |
| dsgrid/registry/common.py | Rename from_file arg to filename. |
| dsgrid/query/report_peak_load.py | Replace Spark agg with SQL-based max-by-group + Ibis IO. |
| dsgrid/query/query_context.py | Store/restore record ids via Ibis table utils instead of Spark collects. |
| dsgrid/query/models.py | Replace Spark function binding with FunctionReference; rename spark conf model -> runtime conf model. |
| dsgrid/query/derived_dataset.py | Convert derived dataset validation + record writing to Ibis helpers (CSV via backend). |
| dsgrid/project.py | Switch query processing/mapping to runtime session + Ibis join/filter ops. |
| dsgrid/loggers.py | Cast handlers list to satisfy typing when appending. |
| dsgrid/ibis/types.py | New: runtime engine helpers (duckdb vs spark), type helpers, is_table_empty. |
| dsgrid/ibis/temp.py | New: temp view/table naming + cleanup for Spark backend. |
| dsgrid/ibis/table_utils.py | New: execute/collect helpers for “known-small” tables + distinct value helpers. |
| dsgrid/ibis/partition.py | Rename partition helper to TablePartition and adapt to Ibis counting/aggregation. |
| dsgrid/ibis/operations.py | New: cross-backend table ops (filter/join/pivot/unpivot/etc.) implemented via Ibis/SQL. |
| dsgrid/ibis/io.py | New: Ibis read helpers for csv/json/parquet. |
| dsgrid/ibis/functions.py | New: compatibility helpers mirroring prior Spark helpers using Ibis + runtime session. |
| dsgrid/ibis/backend.py | New: construct/cache runtime Ibis backend + chronify store creation. |
| dsgrid/ibis/init.py | Package marker. |
| dsgrid/filesystem/s3_filesystem.py | Lazy-load boto3/s3path; minor rglob hidden filtering tweak. |
| dsgrid/dsgrid_rc.py | Allow loading runtime config from a specified filename. |
| dsgrid/dimension/time_utils.py | Fix optional typing for args. |
| dsgrid/dimension/dimension_filters.py | Rewrite filters to SQL-string predicates executed via Ibis SQL. |
| dsgrid/dataset/unpivoted_table.py | Switch aggregation and unit conversion to Ibis tables/SQL. |
| dsgrid/dataset/table_format_handler_base.py | Convert core handler plumbing to Ibis + SQL projections/joins. |
| dsgrid/dataset/growth_rates.py | Convert growth-rate math to Ibis SQL operations. |
| dsgrid/dataset/dataset.py | Switch handler interfaces to Ibis table types; update mapping reference model type. |
| dsgrid/dataset/dataset_schema_handler_one_table.py | Convert schema checks/filtering to Ibis + runtime null checks. |
| dsgrid/dataset/dataset_mapping_manager.py | Use Ibis tables for checkpointed persistence. |
| dsgrid/dataset/dataset_expression_handler.py | Implement dataset arithmetic via SQL over temp views + Ibis backend. |
| dsgrid/data_models.py | Allow from_file to accept str; safer enum description formatting. |
| dsgrid/config/time_dimension_base_config.py | Switch time mapping to Ibis rename helper + updated typing. |
| dsgrid/config/representative_period_time_dimension_config.py | Fix return types to Type[...] for model_class helpers. |
| dsgrid/config/noop_time_dimension_config.py | Fix model_class typing + signature compatibility. |
| dsgrid/config/mapping_tables.py | Build MappingTableModel via model_validate dict. |
| dsgrid/config/input_dataset_requirements.py | Replace conlist with Annotated[..., Field(min_length=1)]. |
| dsgrid/config/index_time_dimension_config.py | Fix model_class typing to Type[...]. |
| dsgrid/config/file_schema.py | Switch file readers/writers to Ibis IO + rename/drop helpers. |
| dsgrid/config/dimensions.py | Add cast for offset_column access on union model. |
| dsgrid/config/dimension_mapping_base.py | Use enum default value instead of string literal. |
| dsgrid/config/dimension_config.py | Ensure get_unique_ids() returns set[str]. |
| dsgrid/config/date_time_dimension_config.py | Fix model_class typing + casts for chronify model creation. |
| dsgrid/config/dataset_config.py | Convert table scanning and trivial dims logic to Ibis SQL/joins. |
| dsgrid/config/config_base.py | Write records via Ibis CSV writer; cast model_class() for typing. |
| dsgrid/config/common.py | Add explicit type annotation for time dimension template dict. |
| dsgrid/config/annual_time_dimension_config.py | Convert annual time operations to runtime session + Ibis SQL. |
| dsgrid/cli/dsgrid.py | Format long call across multiple lines. |
| dsgrid/cli/common.py | Add asserts for exception info; adjust click Option signature typing. |
| dsgrid/chronify.py | Use Ibis-backed chronify store factory; remove explicit store disposal. |
| dsgrid/apps/registration_gui.py | Lazy-load IPython/widgets; runtime session init; fix conn creation. |
| dsgrid/apps/project_viewer/app.py | Cast conditional styling for Dash typing. |
| dsgrid/api/response_models.py | Rename submit-project-query response model. |
| dsgrid/api/models.py | Rename submit-project-query request model. |
| dsgrid/api/app.py | Use runtime session; add inline JSON size limit (413) + pandas conversion helper. |
| dsgrid/init.py | Minor formatting change. |
| docs/source/user_guide/tutorials/map_dataset.md | Update narrative from Spark-specific to Ibis runtime/backends. |
| docs/source/user_guide/project_queries/query_concepts.md | Update query lifecycle docs to Ibis runtime terminology. |
| docs/source/user_guide/apache_spark/overview.md | Clarify Spark is an Ibis backend and update wording. |
| docs/source/user_guide/apache_spark/index.md | Update Spark docs intro wording. |
| docs/source/software_reference/python_api.md | Replace .show() examples with to_pyarrow() printing. |
| docs/source/software_reference/architecture.md | Update architecture to Ibis + backend model. |
| docs/source/index.md | Update overview copy to reflect Ibis backend usage. |
| docs/source/getting_started/installation.md | Clarify Java only required for Spark backend. |
| docs/source/developer_guide/how_tos/how_to_convert_time_zone.md | Update terminology/examples from DataFrame to Ibis table + runtime variants. |
| .gitmodules | Point test-data submodule to feat/use-ibis branch. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def rename_columns(df: ibis.Table, mapping: dict[str, str]) -> ibis.Table: | ||
| """Rename columns using a mapping of old name to new name.""" | ||
| return df.rename({new: old for old, new in mapping.items()}) | ||
|
|
There was a problem hiding this comment.
rename_columns() inverts the provided mapping: df.rename({new: old for old, new in mapping.items()}). Ibis expects a mapping of existing column names to new column names, so this currently renames in the wrong direction and will fail for callers like rename_columns(df, {"unit": "from_unit"}). Update this to pass the mapping in the correct direction (old -> new) and add a small unit test for a simple rename.
| def coalesce(df: ibis.Table, num_partitions: int) -> ibis.Table: | ||
| return df |
There was a problem hiding this comment.
coalesce() is currently a no-op (returns df unchanged). Several call sites rely on coalesce(..., 1) to produce a single output file/partition (e.g., dsgrid/utils/dataset.py:348-363 asserts only one parquet part file). This will break on the Spark backend and can also cause excessive small files. Implement backend-appropriate coalesce/repartition behavior (or remove the helper and update call sites to avoid assumptions about single output files).
| except Exception: | ||
| tmp_file = NamedTemporaryFile(suffix=".parquet", delete=False) | ||
| tmp_file.close() | ||
| df.to_parquet(tmp_file.name) | ||
| escaped_path = tmp_file.name.replace("'", "''") | ||
| conn = cast(Any, make_runtime_backend().connection) | ||
| conn.raw_sql(f"CREATE TEMP VIEW {view} AS SELECT * FROM read_parquet('{escaped_path}')") | ||
| return view |
There was a problem hiding this comment.
create_temp_view() fallback writes an on-disk temporary parquet file with delete=False and never deletes it. On repeated calls this can leak many temp files and fill disks. Consider (a) restricting this fallback to DuckDB only, (b) registering the temp file path for cleanup (atexit / ScratchDirContext / drop_temp_tables_and_views), or (c) using an in-memory registration approach if supported by the backend.
| case "contains": | ||
| expr = f"{column} LIKE '%{_escape_like_value(value)}%'" | ||
| case "endswith": | ||
| expr = f"{column} LIKE '%{_escape_like_value(value)}'" | ||
| case "isNotNull": | ||
| expr = f"{column} IS NOT NULL" | ||
| case "isNull": | ||
| expr = f"{column} IS NULL" | ||
| case "isin": | ||
| if not isinstance(value, list | tuple | set): | ||
| msg = f"value must be a list, tuple, or set for {operator =}" | ||
| raise DSGInvalidField(msg) | ||
| vals = ", ".join(_make_sql_value(x) for x in value) | ||
| expr = f"{column} IN ({vals})" | ||
| case "like": | ||
| expr = f"{column} LIKE {_make_sql_value(value)}" | ||
| case "rlike": | ||
| expr = f"{column} RLIKE {_make_sql_value(value)}" | ||
| case "startswith": | ||
| expr = f"{column} LIKE '{_escape_like_value(value)}%'" | ||
| case _: |
There was a problem hiding this comment.
The LIKE-based implementations for contains/startswith/endswith only escape single quotes. In Spark, col.contains()/startswith()/endswith() treat the input literally, but SQL LIKE will treat '%' and '' as wildcards, so values containing those characters will match incorrectly. Consider escaping '%' and '' (and adding an ESCAPE clause) or using backend string functions that preserve literal semantics.
993c64e to
f10e133
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #414 +/- ##
==========================================
+ Coverage 82.09% 85.22% +3.12%
==========================================
Files 124 128 +4
Lines 14441 14800 +359
==========================================
+ Hits 11856 12613 +757
+ Misses 2585 2187 -398
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Converts set ops, joins, filters, unit conversion, and growth rates to native Ibis expressions. Adds cross-backend bridging in create_temp_view and type-alignment helpers so native set ops and joins tolerate the SQL semantics the prior string-SQL paths had (int/float coercion, schema widening). Restores a Spark-specific coalesce. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Spark's default shuffle.partitions=200 combined with dsgrid's cascaded joins produced stages with 248k+ tiny tasks on small local jobs, dominating runtime. Apply small local-mode defaults (shuffle.partitions=4, default.parallelism=4, AQE + coalescePartitions on) when Spark is running locally. SPARK_CONF_DIR and spark_conf= overrides still win. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the Spark smoke subset with the same pytest invocation the DuckDB job uses so every test runs in both backends. Add Java 21 setup since Spark 4 requires Java 17+. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
368e2b5 to
d52fdc5
Compare
elainethale
left a comment
There was a problem hiding this comment.
I have a general question regarding the documentation: To what extent do users need to know about Ibis, and to what extent is it just an implementation detail? To me, it would be best for users to just think about "DataFrames" and whether they need to do processing with "Spark" (and thus go through a bit more set up) or not. Certainly it should be documented that DuckDB and Ibis are key technologies, but they might not need to be mentioned as much as they are in this PR's documentation changes.
I was also expecting a larger audit of the existing documentation that covers installation and spark use. I'm not sure we have the right set of high-level pages/sections, and suspect that there's a lot of out-of-date information hanging around and not as much on configuring spark with sparkctl as would be helpful.
| project = manager.project_manager.load_project("dsgrid_conus_2022") | ||
| geo_dim = project.config.get_base_dimension(DimensionType.GEOGRAPHY) | ||
| geo_dim.get_records_dataframe().show() | ||
| print(geo_dim.get_records_dataframe().limit(5).to_pyarrow()) |
There was a problem hiding this comment.
This is a clunkier interface for inspecting the head of a DataFrame. Is there a better option?
There was a problem hiding this comment.
print(geo_dim.get_records_dataframe().head().execute())
| 3. **Combine datasets** - Combine the datasets as specified by the `expression` in the dataset data model of the query. The default is to take a union of all datasets. | ||
|
|
||
| 4. **Persist intermediate table** - If the option `--persist-intermediate-table` is `true` (which is the default) then dsgrid will evaluate the Spark query from the previous step by writing the dataframe to the filesystem in the directory `query_output/cached_tables`. This can be disabled by setting `--no-persist-intermediate-table`. | ||
| 4. **Persist intermediate table** - If the option `--persist-intermediate-table` is `true` (which is the default) then dsgrid will evaluate the Ibis query from the previous step by writing the table to the filesystem in the directory `query_output/cached_tables`. This can be disabled by setting `--no-persist-intermediate-table`. |
There was a problem hiding this comment.
| 4. **Persist intermediate table** - If the option `--persist-intermediate-table` is `true` (which is the default) then dsgrid will evaluate the Ibis query from the previous step by writing the table to the filesystem in the directory `query_output/cached_tables`. This can be disabled by setting `--no-persist-intermediate-table`. | |
| 4. **Persist intermediate table** - If the option `--persist-intermediate-table` is `true` (which is the default) then dsgrid will evaluate the query from the previous step by writing the table to the filesystem in the directory `query_output/cached_tables`. This can be disabled by setting `--no-persist-intermediate-table`. |
| 8. **Sort columns** - If the field `sort_columns` in the `result` data model is `true`, sort the table by those columns. | ||
|
|
||
| 9. **Write output** - Evaluate the Spark job for the previous steps by writing the dataframe to the filesystem in the directory `query_output/<query-name>`. | ||
| 9. **Write output** - Evaluate the Ibis query for the previous steps by writing the table to the filesystem in the directory `query_output/<query-name>`. |
There was a problem hiding this comment.
Per suggestion above, can these Spark job to Ibis query replacements all be changed to something like query? (And are these descriptions only applicable with Spark, or do they also apply to DuckDB?)
| ``` | ||
|
|
||
| By default, this will attempt to map all dimensions by performing three Spark queries: | ||
| By default, this will attempt to map all dimensions in three Ibis query phases: |
There was a problem hiding this comment.
| By default, this will attempt to map all dimensions in three Ibis query phases: | |
| By default, this will attempt to map all dimensions in three phases: |
| - **Flexible Mappings**: Map datasets between different dimensional systems with explicit, documented transformations | ||
| - **Powerful Queries**: Aggregate, filter, and transform data across dimensions to create custom views | ||
| - **Big Data Support**: Process terabyte-scale datasets using Apache Spark or gigabyte-scale datasets using DuckDB | ||
| - **Big Data Support**: Use the Ibis table API with DuckDB for local work or Apache Spark for distributed workloads |
There was a problem hiding this comment.
The original version seems better for end-users. I would revert this change.
| "devtools", | ||
| "flake8", | ||
| "mypy", | ||
| "ty", |
There was a problem hiding this comment.
It looks like you've made a bunch of changes to enforce typing, but I'm not seeing pre-commit or ci checks for it. Did I miss those somewhere?
| def _with_sql_column(df, column, expr): | ||
| view = create_temp_view(df) | ||
| cols = ", ".join(x for x in df.columns if x != column) | ||
| return get_runtime_session().sql(f"SELECT {cols}, {expr} AS {column} FROM {view}") | ||
|
|
||
|
|
||
| def _collect(df): | ||
| if isinstance(df, ibis.Table): | ||
| return list(df.execute().itertuples(index=False, name="Row")) | ||
| return df.collect() | ||
|
|
||
|
|
||
| def _first_value(df, column): | ||
| return getattr(_collect(df.limit(1))[0], column) | ||
|
|
||
|
|
||
| def _row_value(row, column): | ||
| try: | ||
| return row[column] | ||
| except TypeError: | ||
| return getattr(row, column) |
There was a problem hiding this comment.
This seems like a funny place for these fairly generic helper functions.
|
|
||
| @pytest.fixture(scope="module") | ||
| def pivoted_dataframes(records_dataframe_energy): | ||
| def pivoted_tables(records_dataframe_energy): |
There was a problem hiding this comment.
Worthwhile to change these test names?
There was a problem hiding this comment.
They are fixtures, not tests.
| def _collect(df): | ||
| if isinstance(df, ibis.Table): | ||
| return list(df.execute().itertuples(index=False, name="Row")) | ||
| return df.collect() | ||
|
|
||
|
|
||
| def _count(df) -> int: | ||
| if isinstance(df, ibis.Table): | ||
| return df.count().execute() | ||
| return df.count() |
There was a problem hiding this comment.
These generic helper functions seem misplaced.
| def _count(df): | ||
| if isinstance(df, ibis.Table): | ||
| return df.count().execute() | ||
| return df.count() | ||
|
|
||
|
|
||
| def _collect(df): | ||
| if isinstance(df, ibis.Table): | ||
| return list(df.execute().itertuples(index=False, name="Row")) | ||
| return df.collect() | ||
|
|
There was a problem hiding this comment.
I'm going to stop commenting on this now, but lots of simple helper functions are repeated throughout the test files.
|
|
||
| @pytest.fixture(scope="module") | ||
| def dataframes(): | ||
| def tables(): |
| assert unpivoted.columns == expected_columns | ||
| null_data = unpivoted.filter("county = 'Boulder' and end_use = 'heating'").collect() | ||
| assert list(unpivoted.columns) == expected_columns | ||
| null_data = _collect(filter_sql(unpivoted, "county = 'Boulder' and end_use = 'heating'")) |
There was a problem hiding this comment.
This new syntax is a bit harder to read (right to left instead of left to right).
elainethale
left a comment
There was a problem hiding this comment.
Additional comments on tests.
| monkeypatch.setattr( | ||
| session, "is_runtime_session_active", lambda: (_ for _ in ()).throw(AssertionError) | ||
| ) | ||
| drop_temp_tables_and_views() |
There was a problem hiding this comment.
Should there be a test for expected state after this call?
| monkeypatch.setattr( | ||
| session, "get_spark_session", lambda: (_ for _ in ()).throw(AssertionError) | ||
| ) | ||
| drop_temp_tables_and_views() |
There was a problem hiding this comment.
Should there be a test for expected state after this call?
| - DF.show() (and probably all arithmetics) use spark.sql.session.timeZone | ||
| - DF.toPandas() likely goes through spark.sql.session.timeZone | ||
| - DF.to_pandas() likely goes through spark.sql.session.timeZone | ||
| - DF.collect() converts timestamps to system time_zone (different from spark.sql.session.timeZone!) |
There was a problem hiding this comment.
Are these still the right method calls?
| return get_runtime_session().sql(f"SELECT {cols_str}, {expr} AS {nc} FROM {view}") | ||
|
|
||
|
|
||
| def from_utc_timestamp( |
There was a problem hiding this comment.
This is now a light wrapper around a local helper?
| .aggregate(**{VALUE_COLUMN: df[VALUE_COLUMN].mean()}) | ||
| ) | ||
| expected = filter_sql(expected_df, "hour == 16").execute()[VALUE_COLUMN].iloc[0] | ||
| else: |
There was a problem hiding this comment.
Why is there ambiguity about the data type?
| def validate_electricity_use_by_state(op, results_path: DataFrame | Path, raw_stats, datasets): | ||
| if isinstance(results_path, DataFrame): | ||
| def validate_electricity_use_by_state(op, results_path: ibis.Table | Path, raw_stats, datasets): | ||
| if isinstance(results_path, ibis.Table | ibis.Table): |
There was a problem hiding this comment.
| if isinstance(results_path, ibis.Table | ibis.Table): | |
| if isinstance(results_path, ibis.Table): |
Reposition Spark as an optional distributed backend with DuckDB as the default. Remove placeholder Spark pages, swap pyspark examples for SQL or DuckDB equivalents in tutorials and how-tos, and update the HPC workflow to reflect backend choice. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| @@ -85,10 +86,7 @@ allow-direct-references = true | ||
| [project.optional-dependencies] | ||
| spark = [ |
There was a problem hiding this comment.
Should sparkctl[pyspark] be in this list?
| return filename | ||
|
|
||
| # Use ValueError because this gets called in Pydantic model validation. | ||
| # Use ValueError because this gets called in Pydantic model validation. |
There was a problem hiding this comment.
nit: Awkward addition of spaces in this line and on some comment lines that follow.
| if type_upper not in SUPPORTED_TYPES: | ||
| supported_data_types = sorted(SUPPORTED_TYPES) | ||
| msg = f"{data_type=} is not one of {supported_data_types=}" | ||
| msg = f"{data_type =} is not one of {supported_data_types =}" |
There was a problem hiding this comment.
nit: I prefer the version on the left (for no great reason). Is there a reason to prefer the version on the right?
There was a problem hiding this comment.
From Claude:
No good reason I can think of — it's almost certainly unintentional or a stylistic miss.
In f-string debug syntax (=), whitespace around the = is preserved literally in the output:
f"{data_type=}" → data_type='foo'
f"{data_type =}" → data_type ='foo' (note the stray space before =, none after)
So the new version produces slightly uglier output (data_type ='foo' is not one of supported_data_types =[...]) with no functional benefit. If they wanted symmetric spacing they'd need {data_type = } (space on both sides) → data_type = 'foo', which is at least a defensible style choice. The asymmetric {data_type =} form is just worse than the original.
Worth flagging on the PR.
| check_uniqueness((x.name for x in self.columns), "column names") | ||
|
|
||
| # Check that ignore_columns don't overlap with columns | ||
| # Check that ignore_columns don't overlap with columns |
There was a problem hiding this comment.
Another misaligned comment
| columns = sorted(selector.column_values.keys()) | ||
| if columns != first: | ||
| msg = f"All selectors must define the same columns: {first=} {columns=}" | ||
| msg = f"All selectors must define the same columns: {first =} {columns =}" |
There was a problem hiding this comment.
More stray spaces in format strings in this file (here and below).
| def _table_count(table: ibis.Table) -> int: | ||
| count = cast(Any, table.count().execute()) | ||
| return int(count) |
There was a problem hiding this comment.
This helper looks like many scattered through the codebase.
| TimestampType, | ||
| ) | ||
|
|
||
| def make_type(dtype: str): |
There was a problem hiding this comment.
Why isn't this in types.py?
| return StructType(fields) | ||
|
|
||
|
|
||
| def _ibis_type_from_spark_type(data_type: Any) -> str: |
There was a problem hiding this comment.
Since there is only one function here, move to functions.py or operations.py? Do we expect to add more?
| Use this only for metadata, capped diagnostics, API payloads, or other known-small | ||
| tables. Large dataset tables should stay as lazy Ibis expressions. | ||
| """ | ||
| return table.execute() |
There was a problem hiding this comment.
Calling .execute on a table of type Any produces a pd.DataFrame? Apologies if this is a dumb question.
Depends on NatLabRockies/chronify#66
Closes GitHub Issue #
Description
Checklist