Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

feat: support array output in remote_function #1057

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 54 commits into from
Jan 15, 2025
Merged

Conversation

shobsi
Copy link
Contributor

@shobsi shobsi commented Oct 7, 2024

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)
    • remote_function: screen/5rMtCZVaUYKdqxP
    • Series.apply: screen/9HkKMuWxMvbbPgf
    • DataFrame.apply: screen/BoXH9A7d4hGpETu

Fixes internal issue 298876217 🦕

This is feature request to support use cases like creating custom
feature vectors, embeddings etc.
@shobsi shobsi requested review from a team as code owners October 7, 2024 17:28
@shobsi shobsi requested a review from GarrettWu October 7, 2024 17:28
@product-auto-label product-auto-label bot added the size: m Pull request size is medium. label Oct 7, 2024
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. label Oct 7, 2024
@shobsi shobsi marked this pull request as draft October 8, 2024 00:42
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Oct 10, 2024

# if the output is an array, reconstruct it from the json serialized
# string form
if bigframes.dtypes.is_array_like(func.output_dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually handle any array-like dtype?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um, in this PR we are looking to support types like list[int] on the output side? Or I didn't get you?

@@ -1513,6 +1513,18 @@ def apply(
ops.RemoteFunctionOp(func=func, apply_on_null=True)
)

# if the output is an array, reconstruct it from the json serialized
# string form
if bigframes.dtypes.is_array_like(func.output_dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems the code within this block assume not just array_like, but specifically that it is a pyarrow list_ type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly what the array_like implementation checks?

def is_array_like(type_: ExpressionType) -> bool:
return isinstance(type_, pd.ArrowDtype) and isinstance(
type_.pyarrow_dtype, pa.ListType
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh, probably fine then, I don't really see array_like definition expanding anytime soon

return None

try:
python_output_type = eval(output_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eval always makes me a bit uncomfortable - can we do this in a more constrained way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed eval in the latest patch, PTAL

if typing.get_origin(python_output_type) is list:
python_output_type_ser = repr(python_output_type)
else:
python_output_type_ser = python_output_type.__name__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shoudl we bother with non-list types right now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throwing error for non-array and not-supported-array types in the latest patch, PTAL

@shobsi shobsi merged commit bdee173 into main Jan 15, 2025
22 checks passed
@shobsi shobsi deleted the shobs-rf-array-out-1 branch January 15, 2025 23:15
shuoweil pushed a commit that referenced this pull request Jan 20, 2025
* feat: support array output in `remote_function`

This is feature request to support use cases like creating custom
feature vectors, embeddings etc.

* add multiindex test

* move array type conversion to bigquery module, test multiindex

* add `bigframes.bigquery.json_extract_string_array`, support int and str array outputs

* increase cleanup rate

* update input and output types doc

* support array output in DataFrame.apply

* support read_gbq_function on a remote function created for array output

* fix the json_set after variable renaming

* add tests for output_type in read_gbq_function

* temporarily exclude system 3.9 tests and include 3.10 and 3.11

* Revert "temporarily exclude system 3.9 tests and include 3.10 and 3.11"

This reverts commit 2485aa3.

* add more info in the unexpected exception

* more debug info

* use unique routine name across tests

* Revert "more debug info"

This reverts commit 86fe316.

* Revert "add more info in the unexpected exception"

This reverts commit fe010cb.

* support array output in binary remote function operations

* support array output in nary remote function operations

* preserve array output type in function description to avoid explit output_type in read_gbq_function

* fix one failing read_gbq_function test

* make test parameterization order deterministic

* fix sorting of types for mypy

* remove test parameterization with sorting inside

* include partial ordering mode testing for read_gbq_function

* add remote function array out test in partial ordering mode

* avoid repr-eval for output type serialization/deserialization

* remove unsupported scenarios system tests, use common exception for unsupported
shuoweil pushed a commit that referenced this pull request Jan 20, 2025
* feat: support array output in `remote_function`

This is feature request to support use cases like creating custom
feature vectors, embeddings etc.

* add multiindex test

* move array type conversion to bigquery module, test multiindex

* add `bigframes.bigquery.json_extract_string_array`, support int and str array outputs

* increase cleanup rate

* update input and output types doc

* support array output in DataFrame.apply

* support read_gbq_function on a remote function created for array output

* fix the json_set after variable renaming

* add tests for output_type in read_gbq_function

* temporarily exclude system 3.9 tests and include 3.10 and 3.11

* Revert "temporarily exclude system 3.9 tests and include 3.10 and 3.11"

This reverts commit 2485aa3.

* add more info in the unexpected exception

* more debug info

* use unique routine name across tests

* Revert "more debug info"

This reverts commit 86fe316.

* Revert "add more info in the unexpected exception"

This reverts commit fe010cb.

* support array output in binary remote function operations

* support array output in nary remote function operations

* preserve array output type in function description to avoid explit output_type in read_gbq_function

* fix one failing read_gbq_function test

* make test parameterization order deterministic

* fix sorting of types for mypy

* remove test parameterization with sorting inside

* include partial ordering mode testing for read_gbq_function

* add remote function array out test in partial ordering mode

* avoid repr-eval for output type serialization/deserialization

* remove unsupported scenarios system tests, use common exception for unsupported
shuoweil pushed a commit that referenced this pull request Jan 20, 2025
* feat: support array output in `remote_function`

This is feature request to support use cases like creating custom
feature vectors, embeddings etc.

* add multiindex test

* move array type conversion to bigquery module, test multiindex

* add `bigframes.bigquery.json_extract_string_array`, support int and str array outputs

* increase cleanup rate

* update input and output types doc

* support array output in DataFrame.apply

* support read_gbq_function on a remote function created for array output

* fix the json_set after variable renaming

* add tests for output_type in read_gbq_function

* temporarily exclude system 3.9 tests and include 3.10 and 3.11

* Revert "temporarily exclude system 3.9 tests and include 3.10 and 3.11"

This reverts commit 2485aa3.

* add more info in the unexpected exception

* more debug info

* use unique routine name across tests

* Revert "more debug info"

This reverts commit 86fe316.

* Revert "add more info in the unexpected exception"

This reverts commit fe010cb.

* support array output in binary remote function operations

* support array output in nary remote function operations

* preserve array output type in function description to avoid explit output_type in read_gbq_function

* fix one failing read_gbq_function test

* make test parameterization order deterministic

* fix sorting of types for mypy

* remove test parameterization with sorting inside

* include partial ordering mode testing for read_gbq_function

* add remote function array out test in partial ordering mode

* avoid repr-eval for output type serialization/deserialization

* remove unsupported scenarios system tests, use common exception for unsupported
shuoweil pushed a commit that referenced this pull request Jan 20, 2025
* feat: support array output in `remote_function`

This is feature request to support use cases like creating custom
feature vectors, embeddings etc.

* add multiindex test

* move array type conversion to bigquery module, test multiindex

* add `bigframes.bigquery.json_extract_string_array`, support int and str array outputs

* increase cleanup rate

* update input and output types doc

* support array output in DataFrame.apply

* support read_gbq_function on a remote function created for array output

* fix the json_set after variable renaming

* add tests for output_type in read_gbq_function

* temporarily exclude system 3.9 tests and include 3.10 and 3.11

* Revert "temporarily exclude system 3.9 tests and include 3.10 and 3.11"

This reverts commit 2485aa3.

* add more info in the unexpected exception

* more debug info

* use unique routine name across tests

* Revert "more debug info"

This reverts commit 86fe316.

* Revert "add more info in the unexpected exception"

This reverts commit fe010cb.

* support array output in binary remote function operations

* support array output in nary remote function operations

* preserve array output type in function description to avoid explit output_type in read_gbq_function

* fix one failing read_gbq_function test

* make test parameterization order deterministic

* fix sorting of types for mypy

* remove test parameterization with sorting inside

* include partial ordering mode testing for read_gbq_function

* add remote function array out test in partial ordering mode

* avoid repr-eval for output type serialization/deserialization

* remove unsupported scenarios system tests, use common exception for unsupported
shuoweil pushed a commit that referenced this pull request Jan 24, 2025
* feat: support array output in `remote_function`

This is feature request to support use cases like creating custom
feature vectors, embeddings etc.

* add multiindex test

* move array type conversion to bigquery module, test multiindex

* add `bigframes.bigquery.json_extract_string_array`, support int and str array outputs

* increase cleanup rate

* update input and output types doc

* support array output in DataFrame.apply

* support read_gbq_function on a remote function created for array output

* fix the json_set after variable renaming

* add tests for output_type in read_gbq_function

* temporarily exclude system 3.9 tests and include 3.10 and 3.11

* Revert "temporarily exclude system 3.9 tests and include 3.10 and 3.11"

This reverts commit 2485aa3.

* add more info in the unexpected exception

* more debug info

* use unique routine name across tests

* Revert "more debug info"

This reverts commit 86fe316.

* Revert "add more info in the unexpected exception"

This reverts commit fe010cb.

* support array output in binary remote function operations

* support array output in nary remote function operations

* preserve array output type in function description to avoid explit output_type in read_gbq_function

* fix one failing read_gbq_function test

* make test parameterization order deterministic

* fix sorting of types for mypy

* remove test parameterization with sorting inside

* include partial ordering mode testing for read_gbq_function

* add remote function array out test in partial ordering mode

* avoid repr-eval for output type serialization/deserialization

* remove unsupported scenarios system tests, use common exception for unsupported
shuoweil pushed a commit that referenced this pull request Jan 24, 2025
* feat: support array output in `remote_function`

This is feature request to support use cases like creating custom
feature vectors, embeddings etc.

* add multiindex test

* move array type conversion to bigquery module, test multiindex

* add `bigframes.bigquery.json_extract_string_array`, support int and str array outputs

* increase cleanup rate

* update input and output types doc

* support array output in DataFrame.apply

* support read_gbq_function on a remote function created for array output

* fix the json_set after variable renaming

* add tests for output_type in read_gbq_function

* temporarily exclude system 3.9 tests and include 3.10 and 3.11

* Revert "temporarily exclude system 3.9 tests and include 3.10 and 3.11"

This reverts commit 2485aa3.

* add more info in the unexpected exception

* more debug info

* use unique routine name across tests

* Revert "more debug info"

This reverts commit 86fe316.

* Revert "add more info in the unexpected exception"

This reverts commit fe010cb.

* support array output in binary remote function operations

* support array output in nary remote function operations

* preserve array output type in function description to avoid explit output_type in read_gbq_function

* fix one failing read_gbq_function test

* make test parameterization order deterministic

* fix sorting of types for mypy

* remove test parameterization with sorting inside

* include partial ordering mode testing for read_gbq_function

* add remote function array out test in partial ordering mode

* avoid repr-eval for output type serialization/deserialization

* remove unsupported scenarios system tests, use common exception for unsupported
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.