Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

ResultSet fetchmany_arrow/fetchall_arrow methods fail during concat_tables #418

Copy link
Copy link
@ksofeikov

Description

@ksofeikov
Issue body actions

Hi there,

I'm using this client library to fetch much data from our DBX environment. The version I'm using is 3.3.0.

The library keeps crashing when attempts to concatenate two current and partial results. I can not attach to full trace, because it contains some of the internal schemas, but here is the gist of it:

creation_date
First Schema: creation_date: timestamp[us, tz=Etc/UTC]
Second Schema: creation_date: timestamp[us, tz=Etc/UTC] not null

status
First Schema: status: string
Second Schema: status: string not null

sender_id
First Schema: sender_id: string
Second Schema: sender_id: string not null

.
.
.
and a few other fields with the exact same discrepancy

The exact stack trace is

Traceback (most recent call last):
  File "/Users/X/work/scripts/raw_order.py", line 34, in <module>
    for r in tqdm(cursor, total=max_items):
  File "/Users/X/work/.venv/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/Users/X/work//.venv/lib/python3.10/site-packages/databricks/sql/client.py", line 422, in __iter__
    for row in self.active_result_set:
  File "/Users/X/work//.venv/lib/python3.10/site-packages/databricks/sql/client.py", line 1112, in __iter__
    row = self.fetchone()
  File "/Users/X/work/.venv/lib/python3.10/site-packages/databricks/sql/client.py", line 1217, in fetchone
    res = self._convert_arrow_table(self.fetchmany_arrow(1))
  File "/Users/X/work/.venv/lib/python3.10/site-packages/databricks/sql/client.py", line 1193, in fetchmany_arrow
    results = pyarrow.concat_tables([results, partial_results])
  File "pyarrow/table.pxi", line 5962, in pyarrow.lib.concat_tables
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Schema at index 1 was different: 
GOES INTO DETAILS SPECIFIED ABOVE AS TO WHAT IS DIFFERNT

The data is coming through a cursor like this

connection = sql.connect(
    server_hostname="X",
    http_path="B",
    access_token=app_settings.dbx_access_token,
)

cursor = connection.cursor()

max_items = 100000
batch_size = 10000

cursor.execute(
    f"SELECT * from X where  creation_date between '2024-06-01' and '2024-09-01' limit {max_items}"
)

The source table is created through a CTAS statement, so all fields are nullable by default. I have found two ways to resolve the issue:

  • either set results = pyarrow.concat_tables([results, partial_results], promote_options="permissive") promote options to permissive to so pyarrow can marry two schemas, or
  • Downgrade to the latest previous major version 2.9.6

I checked the 2.9.6 source code and it does not seem to be using a permissive schema casting, so seems like a regression in this case.

I'm not sure if I can add anything else beyond that, but do let me know.

And to be clear, I request like 100k records at a time there, and can iterate through like 95k of them, and then it fails. So I'm not really sure if there is a reliable way to reproduce that

If the cluster runtime matters, 13.3 LTS (includes Apache Spark 3.4.1, Scala 2.12)

Thanks!

Reactions are currently unavailable

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.