Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Conversation

@metadaddy
Copy link
Contributor

Similarly to #218, we see occasional timeout errors when writing data to S3-compatible object storage:

When uploading part for key 'drivestats/data/date_month=2014-08/00000-0-9c7baab5-af18-4558-ae10-1678aa90b6a5.parquet' in bucket 'drivestats-iceberg': AWS Error NETWORK_CONNECTION during UploadPart operation: curlCode: 28, Timeout was reached

[I don't believe the issue is specific to the fact that I'm using Backblaze B2 rather than Amazon S3 - I saw references to similar error messages with the latter as I was researching this issue.]

The issue happens when the underlying PUT operation takes longer than the request timeout, which is set to a default of 3 seconds in the AWS C++ SDK used by Arrow via PyArrow.

The changes in this PR allow configuration of s3.request_timeout when working directly or indirectly with pyiceberg.io.pyarrow.PyArrowFileIO, just as #218 allowed configuration of s3.connect_timeout.

For example, when creating a catalog:

catalog = load_catalog(
    "docs",
    **{
        "uri": "http://127.0.0.1:8181",
        "s3.endpoint": "http://127.0.0.1:9000",
        "py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO",
        "s3.access-key-id": "admin",
        "s3.secret-access-key": "password",
        "s3.request-timeout": 5.0,
        "s3.connect-timeout": 20.0,
    }
)

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @metadaddy for adding this, I left one comment regarding S3FS, apart from that it looks good to me 👍

pyiceberg/io/fsspec.py Outdated Show resolved Hide resolved
client_kwargs["connect_timeout"] = float(connect_timeout)

if request_timeout := self.properties.get(S3_REQUEST_TIMEOUT):
client_kwargs["request_timeout"] = float(request_timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +445 to +446
if request_timeout := self.properties.get(S3_REQUEST_TIMEOUT):
client_kwargs["request_timeout"] = float(request_timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| s3.region | us-west-2 | Configure the default region used to initialize an `S3FileSystem`. `PyArrowFileIO` attempts to automatically resolve the region for each S3 bucket, falling back to this value if resolution fails. |
| s3.proxy-uri | <http://my.proxy.com:8080> | Configure the proxy server to be used by the FileIO. |
| s3.connect-timeout | 60.0 | Configure socket connection timeout, in seconds. |
| s3.request-timeout | 60.0 | Configure socket read timeouts on Windows and macOS, in seconds. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a Java equivalent, so I'm fine with introducing this one 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i found connect-timeout which i think is different from request-timeout
https://github.com/apache/iceberg-go/blob/4b645d698fffaa99c235f54bf33f4340a4414bc5/io/s3.go#L47-L53

@metadaddy
Copy link
Contributor Author

metadaddy commented Jan 24, 2025

Hi @Fokko - I implemented and pushed your suggested correction. Thanks!

@kevinjqliu
Copy link
Contributor

Looks like theres a lint issue, can you make make lint locally? @metadaddy

@metadaddy
Copy link
Contributor Author

metadaddy commented Jan 24, 2025

@kevinjqliu Ah - it wanted imports in alphabetical order - I'd just inserted S3_REQUEST_TIMEOUT immediately after S3_CONNECT_TIMEOUT. All fixed now!

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

| s3.region | us-west-2 | Configure the default region used to initialize an `S3FileSystem`. `PyArrowFileIO` attempts to automatically resolve the region for each S3 bucket, falling back to this value if resolution fails. |
| s3.proxy-uri | <http://my.proxy.com:8080> | Configure the proxy server to be used by the FileIO. |
| s3.connect-timeout | 60.0 | Configure socket connection timeout, in seconds. |
| s3.request-timeout | 60.0 | Configure socket read timeouts on Windows and macOS, in seconds. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i found connect-timeout which i think is different from request-timeout
https://github.com/apache/iceberg-go/blob/4b645d698fffaa99c235f54bf33f4340a4414bc5/io/s3.go#L47-L53

@Fokko Fokko merged commit 7624ed3 into apache:main Jan 28, 2025
8 checks passed
@Fokko
Copy link
Contributor

Fokko commented Jan 28, 2025

Thanks for working on this @metadaddy, and thanks @kevinjqliu for the review 🙌

@metadaddy
Copy link
Contributor Author

@Fokko / @kevinjqliu Any plans for a release in the near future? It's been a while since 0.8.1, and I'd like to be able to use a mainline version of PyIceberg in my app, rather than my patch. Thanks!

@kevinjqliu
Copy link
Contributor

@metadaddy we're getting ready for the 0.9.0 as we speak :)
heres the thread about it on slack https://apache-iceberg.slack.com/archives/C029EE6HQ5D/p1738367232843719

we also recently added nightly build on testpypi if you want to give that a try https://test.pypi.org/project/pyiceberg/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.