Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

fix: Fix RQ usage in Scrapy scheduler #385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 29, 2025
Merged

fix: Fix RQ usage in Scrapy scheduler #385

merged 4 commits into from
Jan 29, 2025

Conversation

vdusek
Copy link
Contributor

@vdusek vdusek commented Jan 29, 2025

@vdusek vdusek added this to the 107th sprint - Tooling team milestone Jan 29, 2025
@vdusek vdusek requested a review from janbuchar January 29, 2025 08:54
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Jan 29, 2025
@@ -78,6 +78,11 @@ async def process_request(self, request: Request, spider: Spider) -> None:
Raises:
ValueError: If username and password are not provided in the proxy URL.
"""
# Do not use proxy for robots.txt, as it causes 403 Forbidden.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like... universally, everywhere? I don't mind it, it just seems weird.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is a problem of Apify proxies, I don't know, but it results in the following:

[scrapy.downloadermiddlewares.robotstxt] ERROR Error downloading <GET https://console.apify.com/robots.txt>: Could not open CONNECT tunnel with proxy proxy.apify.com:8000 [{'status': 403, 'reason': b'Forbidden'}] ({"spider": "<TitleSpider 'title_spider' at 0x7f2bc3aee660>"})
      Traceback (most recent call last):
        File "/home/vdusek/Projects/apify-sdk-python/.venv/lib/python3.13/site-packages/twisted/internet/defer.py", line 2013, in _inlineCallbacks
          result = context.run(
              cast(Failure, result).throwExceptionIntoGenerator, gen
          )
        File "/home/vdusek/Projects/apify-sdk-python/.venv/lib/python3.13/site-packages/twisted/python/failure.py", line 467, in throwExceptionIntoGenerator
          return g.throw(self.value.with_traceback(self.tb))
                 ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/home/vdusek/Projects/apify-sdk-python/.venv/lib/python3.13/site-packages/scrapy/core/downloader/middleware.py", line 68, in process_request
          return (yield download_func(request, spider))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      scrapy.core.downloader.handlers.http11.TunnelError: Could not open CONNECT tunnel with proxy proxy.apify.com:8000 [{'status': 403, 'reason': b'Forbidden'}]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Humph. But the connect call should happen way before the path part of the URL matters, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's strange. I'm not sure why we can't connect when it comes to robots.txt, while other URLs works. I've reverted the changes and kept only the storage client fix.

# Use the ApifyStorageClient if the Actor is running on the Apify platform,
# otherwise use the MemoryStorageClient.
storage_client = (
ApifyStorageClient.from_config(config) if config.is_at_home else MemoryStorageClient.from_config(config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is supposed to happen in Actor.init, right? Why duplicate it here?

Copy link
Contributor Author

@vdusek vdusek Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the nested event loop, otherwise, it will result in:

RuntimeError: <asyncio.locks.Event object at 0x7c2d640c8fc0 [unset]> is bound to a different event loop

when using Apify client.

@vdusek vdusek merged commit 3363478 into master Jan 29, 2025
27 checks passed
@vdusek vdusek deleted the fixing-scrapy branch January 29, 2025 14:37
@honzajavorek
Copy link
Contributor

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.