-
Notifications
You must be signed in to change notification settings - Fork 12
fix: Fix RQ usage in Scrapy scheduler #385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -78,6 +78,11 @@ async def process_request(self, request: Request, spider: Spider) -> None: | ||
Raises: | ||
ValueError: If username and password are not provided in the proxy URL. | ||
""" | ||
# Do not use proxy for robots.txt, as it causes 403 Forbidden. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like... universally, everywhere? I don't mind it, it just seems weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is a problem of Apify proxies, I don't know, but it results in the following:
[scrapy.downloadermiddlewares.robotstxt] ERROR Error downloading <GET https://console.apify.com/robots.txt>: Could not open CONNECT tunnel with proxy proxy.apify.com:8000 [{'status': 403, 'reason': b'Forbidden'}] ({"spider": "<TitleSpider 'title_spider' at 0x7f2bc3aee660>"})
Traceback (most recent call last):
File "/home/vdusek/Projects/apify-sdk-python/.venv/lib/python3.13/site-packages/twisted/internet/defer.py", line 2013, in _inlineCallbacks
result = context.run(
cast(Failure, result).throwExceptionIntoGenerator, gen
)
File "/home/vdusek/Projects/apify-sdk-python/.venv/lib/python3.13/site-packages/twisted/python/failure.py", line 467, in throwExceptionIntoGenerator
return g.throw(self.value.with_traceback(self.tb))
~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vdusek/Projects/apify-sdk-python/.venv/lib/python3.13/site-packages/scrapy/core/downloader/middleware.py", line 68, in process_request
return (yield download_func(request, spider))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
scrapy.core.downloader.handlers.http11.TunnelError: Could not open CONNECT tunnel with proxy proxy.apify.com:8000 [{'status': 403, 'reason': b'Forbidden'}]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humph. But the connect call should happen way before the path part of the URL matters, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's strange. I'm not sure why we can't connect when it comes to robots.txt, while other URLs works. I've reverted the changes and kept only the storage client fix.
# Use the ApifyStorageClient if the Actor is running on the Apify platform, | ||
# otherwise use the MemoryStorageClient. | ||
storage_client = ( | ||
ApifyStorageClient.from_config(config) if config.is_at_home else MemoryStorageClient.from_config(config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is supposed to happen in Actor.init
, right? Why duplicate it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the nested event loop, otherwise, it will result in:
RuntimeError: <asyncio.locks.Event object at 0x7c2d640c8fc0 [unset]> is bound to a different event loop
when using Apify client.
Thanks! |
Relates: apify/actor-templates#303