fix: Fix RQ usage in Scrapy scheduler #385

vdusek · Jan 29, 2025

janbuchar · Jan 29, 2025

src/apify/scrapy/middlewares/apify_proxy.py

@@ -78,6 +78,11 @@ async def process_request(self, request: Request, spider: Spider) -> None:
        Raises:
            ValueError: If username and password are not provided in the proxy URL.
        """
+        # Do not use proxy for robots.txt, as it causes 403 Forbidden.


Like... universally, everywhere? I don't mind it, it just seems weird.

Maybe it is a problem of Apify proxies, I don't know, but it results in the following:

[scrapy.downloadermiddlewares.robotstxt] ERROR Error downloading <GET https://console.apify.com/robots.txt>: Could not open CONNECT tunnel with proxy proxy.apify.com:8000 [{'status': 403, 'reason': b'Forbidden'}] ({"spider": "<TitleSpider 'title_spider' at 0x7f2bc3aee660>"}) Traceback (most recent call last): File "/home/vdusek/Projects/apify-sdk-python/.venv/lib/python3.13/site-packages/twisted/internet/defer.py", line 2013, in _inlineCallbacks result = context.run( cast(Failure, result).throwExceptionIntoGenerator, gen ) File "/home/vdusek/Projects/apify-sdk-python/.venv/lib/python3.13/site-packages/twisted/python/failure.py", line 467, in throwExceptionIntoGenerator return g.throw(self.value.with_traceback(self.tb)) ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/vdusek/Projects/apify-sdk-python/.venv/lib/python3.13/site-packages/scrapy/core/downloader/middleware.py", line 68, in process_request return (yield download_func(request, spider)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ scrapy.core.downloader.handlers.http11.TunnelError: Could not open CONNECT tunnel with proxy proxy.apify.com:8000 [{'status': 403, 'reason': b'Forbidden'}]

Humph. But the connect call should happen way before the path part of the URL matters, right?

Yep, that's strange. I'm not sure why we can't connect when it comes to robots.txt, while other URLs works. I've reverted the changes and kept only the storage client fix.

janbuchar · Jan 29, 2025

src/apify/scrapy/scheduler.py

+            # Use the ApifyStorageClient if the Actor is running on the Apify platform,
+            # otherwise use the MemoryStorageClient.
+            storage_client = (
+                ApifyStorageClient.from_config(config) if config.is_at_home else MemoryStorageClient.from_config(config)


This is supposed to happen in Actor.init, right? Why duplicate it here?

Because of the nested event loop, otherwise, it will result in:

RuntimeError: <asyncio.locks.Event object at 0x7c2d640c8fc0 [unset]> is bound to a different event loop

when using Apify client.

honzajavorek · Jan 31, 2025

Thanks!

fix: Fix RQ usage in Scrapy scheduler

62ca235

vdusek added this to the 107th sprint - Tooling team milestone Jan 29, 2025

vdusek requested a review from janbuchar January 29, 2025 08:54

github-actions bot assigned vdusek Jan 29, 2025

github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Jan 29, 2025

vdusek added 2 commits January 29, 2025 09:56

linter

a379eb4

improve apify proxy middleware

e1b60c7

janbuchar reviewed Jan 29, 2025

View reviewed changes

vdusek force-pushed the fixing-scrapy branch from 5df17da to 5df6616 Compare January 29, 2025 13:11

redo proxy changes

660b658

vdusek force-pushed the fixing-scrapy branch from 5df6616 to 660b658 Compare January 29, 2025 13:11

vdusek requested a review from janbuchar January 29, 2025 13:40

janbuchar approved these changes Jan 29, 2025

View reviewed changes

vdusek merged commit 3363478 into master Jan 29, 2025
27 checks passed

vdusek deleted the fixing-scrapy branch January 29, 2025 14:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Fix RQ usage in Scrapy scheduler #385

fix: Fix RQ usage in Scrapy scheduler #385

Uh oh!

vdusek commented Jan 29, 2025

Uh oh!

janbuchar Jan 29, 2025

Uh oh!

vdusek Jan 29, 2025

Uh oh!

janbuchar Jan 29, 2025

Uh oh!

vdusek Jan 29, 2025

Uh oh!

janbuchar Jan 29, 2025

Uh oh!

vdusek Jan 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

honzajavorek commented Jan 31, 2025

Uh oh!

Uh oh!

Search code, repositories, users, issues, pull requests...

fix: Fix RQ usage in Scrapy scheduler #385

fix: Fix RQ usage in Scrapy scheduler #385

Uh oh!

Conversation

vdusek commented Jan 29, 2025

Uh oh!

janbuchar Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

vdusek Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

janbuchar Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

vdusek Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

janbuchar Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

vdusek Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

honzajavorek commented Jan 31, 2025

Uh oh!

Uh oh!

vdusek Jan 29, 2025 •

edited

Loading