Closed
Description
Description
When crawling apify.com
using the Apify-Scrapy integration, some specific pages cause request queue inconsistencies, leading to an infinite run loop where requests cannot be removed from the queue.
This issue can be reproduced using the code from the Scrapy guide or the Scrapy template.
Problematic URLs:
- https://blog.apify.com/intercom-customer-support-ai-chatbot-web-scraping
- https://blog.apify.com/how-web-scraping-ai-and-the-eu-have-come-together-to-sweep-away-fake-discounts-in-europe
- https://blog.apify.com/groupon-reaches-new-merchants-with-web-data-collection
Observed behavior
Warnings about mismatched request IDs and unique keys during Spider processing:
[title_spider] INFO TitleSpider is parsing <200 https://apify.com/professional-services>... [({"spider": "<TitleSpider 'title_spider' at 0x7f564d5a52b0>"})]
...
[crawlee.storage_clients._memory._request_queue_client] WARN The request ID does not match the ID from the unique_key (request.id=xTOtjOm0Pqt40fn, id=U6aWc4CgOSWQl13).
...
[crawlee.storage_clients._memory._request_queue_client] WARN The request ID does not match the ID from the unique_key (request.id=QGWv6iVEdK2bHHD, id=DmSWl7p6JKFfUYx).
...
[crawlee.storage_clients._memory._request_queue_client] WARN The request ID does not match the ID from the unique_key (request.id=RWm45nonkuuAh9r, id=5lI8IiodqKAguLk).
...
[title_spider] INFO TitleSpider is parsing <200 https://apify.com/templates>... [({"spider": "<TitleSpider 'title_spider' at 0x7f564d5a52b0>"})]
...
Requests remain stuck in the request queue, causing repeated processing attempts. Resulting in endless loop of the following logs:
...
[crawlee._utils.system] DEBUG Calling get_memory_info()...
[crawlee._utils.system] DEBUG Calling get_cpu_info()...
[crawlee._utils.system] DEBUG Calling get_memory_info()...
[crawlee._utils.system] DEBUG Calling get_cpu_info()...
[crawlee.storages._request_queue] DEBUG Skipping request from queue head, already in progress or recently handled ({"id": "xTOtjOm0Pqt40fn", "unique_key": "GET|e3b0c442|e3b0c442|https://blog.apify.com/intercom-customer-support-ai-chatbot-web-scraping", "in_progress": false, "recently_handled": true})
[crawlee.storages._request_queue] DEBUG Skipping request from queue head, already in progress or recently handled ({"id": "QGWv6iVEdK2bHHD", "unique_key": "GET|e3b0c442|e3b0c442|https://blog.apify.com/how-web-scraping-ai-and-the-eu-have-come-together-to-sweep-away-fake-discounts-in-europe", "in_progress": false, "recently_handled": true})
[crawlee.storages._request_queue] DEBUG Skipping request from queue head, already in progress or recently handled ({"id": "RWm45nonkuuAh9r", "unique_key": "GET|e3b0c442|e3b0c442|https://blog.apify.com/groupon-reaches-new-merchants-with-web-data-collection", "in_progress": false, "recently_handled": true})
[crawlee.storages._request_queue] DEBUG Queue head still returned requests that need to be processed (or that are locked by other clients)
[crawlee._utils.system] DEBUG Calling get_memory_info()...
[crawlee._utils.system] DEBUG Calling get_cpu_info()...
[crawlee._utils.system] DEBUG Calling get_memory_info()...
[crawlee._utils.system] DEBUG Calling get_cpu_info()...
...
honzajavorek
Metadata
Metadata
Assignees
Labels
Something isn't working.Something isn't working.Issues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.