Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Redirect URLs #845

Answered by jkumz
jkumz asked this question in Q&A
Dec 28, 2024 · 2 comments
Discussion options

Hi,

Is there a way to prevent crawling domains if a URL re-directs to a different URL?

For me right now, if it hits a URL that redirects to a different domain it proceeds to crawl that domain as well, even using a strategy of enqueueing links.

For example, www.somelink.com/github redirects to their github profile, which then leads to crawling every URL on page, which leads to endless crawling of GitHub.

My code:

async def main() -> None:
    crawler = PlaywrightCrawler()

    urls_found = []

    # Define a request handler and attach it to the crawler using the decorator.
    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        # See BeautifulSoup documentation for API docs.
        url = context.request.url
        context.log.info(f"On URL: {url}")
        urls_found.append(url)

        await context.enqueue_links(strategy=EnqueueStrategy.SAME_ORIGIN)

    await crawler.run(["some url here"])

    print(f"Found {len(urls_found)} URLs")

Thanks in advance

You must be logged in to vote

I made a workaround via putting enqueue links in a conditional block so that it will only start crawling if the domain is the same.
Still redirects to URLs outside of original domain, but doesn't start crawling any URLs under different domains which is sufficient.

Replies: 2 comments

Comment options

I made a workaround via putting enqueue links in a conditional block so that it will only start crawling if the domain is the same.
Still redirects to URLs outside of original domain, but doesn't start crawling any URLs under different domains which is sufficient.

You must be logged in to vote
0 replies
Answer selected by jkumz
Comment options

Hello @jkumz, I added a test that should reproduce the bug that you're reporting, but it is passing without any changes to the code - #873. Could you share the URL that triggers the behavior that you're observing?

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
🙏
Q&A
Labels
None yet
2 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.