Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Broken link checker for internal links #758

Unanswered
tleyden asked this question in Q&A
Discussion options

I'm planning to build a broken link checker on top of crawlee-python, similar in scope to lychee and htmltest. For now I just need to detect broken internal links, so the crawl should contain all of the state needed. By "internal links" I mean links within the page or to other pages within the same domain, including anchor fragments.

From reading through the docs:

  • The BeautifulSoupCrawler seems like a great starting point since it already parses the HTML and makes it easy to extract links.
  • The KeyValueStore seems like the right tool to store things like paths to detect internal broken links, and element IDs to detect broken anchor fragments.

Does crawlee already have a mechanism to detect internal broken links, or even broken anchor fragments? I'm still ramping up on crawlee, so I don't want to reinvent the wheel here.

Are there any other components or prior art I should be aware of?

PS: Thank you for releasing this as open source! From my experience so far, crawlee-python feels like a really powerful tool and it was extremely easy to get started with it.

You must be logged in to vote

Replies: 1 comment

Comment options

Hi,

glad to hear that, thanks for the nice words 🙂.

Regarding your question:

I hope this helps and good luck with your project.

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
🙏
Q&A
Labels
None yet
2 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.