Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings
This repository was archived by the owner on Sep 17, 2020. It is now read-only.
This repository was archived by the owner on Sep 17, 2020. It is now read-only.

Is there a means by which one can crawl a self-hosted Warc file?  #103

Copy link
Copy link
@deltabravozulu

Description

@deltabravozulu
Issue body actions

So, my general usecase for this is that I have a personal website I recorded at one point but no longer have the original files for. I'd like to rehost my site, but at this point, without the old source code, I cannot. I'm trying to figure out a way to get all my links and everything put back in order the warc file from my backups, but thus far this has been in vain.

I've found that webrecorder (not the player) puts things together in such a way that other programs that have been built over the years cannot take them apart (e.g. warc to zip , warcat, or warc-extractor ) -- each runs into errors when trying to figure out the indexing of the warc.

As such, I ran sudo netstat -tulpn | grep -i webrecord which gave me a host:port of http://127.0.0.1:35535. I found that instead of going through webrecorder-player, I could actually open the whole site in Chrome by going to http://127.0.0.1:35535/local/collection/http://deltabravozu.lu. Because I can access it in the browser with all links working as they would in webrecorder-player, I figured I should be able to crawl the site and pull down the intact site structure using, say, wget or httrack, but thus far I've been able to crawl nothing more than the first page and random offsite links encoded in the webrecorder-player server (e.g. https://www.w3.org).

For wget, I used wget --force-directories --timestamping --level=inf --no-remove-listing --debug --page-requisites --adjust-extension --convert-links --retry-connrefused --span-hosts --follow-ftp --retry-on-host-error --execute robots=off http://127.0.0.1:35535/local/collection/http://deltabravozu.lu

Does anyone have any idea as to how I might more effectively go about my little task?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.