How to download a copy of a website using Wget

kevincox · on June 8, 2024

I have my own set of flags that I recorded because I always forget them when I want to do this. Interesting to compare the differences.

https://kevincox.ca/2022/12/21/wget-mirror/

toomuchtodo · on June 8, 2024

I would recommend grab-site instead if you’re archiving, but wget is easier if mirroring content.

https://github.com/ArchiveTeam/grab-site

pogue · on June 8, 2024

I'm wanting to archive a site but it requires a login. Can I do something like drop a cookie from another browser & drop it in the folder for this or is it more complicated? Unfortunately, it's not the clean/easy old user:login style login, but a form you have to fill out.

wenzlawski · on June 8, 2024

For most browsers you can copy the reuquest in curl format to a site through the Network tab in the Developer Tools panel. In there should be all the cookies/user data required to make a request from the command line.

belter · on June 9, 2024

grab-site or HTTrack are a better option for the modern documentation websites most will probably try to download.

The wget suggestions advanced here or in the blog will only work for the most basic of the static sites, and will quickly fail for sites like for example Juniper, Cisco, AWS or Azure documentation.

toomuchtodo · on June 10, 2024

Excellent point. Some experimentation might be required for the optimal tool for site ripping, depending on presentation layer.

If there is room for improvement in grab-site, please open a Github Issue.

belter · on June 9, 2024

HTTrack is another option: https://www.httrack.com/

o11c · on June 8, 2024

The feature missing from most such tools: recurse, but only within a path.

darnir · on June 8, 2024

That's pretty easy with Wget. You can use `--accept-regex` to define the paths your recursion should follow.

o11c · on June 8, 2024

Huh, must be new. <checks changelog>. It was only added in 2012.

Or wait, was the problem that that breaks `--page-requisites`?

darnir · on June 8, 2024

It should not break `--page-requisites`. If it does please file a bug report with a reproducer

jlundberg · on June 8, 2024

wget —-mirror has been in my set of commonly used command line tools for I don’t know how long.

Probably one of the reason why wget is my standard ”fetch a URL” tool rather than the more obvious curl.