Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How to download a copy of a website using Wget (monolune.com)
80 points by rsfgsdfgsdfg on June 7, 2024 | hide | past | favorite | 12 comments


I have my own set of flags that I recorded because I always forget them when I want to do this. Interesting to compare the differences.

https://kevincox.ca/2022/12/21/wget-mirror/


I would recommend grab-site instead if you’re archiving, but wget is easier if mirroring content.

https://github.com/ArchiveTeam/grab-site


I'm wanting to archive a site but it requires a login. Can I do something like drop a cookie from another browser & drop it in the folder for this or is it more complicated? Unfortunately, it's not the clean/easy old user:login style login, but a form you have to fill out.


For most browsers you can copy the reuquest in curl format to a site through the Network tab in the Developer Tools panel. In there should be all the cookies/user data required to make a request from the command line.


grab-site or HTTrack are a better option for the modern documentation websites most will probably try to download.

The wget suggestions advanced here or in the blog will only work for the most basic of the static sites, and will quickly fail for sites like for example Juniper, Cisco, AWS or Azure documentation.


Excellent point. Some experimentation might be required for the optimal tool for site ripping, depending on presentation layer.

If there is room for improvement in grab-site, please open a Github Issue.


HTTrack is another option: https://www.httrack.com/


The feature missing from most such tools: recurse, but only within a path.


That's pretty easy with Wget. You can use `--accept-regex` to define the paths your recursion should follow.


Huh, must be new. <checks changelog>. It was only added in 2012.

Or wait, was the problem that that breaks `--page-requisites`?


It should not break `--page-requisites`. If it does please file a bug report with a reproducer


wget —-mirror has been in my set of commonly used command line tools for I don’t know how long.

Probably one of the reason why wget is my standard ”fetch a URL” tool rather than the more obvious curl.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: