April 28, 2019 · Don't Forget linux mac

Lightweight offline mirror of a site

Sometimes you want to create an offline copy of a site that you can take and view even without internet access.

wget --mirror --convert-links --adjust-extension --page-requisites 
--no-parent http://example.org

Where:
--mirror – Makes (among other things) the download recursive.

--convert-links – convert all the links (also to stuff like CSS stylesheets) to relative, so it will be suitable for offline viewing.

--adjust-extension – Adds suitable extensions to filenames (html or css) depending on their content-type.

--page-requisites – Download things like CSS style-sheets and images required to properly display the page offline.

--no-parent – When recursing do not ascend to the parent directory. It useful for restricting the download to only a portion of the site.

Alternatively, the command above may be shortened:

wget -mkEpnp http://example.org

However, wget is known to be notoriously flaky, and certain paths may not be fully followed. If you need to further grep out files, you can also try:

lynx -dump http://example.org | awk '/keyword/{print $2}' > links.txt

And then

wget -i links.txt

If you need to download just a sub-section of the site, then you can also try

wget -r -l 5 -np "full URL w/path"