Lightweight offline mirror of a site
Sometimes you want to create an offline copy of a site that you can take and view even without internet access.
wget --mirror --convert-links --adjust-extension --page-requisites
--no-parent http://example.org
Where:
--mirror – Makes (among other things) the download recursive.
--convert-links – convert all the links (also to stuff like CSS stylesheets) to relative, so it will be suitable for offline viewing.
--adjust-extension – Adds suitable extensions to filenames (html or css) depending on their content-type.
--page-requisites – Download things like CSS style-sheets and images required to properly display the page offline.
--no-parent – When recursing do not ascend to the parent directory. It useful for restricting the download to only a portion of the site.
Alternatively, the command above may be shortened:
wget -mkEpnp http://example.org
However, wget is known to be notoriously flaky, and certain paths may not be fully followed. If you need to further grep out files, you can also try:
lynx -dump http://example.org | awk '/keyword/{print $2}' > links.txt
And then
wget -i links.txt
If you need to download just a sub-section of the site, then you can also try
wget -r -l 5 -np "full URL w/path"