Using the Wget Linux command, it is possible to download an entire website, including all assets and scripts. It is occasionally necessary to download and archive a large site for local viewing, and Wget makes this an easy process. This is an example of the options I use to download a complete copy of a site.
wget --mirror --convert-links --span-hosts --adjust-extension --page-requisites --execute robots=off --restrict-file-names=windows --output-file=wget.log --input-file=domains_list.txt --domains=josephmsexton.com,webtipblog.com
There are some additional options like --level or --exclude-directories that I also use for testing and debugging to limit the pages retrieved. The Wget manual is extremely thorough and discusses all of the available options. The options I commonly use are described below.
Wget Options
--mirror
The mirror option allows you to completely mirror a site. This is actually just a shortcut for using the following options:
--recursive
--timestamping
--level=inf
--no-remove-listing
--convert-links
This option sets Wget to convert links to allow for local viewing of a site. Links to files that have been downloaded are converted to relative links to the new location. Relatvie links to files that have not been downloaded will be converted to absolute links.
--adjust-extension
This option sets Wget to append the .html extension to any file that is of the type “application/xhtml+xml” or “text/html” to allow files to be viewed without a webserver, ie. about.php becomes about.php.html.
--page-requisites
The option sets Wget to dowload all assets needed to properly display the page, such as css, javscript, and images.
--execute command
This option executes a command, just as if it were in the Wget startup file.
--level=depth
This option sets the recursion depth. This is great for testing and allows you to not download the internet.
--restrict-file-names=modes
This option sets the character encoding for downloaded files and links. This will mostly not be required and will default to the correct mode for the operating system being used. Available modes are “unix”, “windows”, “nocontrol”, “ascii”, “lowercase”, and “uppercase.”
--wait=seconds
This option sets the interval between retrievals. This can be used to throttle the requests being made and can be useful when downloading a large site.
--output-file=logfile
This will cause Wget to output all messages to the logfile instead of the console.
--input-file=file
This will specify a file where Wget will read the seed URLs from. This allows you to specify multiple URLs to download.
--exclude-directories=list
Specify a comma separated list of directories that should not be downloaded. I find this useful for testing to limit the amount of files retrieved.
--span-hosts
This option sets Wget to span multiple hosts. This is only necessary if assets for the site are located across multiple domains. For instance, if assets are stored on a different domain, this option will enable downloading them.
--domains=list
This option specifies a white-list domains to retrieve files from. This is necessary when using --span-hosts to prevent Wget from downloading the whole internet.
There are many more options that can be used, this is the list of options that I found useful for archiving a site locally.