web archiving

Published on 19/01/2023

Act of creating archives of Web content.

Wget and WARC

--span-hosts for exploring other hosts

--no-clobber for skipping already downloaded files (not sure if it works)

--wait=0.1 and --random-wait instead of --limit-rate=1M

wget --recursive\
      --level=inf \
      --no-clobber \
      --warc-file=test\
      --warc-cdx\
      --execute robots=off\
      --page-requisites \
      --html-extension \
      --directory-prefix=.\
      --user-agent=Mozilla \
      --limit-rate=1M \
      --continue \
      URL

Wget and WARC

Author