I thought it was simple enough, just httrack or wget it, but it turned out to be much trickier :(
Problem #1: how to get httrack to authenticate
Mediawiki uses a form login and uses a login token that it stores in a cookie before opening the form :( which is good for security etc. but made it real tricky for me. In the end I could not get it working with httrack and moved on to wget, but not before figuring out how to intercept a cookie from my browser which was really helpful with debugging later. Some helpful links if you wanna try this:http://www.dzone.com/snippets/using-wget-download-content
http://httrack.kauler.com/help/CatchURL_tutorial
http://www.httrack.com/html/fcguide.html
Problem #2: how to get wget to authenticate
Problem #3: how to not 'click' on the logout link?!
Simple right just add --reject '* = *' to the configuration..
It turns out that wget (1.14 on ubuntu) for some reason first downloads all 'rejected' links and then deletes them?! !@#$!@#$
Thanks Tomàs Reverter for finally pointing me into the right direction with easy instructions for hacking wget.
His patch didn't do the trick for me so I hacked it a bit:
Download the wget source code.
Then ./configure && make && sudo make install
This did unfortunately not solve my problem yet, digging a bit further I noticed that the acceptable() method actually only checks regex rejects, and I thought that the glob version of reject gets converted or something.. NO! I had to specify
--reject-regex='.*=.*,.*UserLogout.*' before it finally started to work.
(The man page doesn't mention --reject-regex but I realized wget --help does )
Here is my script that downloads our internal wiki, it is very dirty and probably does loads of unnecessary stuff but it works, and it is 01:15 and I'm sick and I want to go to bed, but want to get this out while I have all the browser tabs still open:
in src/recur.c on line 406 I injected an if statement so that we don't even enqueue 'rejected' urls
if (acceptable (child->url->url)) { url_enqueue (queue, ci, xstrdup (child->url->url), xstrdup (referer_url), depth + 1, child->link_expect_html, child->link_expect_css); }
This did unfortunately not solve my problem yet, digging a bit further I noticed that the acceptable() method actually only checks regex rejects, and I thought that the glob version of reject gets converted or something.. NO! I had to specify
--reject-regex='.*=.*,.*UserLogout.*' before it finally started to work.
(The man page doesn't mention --reject-regex but I realized wget --help does )
Here is my script that downloads our internal wiki, it is very dirty and probably does loads of unnecessary stuff but it works, and it is 01:15 and I'm sick and I want to go to bed, but want to get this out while I have all the browser tabs still open:
WIKI_USERNAME='someone' WIKI_PASSWORD='setupusthebomb' TARGETHOST="http://intranet.example.com/mediawiki" MW_DOMAIN="intranet.example.com" MW_URL="$TARGETHOST" INDEX_URL="$TARGETHOST/index.php" MW_PAGE="Main_Page" TARGET_DIR="/tmp/wget/" # Mediawiki uses a login token, and we must have it for this to work. WP_LOGIN_TOKEN=$(wget -q --no-check-certificate -O - --save-cookies cookies.txt --keep-session-cookies "$MW_URL/index.php/Special:UserLogin" | grep wpLoginToken | grep -o '[a-z0-9]\{32\}') sleep 1 wget -q --no-check-certificate --load-cookies cookies.txt --save-cookies cookies.txt --keep-session-cookies --post-data "wpName=${WIKI_USERNAME}&wpPassword=${WIKI_PASSWORD}&wpDomain=${MW_DOMAIN}&wpRemember=1&wpLoginattempt=Log%20in&wpLoginToken=${WP_LOGIN_TOKEN}" "$INDEX_URL?title=Special:UserLogin&action=submitlogin&type=login" -O ${TARGET_DIR}/tmp sleep 1 #not sure why this is needed, but it is (probably because mediawiki doesn't like it if you hit an 404 right after logginj in) wget -q --no-check-certificate --load-cookies cookies.txt --keep-session-cookies --save-cookies cookies.txt "$MW_URL/index.php/${MW_PAGE}" -O ${TARGET_DIR}/tmp.html sleep 1 /usr/local/bin/wget -t0 -T900 --limit-rate=30k --random-wait -e robots=off --no-check-certificate --load-cookies cookies.txt --keep-session-cookies --mirror --convert-links -backup-converted --html-extension --page-requisites --reject='*=*,*UserLogout*' --exclude-directories='*=*,*UserLogout*' --reject-regex='.*=.*,.*UserLogout.*' --no-parent -P ${TARGET_DIR} -o ${TARGET_DIR}/wget-log "$MW_URL/" # Problem #4: the following doesn't actually work, # because wget decides out of its own cleverness to go into # background mode and I can't see a way to disable that, I smell another hack:( #remove my username from all the pages cd $TARGET_DIR grep -ilr ${WIKI_USERNAME} . | xargs sed -i s/${WIKI_USERNAME}/User/g # delete pages that include the user name find -iname "*${WIKI_USERNAME}*" -delete
Problem #4: don't know how to force wget to not enter backrgound mode :(
Please help me in the comments..