Thanks to Richard who insists on documenting stuff on mediawiki, I had to figure out how to make a static mirror of the html so that I a) have offline access and b) am 100% sure anything I put there is backed up properly and there is no chance of loosing any information.
I thought it was simple enough, just httrack or wget it, but it turned out to be much trickier :(
Problem #1: how to get httrack to authenticate
Mediawiki uses a form login and uses a login token that it stores in a cookie before opening the form :( which is good for security etc. but made it real tricky for me. In the end I could not get it working with httrack and moved on to wget, but not before figuring out how to intercept a cookie from my browser which was really helpful with debugging later. Some helpful links if you wanna try this:
http://www.dzone.com/snippets/using-wget-download-content
http://httrack.kauler.com/help/CatchURL_tutorial
http://www.httrack.com/html/fcguide.html
Problem #2: how to get wget to authenticate
It took some time to figure out the complicated login process mentioned above.
This post from amadeus helped me get on the right track.
Problem #3: how to not 'click' on the logout link?!
It turns out that wget (1.14 on ubuntu) for some reason first downloads all 'rejected' links and then deletes them?! !@#$!@#$
Thanks
Tomàs Reverter for finally pointing me into the right direction with easy instructions for hacking wget.
His patch didn't do the trick for me so I hacked it a bit:
Download the wget source code.
in src/recur.c on line 406 I injected an if statement so that we don't even enqueue 'rejected' urls
if (acceptable (child->url->url)) {
url_enqueue (queue, ci, xstrdup (child->url->url),
xstrdup (referer_url), depth + 1,
child->link_expect_html,
child->link_expect_css);
}
Then
./configure && make && sudo make install
This did unfortunately not solve my problem yet, digging a bit further I noticed that the
acceptable() method actually only checks regex rejects, and I thought that the glob version of reject gets converted or something.. NO! I had to specify
--reject-regex='.*=.*,.*UserLogout.*' before it finally started to work.
(The man page doesn't mention
--reject-regex but I realized
wget --help does )
Here is my script that downloads our internal wiki, it is very dirty and probably does loads of unnecessary stuff but it works, and it is 01:15 and I'm sick and I want to go to bed, but want to get this out while I have all the browser tabs still open:
WIKI_USERNAME='someone'
WIKI_PASSWORD='setupusthebomb'
TARGETHOST="http://intranet.example.com/mediawiki"
MW_DOMAIN="intranet.example.com"
MW_URL="$TARGETHOST"
INDEX_URL="$TARGETHOST/index.php"
MW_PAGE="Main_Page"
TARGET_DIR="/tmp/wget/"
# Mediawiki uses a login token, and we must have it for this to work.
WP_LOGIN_TOKEN=$(wget -q --no-check-certificate -O - --save-cookies cookies.txt --keep-session-cookies "$MW_URL/index.php/Special:UserLogin" | grep wpLoginToken | grep -o '[a-z0-9]\{32\}')
sleep 1
wget -q --no-check-certificate --load-cookies cookies.txt --save-cookies cookies.txt --keep-session-cookies --post-data "wpName=${WIKI_USERNAME}&wpPassword=${WIKI_PASSWORD}&wpDomain=${MW_DOMAIN}&wpRemember=1&wpLoginattempt=Log%20in&wpLoginToken=${WP_LOGIN_TOKEN}" "$INDEX_URL?title=Special:UserLogin&action=submitlogin&type=login" -O ${TARGET_DIR}/tmp
sleep 1
#not sure why this is needed, but it is (probably because mediawiki doesn't like it if you hit an 404 right after logginj in)
wget -q --no-check-certificate --load-cookies cookies.txt --keep-session-cookies --save-cookies cookies.txt "$MW_URL/index.php/${MW_PAGE}" -O ${TARGET_DIR}/tmp.html
sleep 1
/usr/local/bin/wget -t0 -T900 --limit-rate=30k --random-wait -e robots=off --no-check-certificate --load-cookies cookies.txt --keep-session-cookies --mirror --convert-links -backup-converted --html-extension --page-requisites --reject='*=*,*UserLogout*' --exclude-directories='*=*,*UserLogout*' --reject-regex='.*=.*,.*UserLogout.*' --no-parent -P ${TARGET_DIR} -o ${TARGET_DIR}/wget-log "$MW_URL/"
# Problem #4: the following doesn't actually work,
# because wget decides out of its own cleverness to go into
# background mode and I can't see a way to disable that, I smell another hack:(
#remove my username from all the pages
cd $TARGET_DIR
grep -ilr ${WIKI_USERNAME} . | xargs sed -i s/${WIKI_USERNAME}/User/g
# delete pages that include the user name
find -iname "*${WIKI_USERNAME}*" -delete
Problem #4: don't know how to force wget to not enter backrgound mode :(
Please help me in the comments..