Wednesday, 17 July 2013

the day I had to hack wget

Thanks to Richard who insists on documenting stuff on mediawiki, I had to figure out how to make a static mirror of the html so that I a) have offline access and b) am 100% sure anything I put there is backed up properly and there is no chance of loosing any information.

I thought it was simple enough, just httrack or wget it, but it turned out to be much trickier :(

Problem #1: how to get httrack to authenticate

Mediawiki uses a form login and uses a login token that it stores in a cookie before opening the form :( which is good for security etc. but made it real tricky for me. In the end I could not get it working with httrack and moved on to wget, but not before figuring out how to intercept a cookie from my browser which was really helpful with debugging later. Some helpful links if you wanna try this:
http://www.dzone.com/snippets/using-wget-download-content
http://httrack.kauler.com/help/CatchURL_tutorial
http://www.httrack.com/html/fcguide.html

Problem #2: how to get wget to authenticate

It took some time to figure out the complicated login process mentioned above. This post from amadeus helped me get on the right track.

Problem #3: how to not 'click' on the logout link?!

Simple right just add --reject '* = *' to the configuration..
It turns out that wget (1.14 on ubuntu) for some reason first downloads all 'rejected' links and then deletes them?! !@#$!@#$
Thanks Tomàs Reverter for finally pointing me into the right direction with easy instructions for hacking wget.
His patch didn't do the trick for me so I hacked it a bit: 
Download the wget source code.
in src/recur.c on line 406 I injected an if statement so that we don't even enqueue 'rejected' urls

if (acceptable (child->url->url)) {
    url_enqueue (queue, ci, xstrdup (child->url->url),
    xstrdup (referer_url), depth + 1,
    child->link_expect_html,
    child->link_expect_css);
}
Then ./configure && make && sudo make install 
This did unfortunately not solve my problem yet, digging a bit further I noticed that the acceptable() method actually only checks regex rejects, and I thought that the glob version of reject gets converted or something.. NO! I had to specify
--reject-regex='.*=.*,.*UserLogout.*' before it finally started to work.
(The man page doesn't mention --reject-regex but I realized wget --help does )

Here is my script that downloads our internal wiki, it is very dirty and probably does loads of unnecessary stuff but it works, and it is 01:15 and I'm sick and I want to go to bed, but want to get this out while I have all the browser tabs still open:

WIKI_USERNAME='someone'
WIKI_PASSWORD='setupusthebomb'
TARGETHOST="http://intranet.example.com/mediawiki"
MW_DOMAIN="intranet.example.com"
MW_URL="$TARGETHOST"
INDEX_URL="$TARGETHOST/index.php"
MW_PAGE="Main_Page"
TARGET_DIR="/tmp/wget/"

# Mediawiki uses a login token, and we must have it for this to work.
WP_LOGIN_TOKEN=$(wget -q --no-check-certificate -O - --save-cookies cookies.txt --keep-session-cookies "$MW_URL/index.php/Special:UserLogin" | grep wpLoginToken | grep -o '[a-z0-9]\{32\}')

sleep 1

wget -q --no-check-certificate --load-cookies cookies.txt --save-cookies cookies.txt --keep-session-cookies --post-data "wpName=${WIKI_USERNAME}&wpPassword=${WIKI_PASSWORD}&wpDomain=${MW_DOMAIN}&wpRemember=1&wpLoginattempt=Log%20in&wpLoginToken=${WP_LOGIN_TOKEN}" "$INDEX_URL?title=Special:UserLogin&action=submitlogin&type=login" -O ${TARGET_DIR}/tmp

sleep 1

#not sure why this is needed, but it is (probably because mediawiki doesn't like it if you hit an 404 right after logginj in)
wget -q --no-check-certificate --load-cookies cookies.txt --keep-session-cookies --save-cookies cookies.txt "$MW_URL/index.php/${MW_PAGE}" -O ${TARGET_DIR}/tmp.html

sleep 1

/usr/local/bin/wget -t0 -T900 --limit-rate=30k --random-wait -e robots=off --no-check-certificate --load-cookies cookies.txt --keep-session-cookies --mirror --convert-links -backup-converted --html-extension --page-requisites --reject='*=*,*UserLogout*' --exclude-directories='*=*,*UserLogout*' --reject-regex='.*=.*,.*UserLogout.*'   --no-parent -P ${TARGET_DIR}  -o ${TARGET_DIR}/wget-log "$MW_URL/"

# Problem #4: the following doesn't actually work, 
# because wget decides out of its own cleverness to go into 
# background mode and I can't see a way to disable that, I smell another hack:(
#remove my username from all the pages
cd $TARGET_DIR
grep -ilr ${WIKI_USERNAME} . | xargs sed -i s/${WIKI_USERNAME}/User/g

# delete pages that include the user name
find -iname "*${WIKI_USERNAME}*" -delete


Problem #4: don't know how to force wget to not enter backrgound mode :(

Please help me in the comments..