I thought it was simple enough, just httrack or wget it, but it turned out to be much trickier :(
Problem #1: how to get httrack to authenticate
Mediawiki uses a form login and uses a login token that it stores in a cookie before opening the form :( which is good for security etc. but made it real tricky for me. In the end I could not get it working with httrack and moved on to wget, but not before figuring out how to intercept a cookie from my browser which was really helpful with debugging later. Some helpful links if you wanna try this:http://www.dzone.com/snippets/using-wget-download-content
http://httrack.kauler.com/help/CatchURL_tutorial
http://www.httrack.com/html/fcguide.html
Problem #2: how to get wget to authenticate
Problem #3: how to not 'click' on the logout link?!
Simple right just add --reject '* = *' to the configuration..
It turns out that wget (1.14 on ubuntu) for some reason first downloads all 'rejected' links and then deletes them?! !@#$!@#$
Thanks Tomàs Reverter for finally pointing me into the right direction with easy instructions for hacking wget.
His patch didn't do the trick for me so I hacked it a bit:
Download the wget source code.
Then ./configure && make && sudo make install
This did unfortunately not solve my problem yet, digging a bit further I noticed that the acceptable() method actually only checks regex rejects, and I thought that the glob version of reject gets converted or something.. NO! I had to specify
--reject-regex='.*=.*,.*UserLogout.*' before it finally started to work.
(The man page doesn't mention --reject-regex but I realized wget --help does )
Here is my script that downloads our internal wiki, it is very dirty and probably does loads of unnecessary stuff but it works, and it is 01:15 and I'm sick and I want to go to bed, but want to get this out while I have all the browser tabs still open:
in src/recur.c on line 406 I injected an if statement so that we don't even enqueue 'rejected' urls
if (acceptable (child->url->url)) { url_enqueue (queue, ci, xstrdup (child->url->url), xstrdup (referer_url), depth + 1, child->link_expect_html, child->link_expect_css); }
This did unfortunately not solve my problem yet, digging a bit further I noticed that the acceptable() method actually only checks regex rejects, and I thought that the glob version of reject gets converted or something.. NO! I had to specify
--reject-regex='.*=.*,.*UserLogout.*' before it finally started to work.
(The man page doesn't mention --reject-regex but I realized wget --help does )
Here is my script that downloads our internal wiki, it is very dirty and probably does loads of unnecessary stuff but it works, and it is 01:15 and I'm sick and I want to go to bed, but want to get this out while I have all the browser tabs still open:
WIKI_USERNAME='someone' WIKI_PASSWORD='setupusthebomb' TARGETHOST="http://intranet.example.com/mediawiki" MW_DOMAIN="intranet.example.com" MW_URL="$TARGETHOST" INDEX_URL="$TARGETHOST/index.php" MW_PAGE="Main_Page" TARGET_DIR="/tmp/wget/" # Mediawiki uses a login token, and we must have it for this to work. WP_LOGIN_TOKEN=$(wget -q --no-check-certificate -O - --save-cookies cookies.txt --keep-session-cookies "$MW_URL/index.php/Special:UserLogin" | grep wpLoginToken | grep -o '[a-z0-9]\{32\}') sleep 1 wget -q --no-check-certificate --load-cookies cookies.txt --save-cookies cookies.txt --keep-session-cookies --post-data "wpName=${WIKI_USERNAME}&wpPassword=${WIKI_PASSWORD}&wpDomain=${MW_DOMAIN}&wpRemember=1&wpLoginattempt=Log%20in&wpLoginToken=${WP_LOGIN_TOKEN}" "$INDEX_URL?title=Special:UserLogin&action=submitlogin&type=login" -O ${TARGET_DIR}/tmp sleep 1 #not sure why this is needed, but it is (probably because mediawiki doesn't like it if you hit an 404 right after logginj in) wget -q --no-check-certificate --load-cookies cookies.txt --keep-session-cookies --save-cookies cookies.txt "$MW_URL/index.php/${MW_PAGE}" -O ${TARGET_DIR}/tmp.html sleep 1 /usr/local/bin/wget -t0 -T900 --limit-rate=30k --random-wait -e robots=off --no-check-certificate --load-cookies cookies.txt --keep-session-cookies --mirror --convert-links -backup-converted --html-extension --page-requisites --reject='*=*,*UserLogout*' --exclude-directories='*=*,*UserLogout*' --reject-regex='.*=.*,.*UserLogout.*' --no-parent -P ${TARGET_DIR} -o ${TARGET_DIR}/wget-log "$MW_URL/" # Problem #4: the following doesn't actually work, # because wget decides out of its own cleverness to go into # background mode and I can't see a way to disable that, I smell another hack:( #remove my username from all the pages cd $TARGET_DIR grep -ilr ${WIKI_USERNAME} . | xargs sed -i s/${WIKI_USERNAME}/User/g # delete pages that include the user name find -iname "*${WIKI_USERNAME}*" -delete
Problem #4: don't know how to force wget to not enter backrgound mode :(
Please help me in the comments..
Execute an external command(i.e. on the command line) to bring it to the foreground somewhere in the code.
ReplyDeleteNot sure what you mean Andrew.
DeleteBut in any case I decided not to run this every time because I think it will cause all the modified pages to get re-downloaded. I'll just run it when I need to give it to somebody.. thanks