splitbrain.org

electronic brain surgery since 2001

Spider Your DokuWiki Using Wget

Michael Klier recently decided to shut down his blog. Luckily he provides a tarball of all his posts and used a liberal license for his contents. With his permission I will repost a few of his old blog posts that I think should remain online for their valuable information.

This post was originally posted April, 20th 2009 at chimeric.de and is licensed under the Creative Commons BY-NC-SA License.

Some of you might have been in that situation already and know that sometimes it's necessary to spider your DokuWiki. For example if you need to rebuild the search index or you make use of the tag plugin and you don't want to visit each site on your own to trigger the (re)generation of the needed meta data1).

Here's a short bash snippet which uses wget I want to share with you. You have to run it inside of your <dokuwiki>/data/pages folder or it won't work.

for file in $(find ./ -type f); do
    file=${file//.\//}
    file=${file//\//:}
    file=$(basename $file '.txt')
    url="http://yourdomain.org/doku.php?id=$file"
    wget -nv "$url" -O /dev/null
    [ $? != 0 ] && echo "ERROR fetching $url"
    sleep 1
done

There are probably one million other ways to do this in bash. The reason I search the pages directory first instead of using the <dokuwiki>/data/index/page.idx file, is that there could be pages added from a script which in turn could be missing in the global index because of that.

Note: I would set the sleep count at least to one second (if not more) in order to give the indexer enough time to finish his job and to avoid lock conflicts.

And yeah, it's intentional that I don't use the –spider switch of wget, because it only checks the header response instead of downloading the file, which could be possibly not enough to trigger the indexer.

Tags:
guestpost, chimeric.de, dokuwiki, wget, spider
Similar posts:
1)
DokuWiki uses a Webbug to do all what's needed in the background