splitbrain.org

electronic brain surgery since 2001

Paper Backup (2) Automation Scripts

This is the second part of a three part series describing my automatic paper backup system. In this part I explain the different automation scripts used to create a searchable PDF from a scan and upload it to the backup locations.

  1. Automation Scripts ← you are here

The setup is multi user capable. Eg. for my setup, files will be stored in either Kaddi's or my home directory on the NAS and Google Drive.


Note: all scripts below are available in a github repository

Script 1: Scan

The first step simply acquires the raw image data from the already set up scanner. The script requires a “job identifier” as first parameter. This will be the same for all following scripts. It's simply a unique name identifying the working folder. It will also be the name of the final output PDF.

01-scan.sh
#!/bin/bash
 
BASE="/tmp"
 
if [ -z "$1" ]; then
    echo "Usage: $0 <jobid>"
    echo
    echo "Please provide unique jobid name as first parameter"
    exit 1
fi
 
OUTPUT="$BASE/$1"
mkdir -p "$OUTPUT"
 
echo 'scanning...'
scanimage --resolution 300 \
          --batch="$OUTPUT/scan_%03d.pnm" \
          --format=pnm \
          --mode Gray \
          --source 'ADF Duplex'
echo "Output in $OUTPUT/scan*.pnm"

The script automatically scans all pages in the document feeder as grayscale PNM images.

Script 2: Cleanup, OCR, PDF Generation

The next step is the more complicated one. First the input images are cropped. Then blank pages are recognized and removed. The remaining images are cleaned up and finally OCR is applied and a “sandwich” PDF1) is created. This requires a few more utilities to be installed:

$> sudo apt-get install imagemagick bc exactimage pdftk \
   tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng

The following script use these tools to do all the work based on the previously scanned images (identified by job ID).

02-createpdf.sh
#!/bin/bash
 
LANGUAGE="deu" # the tesseract language
BASE="/tmp"
 
if [ -z "$1" ]; then
    echo "Usage: $0 <jobid>"
    echo
    echo "Please provide existing jobid as first parameter"
    exit 1
fi
 
OUTPUT="$BASE/$1"
 
if [ ! -d "$OUTPUT" ]; then
    echo "jobid does not exist"
    exit 1
fi
 
cd "$OUTPUT"
 
# cut borders
echo 'cutting borders...'
for i in scan_*.pnm; do
    mogrify -shave 50x5 "${i}"
done
 
# check if the page is blank
# http://philipp.knechtges.com/?p=190
echo 'checking for blank pages...'
for i in scan_*.pnm; do
    echo "${i}"
    histogram=`convert "${i}" -threshold 50% -format %c histogram:info:-`
    white=`echo "${histogram}" | grep "#FFFFFF" | sed -n 's/^ *\(.*\):.*$/\1/p'`
    black=`echo "${histogram}" | grep "#000000" | sed -n 's/^ *\(.*\):.*$/\1/p'`
    blank=`echo "scale=4; ${black}/${white} < 0.005" | bc`
 
    if [ ${blank} -eq "1" ]; then
        echo "${i} seems to be blank - removing it..."
        rm "${i}"
    fi
done
 
# apply text cleaning and convert to tif
echo 'cleaning pages...'
for i in scan_*.pnm; do
    echo "${i}"
    convert "${i}" -contrast-stretch 1% -level 29%,76% "${i}.tif"
done
 
# do OCR
echo 'doing OCR...'
for i in scan_*.pnm.tif; do
    echo "${i}"
    tesseract "$i" "$i" -l $LANGUAGE hocr
    hocr2pdf -i "$i" -s -o "$i.pdf" < "$i.hocr"
done
 
# create PDF
echo 'creating PDF...'
pdftk *.tif.pdf cat output "$1.pdf"
 
echo "created $OUTPUT/$1.pdf"

The language for OCR processing with Tesseract is configured at the top of the script. You might want to change that if you don't speak German. Remember to install the correct language package then.

Script 3: Copy to Synology NAS

Our Synology DiskStation NAS is our primary storage and backup server, so prepared documents shall be stored there.

For the transfer SFTP is used 2). To allow passwordless access, SSH has to be enabled on the NAS and some keys have to be exchanged.

First log into your DiskStation web interface, then enable the needed services:

  • Configuration Manager
    • Terminal & SNMP → Terminal → Enable SSH service
    • File Services → FTP → SFTP → Enable SFTP Service

Now SSH into the NAS and login as root 3) and edit /etc/passwd. Change the shell of all users you want to give access from /sbin/nologin to /bin/sh4).

Now log in back on the Raspberry and create a SSH key for the pi user. Don't use a pass phrase and make sure you create a DSA key or you will have problems with curl later!

$> ssh-keygen -t dsa
Generating public/private dsa key pair.
Enter file in which to save the key (/home/pi/.ssh/id_dsa): 
Created directory '/home/pi/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/pi/.ssh/id_dsa.
Your public key has been saved in /home/pi/.ssh/id_dsa.pub.

Next copy this key to all the users that will use the scanner:

$> ssh-copy-id -i ~/.ssh/id_dsa.pub andi@diskstation
andi@diskstation's password: 
Now try logging into the machine, with "ssh 'andi@diskstation'", and check in:

  ~/.ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.

Now our script can copy over the created PDF files:

03-nascopy.sh
#!/bin/bash
 
BASE="/tmp"
HOST="diskstation"
FOLDER="documents"
YEAR=`date '+%Y'`
 
if [ -z "$1" ]; then
    echo "Usage: $0 <jobid> <user> [<keyword>]"
    echo
    echo "Please provide existing jobid as first parameter"
    exit 1
fi
 
if [ -z "$2" ]; then
    echo "Usage: $0 <jobid> <user> [<keyword>]"
    echo
    echo "Please provide user as second parameter"
    exit 1
fi
 
OUTPUT="$BASE/$1"
REMOTE="sftp://$2@$HOST/home/$FOLDER/$YEAR/$3/$1.pdf"
LOCAL="$OUTPUT/$1.pdf"
 
if [ ! -f "$LOCAL" ]; then
    echo "jobid does not exist"
    exit 1
fi
 
 
echo copying to $REMOTE
curl --ftp-create-dirs --insecure -T "$LOCAL" "$REMOTE"

This time the script expects two more parameters after the Job ID: A user name (one of the users whose access we just set up) and an optional keyword. This keyword will be used as a sub folder inside the documents folder configured at the top of the script. This will be our main way to categorize scans later on: a menu will allow to pick a keyword and the scan will be automatically be executed and be placed in the right folder. Additionally a sub folder for the current year is created.

Script 4: Copy to Google Drive

A backup on the NAS is good, but a second off-site backup is better. Having an excellent search on top of that is even better. That's why I want a second copy at Google Drive.

For that we make use of the excellent rclone utility. It is able to copy files from and to different cloud storage services, one of them being Google drive.

First install the Linux ARM binary:

$> wget http://downloads.rclone.org/rclone-v1.05-linux-arm.zip
$> unzip rclone-v1.05-linux-arm.zip
$> sudo cp rclone-v1.05-linux-arm/rclone /usr/local/bin/
$> sudo chmod 755 /usr/local/bin/rclone
$> sudo mkdir -p /usr/local/man/man1
$> sudo cp rclone-v1.05-linux-arm/rclone.1 /usr/local/man/man1/

Next create a profile for every user that will use the service later. Be sure to authenticate the displayed URL with the correct Google User! Name the remote profile exactly like the user.

$> rclone --config=$HOME/.rclone.conf config

Just follow the interactive dialog to create a “remote” of type 6) drive named after your user.

Then the following script will take care of copying the finished PDF to your Google Drive. Since I occasionally got errors from the Google API, it will retry uploading three times until giving up.

04-gdrivecopy.sh
#!/bin/bash
 
BASE="/tmp"
FOLDER="documents"
YEAR=`date '+%Y'`
 
if [ -z "$1" ]; then
    echo "Usage: $0 <jobid> <user> [<keyword>]"
    echo
    echo "Please provide existing jobid as first parameter"
    exit 1
fi
 
if [ -z "$2" ]; then
    echo "Usage: $0 <jobid> <user> [<keyword>]"
    echo
    echo "Please provide user as second parameter"
    exit 1
fi
 
OUTPUT="$BASE/$1"
REMOTE="$2://$FOLDER/$YEAR/$3/"
LOCAL="$OUTPUT/$1.pdf"
 
if [ ! -f "$LOCAL" ]; then
    echo "jobid does not exist"
    exit 1
fi
 
for X in 1 2 3; do
   echo "uploading to Google Drive (try $X)"
   if rclone --config=$HOME/.rclone.conf copy "$LOCAL" "$REMOTE"; then
       exit 0
   fi
   sleep 15 # wait 15 seconds before retrying
done
exit 1

It takes exactly the same parameters as the NAS copy script above.

Script 5: Cleanup

Nothing to see here. Just delete the directory containing all the temporary files:

05-cleanup.sh
#!/bin/bash
 
BASE="/tmp"
 
if [ -z "$1" ]; then
    echo "Usage: $0 <jobid>"
    echo
    echo "Please provide existing jobid as first parameter"
    exit 1
fi
 
OUTPUT="$BASE/$1"
 
if [ ! -d "$OUTPUT" ]; then
    echo "jobid does not exist"
    exit 1
fi
 
rm -rf "$OUTPUT"

Executing the whole chain

Finally we need a way to execute each of the steps above in one go. That's where the controller script comes into play.

scan.sh
#!/bin/bash
 
DIR=$( cd $( dirname "${BASH_SOURCE[0]}" ) && pwd )
JOBID=`date '+%Y-%m-%d_%H%M%S'`
USER=$1
KEYWORD=$2
 
if [ -z "$USER" ]; then
    echo "Usage: $0 <user> [<keyword>]"
    echo "please give a user"
    exit 1
fi
 
 
# run the scanning in foreground
$DIR/01-scan.sh "$JOBID"
 
# execute processing in background
(
    # lock processing to make sure only one is running at a time
    (
        flock -x 200 # wait for lock
 
        $DIR/02-createpdf.sh "$JOBID"
        $DIR/03-nascopy.sh "$JOBID" "$USER" "$KEYWORD"
        $DIR/04-gdrivecopy.sh "$JOBID" "$USER" "$KEYWORD"
        $DIR/05-cleanup.sh "$JOBID"
 
    ) 200>/tmp/scan.lock
) &

What I did here is executing the scanning process in the foreground and then start the whole time consuming PDF creation in the background. The background process is locked to make sure only one of them is ever running at the same time, even when multiple processes were started.

This allows me to quickly scan a couple of documents without needing to wait for anything but the scanner. The processing then can take all the time it needs.

Tags:
scanner, SANE, PDF, OCR, tesseract, linux, howto
Similar posts:
1)
a PDF containing images with invisible, but selectable and searchable text on top
2)
of course there are other ways, but SFTP with curl makes it very easy to transfer a file and create all needed directories in on go
3)
password is the same as the one for admin on the web interface
4)
This step is not really necessary, it just eases the transfer of the keys. You could also just copy them via sftp