My magazine scanning workflow

2025-04-22 by madcap

As you may have noticed, I've been scanning my complete collection of Cyber.net magazines. This post serves as a mental note of the steps I use to scan and publish each issue.

Step 1 - Scan

First I scan the magazine using my flatbed scanner (a CanoScan 9000F) at 300DPI, using the following software:

XSane for Linux

Let's say the output is a file named magazine_original.pdf.

Step 2 - resize file

Since the generated file is too big, I usually resize it, using ghostscript:

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/prepress -dNOPAUSE -dQUIET -dBATCH -sOutputFile=magazine.pdf magazine_original.pdf

Instead of /prepress, we can use /printer to obtain a smaller file, but I think image quality is too low.

Step 3 - Set metadata info

Now, set the metadata of the PDF to whatever you desire, using exiftool. Eg:

exiftool -Title="Cyber.net #25 (1997)" -Producer="Your name" -Keywords="revista internet portugal cyber net" -Subject="Revista portuguesa dedicada à Internet (1995-1998)" cybernet_25.pdf

Step 4 - Upload to Internet Archive

There are likely effective tools available for OCRing the PDF, but my approach is:

1. Upload file to Internet Archive

2. Let Internet Archive generate a copy with OCR

You have to wait a few minutes or hours between step 1 and 2.

After that long wait, go to the page of your uploaded file in the Internet Archive, and you should find a file named PDF WITH TEXT in the "DOWNLOAD OPTIONS" section.

3. Download that file

The file is usually much smaller than the file you uploaded.

Step 5 - Merge PDF OCR and PDF hi-res

Now you should have 2 files:

magazine.pdf - the file you uploaded to the Archive
magazine_text.pdf - the file you downloaded from the Archive (with OCR).

In this step we want to merge both files, resulting in a file that has the quality of the first file and the OCR text of the second.

1. Create a bash script named pdf-merge-text.sh with the following content:

#!/usr/bin/env bash

set -eu

pdf_merge_text() {
    local txtpdf; txtpdf="$1"
    local imgpdf; imgpdf="$2"
    local outpdf; outpdf="${3--}"
    if [ "-" != "${txtpdf}" ] && [ ! -f "${txtpdf}" ]; then echo "error: text PDF does not exist: ${txtpdf}" 1>&2; return 1; fi
    if [ "-" != "${imgpdf}" ] && [ ! -f "${imgpdf}" ]; then echo "error: image PDF does not exist: ${imgpdf}" 1>&2; return 1; fi
    if [ "-" != "${outpdf}" ] && [ -e "${outpdf}" ]; then echo "error: not overwriting existing output file: ${outpdf}" 1>&2; return 1; fi
    (
        local txtonlypdf; txtonlypdf="$(TMPDIR=. mktemp --suffix=.pdf)"
        trap "rm -f -- '${txtonlypdf//'/'\\''}'" EXIT
        gs -o "${txtonlypdf}" -sDEVICE=pdfwrite -dFILTERIMAGE "${txtpdf}"
        pdftk "${txtonlypdf}" multistamp "${imgpdf}" output "${outpdf}"
    )
}

pdf_merge_text "$@"

2. Make it executable:

chmod u+x pdf-merge-text.sh

3. Finally, execute the script to merge both files. Example:

~/pdf-merge-text.sh magazine_text.pdf magazine.pdf magazine_final.pdf

4. A file named magazine_final.pdf is created. Publish it, store it, do whatever you want with it.