My magazine scanning workflow
2025-04-22 by madcap
As you may have noticed, I've been scanning my complete collection of Cyber.net magazines. This post serves as a mental note of the steps I use to scan and publish each issue.
Step 1 - Scan
First I scan the magazine using my flatbed scanner (a CanoScan 9000F) at 300DPI, using the following software:
Let's say the output is a file named magazine_original.pdf.
Step 2 - resize file
Since the generated file is too big, I usually resize it, using ghostscript:
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/prepress -dNOPAUSE -dQUIET -dBATCH -sOutputFile=magazine.pdf magazine_original.pdf
Instead of /prepress, we can use /printer to obtain a smaller file, but I think image quality is too low.
Step 3 - Set metadata info
Now, set the metadata of the PDF to whatever you desire, using exiftool. Eg:
exiftool -Title="Cyber.net #25 (1997)" -Producer="Your name" -Keywords="revista internet portugal cyber net" -Subject="Revista portuguesa dedicada à Internet (1995-1998)" cybernet_25.pdf
Step 4 - Upload to Internet Archive
There are likely effective tools available for OCRing the PDF, but my approach is:
1. Upload file to Internet Archive
2. Let Internet Archive generate a copy with OCR
You have to wait a few minutes or hours between step 1 and 2.
After that long wait, go to the page of your uploaded file in the Internet Archive, and you should find a file named PDF WITH TEXT in the "DOWNLOAD OPTIONS" section.
3. Download that file
The file is usually much smaller than the file you uploaded.
Step 5 - Merge PDF OCR and PDF hi-res
Now you should have 2 files:
- magazine.pdf - the file you uploaded to the Archive
- magazine_text.pdf - the file you downloaded from the Archive (with OCR).
In this step we want to merge both files, resulting in a file that has the quality of the first file and the OCR text of the second.
1. Create a bash script named pdf-merge-text.sh with the following content:
#!/usr/bin/env bash set -eu pdf_merge_text() { local txtpdf; txtpdf="$1" local imgpdf; imgpdf="$2" local outpdf; outpdf="${3--}" if [ "-" != "${txtpdf}" ] && [ ! -f "${txtpdf}" ]; then echo "error: text PDF does not exist: ${txtpdf}" 1>&2; return 1; fi if [ "-" != "${imgpdf}" ] && [ ! -f "${imgpdf}" ]; then echo "error: image PDF does not exist: ${imgpdf}" 1>&2; return 1; fi if [ "-" != "${outpdf}" ] && [ -e "${outpdf}" ]; then echo "error: not overwriting existing output file: ${outpdf}" 1>&2; return 1; fi ( local txtonlypdf; txtonlypdf="$(TMPDIR=. mktemp --suffix=.pdf)" trap "rm -f -- '${txtonlypdf//'/'\\''}'" EXIT gs -o "${txtonlypdf}" -sDEVICE=pdfwrite -dFILTERIMAGE "${txtpdf}" pdftk "${txtonlypdf}" multistamp "${imgpdf}" output "${outpdf}" ) } pdf_merge_text "$@"
2. Make it executable:
chmod u+x pdf-merge-text.sh
3. Finally, execute the script to merge both files. Example:
~/pdf-merge-text.sh magazine_text.pdf magazine.pdf magazine_final.pdf
4. A file named magazine_final.pdf is created. Publish it, store it, do whatever you want with it.