OCR pipeline in Linux (jpg to pdf/txt)
OCR pipeline in Linux (jpg to pdf/txt)
a shell script pipeline to convert a directory of images into editable pdf or plain text.
I did this on Ubuntu. Install the necessary software first. look up additional packages to install for languages other than English using apt-cache search tesseract-ocr. ocrmypdf acts as the wrapper around Tesseract OCR. I needed Russian, so I did:
1
sudo apt install tesseract-ocr ghostscript tesseract-ocr-script-cyrl tesseract-ocr-rus ocrmypdf
the pipeline involves using ImageMagick to enhance jpg files before doing OCR and also for conversion to pdf, the default memory limits are too low and must be changed to process large images:
1
sudo sed -i '/<policy domain="resource" name="memory"/ s/value="1024MiB"/value="30GiB"/' /etc/ImageMagick-6/policy.xml
run the pipeline (run on the jpg images in current directory):
1
2
3
4
5
6
7
mkdir -p ./bw
for f in *.JPG; do mv "$f" "${f%.JPG}.jpg"; done
for f in *.jpg; do convert "$f" -colorspace gray -fill white -resize 200% -sharpen 0x1 "bw/$f"; echo -n "."; done
convert $(ls -v ./bw/*.jpg) in.pdf
ocrmypdf -l rus --output-type pdfa in.pdf out.pdf
pdftotext out.pdf out.txt
find ./bw/ -type f -name "*.jpg" -delete
the results are in out.pdf and out.txt
references:
- https://github.com/ocrmypdf/OCRmyPDF
- https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF
- https://unix.stackexchange.com/questions/377359/how-to-use-ocr-from-the-command-line-in-linux
This post is licensed under CC BY 4.0 by the author.