OCR pipeline in Linux (jpg to pdf/txt)

Posted Feb 15, 2026 Updated Feb 15, 2026

By Aleksandr Mitkov

1 min read

a shell script pipeline to convert a directory of images into editable pdf or plain text.

I did this on Ubuntu. Install the necessary software first. look up additional packages to install for languages other than English using apt-cache search tesseract-ocr. ocrmypdf acts as the wrapper around Tesseract OCR. I needed Russian, so I did:

sudo apt install tesseract-ocr ghostscript tesseract-ocr-script-cyrl tesseract-ocr-rus ocrmypdf

the pipeline involves using ImageMagick to enhance jpg files before doing OCR and also for conversion to pdf, the default memory limits are too low and must be changed to process large images:

  
sudo sed -i '/<policy domain="resource" name="memory"/ s/value="1024MiB"/value="30GiB"/' /etc/ImageMagick-6/policy.xml

run the pipeline (run on the jpg images in current directory):

  
mkdir -p ./bw
for f in *.JPG; do mv "$f" "${f%.JPG}.jpg"; done
for f in *.jpg; do convert "$f" -colorspace gray -fill white -resize 200% -sharpen 0x1 "bw/$f"; echo -n "."; done
convert $(ls -v ./bw/*.jpg) in.pdf
ocrmypdf -l rus --output-type pdfa in.pdf out.pdf
pdftotext out.pdf out.txt
find ./bw/ -type f -name "*.jpg" -delete

the results are in out.pdf and out.txt

references:

https://github.com/ocrmypdf/OCRmyPDF
https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF
https://unix.stackexchange.com/questions/377359/how-to-use-ocr-from-the-command-line-in-linux

tutorial

en linux

This post is licensed under CC BY 4.0 by the author.

a shell script pipeline to convert a directory of images into editable pdf or plain text.

references:

Trending Tags