How to OCR PDFs for Free
How to OCR PDFs on Windows (for free)
I recently needed to run OCR on a 538-page scanned Cyrillic PDF. No budget, no subscriptions, just a Windows machine and some patience. Here's what worked.The tool stack:
The winning combination is Tesseract (open-source OCR engine) wrapped by OCRmyPDF (a Python tool that handles the full pipeline: takes a scanned PDF in, runs Tesseract page by page, and spits out a searchable PDF). Together they handle deskewing, preprocessing, and text layer embedding without you needing to split pages into images manually.There are fancier options. ABBYY FineReader is the commercial gold standard for Cyrillic and will beat Tesseract on accuracy, especially at low DPI. Google Cloud Vision has a free tier of 1,000 pages/month and handles degraded scans better than anything else. But if you want something fully local and free with no page limits, Tesseract + OCRmyPDF is the move.
Setup
1. Install Python
Download from python.org/downloads. During installation, check "Add python.exe to PATH" at the bottom. This is the step everyone skips and then spends an hour debugging. Don't skip it.Open a fresh Command Prompt and verify:
python --version
pip --version
2. Install Tesseract
Download the Windows installer from github.com/UB-Mannheim/tesseract/wiki.During installation, expand the language data options and check Serbian (Cyrillic) (srp). If you also need French, Latin, Albanian, or other languages, check those too. You can always add languages later by downloading .traineddata files from github.com/tesseract-ocr/tessdata and dropping them into C:\Program Files\Tesseract-OCR\tessdata\.
Optional but recommended: add C:\Program Files\Tesseract-OCR to your system PATH. Search "Environment Variables" in Windows, find Path under System variables, click Edit, click New, paste the path.
3. Install OCRmyPDF
pip install ocrmypdfRun the OCR
Navigate to the folder containing your PDF:
cd %USERPROFILE%\Downloads
Then run:
ocrmypdf --language srp --deskew input.pdf output.pdf
That's it. The --deskew flag straightens crooked scans. Regarding the language selection just type tesseract --list-langs.
Hopefully this helps anyone.
Comments
Post a Comment