Extracting text from PDF

balanga · Dec 13, 2025

What do people recommend for extracting text from a PDF?

I've tried numerous converters including Google Docs without success.

Maybe what I have is not an actual PDF but an image. How do I tell?

bakul · Dec 13, 2025

textproc/py-ocrmypdf -- adds an OCR text layer to scanned PDF files
textproc/py-pdftotext -- Simple PDF text extraction

Kai Burghardt · Dec 14, 2025

balanga said:
What do people recommend for extracting text from a PDF? […]

I don’t know what “people” recommend, but I for one use pdftotext(1) from graphics/poppler-utils.

Bash:

pdftotext my.pdf # after invocation there is a my.txt

balanga said:
[…] Maybe what I have is not an actual PDF but an image. How do I tell?

Well, it probably is a PDF file, cmp. output of

Bash:

file my.pdf

but, as bakul already suggested, you need to perform OCR (optical character recognition) because the PDF file contains images (you can get them with pdfimages(1) also from graphics/poppler-utils).

rootbert · Dec 14, 2025

apache tika

Criosphinx · Dec 14, 2025

It's not easy because a PDF file can include both, one page can be text and the next a full image.

graphics/poppler-utils also includes pdffonts with it you can list the fonts a document uses and pdfimages to do the same for the images. Combine both and you have a good idea of the contents.

If the document is not too big and not encrypted you can open it with LibreOffice Draw.

Beastie · Dec 14, 2025

^ ... or Inkscape.

ralphbsz · Dec 14, 2025

Criosphinx said:
It's not easy because a PDF file can include both, one page can be text and the next a full image.

Or both. It's not uncommon for PDF files to contain an image (typically a scanned or photographed document), and then overlaid on it text, often invisible. Like that you can do cut and paste from the PDF viewer, and get the text version. The various PDF content viewer utilities mentioned above are capable of drilling down and showing you whether you have only image, or image with text.

For OCR, my favorite has always been Tesseract. Not because I have done a scientific performance and accuracy analysis, but because some friends and colleagues worked on it, and they are nice people (and in some cases fine amateur musicians). I don't actually know what FreeBSD packages wrap it in a usable fashion, since for the last 8 or 10 years I've switched to just using cloud-based services. They typically have a "free tier" (something like up to 1000 documents or pages per month or per day), so it's easy to do some experimentation. I've been using Google's (Cloud) Vision API, but there are many competitors, which are probably just as good.

bakul · Dec 14, 2025

ocrmypdf uses Tesseract under the hood.

[Edit:] About 10-12 years ago I was helping a friend with OCRing old Gujarati texts. Tesseract was not very good with it. We did use Google tools for a while but back then they had their own issues (non-technical). I just asked him and he says Tesseract is ancient! Gemini 3 is good. Google Docs is easier for larger PDF documents. Google Vision is better for complex layouts. This probably holds true for at least Indic languages; situation with Roman/Cyrillic etc alphabets is likely different. Of course, you still need to proofread since often old scans are not very good. Sanskrit is likely to be more challenging due to "sandhi rules" and compound characters (not to mention really ancient texts!).