What do people recommend for extracting text from a PDF? […]
pdftotext my.pdf # after invocation there is a my.txt
[…] Maybe what I have is not an actual PDF but an image. How do I tell?
file my.pdf
Or both. It's not uncommon for PDF files to contain an image (typically a scanned or photographed document), and then overlaid on it text, often invisible. Like that you can do cut and paste from the PDF viewer, and get the text version. The various PDF content viewer utilities mentioned above are capable of drilling down and showing you whether you have only image, or image with text.It's not easy because a PDF file can include both, one page can be text and the next a full image.