Python text extraction does not work on some pdfs

Python text extraction does not work on some pdfs - python

I am trying to read a pdf through url. I followed many stackoverflow suggestions and used PyPdf2 FileReader to extract text from the pdf.
My code looks like this :
url = "http://kat.kar.nic.in:8080/uploadedFiles/C_13052015_ch1_l1.pdf"
#url = "http://kat.kar.nic.in:8080/uploadedFiles/C_06052015_ch1_l1.pdf"
f = urlopen(Request(url)).read()
fileInput = StringIO(f)
pdf = PyPDF2.PdfFileReader(fileInput)
print pdf.getNumPages()
print pdf.getDocumentInfo()
print pdf.getPage(1).extractText()
I am able to successfully extract text for first link. But if I use the same program for the second pdf. I do not get any text. The page numbers and document info seem to show up.
I tried extracting text from Pdfminer through terminal and was able to extract text from the second pdf.
Any idea what is wrong with the pdf or is there a drawback with the libraries I am using ?

If you read the comments in the pyPDF documentation you'll see that it's written right there that this functionality will not work well for some PDF files; in other words, you're looking at a restriction of the library.
Looking at the two PDF files, I can't see anything wrong with the files themselves. But...
The first file contains fully embedded fonts
The second file contains subsetted fonts
This means that the second file is more difficult to extract text from and the library probably doesn't support that properly. Just for reference I did a text extraction with callas pdfToolbox (caution, I'm affiliated with this tool) which uses the Acrobat text extraction and the text is properly extracted for both files (confirming that it's not the PDF files that are the problem).

Related

How to extract text from a photocopy saved as a pdf in Python

I found the following code that allows one to extract text from a pdf file. However, this only works for pdf's where you can copy the text directly from highlighting it. I'm curious if there's some way to extract text from a document where you can't select the text in Python, like a photocopy or scanned document saved as a pdf?
Here is the code I use to take in text from a non-photocopy pdf file
import PyPDF2
pdfFileObject = open('C:\\filepath\\file.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
output = ''
for i in range(count):
page = pdfReader.getPage(i)
output += page.extractText()
# output.append(page.extractText())
output
Works like a charm. However I am curious about a way to extract text from a photocopied document saved as a pdf.
Extracting text from a photocopied document saved as a pdf doesn't work when I use the code provided above.

PyPDF2 is a PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. PyPDF2 can retrieve text and metadata from PDFs as well.
To retrieve text from a scanned document, you need to do an OCR (Optical Character Recognition (OCR) is the process that converts an image of text into a machine-readable text format. For example, if you scan a form or a receipt, your computer saves the scan as an image file.) There are various OCR tools and pytesseract is one of mostly used tools. Read more about those OCR tools from here.

How do you avoid text from cropped parts in PyPDF?

I'm quite new to python and I'm doing a ML project to extract disclosures from PDF's (published annual reports). PyPDF extracts the disclosures i need for my project but it also includes the text from footers in the text which i want to remove.
I browsed through stack-overflow and found a solution to successfully crop out the footer part through PyPDF and save the file as a new pdf. But when I run the cropped PDF through my original code, the text from footers are still included in the extracted text. Is there anyway I can overcome this ?

Not sure after extracted the desired text, why you wish to save it as new pdf & then load it again... anyways, follow the below suggestion...
So, after cropping the footer part from original pdf, instead of saving the extracted text as new pdf... save it as word document... The idea is when we load a word document in python using "docx2python" library, it separates out header, footer, body in it's properties...
My guess is;
1.) The new saved word document shouldn't have any header/footer, just the text...
2.) And in case if loaded word document still has the footer then you can get rid of using the same library...

Extract text from pdf file genrated by chrome's print option using pypdf2

Trying to extract text from pdf file/s using python(v 3.8.2) module pypdf2(v 1.26.0). All good except with particular pdf file/s(generated from chrome print option.)
I have these files over the period that I have generated/downloaded using chrome's print option, where there is an option to save page/document as pdf. I am not able to extract text from these pdf files as code only returns ' '(empty), no problem with other pdf files. If you would like to test yourself you can save any web page as pdf using chrome print option and use that pdf to test. Chrome(v 81.0.4044.138)
Found that chrome uses Skia to save pages as pdf but didn't help to solve the problem. (PDF Producer: Skia/PDF m80)
Found following similar question on Stack Overflow but no body has answered yet and as I am new user I can't comment or add anything hence this new question.
Extract text from pdf converted from webpage using Pypdf2
Following is the code
import PyPDF2
pdfFileObj = open('example.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
pdfFileObj.close()
I am a new user and this is my first time posting question please correct me if I have done anything incorrect(not sure if I have). I assure you I have done my search on google found no solution or lacking knowledge to understand problem/solution. Thank you

PyPDF2 is highly unreliable for extracting text from pdf . as pointed out here too.
which says:
While PyPDF2 has .extractText(), which can be used on its page objects
(not shown in this example), it does not work very well. Some PDFs
will return text and some will return an empty string. When you want
to extract text from a PDF, you should check out the PDFMiner project
instead. PDFMiner is much more robust and was specifically designed
for extracting text from PDFs.
Look at my answer for similar question here

How to extract images from PDF or Word, together with the text around images?

I found there are some library for extracting images from PDF or word, like docx2txt and pdfimages. But how can I get the content around the images (like there may be a title below the image)? Or get a page number of each image？
Some other tools like PyPDF2 and minecart can extract image page by page. However, I cannot run those code successfully.
Is there a good way to get some information of the images? (from the image got from docx2txt or pdfimages, or another way to extract image with info)

I found the code of doc2txt and it's simply parse the xml of docx file. So it's actually an very easy task..
Ref: doc2txt

docx2python pulls the images into a folder and leaves -----image1.png---- markers in the extracted text. This might get you close to where you'd like to go.

Few month ago, I reprogramed docx2python to reproducing a structured(with level) xml format file from a docx file, which works out pretty good on many files.
As far as I know, a paragraph contains several Runs and each Run contain one only text, sometimes contains images. You can read this document for details. https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.paragraph?view=openxml-2.8.1 .
docx2python support extracting image with text around it. You use docx2python reading paragraphes, while ----media/imagen---- shows in your text, which is a image placeholder. Then you can reach this image if you set extract_image=True. Well, you get what your image called in pagaraph text and list of image files. Match as you like.

Can't get text out of PDF file with PyPDF2

I am trying to get the text from a PDF file I downloaded with PyPDF.
Here is my code:
if not PyPDF2.PdfFileReader('download.pdf').isEncrypted:
PyPDF2.PdfFileReader('download.pdf').getPage(0).extractText()
This is the output:
'\n\n˘ˇ˘ˆ˙\n˝˛˚˜!\n\n\n\n#\nˇ˘ˆ˙ˆ˝˛˝\n˙˙˘ ˘ˆ"˝\n$!%˙(˝)˙*˜+,˝-.#/.(#0)0)/.1.+02345.\n˛˛ˇ/#.$/0/70/#.+322.32˙˘˛˘˘\n˛˘ 8˙˘9:˘ˆ;\n˛˘\n\n˝=\n˙˘˛\n.ˇ<9:˘ˇˇ%˘˛ˇ ˘˘<˘\n˝>"?˝˘$#<˘*ˆˆ˘˙˘A˘B˘˙˘˛ˇ!˛˘˙˘˛ˇ˘\n1C˙ˆ˘06˛˘8+˛9:˘D10+E˝ˆ˘8\n$˘˘9:˘˘1C˙ˆ˘+˘F˛˘D$1+FE˝˘˛˘˘<˘?˝\n////)*˘1˘˛ ?GG˜*HI\nD˘˙A˘E\nJ$\n˛\nDLE///M˛˝˛˙˘˛˘˛\n˛˘˛>"?\n˙˘˛\n˛\n/)M6;˝˛˙˘˛˘\n˛\n///˛\n\n'
When I open the file its content is fine. Also when I use another program to transform pdf into txt it works fine. It is a javascript rendered pdf on a webpage, don't know if it makes any difference.

Under Win 7, Python 3.6, I had the problem that PyPDF2 did not properly encode some PDF files. My solution was to use pdfminer.six.
pip install pdfminer.six
To extract text from a PDF, you can use functions such as the one in this post: https://stackoverflow.com/a/42154976/9524424
Worked perfect for me...

The following is taken from the documentation (https://pythonhosted.org/PyPDF2/PageObject.html)
extractText() Locate all text drawing commands, in the order they are
provided in the content stream, and extract the text. This works well
for some PDF files, but poorly for others, depending on the generator
used. This will be refined in the future. Do not rely on the order of
text coming out of this function, as it will change if this function
is made more sophisticated. Returns: a unicode string object.
So, it seems that the performance of this function depends on the pdf itself.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python text extraction does not work on some pdfs - python

Related

How to extract text from a photocopy saved as a pdf in Python

How do you avoid text from cropped parts in PyPDF?

Extract text from pdf file genrated by chrome's print option using pypdf2

How to extract images from PDF or Word, together with the text around images?

Can't get text out of PDF file with PyPDF2

Categories

Resources