Here is a simple very easily reproducible example of the bug I am having in PyPDF2. Please help me figure out what is going on.
I want to generate this pdf:
As you can see, I have generated it so the code is working correctly. However, it looks like this only when viewed in mac viewer or safari. When I view THE SAME FILE in google chrome or brave or adobe I see this:
So page 2 just shows page 1 again. If I do n pages then all n pages are identical and they all say "page 1".
I put all the versions of the software I am using in the comments below. I want this to be viewable correctly in chrome, brave, adobe, and it's probably wrong in many others as well.
#find original pdf page here and save it in "path": https://www.irs.gov/pub/irs-pdf/f8949.pdf
from PyPDF2 import PdfReader, PdfWriter #, PdfFileMerger, PdfFileWriter, PdfFileReader
writer = PdfWriter()
reader = PdfReader(path)
page = reader.pages[0]
writer.add_page(page)
writer.update_page_form_field_values(writer.pages[0], {"f1_3[0]":"page1"})
reader = PdfReader(path)
page = reader.pages[0]
writer.add_page(page)
writer.update_page_form_field_values(writer.pages[1], {"f1_3[0]":"page2"})
with open("output.pdf", "wb") as output_stream:
writer.write(output_stream)
output_stream.close()
#PyPDF2==2.11.2
#PyPDF4==1.27.0
#Python 3.9.7
#Mac 12.1
#Brave 1.45.127
#Adobe 2022.003.20258
#safari good
#preview good
#google bad
#adobe bad
#brave bad
Yes, this is a bug in PyPDF2. We have issues with filling out forms for a long time, see https://github.com/py-pdf/PyPDF2/issues/355
I want to extract the text of a PDF and use some regular expressions to filter for information.
I am coding in Python 3.7.4 using fitz for parsing the pdf. The PDF is written in German. My code looks as follows:
doc = fitz.open(pdfpath)
pagecount = doc.pageCount
page = 0
content = ""
while (page < pagecount):
p = doc.loadPage(page)
page += 1
content = content + p.getText()
Printing the content, I realized that the first (and important) half of the document is decoded as a strange mix of Japanese (?) signs and others, like this: ョ。オウキ・ゥエオョァ@ュ.
I tried to solve it with different decodings (latin-1, iso-8859-1), encoding is definitely in utf-8.
content= content+p.getText().encode("utf-8").decode("utf-8")
I also have tried to get the text using minecart:
import minecart
file = open(pdfpath, 'rb')
document = minecart.Document(file)
for page in document.iter_pages():
for lettering in page.letterings :
print(lettering)
which results in the same problem.
Using textract, the first half is an empty string:
import textract
text = textract.process(pdfpath)
print(text.decode('utf-8'))
Same thing with PyPDF2:
import PyPDF2
pdfFileObj = open(pdfpath, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
for index in range(0, pdfReader.numPages) :
pageObj = pdfReader.getPage(index)
print(pageObj.extractText())
I don't understand the problem as it's looking like a normal PDF with normal text. Also some of the PDFs don't have this problem.
I am trying to read some text from a pdf file. I am using the code below however when I try to get the text (ptext) all that is return is a string variable of size 1 & its empty.
Why is no text being returned? I have tried other pages and another pdf book but the same thing, I can't seem to read any text.
import PyPDF2
file = open(r'C:/Users/pdfs/test_file.pdf', 'rb')
fileReader = PyPDF2.PdfFileReader(file)
pageObj = fileReader.getPage(445)
ptext = pageObj.extractText()
I also had the same issue, I thought something was wrong with my code or whatnot. After some intense researching, debugging and investigation, it seems that PyPDF2, PyPDF3, PyPDF4 packages cant handle large files... Yes, I tried with a 20 page PDF, ran seamlessly, but put in a 50+ page PDF, and PyPDF crashes.
My only suggestion would be to use a different package altogether. pdftotext is a good recommendation. Use pip install pdftotext.
I have faced a similar issue while reading my pdf files. Hope the below solution helps.
The reason why I faced this issue : The pdf I was selecting was actually a scanned image. I created my resume using a third party site which returned me a pdf. On parsing this type of file, I was not able to extract text directly.
Below is the testes working code
from PIL import Image
import pytesseract
from pdf2image import convert_from_path
import os
def readPdfFile(filePath):
pages = convert_from_path(filePath, 500)
image_counter = 1
#Part #1 : Converting PDF to images
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
page.save(filename, 'JPEG')
image_counter = image_counter + 1
#Part #2 - Recognizing text from the images using OCR
filelimit = image_counter-1 # Variable to get count of total number of pages
for i in range(1, filelimit + 1):
filename = "page_"+str(i)+".jpg"
text = str(((pytesseract.image_to_string(Image.open(filename)))))
text = text.replace('-\n', '')
#Part 3 - Remove those temp files
image_counter = 1
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
os.remove(filename)
image_counter = image_counter + 1
return text
This question already has answers here:
How to extract text from a PDF file?
(33 answers)
Closed 10 months ago.
I am trying to extract text from a PDF file using Python. My main goal is I am trying to create a program that reads a bank statement and extracts its text to update an excel file to easily record monthly spendings. Right now I am focusing just extracting the text from the pdf file but I don't know how to do so.
What is currently the best and easiest way to extract text from a PDF file into a string? What library is best to use today and how can I do it?
I have tried using PyPDF2 but everytime I try to extract text from any page using extractText(), it returns empty strings. I have tried installing textract but I get errors because I need more libraries I think.
from PyPDF2 import PdfReader
reader = PdfReader("January2019.pdf")
page = reader.pages[0]
print(page.extract_text())
This prints empty strings when it should be printing the contents of the page
edit: This question was asked for a very old PyPDF2 version. New versions of PyPDF2 have improved text extraction a lot
I have tried many methods but failed, include PyPDF2 and Tika. I finally found the module pdfplumber that is work for me, you also can try it.
Hope this will be helpful to you.
import pdfplumber
pdf = pdfplumber.open('pdffile.pdf')
page = pdf.pages[0]
text = page.extract_text()
print(text)
pdf.close()
Using tika worked for me!
from tika import parser
rawText = parser.from_file('January2019.pdf')
rawList = rawText['content'].splitlines()
This made it really easy to extract separate each line in the bank statement into a list.
If you are looking for a maintained, bigger project, have a look at PyMuPDF. Install it with pip install pymupdf and use it like this:
import fitz
def get_text(filepath: str) -> str:
with fitz.open(filepath) as doc:
text = ""
for page in doc:
text += page.getText().strip()
return text
PyPDF2 is highly unreliable for extracting text from pdf . as pointed out here too.
it says :
While PyPDF2 has .extractText(), which can be used on its page objects
(not shown in this example), it does not work very well. Some PDFs
will return text and some will return an empty string. When you want
to extract text from a PDF, you should check out the PDFMiner project
instead. PDFMiner is much more robust and was specifically designed
for extracting text from PDFs.
You could instead install and use pdfminer using
pip install pdfminer
or you can use another open source utility named pdftotext by xpdfreader. instructions to use the utility is given on the page.
you can download the command line tools from here
and could use the pdftotext.exe utility using subprocess .detailed explanation for using subprocess is given here
PyPDF2 does not read whole pdf correctly. You must use this code.
import pdftotext
pdfFileObj = open("January2019.pdf", 'rb')
pdf = pdftotext.PDF(pdfFileObj)
# Iterate over all the pages
for page in pdf:
print(page)
Here is an alternative solution in Windows 10, Python 3.8
Example test pdf: https://drive.google.com/file/d/1aUfQAlvq5hA9kz2c9CyJADiY3KpY3-Vn/view?usp=sharing
#pip install pdfminer.six
import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert_pdf_to_txt(path):
'''Convert pdf content from a file path to text
:path the file path
'''
rsrcmgr = PDFResourceManager()
codec = 'utf-8'
laparams = LAParams()
with io.StringIO() as retstr:
with TextConverter(rsrcmgr, retstr, codec=codec,
laparams=laparams) as device:
with open(path, 'rb') as fp:
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp,
pagenos,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True):
interpreter.process_page(page)
return retstr.getvalue()
if __name__ == "__main__":
print(convert_pdf_to_txt('C:\\Path\\To\\Test_PDF.pdf'))
import pdftables_api
import os
c = pdftables_api.Client('MY-API-KEY')
file_path = "C:\\Users\\MyName\\Documents\\PDFTablesCode\\"
for file in os.listdir(file_path):
if file.endswith(".pdf"):
c.xlsx(os.path.join(file_path,file), file+'.xlsx')
Go to https://pdftables.com to get an API key.
CSV, format=csv
XML, format=xml
HTML, format=html
XLSX, format=xlsx-single, format=xlsx-multiple
Try pdfreader. You can extract either plain text or decoded text containing "pdf markdown":
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
plain_text = ""
pdf_markdown = ""
try:
while True:
viewer.render()
pdf_markdown += viewer.canvas.text_content
plain_text += "".join(viewer.canvas.strings)
viewer.next()
except PageDoesNotExist:
pass
I think this code will be exactly what you are looking for:
import requests, time, datetime, os, threading, sys, configparser
import glob
import pdfplumber
for filename in glob.glob("*.pdf"):
pdf = pdfplumber.open(filename)
OutputFile = filename.replace('.pdf','.txt')
fx2=open(OutputFile, "a+")
for i in range(0,10000,1):
try:
page = pdf.pages[i]
text = page.extract_text()
print(text)
fx2.write(text)
except Exception as e:
print(e)
fx2.close()
pdf.close()
Try this:
in terminal execute command: pip install PyPDF2
import PyPDF2
reader = PyPDF2.PdfReader("mypdf.pdf")
for page in reader.pages:
print(page.extract_text())
I would like to add metadata to PDF in python.
I have used PyPDF2 and it works fine with metadata (tags) except of "BaseUrl"
Code:
dict_1['/BaseUrl'] = 'http://bud-arch.pollub.pl/wp-content/uploads/Bud-arch_171_2018_005-012_Boruci%C5%84ska%E2%80%93Bie%C5%84kowska_Maciejko.pdf'
writer = PyPDF2.PdfFileWriter()
writer.addMetadata(dict_1)
The result is metadata "BaseUrl" in the section for creator free tags. not in section "advanced". How to solve this issue?
Łukasz
(source: pokazywarka.pl)