I am trying to extract the contents of a table within a pdf using PyPDF2 however I am encountering this error when trying to open the pdf and I am not sure why. How can I fix this? Here is the code:
#PDF Table testing
pdf_file = open(r"PDFs/murrumbidgee/Murrumbidgee Unregulated River Water Sources 2012_20200815.pdf")
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(50)
page_content = page.extractText()
print(page_content.encode('utf-8'))
table_list = page_content.split('\n')
l = numpy.array_split(table_list, len(table_list)/7)
for i in range(0, 5):
print(l[i])
This is the error:
PdfReadWarning: PdfFileReader stream/file object is not in binary mode. It may not be read correctly. [pdf.py:1079]
Traceback (most recent call last):
File "C:/Users/benjh/Desktop/project/testing_regex.py", line 103, in <module>
read_pdf = PyPDF2.PdfFileReader(pdf_file)
File "C:\Users\benjh\anaconda3\envs\project\lib\site-packages\PyPDF2\pdf.py", line 1084, in __init__
self.read(stream)
File "C:\Users\benjh\anaconda3\envs\project\lib\site-packages\PyPDF2\pdf.py", line 1689, in read
stream.seek(-1, 2)
io.UnsupportedOperation: can't do nonzero end-relative seeks
What does nonzero end-relative seeks mean?
Opening the pdf with 'rb' fixes the error
Related
I'm fairly new to file handling especially PDFs. I currently have PDFminer.six installed and have tested out several functions that extracts text from PDF files. I also have another function that takes in a list of PDF files and then calls the first PDF extraction function to extract all the text from each file.
The problem is that, I have a lot of PDF files and the script seems to break every time it encounters a new error. Its difficult to have go and search for which PDF file caused the error regardless of whether it was an unrecognized character, or different encoding, or encryption etc..
How can I make it so the script just continues to run regardless of the type of error? Could I set the PDF extraction function to ignore any type of error? Or perhaps, am I missing something in my code that would assist me with addressing this issue?
p = Path("C:/Users/Hugo Caldeira/Desktop")
inp = r"((?<=|^)[0-9]{3}-[0-9]{2}-[0-9]{4}(?=|$))"
file_dict = {
"name" : [],
"created" : [],
"modified" : [],
'path' : [],
'content' : [],
'keyword' : []
}
files = list(p.rglob('*pdf'))
def pdfparser(file):
fp = open(file, 'rb')
rsrcmgr = PDFResourceManager()
retstr = io.StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
#Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
#Process each page contained in the document.
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
data = retstr.getvalue()
return(data)
def pdfs(files):
for name in files:
#print(name)
IP_list = (pdfparser(name))
#print(IP_list)
keyword = re.findall(inp,IP_list)
#print(ip_test)
file_dict['keyword'].append(keyword)
file_dict['name'].append(name.name[0:])
file_dict['created'].append(time.ctime(name.stat().st_ctime))
file_dict['modified'].append(time.ctime(name.stat().st_mtime))
file_dict['path'].append(name)
file_dict["content"].append(IP_list)
#print(file_dict)
return(file_dict)
pdfs(files)
def to_xlsx():
df = pd.DataFrame.from_dict(file_dict)
df.head()
df.to_excel("pdftest.xlsx")
if __name__ == "__main__":
to_xlsx()
The current error I am getting is:
Traceback (most recent call last):
File "c:/Users/Hugo Caldeira/Desktop/Scripts/pdf.py", line 67, in <module>
print(pdfparser(p))
File "c:/Users/Hugo Caldeira/Desktop/Scripts/pdf.py", line 32, in pdfparser
fp = open(file, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Hugo Caldeira\\Desktop\\test_folder\\Desktop'
(base) C:\Users\Hugo Caldeira\Desktop\Scripts>"C:/Users/Hugo Caldeira/Anaconda3/python.exe" "c:/Users/Hugo Caldeira/Desktop/Scripts/pdf.py"
Traceback (most recent call last):
File "c:/Users/Hugo Caldeira/Desktop/Scripts/pdf.py", line 64, in <module>
pdfs(files)
File "c:/Users/Hugo Caldeira/Desktop/Scripts/pdf.py", line 52, in pdfs
IP_list = (pdfparser(name))
File "c:/Users/Hugo Caldeira/Desktop/Scripts/pdf.py", line 42, in pdfparser
for page in PDFPage.get_pages(fp):
File "C:\Users\Hugo Caldeira\Anaconda3\lib\site-packages\pdfminer\pdfpage.py", line 129, in get_pages
doc = PDFDocument(parser, password=password, caching=caching)
File "C:\Users\Hugo Caldeira\Anaconda3\lib\site-packages\pdfminer\pdfdocument.py", line 577, in __init__
self._initialize_password(password)
File "C:\Users\Hugo Caldeira\Anaconda3\lib\site-packages\pdfminer\pdfdocument.py", line 603, in _initialize_password
handler = factory(docid, param, password)
File "C:\Users\Hugo Caldeira\Anaconda3\lib\site-packages\pdfminer\pdfdocument.py", line 303, in __init__
self.init()
File "C:\Users\Hugo Caldeira\Anaconda3\lib\site-packages\pdfminer\pdfdocument.py", line 310, in init
self.init_key()
File "C:\Users\Hugo Caldeira\Anaconda3\lib\site-packages\pdfminer\pdfdocument.py", line 325, in init_key
raise PDFPasswordIncorrect
pdfminer.pdfdocument.PDFPasswordIncorrect
The other errors I encountered before were:
PDFSyntaxError: No /Root object! - Is this really a PDF?
Use try and except.
https://docs.python.org/3.7/tutorial/errors.html#handling-exceptions
In your except clause make sure you output the filename and the exception.
I was trying to create a code to generate a combined pdf from a bunch of small pdf files while I found the script failing with UnicodeEncodeError error.
I also tried to include encoding param by
with open("Combined.pdf", "w",encoding='utf-8-sig') as outputStream:
but compiler said it needs to be binary 'wb' mode. So this isn't working.
Below is the code:
writer = PdfFileWriter()
input_stream = []
for f2 in f_re:
inputf_file = str(mypath+'\\'+f2[2])
input_stream.append(open(inputf_file,'rb'))
for reader in map(PdfFileReader, input_stream):
for n in range(reader.getNumPages()):
writer.addPage(reader.getPage(n))
with open("Combined.pdf", "wb") as outputStream:
writer.write(outputStream)
writer.save()
for f in input_stream:
f.close()
Below is error message:
Traceback (most recent call last):
File "\Workspace\Python\py_CombinPDF\py_combinePDF.py", line 89, in <module>
writer.write(outputStream)
File "\AppData\Local\Programs\Python\Python36\lib\site-packages\PyPDF2\pdf.py", line 501, in write
obj.writeToStream(stream, key)
File "\AppData\Local\Programs\Python\Python36\lib\site-packages\PyPDF2\generic.py", line 549, in writeToStream
value.writeToStream(stream, encryption_key)
File "\AppData\Local\Programs\Python\Python36\lib\site-packages\PyPDF2\generic.py", line 472, in writeToStream
stream.write(b_(self))
File "\AppData\Local\Programs\Python\Python36\lib\site-packages\PyPDF2\utils.py", line 238, in b_
r = s.encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 8-9: ordinal not in range(256)
Upgrading PyPDF2 solved this issue.
Now, 4 years later, people should use pypdf. It contains the latest code (I'm the maintainer of PyPDF2 and pypdf)
I extracted text content from a multi page CV in a PDF format and trying to write that content in to a text file using pyPDF2. But I'm getting the following error message when trying to write the content.
Here is my code:
import PyPDF2
newFile = open('details.txt', 'w')
file = open("cv3.pdf", 'rb')
pdfreader = PyPDF2.PdfFileReader(file)
numPages = pdfreader.getNumPages()
print(numPages)
page_content = ""
for page_number in range(numPages):
page = pdfreader.getPage(page_number)
page_content += page.extractText()
newFile.write(page_content)
print(page_content)
file.close()
newFile.close()
The error message:
Traceback (most recent call last): File
"C:/Users/HP/PycharmProjects/CVParser/pdf.py", line 16, in
newFile.write(page_content) File "C:\Program Files\Python37\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u0141' in
position 827: character maps to
Process finished with exit code 1
This code was succeeded with the PDF file (docx file which converted in to a PDF) with multi pages.
Please help me if any one know the solution.
This will solve your problem in Python 3:
with open("Output.txt", "w") as text_file:
print("{}".format(page_content), file=text_file)
If above is not working for you somehow, the try below:
with open("Output1.txt", "wb") as text_file:
text_file.write(page_content.encode("UTF-8"))
Trying to write a really basic scraper for youtube video titles, using a csv of video links and beautiful soup. The script as it currently stands is:
#!/usr/bin/python
from bs4 import BeautifulSoup
import urllib
import csv
with open('url-titles-list.csv', 'wb') as csv_out:
fieldnames = ['url', 'title']
writer = csv.DictWriter(csv_out, fieldnames = fieldnames)
with open('url-nohttps-list.csv', 'rb') as csv_in:
reader = csv.DictReader(csv_in, fieldnames=['linkurls'])
writer.writeheader()
for row in reader:
link = row['linkurls']
with urllib.urlopen(link) as response:
html = response.read()
soup = BeautifulSoup(html, "html.parser")
name = soup.title.string
writer.writerow({'url': row['linkurls'], 'title': name})
This breaks at urllib.urlopen(link), with the following traceback making it look like the url type is not getting recognised correctly, and it's trying to open links as local files?
Traceback (most recent call last):
File "/Users/clarapouletty/Desktop/operation_find_yuzusho/fetcher.py", line 15, in <module>
with urllib.urlopen(link) as response:
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 87, in urlopen
return opener.open(url)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 213, in open
return getattr(self, name)(url)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 469, in open_file
return self.open_local_file(url)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 483, in open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] No such file or directory: 'linkurls'
Process finished with exit code 1
Any assistance much appreciated!
I'm having a hard time reading a pdf from the internet into the python PdfFileReader object.
My code works for the first url, but it doesn't for the second and I don't know how to fix it.
I can see that in the first example, the url refers to a .pdf file and in the second url the pdf is being returned as 'application data' in the html body.
So I think this this might be the issue. Does anybody knows how to fix it so the code also works for the second url?
from pyPdf import PdfFileWriter, PdfFileReader
from io import BytesIO
import requests
def test(url,filename):
response=requests.get(url)
pdf_file = BytesIO(response.content)
existing_pdf = PdfFileReader(pdf_file)
page = existing_pdf.getPage(0)
output = PdfFileWriter()
output.addPage(page)
outputStream = file(filename, "wb")
output.write(outputStream)
outputStream.close()
test('https://s21.q4cdn.com/374334112/files/doc_downloads/test.pdf','works.pdf')
test('https://eservices.minfin.fgov.be/mym-api-rest/finform/pdf/2057','crashes.pdf')
This is the stacktrace I have with the second call of the test function:
D:\scripts>test.py
Traceback (most recent call last):
File "D:\scripts\test.py", line 21, in <module>
test('https://eservices.minfin.fgov.be/mym-api-rest/finform/pdf/2057','crashes.pdf')
File "D:\scripts\test.py", line 10, in test
page = existing_pdf.getPage(0)
File "C:\Python27\lib\site-packages\pyPdf\pdf.py", line 450, in getPage
self._flatten()
File "C:\Python27\lib\site-packages\pyPdf\pdf.py", line 596, in _flatten
catalog = self.trailer["/Root"].getObject()
File "C:\Python27\lib\site-packages\pyPdf\generic.py", line 480, in __getitem__
return dict.__getitem__(self, key).getObject()
File "C:\Python27\lib\site-packages\pyPdf\generic.py", line 165, in getObject
return self.pdf.getObject(self).getObject()
File "C:\Python27\lib\site-packages\pyPdf\pdf.py", line 655, in getObject
raise Exception, "file has not been decrypted"
Exception: file has not been decrypted
I found a solution. I imported PyPDF2 instead of pyPdf, so it was probably a bug.