Python pyPdf issue downloading pdf - python

I'm having a hard time reading a pdf from the internet into the python PdfFileReader object.
My code works for the first url, but it doesn't for the second and I don't know how to fix it.
I can see that in the first example, the url refers to a .pdf file and in the second url the pdf is being returned as 'application data' in the html body.
So I think this this might be the issue. Does anybody knows how to fix it so the code also works for the second url?
from pyPdf import PdfFileWriter, PdfFileReader
from io import BytesIO
import requests
def test(url,filename):
response=requests.get(url)
pdf_file = BytesIO(response.content)
existing_pdf = PdfFileReader(pdf_file)
page = existing_pdf.getPage(0)
output = PdfFileWriter()
output.addPage(page)
outputStream = file(filename, "wb")
output.write(outputStream)
outputStream.close()
test('https://s21.q4cdn.com/374334112/files/doc_downloads/test.pdf','works.pdf')
test('https://eservices.minfin.fgov.be/mym-api-rest/finform/pdf/2057','crashes.pdf')
This is the stacktrace I have with the second call of the test function:
D:\scripts>test.py
Traceback (most recent call last):
File "D:\scripts\test.py", line 21, in <module>
test('https://eservices.minfin.fgov.be/mym-api-rest/finform/pdf/2057','crashes.pdf')
File "D:\scripts\test.py", line 10, in test
page = existing_pdf.getPage(0)
File "C:\Python27\lib\site-packages\pyPdf\pdf.py", line 450, in getPage
self._flatten()
File "C:\Python27\lib\site-packages\pyPdf\pdf.py", line 596, in _flatten
catalog = self.trailer["/Root"].getObject()
File "C:\Python27\lib\site-packages\pyPdf\generic.py", line 480, in __getitem__
return dict.__getitem__(self, key).getObject()
File "C:\Python27\lib\site-packages\pyPdf\generic.py", line 165, in getObject
return self.pdf.getObject(self).getObject()
File "C:\Python27\lib\site-packages\pyPdf\pdf.py", line 655, in getObject
raise Exception, "file has not been decrypted"
Exception: file has not been decrypted

I found a solution. I imported PyPDF2 instead of pyPdf, so it was probably a bug.

Related

'Document' object has no attribute 'is_closed' error & ValueError: bad filename error on fitz (PyMUPDF)

I am trying to extract text from pdf documents and are using S3 buckets and lambda function to execute my code in order to make it work on AWS. The way I have set it up is when I upload a PDF document on the AWS S3 bucket, it will trigger the lambda function.
The series of code I have written are as follow:
import json
import s3fs
import urllib
def lambda_handler(event, context):
s3 = s3fs.S3FileSystem(anon=False)
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
report_path = 'reports/' + key
test_instance = ReportEval(report_path)
evaluation_dictionary = test_instance.main_function()
This is the python script that I have written to call for the opening of file & extraction of text using fitz (PyMUPDF), and using PyPDF2 to count the number of pages in the PDF report
import PyPDF2
import fitz
import s3fs
class ReportEval:
def __init__(self, report_path):
self.report_path = report_path
s3 = s3fs.S3FileSystem(anon=False)
file =s3.open(self.report_path)
self.PDF_report = PyPDF2.PdfFileReader(file)
self.whole_report = fitz.open(file)
def main_function(self):
if self.whole_report.isEncrypted:
return 'encrypted'
page_stop = self.PDF_report.numPages
for page_no in range(page_stop):
page_obj = self.whole_report[page_no].getText("text")
print(page_obj)
When I did this there is an error that caused, does anyone know what is causing this error to happen on my lambda function? Thank in advance.
This is the error log I have gotten:
Traceback (most recent call last):
File "/opt/python/fitz/fitz.py", line 5500, in __del__
self._cleanup()
File "/opt/python/fitz/fitz.py", line 5475, in _cleanup
self._reset_page_refs()
File "/opt/python/fitz/fitz.py", line 5463, in _reset_page_refs
if self.is_closed:
AttributeError: 'Document' object has no attribute 'is_closed'
and I have gotten this error as well:
[ERROR] ValueError: bad filename
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 9, in lambda_handler
test_instance = ReportEval(report_path)
File "/var/task/main.py", line 10, in __init__
self.whole_report = fitz.open(file)
File "/opt/python/fitz/fitz.py", line 3810, in __init__
raise ValueError("bad filename")

Issue with "open" in Python

I am trying to extract the contents of a table within a pdf using PyPDF2 however I am encountering this error when trying to open the pdf and I am not sure why. How can I fix this? Here is the code:
#PDF Table testing
pdf_file = open(r"PDFs/murrumbidgee/Murrumbidgee Unregulated River Water Sources 2012_20200815.pdf")
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(50)
page_content = page.extractText()
print(page_content.encode('utf-8'))
table_list = page_content.split('\n')
l = numpy.array_split(table_list, len(table_list)/7)
for i in range(0, 5):
print(l[i])
This is the error:
PdfReadWarning: PdfFileReader stream/file object is not in binary mode. It may not be read correctly. [pdf.py:1079]
Traceback (most recent call last):
File "C:/Users/benjh/Desktop/project/testing_regex.py", line 103, in <module>
read_pdf = PyPDF2.PdfFileReader(pdf_file)
File "C:\Users\benjh\anaconda3\envs\project\lib\site-packages\PyPDF2\pdf.py", line 1084, in __init__
self.read(stream)
File "C:\Users\benjh\anaconda3\envs\project\lib\site-packages\PyPDF2\pdf.py", line 1689, in read
stream.seek(-1, 2)
io.UnsupportedOperation: can't do nonzero end-relative seeks
What does nonzero end-relative seeks mean?
Opening the pdf with 'rb' fixes the error

urllib.openurl treating http links as local file addresses

Trying to write a really basic scraper for youtube video titles, using a csv of video links and beautiful soup. The script as it currently stands is:
#!/usr/bin/python
from bs4 import BeautifulSoup
import urllib
import csv
with open('url-titles-list.csv', 'wb') as csv_out:
fieldnames = ['url', 'title']
writer = csv.DictWriter(csv_out, fieldnames = fieldnames)
with open('url-nohttps-list.csv', 'rb') as csv_in:
reader = csv.DictReader(csv_in, fieldnames=['linkurls'])
writer.writeheader()
for row in reader:
link = row['linkurls']
with urllib.urlopen(link) as response:
html = response.read()
soup = BeautifulSoup(html, "html.parser")
name = soup.title.string
writer.writerow({'url': row['linkurls'], 'title': name})
This breaks at urllib.urlopen(link), with the following traceback making it look like the url type is not getting recognised correctly, and it's trying to open links as local files?
Traceback (most recent call last):
File "/Users/clarapouletty/Desktop/operation_find_yuzusho/fetcher.py", line 15, in <module>
with urllib.urlopen(link) as response:
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 87, in urlopen
return opener.open(url)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 213, in open
return getattr(self, name)(url)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 469, in open_file
return self.open_local_file(url)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 483, in open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] No such file or directory: 'linkurls'
Process finished with exit code 1
Any assistance much appreciated!

File Create/Write Issue In Python

I'm trying to create and write to a file. I have the following code:
from urllib2 import urlopen
def crawler(seed_url):
to_crawl = [seed_url]
crawled=[]
while to_crawl:
page = to_crawl.pop()
page_source = urlopen(page)
s = page_source.read()
with open(str(page)+".txt","a+") as f:
f.write(s)
f.close()
return crawled
if __name__ == "__main__":
crawler('http://www.yelp.com/')
However, it returns the error:
Traceback (most recent call last):
File "/Users/adamg/PycharmProjects/NLP-HW1/scrape-test.py", line 29, in <module>
crawler('http://www.yelp.com/')
File "/Users/adamg/PycharmProjects/NLP-HW1/scrape-test.py", line 14, in crawler
with open("./"+str(page)+".txt","a+") as f:
IOError: [Errno 2] No such file or directory: 'http://www.yelp.com/.txt'
I thought that open(file,"a+") is supposed to create and write. What am I doing wrong?
If you want to use the URL as the basis for the directory, you should encode the URL. That way, slashes (among other characters) will be converted to character sequences which won't interfere with the file system/shell.
The urllib library can help with this.
So, for example:
>>> import urllib
>>> urllib.quote_plus('http://www.yelp.com/')
'http%3A%2F%2Fwww.yelp.com%2F'

wget loop to all the lines (url) in textfile and download Windows

I have a simple task but cannot make my code work. I want to loop over the URLs listed in my textfile and download it using wget command in Python. Each URL are placed in separate line in the textfile.
Basically, this is the structure of the list in my textfile:
http://e4ftl01.cr.usgs.gov//MODIS_Composites/MOLT/MOD11C3.005/2000.03.01/MOD11C3.A2000061.005.2007177231646.hdf
http://e4ftl01.cr.usgs.gov//MODIS_Composites/MOLT/MOD11C3.005/2014.12.01/MOD11C3.A2014335.005.2015005235231.hdf
all the URLs are about 178 lines. Then save it in the current working directory.
Below is the initial code that I am working:
import os, fileinput, urllib2 as url, wget
os.chdir("E:/Test/dwnld")
for line in fileinput.FileInput("E:/Test/dwnld/data.txt"):
print line
openurl = wget.download(line)
The error message is:
Traceback (most recent call last): File "E:\Python_scripts\General_purpose\download_url_from_textfile.py", line 5, in <module>
openurl = wget.download(line) File "C:\Python278\lib\site-packages\wget.py", line 297, in download
(fd, tmpfile) = tempfile.mkstemp(".tmp", prefix=prefix, dir=".") File "C:\Python278\lib\tempfile.py", line 308, in mkstemp
return _mkstemp_inner(dir, prefix, suffix, flags) File "C:\Python278\lib\tempfile.py", line 239, in _mkstemp_inner
fd = _os.open(file, flags, 0600) OSError: [Errno 22] Invalid argument: ".\\MOD11C3.A2000061.005.2007177231646.hdf'\n.frbfrp.tmp"
Try to use urllib.urlretrieve. Check the documentation here: https://docs.python.org/2/library/urllib.html#urllib.urlretrieve

Categories

Resources