IOError: [Errno 22] Invalid argument - python

I am trying to concatenate all the pdf into one pdf thereby using PyPDF2 library.
I am using python 2.7 for the same.
My error is :
>>>
RESTART: C:\Users\Yash gupta\Desktop\first projectt\concatenate\test\New folder\test.py
['Invoice.pdf', 'Invoice_2.pdf', 'invoice_3.pdf', 'last.pdf']
Traceback (most recent call last):
File "C:\Users\Yash gupta\Desktop\first projectt\concatenate\test\New folder\test.py", line 17, in <module>
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 1084, in __init__
self.read(stream)
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 1689, in read
stream.seek(-1, 2)
IOError: [Errno 22] Invalid argument
My code is :
import PyPDF2, os
# Get all the PDF filenames.
pdfFiles = []
for filename in os.listdir('.'):
if filename.endswith('.pdf'):
pdfFiles.append(filename)
pdfFiles.sort(key=str.lower)
pdfWriter = PyPDF2.PdfFileWriter()
print ( pdfFiles)
# Loop through all the PDF files.
for filename in pdfFiles:
pdfFileObj = open(filename, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
print ( pdfFileObj )
# Loop through all the pages
for pageNum in range(0, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
pdfWriter.addPage(pageObj)
# Save the resulting PDF to a file.
pdfOutput = open('last.pdf', 'wb')
pdfWriter.write(pdfOutput)
pdfOutput.close()
My pdf has some non-ASCII characters, so i am using 'r' rathen then 'rb'
PS:I am new to Python and all this libraries thing

I believe you are looping through collected files incorrectly (Python is indentation-sensitive).
# Loop through all the PDF files.
for filename in pdfFiles:
pdfFileObj = open(filename, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# Loop through all the pages
for pageNum in range(0, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
pdfWriter.addPage(pageObj)
# Save the resulting PDF to a file.
pdfOutput = open('last.pdf', 'wb')
pdfWriter.write(pdfOutput)
pdfOutput.close()
Also, try to use PdfFileMerger if you want to merge PDF files:
merger = PdfFileMerger(strict=False)
Check out the example code here.

Related

pyPDF2 PdfFileWriter output returns a corrupted file

I am very new to python. I have the following code that takes user input from a GUI for the "x" and "a" variable. The goal is to have it open each .pdf in the directory perform the modifications, and save over itself. Each pdf in the directory is a single page pdf. It seems to work however, the newly saved file is corrupted and cannot be opened.
Seal_pdf = PdfFileReader(open(state, "rb"), strict=False)
input_pdf = glob.glob(os.path.join(x, '*.pdf'))
output_pdf = PdfFileWriter()
page_count = len(fnmatch.filter(os.listdir(x), '*.pdf'))
i = 0
if a == "11x17":
for file in input_pdf:
sg.OneLineProgressMeter('My Meter', i, page_count, 'And now we Wait.....')
PageObj = PyPDF2.PdfFileReader(open(file, "rb"), strict=False).getPage(0)
PageObj.scaleTo(11*72, 17*72)
PageObj.mergePage(Seal_pdf.getPage(0))
output_filename = f"{file}"
f = open(output_filename, "wb+")
output_pdf.write(f)
i = i + 1
Adding output_pdf.addPage(PageObj) to the loop produces and uncorrupted file however, that causes each successive .pdf to be added to the previous .pdf. (ex. "pdf 1" is only "pdf 1", "pdf2 is now two pages "pdf1" and "pdf2" merged, etc.). I also attempted to change the next to last two lines to
with open(output_filename, "wb+") as f:
output_pdf.write(f)
with no luck. I can't figure out what I am missing to have the PdfFileWriter return a single page, uncorrupted file for each individual pdf in the directory.
if a == "11x17":
for file in input_pdf:
sg.OneLineProgressMeter('My Meter', i, page_count, 'And now we Wait.....')
PageObj = PyPDF2.PdfFileReader(open(file, "rb"), strict=False).getPage(0)
PageObj.scaleTo(11*72, 17*72)
PageObj.mergePage(Seal_pdf.getPage(0))
output_pdf.addPage(PageObj)
output_filename = f"{file}"
f = open(output_filename, "wb+")
output_pdf.write(f)
i = i + 1
I was able to solve this finally by simply putting the output_pdf = PdfFileWriter() inside the loop. I stumbled across that being the solution for another loop issue and thought I would try it. PdfFileWriter() inside loop

I am getting the error:"UnsupportedOperation: read"

import pickle
#writing into the file
f = open("essay1.txt","ab+")
list1 = ["Aditya","Arvind","Kunal","Naman","Samantha"]
list2 = ["17","23","12","14","34"]
zipfile = zip(list1,list2)
print(zipfile)
pickle.dump(zipfile,f)
f.close()
#opening the file to read it
f = open("essay1","ab")
zipfile = pickle.load(f)
f.close()
and output was :
runfile('E:/Aditya Singh/Aditya Singh/untitled3.py', wdir='E:/Aditya Singh/Aditya Singh')
<zip object at 0x0000000008293BC8>
Traceback (most recent call last):
File "E:\Aditya Singh\Aditya Singh\untitled3.py", line 21, in <module>
zipfile = pickle.load(f)
UnsupportedOperation: read
You forgot the file extension .txt in the line where you tried to open the file and also you opened it in append mode, which is why the returned object does not have read or readline methods (required by pickle.load). I also suggest to use the with keyword instead of manually closing the file.
import pickle
#writing into the file
with open("essay1.txt","ab+") as f:
list1 = ["Aditya","Arvind","Kunal","Naman","Samantha"]
list2 = ["17","23","12","14","34"]
zipfile = zip(list1,list2)
print(zipfile)
pickle.dump(zipfile,f)
#opening the file to read it
with open("essay1.txt", "rb") as f:
zipfile = pickle.load(f)
for item in zipfile:
print(item)
Output:
<zip object at 0x7fa6cb30e3c0>
('Aditya', '17')
('Arvind', '23')
('Kunal', '12')
('Naman', '14')
('Samantha', '34')
do you have essay1 file? or essay1.txt?
this is trying to open without extension.
f = open("essay1","ab")
so fails to read.
There are two issues with your code:
You're opening the file to write and not to read.
You're using different filenames for reading and for writing.
Here's a version that works:
import pickle
#writing into the file
f = open("essay1.txt","wb")
list1 = ["Aditya","Arvind","Kunal","Naman","Samantha"]
list2 = ["17","23","12","14","34"]
zipfile = zip(list1,list2)
print(zipfile)
pickle.dump(zipfile,f)
f.close()
#opening the file to read it
f = open("essay1.txt","rb")
zipfile = pickle.load(f)
print(zipfile)
f.close()

Why the 'FileNotFoundError' when the error specifically names a file when I have only given it a directory

I am attempting to use a for-loop to iterate over csv files in a directory, processing them with a pre-defined function.
# Define a function to collect form page data and save as a new csv file
def writeFormPage(file, path):
'''
Input:
Index CSV
Output:
Page CSV
'''
with open(file, 'r') as rf:
reader = csv.reader(rf)
base_name = os.path.basename(file)
file_path = os.path.join(path, base_name)
with open(file_path, 'w') as wf:
writer = csv.writer(wf, delimiter = ',')
for line in reader:
url = line[-1]
page_data = (parseFormPage(url))
writer.writerow(page_data)
time.sleep(3 + random.random() * 3)
# Create a new directory to save new CSV files
page_dir = './page'
if not os.path.isdir(page_dir):
os.makedirs(page_dir)
os.path.isdir(page_dir)
for filename in os.listdir(indx_dir):
if filename.endswith('.csv'):
writeFormPage(filename, page_dir)
time.sleep(3 + random.random() * 3)
FileNotFoundError Traceback (most recent call last)
<ipython-input-23-3a8a501dd2e9> in <module>
1 for filename in os.listdir(indx_dir):
2 if filename.endswith('.csv'):
----> 3 writeFormPage(filename, page_dir)
4 time.sleep(3 + random.random() * 3)
<ipython-input-22-0fc6fceffe13> in writeFormPage(file, path)
7 Page CSV
8 '''
----> 9 with open(file, 'r') as rf:
10 reader = csv.reader(rf)
11
FileNotFoundError: [Errno 2] No such file or directory: '2007Q2.csv'
The fact that the error names a file that I haven't specifically named, makes me wonder why it is saying the file is not found. The file mentioned is the first file in the directory that I would like to iterate over.
The pre-defined function in the first block of code is sound, I have tested this with a single csv file. I am just struggling with the loop. I am a beginner, and these loops seem to be my nemesis to be honest, they've caused me hours of headaches!
If anyone could help, I would be very grateful.
Thank you to furas for the help. The solution to this problem is to edit the last block of code to read as
for filename in os.listdir(indx_dir):
if filename.endswith('.csv'):
fullpath = os.path.join(indx_dir, filename)
writeFormPage(fullpath, page_dir)

How to search all file types in directory for regular expression

So, I want to search my whole directory for files that contain a list of regular expressions. That includes: directories, pdfs, and csv files. I can succesfully do this task when searching for only text files but search all file types is the struggle. Below is my work so far:
import glob
import re
import PyPDF2
#-------------------------------------------------Input----------------------------------------------------------------------------------------------
folder_path = "/home/"
file_pattern = "/*"
folder_contents = glob.glob(folder_path + file_pattern)
#Search for Emails
regex1= re.compile(r'\S+#\S+')
#Search for Phone Numbers
regex2 = re.compile(r'\d\d\d[-]\d\d\d[-]\d\d\d\d')
#Search for Locations
regex3 =re.compile("([A-Z]\w+), ([A-Z]{2})")
for file in folder_contents:
read_file = open(file, 'rt').read()
if readile_file == pdf:
pdfFileObj = open('pdf.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
content= pageObj.extractText())
if regex1.findall(read_file) or regex2.findall(read_file) or regex3.findall(read_file):
print ("YES, This file containts PHI")
print(file)
else:
print("No, This file DOES NOT contain PHI")
print(file)
When I run this i get this error:
YES, This file containts PHI
/home/e136320/sample.txt
No, This file DOES NOT contain PHI
/home/e136320/medicalSample.txt
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-129-be0b68229c20> in <module>()
19
20 for file in folder_contents:
---> 21 read_file = open(file, 'rt').read()
22 if readile_file == pdf:
23 # creating a pdf file object
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-128-1537605cf636> in <module>()
18
19 for file in folder_contents:
---> 20 read_file = open(file, 'rt').read()
21 if regex1.findall(read_file) or regex2.findall(read_file) or regex3.findall(read_file):
22 print ("YES, This file containts PHI")
/jupyterhub_env/lib/python3.5/codecs.py in decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 10: invalid continuation byte
Any suggestions?
You can't open a pdf file like that, it is expecting a plain text file. You could use something like this:
fn, ext = os.path.splitext(file)
if ext == '.pdf':
open_function = PyPDF2.PdfFileReader
else: # plain text
open_function = open
with open_function(file, 'rt') as open_file:
# Do something with open file...
This snippet checks the file extension then assigns an open function depending on what it finds, this is a bit naive and could be done better in a method similar to the one shown in this answer.
Your issue occured on you open different file type in same way. You have to sort them out. CSV can read directly, but pdf can not do so. I used re.search(r".*(?=pdf$)",file) to prevent from 2.pdf.csv is considered is an pdf file
import glob
import re
import PyPDF2
#-------------------------------------------------Input----------------------------------------------------------------------------------------------
folder_path = "/home/e136320/"
file_pattern = "/*"
folder_contents = glob.glob(folder_path + file_pattern)
#Search for Emails
regex1= re.compile(r'\S+#\S+')
#Search for Phone Numbers
regex2 = re.compile(r'\d\d\d[-]\d\d\d[-]\d\d\d\d')
#Search for Locations
regex3 =re.compile("([A-Z]\w+), ([A-Z]{2})")
for file in folder_contents:
if re.search(r".*(?=pdf$)",file):
#this is pdf
with open(file, 'rb') as pdfFileObj:
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
content = pageObj.extractText()
read_file = #
elif re.search(r".*(?=csv$)",file):
#this is csv
with open(file,"r+",encoding="utf-8") as csv:
read_file = csv.read()
else:
#print("{}".format(file))
continue
if regex1.findall(read_file) or regex2.findall(read_file) or regex3.findall(read_file):
print ("YES, This file containts PHI")
print(file)
else:
print("No, This file DOES NOT contain PHI")
print(file)

Creating files on the fly and converting to pdf

I'm trying to create a html file and then convert this file to a pdf file using wkhtmltopdf http://wkhtmltopdf.org/
inputfilename = "/tmp/inputfile.html"
outputfilename = "/tmp/outputfile.pdf"
f = open(inputfilename, 'w')
f.write(html)
f.close()
f1 = open(outputfilename, 'w')
ret = convert2pdf(f,outputfilename)
f1.close()
In convert2pdf I'm doing:
def convert2pdf(htmlfilename,outputpdf):
import subprocess
commands_to_run = ['/wkhtmltopdf-amd64','htmlfilename', 'outputpdf']
subprocess.call(commands_to_run)
Both input/output files are created on the fly. Input file is perfect but output pdf created using wkhtmltopdf is empty. Can you suggest what am I doing wrong.
I think you just have to change
commands_to_run = ['/wkhtmltopdf-amd64','htmlfilename', 'outputpdf']
to
commands_to_run = ['/wkhtmltopdf-amd64', htmlfilename, outputpdf]
and instead of
ret = convert2pdf(f,outputfilename)
do
ret = convert2pdf(inputfilename, outputfilename)

Categories

Resources