pypdf Merging multiple pdf files into one pdf - python

If I have 1000+ pdf files need to be merged into one pdf,
from PyPDF2 import PdfReader, PdfWriter
writer = PdfWriter()
for i in range(1000):
filepath = f"my/pdfs/{i}.pdf"
reader = PdfReader(open(filepath, "rb"))
for page in reader.pages:
writer.add_page(page)
with open("document-output.pdf", "wb") as fh:
writer.write(fh)
Execute the above code,when reader = PdfReader(open(filepath, "rb")),
An error message:
IOError: [Errno 24] Too many open files:
I think this is a bug, If not, What should I do?

I recently came across this exact same problem, so I dug into PyPDF2 to see what's going on, and how to resolve it.
Note: I am assuming that filename is a well-formed file path string. Assume the same for all of my code
The Short Answer
Use the PdfFileMerger() class instead of the PdfFileWriter() class. I've tried to provide the following to as closely resemble your content as I could:
from PyPDF2 import PdfFileMerger, PdfFileReader
[...]
merger = PdfFileMerger()
for filename in filenames:
merger.append(PdfFileReader(file(filename, 'rb')))
merger.write("document-output.pdf")
The Long Answer
The way you're using PdfFileReader and PdfFileWriter is keeping each file open, and eventually causing Python to generate IOError 24. To be more specific, when you add a page to the PdfFileWriter, you are adding references to the page in the open PdfFileReader (hence the noted IO Error if you close the file). Python detects the file to still be referenced and doesn't do any garbage collection / automatic file closing despite re-using the file handle. They remain open until PdfFileWriter no longer needs access to them, which is at output.write(outputStream) in your code.
To solve this, create copies in memory of the content, and allow the file to be closed. I noticed in my adventures through the PyPDF2 code that the PdfFileMerger() class already has this functionality, so instead of re-inventing the wheel, I opted to use it instead. I learned, though, that my initial look at PdfFileMerger wasn't close enough, and that it only created copies in certain conditions.
My initial attempts looked like the following, and were resulting in the same IO Problems:
merger = PdfFileMerger()
for filename in filenames:
merger.append(filename)
merger.write(output_file_path)
Looking at the PyPDF2 source code, we see that append() requires fileobj to be passed, and then uses the merge() function, passing in it's last page as the new files position. merge() does the following with fileobj (before opening it with PdfFileReader(fileobj):
if type(fileobj) in (str, unicode):
fileobj = file(fileobj, 'rb')
my_file = True
elif type(fileobj) == file:
fileobj.seek(0)
filecontent = fileobj.read()
fileobj = StringIO(filecontent)
my_file = True
elif type(fileobj) == PdfFileReader:
orig_tell = fileobj.stream.tell()
fileobj.stream.seek(0)
filecontent = StringIO(fileobj.stream.read())
fileobj.stream.seek(orig_tell)
fileobj = filecontent
my_file = True
We can see that the append() option does accept a string, and when doing so, assumes it's a file path and creates a file object at that location. The end result is the exact same thing we're trying to avoid. A PdfFileReader() object holding open a file until the file is eventually written!
However, if we either make a file object of the file path string or a PdfFileReader(see Edit 2) object of the path string before it gets passed into append(), it will automatically create a copy for us as a StringIO object, allowing Python to close the file.
I would recommend the simpler merger.append(file(filename, 'rb')), as others have reported that a PdfFileReader object may stay open in memory, even after calling writer.close().
Hope this helped!
EDIT: I assumed you were using PyPDF2, not PyPDF. If you aren't, I highly recommend switching, as PyPDF is no longer maintained with the author giving his official blessings to Phaseit in developing PyPDF2.
If for some reason you cannot swap to PyPDF2 (licensing, system restrictions, etc.) than PdfFileMerger won't be available to you. In that situation you can re-use the code from PyPDF2's merge function (provided above) to create a copy of the file as a StringIO object, and use that in your code in place of the file object.
EDIT 2: Previous recommendation of using merger.append(PdfFileReader(file(filename, 'rb'))) changed based on comments (Thanks #Agostino).

The pdfrw package reads each file all in one go, so will not suffer from the problem of too many open files. Here is an example concatenation script.
The relevant part -- assumes inputs is a list of input filenames, and outfn is an output file name:
from pdfrw import PdfReader, PdfWriter
writer = PdfWriter()
for inpfn in inputs:
writer.addpages(PdfReader(inpfn).pages)
writer.write(outfn)
Disclaimer: I am the primary pdfrw author.

The problem is that you are only allowed to have a certain number of files open at any given time. There are ways to change this (http://docs.python.org/3/library/resource.html#resource.getrlimit), but I don't think you need this.
What you could try is closing the files in the for loop:
input = PdfFileReader()
output = PdfFileWriter()
for file in filenames:
f = open(file, 'rb')
input = PdfFileReader(f)
# Some code
f.close()

I have written this code to help with the answer:-
import sys
import os
import PyPDF2
merger = PyPDF2.PdfFileMerger()
#get PDFs files and path
path = sys.argv[1]
pdfs = sys.argv[2:]
os.chdir(path)
#iterate among the documents
for pdf in pdfs:
try:
#if doc exist then merge
if os.path.exists(pdf):
input = PyPDF2.PdfFileReader(open(pdf,'rb'))
merger.append((input))
else:
print(f"problem with file {pdf}")
except:
print("cant merge !! sorry")
else:
print(f" {pdf} Merged !!! ")
merger.write("Merged_doc.pdf")
In this, I have used PyPDF2.PdfFileMerger and PyPDF2.PdfFileReader, instead of explicitly converting the file name to file object

It maybe just what it says, you are opening to many files.
You may explicitly use f=file(filename) ... f.close() in the loop, or use the with statement. So that each opened file is properly closed.

Related

Python PdfReader: Getting error when sequentially reading PDFs in a folder: Errno 2 (No such file or directory): 'filename.pdf'

I'm trying to put together a code that will procedurally read through a file of PDFs to scrape relevant information such as part names, numbers, materials, and final treatments. The (presumably) problematic part of the code is written:
for fp in os.listdir(path):
pdfFileObj = open(fp, 'rb')
reader = PdfReader(pdfFileObj)
number_of_pages = len(reader.pages)
page = reader.pages[0]
text = page.extract_text()
Title, part_number, material, f_treatments = extractText(text)
printAll(Title, part_number, material, f_treatments)
pdfFileObj.close()
where path = r'C:\Users\myname\Documents\TargetFile'
It reads the first file (1.pdf) in TargetFile successfully, but will return this upon reading the second file, (2.pdf):
[Errno 2] No such file or directory: '2.pdf'
which is peculiar, given that it needs to know that 2.pdf is in the file in order to report this error message. I suspect that fp in os.listdir() is detecting this, but that the pdfFileObj = open(fp, 'rb') command isn't finding it, as the error is reported from that line.
Do you know what the issue might be based on the information I've provided?
I thought that closing the document at the end of the loop code would help but this doesn't seem to make a difference. I've never worked with 'rb' or read-binary code before, but if it seems to work for the first file I don't expect this would be an issue.

How do I iterate through files in my directory so they can be opened/read using PyPDF2?

I am working on an invoice scraper for work, where I have successfully written all the code to scrape the fields that I need using PyPDF2. However, I am having trouble figuring out how to put this code into a for loop so I can iterate through all the invoices stored in my directory. There could be anywhere from 1 to 250+ files depending on which project I am using this for.
I thought I would be able to use "*.pdf" in place of the pdf name, but it does not work for me. I am relatively new to Python and have not used that many loops before, so any guidance would be appreciated!
import re
pdfFileObj = open(r'C:\Users\notylerhere\Desktop\Test Invoices\SampleInvoice.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
#Print all text on page
#print(pageObj.extractText())
#Grab Account Number Meter Number
accountNumber = re.compile(r'\d\d\d\d\d-\d\d\d\d\d')
meterNumber = re.compile(r'(\d\d\d\d\d\d\d\d)')
moAccountNumber = accountNumber.search(pageObj.extractText())
moMeterNumber = meterNumber.search(pageObj.extractText())
print('Account Number: '+moAccountNumber.group())
print('Meter Number: '+moMeterNumber.group(1))'''
Thanks very much!
Another option is glob:
import glob
files = glob.glob("c:/mydirectory/*.pdf")
for file in files:
(Do your processing of file here)
You need to ensure everything past the colon is properly indented.
You want to iterate over your directory and deal with every file independently.
There are many functions depending on your use case. os.walk is a good place to start.
Example:
import os
for root, directories, files in os.walk('.'):
for file in files:
if '.pdf' in file:
openAndDoStuff(file)
import os
import PyPDF2
for el in os.listdir(os.getcwd()):
if el.endswith("pdf"):
pdf_reader = PyPDF2.PdfFileReader(open(os.getcwd() + "/" + el))

Is it possible to open PDF files one after other, their names are saved in a text file using Python?

I wanted to open PDF files one after other to take screenshots with a delay of n seconds.I have made a "1.txt" file to open these through python. I have read these names to a list. But Is there way to read this list to open the files with delay?
I am disconnected here to get the list of the drawings from the list to open it through loop with a delay.
linelist=[line.rstrip('\n') for line in open('1.txt')
print(linelist)
pdf_file=open('1.pdf','rb')
read_pdf=PyPDF2.pdfFileReader(pdf_file)
This is the place I am stuck, to get the file names in the list looped to opening them. used PyPDF2, Webbrowser modules
wb.open_new(r'C\test\1.pdf')
Any help is highly appreciated.
To iterate over a list in python you can do for element in list.
Also, to generate the images from the pdf files, you can use python's pdf2image module, as in this link.
The complete solution would look like:
import os
import tempfile
from pdf2image import convert_from_path
with open('1.txt', 'r') as f:
line_list = f.read().splitlines()
print(line_list)
for line in line_list:
with tempfile.TemporaryDirectory() as path:
images_from_path = convert_from_path(line, output_folder='./',
last_page=1, first_page =0)
base_filename = os.path.splitext(os.path.basename(line))[0] + '.jpg'
save_dir = './'
for page in images_from_path:
page.save(os.path.join(save_dir, base_filename), 'JPEG')
This uses commandline to open the file, which opens it in the default viewer. Tested on Windows 10.
import subprocess
subprocess.Popen([filename], shell=True)
To use your own code:
import subprocess
import time
sleepytime = 5
linelist=[line.rstrip('\n') for line in open('1.txt')
print(linelist)
for filename in linelist:
subprocess.Popen([filename], shell=True)
time.sleep(sleepytime)
Of course I would advise you to look at a way to automate also the screenshot part. Making your life so much more fun.
For example using the pdf2image library
from pdf2image import convert_from_path
images = convert_from_path('/home/belval/example.pdf')
for image in images:
image.save('image.jpg', 'JPEG') # <- change this

Can't access external link with python + h5py

Recently I have started working with .hdf5 files and still can't figure out how to properly use external links.
I have got a few .hdf5 files. Each file has got the same structure e.g. same keys and data types. I want to merge them into one file but keep them separate with a different key for each file.
Here's what I do:
myfile = h5py.File("/path_to_the_directory/merged_images.hdf5", 'w')
myfile['0.000'] = h5py.ExternalLink("img_000.hdf5", "/path_to_the_directory/images")
myfile['0.001'] = h5py.ExternalLink("img_001.hdf5", "/path_to_the_directory/images")
myfile.flush()
Then I try to read it with:
myfile = h5py.File("/path_to_the_directory/merged_images.hdf5", 'r')
keys = list(myfile.keys())
print(keys)
print(list(myfile[keys[0]]))
The line print(keys) gives me ['0.000', '0.001']. So, I believe the file's structure is okay.
And the next lines gives me an exception:
KeyError: "Unable to open object (unable to open external file, external link file name = 'img_000.hdf5')"
Am I doing something wrong? The documentation is pretty poor and I haven
t found a relevant use-case there.
The problem is that you are mixing up paths. It is important to distinguish between two types of paths:
File path (the location on your hard drive).
Dataset path: this path is internal to the HDF5-file, and does not depend on where you store the file.
The syntax of h5py.ExternalLink, as mentioned in the documentation, is:
myfile['/path/of/link'] = h5py.ExternalLink('/path/to/file.hdf5', '/path/to/dataset')
Thereby I would like to encourage you to use a relative file path for the ExternalLink. If you do that, then everything will continue to work even if you move the collection of files to a new location on your hard drive (or give them to somebody else).
With the correct paths, your example works, as shown below.
Note that, to illustrate my remark about relative file paths, I have made all paths of the datasets absolute (these are only internal to the file, and do not depend on where the file is stored on the hard drive) while I kept the file paths relative.
import h5py
import numpy as np
myfile = h5py.File('test_a.hdf5', 'w')
myfile['/path/to/data'] = np.array([0,1,2])
myfile.close()
myfile = h5py.File('test_b.hdf5', 'w')
myfile['/path/to/data'] = np.array([3,4,5])
myfile.close()
myfile = h5py.File('test.hdf5', 'w')
myfile['/a'] = h5py.ExternalLink('test_a.hdf5', '/path/to/data')
myfile['/b'] = h5py.ExternalLink('test_b.hdf5', '/path/to/data')
myfile.close()
myfile = h5py.File('test.hdf5', 'r')
keys = list(myfile.keys())
print(keys)
print(list(myfile[keys[0]]))
print(list(myfile[keys[1]]))
myfile.close()
Prints (as expected):
['a', 'b']
[0, 1, 2]
[3, 4, 5]

Python: Input and output filename matching

I'm trying to come up with a way for the filenames that I'm reading to have the same filename as what I'm writing. The code is currently reading the images and doing some processing. My output will be extracting the data from that process into a csv file. I want both the filenames to be the same. I've come across fname for matching, but that's for existing files.
So if your input file name is in_file = myfile.jpg do this:
my_outfile = "".join(infile.split('.')[:-1]) + 'csv'
This splits infile into a list of parts that are separated by '.'. It then puts them back together minus the last part, and adds csv
your my_outfile will be myfile.csv
Well in python it's possible to do that but, the original file might be corrupted if we were to have the same exact file name i.e BibleKJV.pdf to path BibleKJV.pdf will corrupt the first file. Take a look at this script to verify that I'm on the right track (if I'm totally of disregard my answer):
import os
from PyPDF2 import PdfFileReader , PdfFileWriter
path = "C:/Users/Catrell Washington/Pride"
input_file_name = os.path.join(path, "BibleKJV.pdf")
input_file = PdfFileReader(open(input_file_name , "rb"))
output_PDF = PdfFileWriter()
total_pages = input_file.getNumPages()
for page_num in range(1,total_pages) :
output_PDF.addPage(input_file.getPage(page_num))
output_file_name = os.path.join(path, "BibleKJV.pdf")
output_file = open(output_file_name , "wb")
output_PDF.write(output_file)
output_file.close()
When I ran the above script, I lost all data from the original path "BibleKJV.pdf" thus proving that if the file name and the file delegation i.e .pdf .cs .word etc, are the same then the data, unless changed very minimal, will be corrupted.
If this doesn't give you any help please, edit your question with a script of what you're trying to achieve.

Categories

Resources