I have a document library which consists of Several Thousand PDF Documents. I am trying to extract the first page from each document. The extracted page should then be stored individually into a folder called "First Page".
I have written the below script as a means of printing the first page from each document. I have been able to extract the PDF files from some of the documents in my library. However, the vast majority have not been exported. Examining terminal, i note that there are a lot of errors thrown with the comment "Superfluous whitespace found in object header b'21' b'0'". I have searched online but am unable to locate anything of relevance.
I have three questions:
Would anyone have any idea how I can address the Superfluous whitespace issue?
My documents seem to be exporting as unreadable or damaged files. Is there something missing from my code?
My documents are also not exporting to my required output directory. I am unsure how I point the extracts to this directory. Would anyone be able to help with this also?
import os
import PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileReader
# get the file names in the directory
input_directory = 'Fund Docs'
entries = os.listdir(input_directory)
output_directory = 'First Pages'
outputs = os.listdir(output_directory)
for entry in entries:
print(entry)
# create a PDF reader object
pdfFileObj = open(input_directory + '/' + entry, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()
outputFileName = 'First_Page' + entry + '.pdf'
with open(outputFileName, 'wb') as out:
pdf_writer = PyPDF2.PdfFileWriter(out)
print('created ', outputFileName)
Several points:
Use pypdf (PyPDF2 is deprecated)
Use PdfReader (PdfFileReader is deprecated - it now has strict=False by default)
The Superfluous whitespace found message is only a warning with strict=False. You see that message because the PDF is not completely standard compliant. You can silence the warning: https://pypdf.readthedocs.io/en/latest/user/suppress-warnings.html
When you write a question, you should also mention which version of the critical libraries you're using.
Related
I'm trying to put together a code that will procedurally read through a file of PDFs to scrape relevant information such as part names, numbers, materials, and final treatments. The (presumably) problematic part of the code is written:
for fp in os.listdir(path):
pdfFileObj = open(fp, 'rb')
reader = PdfReader(pdfFileObj)
number_of_pages = len(reader.pages)
page = reader.pages[0]
text = page.extract_text()
Title, part_number, material, f_treatments = extractText(text)
printAll(Title, part_number, material, f_treatments)
pdfFileObj.close()
where path = r'C:\Users\myname\Documents\TargetFile'
It reads the first file (1.pdf) in TargetFile successfully, but will return this upon reading the second file, (2.pdf):
[Errno 2] No such file or directory: '2.pdf'
which is peculiar, given that it needs to know that 2.pdf is in the file in order to report this error message. I suspect that fp in os.listdir() is detecting this, but that the pdfFileObj = open(fp, 'rb') command isn't finding it, as the error is reported from that line.
Do you know what the issue might be based on the information I've provided?
I thought that closing the document at the end of the loop code would help but this doesn't seem to make a difference. I've never worked with 'rb' or read-binary code before, but if it seems to work for the first file I don't expect this would be an issue.
I am working on an invoice scraper for work, where I have successfully written all the code to scrape the fields that I need using PyPDF2. However, I am having trouble figuring out how to put this code into a for loop so I can iterate through all the invoices stored in my directory. There could be anywhere from 1 to 250+ files depending on which project I am using this for.
I thought I would be able to use "*.pdf" in place of the pdf name, but it does not work for me. I am relatively new to Python and have not used that many loops before, so any guidance would be appreciated!
import re
pdfFileObj = open(r'C:\Users\notylerhere\Desktop\Test Invoices\SampleInvoice.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
#Print all text on page
#print(pageObj.extractText())
#Grab Account Number Meter Number
accountNumber = re.compile(r'\d\d\d\d\d-\d\d\d\d\d')
meterNumber = re.compile(r'(\d\d\d\d\d\d\d\d)')
moAccountNumber = accountNumber.search(pageObj.extractText())
moMeterNumber = meterNumber.search(pageObj.extractText())
print('Account Number: '+moAccountNumber.group())
print('Meter Number: '+moMeterNumber.group(1))'''
Thanks very much!
Another option is glob:
import glob
files = glob.glob("c:/mydirectory/*.pdf")
for file in files:
(Do your processing of file here)
You need to ensure everything past the colon is properly indented.
You want to iterate over your directory and deal with every file independently.
There are many functions depending on your use case. os.walk is a good place to start.
Example:
import os
for root, directories, files in os.walk('.'):
for file in files:
if '.pdf' in file:
openAndDoStuff(file)
import os
import PyPDF2
for el in os.listdir(os.getcwd()):
if el.endswith("pdf"):
pdf_reader = PyPDF2.PdfFileReader(open(os.getcwd() + "/" + el))
I have a folder where I want to read all the text files from and put them into a corpus, however I am only able to do it with .txt files. How can I expand the code below to read in .pdf, .htm and .txt files?
corpus_raw = u""
for file_name in file_names:
with codecs.open(file_name, "r", "utf-8") as file_name:
corpus_raw += file_name.read()
print("Document is {0} characters long".format(len(corpus_raw)))
print()
For example:
with open ('/data/text_file.txt', "r", encoding = "utf-8") as f:
print(f.read())
Read in data where I can view it on a notebook.
with open ('/data/text_file.pdf', "r", encoding = "utf-8") as f:
print(f.read())
Read nothing.
There are two types of files, binary files and plain-text files. A file can have one or the other, or sometimes both.
Html files are plaintext, human readable files, which you can edit by hand, but PDF Files are binary + Text files where you'll need special programs to edit them.
If you want to read from pdf or html, it's possible. I wasn't sure if you meant to extract the text, or to extract the source code, so I'll provide explanations to both.
Extracting Text
Extracting text can be done easily for html files. Using webbrowser, you can open your file in the browser, and then use urllib for extracting text. For more info, refer to the answers here: Extracting text from HTML file using Python
For pdf files, you can use a python module called PyPDF2. Download it using pip:
$ pip install PyPDF2
and get started.
Here is an example of a simple program I found on the internet:
import PyPDF2
# creating a pdf file object
pdfFileObj = open('example.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file
print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()
Extracting Source Code
Extracting source code is best done using python's open function as you did above.
For html files, you can just do what you did with text files. Or maybe to be simpler,
file = open("c:\\path\\to\\file")
print(file.read())
you can just do the above.
For pdf files, you do pretty much the same, but specifying the mode for editing in a different parameter in the open function. For more info, visit the sites in the More Info section.
file = open("c:\\path\\to\\file.extension", "a") #specifies the mode of editing. Unfortunately, you'll only be able to store data, not display it. But you can edit it, then save it after wards
print(file.readable()) #Will return false, proving to be not readable.
file.save("c:\\path\\to\\save\\in.extension")
More Info
https://www.geeksforgeeks.org/working-with-pdf-files-in-python/
https://www.programiz.com/python-programming/methods/built-in/open
This should work for htm/html files with no problem - they are basically just text files. Above, I only see that reading in .pdf has failed - was there a problem with .htm?
Also, reading in a .pdf may be much more difficult/involved than you think. A pdf contains a lot more information than just plaintext, and cannot be meaningfully edited in, say, notepad. As an example of what I mean, here's a small sample of what I got when I opened a .pdf in notepad:
%PDF-1.7
%âãÏÓ
1758 0 obj
<</Filter/FlateDecode/First 401/Length 908/N 51/Type/ObjStm>>stream
hÞ”ØQk\7à¿2ÍK,i4
Cã(Á”¾•–öâ.Ýn‚w]òó3rm˜Ÿ =ÄÜÝèΑ®?ÉÍ…e¦ê?Å/2e¥ÂJÙˆ+SÉT«ù7$"T„ZËT”´ù2£®L~©¯fÊ©±É–iÌ(¦ÄF¹&OðÑ’Œ|hnžU}Žñ¾®ûDOÉæCÄç'¿IF¸/±Å¿”±/ÿ!¾›Ú˜Æ>¤ùeiêóuÚ3õ®äU̘Է’Ìhì´$!_Êœ3©oúaÇÖÅÏç·rGòuê‡Gé¾é>Žà›ì¾õä›ò£Õì›ðѵx¨ùQXÇ3ð'åC=ªJÃ6óç:¯Öý—ZòóúI¹ù…Ÿ3—ñ$<Éw‘èÍ›«›/dz/¸z¿¿?Ço'ÑoW¿îÆõX矮¯}Ý»ítþ#?~ö¥ç_ü”×éÓÕÇíÛyü6Ç÷·»û͇åòøé÷ýù°ýôöá´?n§}8ž·Ãa·ÿÜ>ßÞo‡ý¿§Wat£õ…Ñ~ûÏ[ýQÌÍß»¯çížRŽI
$L’ù¤“úËI%Ã$OâTHb˜dóI5&$(éé´SI“€ˆE”-&Š("4&E”=$1ÁPDYa1 ˆ`(‚çEä“€†"x^DŽÁ#C</"ÇŽ` ¢B</"ÇŽ¨#D…"x^DŽQˆ
EÔ±#*Q¡ˆº "vD"*QDÄŽ¨#„#uADì"Š¨"bG!P„Ì‹(±#ˆ(BæE”ØD!ó"Jì"!ó"JìˆD4(BæE”Ø
ˆhPD[;¢
Šh"bG4 ¢AmADìˆD(ÑDÄŽP B¡ˆ¶ "v„
E輎¡#„B:/‚cG(¡P„΋àØ
Dt(BçEpìˆDt(BçEpìˆDt(¢/ˆˆÑˆEô±#:Ñ¡ˆ¾ "vD"Šè"bGaPD_;€ƒ"l^Da#„A6/¢ÆŽ0  ›QcG1Þ¡¨y5–DN eA6¢Ö‹¬‚² ‹ç#O…ÉEzQ•ð›ª´#£]„¡wU ¿¬J:ô"ñPüŸÑçSÿ(íÃñ¯íÛÿA?û°§7¿8ìBÀawü‡nww›ßû]€ %“xw
endstream
endobj
1759 0 obj
<</Filter/FlateDecode/First 1907/Length 3450/N 200/Type/ObjStm>>stream
There are, however, options. I would suggest reading the page at https://www.geeksforgeeks.org/working-with-pdf-files-in-python/ as a starting point.
Currently i get all my reports delivered to me via email attached as a pdf. What i have done is set outlook to automatically download those files to a certain directory every day. Sometimes those pdfs dont have any data in them and only contain the line "There is no data to present that matches the selection criteria". I would like to create a python program that iterates through every pdf file in that directory, open it and look for those words, if they contain that phrase then delete that particular pdf. If they do not then do nothing. Through help with reddit i have pieced together the code below:
import PyPDF2
import os
directory = 'C:\\Users\\jmoorehead\\Desktop\\A2IReports\\'
for file in os.listdir(directory):
if not file.endswith(".pdf"):
continue
with open("{}/{}".format(directory,file), 'rb') as pdfFileObj:
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
if "There is no data to present that matches the selection criteria" in pageObj.extractText():
print("{} was removed.".format(file))
os.remove(file)
I have tested with 3 files one containing the matching phrase. No matter how the files are named or what order it will fail. I have tested it with one file in the directory named 3.pdf. Below is the error code is get.
FileNotFoundError: [WinError 2] The system cannot find the file specified: >'3.pdf'
This would reduce my workload dramatically and be a great learning example for me the newbie. All help/criticism welcome.
See below:
import PyPDF2
import os
directory = 'C:\\Users\\jmoorehead\\Desktop\\A2IReports\\'
for file in os.listdir(directory):
if not file.endswith(".pdf"):
continue
with open(os.path.join(directory,file), 'rb') as pdfFileObj: # Changes here
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
if "There is no data to present that matches the selection criteria" in pageObj.extractText():
print("{} was removed.".format(file))
os.remove(file)
If I have 1000+ pdf files need to be merged into one pdf,
from PyPDF2 import PdfReader, PdfWriter
writer = PdfWriter()
for i in range(1000):
filepath = f"my/pdfs/{i}.pdf"
reader = PdfReader(open(filepath, "rb"))
for page in reader.pages:
writer.add_page(page)
with open("document-output.pdf", "wb") as fh:
writer.write(fh)
Execute the above code,when reader = PdfReader(open(filepath, "rb")),
An error message:
IOError: [Errno 24] Too many open files:
I think this is a bug, If not, What should I do?
I recently came across this exact same problem, so I dug into PyPDF2 to see what's going on, and how to resolve it.
Note: I am assuming that filename is a well-formed file path string. Assume the same for all of my code
The Short Answer
Use the PdfFileMerger() class instead of the PdfFileWriter() class. I've tried to provide the following to as closely resemble your content as I could:
from PyPDF2 import PdfFileMerger, PdfFileReader
[...]
merger = PdfFileMerger()
for filename in filenames:
merger.append(PdfFileReader(file(filename, 'rb')))
merger.write("document-output.pdf")
The Long Answer
The way you're using PdfFileReader and PdfFileWriter is keeping each file open, and eventually causing Python to generate IOError 24. To be more specific, when you add a page to the PdfFileWriter, you are adding references to the page in the open PdfFileReader (hence the noted IO Error if you close the file). Python detects the file to still be referenced and doesn't do any garbage collection / automatic file closing despite re-using the file handle. They remain open until PdfFileWriter no longer needs access to them, which is at output.write(outputStream) in your code.
To solve this, create copies in memory of the content, and allow the file to be closed. I noticed in my adventures through the PyPDF2 code that the PdfFileMerger() class already has this functionality, so instead of re-inventing the wheel, I opted to use it instead. I learned, though, that my initial look at PdfFileMerger wasn't close enough, and that it only created copies in certain conditions.
My initial attempts looked like the following, and were resulting in the same IO Problems:
merger = PdfFileMerger()
for filename in filenames:
merger.append(filename)
merger.write(output_file_path)
Looking at the PyPDF2 source code, we see that append() requires fileobj to be passed, and then uses the merge() function, passing in it's last page as the new files position. merge() does the following with fileobj (before opening it with PdfFileReader(fileobj):
if type(fileobj) in (str, unicode):
fileobj = file(fileobj, 'rb')
my_file = True
elif type(fileobj) == file:
fileobj.seek(0)
filecontent = fileobj.read()
fileobj = StringIO(filecontent)
my_file = True
elif type(fileobj) == PdfFileReader:
orig_tell = fileobj.stream.tell()
fileobj.stream.seek(0)
filecontent = StringIO(fileobj.stream.read())
fileobj.stream.seek(orig_tell)
fileobj = filecontent
my_file = True
We can see that the append() option does accept a string, and when doing so, assumes it's a file path and creates a file object at that location. The end result is the exact same thing we're trying to avoid. A PdfFileReader() object holding open a file until the file is eventually written!
However, if we either make a file object of the file path string or a PdfFileReader(see Edit 2) object of the path string before it gets passed into append(), it will automatically create a copy for us as a StringIO object, allowing Python to close the file.
I would recommend the simpler merger.append(file(filename, 'rb')), as others have reported that a PdfFileReader object may stay open in memory, even after calling writer.close().
Hope this helped!
EDIT: I assumed you were using PyPDF2, not PyPDF. If you aren't, I highly recommend switching, as PyPDF is no longer maintained with the author giving his official blessings to Phaseit in developing PyPDF2.
If for some reason you cannot swap to PyPDF2 (licensing, system restrictions, etc.) than PdfFileMerger won't be available to you. In that situation you can re-use the code from PyPDF2's merge function (provided above) to create a copy of the file as a StringIO object, and use that in your code in place of the file object.
EDIT 2: Previous recommendation of using merger.append(PdfFileReader(file(filename, 'rb'))) changed based on comments (Thanks #Agostino).
The pdfrw package reads each file all in one go, so will not suffer from the problem of too many open files. Here is an example concatenation script.
The relevant part -- assumes inputs is a list of input filenames, and outfn is an output file name:
from pdfrw import PdfReader, PdfWriter
writer = PdfWriter()
for inpfn in inputs:
writer.addpages(PdfReader(inpfn).pages)
writer.write(outfn)
Disclaimer: I am the primary pdfrw author.
The problem is that you are only allowed to have a certain number of files open at any given time. There are ways to change this (http://docs.python.org/3/library/resource.html#resource.getrlimit), but I don't think you need this.
What you could try is closing the files in the for loop:
input = PdfFileReader()
output = PdfFileWriter()
for file in filenames:
f = open(file, 'rb')
input = PdfFileReader(f)
# Some code
f.close()
I have written this code to help with the answer:-
import sys
import os
import PyPDF2
merger = PyPDF2.PdfFileMerger()
#get PDFs files and path
path = sys.argv[1]
pdfs = sys.argv[2:]
os.chdir(path)
#iterate among the documents
for pdf in pdfs:
try:
#if doc exist then merge
if os.path.exists(pdf):
input = PyPDF2.PdfFileReader(open(pdf,'rb'))
merger.append((input))
else:
print(f"problem with file {pdf}")
except:
print("cant merge !! sorry")
else:
print(f" {pdf} Merged !!! ")
merger.write("Merged_doc.pdf")
In this, I have used PyPDF2.PdfFileMerger and PyPDF2.PdfFileReader, instead of explicitly converting the file name to file object
It maybe just what it says, you are opening to many files.
You may explicitly use f=file(filename) ... f.close() in the loop, or use the with statement. So that each opened file is properly closed.