def EncryptPDFFiles(password, directory):
pdfFiles = []
success = 0
# Get all PDF files from a directory
for folderName, subFolders, fileNames in os.walk(directory):
for fileName in fileNames:
if (fileName.endswith(".pdf")):
pdfFiles.append(os.path.join(folderName, fileName))
print("%s PDF documents found." % str(len(pdfFiles)))
# Create an encrypted version for each document
for pdf in pdfFiles:
# Copy old PDF into a new PDF object
pdfFile = open(pdf,"rb")
pdfReader = PyPDF2.PdfFileReader(pdfFile)
pdfWriter = PyPDF2.PdfFileWriter()
for pageNum in range(pdfReader.numPages):
pdfWriter.addPage(pdfReader.getPage(pageNum))
pdfFile.close()
# Encrypt the new PDF and save it
saveName = pdf.replace(".pdf",ENCRYPTION_TAG)
pdfWriter.encrypt(password)
newFile = open(saveName, "wb")
pdfWriter.write(newFile)
newFile.close()
print("%s saved to: %s" % (pdf, saveName))
# Verify the the encrypted PDF encrypted properly
encryptedPdfFile = open(saveName,"rb")
encryptedPdfReader = PyPDF2.PdfFileReader(encryptedPdfFile)
canDecrypt = encryptedPdfReader.decrypt(password)
encryptedPdfFile.close()
if (canDecrypt):
print("%s successfully encrypted." % (pdf))
send2trash.send2trash(pdf)
success += 1
print("%s of %s successfully encrypted." % (str(success),str(len(pdfFiles))))
I am following along with Pythons Automate the Boring Stuff section. I've had off and on issues when doing the copy for a PDF document but as of right now everytime I run the program my copied PDF is all blank pages. There are the correct amount of pages of my newly encrypted PDF but they are all blank (no content on the pages). I've had this happen before but was not able to recreate. I've tried throwing in a sleep before closing my files. I'm not sure what the best practice for opening and closing files are in Python. For reference I'm using Python3.
Try moving the pdfFile.close to the very end of your for loop.
for pdf in pdfFiles:
#
# {stuff}
#
if (canDecrypt):
print("%s successfully encrypted." % (pdf))
send2trash.send2trash(pdf)
success += 1
pdfFile.close()
The thought is that the pdfFile needs to be available and open when the pdfWriter finally writes out, otherwise it cannot access the pages to write the new file.
The issue with getting a blank page even after adding a page to your pdf with writer.addPage(your_page_name) is the context manager.
You have to make sure that you're not closing the pdf from which you're reading the page.
For Example:
with open(str(_pdf), "rb") as in_f:
reader = PdfFileReader(in_f)
_page = reader.getPage(0)
writer = PdfFileWriter()
writer.addPage(_page)
with open(_filename, "wb+") as out_f:
writer.write(out_f)
This will NOT WORK since the file handle is being closed by the context manager. The file has to be open So we would have to indent it. Like the following:
with open(str(_pdf), "rb") as in_f:
reader = PdfFileReader(in_f)
_page = reader.getPage(0)
writer = PdfFileWriter()
writer.addPage(_page)
with open(_filename, "wb+") as out_f:
writer.write(out_f)
I know it's not a big deal but this literally made me pull out my hair, indentation wasted my 6 hours. That's why I thought I should write an answer for others
Related
I have a single PDF that I would like to create different PDFs for each of its pages. How would I be able to so without downloading anything locally? I know that Document AI has a file splitting module (which would actually identify different files.. that would be most ideal) but that is not available publicly.
I am using PyPDF2 to do this curretly
list_of_blobs = list(bucket.list_blobs(prefix = 'tmp/'))
print(len(list_of_blobs))
list_of_blobs[1].download_to_filename('/' + list_of_blobs[1].name)
inputpdf = PdfFileReader(open('/' + list_of_blobs[1].name, "rb"))
individual_files = []
stream = io.StringIO()
for i in range(inputpdf.numPages):
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i))
individual_files.append(output)
with open("document-page%s.pdf" % (i + 1), "a") as outputStream:
outputStream.write(stream.getvalue())
#print(outputStream.read())
with open(outputStream.name, 'rb') as f:
data = f.seek(85)
data = f.read()
individual_files.append(data)
bucket.blob('processed/' + "doc%s.pdf" % (i + 1)).upload_from_string(data, content_type='application/pdf')
In the output, I see different PyPDF2 objects such as
<PyPDF2.pdf.PdfFileWriter object at 0x12a2037f0> but I have no idea how I should proceed next. I am also open to using other libraries if those work better.
There were two reasons why my program was not working:
I was trying to read a file in append mode (I fixed this by moving the second with(open) block outside of the first one,
I should have been writing bytes (I fixed this by changing the open mode to 'wb' instead of 'a')
Below is the corrected code:
if inputpdf.numPages > 2:
for i in range(inputpdf.numPages):
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i))
with open("/tmp/document-page%s.pdf" % (i + 1), "wb") as outputStream:
output.write(outputStream)
with open(outputStream.name, 'rb') as f:
data = f.seek(0)
data = f.read()
#print(data)
bucket.blob(prefix + '/processed/' + "page-%s.pdf" % (i + 1)).upload_from_string(data, content_type='application/pdf')
stream.truncate(0)
To split a PDF file in several small file (page), you need to download the data for that. You can materialize the data in a file (in the writable directory /tmp) or simply keep them in memory in a python variable.
In both cases:
The data will reside in memory
You need to get the data to perform the PDF split.
If you absolutely want to read the data in streaming (I don't know if it's possible with PDF format!!), you can use the streaming feature of GCS. But, because there isn't CRC on the downloaded data, I won't recommend you this solution, except if you are ready to handle corrupted data, retries and all related stuff.
I tried to print pages of a pdf document:
import PyPDF2
FILE_PATH = 'my.pdf'
with open(FILE_PATH, mode='rb') as f:
reader = PyPDF2.PdfFileReader(f)
page = reader.getPage(0) # I tried also other pages e.g 1,2,..
print(page.extractText())
But I only get a lot of blank space and no error message. Could it be that this pdf version (my.pdf) is not supported by PyPDF2?
This solved it (prints all pages of the document). Thanks
from pdfreader import SimplePDFViewer
fd = open("my.pdf", "rb")
viewer = SimplePDFViewer(fd)
for i in range(1,16): # need range from 1 - max number of pages +1
viewer.navigate(i)
viewer.render()
page_1_content=viewer.canvas.text_content
page_1_text = "".join(viewer.canvas.strings)
print (page_1_text)
Try pdfreader
from pdfreader import SimplePDFViewer
fd = open("my.pdf", "rb")
viewer = SimplePDFViewer(fd)
viewer.render()
page_0_content=viewer.canvas.text_content
page_0_text = "".join(viewer.canvas.strings)
If it's blank, either the PDF is being read and it's format can't be read by pypdf so it just outputs blank. Maybe put in the absolute filepath instead of relative filepath. If all else fails, try with different PDFs , and if there is a version that does work and yours doesn't, you might need to convert yours to that working type.
The following code adds a watermark on every page and also adds metadata (or better should do).
The watermarking works perfectly fine, but there is no metadata in the output document and no error
from PyPDF2 import PdfFileReader as PdfReader, PdfFileWriter as PdfWriter
nf = []
sources = ["example.pdf", "example2.pdf"]
for i in sources:
# new pdf file name
new_file_name = i + " " + row["name"] + ".pdf"
nf.append(new_file_name)
print(f"Registering {i} watermarked version for {row['name']}.")
reader = PdfReader(i)
writer = PdfWriter()
# adding watermark to each page
for page in reader.pages:
# creating watermarked page object
wmpageObj = add_watermark(packet, page)
# adding watermarked page object to pdf writer
writer.addPage(wmpageObj)
# Write Metadate
writer.addMetadata({"/Registered to": row["name"]})
writer.addMetadata({"/ATC": "ACME Inc."})
# writing watermarked pages to new file
with open(new_file_name, "wb") as newFile:
writer.write(newFile)
Adding print(pdfWriter._info) before and after adding the metadata gives me only:
IndirectObject(2, 0)
IndirectObject(2, 0)
Also interesting: I tried Adobe Acrobat Reader DC on Mac and Windows and it's not possible to show the metadata of the output file (the window just won't open), but works fine with the source file, i.e. before adding watermark and metadata.
I have a large directory with PDF files (images), how can I extract efficiently the text from all the files inside the directory?. So far I tried to:
import multiprocessing
import textract
def extract_txt(file_path):
text = textract.process(file_path, method='tesseract')
p = multiprocessing.Pool(2)
file_path = ['/Users/user/Desktop/sample.pdf']
list(p.map(extract_txt, file_path))
However, it is not working... it takes a lot of time (I have some documents that have 600 pages). Additionally: a) I do not know how to handle efficiently the directory transformation part. b) I would like to add a page separator, let's say: <start/age = 1> ... page content ... <end/page = 1>, but I have no idea of how to do this.
Thus, how can I apply the extract_txt function to all the elements of a directory that end with .pdf and return the same files in another directory but in a .txt format, and add a page separator with OCR text extraction?.
Also, I was curios about using google docs to make this task, is it possible to programmatically use google docs to solve the aforementioned text extracting problem?.
UPDATE
Regarding the "adding a page separator" issue (<start/age = 1> ... page content ... <end/page = 1>) after reading Roland Smith's answer I tried to:
from PyPDF2 import PdfFileWriter, PdfFileReader
import textract
def extract_text(pdf_file):
inputpdf = PdfFileReader(open(pdf_file, "rb"))
for i in range(inputpdf.numPages):
w = PdfFileWriter()
w.addPage(inputpdf.getPage(i))
outfname = 'page{:03d}.pdf'.format(i)
with open(outfname, 'wb') as outfile: # I presume you need `wb`.
w.write(outfile)
print('\n<begin page pos =' , i, '>\n')
text = textract.process(str(outfname), method='tesseract')
os.remove(outfname) # clean up.
print(str(text, 'utf8'))
print('\n<end page pos =' , i, '>\n')
extract_text('/Users/user/Downloads/ImageOnly.pdf')
However, I still have issues with the print() part, since instead of printing, it would be more useful to save into a file all the output. Thus, I tried to redirect the output to a a file:
sys.stdout=open("test.txt","w")
print('\n<begin page pos =' , i, '>\n')
sys.stdout.close()
text = textract.process(str(outfname), method='tesseract')
os.remove(outfname) # clean up.
sys.stdout=open("test.txt","w")
print(str(text, 'utf8'))
sys.stdout.close()
sys.stdout=open("test.txt","w")
print('\n<end page pos =' , i, '>\n')
sys.stdout.close()
Any idea of how to make the page extraction/separator trick and saving everything into a file?...
In your code, you are extracting the text, but you don't do anything with it.
Try something like this:
def extract_txt(file_path):
text = textract.process(file_path, method='tesseract')
outfn = file_path[:-4] + '.txt' # assuming filenames end with '.pdf'
with open(outfn, 'wb') as output_file:
output_file.write(text)
return file_path
This writes the text to file that has the same name but a .txt extension.
It also returns the path of the original file to let the parent know that this file is done.
So I would change the mapping code to:
p = multiprocessing.Pool()
file_path = ['/Users/user/Desktop/sample.pdf']
for fn in p.imap_unordered(extract_txt, file_path):
print('completed file:', fn)
You don't need to give an argument when creating a Pool. By default it will create as many workers as there are cpu-cores.
Using imap_unordered creates an iterator that starts yielding values as soon as they are available.
Because the worker function returned the filename, you can print it to let the user know that this file is done.
Edit 1:
The additional question is if it is possible to mark page boundaries. I think it is.
A method that would surely work is to split the PDF file into pages before the OCR. You could use e.g. pdfinfo from the poppler-utils package to find out the number of pages in a document. And then you could use e.g. pdfseparate from the same poppler-utils package to convert that one pdf file of N pages into N pdf files of one page. You could then OCR the single page PDF files separately. That would give you the text on each page separately.
Alternatively you could OCR the whole document and then search for page breaks. This will only work if the document has a constant or predictable header or footer on every page. It is probably not as reliable as the abovementioned method.
Edit 2:
If you need a file, write a file:
from PyPDF2 import PdfFileWriter, PdfFileReader
import textract
def extract_text(pdf_file):
inputpdf = PdfFileReader(open(pdf_file, "rb"))
outfname = pdf_file[:-4] + '.txt' # Assuming PDF file name ends with ".pdf"
with open(outfname, 'w') as textfile:
for i in range(inputpdf.numPages):
w = PdfFileWriter()
w.addPage(inputpdf.getPage(i))
outfname = 'page{:03d}.pdf'.format(i)
with open(outfname, 'wb') as outfile: # I presume you need `wb`.
w.write(outfile)
print('page', i)
text = textract.process(outfname, method='tesseract')
# Add header and footer.
text = '\n<begin page pos = {}>\n'.format(i) + text + '\n<end page pos = {}>\n'.format(i)
# Write the OCR-ed text to the output file.
textfile.write(text)
os.remove(outfname) # clean up.
print(text)
So I've created a form that includes the following item
<input type="file" name="form_file" multiple/>
This tells the browser to allow the user to select multiple files while browsing. The problem I am having is is that when reading / writing the files that are being uploaded, I can only see the last of the files, not all of them. I was pretty sure I've seen this done before, but had no luck searching. Here's generally what my read looks like
if request.FILES:
filename = parent_id + str(random.randrange(0,100))
output_file = open(settings.PROJECT_PATH + "static/img/inventory/" + filename + ".jpg", "w")
output_file.write(request.FILES["form_file"].read())
output_file.close()
Now, as you can see I'm not looping through each file, because I've tried a few different ways and can't seem to find the other files (in objects and such)
I added in this print(request.FILES["form_file"]) and was only getting the last filename, as expected. Is there some trick to get to the other files? Am I stuck with a single file upload? Thanks!
Based on your file element form_file, the value in request.FILES['form_file'] should be a list of files. So you can do something like:
for upfile in request.FILES.getlist('form_file'):
filename = upfile.name
# instead of "filename" specify the full path and filename of your choice here
fd = open(filename, 'w')
fd.write(upfile['content'])
fd.close()
Using chunks:
for upfile in request.FILES.getlist('form_file'):
filename = upfile.name
fd = open(filename, 'w+') # or 'wb+' for binary file
for chunk in upfile.chunks():
fd.write(chunk)
fd.close()