How to read this pdf form using PyPDF2 in python

How to read this pdf form using PyPDF2 in python - python

https://www.fda.gov/downloads/AboutFDA/ReportsManualsForms/Forms/UCM074728.pdf
I'm trying to read this pdf using PyPDF2 or Pdfminer, but it is saying that the File has not been decrypted in Pypdf2 and in pdfminer, it is saying that it can decompress that pdf. Somebody let me know how to do this in a python3 windows environment. I can't use poppler as I cant install poppler in this windows.

This is a restricted PDF file. In most cases you can decrypt a file that doesn't prompt you for a password using PyPDF2 with an empty string:
from PyPDF2 import PdfFileReader
reader = PdfFileReader('sample.pdf')
reader.decrypt('')
Unfortunately, it's not the case of your file or any other with 128-bit AES encryption level which is unsupported for the PyPDF2 decrypt() method that will return a NotImplementedError.
As a simple workaround you can save this file as a new file in Adobe Reader or similar and the new file should work for your code.
Also, you can do it programmatically using qpdfas discussed in this GitHub issue:
import os, shutil, tempdir
from subprocess import check_call
try:
tempdir = tempfile.mkdtemp(dir=os.path.dirname(filename))
temp_out = os.path.join(tempdir, 'qpdf_out.pdf')
check_call(['qpdf', "--password=", '--decrypt', filename, temp_out])
shutil.move(temp_out, filename)
print 'File Decrypted'
finally:
shutil.rmtree(tempdir)

Related

How to use PyPDF2 in a script?

import PyPDF2
from PyDF2 import PdfFileReader, PdfFileWriter
file_path="sample.pdf"
pdf = PdfFileReader(file_path)
with open("sample.pdf", "w") as f:'
for page_num in range(pdf.numPages):
pageObj = pdf.getPage(page_num)
try:
txt = pageObj.extractText()
txt = DocumentInformation.author
except:
pass
else:
f.write(txt)
f.close()
Error Received:
ModuleNotFoundError: No module named 'PyPDF2'
Writing my first ever script where I want to scan in a PDF then extract the text and write it to a txt file. I was trying to use pyPDF2 but I'm not sure how to use it in a script like this.
EDIT: I had success importing the os & sys like so.
import os
import sys

There are multiple issues:
from PyDF2 import ...: A typo. You meant PyPDF2 instead of PyDF2
PdfFileWriter was imported, but never used (side-note: It's PdfReader and PdfWriter in the latest version of PyPDF2)
with open("sample.pdf", "w") as f:': A syntax error
Lacking indentation of the next lines
Side-note: Did you know that you can simply write for page in pdf.pages?
DocumentInformation.author is wrong. I guess you meant pdf.metadata.author
You overwrite the txt variable - I don't understand why you don't use it before you re-assign it.
Maybe this is what you want:
from PyPDF2 import PdfReader
def get_text(pdf_file_path: str) -> str:
text = ""
reader = PdfReader(pdf_file_path)
for page in reader.pages:
text += page.extract_text()
return text
text = get_text("example.pdf")
with open("example.txt", "w") as f:
f.write(text)
Installation issues
In case you have installation issues, maybe the docs on installing PyPDF2 can help you?
If you execute your script in the console as python your_script_name.py you might want to check the output of
python -c "import PyPDF2; print(PyPDF2.__version__)"
That should show your PyPDF2 version. If it doesn't, it the Python environment you're using doesn't have PyPDF2 installed. Please note that your system might have arbitrary many Python environments.

Creating view in browser functionality with python

I have been struggling with this problem for a while but can't seem to find a solution for it. The situation is that I need to open a file in browser and after the user closes the file the file is removed from their machine. All I have is the binary data for that file. If it matters, the binary data comes from Google Storage using the download_as_string method.
After doing some research I found that the tempfile module would suit my needs, but I can't get the tempfile to open in browser because the file only exists in memory and not on the disk. Any suggestions on how to solve this?
This is my code so far:
import tempfile
import webbrowser
# grabbing binary data earlier on
temp = tempfile.NamedTemporaryFile()
temp.name = "example.pdf"
temp.write(binary_data_obj)
temp.close()
webbrowser.open('file://' + os.path.realpath(temp.name))
When this is run, my computer gives me an error that says that the file cannot be opened since it is empty. I am on a Mac and am using Chrome if that is relevant.

You could try using a temporary directory instead:
import os
import tempfile
import webbrowser
# I used an existing pdf I had laying around as sample data
with open('c.pdf', 'rb') as fh:
data = fh.read()
# Gives a temporary directory you have write permissions to.
# The directory and files within will be deleted when the with context exits.
with tempfile.TemporaryDirectory() as temp_dir:
temp_file_path = os.path.join(temp_dir, 'example.pdf')
# write a normal file within the temp directory
with open(temp_file_path, 'wb+') as fh:
fh.write(data)
webbrowser.open('file://' + temp_file_path)
This worked for me on Mac OS.

How do I know my file is attached in my PDF using PyPDF2?

I am trying to attach an .exe file into a PDF using PyPDF2.
I ran the code below, but my PDF file is still the same size.
I don't know if my file was attached or not.
from PyPDF2 import PdfFileWriter, PdfFileReader
writer = PdfFileWriter()
reader = PdfFileReader("doc1.pdf")
# check it's whether work or not
print("doc1 has %d pages" % reader.getNumPages())
writer.addAttachment("doc1.pdf", "client.exe")
What am I doing wrong?

First of all, you have to use the PdfFileWriter class properly.
You can use appendPagesFromReader to copy pages from the source PDF ("doc1.pdf") to the output PDF (ex. "out.pdf"). Then, for addAttachment, the 1st parameter is the filename of the file to attach and the 2nd parameter is the attachment data (it's not clear from the docs, but it has to be a bytes-like sequence). To get the attachment data, you can open the .exe file in binary mode, then read() it. Finally, you need to use write to actually save the PdfFileWriter object to an actual PDF file.
Here is a more working example:
from PyPDF2 import PdfFileReader, PdfFileWriter
reader = PdfFileReader("doc1.pdf")
writer = PdfFileWriter()
writer.appendPagesFromReader(reader)
with open("client.exe", "rb") as exe:
writer.addAttachment("client.exe", exe.read())
with open("out.pdf", "wb") as f:
writer.write(f)
Next, to check if attaching was successful, you can use os.stat.st_size to compare the file size (in bytes) before and after attaching the .exe file.
Here is the same example with checking for file sizes:
(I'm using Python 3.6+ for f-strings)
import os
from PyPDF2 import PdfFileReader, PdfFileWriter
reader = PdfFileReader("doc1.pdf")
writer = PdfFileWriter()
writer.appendPagesFromReader(reader)
with open("client.exe", "rb") as exe:
writer.addAttachment("client.exe", exe.read())
with open("out.pdf", "wb") as f:
writer.write(f)
# Check result
print(f"size of SOURCE: {os.stat('doc1.pdf').st_size}")
print(f"size of EXE: {os.stat('client.exe').st_size}")
print(f"size of OUTPUT: {os.stat('out.pdf').st_size}")
The above code prints out
size of SOURCE: 42942
size of EXE: 989744
size of OUTPUT: 1031773
...which sort of shows that the .exe file was added to the PDF.
Of course, you can manually check it by opening the PDF in Adobe Reader:
As a side note, I am not sure what you want to do with attaching exe files to PDF, but it seems you can attach them but Adobe treats them as security risks and may not be possible to be opened. You can use the same code above to attach another PDF file (or other documents) instead of an executable file, and it should still work.

Python: Crossplatform code to download a valid .zip file

I have a requirement to download and unzip a file from a website. Here is the code I'm using:
#!/usr/bin/python
#geoipFolder = r'/my/folder/path/ ' #Mac/Linux folder path
geoipFolder = r'D:\my\folder\path\ ' #Windows folder path
geoipFolder = geoipFolder[:-1] #workaround for Windows escaping trailing quote
geoipName = 'GeoIPCountryWhois'
geoipURL = 'http://geolite.maxmind.com/download/geoip/database/GeoIPCountryCSV.zip'
import urllib2
response = urllib2.urlopen(geoipURL)
f = open('%s.zip' % (geoipFolder+geoipName),"w")
f.write(repr(response.read()))
f.close()
import zipfile
zip = zipfile.ZipFile(r'%s.zip' % (geoipFolder+geoipName))
zip.extractall(r'%s' % geoipFolder)
This code works on Mac and Linux boxes, but not on Windows. There, the .zip file is written, but the script throws this error:
zipfile.BadZipfile: File is not a zip file
I can't unzip the file using Windows Explorer either. It says that:
The compressed (zipped) folder is empty.
However the file on disk is 6MB large.
Thoughts on what I'm doing wrong on Windows?
Thanks

Your zipfile is corrupt on windows because you're opening the file in write/text mode (line-terminator conversion trashes binary data):
f = open('%s.zip' % (geoipFolder+geoipName),"w")
You have to open in write/binary mode like this:
f = open('%s.zip' % (geoipFolder+geoipName),"wb")
(will still work on Linux of course)
To sum it up, a more pythonic way of doing it, using a with block (and remove repr):
with open('{}{}.zip'.format(geoipFolder,geoipName),"wb") as f:
f.write(response.read())
EDIT: no need to write a file to disk, you can use io.BytesIO, since the ZipFile object accepts a file handle as first parameter.
import io
import zipfile
with open('{}{}.zip'.format(geoipFolder,geoipName),"wb") as f:
outbuf = io.BytesIO(f.read())
zip = zipfile.ZipFile(outbuf) # pass the fake-file handle: no disk write, no temp file
zip.extractall(r'%s' % geoipFolder)

How to open a password protected excel file using python?

I looked at the previous threads regarding this topic, but they have not helped solve the problem.
how to read password protected excel in python
How to open write reserved excel file in python with win32com?
I'm trying to open a password protected file in excel without any user interaction. I searched online, and found this code which uses win32com.client
When I run this, I still get the prompt to enter the password...
from xlrd import *
import win32com.client
import csv
import sys
xlApp = win32com.client.Dispatch("Excel.Application")
print "Excel library version:", xlApp.Version
filename,password = r"\\HRA\Myfile.xlsx", 'caa team'
xlwb = xlApp.Workbooks.Open(filename, Password=password)

I don't think that named parameters work in this case. So you'd have to do something like:
xlwb = xlApp.Workbooks.Open(filename, False, True, None, password)
See http://msdn.microsoft.com/en-us/library/office/ff194819.aspx for details on the Workbooks.Open method.

I recently discovered a Python library that makes this task simple.
It does not require Excel to be installed and, because it's pure Python, it's cross-platform too!
msoffcrypto-tool supports password-protected (encrypted) Microsoft Office documents, including the older XLS binary file format.
Install msoffcrypto-tool:
pip install msoffcrypto-tool
You could create an unencrypted version of the workbook from the command line:
msoffcrypto-tool Myfile.xlsx Myfile-decrypted.xlsx -p "caa team"
Or, you could use msoffcrypto-tool as a library. While you could write an unencrypted version to disk like above, you may prefer to create an decrypted in-memory file and pass this to your Python Excel library (openpyxl, xlrd, etc.).
import io
import msoffcrypto
import openpyxl
decrypted_workbook = io.BytesIO()
with open('Myfile.xlsx', 'rb') as file:
office_file = msoffcrypto.OfficeFile(file)
office_file.load_key(password='caa team')
office_file.decrypt(decrypted_workbook)
# `filename` can also be a file-like object.
workbook = openpyxl.load_workbook(filename=decrypted_workbook)

If your file size is small, you can probably save that as ".csv".
and then read
It worked for me :)

Openpyxl Package works if you are using linux system. You can use secure the file by setting up a password and open the file using the same password.
For more info:
https://www.quora.com/How-do-I-open-read-password-protected-xls-or-xlsx-Excel-file-using-python-in-Linux

Thank you so much for the great answers on this topic. Trying to collate all of it. My requirement was to open a bunch of password protected excel files ( all had same password ) so that I could do some more processing on those. Please find the code below.
import pandas as pd
import os
from xlrd import *
import win32com.client as w3c
import csv
import sys
from tempfile import NamedTemporaryFile
df_list=[]
# print(len(files))
for f in files:
# print(f)
if('.xlsx' in f):
xlwb = xlapp.Workbooks.Open('C:\\users\\files\\'+f, False, True, None, 'password')
temp_f = NamedTemporaryFile(delete=False, suffix='.csv')
temp_f.close()
os.unlink(temp_f.name)
xlwb.SaveAs(Filename=temp_f.name, FileFormat=xlCSVWindows)
df = pd.read_csv(temp_f.name,encoding='Latin-1') # Read that CSV from Pandas
df.to_excel('C:\\users\\files\\password_removed\\'+f)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read this pdf form using PyPDF2 in python - python

Related

How to use PyPDF2 in a script?

Creating view in browser functionality with python

How do I know my file is attached in my PDF using PyPDF2?

Python: Crossplatform code to download a valid .zip file

How to open a password protected excel file using python?

Categories

Resources