how to combine pdf files into one from a tempfile in Python

how to combine pdf files into one from a tempfile in Python - python

I have a python code that creates a temporary pdf files from tableau views and sends it to the slack channel separately.
I want to combine them together into one file but I can't figure out how to do it.
I am fairly new to python and would really appreciate some help in how to use PdfFileMerger in the code below.
i've tried to use
merger.append(f)
after f variable but it doesn't work giving me ar error ** AttributeError: 'dict' object has no attribute 'seek'
** what should I put in brackets?
for view_item in all_views :
with tempfile.NamedTemporaryFile(suffix='.pdf', delete=True) as temp_file:
#server.views.populate_image(view_item, req_options=image_req_option)
server.views.populate_pdf(view_item, req_options = pdf_req_option)
print('got the image')
temp_file.write(view_item.pdf)
temp_file.file.seek(0)
print('in the beginnign again')
f = {'file': (temp_file.name,temp_file, 'pdf')}
merger.append(f)
response = requests.post(url='https://slack.com/api/files.upload', data=
{'token': bot_token, 'channels': slack_channels[0], 'media': f,'title': '{} {}'.format(view_item.name, yesterday), 'initial_comment' :''},
headers={'Accept': 'application/json'}, files=f)
print('the image is in the channel')

You'll need to feed PdfFileMerger the file objects, like so, not a dict.
Since PdfFileMerger will do things in-memory anyway, there's no need to write to tempfiles on the disk, a BytesIO in memory will do fine.
import io
merger = PdfFileMerger()
for view_item in all_views:
server.views.populate_pdf(view_item, req_options=pdf_req_option)
# Write retrieved data into memory file
tf = io.BytesIO()
tf.write(view_item.pdf)
tf.seek(0)
# Add it to the merger
merger.append(tf)
# Write merged data into memory file
temp_file = io.BytesIO()
merger.write(temp_file)
temp_file.seek(0)
f = {'file': ('merged.pdf', temp_file, 'pdf')}
# Slack stuff here...

Related

Download a base64 Image data and save into memory

For few days now I am trying to find how the BytesIO is working with what I want to do.
So basically I want to download a pdf that its base64 encoded. After downloading it I want to save the file but not on the disk but in memory. And then I need to upload the path of that file to an API. Does anyone know how I could do that?
Until now this is what I have so far.
def string_io_save(document_string: str, attach_id: str, file_name: str, file_type: str):
img_data = b64decode(document_string)
path = attach_id
out_stream = BytesIO()
with open('imageToSave.pdf', 'wb') as f:
f.write(img_data)
bytesio = BytesIO(f.read())
return bytesio

You can create a file-like BytesIO like this: BytesIO(img_data)
Your function this way is much simpler:
def string_io_save(document_string: str):
img_data = b64decode(document_string)
return BytesIO(img_data)
now you can work with the return value of string_io_save as if it were a regular file (which is very useful for testing purposes):
fp = string_io_save("base64-encoded-sting....")
foo = fp.read()
fp.close()

Merge 2 pdf files giving me an empty pdf

I am using the following standard code:
# importing required modules
import PyPDF2
def PDFmerge(pdfs, output):
# creating pdf file merger object
pdfMerger = PyPDF2.PdfFileMerger()
# appending pdfs one by one
for pdf in pdfs:
with open(pdf, 'rb') as f:
pdfMerger.append(f)
# writing combined pdf to output pdf file
with open(output, 'wb') as f:
pdfMerger.write(f)
def main():
# pdf files to merge
pdfs = ['example.pdf', 'rotated_example.pdf']
# output pdf file name
output = 'combined_example.pdf'
# calling pdf merge function
PDFmerge(pdfs = pdfs, output = output)
if __name__ == "__main__":
# calling the main function
main()
But when I call this with my 2 pdf files (which just contain some text), it produces an empty pdf file, I am wondering how this may be caused?

The problem is that you're closing the files before the write.
When you call pdfMerger.append, it doesn't actually read and process the whole file then; it only does so later, when you call pdfMerger.write. Since the files you've appended are closed, it reads no data from each of them, and therefore outputs an empty PDF.
This should actually raise an exception, which would have made the problem and the fix obvious. Apparently this is a bug introduced in version 1.26, and it will be fixed in the next version. Unfortunately, while the fix was implemented in July 2016, there hasn't been a next version since May 2016. (See this issue.)
You could install directly off the github master (and hope there aren't any new bugs), or you could continue to wait for 1.27, or you could work around the bug. How? Simple: just keep the files open until the write is done:
with contextlib.ExitStack() as stack:
pdfMerger = PyPDF2.PdfFileMerger()
files = [stack.enter_context(open(pdf, 'rb')) for pdf in pdfs]
for f in files:
pdfMerger.append(f)
with open(output, 'wb') as f:
pdfMerger.write(f)

The workaround I have found that works uses an instance of PdfFileReader as the object to append.
from PyPDF2 import PdfFileMerger
from PyPDF2 import PdfFileReader
merger = PdfFileMerger()
for f in ['file1.pdf', 'file2.pdf', 'file3.pdf']:
merger.append(PdfFileReader(f), 'rb')
with open('finished_copy.pdf', 'wb') as new_file:
merger.write(new_file)
Hope that helps!

Python: generate xlsx in memory and stream file download?

for example the following code creates the xlsx file first and then streams it as a download but I'm wondering if it is possible to send the xlsx data as it is being created. For example, imagine if a very large xlsx file needs to be generated, the user has to wait until it is finished and then receive the download, what I'd like is to start the xlsx file download in the user browser, and then send over the data as it is being generated. It seems trivial with a .csv file but not so with an xlsx file.
try:
import cStringIO as StringIO
except ImportError:
import StringIO
from django.http import HttpResponse
from xlsxwriter.workbook import Workbook
def your_view(request):
# your view logic here
# create a workbook in memory
output = StringIO.StringIO()
book = Workbook(output)
sheet = book.add_worksheet('test')
sheet.write(0, 0, 'Hello, world!')
book.close()
# construct response
output.seek(0)
response = HttpResponse(output.read(), mimetype="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet")
response['Content-Disposition'] = "attachment; filename=test.xlsx"
return response

Are you able to write tempfiles to disk while generating the XLSX?
If you are able to use tempfile you won't be memory bound, which is nice, but the download will still only start when the XLSX writer is done assembling the document.
If you can't write tempfiles, you'll have to follow this example http://xlsxwriter.readthedocs.org/en/latest/example_http_server.html and your code is unfortunately completely memory bound.
Streaming CSV is very easy, on the other hand. Here is code we use to stream any iterator of rows in a CSV response:
import csv
import io
def csv_generator(data_generator):
csvfile = io.BytesIO()
csvwriter = csv.writer(csvfile)
def read_and_flush():
csvfile.seek(0)
data = csvfile.read()
csvfile.seek(0)
csvfile.truncate()
return data
for row in data_generator:
csvwriter.writerow(row)
yield read_and_flush()
def csv_stream_response(response, iterator, file_name="xxxx.csv"):
response.content_type = 'text/csv'
response.content_disposition = 'attachment;filename="' + file_name + '"'
response.charset = 'utf8'
response.content_encoding = 'utf8'
response.app_iter = csv_generator(iterator)
return response

xlsx format is a zip file that contains several individual files, so you can't create it on the fly and send it out as it is being created.

Sending multiple .CSV files to .ZIP without storing to disk in Python

I'm working on a reporting application for my Django powered website. I want to run several reports and have each report generate a .csv file in memory that can be downloaded in batch as a .zip. I would like to do this without storing any files to disk. So far, to generate a single .csv file, I am following the common operation:
mem_file = StringIO.StringIO()
writer = csv.writer(mem_file)
writer.writerow(["My content", my_value])
mem_file.seek(0)
response = HttpResponse(mem_file, content_type='text/csv')
response['Content-Disposition'] = 'attachment; filename=my_file.csv'
This works fine, but only for a single, unzipped .csv. If I had, for example, a list of .csv files created with a StringIO stream:
firstFile = StringIO.StringIO()
# write some data to the file
secondFile = StringIO.StringIO()
# write some data to the file
thirdFile = StringIO.StringIO()
# write some data to the file
myFiles = [firstFile, secondFile, thirdFile]
How could I return a compressed file that contains all objects in myFiles and can be properly unzipped to reveal three .csv files?

zipfile is a standard library module that does exactly what you're looking for. For your use-case, the meat and potatoes is a method called "writestr" that takes a name of a file and the data contained within it that you'd like to zip.
In the code below, I've used a sequential naming scheme for the files when they're unzipped, but this can be switched to whatever you'd like.
import zipfile
import StringIO
zipped_file = StringIO.StringIO()
with zipfile.ZipFile(zipped_file, 'w') as zip:
for i, file in enumerate(files):
file.seek(0)
zip.writestr("{}.csv".format(i), file.read())
zipped_file.seek(0)
If you want to future-proof your code (hint hint Python 3 hint hint), you might want to switch over to using io.BytesIO instead of StringIO, since Python 3 is all about the bytes. Another bonus is that explicit seeks are not necessary with io.BytesIO before reads (I haven't tested this behavior with Django's HttpResponse, so I've left that final seek in there just in case).
import io
import zipfile
zipped_file = io.BytesIO()
with zipfile.ZipFile(zipped_file, 'w') as f:
for i, file in enumerate(files):
f.writestr("{}.csv".format(i), file.getvalue())
zipped_file.seek(0)

The stdlib comes with the module zipfile, and the main class, ZipFile, accepts a file or file-like object:
from zipfile import ZipFile
temp_file = StringIO.StringIO()
zipped = ZipFile(temp_file, 'w')
# create temp csv_files = [(name1, data1), (name2, data2), ... ]
for name, data in csv_files:
data.seek(0)
zipped.writestr(name, data.read())
zipped.close()
temp_file.seek(0)
# etc. etc.
I'm not a user of StringIO so I may have the seek and read out of place, but hopefully you get the idea.

def zipFiles(files):
outfile = StringIO() # io.BytesIO() for python 3
with zipfile.ZipFile(outfile, 'w') as zf:
for n, f in enumarate(files):
zf.writestr("{}.csv".format(n), f.getvalue())
return outfile.getvalue()
zipped_file = zip_files(myfiles)
response = HttpResponse(zipped_file, content_type='application/octet-stream')
response['Content-Disposition'] = 'attachment; filename=my_file.zip'
StringIO has getvalue method which return the entire contents. You can compress the zipfile
by zipfile.ZipFile(outfile, 'w', zipfile.ZIP_DEFLATED). Default value of compression is ZIP_STORED which will create zip file without compressing.

Python - how to open a file that is not yet written to disk?

I am using a script to strip exif data from uploaded JPGs in Python, before writing them to disk. I'm using Flask, and the file is brought in through requests
file = request.files['file']
strip the exif data, and then save it
f = open(file)
image = f.read()
f.close()
outputimage = stripExif(image)
f = ('output.jpg', 'w')
f.write(outputimage)
f.close()
f.save(os.path.join(app.config['IMAGE_FOLDER'], filename))
Open isn't working because it only takes a string as an argument, and if I try to just set f=file, it throws an error about tuple objects not having a write attribute. How can I pass the current file into this function before it is read?

file is a FileStorage, described in http://werkzeug.pocoo.org/docs/datastructures/#werkzeug.datastructures.FileStorage
As the doc says, stream represents the stream of data for this file, usually under the form of a pointer to a temporary file, and most function are proxied.
You probably can do something like:
file = request.files['file']
image = file.read()
outputimage = stripExif(image)
f = open(os.path.join(app.config['IMAGE_FOLDER'], 'output.jpg'), 'w')
f.write(outputimage)
f.close()

Try the io package, which has a BufferedReader(), ala:
import io
f = io.BufferedReader(request.files['file'])
...

file = request.files['file']
image = stripExif(file.read())
file.close()
filename = 'whatever' # maybe you want to use request.files['file'].filename
dest_path = os.path.join(app.config['IMAGE_FOLDER'], filename)
with open(dest_path, 'wb') as f:
f.write(image)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to combine pdf files into one from a tempfile in Python - python

Related

Download a base64 Image data and save into memory

Merge 2 pdf files giving me an empty pdf

Python: generate xlsx in memory and stream file download?

Sending multiple .CSV files to .ZIP without storing to disk in Python

Python - how to open a file that is not yet written to disk?

Categories

Resources