Opening a .docx file in S3 bucket in Python (Boto3) - python

In one of our S3 buckets, we have a .docx file with Mail Merge fields in it.
What I'm trying to do is directly read it directly from the bucket without first downloading it locally!
Typically, I can open a file and see the mail merge fields within it through the use of this code:
from mailmerge import MailMerge
document = MailMerge(r'C:\Users\User\Desktop\MailMergeFile.docx') # Trying to get a variable to pass in here
print(document.get_merge_fields())
As seen above, what I'm trying to do is to get the object in a way where I can just pass it to the MailMerge method, as though I were passing a path on my local machine.
The ways I've looked up to do this haven't been able to work.
fileobj = s3.get_object(
Bucket='bucketname',
Key='folder/mailmergefile.docx'
)
word_file = fileobj['Body'].read()
contents = word_file.decode('ISO-8859-1') # can't use utf-8 as that gives encoding error
contents
But then when I try and pass the contents variable to the Mailmerge function, I get another error:
document = MailMerge(contents)
print(document.get_merge_fields())
The error I get is:
ValueError: embedded null character

I presume you are using docx-mailmerge ยท PyPI.
The documentation is quite sparse but is shows MailMerge('input.docx'), which suggests that it is expecting the name of a file, not the 'contents' of a file.
In looking at the code, it seems to be calling a library to open a zip file.
Bottom line: As written, it wants the name of a file, not the contents of the file.

Related

Archive files directly from memory in Python

I'm writing this program where I get a number of files, then zip them with encryption using pyzipper, and also I'm using io.BitesIO() to write these files to it so I keep them in-memory. So now, after some other additions, I want to get all of these in-memory files and zip them together in a single encrypted zip file using the same pyzipper.
The code looks something like this:
# Create the in-memory file object
in_memory = BytesIO()
# Create the zip file and open in write mode
with pyzipper.AESZipFile(in_memory, "w", compression=pyzipper.ZIP_LZMA, encryption=pyzipper.WZ_AES) as zip_file:
# Set password
zip_file.setpassword(b"password")
# Save "data" with file_name
zip_file.writestr(file_name, data)
# Go to the beginning
in_memory.seek(0)
# Read the zip file data
data = in_memory.read()
# Add the data to a list
files.append(data)
So, as you may guess the "files" list is an attribute from a class and the whole thing above is a function that does this a number of times and then you get the full files list. For simplicity's sake, I removed most of the irrelevant parts.
I get no errors for now, but when I try to write all files to a new zip file I get an error. Here's the code:
with pyzipper.AESZipFile(test_name, "w", compression=pyzipper.ZIP_LZMA, encryption=pyzipper.WZ_AES) as zfile:
zfile.setpassword(b"pass")
for file in files:
zfile.write(file)
I get a ValueError because of os.stat:
File "C:\Users\vulka\AppData\Local\Programs\Python\Python310\lib\site-packages\pyzipper\zipfile.py", line 820, in from_file
st = os.stat(filename)
ValueError: stat: embedded null character in path
[WHAT I TRIED]
So, I tried using mmap for this purpose but I don't think this can help me and if it can - then I have no idea how to make it work.
I also tried using fs.memoryfs.MemoryFS to temporarily create a virtual filessystem in memory to store all the files and then get them back to zip everything together and then save it to disk. Again - failed. I got tons of different errors in my tests and TBH, there's very little information out there on this fs method and even if what I'm trying to do is possible - I couldn't figure it out.
P.S: I don't know if pyzipper (almost 1:1 zipfile with the addition of encryption) supports nested zip files at all. This could be the problem I'm facing but if it doesn't I'm open to any suggestions for a new approach to doing this. Also, I don't want to rely on a 3rd party software, even if it is open source! (I'm talking about the method of using 7zip to do all the archiving and ecryption, even though it shouldn't even be possible to use it without saving the files to disk in the first place, which is the main thing I'm trying to avoid)

How do I read and write an excel sheet from and to S3 using python?

I have a excel file in S3. My aim is to read that file, process and write it back. I have been using openpyxl to achieve the read and write part of it and it works locally. However the same doesn't work when the file is located is S3.
The current architecture is as follows. A call is made to my flask app where the URL to the file in S3 is passed as a parameter. The parameter is read as follows.
url = request.args.get('url')
In case of a csv file; the following had worked
pandas.read_csv(url)
But in dealing with xlsx files, the following (with openpyxl) :
file = load_workbook(filename = url)
corpus = file['Sheet']
is giving me the following error :
FileNotFoundError: [Errno 2] No such file or directory: 's3.amazonaws.com/data-file/prod/projects/Methane__-oil_and_gas-_-_Sheet1.xlsx'
How do I resolve this and read this file from S3. Also, after I am done processing, how do I write it back to S3.
You can pass a url in pandas.read_csv as it automatically recognizes url's but it looks like the url doesnt have the protocol from your error.
The url should be https://s3.amazonaws.com/data-file/prod/projects/Methane__-oil_and_gas-_-_Sheet1.xlsx
Try appending https:// before the url and see what happens.

How to obtain a file object from a variable or from http URL without actually creating a file?

I want to manipulate a downloaded PDF using PyPDF and for that, I need a file object.
I use GAE to host my Python app, so I cannot actually write the file to disk.
Is there any way to obtain the file object from URL or from a variable that contains the file contents?
TIA.
Most tools (including urllib) already give you a file-like, but if you need true random access then you'll need to create a StringIO.StringIO and read the data into it.
In GAE you can use the blobstore to read, write file data and to upload and download files. And you can use the File API:
Example :
_file = files.blobstore.create(mime_type=mimetype, _blobinfo_uploaded_filename='test')
with files.open(_file, 'a') as f :
f.write(somedata)
files.finalize(_file)

Extract files from zip folder and store these files in blobstore

i want to upload zip folder from file input in form the i want to extract the contents of this uploaded zip folder,and store the contents (files)of this zip in the blobstore in order to download them after putting these files in one folder,but the problem is that i can't deal with the zip folder directly(to read it), i tried as this:
form = cgi.FieldStorage()
file_upload = form['file']
zip1=file_upload.filename
zipstream=StringIO.StringIO(zip1.read())
But the problem still that i can't read the zip as previous,also i tried to read zip folder directly like this:
z1=zipfile.ZipFile(zip1,"r")
But there was an error in this way.Please can any one help me.Thanks in advance.
Based on your comment, it sounds like you need to take a closer look at the cgi module documentation, which includes the following:
If a field represents an uploaded file, accessing the value via the value attribute or the getvalue() method reads the entire file in memory as a string. This may not be what you want. You can test for an uploaded file by testing either the filename attribute or the file attribute. You can then read the data at leisure from the file attribute...
This suggests that you need to modify your code to look something like:
form = cgi.FieldStorage()
file_upload = form['file']
z1 = zipfile.ZipFile(file_upload.file, 'r')
There are additional examples in the documentation.
You don't have to extract files from the zip in order to make them available for download - see this post for an example of serving direct from a zip. You can adapt that code if you want to extract the files and store them individually in the blobstore.

Returning a file in uploads directory with web2py - strings issue

I have users upload files into a fake directory structure using a database. I have fields for the parent path & the filename & the file (file is of type "upload") that I set using my controller. I can see that files are properly being stored in the uploads directory so that is working. Just for reference I store the files using
db.allfiles.insert(filename=filename, \
parentpath=parentpath, \
file=db.allfiles.file.store(file.file,filename), \
datecreated=now,user=me)
I am trying to set up a function for downloading files as well so a user can download files using something like app/controller/function/myfiles/image.jpg. I find the file using this code:
file=db((db.allfiles.parentpath==parentpath)&\
(db.allfiles.filename==filename)&\
(db.allfiles.user==me)).select()[0]
an I tried returning file.file but the files I was getting jpg files that were strings like:
allfiles.file.89fe64038f1de7be.6d6f6e6b65792d372e6a7067.jpg
Which is the filename in the database. I tried this code:
os.path.join(request.folder,('uploads/'),'/'.join(file.file))
but I'm getting this path:
/home/charles/web2py/applications/chips/uploads/a/l/l/f/i/l/e/s/./f/i/l/e/./8/9/f/e/6/4/0/3/8/f/1/d/e/7/b/e/./6/d/6/f/6/e/6/b/6/5/7/9/2/d/3/7/2/e/6/a/7/0/6/7/./j/p/g
I think this is special type of string or maybe file.file isn't exactly a string. Is there something I can return the file to the user through my function?
You're almost right. Try:
os.path.join(request.folder,'uploads',file.file)
Python strings are sequence types, and therefore iterable. When you submit a single string as an argument to the join method, it iterates over each character in the string. So, for example:
>>> '/'.join('hello')
'h/e/l/l/o'
Also, note that os.path.join will automatically separate its arguments by the appropriate path separator for your OS (i.e., os.path.sep), so no need to insert slashes manually.

Categories

Resources