pickling python objects to google cloud storage

pickling python objects to google cloud storage - python

I've been pickling the objects to filesystem and reading them back when needed to work with those objects. Currently I've this code for that purpose.
def pickle(self, directory, filename):
if not os.path.exists(directory):
os.makedirs(directory)
with open(directory + '/' + filename, 'wb') as handle:
pickle.dump(self, handle)
#staticmethod
def load(filename):
with open(filename, 'rb') as handle:
element = pickle.load(handle)
return element
Now I'm moving my applictation(django) to Google app engine and figured that app engine does not allow me to write to file system. Google cloud storage seemed my only choice but I could not understand how could I pickle my objects as cloud storage objects and read them back to create the original python object.

For Python 3 users, you can use gcsfs library from Dask creator to solve your issue.
Example reading:
import gcsfs
fs = gcsfs.GCSFileSystem(project='my-google-project')
fs.ls('my-bucket')
>>> ['my-file.txt']
with fs.open('my-bucket/my-file.txt', 'rb') as f:
print(f.read())
It basically is identical with pickle, though:
with fs.open(directory + '/' + filename, 'wb') as handle:
pickle.dump(shandle)
To read, this is similar, but replace wb by rb and dump with load:
with fs.open(directory + '/' + filename, 'rb') as handle:
pickle.load(handle)

You can use the Cloud Storage client library.
Instead of open() use cloudstorage.open() (or gcs.open() if importing cloudstorage as gcs, as in the above-mentioned doc) and note that the full filepath starts with the GCS bucket name (as a dir).
More details in the cloudstorage.open() documentation.

One other option (I tested it with Tensorflow 2.2.0) which also works with Python 3:
from tensorflow.python.lib.io import file_io
with file_io.FileIO('gs://....', mode='rb') as f:
pickle.load(f)
This is very useful if you already use Tensorflow for example.

Related

Creating view in browser functionality with python

I have been struggling with this problem for a while but can't seem to find a solution for it. The situation is that I need to open a file in browser and after the user closes the file the file is removed from their machine. All I have is the binary data for that file. If it matters, the binary data comes from Google Storage using the download_as_string method.
After doing some research I found that the tempfile module would suit my needs, but I can't get the tempfile to open in browser because the file only exists in memory and not on the disk. Any suggestions on how to solve this?
This is my code so far:
import tempfile
import webbrowser
# grabbing binary data earlier on
temp = tempfile.NamedTemporaryFile()
temp.name = "example.pdf"
temp.write(binary_data_obj)
temp.close()
webbrowser.open('file://' + os.path.realpath(temp.name))
When this is run, my computer gives me an error that says that the file cannot be opened since it is empty. I am on a Mac and am using Chrome if that is relevant.

You could try using a temporary directory instead:
import os
import tempfile
import webbrowser
# I used an existing pdf I had laying around as sample data
with open('c.pdf', 'rb') as fh:
data = fh.read()
# Gives a temporary directory you have write permissions to.
# The directory and files within will be deleted when the with context exits.
with tempfile.TemporaryDirectory() as temp_dir:
temp_file_path = os.path.join(temp_dir, 'example.pdf')
# write a normal file within the temp directory
with open(temp_file_path, 'wb+') as fh:
fh.write(data)
webbrowser.open('file://' + temp_file_path)
This worked for me on Mac OS.

Encrypt PDFs in python

Is there a possible and way to encrypt PDF-Files in python?
One possibility is to zip the PDFs but is there another ?
Thanks for your help
regards
Felix

You can use PyPDF2:
from PyPDF2 import PdfFileReader, PdfFileWriter
with open("input.pdf", "rb") as in_file:
input_pdf = PdfFileReader(in_file)
output_pdf = PdfFileWriter()
output_pdf.appendPagesFromReader(input_pdf)
output_pdf.encrypt("password")
with open("output.pdf", "wb") as out_file:
output_pdf.write(out_file)
For more information, check out the PdfFileWriter docs.

PikePdf which is python's adaptation of QPDF, is by far the better option. This is especially helpful if you have a file that has text in languages other than English.
from pikepdf import Pdf
pdf = Pdf.open(path/to/file)
pdf.save('output_filename.pdf', encryption=pikepdf.Encryption(owner=password, user=password, R=4))
# you can change the R from 4 to 6 for 256 aes encryption
pdf.close()

You can use PyPDF2
import PyPDF2
pdfFile = open('input.pdf', 'rb')
# Create reader and writer object
pdfReader = PyPDF2.PdfFileReader(pdfFile)
pdfWriter = PyPDF2.PdfFileWriter()
# Add all pages to writer (accepted answer results into blank pages)
for pageNum in range(pdfReader.numPages):
pdfWriter.addPage(pdfReader.getPage(pageNum))
# Encrypt with your password
pdfWriter.encrypt('password')
# Write it to an output file. (you can delete unencrypted version now)
resultPdf = open('encrypted_output.pdf', 'wb')
pdfWriter.write(resultPdf)
resultPdf.close()

Another option is Aspose.PDF Cloud SDK for Python, it is a rest API solution. You can use cloud storage of your choice from Amazon S3, DropBox, Google Drive Storage, Google Cloud Storage, Windows Azure Storage, FTP Storage and Aspose Cloud Storage.
The cryptoAlgorithm takes the follwing possible values
RC4x40: RC4 with key length 40
RC4x128: RC4 with key length 128
AESx128: AES with key length 128
AESx256: AES with key length 256
import os
import base64
import asposepdfcloud
from asposepdfcloud.apis.pdf_api import PdfApi
from shutil import copyfile
# Get Client key and Client ID from https://cloud.aspose.com
pdf_api_client = asposepdfcloud.api_client.ApiClient(
app_key='xxxxxxxxxxxxxxxxxxxxxxxxxx',
app_sid='xxxxxx-xxxx-xxxxx-xxxx-xxxxxxxxxxx')
pdf_api = PdfApi(pdf_api_client)
temp_folder="Temp"
#upload PDF file to storage
data_file = "C:/Temp/02_pages.pdf"
remote_name= "02_pages.pdf"
pdf_api.upload_file(remote_name,data_file)
out_path = "EncryptedPDF.pdf"
user_password_encoded = base64.b64encode(b'user $^Password!&')
owner_password_encoded = base64.b64encode(b'owner\//? $12^Password!&')
#Encrypte PDF document
response = pdf_api.put_encrypt_document(temp_folder + '/' + out_path, user_password_encoded, owner_password_encoded, asposepdfcloud.models.CryptoAlgorithm.AESX128, file = remote_name)
#download PDF file from storage
response_download = pdf_api.download_file(temp_folder + '/' + out_path)
copyfile(response_download, 'C:/Temp/' + out_path)
print(response)

I would highly recommend the pyAesCrypt module.
It is based on the Cryptography module which is written partly in C.
The module is quite fast, especially in high spec computers.
You can expect a 12 second encryption of a 3 Gb file on higher end computers, so It really is fast though not the best one.
One liner for encryptions and Decryptions are:
import pyAesCrypt
Encrypting:
pyAesCrypt.encryptFile(inputfile, outputfile, password, bufferSize)
Decrypting:
pyAesCrypt.decryptFile(inputfile, outputfile, password, bufferSize)
Since this is not the full explanation I would recommend to fully read the documentation as It is not really long.
You can find it here: https://pypi.org/project/pyAesCrypt/

You can also use PyPDF2 with this project.
For example, put the PDF_Lock.py file into your project folder.
Then you can use:
import PDF_Lock
and when you want protect a PDF file use:
PDF_Lock.lock(YourPDFFilePath, YourProtectedPDFFilePath, Password)

How to compress files as password protected archive using Python

Is there a way to compress gzip or zipfile as a password protected archive?
Here is an example code illustrating how to archive file with no password protection:
import gzip, shutil
filepath = r'c:\my.log'
with gzip.GzipFile(filepath + ".gz", "w") as gz:
with open(filepath) as with_open_file:
shutil.copyfileobj(with_open_file, gz)
import zipfile
zf = zipfile.ZipFile(filepath + '.zip', 'w')
zf.write(filepath)
zf.close()

Python supports extracting password protected zips:
zipfile.ZipFile('myarchive.zip').extractall(pwd='P4$$W0rd')
Sadly, it does not support creating them. You can either call an external tool like 7zip or use a third-party library like this zlib wrapper.

Use gpg for encryption. That would be a separate wrapper around the archived and compressed data.

How to open gzip file on gae cloud?

I am using python 2.7 and Google Compute Cloud. I want to process a gzip file uploaded to the gcs datastore. On Python, this would be:
import gzip
with gzip.open('myfile.gz', 'r') as f:
f.read()
Since this does is not allowed on GCS, the only option I found on Google Cloud Storage Client Library Functions, is:
import cloudstorage
cloudstorage.open('myfile.gz', 'r'):
f.read()
which does not open gzip files. Any tips on how I can do this?

You can use the alternate gzip.GzipFile() access via a file object, using the file object provided by the GCS client lib:
import cloudstorage
import gzip
with cloudstorage.open('myfile.gz', 'r') as f:
content = gzip.GzipFile(fileobj=f).read()

Can you upload to S3 using a stream rather than a local file?

I need to create a CSV and upload it to an S3 bucket. Since I'm creating the file on the fly, it would be better if I could write it directly to S3 bucket as it is being created rather than writing the whole file locally, and then uploading the file at the end.
Is there a way to do this? My project is in Python and I'm fairly new to the language. Here is what I tried so far:
import csv
import csv
import io
import boto
from boto.s3.key import Key
conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'
fieldnames = ['first_name', 'last_name']
writer = csv.DictWriter(io.StringIO(), fieldnames=fieldnames)
k.set_contents_from_stream(writer.writeheader())
I received this error: BotoClientError: s3 does not support chunked transfer
UPDATE: I found a way to write directly to S3, but I can't find a way to clear the buffer without actually deleting the lines I already wrote. So, for example:
conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'
testDict = [{
"fieldA": "8",
"fieldB": None,
"fieldC": "888888888888"},
{
"fieldA": "9",
"fieldB": None,
"fieldC": "99999999999"}]
f = io.StringIO()
fieldnames = ['fieldA', 'fieldB', 'fieldC']
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
k.set_contents_from_string(f.getvalue())
for row in testDict:
writer.writerow(row)
k.set_contents_from_string(f.getvalue())
f.close()
Writes 3 lines to the file, however I'm unable to release memory to write a big file. If I add:
f.seek(0)
f.truncate(0)
to the loop, then only the last line of the file is written. Is there any way to release resources without deleting lines from the file?

I did find a solution to my question, which I will post here in case anyone else is interested. I decided to do this as parts in a multipart upload. You can't stream to S3. There is also a package available that changes your streaming file over to a multipart upload which I used: Smart Open.
import smart_open
import io
import csv
testDict = [{
"fieldA": "8",
"fieldB": None,
"fieldC": "888888888888"},
{
"fieldA": "9",
"fieldB": None,
"fieldC": "99999999999"}]
fieldnames = ['fieldA', 'fieldB', 'fieldC']
f = io.StringIO()
with smart_open.smart_open('s3://dev-test/bar/foo.csv', 'wb') as fout:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
fout.write(f.getvalue())
for row in testDict:
f.seek(0)
f.truncate(0)
writer.writerow(row)
fout.write(f.getvalue())
f.close()

Here is a complete example using boto3
import boto3
import io
session = boto3.Session(
aws_access_key_id="...",
aws_secret_access_key="..."
)
s3 = session.resource("s3")
buff = io.BytesIO()
buff.write("test1\n".encode())
buff.write("test2\n".encode())
s3.Object(bucket, keypath).put(Body=buff.getvalue())

We were trying to upload file contents to s3 when it came through as an InMemoryUploadedFile object in a Django request. We ended up doing the following because we didn't want to save the file locally. Hope it helps:
#action(detail=False, methods=['post'])
def upload_document(self, request):
document = request.data.get('image').file
s3.upload_fileobj(document, BUCKET_NAME,
DESIRED_NAME_OF_FILE_IN_S3,
ExtraArgs={"ServerSideEncryption": "aws:kms"})

According to docs it's possible
s3.Object('mybucket', 'hello.txt').put(Body=open('/tmp/hello.txt', 'rb'))
so we can use StringIO in ordinary way
Update: smart_open lib from #inquiring minds answer is better solution

There's an interesting code solution mentioned in a GitHub smart_open issue (#82) that I've been meaning to try out. Copy-pasting here for posterity... looks like boto3 is required:
csv_data = io.BytesIO()
writer = csv.writer(csv_data)
writer.writerows(my_data)
gz_stream = io.BytesIO()
with gzip.GzipFile(fileobj=gz_stream, mode="w") as gz:
gz.write(csv_data.getvalue())
gz_stream.seek(0)
s3 = boto3.client('s3')
s3.upload_fileobj(gz_stream, bucket_name, key)
This specific example is streaming to a compressed S3 key/file, but it seems like the general approach -- using the boto3 S3 client's upload_fileobj() method in conjunction with a target stream, not a file -- should work.

There's a well supported library for doing just this:
pip install s3fs
s3fs is really trivial to use:
import s3fs
s3fs.S3FileSystem(anon=False)
with s3.open('mybucket/new-file', 'wb') as f:
f.write(2*2**20 * b'a')
f.write(2*2**20 * b'a')
Incidentally there's also something built into boto3 (backed by the AWS API) called MultiPartUpload.
This isn't factored as a python stream which might be an advantage for some people. Instead you can start an upload and send parts one at a time.

To write a string to an S3 object, use:
s3.Object('my_bucket', 'my_file.txt').put('Hello there')
So convert the stream to string and you're there.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pickling python objects to google cloud storage - python

You can use the Cloud Storage client library. Instead of open() use cloudstorage.open() (or gcs.open() if importing cloudstorage as gcs, as in the above-mentioned doc) and note that the full filepath starts with the GCS bucket name (as a dir). More details in the cloudstorage.open() documentation.

One other option (I tested it with Tensorflow 2.2.0) which also works with Python 3: from tensorflow.python.lib.io import file_io with file_io.FileIO('gs://....', mode='rb') as f: pickle.load(f) This is very useful if you already use Tensorflow for example.

Related

Creating view in browser functionality with python

Encrypt PDFs in python

How to compress files as password protected archive using Python

How to open gzip file on gae cloud?

Can you upload to S3 using a stream rather than a local file?

Categories

Resources