How to open gzip file on gae cloud? - python

I am using python 2.7 and Google Compute Cloud. I want to process a gzip file uploaded to the gcs datastore. On Python, this would be:
import gzip
with gzip.open('myfile.gz', 'r') as f:
f.read()
Since this does is not allowed on GCS, the only option I found on Google Cloud Storage Client Library Functions, is:
import cloudstorage
cloudstorage.open('myfile.gz', 'r'):
f.read()
which does not open gzip files. Any tips on how I can do this?

You can use the alternate gzip.GzipFile() access via a file object, using the file object provided by the GCS client lib:
import cloudstorage
import gzip
with cloudstorage.open('myfile.gz', 'r') as f:
content = gzip.GzipFile(fileobj=f).read()

Related

How to stream a large gzipped .tsv file from s3, process it, and write back to a new file on s3?

I have a large file s3://my-bucket/in.tsv.gz that I would like to load and process, write back its processed version to an s3 output file s3://my-bucket/out.tsv.gz.
How do I streamline the in.tsv.gz directly from s3 without loading all the file to memory (it cannot fit the memory)
How do I write the processed gzipped stream directly to s3?
In the following code, I show how I was thinking to load the input gzipped dataframe from s3, and how I would write the .tsv if it were located locally bucket_dir_local = ./.
import pandas as pd
import s3fs
import os
import gzip
import csv
import io
bucket_dir = 's3://my-bucket/annotations/'
df = pd.read_csv(os.path.join(bucket_dir, 'in.tsv.gz'), sep='\t', compression="gzip")
bucket_dir_local='./'
# not sure how to do it with an s3 path
with gzip.open(os.path.join(bucket_dir_local, 'out.tsv.gz'), "w") as f:
with io.TextIOWrapper(f, encoding='utf-8') as wrapper:
w = csv.DictWriter(wrapper, fieldnames=['test', 'testing'], extrasaction="ignore")
w.writeheader()
for index, row in df.iterrows():
my_dict = {"test": index, "testing": row[6]}
w.writerow(my_dict)
Edit: smart_open looks like the way to go.
Here is a dummy example to read a file from s3 and write it back to s3 using smart_open
from smart_open import open
import os
bucket_dir = "s3://my-bucket/annotations/"
with open(os.path.join(bucket_dir, "in.tsv.gz"), "rb") as fin:
with open(
os.path.join(bucket_dir, "out.tsv.gz"), "wb"
) as fout:
for line in fin:
l = [i.strip() for i in line.decode().split("\t")]
string = "\t".join(l) + "\n"
fout.write(string.encode())
For downloading the file you can stream the S3 object directly in python. I'd recommend reading that entire post but some key lines from it
import boto3
s3 = boto3.client('s3', aws_access_key_id='mykey', aws_secret_access_key='mysecret') # your authentication may vary
obj = s3.get_object(Bucket='my-bucket', Key='my/precious/object')
import gzip
body = obj['Body']
with gzip.open(body, 'rt') as gf:
for ln in gf:
process(ln)
Unfortunately S3 doesn't support true streaming input but this SO answer has an implementation that chunks out the file and sends each chunk up to S3. While not a "true stream" it will let you upload large files without needing to keep the entire thing in memory

Extract particular file from zip blob stored in azure container with python using Jupyter notebook

I had uploaded zip file in my azure account as a blob in azure container.
Zip file contains .csv, .ascii files and many other formats.
I need to read specific file, lets say ascii file data containing in zip file. I am using python for this case.
How to read particular file data from this zip file without downloading it on local? I would like to handle this process in memory only.
I am also trying with jypyter notebook provided by azure for ML functionality
I am using ZipFile python package for this case.
Request you to assist in this matter to read the file
Please find following code snippet.
blob_service=BlockBlobService(account_name=ACCOUNT_NAME,account_key=ACCOUNT_KEY)
blob_list=blob_service.list_blobs(CONTAINER_NAME)
allBlobs = []
for blob in blob_list:
allBlobs.append(blob.name)
sampleZipFile = allBlobs[0]
print(sampleZipFile)
The below code should work. This example accesses an Azure Container using an Account URL and Key combination.
from azure.storage.blob import BlobServiceClient
from io import BytesIO
from zipfile import ZipFile
key = r'my_key'
service = BlobServiceClient(account_url="my_account_url",
credential=key
)
container_client = service.get_container_client('container_name')
zipfilename = 'myzipfile.zip'
blob_data = container_client.download_blob(zipfilename)
blob_bytes = blob_data.content_as_bytes()
inmem = BytesIO(blob_bytes)
myzip = ZipFile(inmem)
otherfilename = 'mycontainedfile.csv'
filetoread = BytesIO(myzip.read(otherfilename))
Now all you have to do is pass filetoread into whatever method you would normally use to read a local file (eg. pandas.read_csv())
you could use below code for reading file inside .zip file without extracting in python
import zipfile
archive = zipfile.ZipFile('images.zip', 'r')
imgdata = archive.read('img_01.png')
For details , you can refer to ZipFile docs here
Alternatively, you can do something like this
-- coding: utf-8 --
"""
Created on Mon Apr 1 11:14:56 2019
#author: moverm
"""
import zipfile
zfile = zipfile.ZipFile('C:\\LAB\Pyt\sample.zip')
for finfo in zfile.infolist():
ifile = zfile.open(finfo)
line_list = ifile.readlines()
print(line_list)
Here is the output for the same
Hope it helps.

Encrypt PDFs in python

Is there a possible and way to encrypt PDF-Files in python?
One possibility is to zip the PDFs but is there another ?
Thanks for your help
regards
Felix
You can use PyPDF2:
from PyPDF2 import PdfFileReader, PdfFileWriter
with open("input.pdf", "rb") as in_file:
input_pdf = PdfFileReader(in_file)
output_pdf = PdfFileWriter()
output_pdf.appendPagesFromReader(input_pdf)
output_pdf.encrypt("password")
with open("output.pdf", "wb") as out_file:
output_pdf.write(out_file)
For more information, check out the PdfFileWriter docs.
PikePdf which is python's adaptation of QPDF, is by far the better option. This is especially helpful if you have a file that has text in languages other than English.
from pikepdf import Pdf
pdf = Pdf.open(path/to/file)
pdf.save('output_filename.pdf', encryption=pikepdf.Encryption(owner=password, user=password, R=4))
# you can change the R from 4 to 6 for 256 aes encryption
pdf.close()
You can use PyPDF2
import PyPDF2
pdfFile = open('input.pdf', 'rb')
# Create reader and writer object
pdfReader = PyPDF2.PdfFileReader(pdfFile)
pdfWriter = PyPDF2.PdfFileWriter()
# Add all pages to writer (accepted answer results into blank pages)
for pageNum in range(pdfReader.numPages):
pdfWriter.addPage(pdfReader.getPage(pageNum))
# Encrypt with your password
pdfWriter.encrypt('password')
# Write it to an output file. (you can delete unencrypted version now)
resultPdf = open('encrypted_output.pdf', 'wb')
pdfWriter.write(resultPdf)
resultPdf.close()
Another option is Aspose.PDF Cloud SDK for Python, it is a rest API solution. You can use cloud storage of your choice from Amazon S3, DropBox, Google Drive Storage, Google Cloud Storage, Windows Azure Storage, FTP Storage and Aspose Cloud Storage.
The cryptoAlgorithm takes the follwing possible values
RC4x40: RC4 with key length 40
RC4x128: RC4 with key length 128
AESx128: AES with key length 128
AESx256: AES with key length 256
import os
import base64
import asposepdfcloud
from asposepdfcloud.apis.pdf_api import PdfApi
from shutil import copyfile
# Get Client key and Client ID from https://cloud.aspose.com
pdf_api_client = asposepdfcloud.api_client.ApiClient(
app_key='xxxxxxxxxxxxxxxxxxxxxxxxxx',
app_sid='xxxxxx-xxxx-xxxxx-xxxx-xxxxxxxxxxx')
pdf_api = PdfApi(pdf_api_client)
temp_folder="Temp"
#upload PDF file to storage
data_file = "C:/Temp/02_pages.pdf"
remote_name= "02_pages.pdf"
pdf_api.upload_file(remote_name,data_file)
out_path = "EncryptedPDF.pdf"
user_password_encoded = base64.b64encode(b'user $^Password!&')
owner_password_encoded = base64.b64encode(b'owner\//? $12^Password!&')
#Encrypte PDF document
response = pdf_api.put_encrypt_document(temp_folder + '/' + out_path, user_password_encoded, owner_password_encoded, asposepdfcloud.models.CryptoAlgorithm.AESX128, file = remote_name)
#download PDF file from storage
response_download = pdf_api.download_file(temp_folder + '/' + out_path)
copyfile(response_download, 'C:/Temp/' + out_path)
print(response)
I would highly recommend the pyAesCrypt module.
It is based on the Cryptography module which is written partly in C.
The module is quite fast, especially in high spec computers.
You can expect a 12 second encryption of a 3 Gb file on higher end computers, so It really is fast though not the best one.
One liner for encryptions and Decryptions are:
import pyAesCrypt
Encrypting:
pyAesCrypt.encryptFile(inputfile, outputfile, password, bufferSize)
Decrypting:
pyAesCrypt.decryptFile(inputfile, outputfile, password, bufferSize)
Since this is not the full explanation I would recommend to fully read the documentation as It is not really long.
You can find it here: https://pypi.org/project/pyAesCrypt/
You can also use PyPDF2 with this project.
For example, put the PDF_Lock.py file into your project folder.
Then you can use:
import PDF_Lock
and when you want protect a PDF file use:
PDF_Lock.lock(YourPDFFilePath, YourProtectedPDFFilePath, Password)

pickling python objects to google cloud storage

I've been pickling the objects to filesystem and reading them back when needed to work with those objects. Currently I've this code for that purpose.
def pickle(self, directory, filename):
if not os.path.exists(directory):
os.makedirs(directory)
with open(directory + '/' + filename, 'wb') as handle:
pickle.dump(self, handle)
#staticmethod
def load(filename):
with open(filename, 'rb') as handle:
element = pickle.load(handle)
return element
Now I'm moving my applictation(django) to Google app engine and figured that app engine does not allow me to write to file system. Google cloud storage seemed my only choice but I could not understand how could I pickle my objects as cloud storage objects and read them back to create the original python object.
For Python 3 users, you can use gcsfs library from Dask creator to solve your issue.
Example reading:
import gcsfs
fs = gcsfs.GCSFileSystem(project='my-google-project')
fs.ls('my-bucket')
>>> ['my-file.txt']
with fs.open('my-bucket/my-file.txt', 'rb') as f:
print(f.read())
It basically is identical with pickle, though:
with fs.open(directory + '/' + filename, 'wb') as handle:
pickle.dump(shandle)
To read, this is similar, but replace wb by rb and dump with load:
with fs.open(directory + '/' + filename, 'rb') as handle:
pickle.load(handle)
You can use the Cloud Storage client library.
Instead of open() use cloudstorage.open() (or gcs.open() if importing cloudstorage as gcs, as in the above-mentioned doc) and note that the full filepath starts with the GCS bucket name (as a dir).
More details in the cloudstorage.open() documentation.
One other option (I tested it with Tensorflow 2.2.0) which also works with Python 3:
from tensorflow.python.lib.io import file_io
with file_io.FileIO('gs://....', mode='rb') as f:
pickle.load(f)
This is very useful if you already use Tensorflow for example.

Is it possible to generate and return a ZIP file with App Engine?

I have a small project that would be perfect for Google App Engine. Implementing it hinges on the ability to generate a ZIP file and return it.
Due to the distributed nature of App Engine, from what I can tell, the ZIP file couldn't be created "in-memory" in the traditional sense. It would basically have to be generated and and sent in a single request/response cycle.
Does the Python zip module even exist in the App Engine environment?
zipfile is available at appengine and reworked example follows:
from contextlib import closing
from zipfile import ZipFile, ZIP_DEFLATED
from google.appengine.ext import webapp
from google.appengine.api import urlfetch
def addResource(zfile, url, fname):
# get the contents
contents = urlfetch.fetch(url).content
# write the contents to the zip file
zfile.writestr(fname, contents)
class OutZipfile(webapp.RequestHandler):
def get(self):
# Set up headers for browser to correctly recognize ZIP file
self.response.headers['Content-Type'] ='application/zip'
self.response.headers['Content-Disposition'] = \
'attachment; filename="outfile.zip"'
# compress files and emit them directly to HTTP response stream
with closing(ZipFile(self.response.out, "w", ZIP_DEFLATED)) as outfile:
# repeat this for every URL that should be added to the zipfile
addResource(outfile,
'https://www.google.com/intl/en/policies/privacy/',
'privacy.html')
addResource(outfile,
'https://www.google.com/intl/en/policies/terms/',
'terms.html')
import zipfile
import StringIO
text = u"ABCDEFGHIJKLMNOPQRSTUVWXYVabcdefghijklmnopqqstuvweyxáéöüï东 廣 広 广 國 国 国 界"
zipstream=StringIO.StringIO()
file = zipfile.ZipFile(file=zipstream,compression=zipfile.ZIP_DEFLATED,mode="w")
file.writestr("data.txt.zip",text.encode("utf-8"))
file.close()
zipstream.seek(0)
self.response.headers['Content-Type'] ='application/zip'
self.response.headers['Content-Disposition'] = 'attachment; filename="data.txt.zip"'
self.response.out.write(zipstream.getvalue())
From What is Google App Engine:
You can upload other third-party
libraries with your application, as
long as they are implemented in pure
Python and do not require any
unsupported standard library modules.
So, even if it doesn't exist by default you can (potentially) include it yourself. (I say potentially because I don't know if the Python zip library requires any "unsupported standard library modules".

Categories

Resources