read google bucket files using python - python

I have to read google bucket files which are in xlsx format.
The file structure in the bucket look like
bucket_name
folder_name_1
file_name_1
folder_name_2
folder_name_3
file_name_3
The python snippet looks like
def main():
storage_client = storage.Client.from_service_account_json(
Constants.GCP_CRENDENTIALS)
bucket = storage_client.bucket(Constants.GCP_BUCKET_NAME)
blob = bucket.blob(folder_name_2 + '/' + Constants.GCP_FILE_NAME)
data_bytes = blob.download_as_bytes()
df = pd.read_excel(data_bytes, engine='openpyxl')
print(df)
def function1():
print("no file in the folder") # sample error
In the above snippet, I'm trying to open folder_name_2, it returns an error because there's no file to read.
Instead of throwing an error, I need to use function1 to print the error whenever there's no file in any folder.
Any ideas of doing this?

I'm not familiar with the GCP API, but you're going to want to do something along the lines of this:
try:
blob = bucket.blob(folder_name_2 + '/' + Constants.GCP_FILE_NAME)
data_bytes = blob.download_as_bytes()
except Exception as e:
print(e)
https://docs.python.org/3/tutorial/errors.html#handling-exceptions

I'm not sure to understand what is your final goal, but an other logic is to list available resources in the bucket, and process it.
First, let's define a fonction that will list the available resources in a Bucket. You can add a prefix if you want to limit the research to a sub folder inside the Bucket.
def list_resource(client, bucket_name, prefix=''):
path_files = []
for blob in client.list_blobs(bucket_name, prefix=prefix):
path_files.append(blob.name)
return path_files
Now you can process your xlsx files:
for resource in list_resource(storage_client, Constants.GCP_BUCKET_NAME):
if '.xlsx' in resource:
print(resource)
# Load blob and process your xlsx file

Related

Unzip file content hosted in s3 to multiple cloudfront url through a single lambda function

Is there any specific way to unzip single file contents from s3 to multiple cloudfront urls by triggering lambda once.
Lets say in there is a zip file contains multiple jpg/ png files already uploaded to s3. Intention is to run lambda function only once to unzip all its file content and make them available in multiple cloudfront urls.
in s3 bucket
archive.zip
a.jpg
b.jpg
c.jpg
through cloudfront
https://1232.cloudfront.net/a.jpg
https://1232.cloudfront.net/b.jpg
https://1232.cloudfront.net/c.jpg
I am looking for a solution such that lambda function trigger function calls whenever a s3 upload happens and make all files available in the zip through cloudfront multiple urls.
Hello Prathap Parameswar,
I think you can resolve your problem like this:
First you need to exact your zip file
Seconds you upload them again to S3.
This is lambda python function:
import json
import boto3
from io import BytesIO
import zipfile
def lambda_handler(event, context):
# TODO implement
s3_resource = boto3.resource('s3')
source_bucket = 'upload-zip-folder'
target_bucket = 'upload-extracted-folder'
my_bucket = s3_resource.Bucket(source_bucket)
for file in my_bucket.objects.all():
if(str(file.key).endswith('.zip')):
zip_obj = s3_resource.Object(bucket_name=source_bucket, key=file.key)
buffer = BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
file_info = z.getinfo(filename)
try:
response = s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket=target_bucket,
Key=f'{filename}'
)
except Exception as e:
print(e)
else:
print(file.key+ ' is not a zip file.')
Hope this can help you

How do you write a .feather file into GCS?

Previously worked on .csv files which was straightforward to upload to GCS
For csv I would do the following, which works:
blob = bucket.blob(path)
blob.upload_from_string(dataframe.to_csv(), 'text/csv')
I am trying to do the same i.e. write the dataframe as a .feather file in bucket
blob = bucket.blob(path)
blob.upload_from_string(dataframe.reset_index().to_feather(), 'text/feather')
However, this fails saying to_feather() requires a fname. Any suggestions/guidance on where I went wrong would be helpful.
upload_from_string works for the to_csv() method because the ‘path’ parameter is optional. When no path is provided, the result is returned as a string. On the other hand, the to_feather() method requires a path specified. So you should store the feather file and then upload the feather file into GCS.
Refer the code below:
dataFrame.reset_index().to_feather(FILE PATH)
bucket_name = "BUCKET-NAME"
source_file_name = "FILE PATH"
destination_blob_name = "GCS Object Name"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file_name)

Elaborate and store inside an azure blob multiple files from a form-data request by using azure functions

I am developing an azure function that receives in input several files of different formats (eg xlsx, csv, txt, pdf, png) through the form-data format. The idea is to develop a function that can take files and store them one by one inside a blob. At the moment, my code is as follows:
def main(req: func.HttpRequest) -> func.HttpResponse:
logging.info('Python HTTP trigger function processed a request.')
filename, contents = False, False
try:
files = req.files.values()
for file in files:
filename = str(file.filename)
logging.info(type(file.stream.read()))
contents = file.stream.read().decode('utf-8')
except Exception as ex:
logging.error(str(type(ex)) + ': ' + str(ex))
return func.HttpResponse(body=str(ex), status_code=400)
Then i write the content variable inside the blob but the files inside the blob had 0 as size and if i try to download the file, the files are empty. How can i manage this operation to store different format files inside a blob? Thanks a lot for your support!
Below is the code to upload multiple files of different formats:
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient,PublicAccess
import os
def UploadFiles():
CONNECTION_STRING="ENTER_STORAGE_CONNECTION_STRING"
Container_name="uploadcontainer"
service_client=BlobServiceClient.from_connection_string(CONNECTION_STRING)
container_client = service_client.get_container_client(Container_name)
ReplacePath = "C:\\"
local_path = "C:\Testupload" #the local folder
for r,d,f in os.walk(local_path):
if f:
for file in f:
AzurePath = os.path.join(r,file).replace(ReplacePath,"")
LocalPath = os.path.join(r,file)
blob_client = container_client.get_blob_client(AzurePath)
with open(LocalPath,'rb') as data:
blob_client.upload_blob(data)
if __name__ == '__main__':
UploadFiles()
print("Files Copied")
As from your question I am not able to get how your function is getting triggered, and where you are uploading your files.
So as per your logic you can use the above piece of code to upload all type of files.
Currently above code can be used to upload all the files in a local folder. Below is the screenshot for a repro:

App Engine - download files from Cloud Storage

I am using Python 2.7 and Reportlab to create .pdf files for display/print in my app engine system. I am using ndb.Model to store the data if that matters.
I am able to produce the equivalent of a bank statement for a single client on-line. That is; the user clicks the on-screen 'pdf' button and the .pdf statement appears on screen in a new tab, exactly as it should.
I am using the following code to save .pdf files to Google Cloud Storage successfully
buffer = StringIO.StringIO()
self.p = canvas.Canvas(buffer, pagesize=portrait(A4))
self.p.setLineWidth(0.5)
try:
# create .pdf of .csv data here
finally:
self.p.save()
pdfout = buffer.getvalue()
buffer.close()
filename = getgcsbucket() + '/InvestorStatement.pdf'
write_retry_params = gcs.RetryParams(backoff_factor=1.1)
try:
gcs_file = gcs.open(filename,
'w',
content_type='application/pdf',
retry_params=write_retry_params)
gcs_file.write(pdfout)
except:
logging.error(traceback.format_exc())
finally:
gcs_file.close()
I am using the following code to create a list of all files for display on-screen, it shows all the files stored above.
allfiles = []
bucket_name = getgcsbucket()
rfiles = gcs.listbucket(bucket_name)
for rfile in rfiles:
allfiles.append(rfile.filename)
return allfiles
My screen (html) shows rows of ([Delete] and Filename). When the user clicks the [Delete] button, the following delete code snippet works (filename is /bucket/filename, complete)
filename = self.request.get('filename')
try:
gcs.delete(filename)
except gcs.NotFoundError:
pass
My question - given I have a list of files on-screen, I want the user to click on the filename and for that file to be downloaded to the user's computer. In Google's Chrome Browser, this would result in the file being downloaded, with it's name displayed on the bottom left of the screen.
One other point, the above example is for .pdf files. I will also have to show .csv files in the list and would like them to be downloaded as well. I only want the files to be downloaded, no display is required.
So, I would like a snippet like ...
filename = self.request.get('filename')
try:
gcs.downloadtousercomputer(filename) ???
except gcs.NotFoundError:
pass
I think I have tried everything I can find both here and elsewhere. Sorry I have been so long-winded. Any hints for me?
To download a file instead of showing it in the browser, you need to add a header to your response:
self.response.headers["Content-Disposition"] = 'attachment; filename="%s"' % filename
You can specify the filename as shown above and it works for any file type.
One solution you can try is to read the file from the bucket and print the content as the response with the correct header:
import cloudstorage
...
def read_file(self, filename):
bucket_name = "/your_bucket_name"
file = bucket_name + '/' + filename
with cloudstorage.open(file) as cloudstorage_file:
self.response.headers["Content-Disposition"] = str('attachment;filename=' + filename)
contents = cloudstorage_file.read()
cloudstorage_file.close()
self.response.write(contents)
Here filename could be something you are sending as GET parameter and needs to be a file that exist on your bucket or you will raise an exception.
[1] Here you will find a sample.
[1]https://cloud.google.com/appengine/docs/standard/python/googlecloudstorageclient/read-write-to-cloud-storage

How to get data from s3 and do some work on it? python and boto

I have a project task to use some output data I have already produced on s3 in an EMR task. So previously I have ran an EMR job that produced some output in one of my s3 buckets in the form of multiple files named part-xxxx. Now I need to access those files from within my new EMR job, read the contents of those files and by using that data I need to produce another output.
This is the local code that does the job:
def reducer_init(self):
self.idfs = {}
for fname in os.listdir(DIRECTORY): # look through file names in the directory
file = open(os.path.join(DIRECTORY, fname)) # open a file
for line in file: # read each line in json file
term_idf = JSONValueProtocol().read(line)[1] # parse the line as a JSON object
self.idfs[term_idf['term']] = term_idf['idf']
def reducer(self, term_poster, howmany):
tfidf = sum(howmany) * self.idfs[term_poster['term']]
yield None, {'term_poster': term_poster, 'tfidf': tfidf}
This runs just fine locally, but the problem is the data i need now is on s3 and i need to access it somehow in reducer_init function.
This is what I have so far, but it fails while executing on EC2:
def reducer_init(self):
self.idfs = {}
b = conn.get_bucket(bucketname)
idfparts = b.list(destination)
for key in idfparts:
file = open(os.path.join(idfparts, key))
for line in file:
term_idf = JSONValueProtocol().read(line)[1] # parse the line as a JSON object
self.idfs[term_idf['term']] = term_idf['idf']
def reducer(self, term_poster, howmany):
tfidf = sum(howmany) * self.idfs[term_poster['term']]
yield None, {'term_poster': term_poster, 'tfidf': tfidf}
AWS access info is defined as follows:
awskey = '*********'
awssecret = '***********'
conn = S3Connection(awskey, awssecret)
bucketname = 'mybucket'
destination = '/path/to/previous/output'
There are two ways of doing this :
Download the file into your local system and parse it. ( Kinda simple, quick and easy )
Get data stored on S3 into memory and parse it ( a bit more complex in case of huge files ).
Step 1:
On S3 filenames are stored as a Key, if you have a file named "Demo" stored in a folder named "DemoFolder" then the key for that particular file would be "DemoFolder\Demo".
Use the below code to download the file into a temp folder.
AWS_KEY = 'xxxxxxxxxxxxxxxxxx'
AWS_SECRET_KEY = 'xxxxxxxxxxxxxxxxxxxxxxxxxx'
BUCKET_NAME = 'DemoBucket'
fileName = 'Demo'
conn = connect_to_region(Location.USWest2,aws_access_key_id = AWS_KEY,
aws_secret_access_key = AWS_SECRET_KEY,
is_secure=False,host='s3-us-west-2.amazonaws.com'
)
source_bucket = conn.lookup(BUCKET_NAME)
''' Download the file '''
for name in source_bucket.list():
if name.name in fileName:
print("DOWNLOADING",fileName)
name.get_contents_to_filename(tempPath)
You can then work on the file in that temp path.
Step 2:
You can also fetch data as string using data = name.get_contents_as_string(). In case of huge files (> 1gb) you may come across memory errors, to avoid such errors you will have to write a lazy function which reads the data in chunks.
For example you can use range to fetch a part of file using data = name.get_contents_as_string(headers={'Range': 'bytes=%s-%s' % (0,100000000)}).
I am not sure if I answered your question properly, I can custom code for your requirement once I get some time. Meanwhile please feel free to post any query you have.

Categories

Resources