How to read data from S3 using python in Azure ML

How to read data from S3 using python in Azure ML - python

import boto3
import io
import pandas as pd
# The entry point function can contain up to two input arguments:
# Param<dataframe1>: a pandas.DataFrame
# Param<dataframe2>: a pandas.DataFrame
def azureml_main(dataframe1 = None, dataframe2 = None):
s3 = boto3.client('s3',
aws_access_key_id='REMOVED',
aws_secret_access_key='REMOVED')
obj = s3.get_object(Bucket='bucket', Key='data.csv000')
df = pd.read_csv(io.BytesIO(obj['Body'].read()))
return df,
I'm tring to read data from S3 using the Execute Python module. I have downloaded the boto3 package and converted it to a zip. I have then uploaded and connected that .zip to the third input option of the module. When I run this code, I recieve an error stating botocore is not installed. Has anyone been able to read directly from S3 into Azure ML studio? I've tried using the R script module which also fails, so now I'm trying python.

Since the boto3 package has dependencies, even some that are cloned from git, I don't think Azure ML Studio can use it. According to the note in their documentation it would be easier to switch to Azure ML Workbench since it can handle Python packages much easier.
Another option, if you need to use Azure ML Studio, is to copy from S3 into Azure Blob Storage, which ML Studio has great support for.
Not much of an answer, but I'm afraid you've hit a limitation of Azure ML Studio.

Related

Google BigQuery query in Python works when using result(), but Permission issue when using to_dataframe()

I've run into a problem after upgrades of my pip packages and my bigquery connector that returns query results suddenly stopped working with following error message
from google.cloud import bigquery
from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file('path/to/file', scopes=['https://www.googleapis.com/auth/cloud-platform',
'https://www.googleapis.com/auth/drive',
'https://www.googleapis.com/auth/bigquery'
])
client = bigquery.Client(credentials=credentials)
data = client.query('select * from dataset.table').to_dataframe()
PermissionDenied: 403 request failed: the user does not have
bigquery.readsessions.create' permission
But! If you switched the code to
data = client.query('select * from dataset.table').result()
(dataframe -> result) you received the data in RowIterator format and were able to properly read them.
The same script using to_dataframe with the same credentials was working on the server. Therefore I set my bigquery package to the same version 2.28.0, which still did not help.
I could not find any advices on this error / topic anywhere, so I just want to share if any of you faced the same thing.

Just set
create_bqstorage_client=False
from google.cloud import bigquery
import os
client = bigquery.Client()
query_job = client.query(query)
df = query_job.result().to_dataframe(create_bqstorage_client=False)

There are different ways of receiving data from of bigquery. Using the BQ Storage API is considered more efficient for larger result sets compared to the other options:
The BigQuery Storage Read API provides a third option that represents an improvement over prior options. When you use the Storage Read API, structured data is sent over the wire in a binary serialization format. This allows for additional parallelism among multiple consumers for a set of results
The Python BQ library internally determines whether it can use the BQ Storage API or not. For the result method, it uses the tradtional tabledata.list method internally, whereas the to_dataframe method uses the BQ Storage API if the according package is installed.
However, using the BQ Storage API requires you to have the bigquery.readSessionUser Role respectively the readsessions.create right which in your case seems to be lacking.
By uninstalling the google-cloud-bigquery-storage, the google-cloud-bigquery package was falling back to the list method. Hence, by de-installing this package, you were working around the lack of rights.
See the BQ Python Libary Documentation for details.

Resolution
Along with google-cloud-bigquery package, I also had installed package google-cloud-bigquery-storage. Once I uninstalled that one using
pip uninstall google-cloud-bigquery-storage
everything started working again! Unfortunately, the error message was not so straightforward so it took some time to figure out :)

How to send/copy/upload file from AWS S3 to Google GCS using Python

Im looking for a pythonic way to copy a file from AWS S3 to GCS.
I do not want to open/read the file and then use blob.upload_from_string() method. I want to transfer it 'as-is'.
I can not use 'gsutils'. The scope of the libraries Im working with is gcloud, boto3 (also experimented with s3fs).
Here is a simple example (that seems to work) using blob.upload_from_string() method which im trying to avoid because i don't want to open/read the file. I fail to make it work using blob.upload_from_file() method because GCS api requires an accessible, readable, file-like object which i fail to properly provide.
What am I missing? Suggestions?
import boto3
from gcloud import storage
from oauth2client.service_account import ServiceAccountCredentials
GSC_Token_File = 'path/to/GSC_token'
s3 = boto3.client('s3', region_name='MyRegion') # im running from AWS Lambda, no authentication required
gcs_credentials = ServiceAccountCredentials.from_json_keyfile_dict(GSC_Token_File)
gcs_storage_client = storage.Client(credentials=gcs_credentials, project='MyGCP_project')
gcs_bucket = gcs_storage_client.get_bucket('MyGCS_bucket')
s3_file_to_load = str(s3.get_object(Bucket='MyS3_bucket', Key='path/to/file_to_copy.txt')['Body'].read().decode('utf-8'))
blob = gcs_bucket.blob('file_to_copy.txt')
blob.upload_from_string(s3_file_to_load)

So i poked around a bit more and came across this article which eventually led me to this solution. Apparently GCS API can be called using AWS boto3 SDK.
Please mind the HMAC key prerequisite that can be easily created using these instructions.
import boto3
# im using GCP Service Account so my HMAC was created accordingly.
# HMAC for User Account can be created just as well
service_Access_key = 'YourAccessKey'
service_Secret = 'YourSecretKey'
# Reminder: I am copying from S3 to GCS
s3_client = boto3.client('s3', region_name='MyRegion')
gcs_client =boto3.client(
"s3", # !just like that
region_name="auto",
endpoint_url="https://storage.googleapis.com",
aws_access_key_id=service_Access_key,
aws_secret_access_key=service_Secret,
)
file_to_transfer = s3_client.get_object(Bucket='MyS3_bucket', Key='path/to/file_to_copy.txt')
gcs_client.upload_fileobj(file_to_transfer['Body'], 'MyGCS_bucket', 'file_to_copy.txt')

I understand you're trying to move files from S3 to CGS using Python in an AWS Lambda function. There is one thing I'd like to clarify from the statement "I don't want to open/read the file" which is that when the file is downloaded from S3 you are indeed reading it and writing it somewhere, be it into an in-memory string or to a temporary file. In that sense, it actually doesn't matter which one of blob.upload_from_file() or blob.upload_from_string() is used as they're equivalent; the first will read from a file and the second won't because data is already read in-memory. Therefore my suggestion would be to keep the code as it is, I don't see a benefit on changing it.
Anyway the file approach should be possible doing something along the lines below (untested, I have no S3 to check):
# From S3 boto docs: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-download-file.html
s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME')
blob.upload_from_file('FILE_NAME')
Finally it is worth mentioning the Storage Transfer tool which is intended for moving huge amounts of data from S3 to GCS. If that sounds like your use case you may take a look at the code samples for Python.

Uploading csv files to azure container using SAS URI in python?

I am trying to upload files to azure using the SAS URI only. I found ways using C# but I didn't find a solution using python. The only solution I found using python is to input the account name and account key as parameters in blockblobservice. Here is an example Upload image to azure blob storage using python but I am trying to avoid using this solution. Is there a specific way to upload csv files to azure using only the SAS URI ? Thanks for your help :)

If you're using the latest python blob sdk azure-storage-blob 12.4.0, then you can use the code like below(please feel free to modify the code as per your need):
from azure.storage.blob import BlobClient
upload_file_path="d:\\a11.csv"
sas_url="https://xxx.blob.core.windows.net/test5/a11.csv?sastoken"
client = BlobClient.from_blob_url(sas_url)
with open(upload_file_path,'rb') as data:
client.upload_blob(data)
print("**file uploaded**")
Here is the test result:

This might help:
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python#upload-blobs-to-a-container
Example is shown by using the Python SDK for Azure Storage

reading data from s3 by jupyter notebook in ubuntu ec2 deep learning instance

I have several txt and csv datasets in one s3 bucket, my_bucket, and a deep learning ubuntu ec2 instance. I am using Jupyter notebook on this instance. I need to read data from s3 to Jupyter.
I searched everywhere (almost) in AWS documentation and their forum together with other blogs. This is the best I could do. However, after getting the keys (both) restarting the instance (and aws too) I still get an error for aws_key.
I'm wondering if anyone ran to this or you have a better idea to get the data from there. I'm open as long as it's not using http (which requires the data to be public). Thank you.
import pandas as pd
from smart_open import smart_open
import os
aws_key = os.environ['aws_key']
aws_secret = os.environ['aws_secret']
bucket_name = 'my_bucket'
object_key = 'data.csv'
path = 's3://{}:{}#{}/{}'.format(aws_key, aws_secret, bucket_name, object_key)
df = pd.read_csv(smart_open(path))

Your code sample would work if you export the aws_key and first aws_secret. Something like this would work (assuming bash is your shell):
export aws_key=<your key>
export aws_secret=<your aws secret>
python yourscript.py
It is best practice to export things like keys and secrets so that you are not storing confidential/secret things in your source code. If you were to hard code those values into your script and accidentally commit them to a public repo, it would be easy for someone to take over your aws account.

I am answering my own question here and would like to hear from community too on different solutions: Directly access S3 data from the Ubuntu Deep Learning instance by
cd ~/.aws
aws configure
Then update aws key and secret key for the instance, just to make sure. Checke awscli version using the command:
aws --version
Read more on configuration
https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html
In the above code, "aws_key" and "aws_secret" are not listed as environmental variables on the Ubuntu instance and hence the inbuilt function os.environ cannot be used
aws_key = 'aws_key'
aws_secret = 'aws_secret'

ibm_boto3 compatibility issue with scikit-learn on Mac OS

I have a Python 3.6 application that uses scikit-learn, deployed to IBM Cloud (Cloud Foundry). It works fine. My local development environment is Mac OS High Sierra.
Recently, I added IBM Cloud Object Storage functionality (ibm_boto3) to the app. The COS functionality itself works fine. I'm able to upload, download, list, and delete objects just fine using the ibm_boto3 library.
Strangely, the part of the app that uses scikit-learn now freezes up.
If I comment out the ibm_boto3 import statements (and corresponding code), the scikit-learn code works fine.
More perplexingly, the issue only happens on the local development machine running OS X. When the app is deployed to IBM Cloud, it works fine -- both scikit-learn and ibm_boto3 work well side-by-side.
Our only hypothesis at this point is that somehow the ibm_boto3 library surfaces a known issue in scikit-learn (see this -- parallel version of the K-means algorithm is broken when numpy uses Accelerator on OS X).
Note that we only face this issue once we add ibm_boto3 to the project.
However, we need to be able to test on localhost before deploying to IBM Cloud. Are there any known compatibility issues between ibm_boto3 and scikit-learn on Mac OS?
Any suggestions on how we can avoid this on the dev machine?
Cheers.

Up until now, there weren't any known compatibility issues. :)
At some point there were some issues with the vanilla SSL libraries that come with OSX, but if you're able to read and write data that isn't the problem.
Are you using HMAC credentials? If so, I'm curious if the behavior continues if you use the original boto3 library instead of the IBM fork.
Here's a simple examples that shows how you might use pandas with the original boto3:
import boto3 # package used to connect to IBM COS using the S3 API
import io # python package used to stream data
import pandas as pd # lightweight data analysis package
access_key = '<access key>'
secret_key = '<secret key>'
pub_endpoint = 'https://s3-api.us-geo.objectstorage.softlayer.net'
pvt_endpoint = 'https://s3-api.us-geo.objectstorage.service.networklayer.com'
bucket = 'demo' # the bucket holding the objects being worked on.
object_key = 'demo-data' # the name of the data object being analyzed.
result_key = 'demo-data-results' # the name of the output data object.
# First, we need to open a session and create a client that can connect to IBM COS.
# This client needs to know where to connect, the credentials to use,
# and what signature protocol to use for authentication. The endpoint
# can be specified to be public or private.
cos = boto3.client('s3', endpoint_url=pub_endpoint,
aws_access_key_id=access_key,
aws_secret_access_key=secret_key,
region_name='us',
config=boto3.session.Config(signature_version='s3v4'))
# Since we've already uploaded the dataset to be worked on into cloud storage,
# now we just need to identify which object we want to use. This creates a JSON
# representation of request's response headers.
obj = cos.get_object(Bucket=bucket, Key=object_key)
# Now, because this is all REST API based, the actual contents of the file are
# transported in the request body, so we need to identify where to find the
# data stream containing the actual CSV file we want to analyze.
data = obj['Body'].read()
# Now we can read that data stream into a pandas dataframe.
df = pd.read_csv(io.BytesIO(data))
# This is just a trivial example, but we'll take that dataframe and just
# create a JSON document that contains the mean values for each column.
output = df.mean(axis=0, numeric_only=True).to_json()
# Now we can write that JSON file to COS as a new object in the same bucket.
cos.put_object(Bucket=bucket, Key=result_key, Body=output)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read data from S3 using python in Azure ML - python

Related

Google BigQuery query in Python works when using result(), but Permission issue when using to_dataframe()

How to send/copy/upload file from AWS S3 to Google GCS using Python

Uploading csv files to azure container using SAS URI in python?

reading data from s3 by jupyter notebook in ubuntu ec2 deep learning instance

ibm_boto3 compatibility issue with scikit-learn on Mac OS

Categories

Resources