ibm_boto3 compatibility issue with scikit-learn on Mac OS - python

I have a Python 3.6 application that uses scikit-learn, deployed to IBM Cloud (Cloud Foundry). It works fine. My local development environment is Mac OS High Sierra.
Recently, I added IBM Cloud Object Storage functionality (ibm_boto3) to the app. The COS functionality itself works fine. I'm able to upload, download, list, and delete objects just fine using the ibm_boto3 library.
Strangely, the part of the app that uses scikit-learn now freezes up.
If I comment out the ibm_boto3 import statements (and corresponding code), the scikit-learn code works fine.
More perplexingly, the issue only happens on the local development machine running OS X. When the app is deployed to IBM Cloud, it works fine -- both scikit-learn and ibm_boto3 work well side-by-side.
Our only hypothesis at this point is that somehow the ibm_boto3 library surfaces a known issue in scikit-learn (see this -- parallel version of the K-means algorithm is broken when numpy uses Accelerator on OS X).
Note that we only face this issue once we add ibm_boto3 to the project.
However, we need to be able to test on localhost before deploying to IBM Cloud. Are there any known compatibility issues between ibm_boto3 and scikit-learn on Mac OS?
Any suggestions on how we can avoid this on the dev machine?
Cheers.

Up until now, there weren't any known compatibility issues. :)
At some point there were some issues with the vanilla SSL libraries that come with OSX, but if you're able to read and write data that isn't the problem.
Are you using HMAC credentials? If so, I'm curious if the behavior continues if you use the original boto3 library instead of the IBM fork.
Here's a simple examples that shows how you might use pandas with the original boto3:
import boto3 # package used to connect to IBM COS using the S3 API
import io # python package used to stream data
import pandas as pd # lightweight data analysis package
access_key = '<access key>'
secret_key = '<secret key>'
pub_endpoint = 'https://s3-api.us-geo.objectstorage.softlayer.net'
pvt_endpoint = 'https://s3-api.us-geo.objectstorage.service.networklayer.com'
bucket = 'demo' # the bucket holding the objects being worked on.
object_key = 'demo-data' # the name of the data object being analyzed.
result_key = 'demo-data-results' # the name of the output data object.
# First, we need to open a session and create a client that can connect to IBM COS.
# This client needs to know where to connect, the credentials to use,
# and what signature protocol to use for authentication. The endpoint
# can be specified to be public or private.
cos = boto3.client('s3', endpoint_url=pub_endpoint,
aws_access_key_id=access_key,
aws_secret_access_key=secret_key,
region_name='us',
config=boto3.session.Config(signature_version='s3v4'))
# Since we've already uploaded the dataset to be worked on into cloud storage,
# now we just need to identify which object we want to use. This creates a JSON
# representation of request's response headers.
obj = cos.get_object(Bucket=bucket, Key=object_key)
# Now, because this is all REST API based, the actual contents of the file are
# transported in the request body, so we need to identify where to find the
# data stream containing the actual CSV file we want to analyze.
data = obj['Body'].read()
# Now we can read that data stream into a pandas dataframe.
df = pd.read_csv(io.BytesIO(data))
# This is just a trivial example, but we'll take that dataframe and just
# create a JSON document that contains the mean values for each column.
output = df.mean(axis=0, numeric_only=True).to_json()
# Now we can write that JSON file to COS as a new object in the same bucket.
cos.put_object(Bucket=bucket, Key=result_key, Body=output)

Related

How can I be sure that a Library like Pandas is not sending my API Key Secrets to places outside from my Local?

Let's say:
I have my python code in main.py and I am using Pandas
I am storing my API Key(to some azure service) in a Windows Environment Variable ( variable name = "AZURE_KEY" and variable_value = "abc123abc")
I will import this API Key in main.py using azure_key = os.environ.get("AZURE_KEY")
Question:
How can I be sure that Pandas Library hasn't sent azure_key's value to somewhere outside my local system?
Possible Approach:
I know one way is to go through the entire Pandas module files and understand the source code to see if any fishy stuff is happening , but such an approach is not feasible.
Note:
Pandas is just an example for the question.I want to use an API Key within a Streamlit code.
Hence,Please take this question agnostic to the library..
For a production system (on a server), you could use a firewall to filter outgoing connections
For a development system (your machine), you could add restrictions to the "API Key" account (e.g. only access test data, only access systems you really need, etc.)

Google BigQuery query in Python works when using result(), but Permission issue when using to_dataframe()

I've run into a problem after upgrades of my pip packages and my bigquery connector that returns query results suddenly stopped working with following error message
from google.cloud import bigquery
from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file('path/to/file', scopes=['https://www.googleapis.com/auth/cloud-platform',
'https://www.googleapis.com/auth/drive',
'https://www.googleapis.com/auth/bigquery'
])
client = bigquery.Client(credentials=credentials)
data = client.query('select * from dataset.table').to_dataframe()
PermissionDenied: 403 request failed: the user does not have
bigquery.readsessions.create' permission
But! If you switched the code to
data = client.query('select * from dataset.table').result()
(dataframe -> result) you received the data in RowIterator format and were able to properly read them.
The same script using to_dataframe with the same credentials was working on the server. Therefore I set my bigquery package to the same version 2.28.0, which still did not help.
I could not find any advices on this error / topic anywhere, so I just want to share if any of you faced the same thing.
Just set
create_bqstorage_client=False
from google.cloud import bigquery
import os
client = bigquery.Client()
query_job = client.query(query)
df = query_job.result().to_dataframe(create_bqstorage_client=False)
There are different ways of receiving data from of bigquery. Using the BQ Storage API is considered more efficient for larger result sets compared to the other options:
The BigQuery Storage Read API provides a third option that represents an improvement over prior options. When you use the Storage Read API, structured data is sent over the wire in a binary serialization format. This allows for additional parallelism among multiple consumers for a set of results
The Python BQ library internally determines whether it can use the BQ Storage API or not. For the result method, it uses the tradtional tabledata.list method internally, whereas the to_dataframe method uses the BQ Storage API if the according package is installed.
However, using the BQ Storage API requires you to have the bigquery.readSessionUser Role respectively the readsessions.create right which in your case seems to be lacking.
By uninstalling the google-cloud-bigquery-storage, the google-cloud-bigquery package was falling back to the list method. Hence, by de-installing this package, you were working around the lack of rights.
See the BQ Python Libary Documentation for details.
Resolution
Along with google-cloud-bigquery package, I also had installed package google-cloud-bigquery-storage. Once I uninstalled that one using
pip uninstall google-cloud-bigquery-storage
everything started working again! Unfortunately, the error message was not so straightforward so it took some time to figure out :)

reading data from s3 by jupyter notebook in ubuntu ec2 deep learning instance

I have several txt and csv datasets in one s3 bucket, my_bucket, and a deep learning ubuntu ec2 instance. I am using Jupyter notebook on this instance. I need to read data from s3 to Jupyter.
I searched everywhere (almost) in AWS documentation and their forum together with other blogs. This is the best I could do. However, after getting the keys (both) restarting the instance (and aws too) I still get an error for aws_key.
I'm wondering if anyone ran to this or you have a better idea to get the data from there. I'm open as long as it's not using http (which requires the data to be public). Thank you.
import pandas as pd
from smart_open import smart_open
import os
aws_key = os.environ['aws_key']
aws_secret = os.environ['aws_secret']
bucket_name = 'my_bucket'
object_key = 'data.csv'
path = 's3://{}:{}#{}/{}'.format(aws_key, aws_secret, bucket_name, object_key)
df = pd.read_csv(smart_open(path))
Your code sample would work if you export the aws_key and first aws_secret. Something like this would work (assuming bash is your shell):
export aws_key=<your key>
export aws_secret=<your aws secret>
python yourscript.py
It is best practice to export things like keys and secrets so that you are not storing confidential/secret things in your source code. If you were to hard code those values into your script and accidentally commit them to a public repo, it would be easy for someone to take over your aws account.
I am answering my own question here and would like to hear from community too on different solutions: Directly access S3 data from the Ubuntu Deep Learning instance by
cd ~/.aws
aws configure
Then update aws key and secret key for the instance, just to make sure. Checke awscli version using the command:
aws --version
Read more on configuration
https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html
In the above code, "aws_key" and "aws_secret" are not listed as environmental variables on the Ubuntu instance and hence the inbuilt function os.environ cannot be used
aws_key = 'aws_key'
aws_secret = 'aws_secret'

How to read data from S3 using python in Azure ML

import boto3
import io
import pandas as pd
# The entry point function can contain up to two input arguments:
# Param<dataframe1>: a pandas.DataFrame
# Param<dataframe2>: a pandas.DataFrame
def azureml_main(dataframe1 = None, dataframe2 = None):
s3 = boto3.client('s3',
aws_access_key_id='REMOVED',
aws_secret_access_key='REMOVED')
obj = s3.get_object(Bucket='bucket', Key='data.csv000')
df = pd.read_csv(io.BytesIO(obj['Body'].read()))
return df,
I'm tring to read data from S3 using the Execute Python module. I have downloaded the boto3 package and converted it to a zip. I have then uploaded and connected that .zip to the third input option of the module. When I run this code, I recieve an error stating botocore is not installed. Has anyone been able to read directly from S3 into Azure ML studio? I've tried using the R script module which also fails, so now I'm trying python.
Since the boto3 package has dependencies, even some that are cloned from git, I don't think Azure ML Studio can use it. According to the note in their documentation it would be easier to switch to Azure ML Workbench since it can handle Python packages much easier.
Another option, if you need to use Azure ML Studio, is to copy from S3 into Azure Blob Storage, which ML Studio has great support for.
Not much of an answer, but I'm afraid you've hit a limitation of Azure ML Studio.

Authentication to Google Cloud Python API Library stopped working

I have problems with the authentication in the Python Library of Google Cloud API.
At first it worked for some days without problem, but suddenly the API calls are not showing up in the API Overview of the Google CloudPlatform.
I created a service account and stored the json file locally. Then I set the environment variable GCLOUD_PROJECT to the project ID and GOOGLE_APPLICATION_CREDENTIALS to the path of the json file.
from google.cloud import speech
client = speech.Client()
print(client._credentials.service_account_email)
prints the correct service account email.
The following code transcribes the audio_file successfully, but the Dashboard for my Google Cloud project doesn't show anything for the activated Speech API Graph.
import io
with io.open(audio_file, 'rb') as f:
audio = client.sample(f.read(), source_uri=None, sample_rate=48000, encoding=speech.encoding.Encoding.FLAC)
alternatives = audio.sync_recognize(language_code='de-DE')
At some point the code also ran in some errors, regarding the usage limit. I guess due to the unsuccessful authentication, the free/limited option is used somehow.
I also tried the alternative option for authentication by installing the Google Cloud SDK and gcloud auth application-default login, but without success.
I have no idea where to start troubleshooting the problem.
Any help is appreciated!
(My system is running Windows 7 with Anaconda)
EDIT:
The error count (Fehler) is increasing with calls to the API. How can I get detailed information about the error?!
Make sure you are using an absolute path when setting the GOOGLE_APPLICATION_CREDENTIALS environment variable. Also, you might want to try inspecting the access token using OAuth2 tokeninfo and make sure it has "scope": "https://www.googleapis.com/auth/cloud-platform" in its response.
Sometimes you will get different error information if you initialize the client with GRPC enabled:
0.24.0:
speech_client = speech.Client(_use_grpc=True)
0.23.0:
speech_client = speech.Client(use_gax=True)
Usually it's an encoding issue, can you try with the sample audio or try generating LINEAR16 samples using something like the Unix rec tool:
rec --channels=1 --bits=16 --rate=44100 audio.wav trim 0 5
...
with io.open(speech_file, 'rb') as audio_file:
content = audio_file.read()
audio_sample = speech_client.sample(
content,
source_uri=None,
encoding='LINEAR16',
sample_rate=44100)
Other notes:
Sync Recognize is limited to 60 seconds of audio, you must use async for longer audio
If you haven't already, set up billing for your account
With regards to the usage problem, the issue is in fact that when you use the new google-cloud library to access ML APIs, it seems everyone authenticates to a project shared by everyone (hence it says you've used up your limit even though you've not used anything). To check and confirm this, you can call an ML API that you have not enabled by using the python client library, which will give you a result even though it shouldn't. This problem persists to other language client libraries and OS, so I suspect it's an issue with their grpc.
Because of this, to ensure consistency I always use the older googleapiclient that uses my API key. Here is an example to use the translate API:
from googleapiclient import discovery
service = discovery.build('translate', 'v2', developerKey='')
service_request = service.translations().list(q='hello world', target='zh')
result = service_request.execute()
print(result)
For the speech API, it's something along the lines of:
from googleapiclient import discovery
service = discovery.build('speech', 'v1beta1', developerKey='')
service_request = service.speech().syncrecognize()
result = service_request.execute()
print(result)
You can get the list of the discovery APIs at https://developers.google.com/api-client-library/python/apis/ with the speech one located in https://developers.google.com/resources/api-libraries/documentation/speech/v1beta1/python/latest/.
One of the other benefits of using the discovery library is that you get a lot more options compared to the current library, although often times it's a bit more of a pain to implement.

Categories

Resources