Azure service principal and storage.blob with Python - python

I'm trying to authenticate with a service principal through python and then accessing azure.storage.blob
Used to do it with:
NAME = '****'
KEY = '****'
block_blob_service = BlockBlobService(account_name=NAME, account_key=KEY, protocol='https')
But I cant make it work with the service principal:
TENANT_ID = '****'
CLIENT = '****'
KEY_SERVICE = '****'
credentials = ServicePrincipalCredentials(
client_id = CLIENT,
secret = KEY_SERVICE,
tenant = TENANT_ID
)
I'm a little confused how I pair those 2 and whatever I try it just gives me a timeout when I'm trying to upload a blob.

I don't think Azure Storage Service supports service principal credentials. Actually it only accepts two kinds of credentials currently: shared keys and Shared Access Signature (SAS).

Azure Storage works specifically with account name + key (whether primary or secondary). There is no notion of service principal / AD-based access.
Your first example (setting up the blob service endpoint) with account name + key is the correct way to operate.
Note: As Zhaoxing mentioned, you can also use SAS. But from a programmatic standpoint, assuming you are the storage account owner, that doesn't really buy you much.
The only place Service Principals (and AD in general) comes into play is managing the resource itself (e.g. the storage account, from a deployment/management/deletion standpoint).

This is highly confusing.
Once a service principal (SP) is registered in azure portal with a new secret created...
...assigned Contributor role at the Resource Group level...
...any storage accounts and containers therein inherit this role.
In your application, you can create ServicePrincipalCredentials() and a ClientSecretCredential() from the registered SP...
service_credential = ServicePrincipalCredentials(
tenant = '<yourTenantID>',
client_id = '<yourClientID>',
secret = '<yourClientSecret>'
)
client_credential = ClientSecretCredential(
'<yourTenantID>',
'<yourClientID>',
'<yourClientSecret>'
)
From here, create a ResourceManagementClient()...
resource_client = ResourceManagementClient(service_credential, subscription_id)
...to list RG's, Resources and Storage Accounts.
for item in resource_client.resource_groups.list():
print(item.name)
for item in resource_client.resources.list():
print(item.name + " " + item.type)
for item in resource_client.resources.list_by_resource_group('azureStorage'):
print(item.name)
BUT...from my research, you cannot list blob containers nor blobs within a given container using ResourceManagementClient()!!. So we move to a BlobServiceClient()
blob_service_client = BlobServiceClient(account_url = url, credential=client_credential)
From here, you can list blob containers...
blob_list = blob_service_client.list_containers()
for blob in blob_list:
print(blob.name + " " + str(blob.last_modified))
BUT... From my research, you cannot list blobs within a container!!!
container_client = blob_service_client.get_container_client('testcontainer')
blob_list = container_client.list_blobs()
for blob in blob_list:
print("\t" + blob.name)
---------------------------------------------------------------------------
StorageErrorException Traceback (most recent call last)
~/anaconda3_501/lib/python3.6/site-packages/azure/storage/blob/_models.py in _get_next_cb(self, continuation_token)
599 cls=return_context_and_deserialized,
--> 600 use_location=self.location_mode)
601 except StorageErrorException as error:
~/anaconda3_501/lib/python3.6/site-packages/azure/storage/blob/_generated/operations/_container_operations.py in list_blob_flat_segment(self, prefix, marker, maxresults, include, timeout, request_id, cls, **kwargs)
1142 map_error(status_code=response.status_code, response=response, error_map=error_map)
-> 1143 raise models.StorageErrorException(response, self._deserialize)
1144
StorageErrorException: Operation returned an invalid status 'This request is not authorized to perform this operation using this permission.'
During handling of the above exception, another exception occurred:
HttpResponseError Traceback (most recent call last)
<ipython-input-104-7517e7a6a19f> in <module>
1 container_client = blob_service_client.get_container_client('testcontainer')
2 blob_list = container_client.list_blobs()
----> 3 for blob in blob_list:
4 print("\t" + blob.name)
~/anaconda3_501/lib/python3.6/site-packages/azure/core/paging.py in __next__(self)
120 if self._page_iterator is None:
121 self._page_iterator = itertools.chain.from_iterable(self.by_page())
--> 122 return next(self._page_iterator)
123
124 next = __next__ # Python 2 compatibility.
~/anaconda3_501/lib/python3.6/site-packages/azure/core/paging.py in __next__(self)
72 raise StopIteration("End of paging")
73
---> 74 self._response = self._get_next(self.continuation_token)
75 self._did_a_call_already = True
76
~/anaconda3_501/lib/python3.6/site-packages/azure/storage/blob/_models.py in _get_next_cb(self, continuation_token)
600 use_location=self.location_mode)
601 except StorageErrorException as error:
--> 602 process_storage_error(error)
603
604 def _extract_data_cb(self, get_next_return):
~/anaconda3_501/lib/python3.6/site-packages/azure/storage/blob/_shared/response_handlers.py in process_storage_error(storage_error)
145 error.error_code = error_code
146 error.additional_info = additional_data
--> 147 raise error
148
149
HttpResponseError: This request is not authorized to perform this operation using this permission.
RequestId:e056fe39-b01e-0007-425c-20a63f000000
Time:2020-05-02T08:32:02.2204809Z
ErrorCode:AuthorizationPermissionMismatch
Error:None
The only way I've found to list blobs within a container (and to do other things like copy blobs, etc), is to create the BlobServiceClient using a Connection String rather than TenantID, ClientID, ClientSecret.
There is some more info about using a token to access blob resources here and here and here, but I havent' been able to test yet.

Related

Azure: create storage account with container and upload blob to it in Python

I'm trying to create a storage account in Azure and upload a blob into it using their python SDK.
I managed to create an account like this:
client = get_client_from_auth_file(StorageManagementClient)
storage_account = client.storage_accounts.create(
resourceGroup,
name,
StorageAccountCreateParameters(
sku=Sku(name=SkuName.standard_ragrs),
enable_https_traffic_only=True,
kind=Kind.storage,
location=region)).result()
The problem is that later I'm trying to build a container and I don't know what to insert as "account_url"
I have tried doing:
client = get_client_from_auth_file(BlobServiceClient, account_url=storage_account.primary_endpoints.blob)
return client.create_container(name)
But I'm getting:
azure.core.exceptions.ResourceNotFoundError: The specified resource does not exist
I did manage to create a container using:
client = get_client_from_auth_file(StorageManagementClient)
return client.blob_containers.create(
resourceGroup,
storage_account.name,
name,
BlobContainer(),
public_access=PublicAccess.Container
)
But later when I'm trying to upload a blob using BlobServiceClient or BlobClien I still need the "account_url" so I'm still getting an error:
azure.core.exceptions.ResourceNotFoundError: The specified resource does not exist
Anyone can help me to understand how do I get the account_url for a storage account I created with the SDK?
EDIT:
I managed to find a workaround to the problem by creating the connection string from the storage keys.
storage_client = get_client_from_auth_file(StorageManagementClient)
storage_keys = storage_client.storage_accounts.list_keys(resource_group, account_name)
storage_key = next(v.value for v in storage_keys.keys)
return BlobServiceClient.from_connection_string(
'DefaultEndpointsProtocol=https;' +
f'AccountName={account_name};' +
f'AccountKey={storage_key};' +
'EndpointSuffix=core.windows.net')
This works but I thin George Chen answer is more elegant.
I could reproduce this problem, then I found get_client_from_auth_file could not pass the credential to the BlobServiceClient, cause if just create BlobServiceClient with account_url without credential it also could print the account name.
So if you want to use a credential to get BlobServiceClient, you could use the below code, then do other operations.
credentials = ClientSecretCredential(
'tenant_id',
'application_id',
'application_secret'
)
blobserviceclient=BlobServiceClient(account_url=storage_account.primary_endpoints.blob,credential=credentials)
If you don't want this way, you could create the BlobServiceClient with the account key.
client = get_client_from_auth_file(StorageManagementClient,auth_path='auth')
storage_account = client.storage_accounts.create(
'group name',
'account name',
StorageAccountCreateParameters(
sku=Sku(name=SkuName.standard_ragrs),
enable_https_traffic_only=True,
kind=Kind.storage,
location='eastus',)).result()
storage_keys = client.storage_accounts.list_keys(resource_group_name='group name',account_name='account name')
storage_keys = {v.key_name: v.value for v in storage_keys.keys}
blobserviceclient=BlobServiceClient(account_url=storage_account.primary_endpoints.blob,credential=storage_keys['key1'])
blobserviceclient.create_container(name='container name')

Troubles querying public BigQuery data using local workstation

I am trying to query public data from the BigQuery API (The Ethereum dataset) on my colab.
I have tried this
from google.colab import auth
auth.authenticate_user()
from google.cloud import bigquery
eth_project_id = 'crypto_ethereum_classic'
client = bigquery.Client(project=eth_project_id)
and receive this error message:
WARNING:google.auth._default:No project ID could be determined. Consider running `gcloud config set project` or setting the GOOGLE_CLOUD_PROJECT environment variable
I have also tried using the BigQueryHelper library and receive a similar error message
from bq_helper import BigQueryHelper
eth_dataset = BigQueryHelper(active_project="bigquery-public-data",dataset_name="crypto_ethereum_classic")
Error:
WARNING:google.auth._default:No project ID could be determined. Consider running `gcloud config set project` or setting the GOOGLE_CLOUD_PROJECT environment variable
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-21-53ac8b2901e1> in <module>()
1 from bq_helper import BigQueryHelper
----> 2 eth_dataset = BigQueryHelper(active_project="bigquery-public-data",dataset_name="crypto_ethereum_classic")
/content/src/bq-helper/bq_helper.py in __init__(self, active_project, dataset_name, max_wait_seconds)
23 self.dataset_name = dataset_name
24 self.max_wait_seconds = max_wait_seconds
---> 25 self.client = bigquery.Client()
26 self.__dataset_ref = self.client.dataset(self.dataset_name, project=self.project_name)
27 self.dataset = None
/usr/local/lib/python3.6/dist-packages/google/cloud/bigquery/client.py in __init__(self, project, credentials, _http, location, default_query_job_config)
140 ):
141 super(Client, self).__init__(
--> 142 project=project, credentials=credentials, _http=_http
143 )
144 self._connection = Connection(self)
/usr/local/lib/python3.6/dist-packages/google/cloud/client.py in __init__(self, project, credentials, _http)
221
222 def __init__(self, project=None, credentials=None, _http=None):
--> 223 _ClientProjectMixin.__init__(self, project=project)
224 Client.__init__(self, credentials=credentials, _http=_http)
/usr/local/lib/python3.6/dist-packages/google/cloud/client.py in __init__(self, project)
176 if project is None:
177 raise EnvironmentError(
--> 178 "Project was not passed and could not be "
179 "determined from the environment."
180 )
OSError: Project was not passed and could not be determined from the environment.
Just to reiterate I am using Colab- I know how to query the data on Kaggle, but need to do it on my colab
IN Colab - you need to first authenticate.
from google.colab import auth
auth.authenticate_user()
That will authenticate your user account to a project.

ValueError: Invalid endpoint: s3-api.xxxx.objectstorage.service.networklayer.com

I'm trying to access a csv file in my Watson Data Platform catalog. I used the code generation functionality from my DSX notebook: Insert to code > Insert StreamingBody object.
The generated code was:
import os
import types
import pandas as pd
import boto3
def __iter__(self): return 0
# #hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share your notebook.
os.environ['AWS_ACCESS_KEY_ID'] = '******'
os.environ['AWS_SECRET_ACCESS_KEY'] = '******'
endpoint = 's3-api.us-geo.objectstorage.softlayer.net'
bucket = 'catalog-test'
cos_12345 = boto3.resource('s3', endpoint_url=endpoint)
body = cos_12345.Object(bucket,'my.csv').get()['Body']
# add missing __iter__ method so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType(__iter__, body)
df_data_2 = pd.read_csv(body)
df_data_2.head()
When I try to run this code, I get:
/usr/local/src/conda3_runtime.v27/4.1.1/lib/python3.5/site-packages/botocore/endpoint.py in create_endpoint(self, service_model, region_name, endpoint_url, verify, response_parser_factory, timeout, max_pool_connections)
270 if not is_valid_endpoint_url(endpoint_url):
271
--> 272 raise ValueError("Invalid endpoint: %s" % endpoint_url)
273 return Endpoint(
274 endpoint_url,
ValueError: Invalid endpoint: s3-api.us-geo.objectstorage.service.networklayer.com
What is strange is that if I generate the code for SparkSession setup instead, the same endpoint is used but the spark code runs ok.
How can I fix this issue?
I'm presuming the same issue will be encountered for the other Softlayer endpoints so I'm listing them here as well to ensure this question is also applicable for the other softlayer locations:
s3-api.us-geo.objectstorage.softlayer.net
s3-api.dal-us-geo.objectstorage.softlayer.net
s3-api.sjc-us-geo.objectstorage.softlayer.net
s3-api.wdc-us-geo.objectstorage.softlayer.net
s3.us-south.objectstorage.softlayer.net
s3.us-east.objectstorage.softlayer.net
s3.eu-geo.objectstorage.softlayer.net
s3.ams-eu-geo.objectstorage.softlayer.net
s3.fra-eu-geo.objectstorage.softlayer.net
s3.mil-eu-geo.objectstorage.softlayer.net
s3.eu-gb.objectstorage.softlayer.net
The solution was to prefix the endpoint with https://, changing from ...
this
endpoint = 's3-api.us-geo.objectstorage.softlayer.net'
to
endpoint = 'https://s3-api.us-geo.objectstorage.softlayer.net'
For IBM Cloud Object Storage, it should be import ibm_boto3 rather than import boto3. The original boto3 is for accessing AWS, which uses different authentication. Maybe those two have a different interpretation of the endpoint value.

OAuth2 service account credentials for Google Drive upload script

Apologies in advance if anyone has asked this before, but I've got some basic questions about a server-side Python script I'm writing that does a nightly upload of CSV files to a folder in our Google Drive account. The folder owner has created a Google API project, enabled the Drive API for this and created a credentials object which I've downloaded as a JSON file. The folder has been shared with the service account email, so I assume that the script will now have access to this folder, once it is authorized.
The JSON file contains the following fields private_key_id, private_key, client_email, client_id, auth_uri, token_uri, auth_provider_x509_cert_url, client_x509_cert_url.
I'm guessing that my script will not need all of these - which are the essential or compulsory fields for OAuth2 authorization?
The example Python script given here
https://developers.google.com/drive/web/quickstart/python
seems to assume that the credentials are retrieved directly from a JSON file:
...
home_dir = os.path.expanduser('~')
credential_dir = os.path.join(home_dir, '.credentials')
if not os.path.exists(credential_dir):
os.makedirs(credential_dir)
credential_path = os.path.join(credential_dir,
'drive-python-quickstart.json')
store = oauth2client.file.Storage(credential_path)
credentials = store.get()
...
but in our setup we are storing them in our own DB and the script accesses them via a dict. How would the authorization be done if the credentials were in a dict?
Thanks in advance.
After browsing the source code, it seems that it was designed to accept only JSON files. They are using simplejson to encode and decode. Take a look at the source code:
This is where is gets the content from the file you provide:
62 def locked_get(self):
63 """Retrieve Credential from file.
64
65 Returns:
66 oauth2client.client.Credentials
67
68 Raises:
69 CredentialsFileSymbolicLinkError if the file is a symbolic link.
70 """
71 credentials = None
72 self._validate_file()
73 try:
74 f = open(self._filename, 'rb')
75 content = f.read()
76 f.close()
77 except IOError:
78 return credentials
79
80 try:
81 credentials = Credentials.new_from_json(content) #<----!!!
82 credentials.set_store(self)
83 except ValueError:
84 pass
85
86 return credentials
In new_from_json they attempt to decode the content using simplejson.
205 def new_from_json(cls, s):
206 """Utility class method to instantiate a Credentials subclass from a JSON
207 representation produced by to_json().
208
209 Args:
210 s: string, JSON from to_json().
211
212 Returns:
213 An instance of the subclass of Credentials that was serialized with
214 to_json().
215 """
216 data = simplejson.loads(s)
Long story short, it seems you'll have to construct a JSON file from your dict. See this. Basically, you'll need to json.dumps(your_dictionary) and create a file.
Actually it looks like a SignedJwtAssertionCredentials may be one answer. Another possibly is
from oauth2client import service_account
...
credentials = service_account._ServiceAccountCredentials(
service_account_id=client.clid,
service_account_email=client.email,
private_key_id=client.private_key_id,
private_key_pkcs8_text=client.private_key,
scopes=client.auth_scopes,
)
....

obtaining AWS credentials using cognito in python boto

I'm trying to emulate the flow of my server application creating a temporary access/secret key pair for a mobile device using my own authentication. Mobile device talks to my server and end result is it gets AWS credentials.
I'm using Cognito with a custom developer backend, see documentation here.
To this end, I've made the script below, but my secret/access key credentials don't work:
import time
import traceback
from boto.cognito.identity.layer1 import CognitoIdentityConnection
from boto.sts import STSConnection
from boto.s3.connection import S3Connection
from boto.s3.key import Key
AWS_ACCESS_KEY_ID = "XXXXX"
AWS_SECRET_ACCESS_KEY = "XXXXXX"
# get token
iden_pool_id = "us-east-1:xxx-xxx-xxx-xxxx-xxxx"
role_arn = "arn:aws:iam::xxxx:role/xxxxxxx"
user_id = "xxxx"
role_session_name = "my_session_name_here"
bucket_name = 'xxxxxxxxxx'
connection = CognitoIdentityConnection(aws_access_key_id=AWS_ACCESS_KEY_ID, aws_secret_access_key=AWS_SECRET_ACCESS_KEY)
web_identity_token = connection.get_open_id_token_for_developer_identity(
identity_pool_id=iden_pool_id,
logins={"xxxxxxxxx" : user_id},
identity_id=None,
token_duration=3600)
# use token to get credentials
sts_conn = STSConnection(aws_access_key_id=AWS_ACCESS_KEY_ID, aws_secret_access_key=AWS_SECRET_ACCESS_KEY)
result = sts_conn.assume_role_with_web_identity(
role_arn,
role_session_name,
web_identity_token['Token'],
provider_id=None,
policy=None,
duration_seconds=3600)
print "The user now has an access ID (%s) and a secret access key (%s) and a session/security token (%s)!" % (
result.credentials.access_key, result.credentials.secret_key, result.credentials.session_token)
# just use any call that tests if these credentials work
from boto.ec2.connection import EC2Connection
ec2 = EC2Connection(result.credentials.access_key, result.credentials.secret_key, security_token=result.credentials.session_token)
wait = 1
cumulative_wait_time = 0
while True:
try:
print ec2.get_all_regions()
break
except Exception as e:
print e, traceback.format_exc()
time.sleep(2**wait)
cumulative_wait_time += 2**wait
print "Waited for:", cumulative_wait_time
wait += 1
My thought with the exponential backoff was that perhaps Cognito takes a while to propagate the new access/secret key pair, and thus I might have to wait (pretty unacceptable if so!).
However, this script runs for a 10 minutes and doesn't succeed, which leads me to believe the problem is something else.
Console print out:
The user now has an access ID (xxxxxxxx) and a secret access key (xxxxxxxxxx) and a session/security token (XX...XX)!
EC2ResponseError: 401 Unauthorized
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>AuthFailure</Code><Message>AWS was not able to validate the provided access credentials</Message></Error></Errors><RequestID>xxxxxxxxxx</RequestID></Response> Traceback (most recent call last):
File "/home/me/script.py", line 50, in <module>
print ec2.get_all_regions()
File "/home/me/.virtualenvs/venv/local/lib/python2.7/site-packages/boto/ec2/connection.py", line 3477, in get_all_regions
[('item', RegionInfo)], verb='POST')
File "/home/me/.virtualenvs/venv/local/lib/python2.7/site-packages/boto/connection.py", line 1186, in get_list
raise self.ResponseError(response.status, response.reason, body)
EC2ResponseError: EC2ResponseError: 401 Unauthorized
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>AuthFailure</Code><Message>AWS was not able to validate the provided access credentials</Message></Error></Errors><RequestID>xxxxxxxxxxxxx</RequestID></Response>
Waited for: 2
...
...
Any thoughts?
You are correctly extracting the access key and secret key from the result of the assume_role_with_web_identity call. However, when using the temporary credentials, you also need to use the security token from the result.
Here is pseudocode describing what you need to do:
http://docs.aws.amazon.com/STS/latest/UsingSTS/using-temp-creds.html#using-temp-creds-sdk
Also note the security_token parameter for EC2Connection
http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection
Hopefully this solves the problem
-Mark

Categories

Resources