Tensorflow - S3 object does not exist - python

How do I set up direct private bucket access for Tensorflow?
After running
from tensorflow.python.lib.io import file_io
and running print file_io.stat('s3://my/private/bucket/file.json') I end up with an error -
NotFoundError: Object s3://my/private/bucket/file.json does not exist
However, the same line on a public object works without an error:
print file_io.stat('s3://ryft-public-sample-data/wikipedia-20150518.bin')
There appears to be an article on support here: https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/s3.md
However, I end up with the same error after exporting the variables shown.
I have awscli set up with all credentials, and boto3 can view and download the file in question. I am wondering how I can get Tensorflow to have S3 access directly when the bucket is private.

I had the same problem when trying to access files in private S3 bucket from Sagemaker notebook. The mistake I made was to try using credentials I obtained from boto3, which seem not to be valid outside.
The solution was not to specify credentials (in such case it uses the role attached to the machine), but instead just specify the region name (for some reason it didn't read it from ~/.aws/config file) as follows:
import boto3
import os
session = boto3.Session()
os.environ['AWS_REGION']=session.region_name
NOTE: when debugging this error useful was to look at CloudWatch logs, as the logs of S3 client were printed only there and not in the Jupyter notebook.
In there I have first have seen, that:
when I did specify credentials from boto3 the error was: The AWS Access Key Id you provided does not exist in our records.
When accessing without AWS_REGION env variable set I had The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint. which apparently is common when you don't specify bucket (see 301 Moved Permanently after S3 uploading)

Related

List S3 bucket objects with access point using Boto3

I am trying to use the list_objects_v2 function of the Python3 Boto3 S3 API client to list objects from an S3 access point.
Sample Code:
import boto3
import botocore
access_point_arn = "arn:aws:s3:region:account-id:accesspoint/resource"
client = boto3.client('s3')
response = client.list_objects_v2(Bucket=access_point_arn)
Somehow getting the error below:
botocore.exceptions.ParamValidationError: Parameter validation failed:
Invalid bucket name "arn:aws:s3:region:account-id:accesspoint/resource": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$"
Based on the documentation: https://docs.aws.amazon.com/AmazonS3/latest/dev/using-access-points.html, i should be able to pass an access point to the list_objects_v2 function as the Bucket name. The odd thing is, this function works locally on my Windows 10 laptop. The same Python3.6 code with the same Boto3 and Botocore package versions throws this error in AWS Glue Python Shell job. I also made sure the Glue role has S3 Full Access and Glue Service policies attached.
I would appreciate if someone can shed some lights on this.

unable to read large csv file from s3 bucket to python

So I am trying to load a csv file from s3 bucket. The following is the code
import pandas as pd
import boto3
import io
s3_file_key = 'iris.csv'
bucket = 'data'
s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket, Key=s3_file_key)
initial_df = pd.read_csv(io.BytesIO(obj['Body'].read()))
It works fine. iris.csv is only 3kb in size.
Now instead of iris.csv, I try to read 'mydata.csv' which is 6gb in size.
I get the following error :
ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied
I am unable to comprehend how access can be an issue since I put the data there in the first place. Also I am able to read 'iris.csv' from the same location. Any ideas?
Here are the few things that you can do:
Make sure the region of the S3 bucket is the same as your AWS configure. Otherwise, it won't work. S3 service is global but every bucket is created in a specific region. The same region should be used by AWS clients.
Make sure the access keys for the resource has the right set of permissions.
Make sure the file is actually uploaded.
Make sure there is no bucket policy applied that revokes access.
You can enable logging on your S3 bucket to see errors.
Make sure the bucket is not versioned. If versioned, specify the object version.
Make sure the object has the correct set of ACLs defined.
If the object is encrypted, make sure you have permission to use that KMS key to decrypt the object.

Downloading files from AWS S3 Bucket with boto3 results in ClientError: An error occurred (403): Forbidden

I am trying to download files from a s3 bucket by using the Access Key ID and Secret Access Key provided by https://db.humanconnectome.org. However, even though I am able to navigate the database and find the files (as I have configured my credentials via aws cli), attempting to download them results in the following error:
"botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden"
With the same credentials, I can browse the same database and download the files manually via a cloud storage browser such as Cyberduck, so how Cyberduck accesses the data does not invoke a 403 Forbidden error.
I have also verified that boto3 is able to access my aws credentials, and also tried by hardcoding them.
How I am attempting to download the data is very straightforward, and replicates the boto3 docs example: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-download-file.html
s3 = boto3.client('s3',
aws_access_key_id=ACCESS_KEY_ID,
aws_secret_access_key=ACCESS_KEY,)
s3.download_file(Bucket=BUCKET_NAME, Key=FILE_KEY, Filename=FILE_NAME)
This should download the file to the location and file given by FILE_NAME, but instead invokes the 403 Forbidden error.
You'll need to pass the bucket region as well when downloading the file. Try configuring region using the CLI or pass region_name when creating the client.
s3 = boto3.client('s3',
aws_access_key_id=ACCESS_KEY_ID,
aws_secret_access_key=ACCESS_KEY,
region_name=AWS_REGION)
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html
I know this might sound ridiculous, but make sure you don't have a typo in your bucket name or anything like that.
I worked so long trying to fix this, only to realize I added an extra letter in the env variable I had set for my s3 bucket.
It's weird that they give you a forbidden error as a opposed to "not found" error, but they do.

Upload Large files to S3 without authentication in Python

I'm trying to upload large files to Amazon S3 without using credentials. I'm creating a plugin for Octoprint with this, and I can't put any sort of credentials into the code due to it being public. Currently my code for uploads looks like this:
import boto3
from botocore import UNSIGNED
from botocore.client import Config
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))
# Create an S3 client
filename = 'file.txt'
bucket_name = 'BUCKET_HERE'
s3.upload_file(filename, bucket_name, filename)
However, it gives me the following error:
S3UploadFailedError: Failed to upload largefiletest.mp4 to BUCKETNAMEHERE/largefiletest.mp4: An error occurred (AccessDenied) when calling the CreateMultipartUpload operation: Anonymous users cannot initiate multipart uploads. Please authenticate.
Is there any way to work around this, or are there any suggestions for alternative libraries? Anything is appreciated.
Do you mean that the repository is public but the runtime environment is private? If so, the standard practice is to set environment variables like this:
# first pip install environ
import environ
SOME_KEY = env('SOME_KEY', default='')
This way, you can easily update your credentials without changing your code or compromising security.
Edit:
Then on the machine this code will be run, you can set the environment variables as such:
macOS: https://natelandau.com/my-mac-osx-bash_profile/
Linux: https://www.cyberciti.biz/faq/set-environment-variable-linux/
Windows: http://www.dowdandassociates.com/blog/content/howto-set-an-environment-variable-in-windows-command-line-and-registry/

how can i download my data from google-cloud-platform using python?

I have my data on google cloud platform and i want to be able to be able to download it locally, this is my first time trying that and eventually i'll use the downloaded data with my python code.
I have checked the docs, like https://cloud.google.com/genomics/downloading-credentials-for-api-access and https://cloud.google.com/storage/docs/cloud-console i have successfully got the Json file for my first link, the second one is where u'm struggling, i'm using python 3.5 and assuming my json files name is data.json i have added the following code:
os.environ["file"] = "data.json"
urllib.request.urlopen('https://storage.googleapis.com/[bucket_name]/[filename]')
first of all i don't even know what i should call the value near environ so i just called it file, not sure how i'm supposed to fill it and i got access denied on the second line, obviously it's not how to download my file as there is no destination local repository or anything in that command any guidance will be appreciated.
Edit:
from google.cloud.storage import Blob
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "credentials/client_secret.json"
storage_client = storage.Client.from_service_account_json('service_account.json')
client = storage.Client(project='my-project')
bucket = client.get_bucket('my-bucket')
blob = Blob('path/to/my-object', bucket)
download_to_filename('local/path/to/my-file')
I'm getting unresolved reference for storage and download_to_filename and should i replace service_account.json with credentials/client_secret.json. Plus i tried to print the content of os.environ["GOOGLE_APPLICATION_CREDENTIALS"]['installed'] like i'd do with any Json but it just said i should give numbers meaning it read the input path as regular text only.
You should use the idiomatic Google Cloud library to run operations in GCS.
With the example there, and knowing that the client library will get the application default credentials, first we have to set the applicaiton default credentials with
gcloud auth application-default login
===EDIT===
That was the old way. Now you should use the instructions in this link.
This means downloading a service account key file from the console, and setting the environment variable GOOGLE_APPLICATION_CREDENTIALS to the path of the downloaded JSON.
Also, make sure that this service account has the proper permissions on the project of the bucket.
Or you can create the client with explicit credentials. You'll need to download the key file all the same, but when creating the client, use:
storage_client = storage.Client.from_service_account_json('service_account.json')
==========
And then, following the example code:
from google.cloud import storage
client = storage.Client(project='project-id')
bucket = client.get_bucket('bucket-id')
blob = storage.Blob('bucket/file/path', bucket)
blob.download_to_filename('/path/to/local/save')
Or, if this is a one-off download, just install the SDK and use gsutil to download:
gsutil cp gs://bucket/file .

Categories

Resources