unable to read large csv file from s3 bucket to python

unable to read large csv file from s3 bucket to python - python

So I am trying to load a csv file from s3 bucket. The following is the code
import pandas as pd
import boto3
import io
s3_file_key = 'iris.csv'
bucket = 'data'
s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket, Key=s3_file_key)
initial_df = pd.read_csv(io.BytesIO(obj['Body'].read()))
It works fine. iris.csv is only 3kb in size.
Now instead of iris.csv, I try to read 'mydata.csv' which is 6gb in size.
I get the following error :
ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied
I am unable to comprehend how access can be an issue since I put the data there in the first place. Also I am able to read 'iris.csv' from the same location. Any ideas?

Here are the few things that you can do:
Make sure the region of the S3 bucket is the same as your AWS configure. Otherwise, it won't work. S3 service is global but every bucket is created in a specific region. The same region should be used by AWS clients.
Make sure the access keys for the resource has the right set of permissions.
Make sure the file is actually uploaded.
Make sure there is no bucket policy applied that revokes access.
You can enable logging on your S3 bucket to see errors.
Make sure the bucket is not versioned. If versioned, specify the object version.
Make sure the object has the correct set of ACLs defined.
If the object is encrypted, make sure you have permission to use that KMS key to decrypt the object.

Related

Reading File from Vertex AI and Google Cloud Storage

I am trying to set up a pipeline in GCP/Vertex AI and am having a lot of trouble. The pipeline is being written using Kubeflow Pipelines and has many different components, one thing in particular is giving me trouble however. Eventually I want to launch this from a Cloud Function with the help of the Cloud Scheduler.
The part that is giving me issues is fairly simple and I believe I just need some form of introduction to how I should be thinking about this setup. I simply want to read and write from files (might be .csv, .txt or similar). I imagine that the analog to the filesystem on my local machine in GCP is the Cloud Storage so this is where I have been trying to read from for the time being (please correct me if I'm wrong). The component I've built is a blatant rip-off of this post and looks like this.
#component(
packages_to_install=["google-cloud"],
base_image="python:3.9"
)
def main(
):
import csv
from io import StringIO
from google.cloud import storage
BUCKET_NAME = "gs://my_bucket"
storage_client = storage.Client()
bucket = storage_client.get_bucket(BUCKET_NAME)
blob = bucket.blob('test/test.txt')
blob = blob.download_as_string()
blob = blob.decode('utf-8')
blob = StringIO(blob) #tranform bytes to string here
names = csv.reader(blob) #then use csv library to read the content
for name in names:
print(f"First Name: {name[0]}")
The error I'm getting looks like the following:
google.api_core.exceptions.NotFound: 404 GET https://storage.googleapis.com/storage/v1/b/gs://pipeline_dev?projection=noAcl&prettyPrint=false: Not Found
What's going wrong in my brain? I get the feeling that it shouldn't be this difficult to read and write files. I must be missing something fundamental? Any help is highly appreciated.

Try specifying bucket name w/o a gs://. This should fix the issue. One more stackoverflow post that says the same thing: Cloud Storage python client fails to retrieve bucket
any storage bucket you try to access in GCP has a unique address to access it. That address starts with a gs:// always which specifies that it is a cloud storage url. Now, GCS apis are designed such that they need the bucket name only to work with it. Hence, you just pass the bucket name. If you were accessing the bucket via browser you will need the complete address to access and hence the gs:// prefix as well.

List S3 bucket objects with access point using Boto3

I am trying to use the list_objects_v2 function of the Python3 Boto3 S3 API client to list objects from an S3 access point.
Sample Code:
import boto3
import botocore
access_point_arn = "arn:aws:s3:region:account-id:accesspoint/resource"
client = boto3.client('s3')
response = client.list_objects_v2(Bucket=access_point_arn)
Somehow getting the error below:
botocore.exceptions.ParamValidationError: Parameter validation failed:
Invalid bucket name "arn:aws:s3:region:account-id:accesspoint/resource": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$"
Based on the documentation: https://docs.aws.amazon.com/AmazonS3/latest/dev/using-access-points.html, i should be able to pass an access point to the list_objects_v2 function as the Bucket name. The odd thing is, this function works locally on my Windows 10 laptop. The same Python3.6 code with the same Boto3 and Botocore package versions throws this error in AWS Glue Python Shell job. I also made sure the Glue role has S3 Full Access and Glue Service policies attached.
I would appreciate if someone can shed some lights on this.

python boto3: AWS Rekognition is unable to access S3 bucket

I am trying to upload the image to S3 and then have AWS Rekognition fetch it from S3 for face detection, but Rekognition cannot do that.
Here is my code - uploading and then detecting:
import boto3
s3 = boto3.client('s3')
s3.put_object(
ACL='public-read',
Body=open('/Users/1111/Desktop/kitten800300/kitten.jpeg', 'rb'),
Bucket='mobo2apps',
Key='kitten_img.jpeg'
)
rekognition = boto3.client('rekognition')
response = rekognition.detect_faces(
Image={
'S3Object': {
'Bucket': 'mobo2apps',
'Name': 'kitten_img.jpeg',
}
}
)
this produces an error:
Unable to get object metadata from S3. Check object key, region and/or access permissions.
Why is that?
About the permissions: I am authorized with AWS root access keys, so I have full access to all resources.

Here are the few things that you can do:
Make sure the region of the S3 bucket is the same as Recognition. Otherwise, it won't work. S3 service is global but every bucket is created in a specific region. The same region should be used by AWS clients.
Make sure the access keys of the user or role have the right set of permissions for the resource.
Make sure the file is actually uploaded.
Make sure there is no bucket policy applied that revokes access.
You can enable logging on your S3 bucket to see errors.
Make sure the bucket is not versioned. If versioned, specify the object version.
Make sure the object has the correct set of ACLs defined.
If the object is encrypted, make sure you have permission to use that KMS key to decrypt the object.

You have to wait for a while that the image uploading is done.
The code looks running smoothly, so your jpeg starts to upload and even before the uploading is finished, Rekognition starts to detect the face from the image. Since the uploading is not finished when the code runs, it cannot find the object from your S3. Put a wait time a bit.

Tensorflow - S3 object does not exist

How do I set up direct private bucket access for Tensorflow?
After running
from tensorflow.python.lib.io import file_io
and running print file_io.stat('s3://my/private/bucket/file.json') I end up with an error -
NotFoundError: Object s3://my/private/bucket/file.json does not exist
However, the same line on a public object works without an error:
print file_io.stat('s3://ryft-public-sample-data/wikipedia-20150518.bin')
There appears to be an article on support here: https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/s3.md
However, I end up with the same error after exporting the variables shown.
I have awscli set up with all credentials, and boto3 can view and download the file in question. I am wondering how I can get Tensorflow to have S3 access directly when the bucket is private.

I had the same problem when trying to access files in private S3 bucket from Sagemaker notebook. The mistake I made was to try using credentials I obtained from boto3, which seem not to be valid outside.
The solution was not to specify credentials (in such case it uses the role attached to the machine), but instead just specify the region name (for some reason it didn't read it from ~/.aws/config file) as follows:
import boto3
import os
session = boto3.Session()
os.environ['AWS_REGION']=session.region_name
NOTE: when debugging this error useful was to look at CloudWatch logs, as the logs of S3 client were printed only there and not in the Jupyter notebook.
In there I have first have seen, that:
when I did specify credentials from boto3 the error was: The AWS Access Key Id you provided does not exist in our records.
When accessing without AWS_REGION env variable set I had The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint. which apparently is common when you don't specify bucket (see 301 Moved Permanently after S3 uploading)

Downloading files from AWS S3 Bucket with boto3 results in ClientError: An error occurred (403): Forbidden

I am trying to download files from a s3 bucket by using the Access Key ID and Secret Access Key provided by https://db.humanconnectome.org. However, even though I am able to navigate the database and find the files (as I have configured my credentials via aws cli), attempting to download them results in the following error:
"botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden"
With the same credentials, I can browse the same database and download the files manually via a cloud storage browser such as Cyberduck, so how Cyberduck accesses the data does not invoke a 403 Forbidden error.
I have also verified that boto3 is able to access my aws credentials, and also tried by hardcoding them.
How I am attempting to download the data is very straightforward, and replicates the boto3 docs example: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-download-file.html
s3 = boto3.client('s3',
aws_access_key_id=ACCESS_KEY_ID,
aws_secret_access_key=ACCESS_KEY,)
s3.download_file(Bucket=BUCKET_NAME, Key=FILE_KEY, Filename=FILE_NAME)
This should download the file to the location and file given by FILE_NAME, but instead invokes the 403 Forbidden error.

You'll need to pass the bucket region as well when downloading the file. Try configuring region using the CLI or pass region_name when creating the client.
s3 = boto3.client('s3',
aws_access_key_id=ACCESS_KEY_ID,
aws_secret_access_key=ACCESS_KEY,
region_name=AWS_REGION)
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html

I know this might sound ridiculous, but make sure you don't have a typo in your bucket name or anything like that.
I worked so long trying to fix this, only to realize I added an extra letter in the env variable I had set for my s3 bucket.
It's weird that they give you a forbidden error as a opposed to "not found" error, but they do.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

unable to read large csv file from s3 bucket to python - python

Related

Reading File from Vertex AI and Google Cloud Storage

List S3 bucket objects with access point using Boto3

python boto3: AWS Rekognition is unable to access S3 bucket

Tensorflow - S3 object does not exist

Downloading files from AWS S3 Bucket with boto3 results in ClientError: An error occurred (403): Forbidden

Categories

Resources