Airflow S3 ClientError - Forbidden: Wrong s3 connection settings using UI - python

I'm using S3Hook in my task to download files from s3 bucket on DigitalOcean spaces. Here is an example of credentials which are perfectry working with boto3, but causing errors when used in S3Hook:
[s3_bucket]
default_region = fra1
default_endpoint=https://fra1.digitaloceanspaces.com
default_bucket=storage-data
bucket_access_key=F7QTVFMWJF73U75IB26D
bucket_secret_key=mysecret
This is how I filled the connection form in Admin->Connections:
Here is what I see in task's .log file:
ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
So, I guess, the connection form is wrong. What is the proper way to fill all S3 params properly? (i.e. key, secret, bucket, host, region, etc.)

Moving host variable to Extra did the trick for me.
For some reason, airflow is unable to establish connection in case of custom S3 host (different from AWS, like DigitalOcean) if It's not in Extra vars.
Also, region_name can be removed from Extra in case like mine.

To get this working with Airflow 2.1.0 on Digital Ocean Spaces, I had to add the aws_conn_id here:
s3_client = S3Hook(aws_conn_id='123.ams3.digitaloceanspaces.com')
Fill in the Schema as the bucket name, Login (key) and Password (secret) and then the Extra field in the UI contains the region and host:
{"host": "https://ams3.digitaloceanspaces.com","region_name": "ams3"}

Related

com.amazonaws.AmazonClientException: Unable to execute HTTP request: No such host is known (spark-tunes.s3a.ap-south-1.amazonaws.com)

I am trying to read a json file stored in S3 bucket from spark in local mode via pycharm. But I'm getting the below error message:
"py4j.protocol.Py4JJavaError: An error occurred while calling o37.json.
: com.amazonaws.AmazonClientException: Unable to execute HTTP request: No such host is known (spark-tunes.s3a.ap-south-1.amazonaws.com)"
(spark-tunes is my S3 bucket name).
Below is the code I executed. Please help me to know if I'm missing something.
spark = SparkSession.builder.appName('DF Read').config('spark.master', 'local').getOrCreate()
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "access_key")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "secret_key")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3a.ap-south-1.amazonaws.com")
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3a.enableV4", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
df = spark.read.json("s3a://bucket-name/folder_name/*.json")
df.show(5)
try setting fs.s3a.path.style.access to false and instead of prefixing the bucket name to the host, the aws s3 client will use paths under the endpoint
also: drop the fs.s3a.impl line. That is superstition passed down across stack overflow examples. It's not needed. really.

Uploading large files to Google Storage GCE from a Kubernetes pod

We get this error when uploading a large file (more than 10Mb but less than 100Mb):
403 POST https://www.googleapis.com/upload/storage/v1/b/dm-scrapes/o?uploadType=resumable: ('Response headers must contain header', 'location')
Or this error when the file is more than 5Mb
403 POST https://www.googleapis.com/upload/storage/v1/b/dm-scrapes/o?uploadType=multipart: ('Request failed with status code', 403, 'Expected one of', <HTTPStatus.OK: 200>)
It seems that this API is looking at the file size and trying to upload it via multi part or resumable method. I can't imagine that is something that as a caller of this API I should be concerned with. Is the problem somehow related to permissions? Does the bucket need special permission do it can accept multipart or resumable upload.
from google.cloud import storage
try:
client = storage.Client()
bucket = client.get_bucket('my-bucket')
blob = bucket.blob('blob-name')
blob.upload_from_filename(zip_path, content_type='application/gzip')
except Exception as e:
print(f'Error in uploading {zip_path}')
print(e)
We run this inside a Kubernetes pod so the permissions get picked up by storage.Client() call automatically.
We already tried these:
Can't upload with gsutil because the container is Python 3 and gsutil does not run in python 3.
Tried this example: but runs into the same error: ('Response headers must contain header', 'location')
There is also this library. But it is basically alpha quality with little activity and no commits for a year.
Upgraded to google-cloud-storage==1.13.0
Thanks in advance
The problem was indeed the credentials. Somehow the error message was very miss-leading. When we loaded the credentials explicitly the problem went away.
# Explicitly use service account credentials by specifying the private key file.
storage_client = storage.Client.from_service_account_json(
'service_account.json')
I found my node pools had been spec'd with
oauthScopes:
- https://www.googleapis.com/auth/devstorage.read_only
and changing it to
oauthScopes:
- https://www.googleapis.com/auth/devstorage.full_control
fixed the error. As described in this issue the problem is an uninformative error message.

Problems Enabling S3 Bucket Transfer Acceleration Using boto3

I am attempting to pull information about an S3 bucket using boto3. Here is the setup (bucketname is set to a valid S3 bucket name):
import boto3
s3 = boto3.client('s3')
result = s3.get_bucket_acl(Bucket=bucketname)
When I try, I get this error:
ClientError: An error occurred (InvalidRequest) when calling the
GetBucketAcl operation: S3 Transfer Acceleration is not configured on
this bucket
So, I attempt to enable transfer acceleration:
s3.put_bucket_accelerate_configuration(Bucket=bucketname, AccelerateConfiguration={'Status': 'Enabled'})
But, I get this error, which seems silly, since the line above is attempting to configure the bucket. I do have IAM rights (Allow: *) to modify the bucket too:
ClientError: An error occurred (InvalidRequest) when calling the
PutBucketAccelerateConfiguration operation: S3 Transfer Acceleration
is not configured on this bucket
Does anyone have any ideas on what I'm missing here?
Although I borrowed the code in the original question from the boto3 documentation, this construct is not complete and did not provide the connectivity that I expected:
s3 = boto3.client('s3')
What is really needed are fully-initialized session and client handlers, like this (assuming that the profile variable is set correctly in the ~/.aws/config file and bucketname is a valid S3 bucket):
from boto3 import Session
session = Session(profile_name=profile)
client = session.client('s3')
result = client.get_bucket_acl(Bucket=bucketname)
After doing this (duh), I was able to connect with or without transfer acceleration.
Thanks to the commenters, since those comments led me to the solution.

Boto3 - how to connect to S3 via proxy?

I'm using a container that simulate a S3 server running on http://127.0.0.1:4569 (with no authorization or credentials needed)
and I'm trying to simply connect and print a list of all the bucket names using python and boto3
here's my docker-compose:
s3:
image: andrewgaul/s3proxy
environment:
S3PROXY_AUTHORIZATION: none
hostname: s3
ports:
- 4569:80
volumes:
- ./data/s3:/data
here's my code:
s3 = boto3.resource('s3', endpoint_url='http://127.0.0.1:4569')
for bucket in s3.buckets.all():
print(bucket.name)enter code here
here's the error message that I received:
botocore.exceptions.NoCredentialsError: Unable to locate credentials
I tried this solution => How do you use an HTTP/HTTPS proxy with boto3?
but still not working, I don't understand what I'm doing wrong
First, boto3 always try to handshake with S3 server with AWS API key. Even your simulation server don't need password, you still need to specify them either in your .aws/credentials or inside your program. e.g.
[default]
aws_access_key_id = x
aws_secret_access_key = x
hardcoded dummy access key example
import boto3
session = boto3.session(
aws_access_key_id = 'x',
aws_secret_access_key = 'x')
s3 = session.resource('s3', endpoint_url='http://127.0.0.1:4569')
Second, I don't know how reliable and what kind of protocol is implemented by your "s3 simulation container". To make life easier, I always suggest anyone that wants to simulate S3 load test or whatever to use fake-s3

"TypeError: expected string, tuple found" when passing aws credentials to amazon client constructor

I have a python script that calls the Amazon SES api using boto3. It works when I create the client like this client = boto3.client('ses') and allow the aws credentials to come from ~/.aws/credentials, but I wanted to pass the aws_access_key_id and aws_secret_access_key into the constructor somehow.
I thought I had found somewhere that said it was acceptable to do something like this
client = boto3.client(
'ses',
aws_access_key_id=kwargs['aws_access_key_id'],
aws_secret_access_key=kwargs['aws_secret_access_key'],
region_name=kwargs['region_name']
)
but then when I try to send an email, it tells me that there is a TypeError: sequence item 0: expected string, tuple found when it tries to return '/'.join(scope) in botocore/auth.py (line 276).
I know it's a bit of a long shot, but I was hoping someone had an idea of how I can pass these credentials to the client from somewhere other than the aws credentials file. I also have the full stack trace from the error, if that's helpful I can post it as well. I just didn't want to clutter up the question initially.
You need to configure your connection info elsewhere and then connect using:
client = boto3.client('ses', AWS_REGION)
An alternative way, using Session can be done like this:
from boto3.session import Session
# create boto session
session = Session(
aws_access_key_id=settings.AWS_ACCESS_KEY_ID,
aws_secret_access_key=settings.AWS_SECRET_ACCESS_KEY,
region_name=settings.AWS_REGION
)
# make connection
client =session.client('s3')

Categories

Resources