Uploading large files to Google Storage GCE from a Kubernetes pod - python

We get this error when uploading a large file (more than 10Mb but less than 100Mb):
403 POST https://www.googleapis.com/upload/storage/v1/b/dm-scrapes/o?uploadType=resumable: ('Response headers must contain header', 'location')
Or this error when the file is more than 5Mb
403 POST https://www.googleapis.com/upload/storage/v1/b/dm-scrapes/o?uploadType=multipart: ('Request failed with status code', 403, 'Expected one of', <HTTPStatus.OK: 200>)
It seems that this API is looking at the file size and trying to upload it via multi part or resumable method. I can't imagine that is something that as a caller of this API I should be concerned with. Is the problem somehow related to permissions? Does the bucket need special permission do it can accept multipart or resumable upload.
from google.cloud import storage
try:
client = storage.Client()
bucket = client.get_bucket('my-bucket')
blob = bucket.blob('blob-name')
blob.upload_from_filename(zip_path, content_type='application/gzip')
except Exception as e:
print(f'Error in uploading {zip_path}')
print(e)
We run this inside a Kubernetes pod so the permissions get picked up by storage.Client() call automatically.
We already tried these:
Can't upload with gsutil because the container is Python 3 and gsutil does not run in python 3.
Tried this example: but runs into the same error: ('Response headers must contain header', 'location')
There is also this library. But it is basically alpha quality with little activity and no commits for a year.
Upgraded to google-cloud-storage==1.13.0
Thanks in advance

The problem was indeed the credentials. Somehow the error message was very miss-leading. When we loaded the credentials explicitly the problem went away.
# Explicitly use service account credentials by specifying the private key file.
storage_client = storage.Client.from_service_account_json(
'service_account.json')

I found my node pools had been spec'd with
oauthScopes:
- https://www.googleapis.com/auth/devstorage.read_only
and changing it to
oauthScopes:
- https://www.googleapis.com/auth/devstorage.full_control
fixed the error. As described in this issue the problem is an uninformative error message.

Related

503 when trying to access boto3 from kubernetes

I wrote a python script to automatically download a file from a DB and upload it to an S3 account we own. The script works from the PC and I'm successfully pinging Amazon S3 from within the Kubernetees we are working on, but I'm getting 503 when the script tries to upload/download a file from the S3. I'm using the following installation: 'python3.6 -m pip install boto3'
and getting the following error: "otocore.exceptions.ClientError: An error occurred (503) when calling the GetObject operation (reached max retries: 15): Service Unavailable"
I tried adding/removing SSL, changing timeout and max retries, and nothing seems to help. Also tried different boto3 objects (client, session etc.)
The code that crashes is the following: (the line that crashes is the one marked with **)
def write_to_s3():
s3 = get_s3()
object1 = s3.Object(BUCKET_NAME, FILENAME)
print(object1)
**test = object1.get()**
latest_num = int(str(object1.get()['Body'].read())[2:-1])
print(str(latest_num))
...
def get_s3():
my_config = Config(
region_name=REGION,
connect_timeout=25,
retries={
'max_attempts': 15,
'mode': 'standard'
}
)
return boto3.resource('s3', use_ssl=False, config=my_config, aws_access_key_id=os.environ.get("ACCESS_KEY_ID"),
aws_secret_access_key=os.environ.get("SECRET_ACCESS_KEY"))
I really do not understand why this happens and found no answers or similar errors on the web. Please help!

Download Blob From Blob Storage Using Python

I am trying to download an excel file on a blob. However, it keeps generating the error "The specified blob does not exist". This error happens at blob_client.download_blob() although I can get the blob_client. Any idea why or other ways I can connect using managed identity?
default_credential = DefaultAzureCredential()
blob_url = BlobServiceClient('url', credential = default_credential)
container_client = blob_url.get_container_client('xx-xx-data')
blob_client = container_client.get_blob_client('TEST.xlsx')
downloaded_blob = blob_client.download_blob()
df=pd.read_excel(downloaded_blob.content_as_bytes(), sheet_name='Test',skiprows=2)
Turns out that I have to also provide 'Reader' access on top of 'Storage Blob Data Contributor' to be able to identify the blob. There was no need for SAS URL.
The reason you're getting this error is because each request to Azure Blob Storage must be an authenticated request. Only exception to this is when you're reading (downloading) a blob from a public blob container. In all likelihood, the blob container holding this blob is having a Private ACL and since you're sending an unauthenticated request, you're getting this error.
I would recommend using a Shared Access Signature (SAS) URL for the blob with Read permission instead of simple blob URL. Since a SAS URL has authorization information embedded in the URL itself (sig portion), you should be able to download the blob provided SAS is valid and has not expired.
Please see this for more information on Shared Access Signature: https://learn.microsoft.com/en-us/rest/api/storageservices/delegate-access-with-shared-access-signature.

Python O365 Outlook Connection Issues

I am trying to write a script in Python to grab new emails from a specific folder and save the attachments to a shared drive to upload to a database. Power Automate would work, but the file size limit to save the attachment is a meager 20 MB. I am able to authenticate the token, but am getting the following error when trying to grab the emails:
Unauthorized for url.
The token contains no permissions, or permissions can not be understood.
I have included the code I am using to connect to Microsoft Graph.
(credentials and tenant_id are correct in my code, took them out for obvious reasons
from O365 import Account, MSOffice365Protocol, MSGraphProtocol
credentials = ('xxxxxx', 'xxxxxx')
protocol = MSGraphProtocol(default_resource='reporting.triometric#xxxx.com')
scopes_graph = protocol.get_scopes_for('message_all_shared')
scopes = ['https://graph.microsoft.com/.default']
account = Account(credentials, auth_flow_type='credentials', tenant_id="**", scopes=scopes,)
if account.authenticate():
print('Authenticated')
mailbox = account.mailbox(resource='reporting.triometric#xxxx.com')
inbox = mailbox.inbox_folder()
for message in inbox.get_messages():
print(message)
I have already configured the permissions through Azure to include all the necessary 'mail' delegations.
The rest of my script works perfectly fine for uploading files to the database. Currently, the attachments must be manually saved on a shared drive multiple times per day, then the script is run to upload. Are there any steps I am missing? Any insights would be greatly appreciated!
Here are the permissions:
auth_flow_type='credentials' means you are using client credentials flow.
In this case you should add Application permissions rather than Delegated permissions.
Don't forget to click on "Grant admin consent for {your tenant}".
UPDATE:
If you set auth_flow_type to 'Authorization', it will use auth code flow which requires the delegated permission.

Airflow S3 ClientError - Forbidden: Wrong s3 connection settings using UI

I'm using S3Hook in my task to download files from s3 bucket on DigitalOcean spaces. Here is an example of credentials which are perfectry working with boto3, but causing errors when used in S3Hook:
[s3_bucket]
default_region = fra1
default_endpoint=https://fra1.digitaloceanspaces.com
default_bucket=storage-data
bucket_access_key=F7QTVFMWJF73U75IB26D
bucket_secret_key=mysecret
This is how I filled the connection form in Admin->Connections:
Here is what I see in task's .log file:
ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
So, I guess, the connection form is wrong. What is the proper way to fill all S3 params properly? (i.e. key, secret, bucket, host, region, etc.)
Moving host variable to Extra did the trick for me.
For some reason, airflow is unable to establish connection in case of custom S3 host (different from AWS, like DigitalOcean) if It's not in Extra vars.
Also, region_name can be removed from Extra in case like mine.
To get this working with Airflow 2.1.0 on Digital Ocean Spaces, I had to add the aws_conn_id here:
s3_client = S3Hook(aws_conn_id='123.ams3.digitaloceanspaces.com')
Fill in the Schema as the bucket name, Login (key) and Password (secret) and then the Extra field in the UI contains the region and host:
{"host": "https://ams3.digitaloceanspaces.com","region_name": "ams3"}

Google Drive API: Can't upload certain filetypes

I created a form and a simple server with google appengine with which to upload arbitrary file types to my google drive. The form fails to work for certain file types and just gives this error instead:
HttpError: <HttpError 400 when requesting https://www.googleapis.com/upload/drive/v1/files?alt=json returned "Unsupported content with type: application/pdf">
Aren't pdf files supported?
The appengine code that does the upload goes somewhat like this:
def upload_to_drive(self, filestruct):
resource = {
'title': filestruct.filename,
'mimeType': filestruct.type,
}
resource = self.service.files().insert(
body=resource,
media_body=MediaInMemoryUpload(filestruct.value,
filestruct.type),
).execute()
def post(self):
creds = StorageByKeyName(Credentials, my_user_id, 'credentials').get()
self.service = CreateService('drive', 'v1', creds)
post_dict = self.request.POST
for key in post_dict.keys():
if isinstance(post_dict[key], FieldStorage):#might need to import from cgi
#upload to drive and return link
self.upload_to_drive(post_dict[key]) #TODO: there should be error handling here
I've successfully used it for MS Office documents and images. It doesn't work for textfiles too and gives this error:
HttpError: <HttpError 400 when requesting https://www.googleapis.com/upload/drive/v1/files?alt=json returned "Multipart content has too many non-media parts">
I've tried unsetting the 'mimeType' value in the resource dict to let google drive set it automatically. I also tried unsetting the mime type value in the MediaInMemoryUpload constructor. Sadly, none of both worked.
It seems to me that you are using an old version of the Python client library and referring to Drive API v1, while Drive API v2 has been available since the end of June.
Please try updating your library and check the complete Python sample at https://developers.google.com/drive/examples/python.

Categories

Resources