Upload Large files to S3 without authentication in Python - python

I'm trying to upload large files to Amazon S3 without using credentials. I'm creating a plugin for Octoprint with this, and I can't put any sort of credentials into the code due to it being public. Currently my code for uploads looks like this:
import boto3
from botocore import UNSIGNED
from botocore.client import Config
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))
# Create an S3 client
filename = 'file.txt'
bucket_name = 'BUCKET_HERE'
s3.upload_file(filename, bucket_name, filename)
However, it gives me the following error:
S3UploadFailedError: Failed to upload largefiletest.mp4 to BUCKETNAMEHERE/largefiletest.mp4: An error occurred (AccessDenied) when calling the CreateMultipartUpload operation: Anonymous users cannot initiate multipart uploads. Please authenticate.
Is there any way to work around this, or are there any suggestions for alternative libraries? Anything is appreciated.

Do you mean that the repository is public but the runtime environment is private? If so, the standard practice is to set environment variables like this:
# first pip install environ
import environ
SOME_KEY = env('SOME_KEY', default='')
This way, you can easily update your credentials without changing your code or compromising security.
Edit:
Then on the machine this code will be run, you can set the environment variables as such:
macOS: https://natelandau.com/my-mac-osx-bash_profile/
Linux: https://www.cyberciti.biz/faq/set-environment-variable-linux/
Windows: http://www.dowdandassociates.com/blog/content/howto-set-an-environment-variable-in-windows-command-line-and-registry/

Related

S3 URL - Download with python

I need to download a file from this URL https://desafio-rkd.s3.amazonaws.com/disney_plus_titles.csv with Python, try to do it with "" require.get '", but it returns me denied access. I understand that I have to authenticate. I have the key and the secret key, but I do not know how to do it.
Help me please?
The preferred way would be to use the boto3 library for Amazon S3. It has a download_file() command, for which you would use:
import boto3
s3_client = boto3.client('s3')
s3_client.download_file('desafio-rkd', 'disney_plus_titles.csv', 'disney_plus_titles.csv')
The parameters are: Bucket, Key, local filename to use when saving the file
Also, you will need to provide an Access Key and Secret Key. The preferred way to do this is to store them in a credentials file. This can be done by using the AWS Command-Line Interface (CLI) aws configure command.
See: Credentials — Boto3 documentation

Tensorflow - S3 object does not exist

How do I set up direct private bucket access for Tensorflow?
After running
from tensorflow.python.lib.io import file_io
and running print file_io.stat('s3://my/private/bucket/file.json') I end up with an error -
NotFoundError: Object s3://my/private/bucket/file.json does not exist
However, the same line on a public object works without an error:
print file_io.stat('s3://ryft-public-sample-data/wikipedia-20150518.bin')
There appears to be an article on support here: https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/s3.md
However, I end up with the same error after exporting the variables shown.
I have awscli set up with all credentials, and boto3 can view and download the file in question. I am wondering how I can get Tensorflow to have S3 access directly when the bucket is private.
I had the same problem when trying to access files in private S3 bucket from Sagemaker notebook. The mistake I made was to try using credentials I obtained from boto3, which seem not to be valid outside.
The solution was not to specify credentials (in such case it uses the role attached to the machine), but instead just specify the region name (for some reason it didn't read it from ~/.aws/config file) as follows:
import boto3
import os
session = boto3.Session()
os.environ['AWS_REGION']=session.region_name
NOTE: when debugging this error useful was to look at CloudWatch logs, as the logs of S3 client were printed only there and not in the Jupyter notebook.
In there I have first have seen, that:
when I did specify credentials from boto3 the error was: The AWS Access Key Id you provided does not exist in our records.
When accessing without AWS_REGION env variable set I had The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint. which apparently is common when you don't specify bucket (see 301 Moved Permanently after S3 uploading)

reading data from s3 by jupyter notebook in ubuntu ec2 deep learning instance

I have several txt and csv datasets in one s3 bucket, my_bucket, and a deep learning ubuntu ec2 instance. I am using Jupyter notebook on this instance. I need to read data from s3 to Jupyter.
I searched everywhere (almost) in AWS documentation and their forum together with other blogs. This is the best I could do. However, after getting the keys (both) restarting the instance (and aws too) I still get an error for aws_key.
I'm wondering if anyone ran to this or you have a better idea to get the data from there. I'm open as long as it's not using http (which requires the data to be public). Thank you.
import pandas as pd
from smart_open import smart_open
import os
aws_key = os.environ['aws_key']
aws_secret = os.environ['aws_secret']
bucket_name = 'my_bucket'
object_key = 'data.csv'
path = 's3://{}:{}#{}/{}'.format(aws_key, aws_secret, bucket_name, object_key)
df = pd.read_csv(smart_open(path))
Your code sample would work if you export the aws_key and first aws_secret. Something like this would work (assuming bash is your shell):
export aws_key=<your key>
export aws_secret=<your aws secret>
python yourscript.py
It is best practice to export things like keys and secrets so that you are not storing confidential/secret things in your source code. If you were to hard code those values into your script and accidentally commit them to a public repo, it would be easy for someone to take over your aws account.
I am answering my own question here and would like to hear from community too on different solutions: Directly access S3 data from the Ubuntu Deep Learning instance by
cd ~/.aws
aws configure
Then update aws key and secret key for the instance, just to make sure. Checke awscli version using the command:
aws --version
Read more on configuration
https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html
In the above code, "aws_key" and "aws_secret" are not listed as environmental variables on the Ubuntu instance and hence the inbuilt function os.environ cannot be used
aws_key = 'aws_key'
aws_secret = 'aws_secret'

how can i download my data from google-cloud-platform using python?

I have my data on google cloud platform and i want to be able to be able to download it locally, this is my first time trying that and eventually i'll use the downloaded data with my python code.
I have checked the docs, like https://cloud.google.com/genomics/downloading-credentials-for-api-access and https://cloud.google.com/storage/docs/cloud-console i have successfully got the Json file for my first link, the second one is where u'm struggling, i'm using python 3.5 and assuming my json files name is data.json i have added the following code:
os.environ["file"] = "data.json"
urllib.request.urlopen('https://storage.googleapis.com/[bucket_name]/[filename]')
first of all i don't even know what i should call the value near environ so i just called it file, not sure how i'm supposed to fill it and i got access denied on the second line, obviously it's not how to download my file as there is no destination local repository or anything in that command any guidance will be appreciated.
Edit:
from google.cloud.storage import Blob
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "credentials/client_secret.json"
storage_client = storage.Client.from_service_account_json('service_account.json')
client = storage.Client(project='my-project')
bucket = client.get_bucket('my-bucket')
blob = Blob('path/to/my-object', bucket)
download_to_filename('local/path/to/my-file')
I'm getting unresolved reference for storage and download_to_filename and should i replace service_account.json with credentials/client_secret.json. Plus i tried to print the content of os.environ["GOOGLE_APPLICATION_CREDENTIALS"]['installed'] like i'd do with any Json but it just said i should give numbers meaning it read the input path as regular text only.
You should use the idiomatic Google Cloud library to run operations in GCS.
With the example there, and knowing that the client library will get the application default credentials, first we have to set the applicaiton default credentials with
gcloud auth application-default login
===EDIT===
That was the old way. Now you should use the instructions in this link.
This means downloading a service account key file from the console, and setting the environment variable GOOGLE_APPLICATION_CREDENTIALS to the path of the downloaded JSON.
Also, make sure that this service account has the proper permissions on the project of the bucket.
Or you can create the client with explicit credentials. You'll need to download the key file all the same, but when creating the client, use:
storage_client = storage.Client.from_service_account_json('service_account.json')
==========
And then, following the example code:
from google.cloud import storage
client = storage.Client(project='project-id')
bucket = client.get_bucket('bucket-id')
blob = storage.Blob('bucket/file/path', bucket)
blob.download_to_filename('/path/to/local/save')
Or, if this is a one-off download, just install the SDK and use gsutil to download:
gsutil cp gs://bucket/file .

Azure storage container API request authentication failing with Django app

I am trying to sync the static files of my django application to Azure storage. I am getting an error when I try to write static files to the storage container when running the manage.py collectstatic command.
I am getting the error. The MAC signature found in the HTTP request is not the same as any computed signature.
The common cause for this error is mismatched time signatures on the two servers, but this is not the problem in my case.
I am using the django packages django-azure-storage and azure-sdk-for-python to format the request.
Here is a gist of the http request and responses generated when trying to connect to the azure storage container.
Is there anything that seems wrong from these outputs?
I have downloaded the django packages and Azure SDK following your description. I have coded a sample to reproduce this issue, but it works fine on my side. Below are the steps that I have done:
Set up the environment: Python 2.7 and Azure SDK(0.10.0).
1.Trying to use the django-azure-storage
It is very frustrating that I didn't import it into my project successfully since this is the first time I used it. Usually, I leverage Azure Python SDK directly. This time I copied storage.py as AzureStorage class in my project.
#need import django contentfile type
from django.core.files.base import ContentFile
#import the AzureStorage Class form my project
from DjangoWP.AzureStorage import AzureStorage
# my local image path
file_path="local.png";
# my Azure storage blob file
def djangorplugin():
azurestorage=AzureStorage(myaccount, mykey,"mycontainer")
stream=open(file_path, 'rb')
data = stream.read()
#need convert file to ContentFile
azurestorage.save("Testfile1.png",ContentFile(data))
2.You many want to know how to use Azure SDK for Python directly, below code snippet for your reference:
from azure.storage.blobservice import BlobService
#my local image path
file_path="local.png";
def upload():
blob_service = BlobService(account_name=myaccount, account_key=mykey)
stream=open(file_path, 'rb')
data = stream.read()
blob_service.put_blob("mycontainer","local.png",data,"BlockBlob")
If you have any further concerns, please feel free to let us know.
I was incorrectly using the setting DEFAULT_FILE_STORAGE instead of STATICFILES_STORAGE to override the storage backend used while syncing static files. Changing this setting solved this problem.
I was also encountering problems when trying to use django-storages, which specifies to use the DEFAULT_FILE_STORAGE setting in its documentation. However, using STATICFILES_STORAGE with this package also fixed the issue I was having.

Categories

Resources