unable to download image form S3 bucket - python

I have a S3 bukcet url of an image. I am trying to download this image using urllib or wget, in both cases code executes successfully, but corrupt image is downloaded. When i say corrupt, I mean that for a 2MB image a 200kb only get downloaded.
urllib.urlretrieve(str(sys.argv[1]), "data/img"+str(randomword(10))+".jpg")
In the later part of line, I am just adding random string as the name of the image that is to be downloaded.
Pls help

You can grab the file by authenticating first and downloading. I'd recommend just using python boto library to deal with amazon web services. If you did that the code would look something like this
import boto
# set your AWS creds on your environment path or hardcode it
AWS_ACCESS_KEY_ID = os.getenv("AWS_KEY_ID")
AWS_ACCESS_SECRET_KEY = os.getenv("AWS_ACCESS_KEY")
conn = boto.connect_s3(AWS_ACCESS_KEY_ID, AWS_ACCESS_SECRET_KEY)
bucket = conn.get_bucket("my_bucket_name")
key = bucket.get_key('file_on_s3.txt')
key.get_contents_to_filename('where_file_goes_locally.txt')
If you really don't want to use boto, you can piece it together manually and essentially do what boto does build up the right request headers from your aws creds. I do this using a bash script on a server I have. This should point you in the right direction (https://gist.github.com/davidejones/d05f51df75e659111227) if you want to rewrite this with python requests or urllib that should work too.

Related

How to send/copy/upload file from AWS S3 to Google GCS using Python

Im looking for a pythonic way to copy a file from AWS S3 to GCS.
I do not want to open/read the file and then use blob.upload_from_string() method. I want to transfer it 'as-is'.
I can not use 'gsutils'. The scope of the libraries Im working with is gcloud, boto3 (also experimented with s3fs).
Here is a simple example (that seems to work) using blob.upload_from_string() method which im trying to avoid because i don't want to open/read the file. I fail to make it work using blob.upload_from_file() method because GCS api requires an accessible, readable, file-like object which i fail to properly provide.
What am I missing? Suggestions?
import boto3
from gcloud import storage
from oauth2client.service_account import ServiceAccountCredentials
GSC_Token_File = 'path/to/GSC_token'
s3 = boto3.client('s3', region_name='MyRegion') # im running from AWS Lambda, no authentication required
gcs_credentials = ServiceAccountCredentials.from_json_keyfile_dict(GSC_Token_File)
gcs_storage_client = storage.Client(credentials=gcs_credentials, project='MyGCP_project')
gcs_bucket = gcs_storage_client.get_bucket('MyGCS_bucket')
s3_file_to_load = str(s3.get_object(Bucket='MyS3_bucket', Key='path/to/file_to_copy.txt')['Body'].read().decode('utf-8'))
blob = gcs_bucket.blob('file_to_copy.txt')
blob.upload_from_string(s3_file_to_load)
So i poked around a bit more and came across this article which eventually led me to this solution. Apparently GCS API can be called using AWS boto3 SDK.
Please mind the HMAC key prerequisite that can be easily created using these instructions.
import boto3
# im using GCP Service Account so my HMAC was created accordingly.
# HMAC for User Account can be created just as well
service_Access_key = 'YourAccessKey'
service_Secret = 'YourSecretKey'
# Reminder: I am copying from S3 to GCS
s3_client = boto3.client('s3', region_name='MyRegion')
gcs_client =boto3.client(
"s3", # !just like that
region_name="auto",
endpoint_url="https://storage.googleapis.com",
aws_access_key_id=service_Access_key,
aws_secret_access_key=service_Secret,
)
file_to_transfer = s3_client.get_object(Bucket='MyS3_bucket', Key='path/to/file_to_copy.txt')
gcs_client.upload_fileobj(file_to_transfer['Body'], 'MyGCS_bucket', 'file_to_copy.txt')
I understand you're trying to move files from S3 to CGS using Python in an AWS Lambda function. There is one thing I'd like to clarify from the statement "I don't want to open/read the file" which is that when the file is downloaded from S3 you are indeed reading it and writing it somewhere, be it into an in-memory string or to a temporary file. In that sense, it actually doesn't matter which one of blob.upload_from_file() or blob.upload_from_string() is used as they're equivalent; the first will read from a file and the second won't because data is already read in-memory. Therefore my suggestion would be to keep the code as it is, I don't see a benefit on changing it.
Anyway the file approach should be possible doing something along the lines below (untested, I have no S3 to check):
# From S3 boto docs: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-download-file.html
s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME')
blob.upload_from_file('FILE_NAME')
Finally it is worth mentioning the Storage Transfer tool which is intended for moving huge amounts of data from S3 to GCS. If that sounds like your use case you may take a look at the code samples for Python.

How can I use AWS Textract with Python

I have tested almost every example code I can find on the Internet for Amazon Textract and I cant get it to work. I can upload and download a file to S3 from my Python client so the credentials should be OK. Lots of the errors points to some region failure but I have try every possible combinations.
Here are one of the last test call -
def test_parse_3():
# Document
s3BucketName = "xx-xxxx-xx"
documentName = "xxxx.jpg"
# Amazon Textract client
textract = boto3.client('textract')
# Call Amazon Textract
response = textract.detect_document_text(
Document={
'S3Object': {
'Bucket': s3BucketName,
'Name': documentName
}
})
print(response)
seems to be pretty easy but it generates the error -
botocore.errorfactory.InvalidS3ObjectException: An error occurred (InvalidS3ObjectException) when calling the DetectDocumentText operation: Unable to get object metadata from S3. Check object key, region and/or access permissions.
Any ideas whats wrong and dose someone have a working example (I knew the tabs are not correct in the example code)?
I have also tested a lot of permission settings in AWS. The credentials are in a hidden files created by aws sdk.
I am sure you already know, but the bucket is case sensitive. If you have verified that both the object bucket and name are correct, just make sure to add the appropriate region to your credentials.
I tested just reading from s3 without including the region in the credentials and I was able to list the objects in the bucket with no issues. I am thinking this worked because s3 is supposed to be region agnostic. However, since Textract is region specific, you must define the region in your credentials when using Textract to get the data from the s3 bucket.
I realize this was asked a few months ago, but I am hoping this sheds some light to others that face this issue in the future.

gcs client library stopped working with dev_appserver

Google cloud storage client library is returning 500 error when I attempt to upload via development server.
ServerError: Expect status [200] from Google Storage. But got status 500.
I haven't changed anything with the project and the code still works correctly in production.
I've attempted gcloud components update to get the latest dev_server and I've updated to the latest google cloud storage client library.
I've run gcloud init again to make sure credentials are loaded and I've made sure I'm using the correct bucket.
The project is running on windows 10.
Python version 2.7
Any idea why this is happening?
Thanks
Turns out this has been a problem for a while.
It has to do with how blobstore filenames are generated.
https://issuetracker.google.com/issues/35900575
The fix is to monkeypatch this file:
google-cloud-sdk\platform\google_appengine\google\appengine\api\blobstore\file_blob_storage.py
def _FileForBlob(self, blob_key):
"""Calculate full filename to store blob contents in.
This method does not check to see if the file actually exists.
Args:
blob_key: Blob key of blob to calculate file for.
Returns:
Complete path for file used for storing blob.
"""
blob_key = self._BlobKey(blob_key)
# Remove bad characters.
import re
blob_fname = re.sub(r"[^\w\./\\]", "_", str(blob_key))
# Make sure it's a relative directory.
if blob_fname and blob_fname[0] in "/\\":
blob_fname = blob_fname[1:]
return os.path.join(self._DirectoryForBlob(blob_key), blob_fname)

how can i download my data from google-cloud-platform using python?

I have my data on google cloud platform and i want to be able to be able to download it locally, this is my first time trying that and eventually i'll use the downloaded data with my python code.
I have checked the docs, like https://cloud.google.com/genomics/downloading-credentials-for-api-access and https://cloud.google.com/storage/docs/cloud-console i have successfully got the Json file for my first link, the second one is where u'm struggling, i'm using python 3.5 and assuming my json files name is data.json i have added the following code:
os.environ["file"] = "data.json"
urllib.request.urlopen('https://storage.googleapis.com/[bucket_name]/[filename]')
first of all i don't even know what i should call the value near environ so i just called it file, not sure how i'm supposed to fill it and i got access denied on the second line, obviously it's not how to download my file as there is no destination local repository or anything in that command any guidance will be appreciated.
Edit:
from google.cloud.storage import Blob
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "credentials/client_secret.json"
storage_client = storage.Client.from_service_account_json('service_account.json')
client = storage.Client(project='my-project')
bucket = client.get_bucket('my-bucket')
blob = Blob('path/to/my-object', bucket)
download_to_filename('local/path/to/my-file')
I'm getting unresolved reference for storage and download_to_filename and should i replace service_account.json with credentials/client_secret.json. Plus i tried to print the content of os.environ["GOOGLE_APPLICATION_CREDENTIALS"]['installed'] like i'd do with any Json but it just said i should give numbers meaning it read the input path as regular text only.
You should use the idiomatic Google Cloud library to run operations in GCS.
With the example there, and knowing that the client library will get the application default credentials, first we have to set the applicaiton default credentials with
gcloud auth application-default login
===EDIT===
That was the old way. Now you should use the instructions in this link.
This means downloading a service account key file from the console, and setting the environment variable GOOGLE_APPLICATION_CREDENTIALS to the path of the downloaded JSON.
Also, make sure that this service account has the proper permissions on the project of the bucket.
Or you can create the client with explicit credentials. You'll need to download the key file all the same, but when creating the client, use:
storage_client = storage.Client.from_service_account_json('service_account.json')
==========
And then, following the example code:
from google.cloud import storage
client = storage.Client(project='project-id')
bucket = client.get_bucket('bucket-id')
blob = storage.Blob('bucket/file/path', bucket)
blob.download_to_filename('/path/to/local/save')
Or, if this is a one-off download, just install the SDK and use gsutil to download:
gsutil cp gs://bucket/file .

Storing user images on AWS

I'm implementing a simple app using ionic2, which calls an API built using Flask. When setting up the profile, I give the option to the users to upload their own images.
I thought of storing them in an S3 bucket and serving them through CloudFront.
After some research I can only find information about:
Uploading images from the local storage using python.
Uploading images from a HTML file selector using javascript.
I can't find anything about how to deal with blobs/files when you have a front end interacting with an API. When I started researching the options I had thought of were:
Post the file to Amazon on the client side and return the
CloudFront url directly to the back end. I am not too keen on this
one because it would involve having some kind of secret on the
client side (maybe is not that dangerous, but I would rather have it
on the back end).
Upload the image to the server and somehow tell the back end about
which file we want the back end to choose. I am not too keen on
this approach either because the client would need to have knowledge
about the server itself (not only the API).
Encode the image (I have tought of base64, but with the lack of
examples I think that it is plain wrong) and post it to back end,
which will handle all the S3 upload/store CloudFront URL.
I feel like all these approaches are plain wrong, but I can't think (or find) what is the right way of doing it.
How should I approach it?
Have the server generate a pre-signed URL for the client to upload the image to. That means the server is in control of what the URLs will look like and it doesn't expose any secrets, yet the client can upload the image directly to S3.
Generating a pre-signed URL in Python using boto3 looks something like this:
s3 = boto3.client('s3', aws_access_key_id=..., aws_secret_access_key=...)
params = dict(Bucket='my-bucket', Key='myfile.jpg', ContentType='image/jpeg')
url = s3.generate_presigned_url('put_object', Params=params, ExpiresIn=600)
The ContentType is optional, and the client will have to set the same Content-Type HTTP header during upload to url; I find it handy to limit the allowable file types if known.

Categories

Resources