I see a lot of code that uses an S3 bucket url to open a file. I would like to use smart open to open a compressed file
session = boto3.Session(ID, pass)
file = open("s3://bucket/file.txt.gz", transport_params=dict(session=session), encoding="utf-8")
However, all the examples I see about smart open and other pulls from boto3 using a url never specify how to use the session when pulling the data from the url only when pushing data to a new bucket. Is there a way to use the url and the session without needing to create a client from my session and access the bucket and key?
As mentioned you just need to replace "wb" with "rb". I was mistaken that this didn't work
from smart_open import open
import boto3
url = 's3://bucket/your/keyz'
session = boto3.Session(aws_access_key_id,
aws_secret_access_key,
region_name)
with open(url, 'rb', transport_params={'client': session.client('s3')}) as fin:
file = fin.read()
print(file)
Related
I'm trying to upload a file to an existing AWS s3 bucket, generate a public URL and use that URL (somewhere else) to download the file.
I'm closely following the example here:
import os
import boto3
import requests
import tempfile
s3 = boto3.client('s3')
with tempfile.NamedTemporaryFile(mode="w", delete=False) as outfile:
outfile.write("dummycontent")
file_name = outfile.name
with open(file_name, mode="r") as outfile:
s3.upload_file(outfile.name, "twistimages", "filekey")
os.unlink(file_name)
url = s3.generate_presigned_url(
ClientMethod='get_object',
Params={
'Bucket': 'twistimages',
'Key': 'filekey'
}
)
response = requests.get(url)
print(response)
I would expect to see a success return code (200) from the requests library.
Instead, stdout is: <Response [400]>
Also, if I navigate to the corresponding URL with a webbrowser, I get an XML file with an error code: InvalidRequest and an error message:
The authorization mechanism you have provided is not supported. Please
use AWS4-HMAC-SHA256.
How can I use boto3 to generate a public URL, which can easily be downloaded by any user by just navigating to the corresponding URL, without generating complex headers?
Why does the example code from the official documentation not work in my case?
AWS S3 still support legacy old v2 signature in US region(prior 2014). But under new AWS region, only AWS4-HMAC-SHA256(s3v4) are allowed.
To support this features , you must specify them explicitly in .aws/config file or during boto3.s3 resource/client instantiation. e.g.
# add this entry under ~/.aws/config
[default]
s3.signature_version = s3v4
[other profile]
s3.signature_version = s3v4
Or declare them explicitly
s3client = boto3.client('s3', config= boto3.session.Config(signature_version='s3v4'))
s3resource = boto3.resource('s3', config= boto3.session.Config(signature_version='s3v4'))
I solved the issue. I'm using an S3 bucket in the eu-central-1 region and after specifying the region in the config file, everything worked as expected and the script stdout was <Response [200]>.
The configuration file (~/.aws/config) now looks like:
[default]
region=eu-central-1
I am working on a process to dump files from a Redshift database, and would prefer not to have to locally download the files to process the data. I saw that Java has a StreamingObject class that does what I want, but I haven't seen anything similar in boto3.
If you have a mybucket S3 bucket, which contains a beer key, here is how to download and fetch the value without storing it in a local file:
import boto3
s3 = boto3.resource('s3')
print s3.Object('mybucket', 'beer').get()['Body'].read()
smart_open is a Python 3 library for efficient streaming of very large files from/to storages such as S3, GCS, Azure Blob Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem.
https://pypi.org/project/smart-open/
import boto3
import smart_open
client = boto3.client(service_name='s3',
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_KEY,
)
url = 's3://.............'
fin = smart_open.open(url, 'r', transport_params={'client':client})
for line in fin:
data = json.loads(line)
print(data)
fin.close()
This may or may not be relevant to what you want to do, but for my situation one thing that worked well was using tempfile:
import tempfile
import boto3
bucket_name = '[BUCKET_NAME]'
key_name = '[OBJECT_KEY_NAME]'
s3 = boto3.resource('s3')
temp = tempfile.NamedTemporaryFile()
s3.Bucket(bucket_name).download_file(key_name, temp.name)
# do what you will with your file...
temp.close()
I use that solution, actually:
import boto3
s3_client = boto3.client('s3')
def get_content_from_s3(bucket: str, key: str) -> str:
"""Save s3 content locally
param: bucket, s3 bucket
param: key, path to the file, f.i. folder/subfolder/file.txt
"""
s3_file = s3_client.get_ojct(Bucket=bucket, Key=key)['Body'].read()
return s3_file.decode('utf-8').strip()
I am successfully authenticating with AWS and using the 'put_object' method on the Bucket object to upload a file. Now I want to use the multipart API to accomplish this for large files. I found the accepted answer in this question:
How to save S3 object to a file using boto3
But when trying to implement I am getting "unknown method" errors. What am I doing wrong? My code is below. Thanks!
## Get an AWS Session
self.awsSession = Session(aws_access_key_id=accessKey,
aws_secret_access_key=secretKey,
aws_session_token=session_token,
region_name=region_type)
...
# Upload the file to S3
s3 = self.awsSession.resource('s3')
s3.Bucket('prodbucket').put_object(Key=fileToUpload, Body=data) # WORKS
#s3.Bucket('prodbucket').upload_file(dataFileName, 'prodbucket', fileToUpload) # DOESNT WORK
#s3.upload_file(dataFileName, 'prodbucket', fileToUpload) # DOESNT WORK
The upload_file method has not been ported over to the bucket resource yet. For now you'll need to use the client object directly to do this:
client = self.awsSession.client('s3')
client.upload_file(...)
Libcloud S3 wrapper transparently handles all the splitting and uploading of the parts for you.
Use upload_object_via_stream method to do so:
from libcloud.storage.types import Provider
from libcloud.storage.providers import get_driver
# Path to a very large file you want to upload
FILE_PATH = '/home/user/myfile.tar.gz'
cls = get_driver(Provider.S3)
driver = cls('api key', 'api secret key')
container = driver.get_container(container_name='my-backups-12345')
# This method blocks until all the parts have been uploaded.
extra = {'content_type': 'application/octet-stream'}
with open(FILE_PATH, 'rb') as iterator:
obj = driver.upload_object_via_stream(iterator=iterator,
container=container,
object_name='backup.tar.gz',
extra=extra)
For official documentation on S3 Multipart feature, refer to AWS Official Blog.
I am getting strange HTTP errors after I load a file from GCS in my python web app..
suspended generator urlfetch(context.py:1214) raised DeadlineExceededError(Deadline exceeded while waiting for HTTP response from URL: https://storage.googleapis.com/[bucketname]/dailyData_2014-01-11.zip)
However, based on what the app is logging below, it has already loaded the file (and based on memory usage, appears to be in memory).
bucket = '/[bucketname]'
filename = bucket + '/dailyData'+datetime.datetime.today().strftime('%Y-%m-%d')+'.zip'
gcs_file = gcs.open(filename,'r')
gcs_stats = gcs.stat(filename)
logging.info(gcs_stats)
zip_file = zipfile.ZipFile(gcs_file, 'r')
logging.info("zip file loaded")
Is there a way I should close the HTTP request or is it not actually loading the zip_file from memory and is instead trying to pull from GCS all the time...? Thanks!
You should make sure you close the files you're opening. You can use a with context, which will automatically close the file when it goes out of scope:
bucket = '/[bucketname]'
filename = bucket + '/dailyData'+datetime.datetime.today().strftime('%Y-%m-%d')+'.zip'
gcs_stats = gcs.stat(filename)
logging.info(gcs_stats)
with gcs.open(filename,'r') as gcs_file:
with zipfile.ZipFile(gcs_file, 'r') as zip_file:
logging.info("zip file loaded")
I'm trying to upload files to the blobstore in my Google App without using a form. But I'm stuck at how to get the app to read my local file. I'm pretty new to python and Google Apps but after some cut and pasting I've ended up with this:
import webapp2
import urllib
import os
from google.appengine.api import files
from poster.encode import multipart_encode
class Upload(webapp2.RequestHandler):
def get(self):
# Create the file in blobstore
file_name = files.blobstore.create(mime_type='application/octet-stream')
# Get the local file path from an url param
file_path = self.request.get('file')
# Read the file
file = open(file_path, "rb")
datagen, headers = multipart_encode({"file": file})
data = str().join(datagen) # this is supposedly memory intense and slow
# Open the blobstore file and write to it
with files.open(file_name, 'a') as f:
f.write(data)
# Finalize the file. Do this before attempting to read it.
files.finalize(file_name)
# Get the file's blob key
blob_key = files.blobstore.get_blob_key(file_name)
The problem now is I don't really know how to get hold of the local file
You can't read from the local file system from within the application itself, you will need to use http POST to send the file to the app.
You can certainly do this from within another application - you just need to create the mime multipart message with the file content and POST it to your app, the sending application will just have to create the http request that you will post to the app manually. You should have a read on how to create a mime mulitpart message using c#.