How can I access s3 files in Python using urls?

How can I access s3 files in Python using urls? - python

I want to write a Python script that will read and write files from s3 using their url's, eg:'s3:/mybucket/file'. It would need to run locally and in the cloud without any code changes. Is there a way to do this?
Edit: There are some good suggestions here but what I really want is something that allows me to do this:
myfile = open("s3://mybucket/file", "r")
and then use that file object like any other file object. That would be really cool. I might just write something like this for myself if it doesn't exist. I could build that abstraction layer on simples3 or boto.

For opening, it should be as simple as:
import urllib
opener = urllib.URLopener()
myurl = "https://s3.amazonaws.com/skyl/fake.xyz"
myfile = opener.open(myurl)
This will work with s3 if the file is public.
To write a file using boto, it goes a little something like this:
from boto.s3.connection import S3Connection
conn = S3Connection(AWS_KEY, AWS_SECRET)
bucket = conn.get_bucket(BUCKET)
destination = bucket.new_key()
destination.name = filename
destination.set_contents_from_file(myfile)
destination.make_public()
lemme know if this works for you :)

Here's how they do it in awscli :
def find_bucket_key(s3_path):
"""
This is a helper function that given an s3 path such that the path is of
the form: bucket/key
It will return the bucket and the key represented by the s3 path
"""
s3_components = s3_path.split('/')
bucket = s3_components[0]
s3_key = ""
if len(s3_components) > 1:
s3_key = '/'.join(s3_components[1:])
return bucket, s3_key
def split_s3_bucket_key(s3_path):
"""Split s3 path into bucket and key prefix.
This will also handle the s3:// prefix.
:return: Tuple of ('bucketname', 'keyname')
"""
if s3_path.startswith('s3://'):
s3_path = s3_path[5:]
return find_bucket_key(s3_path)
Which you could just use with code like this
from awscli.customizations.s3.utils import split_s3_bucket_key
import boto3
client = boto3.client('s3')
bucket_name, key_name = split_s3_bucket_key(
's3://example-bucket-name/path/to/example.txt')
response = client.get_object(Bucket=bucket_name, Key=key_name)
This doesn't address the goal of interacting with an s3 key as a file like object but it's a step in that direction.

I haven't seen something that would work directly with S3 urls, but you could use an S3 access library (simples3 looks decent) and some simple string manipulation:
>>> url = "s3:/bucket/path/"
>>> _, path = url.split(":", 1)
>>> path = path.lstrip("/")
>>> bucket, path = path.split("/", 1)
>>> print bucket
'bucket'
>>> print path
'path/'

Try s3fs
First example on the docs:
>>> import s3fs
>>> fs = s3fs.S3FileSystem(anon=True)
>>> fs.ls('my-bucket')
['my-file.txt']
>>> with fs.open('my-bucket/my-file.txt', 'rb') as f:
... print(f.read())
b'Hello, world'

You can use Boto Python API for accessing S3 by python. Its a good library. After you do the installation of Boto, following sample programe will work for you
>>> k = Key(b)
>>> k.key = 'yourfile'
>>> k.set_contents_from_filename('yourfile.txt')
You can find more information here http://boto.cloudhackers.com/s3_tut.html#storing-data

http://s3tools.org/s3cmd works pretty well and support the s3:// form of the URL structure you want. It does the business on Linux and Windows. If you need a native API to call from within a python program then http://code.google.com/p/boto/ is a better choice.

Related

How to filter Boto3 s3 objects?

I want to change this line
files = os.listdir('/Users/milenko/mario/Json_gzips')
in my code,to read .gz files from my bucket straight into list.
I tried
>>> import boto3
>>> s3 = boto3.resource('s3')
>>> s3
s3.ServiceResource()
>>> my_bucket = s3.Bucket('cw-dushpica-tests')
>>> for object_summary in my_bucket.objects.filter(Prefix='*.gz'):
... print(object_summary)
There is no output,it does print nothing.
for object_summary in my_bucket.objects.filter(Prefix='/'):
... print(object_summary)
The same,got nothing.
How should my prefix look like?

The prefix parameter of the filter method means that
Prefix (string) -- Limits the response to keys that begin with the specified prefix.
So, you can limit the path to the specific folder and then filter by yourself for the file extension.
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('your_bucket')
keys = []
for obj in bucket.objects.filter(Prefix='path/to/files/'):
if obj.key.endswith('gz'):
keys.append(obj.key)
print(keys)

Python - Cannot Create A Variable That Also Creates A New Line

I Seem To Cannot Create A Variable That Also Creates A New Line
def S3ListAllObjects(event, context):
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('amazons3bucket')
returnoutput = ""
for obj in bucket.objects.all():
returnoutput += obj.key
return returnoutput
For whatever reason I cannot seem to make the returnme variable output a newline. I have tried many different variations of the code but have had no luck. What am I missing?
Thanks!

Are you sure you want key? Perhaps you want name?
>>> import boto3
>>> s3 = boto3.resource('s3')
>>> for bucket in s3.buckets.all():
print(bucket.name)

Listing contents of a bucket with boto3

How can I see what's inside a bucket in S3 with boto3? (i.e. do an "ls")?
Doing the following:
import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('some/path/')
returns:
s3.Bucket(name='some/path/')
How do I see its contents?

One way to see the contents would be:
for my_bucket_object in my_bucket.objects.all():
print(my_bucket_object)

This is similar to an 'ls' but it does not take into account the prefix folder convention and will list the objects in the bucket. It's left up to the reader to filter out prefixes which are part of the Key name.
In Python 2:
from boto.s3.connection import S3Connection
conn = S3Connection() # assumes boto.cfg setup
bucket = conn.get_bucket('bucket_name')
for obj in bucket.get_all_keys():
print(obj.key)
In Python 3:
from boto3 import client
conn = client('s3') # again assumes boto.cfg setup, assume AWS S3
for key in conn.list_objects(Bucket='bucket_name')['Contents']:
print(key['Key'])

I'm assuming you have configured authentication separately.
import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('bucket_name')
for file in my_bucket.objects.all():
print(file.key)

My s3 keys utility function is essentially an optimized version of #Hephaestus's answer:
import boto3
s3_paginator = boto3.client('s3').get_paginator('list_objects_v2')
def keys(bucket_name, prefix='/', delimiter='/', start_after=''):
prefix = prefix.lstrip(delimiter)
start_after = (start_after or prefix) if prefix.endswith(delimiter) else start_after
for page in s3_paginator.paginate(Bucket=bucket_name, Prefix=prefix, StartAfter=start_after):
for content in page.get('Contents', ()):
yield content['Key']
In my tests (boto3 1.9.84), it's significantly faster than the equivalent (but simpler) code:
import boto3
def keys(bucket_name, prefix='/', delimiter='/'):
prefix = prefix.lstrip(delimiter)
bucket = boto3.resource('s3').Bucket(bucket_name)
return (_.key for _ in bucket.objects.filter(Prefix=prefix))
As S3 guarantees UTF-8 binary sorted results, a start_after optimization has been added to the first function.

In order to handle large key listings (i.e. when the directory list is greater than 1000 items), I used the following code to accumulate key values (i.e. filenames) with multiple listings (thanks to Amelio above for the first lines). Code is for python3:
from boto3 import client
bucket_name = "my_bucket"
prefix = "my_key/sub_key/lots_o_files"
s3_conn = client('s3') # type: BaseClient ## again assumes boto.cfg setup, assume AWS S3
s3_result = s3_conn.list_objects_v2(Bucket=bucket_name, Prefix=prefix, Delimiter = "/")
if 'Contents' not in s3_result:
#print(s3_result)
return []
file_list = []
for key in s3_result['Contents']:
file_list.append(key['Key'])
print(f"List count = {len(file_list)}")
while s3_result['IsTruncated']:
continuation_key = s3_result['NextContinuationToken']
s3_result = s3_conn.list_objects_v2(Bucket=bucket_name, Prefix=prefix, Delimiter="/", ContinuationToken=continuation_key)
for key in s3_result['Contents']:
file_list.append(key['Key'])
print(f"List count = {len(file_list)}")
return file_list

If you want to pass the ACCESS and SECRET keys (which you should not do, because it is not secure):
from boto3.session import Session
ACCESS_KEY='your_access_key'
SECRET_KEY='your_secret_key'
session = Session(aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY)
s3 = session.resource('s3')
your_bucket = s3.Bucket('your_bucket')
for s3_file in your_bucket.objects.all():
print(s3_file.key)

#To print all filenames in a bucket
import boto3
s3 = boto3.client('s3')
def get_s3_keys(bucket):
"""Get a list of keys in an S3 bucket."""
resp = s3.list_objects_v2(Bucket=bucket)
for obj in resp['Contents']:
files = obj['Key']
return files
filename = get_s3_keys('your_bucket_name')
print(filename)
#To print all filenames in a certain directory in a bucket
import boto3
s3 = boto3.client('s3')
def get_s3_keys(bucket, prefix):
"""Get a list of keys in an S3 bucket."""
resp = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
for obj in resp['Contents']:
files = obj['Key']
print(files)
return files
filename = get_s3_keys('your_bucket_name', 'folder_name/sub_folder_name/')
print(filename)
Update:
The most easiest way is to use awswrangler
import awswrangler as wr
wr.s3.list_objects('s3://bucket_name')

import boto3
s3 = boto3.resource('s3')
## Bucket to use
my_bucket = s3.Bucket('city-bucket')
## List objects within a given prefix
for obj in my_bucket.objects.filter(Delimiter='/', Prefix='city/'):
print obj.key
Output:
city/pune.csv
city/goa.csv

A more parsimonious way, rather than iterating through via a for loop you could also just print the original object containing all files inside your S3 bucket:
session = Session(aws_access_key_id=aws_access_key_id,aws_secret_access_key=aws_secret_access_key)
s3 = session.resource('s3')
bucket = s3.Bucket('bucket_name')
files_in_s3 = bucket.objects.all()
#you can print this iterable with print(list(files_in_s3))

So you're asking for the equivalent of aws s3 ls in boto3. This would be listing all the top level folders and files. This is the closest I could get; it only lists all the top level folders. Surprising how difficult such a simple operation is.
import boto3
def s3_ls():
s3 = boto3.resource('s3')
bucket = s3.Bucket('example-bucket')
result = bucket.meta.client.list_objects(Bucket=bucket.name,
Delimiter='/')
for o in result.get('CommonPrefixes'):
print(o.get('Prefix'))

ObjectSummary:
There are two identifiers that are attached to the ObjectSummary:
bucket_name
key
boto3 S3: ObjectSummary
More on Object Keys from AWS S3 Documentation:
Object Keys:
When you create an object, you specify the key name, which uniquely identifies the object in the bucket. For example, in the Amazon S3 console (see AWS Management Console), when you highlight a bucket, a list of objects in your bucket appears. These names are the object keys. The name for a key is a sequence of Unicode characters whose UTF-8 encoding is at most 1024 bytes long.
The Amazon S3 data model is a flat structure: you create a bucket, and the bucket stores objects. There is no hierarchy of subbuckets or subfolders; however, you can infer logical hierarchy using key name prefixes and delimiters as the Amazon S3 console does. The Amazon S3 console supports a concept of folders. Suppose that your bucket (admin-created) has four objects with the following object keys:
Development/Projects1.xls
Finance/statement1.pdf
Private/taxdocument.pdf
s3-dg.pdf
Reference:
AWS S3: Object Keys
Here is some example code that demonstrates how to get the bucket name and the object key.
Example:
import boto3
from pprint import pprint
def main():
def enumerate_s3():
s3 = boto3.resource('s3')
for bucket in s3.buckets.all():
print("Name: {}".format(bucket.name))
print("Creation Date: {}".format(bucket.creation_date))
for object in bucket.objects.all():
print("Object: {}".format(object))
print("Object bucket_name: {}".format(object.bucket_name))
print("Object key: {}".format(object.key))
enumerate_s3()
if __name__ == '__main__':
main()

Here is a simple function that returns you the filenames of all files or files with certain types such as 'json', 'jpg'.
def get_file_list_s3(bucket, prefix="", file_extension=None):
"""Return the list of all file paths (prefix + file name) with certain type or all
Parameters
----------
bucket: str
The name of the bucket. For example, if your bucket is "s3://my_bucket" then it should be "my_bucket"
prefix: str
The full path to the the 'folder' of the files (objects). For example, if your files are in
s3://my_bucket/recipes/deserts then it should be "recipes/deserts". Default : ""
file_extension: str
The type of the files. If you want all, just leave it None. If you only want "json" files then it
should be "json". Default: None
Return
------
file_names: list
The list of file names including the prefix
"""
import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket(bucket)
file_objs = my_bucket.objects.filter(Prefix=prefix).all()
file_names = [file_obj.key for file_obj in file_objs if file_extension is not None and file_obj.key.split(".")[-1] == file_extension]
return file_names

One way that I used to do this:
import boto3
s3 = boto3.resource('s3')
bucket=s3.Bucket("bucket_name")
contents = [_.key for _ in bucket.objects.all() if "subfolders/ifany/" in _.key]

Here is the solution
import boto3
s3=boto3.resource('s3')
BUCKET_NAME = 'Your S3 Bucket Name'
allFiles = s3.Bucket(BUCKET_NAME).objects.all()
for file in allFiles:
print(file.key)

I just did it like this, including the authentication method:
s3_client = boto3.client(
's3',
aws_access_key_id='access_key',
aws_secret_access_key='access_key_secret',
config=boto3.session.Config(signature_version='s3v4'),
region_name='region'
)
response = s3_client.list_objects(Bucket='bucket_name', Prefix=key)
if ('Contents' in response):
# Object / key exists!
return True
else:
# Object / key DOES NOT exist!
return False

With little modification to #Hephaeastus 's code in one of the above comments, wrote the below method to list down folders and objects (files) in a given path. Works similar to s3 ls command.
from boto3 import session
def s3_ls(profile=None, bucket_name=None, folder_path=None):
folders=[]
files=[]
result=dict()
bucket_name = bucket_name
prefix= folder_path
session = boto3.Session(profile_name=profile)
s3_conn = session.client('s3')
s3_result = s3_conn.list_objects_v2(Bucket=bucket_name, Delimiter = "/", Prefix=prefix)
if 'Contents' not in s3_result and 'CommonPrefixes' not in s3_result:
return []
if s3_result.get('CommonPrefixes'):
for folder in s3_result['CommonPrefixes']:
folders.append(folder.get('Prefix'))
if s3_result.get('Contents'):
for key in s3_result['Contents']:
files.append(key['Key'])
while s3_result['IsTruncated']:
continuation_key = s3_result['NextContinuationToken']
s3_result = s3_conn.list_objects_v2(Bucket=bucket_name, Delimiter="/", ContinuationToken=continuation_key, Prefix=prefix)
if s3_result.get('CommonPrefixes'):
for folder in s3_result['CommonPrefixes']:
folders.append(folder.get('Prefix'))
if s3_result.get('Contents'):
for key in s3_result['Contents']:
files.append(key['Key'])
if folders:
result['folders']=sorted(folders)
if files:
result['files']=sorted(files)
return result
This lists down all objects / folders in a given path. Folder_path can be left as None by default and method will list the immediate contents of the root of the bucket.

It can also be done as follows:
csv_files = s3.list_objects_v2(s3_bucket_path)
for obj in csv_files['Contents']:
key = obj['Key']

Using cloudpathlib
cloudpathlib provides a convenience wrapper so that you can use the simple pathlib API to interact with AWS S3 (and Azure blob storage, GCS, etc.). You can install with pip install "cloudpathlib[s3]".
Like with pathlib you can use glob or iterdir to list the contents of a directory.
Here's an example with a public AWS S3 bucket that you can copy and past to run.
from cloudpathlib import CloudPath
s3_path = CloudPath("s3://ladi/Images/FEMA_CAP/2020/70349")
# list items with glob
list(
s3_path.glob("*")
)[:3]
#> [ S3Path('s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0001_5a63d42e-27c6-448a-84f1-bfc632125b8e.jpg'),
#> S3Path('s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0002_a89f1b79-786f-4dac-9dcc-609fb1a977b1.jpg'),
#> S3Path('s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0003_02c30af6-911e-4e01-8c24-7644da2b8672.jpg')]
# list items with iterdir
list(
s3_path.iterdir()
)[:3]
#> [ S3Path('s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0001_5a63d42e-27c6-448a-84f1-bfc632125b8e.jpg'),
#> S3Path('s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0002_a89f1b79-786f-4dac-9dcc-609fb1a977b1.jpg'),
#> S3Path('s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0003_02c30af6-911e-4e01-8c24-7644da2b8672.jpg')]
Created at 2021-05-21 20:38:47 PDT by reprexlite v0.4.2

A good option may also be to run aws cli command from lambda functions
import subprocess
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def run_command(command):
command_list = command.split(' ')
try:
logger.info("Running shell command: \"{}\"".format(command))
result = subprocess.run(command_list, stdout=subprocess.PIPE);
logger.info("Command output:\n---\n{}\n---".format(result.stdout.decode('UTF-8')))
except Exception as e:
logger.error("Exception: {}".format(e))
return False
return True
def lambda_handler(event, context):
run_command('/opt/aws s3 ls s3://bucket-name')

I was stuck on this for an entire night because I just wanted to get the number of files under a subfolder but it was also returning one extra file in the content that was the subfolder itself,
After researching about it I found that this is how s3 works but I had
a scenario where I unloaded the data from redshift in the following directory
s3://bucket_name/subfolder/<10 number of files>
and when I used
paginator.paginate(Bucket=price_signal_bucket_name,Prefix=new_files_folder_path+"/")
it would only return the 10 files, but when I created the folder on the s3 bucket itself then it would also return the subfolder
Conclusion
If the whole folder is uploaded to s3 then listing the only returns the files under prefix
But if the fodler was created on the s3 bucket itself then listing it using boto3 client will also return the subfolder and the files

First, create an s3 client object:
s3_client = boto3.client('s3')
Next, create a variable to hold the bucket name and folder. Pay attention to the slash "/" ending the folder name:
bucket_name = 'my-bucket'
folder = 'some-folder/'
Next, call s3_client.list_objects_v2 to get the folder's content object's metadata:
response = s3_client.list_objects_v2(
Bucket=bucket_name,
Prefix=folder
)
Finally, with the object's metadata, you can obtain the S3 object by calling the s3_client.get_object function:
for object_metadata in response['Contents']:
object_key = object_metadata['Key']
response = s3_client.get_object(
Bucket=bucket_name,
Key=object_key
)
object_body = response['Body'].read()
print(object_body)
As you can see, the object content in the string format is available by calling response['Body'].read()

How to save S3 object to a file using boto3

I'm trying to do a "hello world" with new boto3 client for AWS.
The use-case I have is fairly simple: get object from S3 and save it to the file.
In boto 2.X I would do it like this:
import boto
key = boto.connect_s3().get_bucket('foo').get_key('foo')
key.get_contents_to_filename('/tmp/foo')
In boto 3 . I can't find a clean way to do the same thing, so I'm manually iterating over the "Streaming" object:
import boto3
key = boto3.resource('s3').Object('fooo', 'docker/my-image.tar.gz').get()
with open('/tmp/my-image.tar.gz', 'w') as f:
chunk = key['Body'].read(1024*8)
while chunk:
f.write(chunk)
chunk = key['Body'].read(1024*8)
or
import boto3
key = boto3.resource('s3').Object('fooo', 'docker/my-image.tar.gz').get()
with open('/tmp/my-image.tar.gz', 'w') as f:
for chunk in iter(lambda: key['Body'].read(4096), b''):
f.write(chunk)
And it works fine. I was wondering is there any "native" boto3 function that will do the same task?

There is a customization that went into Boto3 recently which helps with this (among other things). It is currently exposed on the low-level S3 client, and can be used like this:
s3_client = boto3.client('s3')
open('hello.txt').write('Hello, world!')
# Upload the file to S3
s3_client.upload_file('hello.txt', 'MyBucket', 'hello-remote.txt')
# Download the file from S3
s3_client.download_file('MyBucket', 'hello-remote.txt', 'hello2.txt')
print(open('hello2.txt').read())
These functions will automatically handle reading/writing files as well as doing multipart uploads in parallel for large files.
Note that s3_client.download_file won't create a directory. It can be created as pathlib.Path('/path/to/file.txt').parent.mkdir(parents=True, exist_ok=True).

boto3 now has a nicer interface than the client:
resource = boto3.resource('s3')
my_bucket = resource.Bucket('MyBucket')
my_bucket.download_file(key, local_filename)
This by itself isn't tremendously better than the client in the accepted answer (although the docs say that it does a better job retrying uploads and downloads on failure) but considering that resources are generally more ergonomic (for example, the s3 bucket and object resources are nicer than the client methods) this does allow you to stay at the resource layer without having to drop down.
Resources generally can be created in the same way as clients, and they take all or most of the same arguments and just forward them to their internal clients.

For those of you who would like to simulate the set_contents_from_string like boto2 methods, you can try
import boto3
from cStringIO import StringIO
s3c = boto3.client('s3')
contents = 'My string to save to S3 object'
target_bucket = 'hello-world.by.vor'
target_file = 'data/hello.txt'
fake_handle = StringIO(contents)
# notice if you do fake_handle.read() it reads like a file handle
s3c.put_object(Bucket=target_bucket, Key=target_file, Body=fake_handle.read())
For Python3:
In python3 both StringIO and cStringIO are gone. Use the StringIO import like:
from io import StringIO
To support both version:
try:
from StringIO import StringIO
except ImportError:
from io import StringIO

# Preface: File is json with contents: {'name': 'Android', 'status': 'ERROR'}
import boto3
import io
s3 = boto3.resource('s3')
obj = s3.Object('my-bucket', 'key-to-file.json')
data = io.BytesIO()
obj.download_fileobj(data)
# object is now a bytes string, Converting it to a dict:
new_dict = json.loads(data.getvalue().decode("utf-8"))
print(new_dict['status'])
# Should print "Error"

Note: I'm assuming you have configured authentication separately. Below code is to download the single object from the S3 bucket.
import boto3
#initiate s3 client
s3 = boto3.resource('s3')
#Download object to the file
s3.Bucket('mybucket').download_file('hello.txt', '/tmp/hello.txt')

If you wish to download a version of a file, you need to use get_object.
import boto3
bucket = 'bucketName'
prefix = 'path/to/file/'
filename = 'fileName.ext'
s3c = boto3.client('s3')
s3r = boto3.resource('s3')
if __name__ == '__main__':
for version in s3r.Bucket(bucket).object_versions.filter(Prefix=prefix + filename):
file = version.get()
version_id = file.get('VersionId')
obj = s3c.get_object(
Bucket=bucket,
Key=prefix + filename,
VersionId=version_id,
)
with open(f"{filename}.{version_id}", 'wb') as f:
for chunk in obj['Body'].iter_chunks(chunk_size=4096):
f.write(chunk)
Ref: https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html

When you want to read a file with a different configuration than the default one, feel free to use either mpu.aws.s3_download(s3path, destination) directly or the copy-pasted code:
def s3_download(source, destination,
exists_strategy='raise',
profile_name=None):
"""
Copy a file from an S3 source to a local destination.
Parameters
----------
source : str
Path starting with s3://, e.g. 's3://bucket-name/key/foo.bar'
destination : str
exists_strategy : {'raise', 'replace', 'abort'}
What is done when the destination already exists?
profile_name : str, optional
AWS profile
Raises
------
botocore.exceptions.NoCredentialsError
Botocore is not able to find your credentials. Either specify
profile_name or add the environment variables AWS_ACCESS_KEY_ID,
AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN.
See https://boto3.readthedocs.io/en/latest/guide/configuration.html
"""
exists_strategies = ['raise', 'replace', 'abort']
if exists_strategy not in exists_strategies:
raise ValueError('exists_strategy \'{}\' is not in {}'
.format(exists_strategy, exists_strategies))
session = boto3.Session(profile_name=profile_name)
s3 = session.resource('s3')
bucket_name, key = _s3_path_split(source)
if os.path.isfile(destination):
if exists_strategy is 'raise':
raise RuntimeError('File \'{}\' already exists.'
.format(destination))
elif exists_strategy is 'abort':
return
s3.Bucket(bucket_name).download_file(key, destination)
from collections import namedtuple
S3Path = namedtuple("S3Path", ["bucket_name", "key"])
def _s3_path_split(s3_path):
"""
Split an S3 path into bucket and key.
Parameters
----------
s3_path : str
Returns
-------
splitted : (str, str)
(bucket, key)
Examples
--------
>>> _s3_path_split('s3://my-bucket/foo/bar.jpg')
S3Path(bucket_name='my-bucket', key='foo/bar.jpg')
"""
if not s3_path.startswith("s3://"):
raise ValueError(
"s3_path is expected to start with 's3://', " "but was {}"
.format(s3_path)
)
bucket_key = s3_path[len("s3://"):]
bucket_name, key = bucket_key.split("/", 1)
return S3Path(bucket_name, key)

How can I upload a file to S3 without creating a temporary local file?

Is there any feasible way to upload a file which is generated dynamically to amazon s3 directly without first create a local file and then upload to the s3 server? I use Python.

Here is an example downloading an image (using requests library) and uploading it to s3, without writing to a local file:
import boto
from boto.s3.key import Key
import requests
#setup the bucket
c = boto.connect_s3(your_s3_key, your_s3_key_secret)
b = c.get_bucket(bucket, validate=False)
#download the file
url = "http://en.wikipedia.org/static/images/project-logos/enwiki.png"
r = requests.get(url)
if r.status_code == 200:
#upload the file
k = Key(b)
k.key = "image1.png"
k.content_type = r.headers['content-type']
k.set_contents_from_string(r.content)

You could use BytesIO from the Python standard library.
from io import BytesIO
bytesIO = BytesIO()
bytesIO.write('whee')
bytesIO.seek(0)
s3_file.set_contents_from_file(bytesIO)

The boto library's Key object has several methods you might be interested in:
send_file
set_contents_from_file
set_contents_from_string
set_contents_from_stream
For an example of using set_contents_from_string, see Storing Data section of the boto documentation, pasted here for completeness:
>>> from boto.s3.key import Key
>>> k = Key(bucket)
>>> k.key = 'foobar'
>>> k.set_contents_from_string('This is a test of S3')

I assume you're using boto. boto's Bucket.set_contents_from_file() will accept a StringIO object, and any code you have written to write data to a file should be easily adaptable to write to a StringIO object. Or if you generate a string, you can use set_contents_from_string().

def upload_to_s3(url, **kwargs):
'''
:param url: url of image which have to upload or resize to upload
:return: url of image stored on aws s3 bucket
'''
r = requests.get(url)
if r.status_code == 200:
# credentials stored in settings AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
conn = boto.connect_s3(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, host=AWS_HOST)
# Connect to bucket and create key
b = conn.get_bucket(AWS_Bucket_Name)
k = b.new_key("{folder_name}/{filename}".format(**kwargs))
k.set_contents_from_string(r.content, replace=True,
headers={'Content-Type': 'application/%s' % (FILE_FORMAT)},
policy='authenticated-read',
reduced_redundancy=True)
# TODO Change AWS_EXPIRY
return k.generate_url(expires_in=AWS_EXPIRY, force_http=True)

I had a dict object which I wanted to store as a json file on S3, without creating a local file. The below code worked for me:
from smart_open import smart_open
with smart_open('s3://access-key:secret-key#bucket-name/file.json', 'wb') as fout:
fout.write(json.dumps(dict_object).encode('utf8'))

In boto3, there is a simple way to upload a file content, without creating a local file using following code. I have modified JimJty example code for boto3
import boto3
from botocore.retries import bucket
import requests
from io import BytesIO
# set the values
aws_access_key_id=""
aws_secret_access_key=""
region_name=""
bucket=""
key=""
session = boto3.session.Session(aws_access_key_id=aws_access_key_id,aws_secret_access_key=aws_secret_access_key, region_name=region_name)
s3_client = session.client('s3')
#download the file
url = "http://en.wikipedia.org/static/images/project-logos/enwiki.png"
r = requests.get(url)
if r.status_code == 200:
#convert content to bytes, since upload_fileobj requires file like obj
bytesIO = BytesIO(bytes(r.content))
with bytesIO as data:
s3_client.upload_fileobj(data, bucket, key)

You can try using smart_open (https://pypi.org/project/smart_open/). I used it exactly for that: writing files directly in S3.

Given that encryption at rest is a much desired data standard now, smart_open does not support this afaik

This implementation is an example of uploading a list of images (NumPy list, OpenCV image objects) directly to S3
Note: you need to convert image objects to bytes or buffer to bytes while uploading the file that's how you can upload files without corruption error
#Consider you have images in the form of a list i.e. img_array
import boto3
s3 = boto3.client('s3')
res_url = []
for i,img in enumerate(img_array):
s3_key = "fileName_on_s3.png"
response = s3.put_object(Body=img.tobytes(), Bucket='bucket_name',Key=s3_key,ACL='public-read',ContentType= 'image/png')
s3_url = 'https://bucket_name.s3.ap-south-1.amazonaws.com/'+s3_key
res_url.append(s3_url)
#res_url is the list of URLs returned from S3 Upload

Update for boto3:
aws_session = boto3.Session('my_access_key_id', 'my_secret_access_key')
s3 = aws_session.resource('s3')
s3.Bucket('my_bucket').put_object(Key='file_name.txt', Body=my_file)

I am having a similar issue, was wondering if there was a final answer, because with my code below , the "starwars.json" keeps on saving locally but I just want to push through each looped .json file into S3 and have no file stored locally.
for key, value in star_wars_actors.items():
response = requests.get('http:starwarsapi/' + value)
data = response.json()
with open("starwars.json", "w+") as d:
json.dump(data, d, ensure_ascii=False, indent=4)
s3.upload_file('starwars.json', 'test-bucket',
'%s/%s' % ('test', str(key) + '.json'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I access s3 files in Python using urls? - python

Try s3fs First example on the docs: >>> import s3fs >>> fs = s3fs.S3FileSystem(anon=True) >>> fs.ls('my-bucket') ['my-file.txt'] >>> with fs.open('my-bucket/my-file.txt', 'rb') as f: ... print(f.read()) b'Hello, world'

http://s3tools.org/s3cmd works pretty well and support the s3:// form of the URL structure you want. It does the business on Linux and Windows. If you need a native API to call from within a python program then http://code.google.com/p/boto/ is a better choice.

Related

How to filter Boto3 s3 objects?

Python - Cannot Create A Variable That Also Creates A New Line

Listing contents of a bucket with boto3

How to save S3 object to a file using boto3

How can I upload a file to S3 without creating a temporary local file?

Categories

Resources