How to write pyarrow table as csv to s3 directly? [duplicate]

How to write pyarrow table as csv to s3 directly? [duplicate] - python

In boto 2, you can write to an S3 object using these methods:
Key.set_contents_from_string()
Key.set_contents_from_file()
Key.set_contents_from_filename()
Key.set_contents_from_stream()
Is there a boto 3 equivalent? What is the boto3 method for saving data to an object stored on S3?

In boto 3, the 'Key.set_contents_from_' methods were replaced by
Object.put()
Client.put_object()
For example:
import boto3
some_binary_data = b'Here we have some data'
more_binary_data = b'Here we have some more data'
# Method 1: Object.put()
s3 = boto3.resource('s3')
object = s3.Object('my_bucket_name', 'my/key/including/filename.txt')
object.put(Body=some_binary_data)
# Method 2: Client.put_object()
client = boto3.client('s3')
client.put_object(Body=more_binary_data, Bucket='my_bucket_name', Key='my/key/including/anotherfilename.txt')
Alternatively, the binary data can come from reading a file, as described in the official docs comparing boto 2 and boto 3:
Storing Data
Storing data from a file, stream, or string is easy:
# Boto 2.x
from boto.s3.key import Key
key = Key('hello.txt')
key.set_contents_from_file('/tmp/hello.txt')
# Boto 3
s3.Object('mybucket', 'hello.txt').put(Body=open('/tmp/hello.txt', 'rb'))

boto3 also has a method for uploading a file directly:
s3 = boto3.resource('s3')
s3.Bucket('bucketname').upload_file('/local/file/here.txt','folder/sub/path/to/s3key')
http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Bucket.upload_file

You no longer have to convert the contents to binary before writing to the file in S3. The following example creates a new text file (called newfile.txt) in an S3 bucket with string contents:
import boto3
s3 = boto3.resource(
's3',
region_name='us-east-1',
aws_access_key_id=KEY_ID,
aws_secret_access_key=ACCESS_KEY
)
content="String content to write to a new S3 file"
s3.Object('my-bucket-name', 'newfile.txt').put(Body=content)

Here's a nice trick to read JSON from s3:
import json, boto3
s3 = boto3.resource("s3").Bucket("bucket")
json.load_s3 = lambda f: json.load(s3.Object(key=f).get()["Body"])
json.dump_s3 = lambda obj, f: s3.Object(key=f).put(Body=json.dumps(obj))
Now you can use json.load_s3 and json.dump_s3 with the same API as load and dump
data = {"test":0}
json.dump_s3(data, "key") # saves json to s3://bucket/key
data = json.load_s3("key") # read json from s3://bucket/key

A cleaner and concise version which I use to upload files on the fly to a given S3 bucket and sub-folder-
import boto3
BUCKET_NAME = 'sample_bucket_name'
PREFIX = 'sub-folder/'
s3 = boto3.resource('s3')
# Creating an empty file called "_DONE" and putting it in the S3 bucket
s3.Object(BUCKET_NAME, PREFIX + '_DONE').put(Body="")
Note: You should ALWAYS put your AWS credentials (aws_access_key_id and aws_secret_access_key) in a separate file, for example- ~/.aws/credentials

After some research, I found this. It can be achieved using a simple csv writer. It is to write a dictionary to CSV directly to S3 bucket.
eg: data_dict = [{"Key1": "value1", "Key2": "value2"}, {"Key1": "value4", "Key2": "value3"}]
assuming that the keys in all the dictionary are uniform.
import csv
import boto3
# Sample input dictionary
data_dict = [{"Key1": "value1", "Key2": "value2"}, {"Key1": "value4", "Key2": "value3"}]
data_dict_keys = data_dict[0].keys()
# creating a file buffer
file_buff = StringIO()
# writing csv data to file buffer
writer = csv.DictWriter(file_buff, fieldnames=data_dict_keys)
writer.writeheader()
for data in data_dict:
writer.writerow(data)
# creating s3 client connection
client = boto3.client('s3')
# placing file to S3, file_buff.getvalue() is the CSV body for the file
client.put_object(Body=file_buff.getvalue(), Bucket='my_bucket_name', Key='my/key/including/anotherfilename.txt')

it is worth mentioning smart-open that uses boto3 as a back-end.
smart-open is a drop-in replacement for python's open that can open files from s3, as well as ftp, http and many other protocols.
for example
from smart_open import open
import json
with open("s3://your_bucket/your_key.json", 'r') as f:
data = json.load(f)
The aws credentials are loaded via boto3 credentials, usually a file in the ~/.aws/ dir or an environment variable.

You may use the below code to write, for example an image to S3 in 2019. To be able to connect to S3 you will have to install AWS CLI using command pip install awscli, then enter few credentials using command aws configure:
import urllib3
import uuid
from pathlib import Path
from io import BytesIO
from errors import custom_exceptions as cex
BUCKET_NAME = "xxx.yyy.zzz"
POSTERS_BASE_PATH = "assets/wallcontent"
CLOUDFRONT_BASE_URL = "https://xxx.cloudfront.net/"
class S3(object):
def __init__(self):
self.client = boto3.client('s3')
self.bucket_name = BUCKET_NAME
self.posters_base_path = POSTERS_BASE_PATH
def __download_image(self, url):
manager = urllib3.PoolManager()
try:
res = manager.request('GET', url)
except Exception:
print("Could not download the image from URL: ", url)
raise cex.ImageDownloadFailed
return BytesIO(res.data) # any file-like object that implements read()
def upload_image(self, url):
try:
image_file = self.__download_image(url)
except cex.ImageDownloadFailed:
raise cex.ImageUploadFailed
extension = Path(url).suffix
id = uuid.uuid1().hex + extension
final_path = self.posters_base_path + "/" + id
try:
self.client.upload_fileobj(image_file,
self.bucket_name,
final_path
)
except Exception:
print("Image Upload Error for URL: ", url)
raise cex.ImageUploadFailed
return CLOUDFRONT_BASE_URL + id

Related

upload a dataframe as a zipped csv directly to s3 without saving it on the local machine

How can I upload a data frame as a zipped csv into S3 bucket without saving it on my local machine first?
I have the connection to that bucket already running using:
self.s3_output = S3(bucket_name='test-bucket', bucket_subfolder='')

We can make a file-like object with BytesIO and zipfile from the standard library.
# 3.7
from io import BytesIO
import zipfile
# .to_csv returns a string when called with no args
s = df.to_csv()
with zipfile.ZipFile(BytesIO(), mode="w",) as z:
z.writestr("df.csv", s)
# upload file here
You'll want to refer to upload_fileobj in order to customize how the upload behaves.
yourclass.s3_output.upload_fileobj(z, ...)

This works equally well for zip and gz:
import boto3
import gzip
import pandas as pd
from io import BytesIO, TextIOWrapper
s3_client = boto3.client(
service_name = "s3",
endpoint_url = your_endpoint_url,
aws_access_key_id = your_access_key,
aws_secret_access_key = your_secret_key
# Your file name inside zip
your_filename = "test.csv"
s3_path = f"path/to/your/s3/compressed/file/test.zip"
bucket = "your_bucket"
df = your_df
gz_buffer = BytesIO()
with gzip.GzipFile(
filename = your_filename,
mode = 'w',
fileobj = gz_buffer ) as gz_file:
df.to_csv(TextIOWrapper(gz_file, 'utf8'), index=False)
s3.put_object(
Bucket=bucket, Key=s3_path, Body=gz_buffer.getvalue()
)

How to load a pickle file from S3 to use in AWS Lambda?

I am currently trying to load a pickled file from S3 into AWS lambda and store it to a list (the pickle is a list).
Here is my code:
import pickle
import boto3
s3 = boto3.resource('s3')
with open('oldscreenurls.pkl', 'rb') as data:
old_list = s3.Bucket("pythonpickles").download_fileobj("oldscreenurls.pkl", data)
I get the following error even though the file exists:
FileNotFoundError: [Errno 2] No such file or directory: 'oldscreenurls.pkl'
Any ideas?

Super simple solution
import pickle
import boto3
s3 = boto3.resource('s3')
my_pickle = pickle.loads(s3.Bucket("bucket_name").Object("key_to_pickle.pickle").get()['Body'].read())

As shown in the documentation for download_fileobj, you need to open the file in binary write mode and save to the file first. Once the file is downloaded, you can open it for reading and unpickle.
import pickle
import boto3
s3 = boto3.resource('s3')
with open('oldscreenurls.pkl', 'wb') as data:
s3.Bucket("pythonpickles").download_fileobj("oldscreenurls.pkl", data)
with open('oldscreenurls.pkl', 'rb') as data:
old_list = pickle.load(data)
download_fileobj takes the name of an object in S3 plus a handle to a local file, and saves the contents of that object to the file. There is also a version of this function called download_file that takes a filename instead of an open file handle and handles opening it for you.
In this case it would probably be better to use S3Client.get_object though, to avoid having to write and then immediately read a file. You could also write to an in-memory BytesIO object, which acts like a file but doesn't actually touch a disk. That would look something like this:
import pickle
import boto3
from io import BytesIO
s3 = boto3.resource('s3')
with BytesIO() as data:
s3.Bucket("pythonpickles").download_fileobj("oldscreenurls.pkl", data)
data.seek(0) # move back to the beginning after writing
old_list = pickle.load(data)

This is the easiest solution. You can load the data without even downloading the file locally using S3FileSystem
from s3fs.core import S3FileSystem
s3_file = S3FileSystem()
data = pickle.load(s3_file.open('{}/{}'.format(bucket_name, file_path)))

According to my implementation, S3 file path read with pickle.
import pickle
import boto3
name = img_url.split('/')[::-1][0]
folder = 'media'
file_name = f'{folder}/{name}'
bucket_name = bucket_name
s3 = boto3.client('s3', aws_access_key_id=aws_access_key_id,aws_secret_access_key=aws_secret_access_key)
response = s3.get_object(Bucket=bucket_name, Key=file_name)
body = response['Body'].read()
data = pickle.loads(body)

How do you perform a credentialed download from s3 using boto3 without saving a file?

It is simple to perform a credentialed download from s3 to a file using
import boto3
s3 = boto3.resource('s3')
def save_file_from_s3(bucket_name, key_name, file_name):
b = s3.Bucket(bucket_name)
b.download_file(key_name, file_name)
It is easy to download from s3 to a file-like object using
import from StringIO import StringIO
import urllib
file_like_object = StringIO(urllib.urlopen(url).read())
(see How do I read image data from a URL in Python?)
But how do you perform a credentialed download from s3 to a file-like object?

You just need to make a call to s3.Bucket.Object.get:
import boto3
s3 = boto3.resource('s3')
# obj = s3.Object(bucket_name, key_name).get()
bucket = s3.Bucket(bucket_name)
obj = bucket.Object(key_name).get()
body = obj.get('Body')

How to save S3 object to a file using boto3

I'm trying to do a "hello world" with new boto3 client for AWS.
The use-case I have is fairly simple: get object from S3 and save it to the file.
In boto 2.X I would do it like this:
import boto
key = boto.connect_s3().get_bucket('foo').get_key('foo')
key.get_contents_to_filename('/tmp/foo')
In boto 3 . I can't find a clean way to do the same thing, so I'm manually iterating over the "Streaming" object:
import boto3
key = boto3.resource('s3').Object('fooo', 'docker/my-image.tar.gz').get()
with open('/tmp/my-image.tar.gz', 'w') as f:
chunk = key['Body'].read(1024*8)
while chunk:
f.write(chunk)
chunk = key['Body'].read(1024*8)
or
import boto3
key = boto3.resource('s3').Object('fooo', 'docker/my-image.tar.gz').get()
with open('/tmp/my-image.tar.gz', 'w') as f:
for chunk in iter(lambda: key['Body'].read(4096), b''):
f.write(chunk)
And it works fine. I was wondering is there any "native" boto3 function that will do the same task?

There is a customization that went into Boto3 recently which helps with this (among other things). It is currently exposed on the low-level S3 client, and can be used like this:
s3_client = boto3.client('s3')
open('hello.txt').write('Hello, world!')
# Upload the file to S3
s3_client.upload_file('hello.txt', 'MyBucket', 'hello-remote.txt')
# Download the file from S3
s3_client.download_file('MyBucket', 'hello-remote.txt', 'hello2.txt')
print(open('hello2.txt').read())
These functions will automatically handle reading/writing files as well as doing multipart uploads in parallel for large files.
Note that s3_client.download_file won't create a directory. It can be created as pathlib.Path('/path/to/file.txt').parent.mkdir(parents=True, exist_ok=True).

boto3 now has a nicer interface than the client:
resource = boto3.resource('s3')
my_bucket = resource.Bucket('MyBucket')
my_bucket.download_file(key, local_filename)
This by itself isn't tremendously better than the client in the accepted answer (although the docs say that it does a better job retrying uploads and downloads on failure) but considering that resources are generally more ergonomic (for example, the s3 bucket and object resources are nicer than the client methods) this does allow you to stay at the resource layer without having to drop down.
Resources generally can be created in the same way as clients, and they take all or most of the same arguments and just forward them to their internal clients.

For those of you who would like to simulate the set_contents_from_string like boto2 methods, you can try
import boto3
from cStringIO import StringIO
s3c = boto3.client('s3')
contents = 'My string to save to S3 object'
target_bucket = 'hello-world.by.vor'
target_file = 'data/hello.txt'
fake_handle = StringIO(contents)
# notice if you do fake_handle.read() it reads like a file handle
s3c.put_object(Bucket=target_bucket, Key=target_file, Body=fake_handle.read())
For Python3:
In python3 both StringIO and cStringIO are gone. Use the StringIO import like:
from io import StringIO
To support both version:
try:
from StringIO import StringIO
except ImportError:
from io import StringIO

# Preface: File is json with contents: {'name': 'Android', 'status': 'ERROR'}
import boto3
import io
s3 = boto3.resource('s3')
obj = s3.Object('my-bucket', 'key-to-file.json')
data = io.BytesIO()
obj.download_fileobj(data)
# object is now a bytes string, Converting it to a dict:
new_dict = json.loads(data.getvalue().decode("utf-8"))
print(new_dict['status'])
# Should print "Error"

Note: I'm assuming you have configured authentication separately. Below code is to download the single object from the S3 bucket.
import boto3
#initiate s3 client
s3 = boto3.resource('s3')
#Download object to the file
s3.Bucket('mybucket').download_file('hello.txt', '/tmp/hello.txt')

If you wish to download a version of a file, you need to use get_object.
import boto3
bucket = 'bucketName'
prefix = 'path/to/file/'
filename = 'fileName.ext'
s3c = boto3.client('s3')
s3r = boto3.resource('s3')
if __name__ == '__main__':
for version in s3r.Bucket(bucket).object_versions.filter(Prefix=prefix + filename):
file = version.get()
version_id = file.get('VersionId')
obj = s3c.get_object(
Bucket=bucket,
Key=prefix + filename,
VersionId=version_id,
)
with open(f"{filename}.{version_id}", 'wb') as f:
for chunk in obj['Body'].iter_chunks(chunk_size=4096):
f.write(chunk)
Ref: https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html

When you want to read a file with a different configuration than the default one, feel free to use either mpu.aws.s3_download(s3path, destination) directly or the copy-pasted code:
def s3_download(source, destination,
exists_strategy='raise',
profile_name=None):
"""
Copy a file from an S3 source to a local destination.
Parameters
----------
source : str
Path starting with s3://, e.g. 's3://bucket-name/key/foo.bar'
destination : str
exists_strategy : {'raise', 'replace', 'abort'}
What is done when the destination already exists?
profile_name : str, optional
AWS profile
Raises
------
botocore.exceptions.NoCredentialsError
Botocore is not able to find your credentials. Either specify
profile_name or add the environment variables AWS_ACCESS_KEY_ID,
AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN.
See https://boto3.readthedocs.io/en/latest/guide/configuration.html
"""
exists_strategies = ['raise', 'replace', 'abort']
if exists_strategy not in exists_strategies:
raise ValueError('exists_strategy \'{}\' is not in {}'
.format(exists_strategy, exists_strategies))
session = boto3.Session(profile_name=profile_name)
s3 = session.resource('s3')
bucket_name, key = _s3_path_split(source)
if os.path.isfile(destination):
if exists_strategy is 'raise':
raise RuntimeError('File \'{}\' already exists.'
.format(destination))
elif exists_strategy is 'abort':
return
s3.Bucket(bucket_name).download_file(key, destination)
from collections import namedtuple
S3Path = namedtuple("S3Path", ["bucket_name", "key"])
def _s3_path_split(s3_path):
"""
Split an S3 path into bucket and key.
Parameters
----------
s3_path : str
Returns
-------
splitted : (str, str)
(bucket, key)
Examples
--------
>>> _s3_path_split('s3://my-bucket/foo/bar.jpg')
S3Path(bucket_name='my-bucket', key='foo/bar.jpg')
"""
if not s3_path.startswith("s3://"):
raise ValueError(
"s3_path is expected to start with 's3://', " "but was {}"
.format(s3_path)
)
bucket_key = s3_path[len("s3://"):]
bucket_name, key = bucket_key.split("/", 1)
return S3Path(bucket_name, key)

How can I upload a file to S3 without creating a temporary local file?

Is there any feasible way to upload a file which is generated dynamically to amazon s3 directly without first create a local file and then upload to the s3 server? I use Python.

Here is an example downloading an image (using requests library) and uploading it to s3, without writing to a local file:
import boto
from boto.s3.key import Key
import requests
#setup the bucket
c = boto.connect_s3(your_s3_key, your_s3_key_secret)
b = c.get_bucket(bucket, validate=False)
#download the file
url = "http://en.wikipedia.org/static/images/project-logos/enwiki.png"
r = requests.get(url)
if r.status_code == 200:
#upload the file
k = Key(b)
k.key = "image1.png"
k.content_type = r.headers['content-type']
k.set_contents_from_string(r.content)

You could use BytesIO from the Python standard library.
from io import BytesIO
bytesIO = BytesIO()
bytesIO.write('whee')
bytesIO.seek(0)
s3_file.set_contents_from_file(bytesIO)

The boto library's Key object has several methods you might be interested in:
send_file
set_contents_from_file
set_contents_from_string
set_contents_from_stream
For an example of using set_contents_from_string, see Storing Data section of the boto documentation, pasted here for completeness:
>>> from boto.s3.key import Key
>>> k = Key(bucket)
>>> k.key = 'foobar'
>>> k.set_contents_from_string('This is a test of S3')

I assume you're using boto. boto's Bucket.set_contents_from_file() will accept a StringIO object, and any code you have written to write data to a file should be easily adaptable to write to a StringIO object. Or if you generate a string, you can use set_contents_from_string().

def upload_to_s3(url, **kwargs):
'''
:param url: url of image which have to upload or resize to upload
:return: url of image stored on aws s3 bucket
'''
r = requests.get(url)
if r.status_code == 200:
# credentials stored in settings AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
conn = boto.connect_s3(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, host=AWS_HOST)
# Connect to bucket and create key
b = conn.get_bucket(AWS_Bucket_Name)
k = b.new_key("{folder_name}/{filename}".format(**kwargs))
k.set_contents_from_string(r.content, replace=True,
headers={'Content-Type': 'application/%s' % (FILE_FORMAT)},
policy='authenticated-read',
reduced_redundancy=True)
# TODO Change AWS_EXPIRY
return k.generate_url(expires_in=AWS_EXPIRY, force_http=True)

I had a dict object which I wanted to store as a json file on S3, without creating a local file. The below code worked for me:
from smart_open import smart_open
with smart_open('s3://access-key:secret-key#bucket-name/file.json', 'wb') as fout:
fout.write(json.dumps(dict_object).encode('utf8'))

In boto3, there is a simple way to upload a file content, without creating a local file using following code. I have modified JimJty example code for boto3
import boto3
from botocore.retries import bucket
import requests
from io import BytesIO
# set the values
aws_access_key_id=""
aws_secret_access_key=""
region_name=""
bucket=""
key=""
session = boto3.session.Session(aws_access_key_id=aws_access_key_id,aws_secret_access_key=aws_secret_access_key, region_name=region_name)
s3_client = session.client('s3')
#download the file
url = "http://en.wikipedia.org/static/images/project-logos/enwiki.png"
r = requests.get(url)
if r.status_code == 200:
#convert content to bytes, since upload_fileobj requires file like obj
bytesIO = BytesIO(bytes(r.content))
with bytesIO as data:
s3_client.upload_fileobj(data, bucket, key)

You can try using smart_open (https://pypi.org/project/smart_open/). I used it exactly for that: writing files directly in S3.

Given that encryption at rest is a much desired data standard now, smart_open does not support this afaik

This implementation is an example of uploading a list of images (NumPy list, OpenCV image objects) directly to S3
Note: you need to convert image objects to bytes or buffer to bytes while uploading the file that's how you can upload files without corruption error
#Consider you have images in the form of a list i.e. img_array
import boto3
s3 = boto3.client('s3')
res_url = []
for i,img in enumerate(img_array):
s3_key = "fileName_on_s3.png"
response = s3.put_object(Body=img.tobytes(), Bucket='bucket_name',Key=s3_key,ACL='public-read',ContentType= 'image/png')
s3_url = 'https://bucket_name.s3.ap-south-1.amazonaws.com/'+s3_key
res_url.append(s3_url)
#res_url is the list of URLs returned from S3 Upload

Update for boto3:
aws_session = boto3.Session('my_access_key_id', 'my_secret_access_key')
s3 = aws_session.resource('s3')
s3.Bucket('my_bucket').put_object(Key='file_name.txt', Body=my_file)

I am having a similar issue, was wondering if there was a final answer, because with my code below , the "starwars.json" keeps on saving locally but I just want to push through each looped .json file into S3 and have no file stored locally.
for key, value in star_wars_actors.items():
response = requests.get('http:starwarsapi/' + value)
data = response.json()
with open("starwars.json", "w+") as d:
json.dump(data, d, ensure_ascii=False, indent=4)
s3.upload_file('starwars.json', 'test-bucket',
'%s/%s' % ('test', str(key) + '.json'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to write pyarrow table as csv to s3 directly? [duplicate] - python

In boto 2, you can write to an S3 object using these methods: Key.set_contents_from_string() Key.set_contents_from_file() Key.set_contents_from_filename() Key.set_contents_from_stream() Is there a boto 3 equivalent? What is the boto3 method for saving data to an object stored on S3?

boto3 also has a method for uploading a file directly: s3 = boto3.resource('s3') s3.Bucket('bucketname').upload_file('/local/file/here.txt','folder/sub/path/to/s3key') http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Bucket.upload_file

Related

upload a dataframe as a zipped csv directly to s3 without saving it on the local machine

How to load a pickle file from S3 to use in AWS Lambda?

How do you perform a credentialed download from s3 using boto3 without saving a file?

How to save S3 object to a file using boto3

How can I upload a file to S3 without creating a temporary local file?

Categories

Resources