Import CSV from AWS S3 instance to Numpy

Import CSV from AWS S3 instance to Numpy - python

I've been trying to directly read a csv file from AWS S3 to numpy. I've used:
s3 = boto3.client(service_name = 's3')
def s3_read(filename):
s3_obj = s3.get_object(Bucket = 'bucket-name', Key = filename)
body = s3_obj['Body']
return body.read()
as an attempt to pull the data but I'm running into an issue of formatting from AWS that I don't know how to handle.
When I print out the data that is being returned from that there is a weird header before the data:
b{\n "name":"filename",\n "data":{\n "type":"Buffer,\n "data:[\n 114,\n 97,...]}}
So there's a bunch of \n's and the weird header. Would this have something to do with the way I uploaded the file to AWS or is there something I'm messing up with the reading of the file?

body.read() returns bytes.
import json
j = json.loads(s3_obj['Body'].read().decode('utf-8'))
decode will turn bytes to string, json.loads will parse the string to dictionary.

Related

python - read csv from s3 and identify its encoding info for pandas dataframe

I am making a service to download csv files from s3 bucket.
The bucket contains csv with various encodings (which I may not know before hand), since users are uploading these files.
This is what I am trying:
...
obj = s3c.get_object(Bucket= BUCKET_NAME , Key = KEY)
content = io.BytesIO(obj['Body'].read())
df_s3_file = pd.read_csv(content)
...
This works fine for utf-8, however for other format it fails (obviously!).
I have found an independent code which can help me identify the encoding of a csv file on a netwrok drive.
It looks like this:
...
def find_encoding(fname):
r_file = open(fname, 'rb').read()
result = chardet.detect(r_file)
charenc = result['encoding']
return charenc
my_encoding = find_encoding(content)
print('detected csv encoding: ',my_encoding)
df_s3_file = pd.read_csv(content, encoding=my_encoding)
...
This snippet works absolutely fine for a file on a drive(local), but how do I do this for a file on s3 bucket? Since I am reading the s3 file as io.BytesIO object.
I think if I write the file on a drive and then execute the function find_encoding, its going to work, since that function takes csv file as input as opoosed to BytesIO object.
Is there a way to do this without having to download the file on a drive, within memory?
Note: the files size is not very big (<10 mb).

According to their docs s3c.get_object(Bucket= BUCKET_NAME , Key = KEY) will return a dict where one of the keys is ContentEncoding so I would try:
obj = s3c.get_object(Bucket= BUCKET_NAME , Key = KEY)
print(obj["ContentEncoding"])

Writing json to file in s3 bucket

This code writes json to a file in s3,
what i wanted to achieve is instead of opening data.json file and writing to s3 (sample.json) file,
how do i pass the json directly and write to a file in s3 ?
import boto3
s3 = boto3.resource('s3', aws_access_key_id='aws_key', aws_secret_access_key='aws_sec_key')
s3.Object('mybucket', 'sample.json').put(Body=open('data.json', 'rb'))

I'm not sure, if I get the question right. You just want to write JSON data to a file using Boto3? The following code writes a python dictionary to a JSON file.
import json
import boto3
s3 = boto3.resource('s3')
s3object = s3.Object('your-bucket-name', 'your_file.json')
s3object.put(
Body=(bytes(json.dumps(json_data).encode('UTF-8')))
)

I don't know if anyone is still attempting to use this thread, but I was attempting to upload a JSON to s3 trying to use the method above, which didnt quite work for me. Boto and s3 might have changed since 2018, but this achieved the results for me:
import json
import boto3
s3 = boto3.client('s3')
json_object = 'your_json_object here'
s3.put_object(
Body=json.dumps(json_object),
Bucket='your_bucket_name',
Key='your_key_here'
)

Amazon S3 is an object store (File store in reality). The primary operations are PUT and GET. You can not add data into an existing object in S3. You can only replace the entire object itself.
For a list of available operations you can perform on s3 see this link
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectOps.html

An alternative to Joseph McCombs answer can be achieved using s3fs.
from s3fs import S3FileSystem
json_object = {'test': 3.14}
path_to_s3_object = 's3://your-s3-bucket/your_json_filename.json'
s3 = S3FileSystem()
with s3.open(path_to_s3_object, 'w') as file:
json.dump(json_object, file)

Python - How to read CSV file retrieved from S3 bucket?

There's a CSV file in a S3 bucket that I want to parse and turn into a dictionary in Python. Using Boto3, I called the s3.get_object(<bucket_name>, <key>) function and that returns a dictionary which includes a "Body" : StreamingBody() key-value pair that apparently contains the data I want.
In my python file, I've added import csv and the examples I see online on how to read a csv file, you pass the file name such as:
with open(<csv_file_name>, mode='r') as file:
reader = csv.reader(file)
However, I'm not sure how to retrieve the csv file name from StreamBody, if that's even possible. If not, is there a better way for me to read the csv file in Python? Thanks!
Edit: Wanted to add that I'm doing this in AWS Lambda and there are documented issues with using pandas in Lambda, so this is why I wanted to use the csv library and not pandas.

csv.reader does not require a file. It can use anything that iterates through lines, including files and lists.
So you don't need a filename. Just pass the lines from response['Body'] directly into the reader. One way to do that is
lines = response['Body'].read().splitlines(True)
reader = csv.reader(lines)

To retrieve and read CSV file from s3 bucket, you can use the following code:
import csv
import boto3
from django.conf import settings
bucket_name = "your-bucket-name"
file_name = "your-file-name-exists-in-that-bucket.csv"
s3 = boto3.resource('s3', aws_access_key_id=settings.AWS_ACCESS_KEY_ID,
aws_secret_access_key=settings.AWS_SECRET_ACCESS_KEY)
bucket = s3.Bucket(bucket_name)
obj = bucket.Object(key=file_name)
response = obj.get()
lines = response['Body'].read().decode('utf-8').splitlines(True)
reader = csv.DictReader(lines)
for row in reader:
# csv_header_key is the header keys which you have defined in your csv header
print(row['csv_header_key1'], row['csv_header_key2')

Infinite loop when streaming a .gz file from S3 using boto

I'm attempting to stream a .gz file from S3 using boto and iterate over the lines of the unzipped text file. Mysteriously, the loop never terminates; when the entire file has been read, the iteration restarts at the beginning of the file.
Let's say I create and upload an input file like the following:
> echo '{"key": "value"}' > foo.json
> gzip -9 foo.json
> aws s3 cp foo.json.gz s3://my-bucket/my-location/
and I run the following Python script:
import boto
import gzip
connection = boto.connect_s3()
bucket = connection.get_bucket('my-bucket')
key = bucket.get_key('my-location/foo.json.gz')
gz_file = gzip.GzipFile(fileobj=key, mode='rb')
for line in gz_file:
print(line)
The result is:
b'{"key": "value"}\n'
b'{"key": "value"}\n'
b'{"key": "value"}\n'
...forever...
Why is this happening? I think there must be something very basic that I am missing.

Ah, boto. The problem is that the read method redownloads the key if you call it after the key has been completely read once (compare the read and next methods to see the difference).
This isn't the cleanest way to do it, but it solves the problem:
import boto
import gzip
class ReadOnce(object):
def __init__(self, k):
self.key = k
self.has_read_once = False
def read(self, size=0):
if self.has_read_once:
return b''
data = self.key.read(size)
if not data:
self.has_read_once = True
return data
connection = boto.connect_s3()
bucket = connection.get_bucket('my-bucket')
key = ReadOnce(bucket.get_key('my-location/foo.json.gz'))
gz_file = gzip.GzipFile(fileobj=key, mode='rb')
for line in gz_file:
print(line)

Thanks zweiterlinde for the wonderful insight and excellent answer provided.
I was looking for a solution to read a compressed S3 object directly into a Pandas DataFrame, and using his wrapper, it can be expressed in two lines:
with gzip.GzipFile(fileobj=ReadOnce(bucket.get_key('my/obj.tsv.gz')), mode='rb') as f:
df = pd.read_csv(f, sep='\t')

How can I upload a file to S3 without creating a temporary local file?

Is there any feasible way to upload a file which is generated dynamically to amazon s3 directly without first create a local file and then upload to the s3 server? I use Python.

Here is an example downloading an image (using requests library) and uploading it to s3, without writing to a local file:
import boto
from boto.s3.key import Key
import requests
#setup the bucket
c = boto.connect_s3(your_s3_key, your_s3_key_secret)
b = c.get_bucket(bucket, validate=False)
#download the file
url = "http://en.wikipedia.org/static/images/project-logos/enwiki.png"
r = requests.get(url)
if r.status_code == 200:
#upload the file
k = Key(b)
k.key = "image1.png"
k.content_type = r.headers['content-type']
k.set_contents_from_string(r.content)

You could use BytesIO from the Python standard library.
from io import BytesIO
bytesIO = BytesIO()
bytesIO.write('whee')
bytesIO.seek(0)
s3_file.set_contents_from_file(bytesIO)

The boto library's Key object has several methods you might be interested in:
send_file
set_contents_from_file
set_contents_from_string
set_contents_from_stream
For an example of using set_contents_from_string, see Storing Data section of the boto documentation, pasted here for completeness:
>>> from boto.s3.key import Key
>>> k = Key(bucket)
>>> k.key = 'foobar'
>>> k.set_contents_from_string('This is a test of S3')

I assume you're using boto. boto's Bucket.set_contents_from_file() will accept a StringIO object, and any code you have written to write data to a file should be easily adaptable to write to a StringIO object. Or if you generate a string, you can use set_contents_from_string().

def upload_to_s3(url, **kwargs):
'''
:param url: url of image which have to upload or resize to upload
:return: url of image stored on aws s3 bucket
'''
r = requests.get(url)
if r.status_code == 200:
# credentials stored in settings AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
conn = boto.connect_s3(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, host=AWS_HOST)
# Connect to bucket and create key
b = conn.get_bucket(AWS_Bucket_Name)
k = b.new_key("{folder_name}/{filename}".format(**kwargs))
k.set_contents_from_string(r.content, replace=True,
headers={'Content-Type': 'application/%s' % (FILE_FORMAT)},
policy='authenticated-read',
reduced_redundancy=True)
# TODO Change AWS_EXPIRY
return k.generate_url(expires_in=AWS_EXPIRY, force_http=True)

I had a dict object which I wanted to store as a json file on S3, without creating a local file. The below code worked for me:
from smart_open import smart_open
with smart_open('s3://access-key:secret-key#bucket-name/file.json', 'wb') as fout:
fout.write(json.dumps(dict_object).encode('utf8'))

In boto3, there is a simple way to upload a file content, without creating a local file using following code. I have modified JimJty example code for boto3
import boto3
from botocore.retries import bucket
import requests
from io import BytesIO
# set the values
aws_access_key_id=""
aws_secret_access_key=""
region_name=""
bucket=""
key=""
session = boto3.session.Session(aws_access_key_id=aws_access_key_id,aws_secret_access_key=aws_secret_access_key, region_name=region_name)
s3_client = session.client('s3')
#download the file
url = "http://en.wikipedia.org/static/images/project-logos/enwiki.png"
r = requests.get(url)
if r.status_code == 200:
#convert content to bytes, since upload_fileobj requires file like obj
bytesIO = BytesIO(bytes(r.content))
with bytesIO as data:
s3_client.upload_fileobj(data, bucket, key)

You can try using smart_open (https://pypi.org/project/smart_open/). I used it exactly for that: writing files directly in S3.

Given that encryption at rest is a much desired data standard now, smart_open does not support this afaik

This implementation is an example of uploading a list of images (NumPy list, OpenCV image objects) directly to S3
Note: you need to convert image objects to bytes or buffer to bytes while uploading the file that's how you can upload files without corruption error
#Consider you have images in the form of a list i.e. img_array
import boto3
s3 = boto3.client('s3')
res_url = []
for i,img in enumerate(img_array):
s3_key = "fileName_on_s3.png"
response = s3.put_object(Body=img.tobytes(), Bucket='bucket_name',Key=s3_key,ACL='public-read',ContentType= 'image/png')
s3_url = 'https://bucket_name.s3.ap-south-1.amazonaws.com/'+s3_key
res_url.append(s3_url)
#res_url is the list of URLs returned from S3 Upload

Update for boto3:
aws_session = boto3.Session('my_access_key_id', 'my_secret_access_key')
s3 = aws_session.resource('s3')
s3.Bucket('my_bucket').put_object(Key='file_name.txt', Body=my_file)

I am having a similar issue, was wondering if there was a final answer, because with my code below , the "starwars.json" keeps on saving locally but I just want to push through each looped .json file into S3 and have no file stored locally.
for key, value in star_wars_actors.items():
response = requests.get('http:starwarsapi/' + value)
data = response.json()
with open("starwars.json", "w+") as d:
json.dump(data, d, ensure_ascii=False, indent=4)
s3.upload_file('starwars.json', 'test-bucket',
'%s/%s' % ('test', str(key) + '.json'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Import CSV from AWS S3 instance to Numpy - python

body.read() returns bytes. import json j = json.loads(s3_obj['Body'].read().decode('utf-8')) decode will turn bytes to string, json.loads will parse the string to dictionary.

Related

python - read csv from s3 and identify its encoding info for pandas dataframe

Writing json to file in s3 bucket

Python - How to read CSV file retrieved from S3 bucket?

Infinite loop when streaming a .gz file from S3 using boto

How can I upload a file to S3 without creating a temporary local file?

Categories

Resources