I'm attempting to stream a .gz file from S3 using boto and iterate over the lines of the unzipped text file. Mysteriously, the loop never terminates; when the entire file has been read, the iteration restarts at the beginning of the file.
Let's say I create and upload an input file like the following:
> echo '{"key": "value"}' > foo.json
> gzip -9 foo.json
> aws s3 cp foo.json.gz s3://my-bucket/my-location/
and I run the following Python script:
import boto
import gzip
connection = boto.connect_s3()
bucket = connection.get_bucket('my-bucket')
key = bucket.get_key('my-location/foo.json.gz')
gz_file = gzip.GzipFile(fileobj=key, mode='rb')
for line in gz_file:
print(line)
The result is:
b'{"key": "value"}\n'
b'{"key": "value"}\n'
b'{"key": "value"}\n'
...forever...
Why is this happening? I think there must be something very basic that I am missing.
Ah, boto. The problem is that the read method redownloads the key if you call it after the key has been completely read once (compare the read and next methods to see the difference).
This isn't the cleanest way to do it, but it solves the problem:
import boto
import gzip
class ReadOnce(object):
def __init__(self, k):
self.key = k
self.has_read_once = False
def read(self, size=0):
if self.has_read_once:
return b''
data = self.key.read(size)
if not data:
self.has_read_once = True
return data
connection = boto.connect_s3()
bucket = connection.get_bucket('my-bucket')
key = ReadOnce(bucket.get_key('my-location/foo.json.gz'))
gz_file = gzip.GzipFile(fileobj=key, mode='rb')
for line in gz_file:
print(line)
Thanks zweiterlinde for the wonderful insight and excellent answer provided.
I was looking for a solution to read a compressed S3 object directly into a Pandas DataFrame, and using his wrapper, it can be expressed in two lines:
with gzip.GzipFile(fileobj=ReadOnce(bucket.get_key('my/obj.tsv.gz')), mode='rb') as f:
df = pd.read_csv(f, sep='\t')
Related
I have a large file s3://my-bucket/in.tsv.gz that I would like to load and process, write back its processed version to an s3 output file s3://my-bucket/out.tsv.gz.
How do I streamline the in.tsv.gz directly from s3 without loading all the file to memory (it cannot fit the memory)
How do I write the processed gzipped stream directly to s3?
In the following code, I show how I was thinking to load the input gzipped dataframe from s3, and how I would write the .tsv if it were located locally bucket_dir_local = ./.
import pandas as pd
import s3fs
import os
import gzip
import csv
import io
bucket_dir = 's3://my-bucket/annotations/'
df = pd.read_csv(os.path.join(bucket_dir, 'in.tsv.gz'), sep='\t', compression="gzip")
bucket_dir_local='./'
# not sure how to do it with an s3 path
with gzip.open(os.path.join(bucket_dir_local, 'out.tsv.gz'), "w") as f:
with io.TextIOWrapper(f, encoding='utf-8') as wrapper:
w = csv.DictWriter(wrapper, fieldnames=['test', 'testing'], extrasaction="ignore")
w.writeheader()
for index, row in df.iterrows():
my_dict = {"test": index, "testing": row[6]}
w.writerow(my_dict)
Edit: smart_open looks like the way to go.
Here is a dummy example to read a file from s3 and write it back to s3 using smart_open
from smart_open import open
import os
bucket_dir = "s3://my-bucket/annotations/"
with open(os.path.join(bucket_dir, "in.tsv.gz"), "rb") as fin:
with open(
os.path.join(bucket_dir, "out.tsv.gz"), "wb"
) as fout:
for line in fin:
l = [i.strip() for i in line.decode().split("\t")]
string = "\t".join(l) + "\n"
fout.write(string.encode())
For downloading the file you can stream the S3 object directly in python. I'd recommend reading that entire post but some key lines from it
import boto3
s3 = boto3.client('s3', aws_access_key_id='mykey', aws_secret_access_key='mysecret') # your authentication may vary
obj = s3.get_object(Bucket='my-bucket', Key='my/precious/object')
import gzip
body = obj['Body']
with gzip.open(body, 'rt') as gf:
for ln in gf:
process(ln)
Unfortunately S3 doesn't support true streaming input but this SO answer has an implementation that chunks out the file and sends each chunk up to S3. While not a "true stream" it will let you upload large files without needing to keep the entire thing in memory
I am making a service to download csv files from s3 bucket.
The bucket contains csv with various encodings (which I may not know before hand), since users are uploading these files.
This is what I am trying:
...
obj = s3c.get_object(Bucket= BUCKET_NAME , Key = KEY)
content = io.BytesIO(obj['Body'].read())
df_s3_file = pd.read_csv(content)
...
This works fine for utf-8, however for other format it fails (obviously!).
I have found an independent code which can help me identify the encoding of a csv file on a netwrok drive.
It looks like this:
...
def find_encoding(fname):
r_file = open(fname, 'rb').read()
result = chardet.detect(r_file)
charenc = result['encoding']
return charenc
my_encoding = find_encoding(content)
print('detected csv encoding: ',my_encoding)
df_s3_file = pd.read_csv(content, encoding=my_encoding)
...
This snippet works absolutely fine for a file on a drive(local), but how do I do this for a file on s3 bucket? Since I am reading the s3 file as io.BytesIO object.
I think if I write the file on a drive and then execute the function find_encoding, its going to work, since that function takes csv file as input as opoosed to BytesIO object.
Is there a way to do this without having to download the file on a drive, within memory?
Note: the files size is not very big (<10 mb).
According to their docs s3c.get_object(Bucket= BUCKET_NAME , Key = KEY) will return a dict where one of the keys is ContentEncoding so I would try:
obj = s3c.get_object(Bucket= BUCKET_NAME , Key = KEY)
print(obj["ContentEncoding"])
I've been trying to directly read a csv file from AWS S3 to numpy. I've used:
s3 = boto3.client(service_name = 's3')
def s3_read(filename):
s3_obj = s3.get_object(Bucket = 'bucket-name', Key = filename)
body = s3_obj['Body']
return body.read()
as an attempt to pull the data but I'm running into an issue of formatting from AWS that I don't know how to handle.
When I print out the data that is being returned from that there is a weird header before the data:
b{\n "name":"filename",\n "data":{\n "type":"Buffer,\n "data:[\n 114,\n 97,...]}}
So there's a bunch of \n's and the weird header. Would this have something to do with the way I uploaded the file to AWS or is there something I'm messing up with the reading of the file?
body.read() returns bytes.
import json
j = json.loads(s3_obj['Body'].read().decode('utf-8'))
decode will turn bytes to string, json.loads will parse the string to dictionary.
There's a CSV file in a S3 bucket that I want to parse and turn into a dictionary in Python. Using Boto3, I called the s3.get_object(<bucket_name>, <key>) function and that returns a dictionary which includes a "Body" : StreamingBody() key-value pair that apparently contains the data I want.
In my python file, I've added import csv and the examples I see online on how to read a csv file, you pass the file name such as:
with open(<csv_file_name>, mode='r') as file:
reader = csv.reader(file)
However, I'm not sure how to retrieve the csv file name from StreamBody, if that's even possible. If not, is there a better way for me to read the csv file in Python? Thanks!
Edit: Wanted to add that I'm doing this in AWS Lambda and there are documented issues with using pandas in Lambda, so this is why I wanted to use the csv library and not pandas.
csv.reader does not require a file. It can use anything that iterates through lines, including files and lists.
So you don't need a filename. Just pass the lines from response['Body'] directly into the reader. One way to do that is
lines = response['Body'].read().splitlines(True)
reader = csv.reader(lines)
To retrieve and read CSV file from s3 bucket, you can use the following code:
import csv
import boto3
from django.conf import settings
bucket_name = "your-bucket-name"
file_name = "your-file-name-exists-in-that-bucket.csv"
s3 = boto3.resource('s3', aws_access_key_id=settings.AWS_ACCESS_KEY_ID,
aws_secret_access_key=settings.AWS_SECRET_ACCESS_KEY)
bucket = s3.Bucket(bucket_name)
obj = bucket.Object(key=file_name)
response = obj.get()
lines = response['Body'].read().decode('utf-8').splitlines(True)
reader = csv.DictReader(lines)
for row in reader:
# csv_header_key is the header keys which you have defined in your csv header
print(row['csv_header_key1'], row['csv_header_key2')
I need to create a CSV and upload it to an S3 bucket. Since I'm creating the file on the fly, it would be better if I could write it directly to S3 bucket as it is being created rather than writing the whole file locally, and then uploading the file at the end.
Is there a way to do this? My project is in Python and I'm fairly new to the language. Here is what I tried so far:
import csv
import csv
import io
import boto
from boto.s3.key import Key
conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'
fieldnames = ['first_name', 'last_name']
writer = csv.DictWriter(io.StringIO(), fieldnames=fieldnames)
k.set_contents_from_stream(writer.writeheader())
I received this error: BotoClientError: s3 does not support chunked transfer
UPDATE: I found a way to write directly to S3, but I can't find a way to clear the buffer without actually deleting the lines I already wrote. So, for example:
conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'
testDict = [{
"fieldA": "8",
"fieldB": None,
"fieldC": "888888888888"},
{
"fieldA": "9",
"fieldB": None,
"fieldC": "99999999999"}]
f = io.StringIO()
fieldnames = ['fieldA', 'fieldB', 'fieldC']
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
k.set_contents_from_string(f.getvalue())
for row in testDict:
writer.writerow(row)
k.set_contents_from_string(f.getvalue())
f.close()
Writes 3 lines to the file, however I'm unable to release memory to write a big file. If I add:
f.seek(0)
f.truncate(0)
to the loop, then only the last line of the file is written. Is there any way to release resources without deleting lines from the file?
I did find a solution to my question, which I will post here in case anyone else is interested. I decided to do this as parts in a multipart upload. You can't stream to S3. There is also a package available that changes your streaming file over to a multipart upload which I used: Smart Open.
import smart_open
import io
import csv
testDict = [{
"fieldA": "8",
"fieldB": None,
"fieldC": "888888888888"},
{
"fieldA": "9",
"fieldB": None,
"fieldC": "99999999999"}]
fieldnames = ['fieldA', 'fieldB', 'fieldC']
f = io.StringIO()
with smart_open.smart_open('s3://dev-test/bar/foo.csv', 'wb') as fout:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
fout.write(f.getvalue())
for row in testDict:
f.seek(0)
f.truncate(0)
writer.writerow(row)
fout.write(f.getvalue())
f.close()
Here is a complete example using boto3
import boto3
import io
session = boto3.Session(
aws_access_key_id="...",
aws_secret_access_key="..."
)
s3 = session.resource("s3")
buff = io.BytesIO()
buff.write("test1\n".encode())
buff.write("test2\n".encode())
s3.Object(bucket, keypath).put(Body=buff.getvalue())
We were trying to upload file contents to s3 when it came through as an InMemoryUploadedFile object in a Django request. We ended up doing the following because we didn't want to save the file locally. Hope it helps:
#action(detail=False, methods=['post'])
def upload_document(self, request):
document = request.data.get('image').file
s3.upload_fileobj(document, BUCKET_NAME,
DESIRED_NAME_OF_FILE_IN_S3,
ExtraArgs={"ServerSideEncryption": "aws:kms"})
According to docs it's possible
s3.Object('mybucket', 'hello.txt').put(Body=open('/tmp/hello.txt', 'rb'))
so we can use StringIO in ordinary way
Update: smart_open lib from #inquiring minds answer is better solution
There's an interesting code solution mentioned in a GitHub smart_open issue (#82) that I've been meaning to try out. Copy-pasting here for posterity... looks like boto3 is required:
csv_data = io.BytesIO()
writer = csv.writer(csv_data)
writer.writerows(my_data)
gz_stream = io.BytesIO()
with gzip.GzipFile(fileobj=gz_stream, mode="w") as gz:
gz.write(csv_data.getvalue())
gz_stream.seek(0)
s3 = boto3.client('s3')
s3.upload_fileobj(gz_stream, bucket_name, key)
This specific example is streaming to a compressed S3 key/file, but it seems like the general approach -- using the boto3 S3 client's upload_fileobj() method in conjunction with a target stream, not a file -- should work.
There's a well supported library for doing just this:
pip install s3fs
s3fs is really trivial to use:
import s3fs
s3fs.S3FileSystem(anon=False)
with s3.open('mybucket/new-file', 'wb') as f:
f.write(2*2**20 * b'a')
f.write(2*2**20 * b'a')
Incidentally there's also something built into boto3 (backed by the AWS API) called MultiPartUpload.
This isn't factored as a python stream which might be an advantage for some people. Instead you can start an upload and send parts one at a time.
To write a string to an S3 object, use:
s3.Object('my_bucket', 'my_file.txt').put('Hello there')
So convert the stream to string and you're there.