Python Write Temp File to S3 - python

I am currently trying to write a dataframe to a temp file and then upload that temp file into an S3 bucket. When I run my code there currently isn't any action that occurs. Any help would be greatly appreciated. The following is my code:
import csv
import pandas as pd
import boto3
import tempfile
import os
s3 = boto3.client('s3', aws_access_key_id = access_key, aws_secret_access_key = secret_key, region_name = region)
temp = tempfile.TemporaryFile()
largedf.to_csv(temp, sep = '|')
s3.put_object(temp, Bucket = '[BUCKET NAME]', Key = 'test.txt')
temp.close()

The file-handle you pass to the s3.put_object is at the final position, when you .read from it, it will return an empty string.
>>> df = pd.DataFrame(np.random.randint(10,50, (5,5)))
>>> temp = tempfile.TemporaryFile(mode='w+')
>>> df.to_csv(temp)
>>> temp.read()
''
A quick fix is to .seek back to the beginning...
>>> temp.seek(0)
0
>>> print(temp.read())
,0,1,2,3,4
0,11,42,40,45,11
1,36,18,45,24,25
2,28,20,12,33,44
3,45,39,14,16,20
4,40,16,22,30,37
Note, writing to disk is unnecessary, really, you could just keep everything in memory using a buffer, something like:
from io import StringIO # on python 2, use from cStringIO import StringIO
buffer = StringIO()
# Saving df to memory as a temporary file
df.to_csv(buffer)
buffer.seek(0)
s3.put_object(buffer, Bucket = '[BUCKET NAME]', Key = 'test.txt')

Related

Create CSV with variables as column using AWS lambda

I've created a lambda that scans my s3 bucket and collects some metadata for each object found in the S3. However, I am hitting a roadblock when exporting a CSV with the data of the s3 object. My CSV only returns one record, how can I get my CSV to return all objects?
Please see my Lambda code below:
import re
import datetime
from datetime import date
import os
import math
import csv
s3 = boto3.client('s3')
logger = logging.getLogger()
logger.setLevel(logging.INFO)
time=date.today().strftime("%d/%m/%Y")
def lambda_handler(event, context):
s3_resource = boto3.resource('s3')
result = []
bucket = s3_resource.Bucket('dev-bucket')
key='csv_file.csv'
for object in bucket.objects.all():
name=object.key
size=object.size
si=list(name)
dates=object.last_modified.strftime("%d/%m/%Y")
owner=object.owner['DisplayName']
days_since_creation= datetime.datetime.strptime(time, "%d/%m/%Y") - datetime.datetime.strptime(dates, "%d/%m/%Y")
days_since_creation=days_since_creation.days
to_delete =[]
if days_since_creation >= 30:
to_delete = 'Y'
else:
to_delete = 'N'
myfile = open("/tmp/csv_file.csv", "w+")
writer = csv.writer(myfile,delimiter='|')
rows = name, size, dates, days_since_creation
rows=list(rows)
writer.writerow(rows)
myfile.close()
#upload the data into s3
s3.upload_file('/tmp/csv_file.csv', 'dev-bucket', 'cleanuptest.csv')
print(rows)
My Current output is this below:
09ff0687-a644-4d5e-9de8-277594b194a6.csv.metadata|280|29/11/2021|78
The preferred output would be:
0944ee8b-1e17-496a-9196-0caed1e1de11.csv.metadata|152|08/12/2021|69
0954d7e5-dcc6-4cb6-8c07-70cbf37a73ef.csv|8776432|16/11/2021|91
0954d7e5-dcc6-4cb6-8c07-70cbf37a73ef.csv.metadata|336|16/11/2021|91
0959edc4-fa02-493f-9c05-9040964f4756.csv|6338|29/11/2021|78
0959edc4-fa02-493f-9c05-9040964f4756.csv.metadata|225|29/11/2021|78
0965cf32-fc31-4acc-9c32-a983d8ea720d.txt|844|10/12/2021|67
0965cf32-fc31-4acc-9c32-a983d8ea720d.txt.metadata|312|10/12/2021|67
096ed35c-e2a7-4ec4-8dae-f87b42bfe97c.csv|1761|09/12/2021|68
Unfortunately, I cannot get it right, I'm not sure what I'm doing wrong. Help would be appreciated
I think in your current setup, you open and close the file for each row. So, basically, at the end your file will have the last row.
What you probably want is this:
myfile = open("/tmp/csv_file.csv", "w+")
for object in bucket.objects.all():
<the looping logic>
myfile.close()
s3.upload_file('/tmp/csv_file.csv', 'dev-bucket', 'cleanuptest.csv')
You can prove that opening & closing the file each time rewrites the file by running the below minimal version of your script:
import csv
myfile1 = open("csv_file.csv", "w+")
writer1 = csv.writer(myfile1,delimiter='|')
row1 = "a", "b", "c"
rows1 = list(row1)
writer1.writerow(rows1)
myfile1.close()
print(rows1)
myfile2 = open("csv_file.csv", "w+")
writer2 = csv.writer(myfile2,delimiter='|')
row2 = "x", "y", "z"
rows2 = list(row2)
writer2.writerow(rows2)
myfile2.close()
print(rows2)
Output in file:
x|y|z
FYI you can also open the file in append mode using a to ensure the rows are not overwritten.
myfile = open("/tmp/csv_file.csv", "a")
Using w has the below caveat as mentioned in the docs:
'w' for only writing (an existing file with the same name will be erased)
'a' opens the file for appending; any data written to the file is automatically added to the end.
...

import multiple files and combine into one large parquet file in s3

I have multiple files in s3 in one folder:
apc18840407-20191231-01.csv
apc18840407-20191231-02.csv
apc18840407-20191231-03.csv
...apc18840407-20191231-65.csv
because of the multi-header issue, each file needs a bit cleaning before combining with others. The cleaning code:
tm1 = pd.read_csv('s3:file path')
tm1 = tm1.reset_index()
tm1 = tm1.rename(columns=tm1.iloc[0])
tm1 = tm1[6:]
tm1
I am trying to import them all at once and combine the file into a large parquet file.
import boto3
import pandas as pd
import io
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucketname')
prefix_objs = bucket.objects.filter(Prefix="s3://folder path")
prefix_df = []
for obj in prefix_objs:
key = obj.key
body = obj.get()['Body'].read()
temp = pd.read_csv(io.BytesIO(body), encoding='utf8')
prefix_df.append(temp)
However, this code does not working, and return an empty list. Is there anyway to fix this?

Saving a DF to azure blob

I am trying to save a df which is returned from a function (return df). I am trying to push this to my azure blob storage account.
I am having some troubles as all the solutions I have found required a file path, however I just want to run some code on a dataframe and save it automatically to azure blob.
As per requests, a snippet of my code :)
As stated above, I am looking to save the df (a pandas dataframe) as a .csv into the blob, I am not looking for other information.
import pandas as pd
import numpy as np
import datetime
import os, uuid
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, __version__
def function (df):
df = df.rename(columns=df.iloc[1]).drop(df.index[0])
df = df.iloc[1:]
indexNames = df[df['Customer'].isin(['Stock', 'Sales', 'Over', '2021 Under'])].index
df = df.drop(indexNames)
df.columns = df.columns.fillna('ItemNo')
for col in df:
df['ItemNo'] = df['ItemNo'].ffill()
return df
CONNECTION_STRING = ""
CONTAINERNAME = ""
BLOBNAME = ""
LOCALFILENAME = ""
blob_service_client = BlobServiceClient.from_connection_string(CONNECTION_STRING) #instantiate new blobservice with connection string
#container_client = blob_service_client.get_container_client(CONTAINERNAME) #instantiate new containerclient
blob_client = blob_service_client.get_blob_client(container = CONTAINERNAME, blob=BLOBNAME)
#READ PRODUCTS FILE
f = open(LOCALFILENAME, "wb")
f.write(blob_client.download_blob().content_as_bytes())
f.close()
df = pd.read_excel(r''+LOCALFILENAME)
Maybe you can try the following code:
temp_path = tempfile.gettempdir()
file_path = os.path.join(temp_path, 'dataframe.csv')
df.to_csv (file_path)
with open(file_path, "rb") as data:
blob_client.upload_blob(data)

EmptyDataError: No columns to parse from file when reading multiple csv files from S3 bucket to pandas Dataframe

I have a source s3 bucket which has around 500 csv files, I want to move those files to another s3 bucket and before moving I want to clean up the data so I am trying to read it to a pandas dataframe. My code works fine and returns dataframes for a few files and then it suddenly breaks and gives me the error "EmptyDataError: No columns to parse from file" .
sts_client = boto3.client('sts', region_name='us-east-1')
client = boto3.client('s3')
bucket = 'source bucket'
folder_path = 'mypath'
def get_keys(bucket,folder_path):
keys = []
resp = client.list_objects(Bucket=bucket, Prefix=folder_path)
for obj in resp['Contents']:
keys.append(obj['Key'])
return keys
files = get_keys(bucket,folder_path)
print(files)
for file in files:
f = BytesIO()
client.download_fileobj(bucket, file, f)
f.seek(0)
obj = f.getvalue()
my_df = pd.read_csv(f ,header=None, escapechar='\\', encoding='utf-8', engine='python')
# files dont have column names, providing column names
my_df.columns = ['col1', 'col2','col3','col4','col5']
print(my_df.head())
Thanks in advance!
Your file size is zero. Instead of os.path.getsize(file) use paginator to check as follows:
import boto3
client = boto3.client('s3', region_name='us-west-2')
paginator = client.get_paginator('list_objects')
page_iterator = paginator.paginate(Bucket='my-bucket')
filtered_iterator = page_iterator.search("Contents[?Size > `0`][]")
for key_data in filtered_iterator:
print(key_data)

Python ungzipping stream of bytes?

Here is the situation:
I get gzipped xml documents from Amazon S3
import boto
from boto.s3.connection import S3Connection
from boto.s3.key import Key
conn = S3Connection('access Id', 'secret access key')
b = conn.get_bucket('mydev.myorg')
k = Key(b)
k.key('documents/document.xml.gz')
I read them in file as
import gzip
f = open('/tmp/p', 'w')
k.get_file(f)
f.close()
r = gzip.open('/tmp/p', 'rb')
file_content = r.read()
r.close()
Question
How can I ungzip the streams directly and read the contents?
I do not want to create temp files, they don't look good.
Yes, you can use the zlib module to decompress byte streams:
import zlib
def stream_gzip_decompress(stream):
dec = zlib.decompressobj(32 + zlib.MAX_WBITS) # offset 32 to skip the header
for chunk in stream:
rv = dec.decompress(chunk)
if rv:
yield rv
The offset of 32 signals to the zlib header that the gzip header is expected but skipped.
The S3 key object is an iterator, so you can do:
for data in stream_gzip_decompress(k):
# do something with the decompressed data
I had to do the same thing and this is how I did it:
import gzip
f = StringIO.StringIO()
k.get_file(f)
f.seek(0) #This is crucial
gzf = gzip.GzipFile(fileobj=f)
file_content = gzf.read()
For Python3x and boto3-
So I used BytesIO to read the compressed file into a buffer object, then I used zipfile to open the decompressed stream as uncompressed data and I was able to get the datum line by line.
import io
import zipfile
import boto3
import sys
s3 = boto3.resource('s3', 'us-east-1')
def stream_zip_file():
count = 0
obj = s3.Object(
bucket_name='MonkeyBusiness',
key='/Daily/Business/Banana/{current-date}/banana.zip'
)
buffer = io.BytesIO(obj.get()["Body"].read())
print (buffer)
z = zipfile.ZipFile(buffer)
foo2 = z.open(z.infolist()[0])
print(sys.getsizeof(foo2))
line_counter = 0
for _ in foo2:
line_counter += 1
print (line_counter)
z.close()
if __name__ == '__main__':
stream_zip_file()
You can try PIPE and read contents without downloading file
import subprocess
c = subprocess.Popen(['-c','zcat -c <gzip file name>'], shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
for row in c.stdout:
print row
In addition "/dev/fd/" + str(c.stdout.fileno()) will provide you FIFO file name (Named pipe) which can be passed to other program.

Categories

Resources