Python - How to read CSV file retrieved from S3 bucket? - python

There's a CSV file in a S3 bucket that I want to parse and turn into a dictionary in Python. Using Boto3, I called the s3.get_object(<bucket_name>, <key>) function and that returns a dictionary which includes a "Body" : StreamingBody() key-value pair that apparently contains the data I want.
In my python file, I've added import csv and the examples I see online on how to read a csv file, you pass the file name such as:
with open(<csv_file_name>, mode='r') as file:
reader = csv.reader(file)
However, I'm not sure how to retrieve the csv file name from StreamBody, if that's even possible. If not, is there a better way for me to read the csv file in Python? Thanks!
Edit: Wanted to add that I'm doing this in AWS Lambda and there are documented issues with using pandas in Lambda, so this is why I wanted to use the csv library and not pandas.

csv.reader does not require a file. It can use anything that iterates through lines, including files and lists.
So you don't need a filename. Just pass the lines from response['Body'] directly into the reader. One way to do that is
lines = response['Body'].read().splitlines(True)
reader = csv.reader(lines)

To retrieve and read CSV file from s3 bucket, you can use the following code:
import csv
import boto3
from django.conf import settings
bucket_name = "your-bucket-name"
file_name = "your-file-name-exists-in-that-bucket.csv"
s3 = boto3.resource('s3', aws_access_key_id=settings.AWS_ACCESS_KEY_ID,
aws_secret_access_key=settings.AWS_SECRET_ACCESS_KEY)
bucket = s3.Bucket(bucket_name)
obj = bucket.Object(key=file_name)
response = obj.get()
lines = response['Body'].read().decode('utf-8').splitlines(True)
reader = csv.DictReader(lines)
for row in reader:
# csv_header_key is the header keys which you have defined in your csv header
print(row['csv_header_key1'], row['csv_header_key2')

Related

How to use S3 Select with tab separated csv files

I'm using this script to query data from a CSV file that's saved on an AWS S3 Bucket. It works well with CSV files that were originally saved in Comma Separated format but I have a lot of data saved with tab delimiter (Sep='\t') which makes the code fail.
The original data is very massive which makes it difficult to rewrite it. Is there a way to query data where we specify the delimiter/separator for the CSV file?
I used it from this post: https://towardsdatascience.com/how-i-improved-performance-retrieving-big-data-with-s3-select-2bd2850bc428 ... I'd like to thank the writer for the tutorial which helped me save a lot of time.
Here's the code:
import boto3
import os
import pandas as pd
S3_KEY = r'source/df.csv'
S3_BUCKET = 'my_bucket'
TARGET_FILE = 'dataset.csv'
aws_access_key_id= 'my_key'
aws_secret_access_key= 'my_secret'
s3_client = boto3.client(service_name='s3',
region_name='us-east-1',
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key)
query = """SELECT column1
FROM S3Object
WHERE column1 = '4223740573'"""
result = s3_client.select_object_content(Bucket=S3_BUCKET,
Key=S3_KEY,
ExpressionType='SQL',
Expression=query,
InputSerialization={'CSV': {'FileHeaderInfo': 'Use'}},
OutputSerialization={'CSV': {}})
# remove the file if exists, since we append filtered rows line by line
if os.path.exists(TARGET_FILE):
os.remove(TARGET_FILE)
with open(TARGET_FILE, 'a+') as filtered_file:
# write header as a first line, then append each row from S3 select
filtered_file.write('Column1\n')
for record in result['Payload']:
if 'Records' in record:
res = record['Records']['Payload'].decode('utf-8')
filtered_file.write(res)
result = pd.read_csv(TARGET_FILE)
The InputSerialization option also allows you to specify:
RecordDelimiter - A single character used to separate individual records in the input. Instead of the default value, you can specify an arbitrary delimiter.
So you could try:
result = s3_client.select_object_content(
Bucket=S3_BUCKET,
Key=S3_KEY,
ExpressionType='SQL',
Expression=query,
InputSerialization={'CSV': {'FileHeaderInfo': 'Use', 'RecordDelimiter': '\t'}},
OutputSerialization={'CSV': {}})
Actually, I had a TSV file, and I used this InputSerialization:
InputSerialization={'CSV': {'FileHeaderInfo': 'None', 'RecordDelimiter': '\n', 'FieldDelimiter': '\t'}}
It works for files and have Enters between records, and not tabs, but tabs between fields.

How to stream a large gzipped .tsv file from s3, process it, and write back to a new file on s3?

I have a large file s3://my-bucket/in.tsv.gz that I would like to load and process, write back its processed version to an s3 output file s3://my-bucket/out.tsv.gz.
How do I streamline the in.tsv.gz directly from s3 without loading all the file to memory (it cannot fit the memory)
How do I write the processed gzipped stream directly to s3?
In the following code, I show how I was thinking to load the input gzipped dataframe from s3, and how I would write the .tsv if it were located locally bucket_dir_local = ./.
import pandas as pd
import s3fs
import os
import gzip
import csv
import io
bucket_dir = 's3://my-bucket/annotations/'
df = pd.read_csv(os.path.join(bucket_dir, 'in.tsv.gz'), sep='\t', compression="gzip")
bucket_dir_local='./'
# not sure how to do it with an s3 path
with gzip.open(os.path.join(bucket_dir_local, 'out.tsv.gz'), "w") as f:
with io.TextIOWrapper(f, encoding='utf-8') as wrapper:
w = csv.DictWriter(wrapper, fieldnames=['test', 'testing'], extrasaction="ignore")
w.writeheader()
for index, row in df.iterrows():
my_dict = {"test": index, "testing": row[6]}
w.writerow(my_dict)
Edit: smart_open looks like the way to go.
Here is a dummy example to read a file from s3 and write it back to s3 using smart_open
from smart_open import open
import os
bucket_dir = "s3://my-bucket/annotations/"
with open(os.path.join(bucket_dir, "in.tsv.gz"), "rb") as fin:
with open(
os.path.join(bucket_dir, "out.tsv.gz"), "wb"
) as fout:
for line in fin:
l = [i.strip() for i in line.decode().split("\t")]
string = "\t".join(l) + "\n"
fout.write(string.encode())
For downloading the file you can stream the S3 object directly in python. I'd recommend reading that entire post but some key lines from it
import boto3
s3 = boto3.client('s3', aws_access_key_id='mykey', aws_secret_access_key='mysecret') # your authentication may vary
obj = s3.get_object(Bucket='my-bucket', Key='my/precious/object')
import gzip
body = obj['Body']
with gzip.open(body, 'rt') as gf:
for ln in gf:
process(ln)
Unfortunately S3 doesn't support true streaming input but this SO answer has an implementation that chunks out the file and sends each chunk up to S3. While not a "true stream" it will let you upload large files without needing to keep the entire thing in memory

python - read csv from s3 and identify its encoding info for pandas dataframe

I am making a service to download csv files from s3 bucket.
The bucket contains csv with various encodings (which I may not know before hand), since users are uploading these files.
This is what I am trying:
...
obj = s3c.get_object(Bucket= BUCKET_NAME , Key = KEY)
content = io.BytesIO(obj['Body'].read())
df_s3_file = pd.read_csv(content)
...
This works fine for utf-8, however for other format it fails (obviously!).
I have found an independent code which can help me identify the encoding of a csv file on a netwrok drive.
It looks like this:
...
def find_encoding(fname):
r_file = open(fname, 'rb').read()
result = chardet.detect(r_file)
charenc = result['encoding']
return charenc
my_encoding = find_encoding(content)
print('detected csv encoding: ',my_encoding)
df_s3_file = pd.read_csv(content, encoding=my_encoding)
...
This snippet works absolutely fine for a file on a drive(local), but how do I do this for a file on s3 bucket? Since I am reading the s3 file as io.BytesIO object.
I think if I write the file on a drive and then execute the function find_encoding, its going to work, since that function takes csv file as input as opoosed to BytesIO object.
Is there a way to do this without having to download the file on a drive, within memory?
Note: the files size is not very big (<10 mb).
According to their docs s3c.get_object(Bucket= BUCKET_NAME , Key = KEY) will return a dict where one of the keys is ContentEncoding so I would try:
obj = s3c.get_object(Bucket= BUCKET_NAME , Key = KEY)
print(obj["ContentEncoding"])

How to combine same files in mutliple folders into one file s3

If I have a file in multiple folders in S3, how do I combine them together using boto3 python
Say in a bucket I have
bucket_a
ts
ts_folder
a_date.csv
b_date.csv
c_date.csv
d_date.csv
ts_folder2
a_date.csv
b_date.csv
c_date.csv
d_date.csv
I need to combine these two files into one file, also ignoring header in second file
I am trying to figure out how to achieve using boto3 python or aws
Try something like this. I assume you have your AWS credentials set up properly on your system. My suggestion would be to first add the lines of the CSV to a new variable. For the second CSV you will skip the first line. After finding all the lines you join them as a string so they can be written to an S3 object.
import boto3
# Output will contain the CSV lines
output = []
with open("first.csv", "r") as fh:
output.extend(fh.readlines())
with open("second.csv", "r") as fh:
# Skip header
output.extend(fh.readlines()[1:])
# Combine the lines as string
body = "".join(output)
# Create the S3 client (assuming credentials are setup)
s3_client = boto3.client("s3")
# Write the object
s3_client.put_object(Bucket="my-bucket",
Key="combined.csv",
Body=body)
Update
This should help you with the S3 setup
import boto3
session = boto3.session.Session(profile_name='dev')
s3_client = session.client("s3")
bucket = "my-bucket"
files = []
for item in s3_client.list_objects_v2(Bucket=bucket, Prefix="ts/")['Contents']:
if item['Key'].endswith(".csv"):
files.append(item['Key'])
output = []
for file in files:
body = s3_client.get_object(Bucket=bucket,
Key=file)["Body"].read()
output.append(body)
# Combine the lines as string
outputbody = "".join(output)
# Write the object
s3_client.put_object(Bucket=bucket,
Key="combined.csv",
Body=outputbody)

How to download a CSV file from the World Bank's dataset

I would like to automate the download of CSV files from the World Bank's dataset.
My problem is that the URL corresponding to a specific dataset does not lead directly to the desired CSV file but is instead a query to the World Bank's API. As an example, this is the URL to get the GDP per capita data: http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv.
If you paste this URL in your browser, it will automatically start the download of the corresponding file. As a consequence, the code I usually use to collect and save CSV files in Python is not working in the present situation:
baseUrl = "http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv"
remoteCSV = urllib2.urlopen("%s" %(baseUrl))
myData = csv.reader(remoteCSV)
How should I modify my code in order to download the file coming from the query to the API?
This will get the zip downloaded, open it and get you a csv object with whatever file you want.
import urllib2
import StringIO
from zipfile import ZipFile
import csv
baseUrl = "http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv"
remoteCSV = urllib2.urlopen(baseUrl)
sio = StringIO.StringIO()
sio.write(remoteCSV.read())
# We create a StringIO object so that we can work on the results of the request (a string) as though it is a file.
z = ZipFile(sio, 'r')
# We now create a ZipFile object pointed to by 'z' and we can do a few things here:
print z.namelist()
# A list with the names of all the files in the zip you just downloaded
# We can use z.namelist()[1] to refer to 'ny.gdp.pcap.cd_Indicator_en_csv_v2.csv'
with z.open(z.namelist()[1]) as f:
# Opens the 2nd file in the zip
csvr = csv.reader(f)
for row in csvr:
print row
For more information see ZipFile Docs and StringIO Docs
import os
import urllib
import zipfile
from StringIO import StringIO
package = StringIO(urllib.urlopen("http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv").read())
zip = zipfile.ZipFile(package, 'r')
pwd = os.path.abspath(os.curdir)
for filename in zip.namelist():
csv = os.path.join(pwd, filename)
with open(csv, 'w') as fp:
fp.write(zip.read(filename))
print filename, 'downloaded successfully'
From here you can use your approach to handle CSV files.
We have a script to automate access and data extraction for World Bank World Development Indicators like: https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS
The script does the following:
Downloading the metadata data
Extracting metadata and data
Converting to a Data Package
The script is python based and uses python 3.0. It has no dependencies outside of the standard library. Try it:
python scripts/get.py
python scripts/get.py https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS
You also can read our analysis about data from World Bank:
https://datahub.io/awesome/world-bank
Just a suggestion than a solution. You can use pd.read_csv to read any csv file directly from a URL.
import pandas as pd
data = pd.read_csv('http://url_to_the_csv_file')

Categories

Resources