I'm using this script to query data from a CSV file that's saved on an AWS S3 Bucket. It works well with CSV files that were originally saved in Comma Separated format but I have a lot of data saved with tab delimiter (Sep='\t') which makes the code fail.
The original data is very massive which makes it difficult to rewrite it. Is there a way to query data where we specify the delimiter/separator for the CSV file?
I used it from this post: https://towardsdatascience.com/how-i-improved-performance-retrieving-big-data-with-s3-select-2bd2850bc428 ... I'd like to thank the writer for the tutorial which helped me save a lot of time.
Here's the code:
import boto3
import os
import pandas as pd
S3_KEY = r'source/df.csv'
S3_BUCKET = 'my_bucket'
TARGET_FILE = 'dataset.csv'
aws_access_key_id= 'my_key'
aws_secret_access_key= 'my_secret'
s3_client = boto3.client(service_name='s3',
region_name='us-east-1',
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key)
query = """SELECT column1
FROM S3Object
WHERE column1 = '4223740573'"""
result = s3_client.select_object_content(Bucket=S3_BUCKET,
Key=S3_KEY,
ExpressionType='SQL',
Expression=query,
InputSerialization={'CSV': {'FileHeaderInfo': 'Use'}},
OutputSerialization={'CSV': {}})
# remove the file if exists, since we append filtered rows line by line
if os.path.exists(TARGET_FILE):
os.remove(TARGET_FILE)
with open(TARGET_FILE, 'a+') as filtered_file:
# write header as a first line, then append each row from S3 select
filtered_file.write('Column1\n')
for record in result['Payload']:
if 'Records' in record:
res = record['Records']['Payload'].decode('utf-8')
filtered_file.write(res)
result = pd.read_csv(TARGET_FILE)
The InputSerialization option also allows you to specify:
RecordDelimiter - A single character used to separate individual records in the input. Instead of the default value, you can specify an arbitrary delimiter.
So you could try:
result = s3_client.select_object_content(
Bucket=S3_BUCKET,
Key=S3_KEY,
ExpressionType='SQL',
Expression=query,
InputSerialization={'CSV': {'FileHeaderInfo': 'Use', 'RecordDelimiter': '\t'}},
OutputSerialization={'CSV': {}})
Actually, I had a TSV file, and I used this InputSerialization:
InputSerialization={'CSV': {'FileHeaderInfo': 'None', 'RecordDelimiter': '\n', 'FieldDelimiter': '\t'}}
It works for files and have Enters between records, and not tabs, but tabs between fields.
Related
I am trying to load data (CSV Files) from S3 to MySQL RDS through Lambda. So that i have written code in Lambda so whenever csv file is upload in S3 bucket then the data will import to Database.
But if CSV files are having spaces then the data is not importing exactly in database. See below images
CODE:
import json
import boto3
import csv
import mysql.connector
from mysql.connector import Error
from mysql.connector import errorcode
s3_client = boto3.client('s3')
# Read CSV file content from S3 bucket
def lambda_handler(event, context):
# TODO implement
# print(event)
bucket = event['Records'][0]['s3']['bucket']['name']
csv_file = event['Records'][0]['s3']['object']['key']
csv_file_obj = s3_client.get_object(Bucket=bucket, Key=csv_file)
lines = csv_file_obj['Body'].read().decode('utf-8').split()
results = []
for row in csv.DictReader(lines, skipinitialspace=True, delimiter=',', quotechar='"', doublequote = True):
results.append(row.values())
print(results)
connection = mysql.connector.connect(host='xxxxxxxxxxxxxxx.ap-south-1.rds.amazonaws.com',database='xxxxxxxdb',user='xxxxxx',password='xxxxxx')
tables_dict = {
'sketching': 'INSERT INTO table1 (empid, empname, empaddress) VALUES (%s, %s, %s)'
}
if csv_file in tables_dict:
mysql_empsql_insert_query = tables_dict[csv_file]
cursor = connection.cursor()
cursor.executemany(mysql_empsql_insert_query,results)
connection.commit()
print(cursor.rowcount, f"Record inserted successfully from {csv_file} file")
return {
'statusCode': 200,
'body': json.dumps('Hello from Lambda!')
}
CSV FILE
The Result in DataBase
So if there is a space in between name or any word, number the data is not uploading correctly and also if there is decimals like (9.2, 8.7) then also it is not uploading exactly.
How can I solve this problem?
I think the issue is with this line:
lines = csv_file_obj['Body'].read().decode('utf-8').split()
When no parameters are specified, split() will break strings at whitespaces.
You should probably use: split('\n')
Alternatively:
Download the file to disk (instead of using read())
Use the default behaviour of the CSV Reader (which knows how to break on lines)
writing a file s3 using spark usually creates a directory with 11 files success and the other file name starts with name as part which has actual data in s3 , how to load the same file using pandas dataframe since the file path changes because the file name Par for all 10 files with actual data varies in each run.
For example the file path at the time of writing :
df.colaesce.(10).write.path("s3://testfolder.csv")
The files stored in the directory are :
- sucess
- part-00-*.parquet
I have a python job which reads the file to pandas dataframe
pd.read(s3\\..........what is the path to specify here.................)
when writing files with spark, you cannot pass the name the file (you can, but you end up with what you described above). if you want a single file to later load to pandas, you would do something like this:
df.repartition(1).write.parquet(path="s3://testfolder/", mode='append')
The end result will be a single file in "s3://testfolder/" that starts with part-00-*.parquet. You can simply read that file in or rename the file to something specific before reading it in with pandas.
Option 1: (Recommended)
You can use awswrangler. Its a light weight tool to aid with the integration between
Pandas/S3/Parquet. It lets you read in multiple files from the directory.
pip install awswrangler
import awswrangler as wr
df = wr.s3.read_parquet(path='s3://testfolder/')
Option 2:
############################## RETRIEVE KEYS FROM THE BUCKET ##################################
import boto3
import pandas as pd
s3 = boto3.client('s3')
s3_bucket_name = 'your bucket name'
prefix = 'path where the files are located'
response = s3.list_objects_v2(
Bucket = s3_bucket_name,
Prefix = prefix
)
keys = []
for obj in response['Contents']:
keys.append(obj['Key'])
##################################### READ IN THE FILES #######################################
df=[]
for key in keys:
df.append(pd.read_parquet(path = 's3://' + s3_bucket_name + '/' + key, engine = 'pyarrow'))
If I have a file in multiple folders in S3, how do I combine them together using boto3 python
Say in a bucket I have
bucket_a
ts
ts_folder
a_date.csv
b_date.csv
c_date.csv
d_date.csv
ts_folder2
a_date.csv
b_date.csv
c_date.csv
d_date.csv
I need to combine these two files into one file, also ignoring header in second file
I am trying to figure out how to achieve using boto3 python or aws
Try something like this. I assume you have your AWS credentials set up properly on your system. My suggestion would be to first add the lines of the CSV to a new variable. For the second CSV you will skip the first line. After finding all the lines you join them as a string so they can be written to an S3 object.
import boto3
# Output will contain the CSV lines
output = []
with open("first.csv", "r") as fh:
output.extend(fh.readlines())
with open("second.csv", "r") as fh:
# Skip header
output.extend(fh.readlines()[1:])
# Combine the lines as string
body = "".join(output)
# Create the S3 client (assuming credentials are setup)
s3_client = boto3.client("s3")
# Write the object
s3_client.put_object(Bucket="my-bucket",
Key="combined.csv",
Body=body)
Update
This should help you with the S3 setup
import boto3
session = boto3.session.Session(profile_name='dev')
s3_client = session.client("s3")
bucket = "my-bucket"
files = []
for item in s3_client.list_objects_v2(Bucket=bucket, Prefix="ts/")['Contents']:
if item['Key'].endswith(".csv"):
files.append(item['Key'])
output = []
for file in files:
body = s3_client.get_object(Bucket=bucket,
Key=file)["Body"].read()
output.append(body)
# Combine the lines as string
outputbody = "".join(output)
# Write the object
s3_client.put_object(Bucket=bucket,
Key="combined.csv",
Body=outputbody)
There's a CSV file in a S3 bucket that I want to parse and turn into a dictionary in Python. Using Boto3, I called the s3.get_object(<bucket_name>, <key>) function and that returns a dictionary which includes a "Body" : StreamingBody() key-value pair that apparently contains the data I want.
In my python file, I've added import csv and the examples I see online on how to read a csv file, you pass the file name such as:
with open(<csv_file_name>, mode='r') as file:
reader = csv.reader(file)
However, I'm not sure how to retrieve the csv file name from StreamBody, if that's even possible. If not, is there a better way for me to read the csv file in Python? Thanks!
Edit: Wanted to add that I'm doing this in AWS Lambda and there are documented issues with using pandas in Lambda, so this is why I wanted to use the csv library and not pandas.
csv.reader does not require a file. It can use anything that iterates through lines, including files and lists.
So you don't need a filename. Just pass the lines from response['Body'] directly into the reader. One way to do that is
lines = response['Body'].read().splitlines(True)
reader = csv.reader(lines)
To retrieve and read CSV file from s3 bucket, you can use the following code:
import csv
import boto3
from django.conf import settings
bucket_name = "your-bucket-name"
file_name = "your-file-name-exists-in-that-bucket.csv"
s3 = boto3.resource('s3', aws_access_key_id=settings.AWS_ACCESS_KEY_ID,
aws_secret_access_key=settings.AWS_SECRET_ACCESS_KEY)
bucket = s3.Bucket(bucket_name)
obj = bucket.Object(key=file_name)
response = obj.get()
lines = response['Body'].read().decode('utf-8').splitlines(True)
reader = csv.DictReader(lines)
for row in reader:
# csv_header_key is the header keys which you have defined in your csv header
print(row['csv_header_key1'], row['csv_header_key2')
I am using Google App Engine (python), I want my users to be able to download a CSV file generated using some data from the datastore (but I don't want them to download the whole thing, as I re-order the columns and stuff).
I have to use the csv module, because there can be cells containing commas. But the problem that if I do that I will need to write a file, which is not allowed on Google App Engine
What I currently have is something like this:
tmp = open("tmp.csv", 'w')
writer = csv.writer(tmp)
writer.writerow(["foo", "foo,bar", "bar"])
So I guess what I would want to do is either to handle cells with commas.. or to use the csv module without writing a file as this is not possible with GAE..
I found a way to use the CSV module on GAE! Here it is:
self.response.headers['Content-Type'] = 'application/csv'
writer = csv.writer(self.response.out)
writer.writerow(["foo", "foo,bar", "bar"])
This way you don't need to write any files
Here is a complete example of using the Python CSV module in GAE. I typically use it for creating a csv file from a gql query and prompting the user to save or open it.
import csv
class MyDownloadHandler(webapp2.RequestHandler):
def get(self):
q = ModelName.gql("WHERE foo = 'bar' ORDER BY date ASC")
reqs = q.fetch(1000)
self.response.headers['Content-Type'] = 'text/csv'
self.response.headers['Content-Disposition'] = 'attachment; filename=studenttransreqs.csv'
writer = csv.writer(self.response.out)
create row labels
writer.writerow(['Date', 'Time','User' ])
iterate through query returning each instance as a row
for req in reqs:
writer.writerow([req.date,req.time,req.user])
Add the appropriate mapping so that when a link is clicked, the file dialog opens
('/mydownloadhandler',MyDownloadHandler),
import StringIO
tmp = StringIO.StringIO()
writer = csv.writer(tmp)
writer.writerow(["foo", "foo,bar", "bar"])
contents = tmp.getvalue()
tmp.close()
print contents