If I have a file in multiple folders in S3, how do I combine them together using boto3 python
Say in a bucket I have
bucket_a
ts
ts_folder
a_date.csv
b_date.csv
c_date.csv
d_date.csv
ts_folder2
a_date.csv
b_date.csv
c_date.csv
d_date.csv
I need to combine these two files into one file, also ignoring header in second file
I am trying to figure out how to achieve using boto3 python or aws
Try something like this. I assume you have your AWS credentials set up properly on your system. My suggestion would be to first add the lines of the CSV to a new variable. For the second CSV you will skip the first line. After finding all the lines you join them as a string so they can be written to an S3 object.
import boto3
# Output will contain the CSV lines
output = []
with open("first.csv", "r") as fh:
output.extend(fh.readlines())
with open("second.csv", "r") as fh:
# Skip header
output.extend(fh.readlines()[1:])
# Combine the lines as string
body = "".join(output)
# Create the S3 client (assuming credentials are setup)
s3_client = boto3.client("s3")
# Write the object
s3_client.put_object(Bucket="my-bucket",
Key="combined.csv",
Body=body)
Update
This should help you with the S3 setup
import boto3
session = boto3.session.Session(profile_name='dev')
s3_client = session.client("s3")
bucket = "my-bucket"
files = []
for item in s3_client.list_objects_v2(Bucket=bucket, Prefix="ts/")['Contents']:
if item['Key'].endswith(".csv"):
files.append(item['Key'])
output = []
for file in files:
body = s3_client.get_object(Bucket=bucket,
Key=file)["Body"].read()
output.append(body)
# Combine the lines as string
outputbody = "".join(output)
# Write the object
s3_client.put_object(Bucket=bucket,
Key="combined.csv",
Body=outputbody)
Related
I have a csv file containing numerous uuids
I'd like to write a python script using boto3 which:
Connects to an AWS S3 bucket
Uses each uuid contained in the CSV to copy the file contained
Files are all contained in a filepath like this: BUCKET/ORG/FOLDER1/UUID/DATA/FILE.PNG
However, the file contained in DATA/ can be different file types.
Put the copied file in a new S3 bucket
So far, I have successfully connected to the s3 bucket and checked its contents in python using boto3, but need help implementing the rest
import boto3
#Create Session
session = boto3.Session(
aws_access_key_id='ACCESS_KEY_ID',
aws_secret_access_key='SECRET_ACCESS_KEY',
)
#Initiate S3 Resource
s3 = session.resource('s3')
your_bucket = s3.Bucket('BUCKET-NAME')
for s3_file in your_bucket.objects.all():
print(s3_file.key) # prints the contents of bucket
To read the CSV file you can use csv library (see: https://docs.python.org/fr/3.6/library/csv.html)
Example:
import csv
with open('file.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
print(row)
To push files to the new bucket, you can use the copy method (see: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.copy)
Example:
import boto3
s3 = boto3.resource('s3')
source = {
'Bucket': 'BUCKET-NAME',
'Key': 'mykey'
}
bucket = s3.Bucket('SECOND_BUCKET-NAME')
bucket.copy(source, 'SECOND_BUCKET-NAME')
I'm using this script to query data from a CSV file that's saved on an AWS S3 Bucket. It works well with CSV files that were originally saved in Comma Separated format but I have a lot of data saved with tab delimiter (Sep='\t') which makes the code fail.
The original data is very massive which makes it difficult to rewrite it. Is there a way to query data where we specify the delimiter/separator for the CSV file?
I used it from this post: https://towardsdatascience.com/how-i-improved-performance-retrieving-big-data-with-s3-select-2bd2850bc428 ... I'd like to thank the writer for the tutorial which helped me save a lot of time.
Here's the code:
import boto3
import os
import pandas as pd
S3_KEY = r'source/df.csv'
S3_BUCKET = 'my_bucket'
TARGET_FILE = 'dataset.csv'
aws_access_key_id= 'my_key'
aws_secret_access_key= 'my_secret'
s3_client = boto3.client(service_name='s3',
region_name='us-east-1',
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key)
query = """SELECT column1
FROM S3Object
WHERE column1 = '4223740573'"""
result = s3_client.select_object_content(Bucket=S3_BUCKET,
Key=S3_KEY,
ExpressionType='SQL',
Expression=query,
InputSerialization={'CSV': {'FileHeaderInfo': 'Use'}},
OutputSerialization={'CSV': {}})
# remove the file if exists, since we append filtered rows line by line
if os.path.exists(TARGET_FILE):
os.remove(TARGET_FILE)
with open(TARGET_FILE, 'a+') as filtered_file:
# write header as a first line, then append each row from S3 select
filtered_file.write('Column1\n')
for record in result['Payload']:
if 'Records' in record:
res = record['Records']['Payload'].decode('utf-8')
filtered_file.write(res)
result = pd.read_csv(TARGET_FILE)
The InputSerialization option also allows you to specify:
RecordDelimiter - A single character used to separate individual records in the input. Instead of the default value, you can specify an arbitrary delimiter.
So you could try:
result = s3_client.select_object_content(
Bucket=S3_BUCKET,
Key=S3_KEY,
ExpressionType='SQL',
Expression=query,
InputSerialization={'CSV': {'FileHeaderInfo': 'Use', 'RecordDelimiter': '\t'}},
OutputSerialization={'CSV': {}})
Actually, I had a TSV file, and I used this InputSerialization:
InputSerialization={'CSV': {'FileHeaderInfo': 'None', 'RecordDelimiter': '\n', 'FieldDelimiter': '\t'}}
It works for files and have Enters between records, and not tabs, but tabs between fields.
I need to read multiple csv files from S3 bucket with boto3 in python and finally combine those files in single dataframe in pandas.
I am able to read single file from following script in python
s3 = boto3.resource('s3')
bucket = s3.Bucket('test-bucket')
for obj in bucket.objects.all():
key = obj.key
body = obj.get()['Body'].read()
Following is my path
files/splittedfiles/Code-345678
In Code-345678 I have multiple csv files which I have to read and combine it to single dataframe in pandas
Also, how do I pass a list of selected Codes as a list,so that it will read those folders only. e.g.
files/splittedfiles/Code-345678
files/splittedfiles/Code-345679
files/splittedfiles/Code-345680
files/splittedfiles/Code-345681
files/splittedfiles/Code-345682
From above I need to read files under following codes only.
345678,345679,345682
How can I do it in python?
The boto3 API does not support reading multiple objects at once. What you can do is retrieve all objects with a specified prefix and load each of the returned objects with a loop. To do this you can use the filter() method and set the Prefix parameter to the prefix of the objects you want to load. Below I've made this simple change to your code that will let you get all the objects with the prefix "files/splittedfiles/Code-345678" that you can read by looping through those objects where you can load each file into a DataFrame:
s3 = boto3.resource('s3')
bucket = s3.Bucket('test-bucket')
prefix_objs = bucket.objects.filter(Prefix="files/splittedfiles/Code-345678")
for obj in prefix_objs:
key = obj.key
body = obj.get()['Body'].read()
If you have multiple prefixes you are going to want to evaluate you can take the above and turn it into a function where the prefix is a parameter then combine the results together. The function could like something like this:
import pandas as pd
def read_prefix_to_df(prefix):
s3 = boto3.resource('s3')
bucket = s3.Bucket('test-bucket')
prefix_objs = bucket.objects.filter(Prefix=prefix)
prefix_df = []
for obj in prefix_objs:
key = obj.key
body = obj.get()['Body'].read()
df = pd.DataFrame(body)
prefix_df.append(df)
return pd.concat(prefix_df)
Then you can iteratively apply this function to each prefix and combine the results in the end.
Modifying Answer 1 to overcome error DataFrame constructor not properly called!
Code:
import boto3
import pandas as pd
import io
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket_name')
prefix_objs = bucket.objects.filter(Prefix="folder_path/prefix")
prefix_df = []
for obj in prefix_objs:
key = obj.key
body = obj.get()['Body'].read()
temp = pd.read_csv(io.BytesIO(body), encoding='utf8')
prefix_df.append(temp)
Can you do it like this, using "filter" instead of "all":
for obj in bucket.objects.filter(Prefix='files/splittedfiles/'):
key = obj.key
body = obj.get()['Body'].read()
There's a CSV file in a S3 bucket that I want to parse and turn into a dictionary in Python. Using Boto3, I called the s3.get_object(<bucket_name>, <key>) function and that returns a dictionary which includes a "Body" : StreamingBody() key-value pair that apparently contains the data I want.
In my python file, I've added import csv and the examples I see online on how to read a csv file, you pass the file name such as:
with open(<csv_file_name>, mode='r') as file:
reader = csv.reader(file)
However, I'm not sure how to retrieve the csv file name from StreamBody, if that's even possible. If not, is there a better way for me to read the csv file in Python? Thanks!
Edit: Wanted to add that I'm doing this in AWS Lambda and there are documented issues with using pandas in Lambda, so this is why I wanted to use the csv library and not pandas.
csv.reader does not require a file. It can use anything that iterates through lines, including files and lists.
So you don't need a filename. Just pass the lines from response['Body'] directly into the reader. One way to do that is
lines = response['Body'].read().splitlines(True)
reader = csv.reader(lines)
To retrieve and read CSV file from s3 bucket, you can use the following code:
import csv
import boto3
from django.conf import settings
bucket_name = "your-bucket-name"
file_name = "your-file-name-exists-in-that-bucket.csv"
s3 = boto3.resource('s3', aws_access_key_id=settings.AWS_ACCESS_KEY_ID,
aws_secret_access_key=settings.AWS_SECRET_ACCESS_KEY)
bucket = s3.Bucket(bucket_name)
obj = bucket.Object(key=file_name)
response = obj.get()
lines = response['Body'].read().decode('utf-8').splitlines(True)
reader = csv.DictReader(lines)
for row in reader:
# csv_header_key is the header keys which you have defined in your csv header
print(row['csv_header_key1'], row['csv_header_key2')
I have a project task to use some output data I have already produced on s3 in an EMR task. So previously I have ran an EMR job that produced some output in one of my s3 buckets in the form of multiple files named part-xxxx. Now I need to access those files from within my new EMR job, read the contents of those files and by using that data I need to produce another output.
This is the local code that does the job:
def reducer_init(self):
self.idfs = {}
for fname in os.listdir(DIRECTORY): # look through file names in the directory
file = open(os.path.join(DIRECTORY, fname)) # open a file
for line in file: # read each line in json file
term_idf = JSONValueProtocol().read(line)[1] # parse the line as a JSON object
self.idfs[term_idf['term']] = term_idf['idf']
def reducer(self, term_poster, howmany):
tfidf = sum(howmany) * self.idfs[term_poster['term']]
yield None, {'term_poster': term_poster, 'tfidf': tfidf}
This runs just fine locally, but the problem is the data i need now is on s3 and i need to access it somehow in reducer_init function.
This is what I have so far, but it fails while executing on EC2:
def reducer_init(self):
self.idfs = {}
b = conn.get_bucket(bucketname)
idfparts = b.list(destination)
for key in idfparts:
file = open(os.path.join(idfparts, key))
for line in file:
term_idf = JSONValueProtocol().read(line)[1] # parse the line as a JSON object
self.idfs[term_idf['term']] = term_idf['idf']
def reducer(self, term_poster, howmany):
tfidf = sum(howmany) * self.idfs[term_poster['term']]
yield None, {'term_poster': term_poster, 'tfidf': tfidf}
AWS access info is defined as follows:
awskey = '*********'
awssecret = '***********'
conn = S3Connection(awskey, awssecret)
bucketname = 'mybucket'
destination = '/path/to/previous/output'
There are two ways of doing this :
Download the file into your local system and parse it. ( Kinda simple, quick and easy )
Get data stored on S3 into memory and parse it ( a bit more complex in case of huge files ).
Step 1:
On S3 filenames are stored as a Key, if you have a file named "Demo" stored in a folder named "DemoFolder" then the key for that particular file would be "DemoFolder\Demo".
Use the below code to download the file into a temp folder.
AWS_KEY = 'xxxxxxxxxxxxxxxxxx'
AWS_SECRET_KEY = 'xxxxxxxxxxxxxxxxxxxxxxxxxx'
BUCKET_NAME = 'DemoBucket'
fileName = 'Demo'
conn = connect_to_region(Location.USWest2,aws_access_key_id = AWS_KEY,
aws_secret_access_key = AWS_SECRET_KEY,
is_secure=False,host='s3-us-west-2.amazonaws.com'
)
source_bucket = conn.lookup(BUCKET_NAME)
''' Download the file '''
for name in source_bucket.list():
if name.name in fileName:
print("DOWNLOADING",fileName)
name.get_contents_to_filename(tempPath)
You can then work on the file in that temp path.
Step 2:
You can also fetch data as string using data = name.get_contents_as_string(). In case of huge files (> 1gb) you may come across memory errors, to avoid such errors you will have to write a lazy function which reads the data in chunks.
For example you can use range to fetch a part of file using data = name.get_contents_as_string(headers={'Range': 'bytes=%s-%s' % (0,100000000)}).
I am not sure if I answered your question properly, I can custom code for your requirement once I get some time. Meanwhile please feel free to post any query you have.