How to access AWS S3 data using boto3 - python

I am fairly new to both S3 as well as boto3. I am trying to read in some data in the following format:
https://blahblah.s3.amazonaws.com/data1.csv
https://blahblah.s3.amazonaws.com/data2.csv
https://blahblah.s3.amazonaws.com/data3.csv
I am importing boto3, and it seems like I would need to do something like:
import boto3
s3 = boto3.client('s3')
However, what should I do after creating this client if I want to read in all files separately in-memory (I am not supposed to locally download this data). Ideally, I would like to read in each CSV data file into separate Pandas DataFrames (which I know how to do once I know how to access the S3 data).
Please understand I'm fairly new to both boto3 as well as S3, so I don't even know where to begin.

You'll have 2 options, both the options you've already mentioned:
Downloading the file locally using download_file
s3.download_file(
"<bucket-name>",
"<key-of-file>",
"<local-path-where-file-will-be-downloaded>"
)
See download_file
Loading the file contents into memory using get_object
response = s3.get_object(Bucket="<bucket-name>", Key="<key-of-file>")
contentBody = response.get("Body")
# You need to read the content as it is a Stream
content = contentBody.read()
See get_object
Either approach is fine and you can just chose which one fits your scenario better.

Try this:
import boto3
s3 = boto3.resource('s3')
obj = s3.Object(<<bucketname>>, <<itemname>>)
body = obj.get()['Body'].read()

Related

Uploading excel file the second time producing an empty excel file in S3 using python

import boto3
def upload(excel, bucket, file_name):
S3_CLIENT = boto3.client("s3")
fp = pathlib.Path(f"/tmp/{file_name}")
try:
fp.write_bytes(excel.get_workbook_data())
S3_CLIENT.upload_file(str(fp.absolute()), bucket, 'file_key_in_s3', ExtraArgs={"ContentType": 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'})
finally:
fp.unlink()
This is my upload to S3 method. This is a lambda attached to API Gateway. so when I hit it the first time it's generating a proper excel file but when I do it the second time it's generating an empty excel spreadsheet and uploading it to S3.
get_workbook_data is a method that gets the information in the excel
PS: If I hit the same method after 5 mins again then I see a proper excel file.
Did anyone encounter the same issue?
Could someone please help me out?
I have tried using diff headers in postman but in vain

How can I read from a CSV file from an S3 bucket, apply certain if-statements to it, and write a new updated CSV file and place it in the S3 bucket?

I'm having trouble writing to a new CSV file into an S3 bucket. I want to be able to read a CSV file that I have in an S3 bucket, and if one of the values in the CSV fits a certain requirement, I want to change it to a different value. I've read that it's not possible to edit an S3 object, so I need to create a new one every time. In short, I want to create a new, updated CSV file from another CSV file in an S3 bucket, with changes applied.
I'm trying to use DictWriter and DictReader, but I always run into issues with DictWriter. I can read the CSV file properly, but when I try to update it, there are a myriad of significantly different issues from DictWriter. Right now, the issue that I am getting is that
# Function to be pasted into AWS Lambda.
# Accesses S3 bucket, opens the CSV file, receive the response line-by-line,
# To be able to access S3 buckets and the objects within the bucket
import boto3
# To be able to read the CSV by using DictReader
import csv
# Lambda script that extracts, transforms, and loads data from S3 bucket 'testing-bucket-1042' and CSV file 'Insurance.csv'
def lambda_handler(event, context):
s3 = boto3.resource('s3')
bucket = s3.Bucket('testing-bucket-1042')
obj = bucket.Object(key = 'Insurance.csv')
response = obj.get()
lines = response['Body'].read().decode('utf-8').split()
reader = csv.DictReader(lines)
with open("s3://testing-bucket-1042/Insurance.csv", newline = '') as csvfile:
reader = csv.DictReader(csvfile)
fieldnames = ['county', 'eq_site_limit']
writer = csv.DictWriter(lines, fieldnames=fieldnames)
for row in reader:
writer.writeheader()
if row['county'] == "CLAY": # if the row is under the column 'county', and contains the string "CLAY"
writer.writerow({'county': 'CHANGED'})
if row['eq_site_limit'] == "0": # if the row is under the column 'eq_site_limit', and contains the string "0"
writer.writerow({'eq_site_limit': '9000'})
Right now, the error that I am getting is that the path I use when attempting to open the CSV, "s3://testing-bucket-1042/Insurance.csv", is said to not exist.
The error says
"errorMessage": "[Errno 2] No such file or directory: 's3://testing-bucket-1042/Insurance.csv'",
"errorType": "FileNotFoundError"
What would be the correct way to use DictWriter, if at all?
First of all s3:\\ is not a common (file) protocol and therefore you get your error message. It is good, that you stated your intentions.
Okay, I refactored your code
import codecs
import boto3
# To be able to read the CSV by using DictReader
import csv
from io import StringIO
# Lambda script that extracts, transforms, and loads data from S3 bucket 'testing-bucket-1042' and CSV file 'Insurance.csv'
def lambda_handler(event, context):
s3 = boto3.resource('s3')
bucket = s3.Bucket('testing-bucket-1042')
obj = bucket.Object(key = 'Insurance.csv')
stream = codecs.getreader('utf-8')(obj.get()['Body'])
lines = list(csv.DictReader(stream))
### now you have your object there
csv_buffer = StringIO()
out = csv.DictWriter(csv_buffer, fieldnames=['county', 'eq_site_limit'])
for row in lines:
if row['county'] == "CLAY":
out.writerow({'county': 'CHANGED'})
if row['eq_site_limit'] == "0":
out.writerow({'eq_site_limit': '9000'})
### now write content into some different bucket/key
s3client = boto3.client('s3')
s3client.put_object(Body=csv_buffer.getvalue().encode(encoding),
Bucket=...targetbucket, Key=...targetkey)
I hope that this works. Basically there are few tricks:
use codecs to directly stream csv data from s3 bucket
use BytesIO to create a stream in memory to which csv.DictWriter can write to.
when you are finished, one way to "upload" your content is through s3.clients's put_object method (as documented in AWS)
To logically separate AWS code from business logic, I normally recommend this approach:
Download the object from Amazon S3 to the /tmp directory
Perform desired business logic (read file, write file)
Upload the resulting file to Amazon S3
Using download_file() and upload_file() avoids having to worry about in-memory streams. It means you can take logic that normally operates on files (eg on your own computer) and then apply them to files obtained from S3.
It comes down to personal preference.
You can use streaming functionality of S3 to make changes on the fly. It is better suited for text manipulation tools such as awk and sed.
Example:
aws s3 cp s3://bucketname/file.csv - | sed 's/foo/bar/g' | aws s3 cp - s3://bucketname/new-file.csv
AWS Docs: https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html

AWS Lambda and S3: passing s3 object path to image process function

My intention is to have a large image stored on my S3 server and then get a lambda function to read/process the file and save the resulting output(s). I'm using a package called python-bioformats to work with a proprietary image file (which is basically a whole bunch of tiffs stacked together). When I use
def lambda_handler(event, context):
import boto3
key = event['Records'][0]['s3']['object']['key'].encode("utf-8")
bucket = 'bucketname'
s3 = boto3.resource('s3')
imageobj = s3.Object(bucket, key).get()['Body'].read()
bioformats.get_omexml_metadata(imageobj)
I have a feeling that the lambda function tries to download the entire file (5GB) when making imageobj. Is there a way I can just get the second function (which takes a filepath as argument) to refer to the s3 object in a filepath-like manner? I'd also like to not expose the s3 bucket/object publicly, so doing this server-side would be ideal.
If your bioformats.get_omexml_metadata() function requires a filepath as an argument, then you will need to have the object downloaded before calling the function.
This could be a problem in an AWS Lambda function because there is a 500MB limit on available disk space (and only in /tmp/).
If the data can instead be processed as a stream, you could read the data as it is required without saving to disk first. However, the python-bioformats documentation does not show this as an option. In fact, I would be surprised if your above code works, given that it is expecting a path while imageobj is the contents of the file.

Reading Data From Cloud Storage Via Cloud Functions

I am trying to do a quick proof of concept for building a data processing pipeline in Python. To do this, I want to build a Google Function which will be triggered when certain .csv files will be dropped into Cloud Storage.
I followed along this Google Functions Python tutorial and while the sample code does trigger the Function to create some simple logs when a file is dropped, I am really stuck on what call I have to make to actually read the contents of the data. I tried to search for an SDK/API guidance document but I have not been able to find it.
In case this is relevant, once I process the .csv, I want to be able to add some data that I extract from it into GCP's Pub/Sub.
The function does not actually receive the contents of the file, just some metadata about it.
You'll want to use the google-cloud-storage client. See the "Downloading Objects" guide for more details.
Putting that together with the tutorial you're using, you get a function like:
from google.cloud import storage
storage_client = storage.Client()
def hello_gcs_generic(data, context):
bucket = storage_client.get_bucket(data['bucket'])
blob = bucket.blob(data['name'])
contents = blob.download_as_string()
# Process the file contents, etc...
This is an alternative solution using pandas:
Cloud Function Code:
import pandas as pd
def GCSDataRead(event, context):
bucketName = event['bucket']
blobName = event['name']
fileName = "gs://" + bucketName + "/" + blobName
dataFrame = pd.read_csv(fileName, sep=",")
print(dataFrame)

JSON format validation without loading file

How to validate JSON format without loading the file? I am copying files from one S3 bucket to another S3 bucket. After JSONL files are
copied , I want to check if file format is correct in the sense curly braces and commas are fine.
I don't want to use json.load() because file size and number are big and it will slow down the process plus file is already copied so no need to parse it , just validation is requirement.
There is no capability within Amazon S3 itself to validate the content of objects.
You could configure S3 to trigger an AWS Lambda function whenever a file is created in the S3 bucket. The Lambda function could then parse the file and perform some action (eg send a notification or move the object to another location) if the validation fails.
Streaming the file seems to be the way to go about it, put into a generator and yield line by line to check if the JSON is valid. The requests library supports streaming of a file.
The solution would look something like this:
import requests
def get_data():
r = requests.get('s3_file_url', stream=True)
yield from r.iter_lines():
def parse_data():
# initialize generator
gd_gen = get_data()
while True:
try:
ge_gen.__next__()
except StopIteration:
break
# put your validation code here
Let me know you need a better clarification

Categories

Resources