Reading Data from AWS S3 - python

I have some data with very particular format (e.g., tdms files generated by NI systems) and I stored them in a S3 bucket. Typically, for reading this data in python if the data was stored in my local computer, I would use npTDMS package. But, how should is read this tdms files when they are stored in a S3 bucket? One solution is to download the data for instance to the EC2 instance and then use npTDMS package for reading the data into python. But it does not seem to be a perfect solution. Is there any way that I can read the data similar to reading CSV files from S3?

Some Python packages (such as Pandas) support reading data directly from S3, as it is the most popular location for data. See this question for example on the way to do that with Pandas.
If the package (npTDMS) doesn't support reading directly from S3, you should copy the data to the local disk of the notebook instance.
The simplest way to copy is to run the AWS CLI in a cell in your notebook
!aws s3 cp s3://bucket_name/path_to_your_data/ data/
This command will copy all the files under the "folder" in S3 to the local folder data
You can use more fine-grained copy using the filtering of the files and other specific requirements using the boto3 rich capabilities. For example:
s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket')
objs = bucket.objects.filter(Prefix='myprefix')
for obj in objs:
obj.download_file(obj.key)

import boto3
s3 = boto3.resource('s3')
bucketname = "your-bucket-name"
filename = "the file you want to read"
obj = s3.Object(bucketname, filename)
body = obj.get()['Body'].read()

boto3 is the default option, however, as an alternative awswrangler provides some nice wrappers.

Related

How can I access the created folder in s3 to write csv file into it?

I have created the folder code but how can i access the folder to write csv file into that folder?
# Creating folder on S3 for unmatched data
client = boto3.client('s3')
# Variables
target_bucket = obj['source_and_destination_details']['s3_bucket_name']
subfolder = obj['source_and_destination_details']['s3_bucket_uri-new_folder_path'] + obj['source_and_destination_details']['folder_name_for_unmatched_column_data']
# Create subfolder (objects)
client.put_object(Bucket = target_bucket, Key = subfolder)
Folder is getting created succesfully by above code but how to write csv file into it?
Below is the code which i have tried to write but its not working
# Writing csv on AWS S3
df.reindex(idx).to_csv(obj['source_and_destination_details']['s3_bucket_uri-write'] + obj['source_and_destination_details']['folder_name_for_unmatched_column_data'] + obj['source_and_destination_details']['file_name_for_unmatched_column_data'], index=False)
An S3 bucket is not a file system.
I assume that the to_csv() method is supposed to do write to some sort of file system, but this is not the way it works with S3. While there are solutions to mount S3 buckets as file systems, this is not the preferred way.
Usually, you would interact with S3 via the AWS REST APIs, the AWS CLI or a client library such as Boto, which you’re already using.
So in order to store your content on S3, you first create the file locally, e.g. in the system’s /tmp folder. Then use Boto’s put_object() method to upload the file. Remove from your local storage afterwards.

Automatically Upload New Files in SharePoint to S3 with Python

I'm very new to AWS, and relatively new to python. Please go easy on me.
I want to upload files from a Sharepoint location to an S3 bucket. From there, I'll be able to perform analysis on those files.
The below code uploads a file in a local directory to an example S3 bucket. I'd like to modify this to only upload new files from a Sharepoint location (and not upload new files).
import boto3
BUCKET_NAME = "test_bucket"
s3 = boto3.client("s3")
with open("./burger.jpg", "rb") as f:
s3.upload_fileobj(f, BUCKET_NAME, "burger_new_upload.jpg", ExtraArgs={"ACL": "public-read"})
Would I find use of AWS Lambda via Python code? Thank you for sharing your knowledge.

Convert Pandas Dataframe to Image and Upload Directly to S3 Bucket Without Saving Locally Using Python

I'm writing a Python script that I will be running in a AWS Lamba function. The python script will generate a pandas dataframe that I would like to have converted into either a .jpeg or .png. I want this image file written directly to an S3 bucket without having to save it to a folder on my local machine. I've done research on boto3 and have my access keys for AWS to read and write from S3, but I need help with converting a pandas dataframe to an image file that is then directly uploaded to an s3 bucket.
I created an issue on the dataframe_image repository asking this question to the library author, and below is my implementation of his suggestion:
import boto3
import dataframe_image as dfi
from io import BytesIO
# create the buffer in which to store the generated image
png_io = BytesIO()
# assuming that your dataframe is stored as `df` variable
df.dfi.export(png_io)
# create the boto3 client and upload the BytesIO directly;
# create your client in whatever way works for you
session = boto3.session.Session(aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key, region_name=region_name)
s3_client = session.client('s3')
s3_client.upload_fileobj(png_io, 'your_destination_folder/your_destination_file')
For more information, check this TechOverflow post and this StackOverflow post.

How to mount S3 bucket as local FileSystem?

I have a python app running on a Jupiter-notebook on AWS. I loaded a C-library into my python code which expects a path to a file.
I would like to access this file from the S3 bucket.
I tried to use s3fs:
s3 = s3fs.S3FileSystem(anon=False)
using s3.ls('..') lists all my bucket files... this is ok so far. But, the library I am using should actually use the s3 variable inside where I have no access. I can only pass the path to the c library.
Is there a way to mount the s3 bucket in a way, where I don't have to call
s3.open(), and can just call open(/path/to/s3) were somewhere hidden the s3 bucket is really mounted as a local filesystem?
I think it should work like this without using s3. Because I can't change the library I am using internally to use the s3 variable...
with s3.open("path/to/s3/file",'w') as f:
df.to_csv(f)
with open("path/to/s3/file",'w') as f:
df.to_csv(f)
Or am I doing it completely wrong?
The c library iam using is loaded as DLL in python and i call a function :
lib.OpenFile(path/to/s3/file)
I have to pass the path to s3 into the library OpenFile function.
If you're looking to mount the S3 bucket as part of the file system, then use s3fs-fuse
https://github.com/s3fs-fuse/s3fs-fuse
That will make it part of the file system, and the regular file system functions will work as you would expect.
If you are targeting windows, it is possible to use rclone along with winfsp to mount a S3 bucket as local FileSystem
The simplified steps are :
rclone config to create a remote
rclone mount remote:bucket * to mount
https://github.com/rclone/rclone
https://rclone.org/
https://github.com/billziss-gh/winfsp
http://www.secfs.net/winfsp/
Might not the completely relevant to this question, but I am certain it will be to a lot of users coming here.

Transfering data from gcs to s3 with google-cloud-storage

I'm making a small app to export data from BigQuery to google-cloud-storage and then copy it into aws s3, but having trouble finding out how to do it in python.
I have already written the code in kotlin (because it was easiest for me, and reasons outside the scope of my question, we want it to run in python), and in kotlin the google sdk allows me to get an InputSteam from the Blob object, which i can then inject into the amazon s3 sdk's AmazonS3.putObject(String bucketName, String key, InputStream input, ObjectMetadata metadata).
With the python sdk it seems i only have the options to download file to a file and as a string.
I would like (as i do in kotlin) to pass some object returned from the Blob object, into the AmazonS3.putObject() method, without having to save the content as a file first.
I am in no way a python pro, so i might have missed an obvious way of doing this.
I ended up with the following solution, as apparently download_to_filename downloads data into a file-like-object that the boto3 s3 client can handle.
This works just fine for smaller files, but as it buffers it all in memory, it could be problematic for larger files.
def copy_data_from_gcs_to_s3(gcs_bucket, gcs_filename, s3_bucket, s3_filename):
gcs_client = storage.Client(project="my-project")
bucket = gcs_client.get_bucket(gcs_bucket)
blob = bucket.blob(gcs_filename)
data = BytesIO()
blob.download_to_file(data)
data.seek(0)
s3 = boto3.client("s3")
s3.upload_fileobj(data, s3_bucket, s3_filename)
If anyone has information/knowledge about something other than BytesIO to handle the data (fx. so i can stream the data directly into s3, without having to buffer it in memory on the host-machine) it would be very much appreciated.
Google-resumable-media can be used to download file through chunks from GCS and smart_open to upload them to S3. This way you don't need to download whole file into memory. Also there is an similar question that addresses this issue Can you upload to S3 using a stream rather than a local file?

Categories

Resources