Read file from S3 into Python memory - python

I have a large csv file stored in S3, I would like to download, edit and reupload this file without it ever touching my hard drive, i.e. read it straight into memory from S3. I am using the python library boto3, is this possible?

You should look into the io module
Depending on how you want to read the file, you can create a StringIO() or BytesIO() object and download your file to this stream.
You should check out these answers:
How to read image file from S3 bucket directly into memory?
How to read a csv file from an s3 bucket using Pandas in Python

Related

convert docx to pdf using pypandoc with BytesIO file path

I want to get docx file from azure blob storage, convert it into pdf and save it again into azure blob storage. I want to use pypandoc to convert docx to pdf.
pypandoc.convert_file('abc.docx', format='docx', to='pdf',outputfile='abc.pdf')
But, I want to run this code in azure function where I will not get enough space to save files, hence I am downloading file from azure blob storage using BytesIO as a stream as follows.
blob_service_client = BlobServiceClient.from_connection_string(cs)
container_client=blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(filename)
streamdownloader=blob_client.download_blob()
stream = BytesIO()
streamdownloader.download_to_stream(stream)
now I want to convert my docx file which is accessible using stram to pdf. converted pdf also savable as BytesIO stream so could upload it into blob storage without taking system memory. but pypandoc showing error as RuntimeError: source_file is not a valid path
if you could suggest some other way to convert docx to pdf which could handle BytesIO file format, then I like to mention I will work in linux environment where library like doc2pdf does not support.

Convert s3 video files resolution and store it back to s3

I want to convert s3 bucket video file resolutions to other resolutions and store it back to s3.
I know we can use ffmpeg locally to change the resolution.
Can we use ffmpeg to convert s3 files and store it back using Django?
I am puzzled as from where to start and what should be the process.
Whether I should have the video file from s3 bucket in buffer and then convert it using ffmpeg and then upload it back to s3.
Or is there anyway to do it directly without having to keep it in buffer.
You can break the problem into 3 parts:
Get the file from s3: use boto3 & the AWS APIs to download the file
Convert the file locally: ffmpeg has got you covered here. It'll generate a new file
Upload the file back to s3: use boto3 again
This is fairly simple and robust for a script.
Now if you want to optimize you can try using something like s3fs to mount your s3 bucket to local & do the conversion

python upload data, not file, to s3 bucket

I know how to upload a file to s3 buckets in Python. I am looking for a way to upload data to a file in s3 bucket directly. In this way, I do not need to save my data to a local file, and then upload the file. Any suggestions? Thanks!
AFAIK standard Object.put() supports this.
resp = s3.Object('bucket_name', 'key/key.txt').put(Body=b'data')
Edit: it was pointed out that you might want the client method, which is just put_object with the kwargs differently organized
client.put_object(Body=b'data', Bucket='bucket_name', Key='key/key.txt')

File upload and download using python

I am looking for suggestions for my program:
One part of my program generates a .csv file that I need to upload to cloud. Essentially, the program should upload the .csv file to cloud and return the url for that location (csv_url)
Another part of my program has to use that csv_url with wget to download this file.
How can I tackle this problem? Will uploading the file to a S3 bucket work for me? How to return a consolidated url in that case? apart from s3 bucket is there any other medium where I can try and upload my file? Any suggestion would be very helpful.
Try boto3 library from Amazon, its has all the functions your would like to do
S3, GET/POST/PUT/DELETE/LIST.
PUT Example--
# Upload a new file
data = open('test.jpg', 'rb')
s3.Bucket('my-bucket').put_object(Key='test.jpg', Body=data)
Yes , uploading the file to AWS s3 will definitely work for you and you need nothing else and if you want to do that with python , it's quite easy
import boto3
s3 = boto3.client('s3')
s3.upload_file('images/4.jpeg', 'mausamrest', 'test/jkl.jpeg',ExtraArgs={'ACL': 'public-read'})
where mausamrest is bucket and test/jkl.jpeg is keyname or you can say filename in s3
and this is how you will have your url
https://s3.amazonaws.com/mausamrest/test/jkl.jpeg
s3.amazonaws.com/bucketname/keyname this is the format of how your object url will be
in my case image is opening in browser as i have done that kind of thing , in your case your csv will get downloaded

Transfering data from gcs to s3 with google-cloud-storage

I'm making a small app to export data from BigQuery to google-cloud-storage and then copy it into aws s3, but having trouble finding out how to do it in python.
I have already written the code in kotlin (because it was easiest for me, and reasons outside the scope of my question, we want it to run in python), and in kotlin the google sdk allows me to get an InputSteam from the Blob object, which i can then inject into the amazon s3 sdk's AmazonS3.putObject(String bucketName, String key, InputStream input, ObjectMetadata metadata).
With the python sdk it seems i only have the options to download file to a file and as a string.
I would like (as i do in kotlin) to pass some object returned from the Blob object, into the AmazonS3.putObject() method, without having to save the content as a file first.
I am in no way a python pro, so i might have missed an obvious way of doing this.
I ended up with the following solution, as apparently download_to_filename downloads data into a file-like-object that the boto3 s3 client can handle.
This works just fine for smaller files, but as it buffers it all in memory, it could be problematic for larger files.
def copy_data_from_gcs_to_s3(gcs_bucket, gcs_filename, s3_bucket, s3_filename):
gcs_client = storage.Client(project="my-project")
bucket = gcs_client.get_bucket(gcs_bucket)
blob = bucket.blob(gcs_filename)
data = BytesIO()
blob.download_to_file(data)
data.seek(0)
s3 = boto3.client("s3")
s3.upload_fileobj(data, s3_bucket, s3_filename)
If anyone has information/knowledge about something other than BytesIO to handle the data (fx. so i can stream the data directly into s3, without having to buffer it in memory on the host-machine) it would be very much appreciated.
Google-resumable-media can be used to download file through chunks from GCS and smart_open to upload them to S3. This way you don't need to download whole file into memory. Also there is an similar question that addresses this issue Can you upload to S3 using a stream rather than a local file?

Categories

Resources