I want to convert s3 bucket video file resolutions to other resolutions and store it back to s3.
I know we can use ffmpeg locally to change the resolution.
Can we use ffmpeg to convert s3 files and store it back using Django?
I am puzzled as from where to start and what should be the process.
Whether I should have the video file from s3 bucket in buffer and then convert it using ffmpeg and then upload it back to s3.
Or is there anyway to do it directly without having to keep it in buffer.
You can break the problem into 3 parts:
Get the file from s3: use boto3 & the AWS APIs to download the file
Convert the file locally: ffmpeg has got you covered here. It'll generate a new file
Upload the file back to s3: use boto3 again
This is fairly simple and robust for a script.
Now if you want to optimize you can try using something like s3fs to mount your s3 bucket to local & do the conversion
Related
I've written a python script that resamples and renames a ton of audio data and moves it to a new location on disk. I'd like to use this script to move the data I'm resampling to a google storage bucket.
Question: Is there a way to connect/mount your GCP VM instance to a bucket in such a way that reading and writing can be done as if the bucket is just another directory?
For example, this is somewhere in my script:
# load audio from old location
audio, _ = librosa.load(old_path):
# Do some stuff to the audio
# ...
# write audio to new location
with sf.SoundFile(new_path, 'w', sr, channels=1, format='WAV') as f:
f.write(audio)
I'd like to have a way to get the path to my bucket because my script takes an old_path where the original data is, resamples and moves it to a new_path.
My script would not be as simple to modify as the snippet above makes it seem, because I'm doing a lot of multiprocessing. Plus I'd like to make the script generic so I can re-use it for local files, etc. Basically, altering the script is off the table.
You could use the FUSE adapter to mount your GCS bucket onto the local filesystem
https://cloud.google.com/storage/docs/gcs-fuse
For Linux:
sudo apt-get update
sudo apt-get install gcsfuse
gcsfuse mybucket /my/path
Alternatively you could use the GCS Client for Python to upload your content directly:
https://cloud.google.com/storage/docs/reference/libraries#client-libraries-usage-python
Yes, you can use Cloud Storage FUSE. More info and some examples here.
To mount a bucket using gcsfuse over an existing directory /path/to/mount, invoke it like this:
gcsfuse my-bucket /path/to/mount
I recommend having a bucket that is exclusively accessed through gcsfuse to keep things simple.
Important node: gcsfuse It is distributed as-is, without warranties of any kind.
I have a large csv file stored in S3, I would like to download, edit and reupload this file without it ever touching my hard drive, i.e. read it straight into memory from S3. I am using the python library boto3, is this possible?
You should look into the io module
Depending on how you want to read the file, you can create a StringIO() or BytesIO() object and download your file to this stream.
You should check out these answers:
How to read image file from S3 bucket directly into memory?
How to read a csv file from an s3 bucket using Pandas in Python
I have a python app running on a Jupiter-notebook on AWS. I loaded a C-library into my python code which expects a path to a file.
I would like to access this file from the S3 bucket.
I tried to use s3fs:
s3 = s3fs.S3FileSystem(anon=False)
using s3.ls('..') lists all my bucket files... this is ok so far. But, the library I am using should actually use the s3 variable inside where I have no access. I can only pass the path to the c library.
Is there a way to mount the s3 bucket in a way, where I don't have to call
s3.open(), and can just call open(/path/to/s3) were somewhere hidden the s3 bucket is really mounted as a local filesystem?
I think it should work like this without using s3. Because I can't change the library I am using internally to use the s3 variable...
with s3.open("path/to/s3/file",'w') as f:
df.to_csv(f)
with open("path/to/s3/file",'w') as f:
df.to_csv(f)
Or am I doing it completely wrong?
The c library iam using is loaded as DLL in python and i call a function :
lib.OpenFile(path/to/s3/file)
I have to pass the path to s3 into the library OpenFile function.
If you're looking to mount the S3 bucket as part of the file system, then use s3fs-fuse
https://github.com/s3fs-fuse/s3fs-fuse
That will make it part of the file system, and the regular file system functions will work as you would expect.
If you are targeting windows, it is possible to use rclone along with winfsp to mount a S3 bucket as local FileSystem
The simplified steps are :
rclone config to create a remote
rclone mount remote:bucket * to mount
https://github.com/rclone/rclone
https://rclone.org/
https://github.com/billziss-gh/winfsp
http://www.secfs.net/winfsp/
Might not the completely relevant to this question, but I am certain it will be to a lot of users coming here.
I know how to upload a file to s3 buckets in Python. I am looking for a way to upload data to a file in s3 bucket directly. In this way, I do not need to save my data to a local file, and then upload the file. Any suggestions? Thanks!
AFAIK standard Object.put() supports this.
resp = s3.Object('bucket_name', 'key/key.txt').put(Body=b'data')
Edit: it was pointed out that you might want the client method, which is just put_object with the kwargs differently organized
client.put_object(Body=b'data', Bucket='bucket_name', Key='key/key.txt')
I'm making a small app to export data from BigQuery to google-cloud-storage and then copy it into aws s3, but having trouble finding out how to do it in python.
I have already written the code in kotlin (because it was easiest for me, and reasons outside the scope of my question, we want it to run in python), and in kotlin the google sdk allows me to get an InputSteam from the Blob object, which i can then inject into the amazon s3 sdk's AmazonS3.putObject(String bucketName, String key, InputStream input, ObjectMetadata metadata).
With the python sdk it seems i only have the options to download file to a file and as a string.
I would like (as i do in kotlin) to pass some object returned from the Blob object, into the AmazonS3.putObject() method, without having to save the content as a file first.
I am in no way a python pro, so i might have missed an obvious way of doing this.
I ended up with the following solution, as apparently download_to_filename downloads data into a file-like-object that the boto3 s3 client can handle.
This works just fine for smaller files, but as it buffers it all in memory, it could be problematic for larger files.
def copy_data_from_gcs_to_s3(gcs_bucket, gcs_filename, s3_bucket, s3_filename):
gcs_client = storage.Client(project="my-project")
bucket = gcs_client.get_bucket(gcs_bucket)
blob = bucket.blob(gcs_filename)
data = BytesIO()
blob.download_to_file(data)
data.seek(0)
s3 = boto3.client("s3")
s3.upload_fileobj(data, s3_bucket, s3_filename)
If anyone has information/knowledge about something other than BytesIO to handle the data (fx. so i can stream the data directly into s3, without having to buffer it in memory on the host-machine) it would be very much appreciated.
Google-resumable-media can be used to download file through chunks from GCS and smart_open to upload them to S3. This way you don't need to download whole file into memory. Also there is an similar question that addresses this issue Can you upload to S3 using a stream rather than a local file?