How to mount S3 bucket as local FileSystem? - python

I have a python app running on a Jupiter-notebook on AWS. I loaded a C-library into my python code which expects a path to a file.
I would like to access this file from the S3 bucket.
I tried to use s3fs:
s3 = s3fs.S3FileSystem(anon=False)
using s3.ls('..') lists all my bucket files... this is ok so far. But, the library I am using should actually use the s3 variable inside where I have no access. I can only pass the path to the c library.
Is there a way to mount the s3 bucket in a way, where I don't have to call
s3.open(), and can just call open(/path/to/s3) were somewhere hidden the s3 bucket is really mounted as a local filesystem?
I think it should work like this without using s3. Because I can't change the library I am using internally to use the s3 variable...
with s3.open("path/to/s3/file",'w') as f:
df.to_csv(f)
with open("path/to/s3/file",'w') as f:
df.to_csv(f)
Or am I doing it completely wrong?
The c library iam using is loaded as DLL in python and i call a function :
lib.OpenFile(path/to/s3/file)
I have to pass the path to s3 into the library OpenFile function.

If you're looking to mount the S3 bucket as part of the file system, then use s3fs-fuse
https://github.com/s3fs-fuse/s3fs-fuse
That will make it part of the file system, and the regular file system functions will work as you would expect.

If you are targeting windows, it is possible to use rclone along with winfsp to mount a S3 bucket as local FileSystem
The simplified steps are :
rclone config to create a remote
rclone mount remote:bucket * to mount
https://github.com/rclone/rclone
https://rclone.org/
https://github.com/billziss-gh/winfsp
http://www.secfs.net/winfsp/
Might not the completely relevant to this question, but I am certain it will be to a lot of users coming here.

Related

How can I access the created folder in s3 to write csv file into it?

I have created the folder code but how can i access the folder to write csv file into that folder?
# Creating folder on S3 for unmatched data
client = boto3.client('s3')
# Variables
target_bucket = obj['source_and_destination_details']['s3_bucket_name']
subfolder = obj['source_and_destination_details']['s3_bucket_uri-new_folder_path'] + obj['source_and_destination_details']['folder_name_for_unmatched_column_data']
# Create subfolder (objects)
client.put_object(Bucket = target_bucket, Key = subfolder)
Folder is getting created succesfully by above code but how to write csv file into it?
Below is the code which i have tried to write but its not working
# Writing csv on AWS S3
df.reindex(idx).to_csv(obj['source_and_destination_details']['s3_bucket_uri-write'] + obj['source_and_destination_details']['folder_name_for_unmatched_column_data'] + obj['source_and_destination_details']['file_name_for_unmatched_column_data'], index=False)
An S3 bucket is not a file system.
I assume that the to_csv() method is supposed to do write to some sort of file system, but this is not the way it works with S3. While there are solutions to mount S3 buckets as file systems, this is not the preferred way.
Usually, you would interact with S3 via the AWS REST APIs, the AWS CLI or a client library such as Boto, which you’re already using.
So in order to store your content on S3, you first create the file locally, e.g. in the system’s /tmp folder. Then use Boto’s put_object() method to upload the file. Remove from your local storage afterwards.

Python how to download a file from s3 and then reuse

how can I download a file from s3 and then reuse instead of keeping downloading it everytime when the endpoint is called?
#app.route("/get", methods=['GET'])
def get():
s3 = boto3.resource('s3')
ids = pickle.loads(s3.Bucket('bucket').Object('file').get()['Body'].read())
...
Based on the comments.
Since the files are large (16GB) and need to be read and updated often, instead of S3, an EFS filesystem could be used for their storage:
Amazon Elastic File System (Amazon EFS) provides a simple, serverless, set-and-forget elastic file system for use with AWS Cloud services and on-premises resources.
EFS provides NFS filesystems that you can mount to your instance, or even multiple instances at the same time. You can also mount the same filesystem to ECS containers and lambda functions.
Since EFS provides regular filesystem, you can write and read the files directly in it. There is no need to copy it first as in S3 which is object storage (not filesystem).
Its worth pointing out, that convenience of EFS costs more then using S3. However, now you can reduce the cost of using EFS, if this is a problem, by using just released Amazon EFS One Zone storage class.

Write to Google Cloud storage bucket as if it were a directory?

I've written a python script that resamples and renames a ton of audio data and moves it to a new location on disk. I'd like to use this script to move the data I'm resampling to a google storage bucket.
Question: Is there a way to connect/mount your GCP VM instance to a bucket in such a way that reading and writing can be done as if the bucket is just another directory?
For example, this is somewhere in my script:
# load audio from old location
audio, _ = librosa.load(old_path):
# Do some stuff to the audio
# ...
# write audio to new location
with sf.SoundFile(new_path, 'w', sr, channels=1, format='WAV') as f:
f.write(audio)
I'd like to have a way to get the path to my bucket because my script takes an old_path where the original data is, resamples and moves it to a new_path.
My script would not be as simple to modify as the snippet above makes it seem, because I'm doing a lot of multiprocessing. Plus I'd like to make the script generic so I can re-use it for local files, etc. Basically, altering the script is off the table.
You could use the FUSE adapter to mount your GCS bucket onto the local filesystem
https://cloud.google.com/storage/docs/gcs-fuse
For Linux:
sudo apt-get update
sudo apt-get install gcsfuse
gcsfuse mybucket /my/path
Alternatively you could use the GCS Client for Python to upload your content directly:
https://cloud.google.com/storage/docs/reference/libraries#client-libraries-usage-python
Yes, you can use Cloud Storage FUSE. More info and some examples here.
To mount a bucket using gcsfuse over an existing directory /path/to/mount, invoke it like this:
gcsfuse my-bucket /path/to/mount
I recommend having a bucket that is exclusively accessed through gcsfuse to keep things simple.
Important node: gcsfuse It is distributed as-is, without warranties of any kind.

Python/AWS Lambda Function: How to view /tmp storage?

Lambda functions have access to disk space in their own /tmp directories. My question is, where can I visually view the /tmp directory?
I’m attempting to download the files into the /tmp directory to read them, and write a new file to it as well. I actually want see the files I’m working with are getting stored properly in /tmp during execution.
Thank you
You can't 'view' the /tmp directory after the lambda execution has ended.
Lambda works in distributed architecture and after the execution all resources used (including all files stored in /tmp) are disposed.
So if you want to check your files, you might want consider using EC2 or S3.
If you just want to check if the s3 download was successful, during the execution, you can try:
import os
os.path.isfile('/tmp/' + filename)
As previous answers suggested, you might want to create a /tmp directory in S3 bucket and download/upload your temp processing file to this /tmp directory before final clean up.
You can do the following (I'm not showing detailed process here):
import boto
s3 = boto3.client("s3")
s3.put_object(Bucket=Your_bucket_name,Key=tmp/Your_file_name)
How you download your file from your /tmp is through:
s3.download_file(Your_bucket_name, Your_key_name, Your_file_name)
after you download files and process, you want to upload it again to /tmp through:
s3.upload_file(Your_file_name, Your_bucket_name, Your_key_name)
You can add your /tmp/ in Your_key_name
Then you should be able to list the bucket easily from this sample:
for key in bucket.list():
print "{name}\t{size}\t{modified}".format(
name = key.name,
size = key.size,
modified = key.last_modified,
)
Make sure you keep your download and upload asynchronously by this boto async package.
Try to use a S3 bucket to store the file and read it from the AWS Lambda function, you should ensure the AWS Lambda role has access to the S3 bucket.

Local access to Amazon S3 Bucket from EC2 instance

I have an EC2 instance and an S3 bucket in the same region. The bucket contains reasonably large (5-20mb) files that are used regularly by my EC2 instance.
I want to programatically open the file on my EC2 instance (using python). Like so:
file_from_s3 = open('http://s3.amazonaws.com/my-bucket-name/my-file-name')
But using a "http" URL to access the file remotely seems grossly inefficient, surely this would mean downloading the file to the server every time I want to use it.
What I want to know is, is there a way I can access S3 files locally from my EC2 instance, for example:
file_from_s3 = open('s3://my-bucket-name/my-file-name')
I can't find a solution myself, any help would be appreciated, thank you.
Whatever you do the object will be downloaded behind the scenes from S3 into your EC2 instance. That cannot be avoided.
If you want to treat files in the bucket as local files you need to install any one of several S3 filesystem plugins for FUSE (example : s3fs-fuse ). Alternatively you can use boto for easy access to S3 objects via python code.

Categories

Resources