Uploading files from sftp to an s3 bucket [duplicate] - python

I am using Paramiko to access a remote SFTP folder, and I'm trying to write code that transfers files from a path in SFTP (with a simple logic using the file metadata to check it's last modified date) to AWS S3 bucket.
I have set the connection to S3 using Boto3, but I still can't seem to write a working code that transfers the files without downloading them to a local directory first. Here is some code I tried using Paramiko's getfo() method. But it doesn't work.
for f in files:
# get last modified from file metadata
last_modified = sftp.stat(remote_path + f).st_mtime
last_modified_date = datetime.fromtimestamp(last_modified).date()
if last_modified_date > date_limit: # check limit
print('getting ' + f)
full_path = f"{folder_path}{f}"
fo = sftp.getfo(remote_path + f,f)
s3_conn.put_object(Body=fo,Bucket=s3_bucket, Key=full_path)
Thank you!

Use Paramiko SFTPClient.open to get a file-like object that you can pass to Boto3 Client.put_object:
with sftp.open(remote_path + f, "r") as f:
f.prefetch()
s3_conn.put_object(Body=f)
For the purpose of the f.prefetch(), see Reading file opened with Python Paramiko SFTPClient.open method is slow.
For the opposite direction, see:
Transfer file from AWS S3 to SFTP using Boto 3

Related

Image error, not loading for S3 image retrieval

I have written code on my backend (hosted on Elastic Beanstalk) to retrieve a file from an S3 bucket and save it back to the bucket under a different name. I am using boto3 and have created an s3 client called 's3'.
bucketname is the name of the bucket, keyname is name of the key. I am also using the tempfile module
tmp = tempfile.NamedTemporaryFile()
with open(tmp.name, 'wb') as f:
s3.download_fileobj(bucketname, keyname, f)
s3.upload_file(tmp, bucketname, 'fake.jpg')
I was wondering if my understanding was off (still debugging why there is an error) - I created a tempfile and opened and saved within it the contents of the object with the keyname and bucketname. Then I uploaded that temp file to the bucket under a different name. Is my reasoning correct?
The upload_file() command is expecting a filename (as a string) in the first parameter, not a file object.
Instead, you should use upload_fileobj().
However, I would recommend something different...
If you simply wish to make a copy of an object, you can use copy_object:
response = client.copy_object(
Bucket='destinationbucket',
CopySource='/sourcebucket/HappyFace.jpg',
Key='HappyFaceCopy.jpg',
)

How can I access the emrfs filesystem from python in pyspark code?

I am using pyspark on amazon EMR and need to access files stored on the emrfs in s3, everywhere I look I can only find examples for how to access the emrfs via the spark API, but I need to access it in the executers, using python code. How can I do that?
Below code can help you in listing the content of a bucket in aws using boto3.
from boto3.session import Session
ACCESS_KEY='your_access_key'
SECRET_KEY='your_secret_key'
session = Session(aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY)
s3 = session.resource('s3')
your_bucket = s3.Bucket('your_bucket')
for s3_file in your_bucket.objects.all():
print(s3_file.key)
One solution is to use Hadoop FS API. From Pyspark you can access it via the JVM.
Here is an example that lists files from S3 bucket folder and prints the paths.
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
conf = sc._jsc.hadoopConfiguration()
s3_folder = Path("s3://bucket_name/folder")
gs = s3_folder.getFileSystem(conf).globStatus(s3_folder)
for f in gs:
print(f.getPath().toString())
Not sure why you want to read files this way as you can do that using Spark but here is a way using Hadoop FS open method:
fs = s3_folder.getFileSystem(conf)
fs_data_input_stream = fs.open(s3_folder)
line = fs_data_input_stream.readLine()
while line:
print(line)
line = fs_data_input_stream.readLine()
However, if you're using EMR Cluster I recommend copying the files from S3 to the local system and use them.

open and Save excel file in S3 using Python

I have some problem with excel(xlsx) file.I want to just open and save operation using python code.I have tried with python but couldn't found
cursor = context.cursor()
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucket')
objects = bucket.objects.all()
for obj in objects:
if obj.key.startswith('path/filename'):
filename=obj.key
openok=open(obj)
readok = openok.readlines()
readok.close()
print ('file open and close sucessfully')```
You can't read/interact with files directly on s3 as far as I know.
I'd recommend downloading it locally, and then opening it. You can use the builtin tempfile module if you want to save it to a temporary path.
with tempfile.TemporaryDirectory() as tmpdir:
local_file_path = os.path.join(tmpdir, "tmpfile")
bucket.download_file(obj.key, local_file_path)
openok=open(local_file_path)
readok = openok.readlines()
readok.close()

Read file data from aws s3 pyspark

I have a json file placed in s3. The s3 url is similar to the below one:
https://s3-eu-region-1.amazonaws.com/dir-resources/sample.json
But in pyspark when pass the same, It is not reading the file.
path = "https://s3-eu-region-1.amazonaws.com/dir-resources/sample.json"
df=spark.read.json(path)
But I am able to download it through browser.
Assuming that dir-resources is the name of your bucket, you should be able to access to the file with the following URI:
path = "s3://dir-resources/sample.json"
In some cases, you may have to use the s3n protocol instead:
path = "s3n://dir-resources/sample.json"

Save uploaded image to S3 with Django

I'm attempting to save an image to S3 using boto. It does save a file, but it doesn't appear to save it correctly. If I try to open the file in S3, it just shows a broken image icon. Here's the code I'm using:
# Get and verify the file
file = request.FILES['file']
try:
img = Image.open(file)
except:
return api.error(400)
# Determine a filename
filename = file.name
# Upload to AWS and register
s3 = boto.connect_s3(aws_access_key_id=settings.AWS_KEY_ID,
aws_secret_access_key=settings.AWS_SECRET_ACCESS_KEY)
bucket = s3.get_bucket(settings.AWS_BUCKET)
f = bucket.new_key(filename)
f.set_contents_from_file(file)
I've also tried replacing the last line with:
f.set_contents_from_string(file.read())
But that didn't work either. Is there something obvious that I'm missing here? I'm aware django-storages has a boto backend, but because of complexity with this model, I do not want to use forms with django-storages.
Incase you don't want to go for django-storages and just want to upload few files to s3 rather then all the files then below is the code:
import boto3
file = request.FILES['upload']
s3 = boto3.resource('s3', aws_access_key_id=settings.AWS_ACCESS_KEY, aws_secret_access_key=settings.AWS_SECRET_ACCESS_KEY)
bucket = s3.Bucket('bucket-name')
bucket.put_object(Key=filename, Body=file)
You should use django-storages which uses boto internally.
You can either swap the default FileSystemStorage, or create a new storage instance and manually save files. Based on your code example I guess you really want to go with the first option.
Please consider using django's Form instead of directly accessing the request.

Categories

Resources