Read file data from aws s3 pyspark

Read file data from aws s3 pyspark - python

I have a json file placed in s3. The s3 url is similar to the below one:
https://s3-eu-region-1.amazonaws.com/dir-resources/sample.json
But in pyspark when pass the same, It is not reading the file.
path = "https://s3-eu-region-1.amazonaws.com/dir-resources/sample.json"
df=spark.read.json(path)
But I am able to download it through browser.

Assuming that dir-resources is the name of your bucket, you should be able to access to the file with the following URI:
path = "s3://dir-resources/sample.json"
In some cases, you may have to use the s3n protocol instead:
path = "s3n://dir-resources/sample.json"

Related

Image error, not loading for S3 image retrieval

I have written code on my backend (hosted on Elastic Beanstalk) to retrieve a file from an S3 bucket and save it back to the bucket under a different name. I am using boto3 and have created an s3 client called 's3'.
bucketname is the name of the bucket, keyname is name of the key. I am also using the tempfile module
tmp = tempfile.NamedTemporaryFile()
with open(tmp.name, 'wb') as f:
s3.download_fileobj(bucketname, keyname, f)
s3.upload_file(tmp, bucketname, 'fake.jpg')
I was wondering if my understanding was off (still debugging why there is an error) - I created a tempfile and opened and saved within it the contents of the object with the keyname and bucketname. Then I uploaded that temp file to the bucket under a different name. Is my reasoning correct?

The upload_file() command is expecting a filename (as a string) in the first parameter, not a file object.
Instead, you should use upload_fileobj().
However, I would recommend something different...
If you simply wish to make a copy of an object, you can use copy_object:
response = client.copy_object(
Bucket='destinationbucket',
CopySource='/sourcebucket/HappyFace.jpg',
Key='HappyFaceCopy.jpg',
)

Create a Directory in AWS S3

I am using the boto3 library to create a S3 folder using python.(Want to create a directory 'c' in already existing directory structure like '/a/b'
s3_client=boto3.client('s3')
s3_client =put_object(Bucket=bucket_name, Key='a/b/c/')
I am not getting any error but the directory is also not getting created. I cant really figure out the reason, any suggestions?

Not sure if it is a typo in the code you show but it should give you an error. I think what you are trying to do is:
s3_client = boto3.client('s3')
response = s3_client.put_object(Bucket=bucket_name, Key='a/b/c/')
print("Response: {}".format(response)) # See result of request.

there are no such things as folders or directory in S3
to upload a file to your bucket you could use:
s3_client=boto3.client('s3')
# you have to provide my_binary_data
response = s3_client.put_object(Body=my_binary_data, Bucket=bucket_name, Key='a/b/c/')
where Key represents the name or your file
you can read more aboout Client.put_object here

How can I access the emrfs filesystem from python in pyspark code?

I am using pyspark on amazon EMR and need to access files stored on the emrfs in s3, everywhere I look I can only find examples for how to access the emrfs via the spark API, but I need to access it in the executers, using python code. How can I do that?

Below code can help you in listing the content of a bucket in aws using boto3.
from boto3.session import Session
ACCESS_KEY='your_access_key'
SECRET_KEY='your_secret_key'
session = Session(aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY)
s3 = session.resource('s3')
your_bucket = s3.Bucket('your_bucket')
for s3_file in your_bucket.objects.all():
print(s3_file.key)

One solution is to use Hadoop FS API. From Pyspark you can access it via the JVM.
Here is an example that lists files from S3 bucket folder and prints the paths.
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
conf = sc._jsc.hadoopConfiguration()
s3_folder = Path("s3://bucket_name/folder")
gs = s3_folder.getFileSystem(conf).globStatus(s3_folder)
for f in gs:
print(f.getPath().toString())
Not sure why you want to read files this way as you can do that using Spark but here is a way using Hadoop FS open method:
fs = s3_folder.getFileSystem(conf)
fs_data_input_stream = fs.open(s3_folder)
line = fs_data_input_stream.readLine()
while line:
print(line)
line = fs_data_input_stream.readLine()
However, if you're using EMR Cluster I recommend copying the files from S3 to the local system and use them.

Extract particular file from zip blob stored in azure container with python using Jupyter notebook

I had uploaded zip file in my azure account as a blob in azure container.
Zip file contains .csv, .ascii files and many other formats.
I need to read specific file, lets say ascii file data containing in zip file. I am using python for this case.
How to read particular file data from this zip file without downloading it on local? I would like to handle this process in memory only.
I am also trying with jypyter notebook provided by azure for ML functionality
I am using ZipFile python package for this case.
Request you to assist in this matter to read the file
Please find following code snippet.
blob_service=BlockBlobService(account_name=ACCOUNT_NAME,account_key=ACCOUNT_KEY)
blob_list=blob_service.list_blobs(CONTAINER_NAME)
allBlobs = []
for blob in blob_list:
allBlobs.append(blob.name)
sampleZipFile = allBlobs[0]
print(sampleZipFile)

The below code should work. This example accesses an Azure Container using an Account URL and Key combination.
from azure.storage.blob import BlobServiceClient
from io import BytesIO
from zipfile import ZipFile
key = r'my_key'
service = BlobServiceClient(account_url="my_account_url",
credential=key
)
container_client = service.get_container_client('container_name')
zipfilename = 'myzipfile.zip'
blob_data = container_client.download_blob(zipfilename)
blob_bytes = blob_data.content_as_bytes()
inmem = BytesIO(blob_bytes)
myzip = ZipFile(inmem)
otherfilename = 'mycontainedfile.csv'
filetoread = BytesIO(myzip.read(otherfilename))
Now all you have to do is pass filetoread into whatever method you would normally use to read a local file (eg. pandas.read_csv())

you could use below code for reading file inside .zip file without extracting in python
import zipfile
archive = zipfile.ZipFile('images.zip', 'r')
imgdata = archive.read('img_01.png')
For details , you can refer to ZipFile docs here
Alternatively, you can do something like this
-- coding: utf-8 --
"""
Created on Mon Apr 1 11:14:56 2019
#author: moverm
"""
import zipfile
zfile = zipfile.ZipFile('C:\\LAB\Pyt\sample.zip')
for finfo in zfile.infolist():
ifile = zfile.open(finfo)
line_list = ifile.readlines()
print(line_list)
Here is the output for the same
Hope it helps.

Save uploaded image to S3 with Django

I'm attempting to save an image to S3 using boto. It does save a file, but it doesn't appear to save it correctly. If I try to open the file in S3, it just shows a broken image icon. Here's the code I'm using:
# Get and verify the file
file = request.FILES['file']
try:
img = Image.open(file)
except:
return api.error(400)
# Determine a filename
filename = file.name
# Upload to AWS and register
s3 = boto.connect_s3(aws_access_key_id=settings.AWS_KEY_ID,
aws_secret_access_key=settings.AWS_SECRET_ACCESS_KEY)
bucket = s3.get_bucket(settings.AWS_BUCKET)
f = bucket.new_key(filename)
f.set_contents_from_file(file)
I've also tried replacing the last line with:
f.set_contents_from_string(file.read())
But that didn't work either. Is there something obvious that I'm missing here? I'm aware django-storages has a boto backend, but because of complexity with this model, I do not want to use forms with django-storages.

Incase you don't want to go for django-storages and just want to upload few files to s3 rather then all the files then below is the code:
import boto3
file = request.FILES['upload']
s3 = boto3.resource('s3', aws_access_key_id=settings.AWS_ACCESS_KEY, aws_secret_access_key=settings.AWS_SECRET_ACCESS_KEY)
bucket = s3.Bucket('bucket-name')
bucket.put_object(Key=filename, Body=file)

You should use django-storages which uses boto internally.
You can either swap the default FileSystemStorage, or create a new storage instance and manually save files. Based on your code example I guess you really want to go with the first option.
Please consider using django's Form instead of directly accessing the request.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read file data from aws s3 pyspark - python

Assuming that dir-resources is the name of your bucket, you should be able to access to the file with the following URI: path = "s3://dir-resources/sample.json" In some cases, you may have to use the s3n protocol instead: path = "s3n://dir-resources/sample.json"

Related

Image error, not loading for S3 image retrieval

Create a Directory in AWS S3

How can I access the emrfs filesystem from python in pyspark code?

Extract particular file from zip blob stored in azure container with python using Jupyter notebook

Save uploaded image to S3 with Django

Categories

Resources