I am using pyspark on amazon EMR and need to access files stored on the emrfs in s3, everywhere I look I can only find examples for how to access the emrfs via the spark API, but I need to access it in the executers, using python code. How can I do that?
Below code can help you in listing the content of a bucket in aws using boto3.
from boto3.session import Session
ACCESS_KEY='your_access_key'
SECRET_KEY='your_secret_key'
session = Session(aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY)
s3 = session.resource('s3')
your_bucket = s3.Bucket('your_bucket')
for s3_file in your_bucket.objects.all():
print(s3_file.key)
One solution is to use Hadoop FS API. From Pyspark you can access it via the JVM.
Here is an example that lists files from S3 bucket folder and prints the paths.
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
conf = sc._jsc.hadoopConfiguration()
s3_folder = Path("s3://bucket_name/folder")
gs = s3_folder.getFileSystem(conf).globStatus(s3_folder)
for f in gs:
print(f.getPath().toString())
Not sure why you want to read files this way as you can do that using Spark but here is a way using Hadoop FS open method:
fs = s3_folder.getFileSystem(conf)
fs_data_input_stream = fs.open(s3_folder)
line = fs_data_input_stream.readLine()
while line:
print(line)
line = fs_data_input_stream.readLine()
However, if you're using EMR Cluster I recommend copying the files from S3 to the local system and use them.
Related
I am facing a little issue here that I can't explain.
On some occasions, I am able to open files from my cloud storage buckets using a GSutil URI. For instance this one works fine
df = pd.read_csv('gs://poker030120203/ouptut_test.csv')
But on some other occasions, this method does not work & returns an error FileNotFoundError: [Errno 2] No such file or directory
This happens for instance with the following codes
rank_table_filename = 'gs://poker030120203/rank_table.bin'
rank_table_file = open(rank_table_filename, "r")
preflop_table_filename = 'gs://poker030120203/preflop_table.npy'
self.preflop_table = np.load(preflop_table_filename)
I am not sure if this is related to the "open" or "load" methode, or maybe the file type, but I can't figure out why this return an error. I do not know if this has an impact on that matter, but I'm running everything from Vertex (ie. the AI module that automatically sets up a storage bucket / a VM and a jupyter notebook).
Thanks a lot for the help
In order to read and write the file from the google cloud storage, you can use google recommended methods. It's easier to use google client libraries to read / write anything from / in Google Cloud Storage.
From the doc Example:
from google.cloud import storage
def write_read(bucket_name, blob_name):
"""Write and read a blob from GCS using file-like IO"""
# The ID of your GCS bucket
# bucket_name = "your-bucket-name"
# The ID of your new GCS object
# blob_name = "storage-object-name"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_name)
# Mode can be specified as wb/rb for bytes mode.
# See: https://docs.python.org/3/library/io.html
with blob.open("w") as f:
f.write("Hello world")
with blob.open("r") as f:
print(f.read())
I am trying to load a file to s3 bucket through streamlit web UI. Also next step is to read or delete files present in s3 bucket through streamlit app in an interactive way .. I have written a code to upload file on streamlit app but failing in loading it to s3 bucket.
We have a pretty detailed doc on this. You can use s3fs to make the connection, e.g.:
# streamlit_app.py
import streamlit as st
import s3fs
import os
# Create connection object.
# `anon=False` means not anonymous, i.e. it uses access keys to pull data.
fs = s3fs.S3FileSystem(anon=False)
# Retrieve file contents.
# Uses st.experimental_memo to only rerun when the query changes or after 10 min.
#st.experimental_memo(ttl=600)
def read_file(filename):
with fs.open(filename) as f:
return f.read().decode("utf-8")
content = read_file("testbucket-jrieke/myfile.csv")
# Print results.
for line in content.strip().split("\n"):
name, pet = line.split(",")
st.write(f"{name} has a :{pet}:")
I am using Paramiko to access a remote SFTP folder, and I'm trying to write code that transfers files from a path in SFTP (with a simple logic using the file metadata to check it's last modified date) to AWS S3 bucket.
I have set the connection to S3 using Boto3, but I still can't seem to write a working code that transfers the files without downloading them to a local directory first. Here is some code I tried using Paramiko's getfo() method. But it doesn't work.
for f in files:
# get last modified from file metadata
last_modified = sftp.stat(remote_path + f).st_mtime
last_modified_date = datetime.fromtimestamp(last_modified).date()
if last_modified_date > date_limit: # check limit
print('getting ' + f)
full_path = f"{folder_path}{f}"
fo = sftp.getfo(remote_path + f,f)
s3_conn.put_object(Body=fo,Bucket=s3_bucket, Key=full_path)
Thank you!
Use Paramiko SFTPClient.open to get a file-like object that you can pass to Boto3 Client.put_object:
with sftp.open(remote_path + f, "r") as f:
f.prefetch()
s3_conn.put_object(Body=f)
For the purpose of the f.prefetch(), see Reading file opened with Python Paramiko SFTPClient.open method is slow.
For the opposite direction, see:
Transfer file from AWS S3 to SFTP using Boto 3
I need to open and work on data coming in a text file with python.
The file will be stored in the Azure Blob storage or Azure file share.
However, my question is can I use the same modules and functions like os.chdir() and read_fwf() I was using in windows? The code I wanted to run:
import pandas as pd
import os
os.chdir( file_path)
df=pd.read_fwf(filename)
I want to be able to run this code and file_path would be a directory in Azure blob.
Please let me know if it's possible. If you have a better idea where the file can be stored please share.
Thanks,
As far as I know, os.chdir(path) can only operate on local files. If you want to move files from storage to local, you can refer to the following code:
connect_str = "<your-connection-string>"
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
container_name = "<container-name>"
file_name = "<blob-name>"
container_client = blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(file_name)
download_file_path = "<local-path>"
with open(download_file_path, "wb") as download_file:
download_file.write(blob_client.download_blob().readall())
pandas.read_fwf can read blob directly from storage using URL:
For example:
url = "https://<your-account>.blob.core.windows.net/test/test.txt?<sas-token>"
df=pd.read_fwf(url)
I have a json file placed in s3. The s3 url is similar to the below one:
https://s3-eu-region-1.amazonaws.com/dir-resources/sample.json
But in pyspark when pass the same, It is not reading the file.
path = "https://s3-eu-region-1.amazonaws.com/dir-resources/sample.json"
df=spark.read.json(path)
But I am able to download it through browser.
Assuming that dir-resources is the name of your bucket, you should be able to access to the file with the following URI:
path = "s3://dir-resources/sample.json"
In some cases, you may have to use the s3n protocol instead:
path = "s3n://dir-resources/sample.json"