writing a file s3 using spark usually creates a directory with 11 files success and the other file name starts with name as part which has actual data in s3 , how to load the same file using pandas dataframe since the file path changes because the file name Par for all 10 files with actual data varies in each run.
For example the file path at the time of writing :
df.colaesce.(10).write.path("s3://testfolder.csv")
The files stored in the directory are :
- sucess
- part-00-*.parquet
I have a python job which reads the file to pandas dataframe
pd.read(s3\\..........what is the path to specify here.................)
when writing files with spark, you cannot pass the name the file (you can, but you end up with what you described above). if you want a single file to later load to pandas, you would do something like this:
df.repartition(1).write.parquet(path="s3://testfolder/", mode='append')
The end result will be a single file in "s3://testfolder/" that starts with part-00-*.parquet. You can simply read that file in or rename the file to something specific before reading it in with pandas.
Option 1: (Recommended)
You can use awswrangler. Its a light weight tool to aid with the integration between
Pandas/S3/Parquet. It lets you read in multiple files from the directory.
pip install awswrangler
import awswrangler as wr
df = wr.s3.read_parquet(path='s3://testfolder/')
Option 2:
############################## RETRIEVE KEYS FROM THE BUCKET ##################################
import boto3
import pandas as pd
s3 = boto3.client('s3')
s3_bucket_name = 'your bucket name'
prefix = 'path where the files are located'
response = s3.list_objects_v2(
Bucket = s3_bucket_name,
Prefix = prefix
)
keys = []
for obj in response['Contents']:
keys.append(obj['Key'])
##################################### READ IN THE FILES #######################################
df=[]
for key in keys:
df.append(pd.read_parquet(path = 's3://' + s3_bucket_name + '/' + key, engine = 'pyarrow'))
Related
Hello in my GCP jupyter notebook I am reading
from google.cloud import storage
client = storage.Client()
BUCKET_NAME = 'sleep-accel'
bucket = client.get_bucket(BUCKET_NAME)
blobs_all = list(bucket.list_blobs())
blobs_specific = list(bucket.list_blobs(prefix='physionet.org/files/sleep-accel/1.0.0/motion/'))
for doc in blobs_specific:
print(doc)
dataset that I loaded in to GCS and for some reason its printing
<Blob: sleep-accel, physionet.org/files/sleep-accel/1.0.0/motion/1455390_acceleration.txt, 1656705245042882>
how can I access the .txt files ?
Because my main/end goal is to convert the content of .txt into a single .csv format
Converting the .txt files to .csv format can be achieved by using the pandas module.
Below is my sample code converting txt files from the bucket to csv format
from google.cloud import storage
import pandas as pd
client = storage.Client()
BUCKET_NAME = 'your_bucket_name'
bucket = client.get_bucket(BUCKET_NAME)
blobs_specific = list(bucket.list_blobs(prefix='physionet.org/files/sleep-accel/1.0.0/motion/'))
#List all the objects inside physionet.org/files/sleep-accel/1.0.0/motion/ folder
for doc in list(blobs_specific)[1:]:
#read the txt files using pandas and remove the header and separate it using space
df = pd.read_csv("gs://your_bucket_file_path/" + doc.name, header=None, sep=' ')
#change the doc.name value .txt to .csv
to_csv = doc.name.replace('.txt','.csv')
print(to_csv)
#convert the txt files to csv using pandas and save it to physionet.org/files/sleep-accel/1.0.0/motion/ folder in your notebook
df.to_csv(to_csv, index=False, sep=',')
the csv file will be downloaded to your notebook local server.
Note: you need to create a directory tree like this: physionet.org/files/sleep-accel/1.0.0/motion/ in your notebook because this is where the csv file will be saved.
I'm trying to open a series of different cracked documents / texts that we've stored in Azure Blob storage, ideally pushing them all into a pandas db. I do not want to download them (I'm going to be opening them from a Docker Container), I just want to store the information in memory.
The file structure looks like: Azure Blob Storage -> MyContainer -> UUIDFolderNames (many) -> 1 "knowledge.json" file in each Folder.
What I've got working:
container = ContainerClient.from_connection_string( <my connection str>, <MyContainer> )
blob_list = container.list_blobs()
for blob in blob_list:
blobClient = container.get_blob_client( blob ) #Not sure this is needed
Ideally for each item in my for loop, I'd do something like opening the .json file, then adding it's text to a row in my dataframe. However, I can't actually manage to open any of the JSON files.
What I've tried:
#1
name = blob.name
json.loads( name )
#2
with open(name, 'r') as f:
data = json.load( f )
Errors:
#1 Json Decoder Error Expecting Value: line 1 column 1 (char 0)
#2: No such file or directory
I've tried other sillier things like json.loads( blob ) or json.loads('knowledge.json') (no folder name in path), but those are kinda nonsensicle things that I was just trying to see if they worked, they're not exactly reasonable.
Most methods (including on Azure's documentation) download the file first, but again, I don't want to download the file.
*Edit: I realized that its somewhat obvious why the file's cannot be found - json.load etc will look in my local directory / where I'm running the python file from, rather than the blob location. Still, not sure how to load a file w.o downloading it.
With the help of the below block you will be able to view the JSON blob:
for blobs in container_client.list_blobs():
blob_client = service_client.get_blob_client(container=Container_name, blob=blobs)
content = blob_client.download_blob()
contentastext = content.readall()
print(contentastext)
Below is the full code to read JSON files from blobs, later you can add this data to your dataframes:
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient,PublicAccess
import os
import logging
import sys
import azure.functions as func
from azure.storage import blob
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, __version__
def UploadFiles():
CONNECTION_STRING="ENTER_CONNECTION_STR"
Container_name="gatherblobs"
service_client=BlobServiceClient.from_connection_string(CONNECTION_STRING)
container_client = service_client.get_container_client(Container_name)
for blobs in container_client.list_blobs():
blob_client = service_client.get_blob_client(container=Container_name, blob=blobs)
content = blob_client.download_blob()
contentastext = content.readall()
print(contentastext)
if __name__ == '__main__':
UploadFiles()
Here is the spark DataFrame I want to save as a csv.
type(MyDataFrame)
--Output: <class 'pyspark.sql.dataframe.DataFrame'>
To save this as a CSV, I have the following code:
MyDataFrame.write.csv(csv_path, mode = 'overwrite', header = 'true')
When I save this, the file name is something like this:
part-0000-766dfdf-78fg-aa44-as3434rdfgfg-c000.csv
Is there a way I can give this a custom name while saving it? Like "MyDataFrame.csv"
I have the same requirement.You can write to one path, and then change the file path. This is my solution.
def write_to_hdfs_specify_path(df, spark, hdfs_path, file_name):
"""
:param df: dataframe which you want to save
:param spark: sparkSession
:param hdfs_path: target path(shoul be not exises)
:param file_name: csv file name
:return:
"""
sc = spark.sparkContext
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
df.coalesce(1).write.option("header", True).option("delimiter", "|").option("compression", "none").csv(hdfs_path)
fs = FileSystem.get(Configuration())
file = fs.globStatus(Path("%s/part*" % hdfs_path))[0].getPath().getName()
full_path = "%s/%s" % (hdfs_path, file_name)
result = fs.rename(Path("%s/%s" % (hdfs_path, file)), Path(full_path))
return result
No. That's how Spark work (at least for now). You'd have MyDataFrame.csv as a directory name, and under that directory, you'd have multiple files with the same format as part-0000-766dfdf-78fg-aa44-as3434rdfgfg-c000.csv, part-0000-766dfdf-78fg-aa44-as3434rdfgfg-c001.csv etc
It's not recommended, but if your data is small enough (arguably what is "small enough" here), you can always convert it to Pandas and save it to a single CSV file with any name you wanted.
.coalesce(1) will guarantee that there is only 1 file but will not guarantee file name. Please use some temp directory to save it and than rename it and copy (using dbutils.fs functions if you use databricks or using FileUtil from Hadoop API).
I had uploaded zip file in my azure account as a blob in azure container.
Zip file contains .csv, .ascii files and many other formats.
I need to read specific file, lets say ascii file data containing in zip file. I am using python for this case.
How to read particular file data from this zip file without downloading it on local? I would like to handle this process in memory only.
I am also trying with jypyter notebook provided by azure for ML functionality
I am using ZipFile python package for this case.
Request you to assist in this matter to read the file
Please find following code snippet.
blob_service=BlockBlobService(account_name=ACCOUNT_NAME,account_key=ACCOUNT_KEY)
blob_list=blob_service.list_blobs(CONTAINER_NAME)
allBlobs = []
for blob in blob_list:
allBlobs.append(blob.name)
sampleZipFile = allBlobs[0]
print(sampleZipFile)
The below code should work. This example accesses an Azure Container using an Account URL and Key combination.
from azure.storage.blob import BlobServiceClient
from io import BytesIO
from zipfile import ZipFile
key = r'my_key'
service = BlobServiceClient(account_url="my_account_url",
credential=key
)
container_client = service.get_container_client('container_name')
zipfilename = 'myzipfile.zip'
blob_data = container_client.download_blob(zipfilename)
blob_bytes = blob_data.content_as_bytes()
inmem = BytesIO(blob_bytes)
myzip = ZipFile(inmem)
otherfilename = 'mycontainedfile.csv'
filetoread = BytesIO(myzip.read(otherfilename))
Now all you have to do is pass filetoread into whatever method you would normally use to read a local file (eg. pandas.read_csv())
you could use below code for reading file inside .zip file without extracting in python
import zipfile
archive = zipfile.ZipFile('images.zip', 'r')
imgdata = archive.read('img_01.png')
For details , you can refer to ZipFile docs here
Alternatively, you can do something like this
-- coding: utf-8 --
"""
Created on Mon Apr 1 11:14:56 2019
#author: moverm
"""
import zipfile
zfile = zipfile.ZipFile('C:\\LAB\Pyt\sample.zip')
for finfo in zfile.infolist():
ifile = zfile.open(finfo)
line_list = ifile.readlines()
print(line_list)
Here is the output for the same
Hope it helps.
I have a project task to use some output data I have already produced on s3 in an EMR task. So previously I have ran an EMR job that produced some output in one of my s3 buckets in the form of multiple files named part-xxxx. Now I need to access those files from within my new EMR job, read the contents of those files and by using that data I need to produce another output.
This is the local code that does the job:
def reducer_init(self):
self.idfs = {}
for fname in os.listdir(DIRECTORY): # look through file names in the directory
file = open(os.path.join(DIRECTORY, fname)) # open a file
for line in file: # read each line in json file
term_idf = JSONValueProtocol().read(line)[1] # parse the line as a JSON object
self.idfs[term_idf['term']] = term_idf['idf']
def reducer(self, term_poster, howmany):
tfidf = sum(howmany) * self.idfs[term_poster['term']]
yield None, {'term_poster': term_poster, 'tfidf': tfidf}
This runs just fine locally, but the problem is the data i need now is on s3 and i need to access it somehow in reducer_init function.
This is what I have so far, but it fails while executing on EC2:
def reducer_init(self):
self.idfs = {}
b = conn.get_bucket(bucketname)
idfparts = b.list(destination)
for key in idfparts:
file = open(os.path.join(idfparts, key))
for line in file:
term_idf = JSONValueProtocol().read(line)[1] # parse the line as a JSON object
self.idfs[term_idf['term']] = term_idf['idf']
def reducer(self, term_poster, howmany):
tfidf = sum(howmany) * self.idfs[term_poster['term']]
yield None, {'term_poster': term_poster, 'tfidf': tfidf}
AWS access info is defined as follows:
awskey = '*********'
awssecret = '***********'
conn = S3Connection(awskey, awssecret)
bucketname = 'mybucket'
destination = '/path/to/previous/output'
There are two ways of doing this :
Download the file into your local system and parse it. ( Kinda simple, quick and easy )
Get data stored on S3 into memory and parse it ( a bit more complex in case of huge files ).
Step 1:
On S3 filenames are stored as a Key, if you have a file named "Demo" stored in a folder named "DemoFolder" then the key for that particular file would be "DemoFolder\Demo".
Use the below code to download the file into a temp folder.
AWS_KEY = 'xxxxxxxxxxxxxxxxxx'
AWS_SECRET_KEY = 'xxxxxxxxxxxxxxxxxxxxxxxxxx'
BUCKET_NAME = 'DemoBucket'
fileName = 'Demo'
conn = connect_to_region(Location.USWest2,aws_access_key_id = AWS_KEY,
aws_secret_access_key = AWS_SECRET_KEY,
is_secure=False,host='s3-us-west-2.amazonaws.com'
)
source_bucket = conn.lookup(BUCKET_NAME)
''' Download the file '''
for name in source_bucket.list():
if name.name in fileName:
print("DOWNLOADING",fileName)
name.get_contents_to_filename(tempPath)
You can then work on the file in that temp path.
Step 2:
You can also fetch data as string using data = name.get_contents_as_string(). In case of huge files (> 1gb) you may come across memory errors, to avoid such errors you will have to write a lazy function which reads the data in chunks.
For example you can use range to fetch a part of file using data = name.get_contents_as_string(headers={'Range': 'bytes=%s-%s' % (0,100000000)}).
I am not sure if I answered your question properly, I can custom code for your requirement once I get some time. Meanwhile please feel free to post any query you have.