Is it possible to do filebased processing with UDF in pyspark? - python

I have a UDF defined which does the following with a dataframe where a column contains the location of zip files in azure blob storage(I tested the UDF without spark and that worked out):
downloads defined file from the blob and safe it somewhere on the Excutor/Driver
Extract a certain file of the blob and safe it on the Excutor/Driver
With this UDF I experience it is the same speed as if I would just loop in python over the files. So is it even possible to do this kind of task in spark? I wanted to use spark to parallelize the download and unzipping to speed it up.
I connected via ssh to the Excutor and the Driver (it is a test cluster, so it only has one of each) and found out that only the data was processe on the Excutor and the driver did not do anything at all. Why is that so?
The next step would be to read the extracted files (normal csvs) to a spark data frame. But how can this be done if the files are distributed over the Excutor and Driver? I did not yet find a way to access the storage of the Excutors. Or is it somehow possible to define a common location within the UDF to write it back to a location at the driver?
I would like to read than the extracted files with:
data_frame = (
spark
.read
.format('csv')
.option('header', True)
.option('delimiter', ',')
.load(f"/mydriverpath/*.csv"))
If there is another method to parallelize the download and unzipping of the files I would be happy to hear about it.

PySpark readers / writers make it easy to read and write files in parallel. When working in Spark, you generally should not loop over files or save data on the driver node.
Suppose you have 100 gzipped CSV files in the my-bucket/my-folder directory. Here's how you can read them into a DataFrame in parallel:
df = spark.read.csv("my-bucket/my-folder")
And here's how you can write them to 50 Snappy compressed Parquet files (in parallel):
df.repartition(50).write.parquet("my-bucket/another-folder")
The readers / writers do all the heavy lifting for you. See here for more info about repartition.

Related

Is there a way to of getting part of a dataframe from an azure blob storage

So I have a a lot of data in an Azure blob storage. Each user can upload some cases and the end result can be represented as a series of panda dataframes. Now I want to be able to display some of this data on our site, but the files are several hundreds of MB and there is no need to download all of it. What would be the best way to get part of the df?
I can make a folder structure in each blob storage containing the different columns in each df and perhaps a more more compact summery of the columns but I would like to keep it in one file if possible.
I could also set up a database containing the info but I like the structure as it is - completely separated in cases.
Originally I thought I could do it in hdf5 but it seems that I need to download the entire file from the blob storage to my API backend before I can run my python code on it. I would prefer if I could keep the hdf5 files and get the parts of the columns from the blob storage directly but as far as I can see that is not possible.
I am thinking this is something that has been solved a million times before but it is a bit out of my domain so I have not been able to find a good solution for it.
Check out the BlobClient of the Azure Python SDK. The download_blob method might suit your needs. Use chunks() to get an iterator which allows you to iterate of over the file in chunks. You can also set other parameters to assure that a chunk doesn't exceed a set size.

Logging a PySpark dataframe into a MLFlow Artifact

I am currently writing an MLFlow artifact to the dbfs but I am using pandas using the code below...
temp = tempfile.NamedTemporaryFile(prefix="*****", suffix=".csv")
temp_name = temp.name
try:
df.to_csv(temp_name, index=False)
mlflow.log_artifact(temp_name, "******")
finally:
temp.close() # Delete the temp file
How would I write this if 'df' was a spark dataframe?
You just need to use the filepath URLs with proper protocols. "dbfs" is a generic Databricks one. For Azure, "abffs" would be needed. (I can not recall AWS's S3 extension)
filepath="dbfs:///filepath"
df # My Spark DataFrame
df.write.csv(filepath)
mlflow.log_artifact(temp_name, filepath)
It looks like in your case the problem has to do with how spark APIs accesses the filesystem vs how python APIs accesses it see here for details. This is probably not the recommended way (I'm fairly new to Databricks myself), but if you're on a single node you can write your parquet to the local filesystem and mlflow can log it from there with something like:
with tempfile.TemporaryDirectory() as tmpdirname:
df.write.parquet(f'file:{tmpdirname}/my_parquet_table')
mlflow.log_artifacts(tmpdirname, artifact_path='my_parquet_table_name')
Keep in mind that a parquet "file" is actually a directory with a whole bunch of files in it, so you need to use log_artifacts, not log_artifact, and if you don't specify artifact_path you'll get all the little files that make up the parquet file (directory) dumped directly into the root of your your mlflow artifacts. Also, Mlflow doesn't have any previewing capability for parquet files, so depending on your use case, logging parquet artifacts may not be as convenient as it first seems.
HTH

Writing .txt files to GCS from Spark/Dataproc: How to write only one large file instead of it automatically splitting in to multiple?

I use Dataproc to run a Pyspark script that writes a dataframe to text files in google cloud storage bucket. When I run the script with big data, I automatically end up with a large number of text files in my output folder, but I want only one large file.
I read here Spark saveAsTextFile() writes to multiple files instead of one I can use .repartition(1) before .write() to get one file but I want it to run fast (of course) so I don't want to go back to one partition before performing the .write().
df_plain = df.select('id', 'string_field1').write.mode('append').partitionBy('id').text('gs://evatest/output', compression="gzip")
Don't think of GCS as a filesystem. The content of a GCS bucket is a set of immutable blobs (files). Once written, they can't be changed. My recommendation is to let your job write all the files independently and aggregate them at the end. There are a number of ways to achieve this.
The easiest way to achieve this is through the gsutil compose command.
References:
How to concatenate sharded files on Google Cloud Storage automatically using Cloud Functions
compose - Concatenate a sequence of objects into a new composite object.
Google Cloud Storage Joining multiple csv files

Read multiple files from s3 faster in python

I have multiple files in s3 bucket folder. In python I read the files one by one and used concat for single dataframe. However, it is pretty slow. If I have a million of files then it will be extremely slow. Is there any other method available (like bash) that can increase the process of reading s3 files?
response = client.list_objects_v2(
Bucket='bucket',
Prefix=f'key'
)
dflist = []
for obj in response.get('Contents', []):
dflist.append(get_data(obj,col_name))
pd.concat(dflist)
def get_data(obj, col_name):
data = pd.read_csv(f's3://bucket/{obj.get("Key")}', delimiter='\t', header=None, usecols=col_name.keys(),
names=col_name.values(), error_bad_lines=False)
return data
As s3 is object storage you need to bring your file on the computer (ie read a file in memory) and edit it and then push back again (rewrite object).
So it took some time to achieve your task.
some helper pointers:
If you process multiple files in multiple threads that will help you.
If your data is really heavy, start an instance on aws in the same region where your bucket is, and process data from there and terminate it. (it will save network cost + time to pull and push files across networks)
You can use AWS SDK for Pandas, a library that extends Pandas to work smoothly with AWS data stores. It's read_csv can also read multiple csv files from an S3 folder.
import awswrangler as wr
df = wr.s3.read_csv("s3://bucket/folder/")
It can be installed via pip install awswrangler.

How to load big datasets like million song dataset into BigData HDFS or Hbase or Hive?

I have downloaded a subset of million song data set which is about 2GB. However, the data is broken down into folders and sub folders. In the sub-folder they are all in several 'H5 file' format. I understand it can be read using Python. But I do not know how to extract and load then into HDFS so I can run some data analysis in Pig.
Do I extract them as CSV and load to Hbase or Hive ? It would help if someone can point me to right resource.
If it's already in the CSV or any format on the linux file system, that PIG can understand, just do a hadoop fs -copyFromLocal to
If you want to read/process the raw H5 File format using Python on HDFS, look at hadoop-streaming (map/reduce)
Python can handle 2GB on a decent linux system- not sure if you need hadoop for it.
Don't load such amount of small files into HDFS. Hadoop doesn't handle well lots of small files. Each small file will incur in overhead because the block size (usually 64MB) is much bigger.
I want to do it myself, so I'm thinking of solutions. The million song dataset files don't have more than 1MB. My approach will be to aggregate data somehow before importing into HDFS.
The blog post "The Small Files Problem" from Cloudera may shed some light.

Categories

Resources