Logging a PySpark dataframe into a MLFlow Artifact - python

I am currently writing an MLFlow artifact to the dbfs but I am using pandas using the code below...
temp = tempfile.NamedTemporaryFile(prefix="*****", suffix=".csv")
temp_name = temp.name
try:
df.to_csv(temp_name, index=False)
mlflow.log_artifact(temp_name, "******")
finally:
temp.close() # Delete the temp file
How would I write this if 'df' was a spark dataframe?

You just need to use the filepath URLs with proper protocols. "dbfs" is a generic Databricks one. For Azure, "abffs" would be needed. (I can not recall AWS's S3 extension)
filepath="dbfs:///filepath"
df # My Spark DataFrame
df.write.csv(filepath)
mlflow.log_artifact(temp_name, filepath)

It looks like in your case the problem has to do with how spark APIs accesses the filesystem vs how python APIs accesses it see here for details. This is probably not the recommended way (I'm fairly new to Databricks myself), but if you're on a single node you can write your parquet to the local filesystem and mlflow can log it from there with something like:
with tempfile.TemporaryDirectory() as tmpdirname:
df.write.parquet(f'file:{tmpdirname}/my_parquet_table')
mlflow.log_artifacts(tmpdirname, artifact_path='my_parquet_table_name')
Keep in mind that a parquet "file" is actually a directory with a whole bunch of files in it, so you need to use log_artifacts, not log_artifact, and if you don't specify artifact_path you'll get all the little files that make up the parquet file (directory) dumped directly into the root of your your mlflow artifacts. Also, Mlflow doesn't have any previewing capability for parquet files, so depending on your use case, logging parquet artifacts may not be as convenient as it first seems.
HTH

Related

to_csv "No Such File or Directory" But the directory does exist - Databricks on ADLS

I've seen many iterations of this question but cannot seem to understand/fix this behavior.
I am on Azure Databricks working on DBR 10.4 LTS Spark 3.2.1 Scala 2.12 trying to write a single csv file to blob storage so that it can be dropped to an SFTP server. Could not use spark-sftp because I am on Scala 2.12 unfortunately and could not get the library to work.
Given this is a small dataframe, I am converting it to pandas and then attempting to_csv.
to_export = df.toPandas()
to_export.to_csv(pathToFile, index = False)
I get the error: [Errno 2] No such file or directory: '/dbfs/mnt/adls/Sandbox/user/project_name/testfile.csv
Based on the information in other threads, I create the directory with dbutils.fs.mkdirs("/dbfs/mnt/adls/Sandbox/user/project_name/") /n Out[40]: True
The response is true and the directory exists, yet I still get the same error. I'm convinced it is something obvious and I've been staring at it for too long to notice. Does anyone see what my error may be?
Python's pandas library recognizes the path only when it is in File API Format (since you are using mount). And dbutils.fs.mkdirs uses Spark API Format which is different from File API Format.
As you are creating the directory using dbutils.fs.mkdirs with path as /dbfs/mnt/adls/Sandbox/user/project_name/, this path would be actually considered as dbfs:/dbfs/mnt/adls/Sandbox/user/project_name/. Hence, the directory would be created within DBFS.
dbutils.fs.mkdirs('/dbfs/mnt/repro/Sandbox/user/project_name/')
So, you have to create the directory by modify the code to create directory to the following code:
dbutils.fs.mkdirs('/mnt/repro/Sandbox/user/project_name/')
#OR
#dbutils.fs.mkdirs('dbfs:/mnt/repro/Sandbox/user/project_name/')
Writing to the folder would now work without any issue.
pdf.to_csv('/dbfs/mnt/repro/Sandbox/user/project_name/testfile.csv', index=False)

How can I save files to ADLS from Azure Databricks python notebook?

Does anyone know how can I save screenshots from a databricks notebook directly to ADLS.
I have set up the connection but for some reason I cannot do it directly so I have so save to DBFS then move the files to ADLS
Currently this works:
driver.save_screenshot('/dbfs/test.png')
dbutils.fs.mv('dbfs:/', 'abfss://<container>#<storage-account>.dfs.core.windows.net/', recurse=True)
Ideally, I want to do this in a single step rather than saving in the root storage than moving so something like this:
driver.save_screenshot('abfss://<container>#<storage-account>.dfs.core.windows.net/test.png')
or
driver.save_screenshot('/abfss/<storage-account>/<container>/test.png')
Finally, does setting up the Unity Catalog and a corresponding Metastore help in changing the DBFS root storage location to ADLS so I can directly use the container without specifying the links everytime? What are the best practices in such cases?
Many Thanks!
Databricks considers 'abfss' as an external path so some functions only consider local path to save files
Instead of using 'abfss' path we have to provide 'mnt path' to save file directly into ADLS.
driver.save_screenshot('/mnt/Input_path/<foldername>/test.png')

Is it possible to do filebased processing with UDF in pyspark?

I have a UDF defined which does the following with a dataframe where a column contains the location of zip files in azure blob storage(I tested the UDF without spark and that worked out):
downloads defined file from the blob and safe it somewhere on the Excutor/Driver
Extract a certain file of the blob and safe it on the Excutor/Driver
With this UDF I experience it is the same speed as if I would just loop in python over the files. So is it even possible to do this kind of task in spark? I wanted to use spark to parallelize the download and unzipping to speed it up.
I connected via ssh to the Excutor and the Driver (it is a test cluster, so it only has one of each) and found out that only the data was processe on the Excutor and the driver did not do anything at all. Why is that so?
The next step would be to read the extracted files (normal csvs) to a spark data frame. But how can this be done if the files are distributed over the Excutor and Driver? I did not yet find a way to access the storage of the Excutors. Or is it somehow possible to define a common location within the UDF to write it back to a location at the driver?
I would like to read than the extracted files with:
data_frame = (
spark
.read
.format('csv')
.option('header', True)
.option('delimiter', ',')
.load(f"/mydriverpath/*.csv"))
If there is another method to parallelize the download and unzipping of the files I would be happy to hear about it.
PySpark readers / writers make it easy to read and write files in parallel. When working in Spark, you generally should not loop over files or save data on the driver node.
Suppose you have 100 gzipped CSV files in the my-bucket/my-folder directory. Here's how you can read them into a DataFrame in parallel:
df = spark.read.csv("my-bucket/my-folder")
And here's how you can write them to 50 Snappy compressed Parquet files (in parallel):
df.repartition(50).write.parquet("my-bucket/another-folder")
The readers / writers do all the heavy lifting for you. See here for more info about repartition.

Is there a way to get log the descriptive stats of a dataset using MLflow?

Is there a way to get log the descriptive stats of a dataset using MLflow? If any could you please share the details?
Generally speaking you can log arbitrary output from your code using the mlflow_log_artifact() function. From the docs:
mlflow.log_artifact(local_path, artifact_path=None)
Log a local file or directory as an artifact of the currently active run.
Parameters:
local_path – Path to the file to write.
artifact_path – If provided, the directory in artifact_uri to write to.
As an example, say you have your statistics in a pandas dataframe, stat_df.
## Write csv from stats dataframe
stat_df.to_csv('dataset_statistics.csv')
## Log CSV to MLflow
mlflow.log_artifact('dataset_statistics.csv')
This will show up under the artifacts section of this MLflow run in the Tracking UI. If you explore the docs further you'll see that you can also log an entire directory and the objects therein. In general, MLflow provides you a lot of flexibility - anything you write to your file system you can track with MLflow. Of course that doesn't mean you should. :)
There is also the possibility to log the artifact as an html file such that it is displayed as an (ugly) table in mlflow.
import seaborn as sns
import mlflow
mlflow.start_run()
df_iris = sns.load_dataset("iris")
df_iris.describe().to_html("iris.html")
mlflow.log_artifact("iris.html",
"stat_descriptive")
mlflow.end_run()
As the answers pointed out, MLFlow allows for uploading any local files. But the good practice is to dump to and upload from temporary files.
The advantage over the accepted answer are: no leftovers, and no issues with parallelization.
with tempfile.TemporaryDirectory() as tmpdir:
fname = tmpdir+'/'+'bits_corr_matrix.csv'
np.savetxt(fname,corr_matrix,delimiter=',')
mlflow.log_artifact(fname)

Read multiple files from s3 faster in python

I have multiple files in s3 bucket folder. In python I read the files one by one and used concat for single dataframe. However, it is pretty slow. If I have a million of files then it will be extremely slow. Is there any other method available (like bash) that can increase the process of reading s3 files?
response = client.list_objects_v2(
Bucket='bucket',
Prefix=f'key'
)
dflist = []
for obj in response.get('Contents', []):
dflist.append(get_data(obj,col_name))
pd.concat(dflist)
def get_data(obj, col_name):
data = pd.read_csv(f's3://bucket/{obj.get("Key")}', delimiter='\t', header=None, usecols=col_name.keys(),
names=col_name.values(), error_bad_lines=False)
return data
As s3 is object storage you need to bring your file on the computer (ie read a file in memory) and edit it and then push back again (rewrite object).
So it took some time to achieve your task.
some helper pointers:
If you process multiple files in multiple threads that will help you.
If your data is really heavy, start an instance on aws in the same region where your bucket is, and process data from there and terminate it. (it will save network cost + time to pull and push files across networks)
You can use AWS SDK for Pandas, a library that extends Pandas to work smoothly with AWS data stores. It's read_csv can also read multiple csv files from an S3 folder.
import awswrangler as wr
df = wr.s3.read_csv("s3://bucket/folder/")
It can be installed via pip install awswrangler.

Categories

Resources