How can I save files to ADLS from Azure Databricks python notebook?

How can I save files to ADLS from Azure Databricks python notebook? - python

Does anyone know how can I save screenshots from a databricks notebook directly to ADLS.
I have set up the connection but for some reason I cannot do it directly so I have so save to DBFS then move the files to ADLS
Currently this works:
driver.save_screenshot('/dbfs/test.png')
dbutils.fs.mv('dbfs:/', 'abfss://<container>#<storage-account>.dfs.core.windows.net/', recurse=True)
Ideally, I want to do this in a single step rather than saving in the root storage than moving so something like this:
driver.save_screenshot('abfss://<container>#<storage-account>.dfs.core.windows.net/test.png')
or
driver.save_screenshot('/abfss/<storage-account>/<container>/test.png')
Finally, does setting up the Unity Catalog and a corresponding Metastore help in changing the DBFS root storage location to ADLS so I can directly use the container without specifying the links everytime? What are the best practices in such cases?
Many Thanks!

Databricks considers 'abfss' as an external path so some functions only consider local path to save files
Instead of using 'abfss' path we have to provide 'mnt path' to save file directly into ADLS.
driver.save_screenshot('/mnt/Input_path/<foldername>/test.png')

Related

[DATABRICKS]how to store SQ L query result data to local disk?

I am a newbie to data bricks and trying to write results into the excel/ CSV file using the below command but getting DataFrame' object has no attribute 'to_csv' errors while executing.
I am using a notebook to execute my SQL queries and now want to store results in the CSV or excel file
%python
df =spark.sql ("""select * from customer""")
and now I want to store the query results in the excel/csv file.I have tried the below code but it's not working
df.coalesce(1).write.option("header","true").option("sep",",").mode("overwrite").csv("file:///C:/New folder/mycsv.csv")
AND
df.write.option("header", "true").csv("file:///C:/New folder/mycsv.csv")

The is no direct way to write a dataframe to local machine. Databricks does not identify path to local machine.
The df.write.option("header", "true").csv("file:///C:/New folder/mycsv.csv") runs successfully because it file:/ is a valid path inside Databricks. It does not refer to your local machine according to Databricks. You can use display(dbutils.fs.ls("file:/C:/")) to see its contents.
The best possible way to download the results to your local machine is either of the following ways
1. Using UI from display():
Use the following code.
%python
display(df)
Your dataframe will be displayed with a few UI options. You can use the download symbol to download the results.
2. Using Filestore:
First, we have to enable DBFS file browser. Navigate to Settings -> Admin Console -> Workspace Settings. Under advanced section, there is an option called DBFS file browser. Enable this and reload. Now you can see the DBFS FileStore as shown in the below image
Write the dataframe to this location using the following code.
df.coalesce(1).write.option("header","true").option("sep",",").mode("overwrite").csv("dbfs:/FileStore/tables/Output")
#must use coalesce
Now use displayHTML in the following way inside python cell
%python
displayHTML("<a href='/FileStore/tables/Output/opfile.csv' download>Download CSV </a>")
#I renamed my file to opfile.csv.
#You can find your file by navigating to Data -> DBFS -> FileStore -> tables -> Output.
#Right click to rename
These are some of the ways to download dataframe to your local machine. But there might be no way to write it directly to Local machine.
UPDATE:
Using pandas dataframe also does not work. The file will be saved inside databricks itself. Look at the following images:
pd = df.toPandas()
pd.to_csv('C://New folder/myoutput.csv', sep=',', header=True, index=False)
#successful
import os
print(os.listdir("/"))
The following image indicates that the output is being written inside databricks (inside C:/), but not local machine.

Reading a file from Databrick filesystem

I used the following code to read a shapefile from dbfs:
geopandas.read_file("file:/databricks/folderName/fileName.shp")
Unfortunately, I don't have access to do so and I get the following error
DriverError: dbfs:/databricks/folderName/fileName.shp: Permission denied
Any idea how to grant the access? File exist there (I have a permission to save a file there using dbutils, also - I can read a file from there using spark, but I have no idea how to read a file using pyspark).
After adding those lines:
dbutils.fs.cp("/databricks/folderName/fileName.shp", "file:/tmp/fileName.shp", recurse = True)
geopandas.read_file("/tmp/fileName.shp")
...from suggestion below I get another error.
org.apache.spark.api.python.PythonSecurityException: Path 'file:/tmp/fileName.shp' uses an untrusted filesystem 'org.apache.hadoop.fs.LocalFileSystem', but your administrator has configured Spark to only allow trusted filesystems: (com.databricks.s3a.S3AFileSystem, shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem, shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem, com.databricks.adl.AdlFileSystem, shaded.databricks.V2_1_4.com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem, shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem, shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem)

GeoPandas doesn't know anything about DBFS - it's working with the local files. So you either need:
to use the DBFS Fuse to read file from DBFS (but there are some limitations):
geopandas.read_file("/dbfs/databricks/folderName/fileName.shp")
or use dbutils.fs.cp command to copy file from DBFS to the local filesystem, and read from it:
dbutils.fs.cp("/databricks/folderName/fileName.shp", "file:/tmp/fileName.shp", recurse = True)
geopandas.read_file("/tmp/fileName.shp")
P.S. But if the file is already copied to the driver node, then you just need to remove file: from the name.
Updated after updated question:
There are limitations on what could be done on the AAD passthrough clusters, so your administrator needs to change cluster configuration as it's described in the documentation on troubleshooting if you want to copy file from DBFS to local file system.
But the /dbfs way should work for passthrough clusters as well, although it should be at least DBR 7.3 (docs)

Okay, the answer is easier than I could thought:
geopandas.read_file("/dbfs/databricks/folderName")
(folder name as it is a folder with all shape files)
Why it should be like that? Easy. Enable the possibility of checking files on DBFS in admin control panel ('Advanced' tab), click on the file you need and you will get two possible paths to the file. One is dedicated to Spark API and another one for File API (this is what I needed).
:)

Am working in ADF i need to export data from sql source to Excel destination, Is there way to use Excel(.xlsx) as destination in ADF ? Notebook?

Am working in ADF i need to export data from sql source to Excel destination, Is there way to use Excel(.xlsx) as destination in ADF ? Notebook ?

Just taking a guess - use CSV as target and in properties - give .xlsx as file suffix -- it might work but you'll have to download the file from blob storage using Logic App etc.

Yes, Excel isn't supported as sink in ADF. Using Notebook is one way to achieve this. Another way in ADF is creating a Azure Function and then invoke it by Azure Function activity.

Logging a PySpark dataframe into a MLFlow Artifact

I am currently writing an MLFlow artifact to the dbfs but I am using pandas using the code below...
temp = tempfile.NamedTemporaryFile(prefix="*****", suffix=".csv")
temp_name = temp.name
try:
df.to_csv(temp_name, index=False)
mlflow.log_artifact(temp_name, "******")
finally:
temp.close() # Delete the temp file
How would I write this if 'df' was a spark dataframe?

You just need to use the filepath URLs with proper protocols. "dbfs" is a generic Databricks one. For Azure, "abffs" would be needed. (I can not recall AWS's S3 extension)
filepath="dbfs:///filepath"
df # My Spark DataFrame
df.write.csv(filepath)
mlflow.log_artifact(temp_name, filepath)

It looks like in your case the problem has to do with how spark APIs accesses the filesystem vs how python APIs accesses it see here for details. This is probably not the recommended way (I'm fairly new to Databricks myself), but if you're on a single node you can write your parquet to the local filesystem and mlflow can log it from there with something like:
with tempfile.TemporaryDirectory() as tmpdirname:
df.write.parquet(f'file:{tmpdirname}/my_parquet_table')
mlflow.log_artifacts(tmpdirname, artifact_path='my_parquet_table_name')
Keep in mind that a parquet "file" is actually a directory with a whole bunch of files in it, so you need to use log_artifacts, not log_artifact, and if you don't specify artifact_path you'll get all the little files that make up the parquet file (directory) dumped directly into the root of your your mlflow artifacts. Also, Mlflow doesn't have any previewing capability for parquet files, so depending on your use case, logging parquet artifacts may not be as convenient as it first seems.
HTH

Is there a way to access local files without having to use upload() option in Google Colab or uploading the data to the drive and then accessing it

I have data in my local drive spread over a lot of files. I want to access those data from Google Colab. Since it is spread over a large area and the data is susceptible to constant change I don't want to use the upload() option as it can get tedious and long.
Uploading to Drive is also something I am trying to avoid, due to the changing data values.
So I was wondering if there is another method to access the local data something similar to the code presented.
def list_files(dir):
r = []
for root, dirs, files in os.walk(dir):
for name in dirs:
r.append(os.path.join(root, name))
return r
train_path = list_files('/home/path/to/folder/containing/data/')
This does not seem to work since GC cannot access my local machine. So I always get an empty array (0,) returned from the function

The short answer is: no, you can't. The long answer is: you can skip the uploading phase each time you restart the runtime. You just need to use google.colab package in order to have a similar behaviour to the local environment. Upload all the files you need to your google drive, then just import:
from google.colab import drive
drive.mount('/content/gdrive')
After the authentication part, you will be able to access all your files stored in google drive. They will be imported as you have uploaded them, so you just have to modify the last line in this way:
train_path = list_files('gdrive/path/to/folder/containing/data/')
or in this way:
train_path = list_files('/content/gdrive/home/path/to/folder/containing/data/')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I save files to ADLS from Azure Databricks python notebook? - python

Databricks considers 'abfss' as an external path so some functions only consider local path to save files Instead of using 'abfss' path we have to provide 'mnt path' to save file directly into ADLS. driver.save_screenshot('/mnt/Input_path/<foldername>/test.png')

Related

[DATABRICKS]how to store SQ L query result data to local disk?

Reading a file from Databrick filesystem

Am working in ADF i need to export data from sql source to Excel destination, Is there way to use Excel(.xlsx) as destination in ADF ? Notebook?

Logging a PySpark dataframe into a MLFlow Artifact

Is there a way to access local files without having to use upload() option in Google Colab or uploading the data to the drive and then accessing it

Categories

Resources