Normally I use below URL to download file from Databricks DBFS FileStore to my local computer.
*https://<MY_DATABRICKS_INSTANCE_NAME>/fileStore/?o=<NUMBER_FROM_ORIGINAL_URL>*
However, this time the file is not downloaded and the URL lead me to Databricks homepage instead.
Does anyone have any suggestion on how I can download file from DBFS to local area? or how should fix the URL to make it work?
Any suggestions would be greatly appreciated!
PJ
Method1: Using Databricks portal GUI, you can download full results (max 1 millions rows).
Method2: Using Databricks CLI
To download full results, first save the file to dbfs and then copy the file to local machine using Databricks cli as follows.
dbfs cp "dbfs:/FileStore/tables/my_my.csv" "A:\AzureAnalytics"
You can access DBFS objects using the DBFS CLI, DBFS API, Databricks file system utilities (dbutils.fs), Spark APIs, and local file APIs.
In a Spark cluster you access DBFS objects using Databricks file system utilities, Spark APIs, or local file APIs.
On a local computer you access DBFS objects using the Databricks CLI or DBFS API.
Reference: Azure Databricks – Access DBFS
The DBFS command-line interface (CLI) uses the DBFS API to expose an easy to use command-line interface to DBFS. Using this client, you can interact with DBFS using commands similar to those you use on a Unix command line. For example:
# List files in DBFS
dbfs ls
# Put local file ./apple.txt to dbfs:/apple.txt
dbfs cp ./apple.txt dbfs:/apple.txt
# Get dbfs:/apple.txt and save to local file ./apple.txt
dbfs cp dbfs:/apple.txt ./apple.txt
# Recursively put local dir ./banana to dbfs:/banana
dbfs cp -r ./banana dbfs:/banana
Reference: Installing and configuring Azure Databricks CLI
Method3: Using third-party tool named DBFS Explorer
DBFS Explorer was created as a quick way to upload and download files to the Databricks filesystem (DBFS). This will work with both AWS and Azure instances of Databricks. You will need to create a bearer token in the web interface in order to connect.
Related
Not able to download entire directory from azure file share in python
I have used all basic stuffs available in google
share = ShareClient.from_connection_string(connection_string, "filshare")
my_file = share.get_file_client("dir1/sub_idr1")
# print(dir(my_file))
stream_1 = my_file.download_file()
I tried in my environment and got below results:
Initially I tried with python,
Unfortunately, ShareServiceClient Class which Interacts with A client to interact with the File Share Service at the account level. does not yet support Download operation in the Azure Python SDK.
ShareClient Class which only interacts with specific file Share in the account does not support Download Directory or file share option Python SDK..
But there's one class ShareFileClient Class which supports downloading of individual files inside a directory but not entire directory, you can use this class to download the files from directory with Python SDK.
Also you check Azure Portal > Select your Storage account > File Share and Directory > There's no option in Portal too to download the directory, but there's an option to download specific file.
As workaround, if you need to download directory from file-share, you can use Az copy tool command to download the directory in your local machine.
I tried to download the Directory with Az-copy command and was able to download to it successfully!
Command:
azcopy copy 'https://mystorageaccount.file.core.windows.net/myfileshare/myFileShareDirectory?sv=2018-03-28&ss=bjqt&srs=sco&sp=rjklhjup&se=2019-05-10T04:37:48Z&st=2019-05-09T20:37:48Z&spr=https&sig=/SOVEFfsKDqRry4bk3xxxxxxxx' 'C:\myDirectory' --recursive --preserve-smb-permissions=true --preserve-smb-info=true
Console:
Local environment:
I'm currently working on moving a python .whl file that I have generated in dbfs to my repo located in /Workspace/Repos/My_Repo/My_DBFS_File, to commit the file to Azure DevOps.
As Databricks Repos is a Read Only location it does not permit me to programmatically copy the file to the Repo location.
However, the UI provides various options to create or import files from various locations except dbfs.
Is there a workaround to actually move dbfs files to repos and then commit them to Azure DevOps?
The documentation says:
Databricks Runtime 11.2 or above.
In a Databricks Repo, you can programmatically create directories and create and append to files. This is useful for creating or modifying an environment specification file, writing output from notebooks, or writing output from execution of libraries, such as Tensorboard.
Using a Databricks cluster with Runtime 11.2 solved my issue
With the Google Cloud command line CLI running you can specify a local jar with the --jars flag. However I want to submit a job using the Python API. I have that working but when I specify the jar, if I use the file: prefix, it looks on the Dataproc master cluster rather than on my local workstation.
There is an easy workaround which is to just upload the jar using the GCS library first but I wanted to check if the Dataproc client libraries already supported this convenience feature.
Not at the moment. As you mentioned, the most convenient way to do this strictly using the Python client libraries would be to use the GCS client first and then point to your job file in GCS.
I installed the CLI rest api and now I want to save a test file to my local desktop. This is the command I have but it throws me a syntax error:
dbfs cp dbfs:/myname/test.pptx /Users/myname/Desktop.
SyntaxError: invalid syntax
dbfs cp dbfs:/myname/test.pptx /Users/myname/Desktop
Note, I am on a Mac, so hopefully the path is correct. What am I doing wrong?
To download files from DBFS to local machine, you can checkout similar SO thread, which addressing similar issue:
Not able to copy file from DBFS to local desktop in Databricks
OR
Alternately, you can use GUI tool called DBFS Explorer to download the files on you local machine.
DBFS Explorer was created as a quick way to upload and download files to the Databricks filesystem (DBFS). This will work with both AWS and Azure instances of Databricks. You will need to create a bearer token in the web interface in order to connect.
Reference: DBFS Explorer for Databricks
Hope this helps.
From google colaboratory, if I want to read/write to a folder in a given bucket created in google cloud, how do I achieve this?
I have created a bucket, a folder within the bucket and uploaded bunch of images into it. Now from colaboratory, using jupyter notebook, want to create multiple sub-directories to organise these images into train, validation and test folders.
Subsequently access respective folders for training, validating and testing the model.
With Google drive, we just update the path to direct to specific directory with following commands, after authentication.
import sys
sys.path.append('drive/xyz')
We do some thing similar on desktop version also
import os
os.chdir(local_path)
Does some thing similar exist for Google Cloud Storage?
I colaboratory FAQs, it has procedure for reading and writing a single file, where we need to set the entire path. That will be tedious to re-organise a main directory into sub-directories and access them separately.
In general it's not a good idea to try to mount a GCS bucket on the local machine (which would allow you to use it as you mentioned). From Connecting to Cloud Storage buckets:
Note: Cloud Storage is an object storage system that does not have the
same write constraints as a POSIX file system. If you write data
to a file in Cloud Storage simultaneously from multiple sources, you
might unintentionally overwrite critical data.
Assuming you'd like to continue regardless of the warning, if you use a Linux OS you may be able to mount it using the Cloud Storage FUSE adapter. See related How to mount Google Bucket as local disk on Linux instance with full access rights.
The recommended way to access GCS from python apps is using the Cloud Storage Client Libraries, but accessing files will be different
than in your snippets. You can find some examples at Python Client for Google Cloud Storage:
from google.cloud import storage
client = storage.Client()
# https://console.cloud.google.com/storage/browser/[bucket-id]/
bucket = client.get_bucket('bucket-id-here')
# Then do other things...
blob = bucket.get_blob('remote/path/to/file.txt')
print(blob.download_as_string())
blob.upload_from_string('New contents!')
blob2 = bucket.blob('remote/path/storage.txt')
blob2.upload_from_filename(filename='/local/path.txt')
Update:
The Colaboratory doc recommends another method that I forgot about, based on the Google API Client Library for Python, but note that it also doesn't operate like a regular filesystem, it's using an intermediate file on the local filesystem:
uploading files to GCS
downloading files from GCS: