Reading a file from Databrick filesystem

Reading a file from Databrick filesystem - python

I used the following code to read a shapefile from dbfs:
geopandas.read_file("file:/databricks/folderName/fileName.shp")
Unfortunately, I don't have access to do so and I get the following error
DriverError: dbfs:/databricks/folderName/fileName.shp: Permission denied
Any idea how to grant the access? File exist there (I have a permission to save a file there using dbutils, also - I can read a file from there using spark, but I have no idea how to read a file using pyspark).
After adding those lines:
dbutils.fs.cp("/databricks/folderName/fileName.shp", "file:/tmp/fileName.shp", recurse = True)
geopandas.read_file("/tmp/fileName.shp")
...from suggestion below I get another error.
org.apache.spark.api.python.PythonSecurityException: Path 'file:/tmp/fileName.shp' uses an untrusted filesystem 'org.apache.hadoop.fs.LocalFileSystem', but your administrator has configured Spark to only allow trusted filesystems: (com.databricks.s3a.S3AFileSystem, shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem, shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem, com.databricks.adl.AdlFileSystem, shaded.databricks.V2_1_4.com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem, shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem, shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem)

GeoPandas doesn't know anything about DBFS - it's working with the local files. So you either need:
to use the DBFS Fuse to read file from DBFS (but there are some limitations):
geopandas.read_file("/dbfs/databricks/folderName/fileName.shp")
or use dbutils.fs.cp command to copy file from DBFS to the local filesystem, and read from it:
dbutils.fs.cp("/databricks/folderName/fileName.shp", "file:/tmp/fileName.shp", recurse = True)
geopandas.read_file("/tmp/fileName.shp")
P.S. But if the file is already copied to the driver node, then you just need to remove file: from the name.
Updated after updated question:
There are limitations on what could be done on the AAD passthrough clusters, so your administrator needs to change cluster configuration as it's described in the documentation on troubleshooting if you want to copy file from DBFS to local file system.
But the /dbfs way should work for passthrough clusters as well, although it should be at least DBR 7.3 (docs)

Okay, the answer is easier than I could thought:
geopandas.read_file("/dbfs/databricks/folderName")
(folder name as it is a folder with all shape files)
Why it should be like that? Easy. Enable the possibility of checking files on DBFS in admin control panel ('Advanced' tab), click on the file you need and you will get two possible paths to the file. One is dedicated to Spark API and another one for File API (this is what I needed).
:)

Related

How can I save files to ADLS from Azure Databricks python notebook?

Does anyone know how can I save screenshots from a databricks notebook directly to ADLS.
I have set up the connection but for some reason I cannot do it directly so I have so save to DBFS then move the files to ADLS
Currently this works:
driver.save_screenshot('/dbfs/test.png')
dbutils.fs.mv('dbfs:/', 'abfss://<container>#<storage-account>.dfs.core.windows.net/', recurse=True)
Ideally, I want to do this in a single step rather than saving in the root storage than moving so something like this:
driver.save_screenshot('abfss://<container>#<storage-account>.dfs.core.windows.net/test.png')
or
driver.save_screenshot('/abfss/<storage-account>/<container>/test.png')
Finally, does setting up the Unity Catalog and a corresponding Metastore help in changing the DBFS root storage location to ADLS so I can directly use the container without specifying the links everytime? What are the best practices in such cases?
Many Thanks!

Databricks considers 'abfss' as an external path so some functions only consider local path to save files
Instead of using 'abfss' path we have to provide 'mnt path' to save file directly into ADLS.
driver.save_screenshot('/mnt/Input_path/<foldername>/test.png')

How to get file from a shared Google Drive link in Colab/Jupyter?

Without authenticating anything because it's a shared file, so you shouldn't need credentials.
Without manually adding the file to your shared folder or downloading it.
How does one read a file into a variable from a shared link directly?
Like this one:
https://drive.google.com/file/d/1HA3enF7c26JVm4ouGHBQ7v1_ToQ_pus9/view?usp=sharing

(1) Try adjusting your link to become like below. That should allow your "read a file into a variable"...
(begins auto-download of file bytes).
https://drive.google.com/uc?authuser=0&id=1HA3enF7c26JVm4ouGHBQ7v1_ToQ_pus9&export=download
(2) Your link is not testable at present (you need to change view settings to Public).
Maybe that's why the download fails (is seen as un-authorized access)?

Datadog Agent check cannot find the path specified

I have written a Datadog Agent check in Python following the instructions on this page: https://docs.datadoghq.com/developers/agent_checks/.
The agent check is supposed to read all files in a specified network folder and then send certain metrics to Datadog.
The folder to be read is specified like this in the Yaml file:
init_config:
taskResultLocation: "Z:/TaskResults"
This is the code used to read the folder, it is Python 2.7 because that is required by Datadog
task_result_location = self.init_config.get('taskResultLocation')
# Loop through all the XML files in the specified folder
for file in os.listdir(task_result_location):
If I just run the Python script in my IDE everything works correctly.
When the check is added to the Datadog Agent Manager on the same machine that the IDE is on and the check is run an error is thrown in the Datadog Agent Manager Log saying:
2018-08-14 14:33:26 EEST | ERROR | (runner.go:277 in work) | Error running check TaskResultErrorReader: [{"message": "[Error 3] The system cannot find the path specified: 'Z:/TaskResults/.'", "traceback": "Traceback (most recent call last):\n File \"C:\Program Files\Datadog\Datadog Agent\embedded\lib\site-packages\datadog_checks\checks\base.py\", line 294, in run\n self.check(copy.deepcopy(self.instances[0]))\n File \"c:\programdata\datadog\checks.d\TaskResultErrorReader.py\", line 42, in check\n for file in os.listdir(task_result_location):\nWindowsError: [Error 3] The system cannot find the path specified: 'Z:/TaskResults/.'\n"}]
I have tried specifying the folder location in multiple ways with single and double quotes, forward and back slashes and double slashes but the same error is thrown.
Would anyone know if this is a Yaml syntax error or some sort of issue with Datadog or the Python?

Even though datadog is being run from the same machine, it is setting up a separate server on your machine. Because of that, it sounds like the datadog agent doesn't have access to your z:/ driver.
Try to put the "TaskResults" folder in your root directory (when running from datadog - where the mycheck.yaml file is) and change the path accordingly.
If this works and you still want to have a common drive to be able to share files from your computer to datadog's agent, you have to find a way to mount a drive\folder to the agent. They probably have a way to do that in the documentation

The solution to this is to create a file share on the network drive and use that path instead of the full network drive path.
May be obvious to some but it didn't occur to me right away since the normal Python code worked without any issue outside of Datadog.
So instead of:
init_config:
taskResultLocation: "Z:/TaskResults"
use
init_config:
taskResultLocation: '//FileShareName/d/TaskResults'

makedirs error: can GAE Python create new directories (folders) or not?

I have seen a number of questions relating to writing files & creating new directories using Python and GAE, but a number of them conclude (not only on SO) by saying that Python cannot write files or create new directories. Yet these commands exist and plenty of other people seem to be writing files and opening directories no problem.
I'm trying to write to .txt files and create folders and getting the following errors:
Case #1:
with open("aardvark.txt", "a") as myfile:
myfile.write("i can't believe its not butter")
produces "IOError: [Errno 30] Read-only file system: 'aardvark.txt'". But i've checked and it's def-o not a read only file.
Case #2:
folder = r'C:\project\folder\' + str(name)
os.makedirs(folder)
produces "OSError: [Errno 38] Function not implemented: 'C:\project\folder'"
What am i missing?

Appengine does not support any write operations to the filesystem (amongst other restrictions).
The BlobStore does have a file like api, but you cannot rewrite/append to existing blob store entities. The dev server also presents these restrictions to emulate production environment.
You should probably have a read of the some of the docs about appengine.
The overview doc https://developers.google.com/appengine/docs/python/overview explicitly states you can't write.

AppEngine can now write to a local "ephemeral" disk storage when using Managed-VM which is not supported when using the sandbox method as specified on this documentation:
https://cloud.google.com/appengine/docs/managed-vms/tutorial/step3

How to read other files in hadoop jobs?

I need to read in a dictionary file to filter content specified in the hdfs_input, and I have uploaded it to the cluster using the put command, but I don't know how to access it in my program.
I tried to access it using path on the cluster like normal files, but it gives the error information: IOError: [Errno 2] No such file or directory
Besides, is there any way to maintain only one copy of the dictionary for all the machines that runs the job ?
So what's the correct way of access files other than the specified input in hadoop jobs?

Problem solved by adding the file needed with the -file option or file= option in conf file.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading a file from Databrick filesystem - python

Related

How can I save files to ADLS from Azure Databricks python notebook?

How to get file from a shared Google Drive link in Colab/Jupyter?

Datadog Agent check cannot find the path specified

makedirs error: can GAE Python create new directories (folders) or not?

How to read other files in hadoop jobs?

Categories

Resources