How to read other files in hadoop jobs? - python

I need to read in a dictionary file to filter content specified in the hdfs_input, and I have uploaded it to the cluster using the put command, but I don't know how to access it in my program.
I tried to access it using path on the cluster like normal files, but it gives the error information: IOError: [Errno 2] No such file or directory
Besides, is there any way to maintain only one copy of the dictionary for all the machines that runs the job ?
So what's the correct way of access files other than the specified input in hadoop jobs?

Problem solved by adding the file needed with the -file option or file= option in conf file.

Related

to_csv "No Such File or Directory" But the directory does exist - Databricks on ADLS

I've seen many iterations of this question but cannot seem to understand/fix this behavior.
I am on Azure Databricks working on DBR 10.4 LTS Spark 3.2.1 Scala 2.12 trying to write a single csv file to blob storage so that it can be dropped to an SFTP server. Could not use spark-sftp because I am on Scala 2.12 unfortunately and could not get the library to work.
Given this is a small dataframe, I am converting it to pandas and then attempting to_csv.
to_export = df.toPandas()
to_export.to_csv(pathToFile, index = False)
I get the error: [Errno 2] No such file or directory: '/dbfs/mnt/adls/Sandbox/user/project_name/testfile.csv
Based on the information in other threads, I create the directory with dbutils.fs.mkdirs("/dbfs/mnt/adls/Sandbox/user/project_name/") /n Out[40]: True
The response is true and the directory exists, yet I still get the same error. I'm convinced it is something obvious and I've been staring at it for too long to notice. Does anyone see what my error may be?
Python's pandas library recognizes the path only when it is in File API Format (since you are using mount). And dbutils.fs.mkdirs uses Spark API Format which is different from File API Format.
As you are creating the directory using dbutils.fs.mkdirs with path as /dbfs/mnt/adls/Sandbox/user/project_name/, this path would be actually considered as dbfs:/dbfs/mnt/adls/Sandbox/user/project_name/. Hence, the directory would be created within DBFS.
dbutils.fs.mkdirs('/dbfs/mnt/repro/Sandbox/user/project_name/')
So, you have to create the directory by modify the code to create directory to the following code:
dbutils.fs.mkdirs('/mnt/repro/Sandbox/user/project_name/')
#OR
#dbutils.fs.mkdirs('dbfs:/mnt/repro/Sandbox/user/project_name/')
Writing to the folder would now work without any issue.
pdf.to_csv('/dbfs/mnt/repro/Sandbox/user/project_name/testfile.csv', index=False)

Reading a file from Databrick filesystem

I used the following code to read a shapefile from dbfs:
geopandas.read_file("file:/databricks/folderName/fileName.shp")
Unfortunately, I don't have access to do so and I get the following error
DriverError: dbfs:/databricks/folderName/fileName.shp: Permission denied
Any idea how to grant the access? File exist there (I have a permission to save a file there using dbutils, also - I can read a file from there using spark, but I have no idea how to read a file using pyspark).
After adding those lines:
dbutils.fs.cp("/databricks/folderName/fileName.shp", "file:/tmp/fileName.shp", recurse = True)
geopandas.read_file("/tmp/fileName.shp")
...from suggestion below I get another error.
org.apache.spark.api.python.PythonSecurityException: Path 'file:/tmp/fileName.shp' uses an untrusted filesystem 'org.apache.hadoop.fs.LocalFileSystem', but your administrator has configured Spark to only allow trusted filesystems: (com.databricks.s3a.S3AFileSystem, shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem, shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem, com.databricks.adl.AdlFileSystem, shaded.databricks.V2_1_4.com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem, shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem, shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem)
GeoPandas doesn't know anything about DBFS - it's working with the local files. So you either need:
to use the DBFS Fuse to read file from DBFS (but there are some limitations):
geopandas.read_file("/dbfs/databricks/folderName/fileName.shp")
or use dbutils.fs.cp command to copy file from DBFS to the local filesystem, and read from it:
dbutils.fs.cp("/databricks/folderName/fileName.shp", "file:/tmp/fileName.shp", recurse = True)
geopandas.read_file("/tmp/fileName.shp")
P.S. But if the file is already copied to the driver node, then you just need to remove file: from the name.
Updated after updated question:
There are limitations on what could be done on the AAD passthrough clusters, so your administrator needs to change cluster configuration as it's described in the documentation on troubleshooting if you want to copy file from DBFS to local file system.
But the /dbfs way should work for passthrough clusters as well, although it should be at least DBR 7.3 (docs)
Okay, the answer is easier than I could thought:
geopandas.read_file("/dbfs/databricks/folderName")
(folder name as it is a folder with all shape files)
Why it should be like that? Easy. Enable the possibility of checking files on DBFS in admin control panel ('Advanced' tab), click on the file you need and you will get two possible paths to the file. One is dedicated to Spark API and another one for File API (this is what I needed).
:)

Datadog Agent check cannot find the path specified

I have written a Datadog Agent check in Python following the instructions on this page: https://docs.datadoghq.com/developers/agent_checks/.
The agent check is supposed to read all files in a specified network folder and then send certain metrics to Datadog.
The folder to be read is specified like this in the Yaml file:
init_config:
taskResultLocation: "Z:/TaskResults"
This is the code used to read the folder, it is Python 2.7 because that is required by Datadog
task_result_location = self.init_config.get('taskResultLocation')
# Loop through all the XML files in the specified folder
for file in os.listdir(task_result_location):
If I just run the Python script in my IDE everything works correctly.
When the check is added to the Datadog Agent Manager on the same machine that the IDE is on and the check is run an error is thrown in the Datadog Agent Manager Log saying:
2018-08-14 14:33:26 EEST | ERROR | (runner.go:277 in work) | Error running check TaskResultErrorReader: [{"message": "[Error 3] The system cannot find the path specified: 'Z:/TaskResults/.'", "traceback": "Traceback (most recent call last):\n File \"C:\Program Files\Datadog\Datadog Agent\embedded\lib\site-packages\datadog_checks\checks\base.py\", line 294, in run\n self.check(copy.deepcopy(self.instances[0]))\n File \"c:\programdata\datadog\checks.d\TaskResultErrorReader.py\", line 42, in check\n for file in os.listdir(task_result_location):\nWindowsError: [Error 3] The system cannot find the path specified: 'Z:/TaskResults/.'\n"}]
I have tried specifying the folder location in multiple ways with single and double quotes, forward and back slashes and double slashes but the same error is thrown.
Would anyone know if this is a Yaml syntax error or some sort of issue with Datadog or the Python?
Even though datadog is being run from the same machine, it is setting up a separate server on your machine. Because of that, it sounds like the datadog agent doesn't have access to your z:/ driver.
Try to put the "TaskResults" folder in your root directory (when running from datadog - where the mycheck.yaml file is) and change the path accordingly.
If this works and you still want to have a common drive to be able to share files from your computer to datadog's agent, you have to find a way to mount a drive\folder to the agent. They probably have a way to do that in the documentation
The solution to this is to create a file share on the network drive and use that path instead of the full network drive path.
May be obvious to some but it didn't occur to me right away since the normal Python code worked without any issue outside of Datadog.
So instead of:
init_config:
taskResultLocation: "Z:/TaskResults"
use
init_config:
taskResultLocation: '//FileShareName/d/TaskResults'

Error when invoking lambda function as a ZIP file

This is the error I get when I try to invoke my lambda function as a ZIP file.
"The file lambda_function.py could not be found. Make sure your
handler upholds the format: file-name.method."
What am I doing wrong?
Mostly it is because of the way zipping the files making the problem. Instead of zipping the root folder you have to select all files and zip it like below,
Please upload all files and subfolders. My example is using node.js but you can do the same for python
Just to clarify: If I want to invoke Keras all I have to do is download the Keras directories and put my lambda code and Keras directories as a zip folder and upload it directly from my desktop right?
Just wanted to know if this is the right method to invoke Keras.
Whenever getting these kind of messages, if you see all files and handlers have the right name, format, location, etc., check also if other parts of the Lambda configuration are set up properly for what the code is trying to do.
For example, you might receive that unrelated error if your code is trying to execute against an RDS database that is in a private subnet and you are missing the correct VPC configuration that allows connectivity to that database.

makedirs error: can GAE Python create new directories (folders) or not?

I have seen a number of questions relating to writing files & creating new directories using Python and GAE, but a number of them conclude (not only on SO) by saying that Python cannot write files or create new directories. Yet these commands exist and plenty of other people seem to be writing files and opening directories no problem.
I'm trying to write to .txt files and create folders and getting the following errors:
Case #1:
with open("aardvark.txt", "a") as myfile:
myfile.write("i can't believe its not butter")
produces "IOError: [Errno 30] Read-only file system: 'aardvark.txt'". But i've checked and it's def-o not a read only file.
Case #2:
folder = r'C:\project\folder\' + str(name)
os.makedirs(folder)
produces "OSError: [Errno 38] Function not implemented: 'C:\project\folder'"
What am i missing?
Appengine does not support any write operations to the filesystem (amongst other restrictions).
The BlobStore does have a file like api, but you cannot rewrite/append to existing blob store entities. The dev server also presents these restrictions to emulate production environment.
You should probably have a read of the some of the docs about appengine.
The overview doc https://developers.google.com/appengine/docs/python/overview explicitly states you can't write.
AppEngine can now write to a local "ephemeral" disk storage when using Managed-VM which is not supported when using the sandbox method as specified on this documentation:
https://cloud.google.com/appengine/docs/managed-vms/tutorial/step3

Categories

Resources