Using a JAR dependency in a PySpark parallelized execution context - python

This is for a PySpark / Databricks project:
I've written a Scala JAR library and exposed its functions as UDFs via a simple Python wrapper; everything works as it should in my PySpark notebooks. However, when I try to use any of the functions imported from the JAR in an sc.parallelize(..).foreach(..) environment, execution keeps dying with the following error:
TypeError: 'JavaPackage' object is not callable
at this line in the wrapper:
jc = get_spark()._jvm.com.company.package.class.get_udf(function.__name__)
My suspicious is that the JAR library is not available in the parallelized context, since if I replace the library path to some gibberish, the error remains exactly the same.
I haven't been able to find the necessary clues in the Spark docs so far. Using an sc.addFile("dbfs:/FileStore/path-to-library.jar") didn't help.

You could try adding the JAR in the PYSPARK_SUBMIT_ARGS environment variable (before Spark 2.3 this was doable with SPARK_CLASSPATH as well).
For example with:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars <path/to/jar> pyspark-shell'

Related

Installing Python package across Spark executors - not finding python package, raising ModuleNotFoundError

I have a question regarding the correct way to install new packages on Spark worker nodes, using Databricks and Mlflow.
What I currently have are the following:
a training script (using cv2, i.e. opencv2-python library), which logs the tuned ML model, together with dependencies, on mlflow Model Registry
an inference script which reads the logged ML model together with the saved conda environment, as a spark_udf
an installation step which reads the conda environment and installs all packages to the required version, via pip install (wrapped in a subprocess.call)
and then the actual prediction where the spark_udf is called on new data.
The last step is what fails with ModuleNotFoundError.
SparkException: Job aborted due to stage failure: Task 8 in stage 95.0
failed 4 times, most recent failure: Lost task 8.3 in stage 95.0 (TID 577
(xxx.xxx.xxx.xxx executor 0): org.apache.spark.api.python.PythonException:
'ModuleNotFoundError: No module named 'cv2''
I have been closely following the content of this article which seems to cover this same problem:
"One major reason for such issues is using udfs. And sometimes the udfs don’t get distributed to the cluster worker nodes."
"The respective dependency modules used in the udfs or in main spark program might be missing or inaccessible from\in the cluster worker nodes."
So it seems that at the moment, despite using spark_udf and conda environment logged to mlflow, installation of the cv2 module only happened on my driver node, but not on the worker nodes. If this is true, I now need to programatically specify these extra dependencies (namely, the cv2 Python module), to the executors/worker nodes.
So what I did was, importing the cv2 in my inference script, and retrieving the path of the cv2's init file, and adding it to spark context, similarly to how it is done for the arbitrary "A.py" file in the blog post above.
import cv2
spark.sparkContext.addFile(os.path.abspath(cv2.__file__))
This doesn't seem to do any change though. I assume the reason is, partly, that I want to add not just a single __init__.py file, but make the entire cv2 library accessible to the worker nodes; however, the above solution seems to only do it for the __init__.py. I'm certain that adding all files in all submodules of cv2 is also not the way to go, but I haven't been able to figure out how I could achieve this easily, with a similar command as the addFile() above.
Similarly, I also tried the other option, addPyFile(), by pointing it to the cv2 module's root (parent of __init__):
import cv2
spark.sparkContext.addPyFile(os.path.dirname(cv2.__file__))
but this also didn't help, and I still got stuck with the same error. Furthermore, I would like this process to be automatic, i.e. not having to manually set module paths in the inference code.
Similar posts I came across to:
Running spacy in pyspark, but getting ModuleNotFoundError: No module named 'spacy', here the only answer suggests to "restart spark session" - not sure though what that means in my specific case, with an active Databricks notebook and a running cluster.
ModuleNotFoundError in PySpark Worker on rdd.collect(), here the answer states "you're not allowed to access the spark context from executor tasks," which might explain why I failed with both of my approaches above (addFile and addPyFile). But if this is not allowed, what's the correct workaround then?
PySpark: ModuleNotFoundError: No module named 'app', here there is an informative answer, stating "Your Python code runs on driver, but you udf runs on executor PVM. (...) use the same environment in both driver and executors.", but it's really not clear how to do so programatically, inside of the notebook.

Unable to Create UDF using python in snowflake

I'm trying to create snowpark UDF in python as object. Below is my code
from snowflake.snowpark.functions import udf
from pytorch_tabnet.tab_model import TabNetRegressor
session.clearImports()
model = TabNetRegressor()
model.load_model(model_file)
lib_test = udf(lambda: (model.device), return_type=StringType())
I'm getting a error like below
Failed to execute query
CREATE
TEMPORARY FUNCTION "TEST_DB"."TEST".SNOWPARK_TEMP_FUNCTION_GES3G8XHRH()
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION=3.8
IMPORTS=('#"TEST_DB"."TEST".SNOWPARK_TEMP_STAGE_CR0E7FBWQ6/cloudpickle/cloudpickle.zip','#"TEST_DB"."TEST".SNOWPARK_TEMP_STAGE_CR0E7FBWQ6/TEST_DBTESTSNOWPARK_TEMP_FUNCTION_GES3G8XHRH_5843981186544791787/udf_py_1085938638.zip')
HANDLER='udf_py_1085938638.compute'
002072 (42601): SQL compilation error:
Unknown function language: PYTHON.
It throws error as python is not available.
I checked the packages available in information schema. It shows only scala and java. I'm not sure why python is not available in packages. How to add python to the packages? adding python will resolve this issue?
can anyone help on this? Thanks
The Python UDFs are not in production yet and are only available to selected accounts.
Please reach to Snowflake account team to have the functionality enabled.

Monitoring python shell glue jobs in AWS

In the AWS documentation, they specify how to activate monitoring for Spark jobs (https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-glue-job-cloudwatch-metrics.html), but not python shell jobs.
Using the code as is gives me this error: ModuleNotFoundError: No module named 'pyspark'
Worse, after commenting out from pyspark.context import SparkContext, I then get ModuleNotFoundError: No module named 'awsglue.context'. It seems the python shell jobs don't have access to glue context? Has anyone solved for this?
The python shell jobs are purely python based environment and do not have access to pyspark ( EMR in the backend). You will not be able to get access to the context attribute here. That is purely a spark concept and glue is essentially a wrapper around pyspark.
I am getting into glue python shell jobs more, and resolving some dependencies in some code files that are shared between my spark jobs and pyshell jobs. I was able to resolve the pyspark dependency, by including in the creation of my .egg/.whl file, in requirements.txt, pyspark==2.4.7. That version because another library required it.
You still cannot use the pyspark context as mentioned above by Emerson, because this is python runtime, not the spark runtime.
So when building distribution with setuptools, can have a requirements.txt that looks like this(below), and when the shell is setup, it will install these dependencies:
elasticsearch
aws_requests_auth
pg8000
pyspark==2.4.7
awsglue-local

Add custom python library path to Pyspark

In my hadoop cluster they installed anaconda package in some other path other than python default path. I am getting below error when i try to access numpy in pyspark
ImportError: No module named numpy
I am invoking pyspark using oozie.
I tried to give this custom python library path in below approaches
Using oozie tags
<property>
<name>oozie.launcher.mapreduce.map.env</name>
<value>PYSPARK_PYTHON=/var/opt/teradata/anaconda2/bin/python2.7</value>
</property>
Using spark option tag
<spark-opts>spark.yarn.appMasterEnv.PYSPARK_PYTHON=/var/opt/teradata/anaconda2/bin/python2.7 --conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=/var/opt/teradata/anaconda2/bin/python2.7 --conf spark.pyspark.python=/var/opt/teradata/anaconda2/bin/python2.7 --conf spark.pyspark.driver.python=/var/opt/teradata/anaconda2/bin/python2.7</spark-opts>
Nothing works.
When i run plain python script it works fine. Problem is passing to pyspark
Even i gave this in pyspark header also as
#! /usr/bin/env /var/opt/teradata/anaconda2/bin/python2.7
When i print sys.path in my pyspark code it still gives me below default path
​[ '/usr/lib/python27.zip', '/usr/lib64/python2.7', '/usr/lib64/python2.7/plat-linux2', '/usr/lib64/python2.7/lib-tk', '/usr/lib64/python2.7/lib-old', '/usr/lib64/python2.7/lib-dynload', '/usr/lib64/python2.7/site-packages', '/usr/local/lib64/python2.7/site-packages', '/usr/local/lib/python2.7/site-packages', '/usr/lib/python2.7/site-packages']​
I'm having this same issue. In my case it seems like ml classes (e.g. Vector) are calling numpy behind the scenes, but are not looking for it in the standard installation places. Even though the Python2 and Python3 versions of numpy are installed on all nodes of the cluster, PySpark is still complaining it can't find it.
I've tried a lot of suggestions that have not worked.
Two things that have been suggested that I haven't tried:
1) Use the bashrc for the user that PySpark runs as (ubuntu for me) to set the preferred python path. Do this on all nodes.
2) Have the PySpark script attempt to install the module in question as part of its functionality (e.g. by shelling out to pip/pip3).
I'll keep an eye here and if I find an answer I'll post it.

Issue with using files in distributed cache in Elastic MapReduce

I'm trying to make use of an external library in my Python mapper script in an AWS Elastic MapReduce job.
However, my script doesn't seem to be able to find the modules in the cache. I archived the files into a tarball called helper_classes.tar and uploaded the tarball to an Amazon S3 bucket. When creating my MapReduce job on the console, I specified the argument as:
cacheArchive s3://folder1/folder2/helper_classes.tar#helper_classes
At the beginning of my Python mapper script, I included the following code to import the library:
import sys
sys.path.append('./helper_classes')
import geoip.database
When I run the MapReduce job, it fails with an ImportError: No module named geoip.database. (geoip is a folder in the top level of helper_classes.tar and database is the module I'm trying to import.)
Any ideas what I could be doing wrong?
This might be late for the topic.
Reason is that the module geoip.database is not installed on all the Hadoop nodes.
You can either try not use uncommon imports in your map/reduce code,
or try to install the needed modules on all Hadoop nodes.

Categories

Resources