I am trying to distribute some programs to a local cluster I built using Spark. The aim of this project is to pass some data to each worker and pass the data to an external matlab function to process and collect the data back to master node. I met problem of how to call matlab function. Is that possible for Spark to call external function? In other word, could we control each function parallelized in Spark to search local path of each node to execute external function.
Here is a small test code:
run.py
import sys
from operator import add
from pyspark import SparkContext
import callmatlab
def run(a):
# print '__a'
callmatlab.sparktest()
if __name__ == "__main__":
sc = SparkContext(appName="PythonWordCount")
output = sc.parallelize(range(1,2)).map(run)
print output
sc.stop()
sparktest.py
import matlab.engine as eng
import numpy as np
eng = eng.start_matlab()
def sparktest():
print "-----------------------------------------------"
data = eng.sparktest()
print "----the return data:\n", type(data), data
if __name__ == "__main__":
sparktest()
submit spark
#!/bin/bash
path=/home/zzz/ProgramFiles/spark
$path/bin/spark-submit \
--verbose \
--py-files $path/hpc/callmatlab.py $path/hpc/sparktest.m \
--master local[4] \
$path/hpc/run.py \
README.md
It seems Spark asks all attached .py files shown as parameters of --py-files, however, Spark does not recognize sparktest.m.
I do not know how to continue. Could anyone give me some advice? Does Spark allow this way? Or any recommendation of other distributed python framework?
Thanks
Thanks for trying to answer my question. I use a different way to solve this problem. I uploaded the matlab files and data need to call and load to a path in the node file system. And the python just add the path and call it using matlab.engine module.
So my callmatlab.py becomes
import matlab.engine as eng
import numpy as np
import os
eng = eng.start_matlab()
def sparktest():
print "-----------------------------------------------"
eng.addpath(os.path.join(os.getenv("HOME"), 'zzz/hpc/'),nargout=0)
data = eng.sparktest([12, 1, 2])
print data
Firstly, I do not see any reason to pass on sparktest.m.
Secondly, recommended way is putting them in a .zip file. From documentation:
For Python, you can use the --py-files argument of spark-submit to add
.py, .zip or .egg files to be distributed with your application. If
you depend on multiple Python files we recommend packaging them into a
.zip or .egg.
At the end, remember your function will be executed in an executor jvm in a remote m/c, so Spark framework ships function, closure and additional files as part of the job. Hope that helps.
Add the
--files
option before the sparktest.m .
That tells Spark to ship the sparktest.m file to all workers.
Related
When working on a local project, from local_project.funcs import local_func will fail in the cluster because local_project is not installed.
This forces me to develop everything on the same file.
Solutions? Is there a way to "import" the contents of the module into the working file so that the cluster doesn't need to import it?
Installing the local_project in the cluster is not development friendly because any change in an imported feature requires a cluster redeploy.
import dask
from dask_kubernetes import KubeCluster, make_pod_spec
from local_project.funcs import local_func
pod_spec = make_pod_spec(
image="daskdev/dask:latest",
memory_limit="4G",
memory_request="4G",
cpu_limit=1,
cpu_request=1,
)
cluster = KubeCluster(pod_spec)
df = dask.datasets.timeseries()
df.groupby('id').apply(local_func) #fails if local_project not installed in cluster
Typically the solution to this is to make your own docker image. If you have only a single file, or an egg or zip file then you might also look into the Client.upload_file method
I would like to initialize config once, and then use it in many modules of my PySpark project.
I see 2 ways to do it.
load it in entry point and pass as an argument to each function
main.py:
with open(sys.argv[1]) as f:
config = json.load(f)
df = load_df(config)
df = parse(df, config)
df = validate(df, config, strict=True)
dump(df, config)
But it seems unbeauty to pass one external argument to each function.
Load config in config.py and import this object in each module
config.py
import sys
import json
with open(sys.argv[1]) as f:
config = json.load(f)
main.py
from config import config
df = load_df()
df = parse(df)
df = validate(df, strict=True)
dump(df)
and in each module add row
from config import config
It seems more beauty because config is not, strictly speaking, an argument of function. It is general context where they execute.
Unfortunately, PySpark pickle config.py and tries to execute it on server, but doesn't pass sys.argv to them!
So, I see error when run it
File "/PycharmProjects/spark_test/config.py", line 6, in <module>
CONFIG_PATH = sys.argv[1]
IndexError: list index out of range
What is the best practice to work with general config, loaded from file, in PySpark?
Your program starts execution on master and passes main bulk of its work to executors by invoking some functions on them. The executors are different processes that are typically run on different physical machines.
Thus anything that the master would want to reference on executors needs to be either a standard library function (to which the executors have access) or a pickelable object that can be sent over.
You typically don't want to load and parse any external resources on the executors, since you would always have to copy them over and make sure you load them properly... Passing a pickelable object as an argument of the function (e.g. for a UDF) works much better, since there is only one place in your code where you need to load it.
I would suggest creating a config.py file and add it as an argument to your spark-submit command:
spark-submit --py-files /path/to/config.py main_program.py
Then you can create spark context like this:
spark_context = SparkContext(pyFiles=['/path/to/config.py'])
and simply use import config wherever you need.
You can even include whole python packages in a tree packaged as a single zip file instead of just a single config.py file, but then be sure to include __init__.py in every folder that needs to be referenced as python module.
I would like to import a .py file that contains some modules. I have saved the files init.py and util_func.py under this folder:
/usr/local/lib/python3.4/site-packages/myutil
The util_func.py contains all the modules that i would like to use. I also need to create a pyspark udf so I can use it to transform my dataframe. My code looks like this:
import myutil
from myutil import util_func
myudf = pyspark.sql.functions.udf(util_func.ConvString, StringType())
somewhere down the code, I am using this to convert one of the columns in my dataframe:
df = df.withColumn("newcol", myudf(df["oldcol"]))
then I am trying to see if it converts it my using:
df.head()
It fails with an error "No module named myutil".
I am able to bring up the functions within ipython. Somehow the pyspark engined does not see the module. Any idea how to make sure that the pyspark engine picks up the module?
You must build a egg file of your package using setup tools and add the egg file to your application like below
sc.addFile('<path of the egg file>')
here sc is the spark context variable.
Sorry for hijack the thread. I want to reply to #rouge-one comment but I dont have enough reputation to do it
I'm having the same problem with OP but this time the module is not a single py file but the annoy spotify package in Python https://github.com/spotify/annoy/tree/master/annoy
I tried sc.addPyFile('venv.zip') and added --archives ./venv.zip#PYTHON \ in the spark-submit file
but it still threw the same error message
I can still use from annoy import AnnoyIndex in the spark submit file but everytime I try to import it in the udf like this
schema = ArrayType(StructType([
StructField("char", IntegerType(), False),
StructField("count", IntegerType(), False)
]))
f= 128
def return_candidate(x):
from annoy import AnnoyIndex
from pyspark import SparkFiles
annoy = AnnoyIndex(f)
annoy.load(SparkFiles.get("annoy.ann"))
neighbor = 5
annoy_object = annoy.get_nns_by_item(x,n = neighbor, include_distances=True)
return annoy_object
return_candidate_udf = udf(lambda y: return_candidate(y), schema )
inter4 =inter3.select('*',return_candidate_udf('annoy_id').alias('annoy_candidate_list'))
I found the point! Spark UDF uses another executor when you have a problem like yours, the environment variables are different!
My case, I was developing, debugging and testing on Zeppelin and it has two different interpreters for Python and Spark! When I install the libs in the terminal, I could use the functions normally but on UDF not!
Solution: Just set the same environment for driver and executor, PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON
I use pyspark.sql.functions.udf to define a UDF that uses a class imported from a .py module written by me.
from czech_simple_stemmer import CzechSimpleStemmer #this is my class in my module
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
...some code here...
def clean_one_raw_doc(my_raw_doc):
... calls something from CzechSimpleStemmer ...
udf_clean_one_raw_doc = udf(clean_one_raw_doc, StringType())
When I call
df = spark.sql("SELECT * FROM mytable").withColumn("output_text", udf_clean_one_raw_doc("input_text"))
I get a typical huge error message where probably this is the relevant part:
File "/data2/hadoop/yarn/local/usercache/ja063930/appcache/application_1472572954011_132777/container_e23_1472572954011_132777_01_000003/pyspark.zip/pyspark/serializers.py", line 431, in loads
return pickle.loads(obj, encoding=encoding)
ImportError: No module named 'czech_simple_stemmer'
Do I understand it correctly that pyspark distributes udf_clean_one_raw_doc to all the worker nodes but czech_simple_stemmer.py is missing there in the nodes' python installations (being present only on the edge node where I run the spark driver)?
And if yes, is there any way how I could tell pyspark to distribute this module too? I guess I could probably copy manually czech_simple_stemmer.py to all the nodes' pythons but 1) I don't have the admin access to the nodes, and 2) even if I beg the admin to put it there and he does it, then in case I need to do some tuning to the module itself, he'd probably kill me.
SparkContext.addPyFile("my_module.py") will do it.
from the spark-submit documentation
For Python, you can use the --py-files argument of spark-submit to add
.py, .zip or .egg files to be distributed with your application. If
you depend on multiple Python files we recommend packaging them into a
.zip or .egg.
My goal is to import a custom .py file into my spark application and call some of the functions included inside that file
Here is what I tried:
I have a test file called Test.py which looks as follows:
def func():
print "Import is working"
Inside my Spark application I do the following (as described in the docs):
sc = SparkContext(conf=conf, pyFiles=['/[AbsolutePathTo]/Test.py'])
I also tried this instead (after the Spark context is created):
sc.addFile("/[AbsolutePathTo]/Test.py")
I even tried the following when submitting my spark application:
./bin/spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --py-files /[AbsolutePath]/Test.py ../Main/Code/app.py
However, I always get a name error:
NameError: name 'func' is not defined
when I am calling func() inside my app.py. (same error with 'Test' if I try to call Test.func())
Finally, al also tried importing the file inside the pyspark shell with the same command as above:
sc.addFile("/[AbsolutePathTo]/Test.py")
Strangely, I do not get an error on the import, but still, I cannot call func() without getting the error. Also, not sure if it matters, but I'm using spark locally on one machine.
I really tried everything I could think of, but still cannot get it to work. Probably I am missing something very simple. Any help would be appreciated.
Alright, actually my question is rather stupid. After doing:
sc.addFile("/[AbsolutePathTo]/Test.py")
I still have to import the Test.py file like I would import a regular python file with:
import Test
then I can call
Test.func()
and it works. I thought that the "import Test" is not necessary since I add the file to the spark context, but apparently that does not have the same effect.
Thanks mark91 for pointing me into the right direction.
UPDATE 28.10.2017:
as asked in the comments, here more details on the app.py
from pyspark import SparkContext
from pyspark.conf import SparkConf
conf = SparkConf()
conf.setMaster("local[4]")
conf.setAppName("Spark Stream")
sc = SparkContext(conf=conf)
sc.addFile("Test.py")
import Test
Test.func()