When working on a local project, from local_project.funcs import local_func will fail in the cluster because local_project is not installed.
This forces me to develop everything on the same file.
Solutions? Is there a way to "import" the contents of the module into the working file so that the cluster doesn't need to import it?
Installing the local_project in the cluster is not development friendly because any change in an imported feature requires a cluster redeploy.
import dask
from dask_kubernetes import KubeCluster, make_pod_spec
from local_project.funcs import local_func
pod_spec = make_pod_spec(
image="daskdev/dask:latest",
memory_limit="4G",
memory_request="4G",
cpu_limit=1,
cpu_request=1,
)
cluster = KubeCluster(pod_spec)
df = dask.datasets.timeseries()
df.groupby('id').apply(local_func) #fails if local_project not installed in cluster
Typically the solution to this is to make your own docker image. If you have only a single file, or an egg or zip file then you might also look into the Client.upload_file method
Related
How can I convert an asciidoc to html using the asciidoc3 python package from within my python script? I'm not able to find a working example. The official docs are oriented mainly towards those who will use asciidoc3 as a command line tool, not for those who wish to do conversions in their python apps.
I'm finding that sometimes packages are refactored with significant improvements and old examples on the interwebs are not updated. Python examples frequently omit import statements for brevity, but for newer developers like me, the correct entry point is not obvious.
In my venv, I ran
pip install asciidoc3
Then I tried...
import io
from asciidoc3.asciidoc3api import AsciiDoc3API
infile = io.StringIO('Hello world')
outfile = io.StringIO()
asciidoc3_ = AsciiDoc3API()
asciidoc3_.options('--no-header-footer')
asciidoc3_.execute(infile, outfile, backend='html4')
print(outfile.getvalue())
and
import io
from asciidoc3 import asciidoc3api
asciidoc3_ = asciidoc3api.AsciiDoc3API()
infile = io.StringIO('Hello world')
asciidoc3_.execute(infile)
Pycharm doesn't have a problem with either import attempt when it does it's syntax check and everything looks right based on what I'm seeing in my venv's site-packages... "./venv/lib/python3.10/site-packages/asciidoc3/asciidoc3api.py" is there as expected. But both of my attempts raise "AttributeError: module 'asciidoc3' has no attribute 'execute'"
That's true. asciidoc3 doesn't have any such attribute. It's a method of class AsciiDoc3API defined in asciidoc3api.py. I assume the problem is my import statement?
I figured it out. It wasn't the import statement. The error message was sending me down the wrong rabbit hole but I found this in the module's doc folder...
[NOTE]
.PyPI, venv (Windows or GNU/Linux and other POSIX OS)
Unfortunately, sometimes (not always - depends on your directory-layout, operating system etc.) AsciiDoc3 cannot find the 'asciidoc3' module when you installed via venv and/or PyPI. +
The solution:
from asciidoc3api import AsciiDoc3API
asciidoc3 = AsciiDoc3API('/full/path/to/asciidoc3.py')
I would like to import a .py file that contains some modules. I have saved the files init.py and util_func.py under this folder:
/usr/local/lib/python3.4/site-packages/myutil
The util_func.py contains all the modules that i would like to use. I also need to create a pyspark udf so I can use it to transform my dataframe. My code looks like this:
import myutil
from myutil import util_func
myudf = pyspark.sql.functions.udf(util_func.ConvString, StringType())
somewhere down the code, I am using this to convert one of the columns in my dataframe:
df = df.withColumn("newcol", myudf(df["oldcol"]))
then I am trying to see if it converts it my using:
df.head()
It fails with an error "No module named myutil".
I am able to bring up the functions within ipython. Somehow the pyspark engined does not see the module. Any idea how to make sure that the pyspark engine picks up the module?
You must build a egg file of your package using setup tools and add the egg file to your application like below
sc.addFile('<path of the egg file>')
here sc is the spark context variable.
Sorry for hijack the thread. I want to reply to #rouge-one comment but I dont have enough reputation to do it
I'm having the same problem with OP but this time the module is not a single py file but the annoy spotify package in Python https://github.com/spotify/annoy/tree/master/annoy
I tried sc.addPyFile('venv.zip') and added --archives ./venv.zip#PYTHON \ in the spark-submit file
but it still threw the same error message
I can still use from annoy import AnnoyIndex in the spark submit file but everytime I try to import it in the udf like this
schema = ArrayType(StructType([
StructField("char", IntegerType(), False),
StructField("count", IntegerType(), False)
]))
f= 128
def return_candidate(x):
from annoy import AnnoyIndex
from pyspark import SparkFiles
annoy = AnnoyIndex(f)
annoy.load(SparkFiles.get("annoy.ann"))
neighbor = 5
annoy_object = annoy.get_nns_by_item(x,n = neighbor, include_distances=True)
return annoy_object
return_candidate_udf = udf(lambda y: return_candidate(y), schema )
inter4 =inter3.select('*',return_candidate_udf('annoy_id').alias('annoy_candidate_list'))
I found the point! Spark UDF uses another executor when you have a problem like yours, the environment variables are different!
My case, I was developing, debugging and testing on Zeppelin and it has two different interpreters for Python and Spark! When I install the libs in the terminal, I could use the functions normally but on UDF not!
Solution: Just set the same environment for driver and executor, PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON
I use pyspark.sql.functions.udf to define a UDF that uses a class imported from a .py module written by me.
from czech_simple_stemmer import CzechSimpleStemmer #this is my class in my module
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
...some code here...
def clean_one_raw_doc(my_raw_doc):
... calls something from CzechSimpleStemmer ...
udf_clean_one_raw_doc = udf(clean_one_raw_doc, StringType())
When I call
df = spark.sql("SELECT * FROM mytable").withColumn("output_text", udf_clean_one_raw_doc("input_text"))
I get a typical huge error message where probably this is the relevant part:
File "/data2/hadoop/yarn/local/usercache/ja063930/appcache/application_1472572954011_132777/container_e23_1472572954011_132777_01_000003/pyspark.zip/pyspark/serializers.py", line 431, in loads
return pickle.loads(obj, encoding=encoding)
ImportError: No module named 'czech_simple_stemmer'
Do I understand it correctly that pyspark distributes udf_clean_one_raw_doc to all the worker nodes but czech_simple_stemmer.py is missing there in the nodes' python installations (being present only on the edge node where I run the spark driver)?
And if yes, is there any way how I could tell pyspark to distribute this module too? I guess I could probably copy manually czech_simple_stemmer.py to all the nodes' pythons but 1) I don't have the admin access to the nodes, and 2) even if I beg the admin to put it there and he does it, then in case I need to do some tuning to the module itself, he'd probably kill me.
SparkContext.addPyFile("my_module.py") will do it.
from the spark-submit documentation
For Python, you can use the --py-files argument of spark-submit to add
.py, .zip or .egg files to be distributed with your application. If
you depend on multiple Python files we recommend packaging them into a
.zip or .egg.
I have a question regarding the pymatbridge. I have been trying to use it as an alternative to the Matlab Engine, which for some reason broke on me recently and I haven't been able to get it to work again. I followed the instructions from Github and when testing my script in the terminal, the zmq connection works great, and the connection gets established every single time. But when I copy paste what's working in the terminal into a python script, the connection fails every single time. I'm not familiar with zmq, but the problem seems to be systematic, so I was wondering if there was something obvious I'm missing. Here is my code.
import os
import glob
import csv
import numpy as np
import matplotlib.pylab as plt
#Alternative to matlab Engine: pymatbridge
import pymatbridge as pymat
matlab = pymat.Matlab(executable='/Applications/MATLAB_R2015a.app/bin/matlab')
#Directory of Matlab functions
Matlab_dir = '/Users/cynthiagerlein/Dropbox (Personal)/Scatterometer/Matlab/'
#Directory with SIR data
SIR_dir = '/Volumes/blahblahblah/OriginalData/'
#Directory with matrix data
Data_dir = '/Volumes/blahblahblah/Data/'
#Create list of names of SIR files to open and save as matrices
os.chdir(SIR_dir)
#Save list of SIR file names
SIR_File_List = glob.glob("*.sir")
#Launch Pymatbridge
matlab.start()
for the_file in SIR_File_List:
print 'We are on file ', the_file
Running_name = SIR_dir + the_file
image = matlab.run_func('/Users/cynthiagerlein/Dropbox\ \(Personal\)/Scatterometer/Matlab/loadsir.m', Running_name)
np.savetxt(Data_dir+the_file[:22] + '.txt.gz',np.array(image['result']) )
I ended up using matlab_wrapper instead, and it's working great and was A LOT easier to install and set up, but I am just curious to understand why the pymatbridge is failing in my script but working in terminal. By the way, I learned about both pymatbridge and matlab_wrapper in the amazing answer to this post (scroll down, 3rd answer).
I am trying to distribute some programs to a local cluster I built using Spark. The aim of this project is to pass some data to each worker and pass the data to an external matlab function to process and collect the data back to master node. I met problem of how to call matlab function. Is that possible for Spark to call external function? In other word, could we control each function parallelized in Spark to search local path of each node to execute external function.
Here is a small test code:
run.py
import sys
from operator import add
from pyspark import SparkContext
import callmatlab
def run(a):
# print '__a'
callmatlab.sparktest()
if __name__ == "__main__":
sc = SparkContext(appName="PythonWordCount")
output = sc.parallelize(range(1,2)).map(run)
print output
sc.stop()
sparktest.py
import matlab.engine as eng
import numpy as np
eng = eng.start_matlab()
def sparktest():
print "-----------------------------------------------"
data = eng.sparktest()
print "----the return data:\n", type(data), data
if __name__ == "__main__":
sparktest()
submit spark
#!/bin/bash
path=/home/zzz/ProgramFiles/spark
$path/bin/spark-submit \
--verbose \
--py-files $path/hpc/callmatlab.py $path/hpc/sparktest.m \
--master local[4] \
$path/hpc/run.py \
README.md
It seems Spark asks all attached .py files shown as parameters of --py-files, however, Spark does not recognize sparktest.m.
I do not know how to continue. Could anyone give me some advice? Does Spark allow this way? Or any recommendation of other distributed python framework?
Thanks
Thanks for trying to answer my question. I use a different way to solve this problem. I uploaded the matlab files and data need to call and load to a path in the node file system. And the python just add the path and call it using matlab.engine module.
So my callmatlab.py becomes
import matlab.engine as eng
import numpy as np
import os
eng = eng.start_matlab()
def sparktest():
print "-----------------------------------------------"
eng.addpath(os.path.join(os.getenv("HOME"), 'zzz/hpc/'),nargout=0)
data = eng.sparktest([12, 1, 2])
print data
Firstly, I do not see any reason to pass on sparktest.m.
Secondly, recommended way is putting them in a .zip file. From documentation:
For Python, you can use the --py-files argument of spark-submit to add
.py, .zip or .egg files to be distributed with your application. If
you depend on multiple Python files we recommend packaging them into a
.zip or .egg.
At the end, remember your function will be executed in an executor jvm in a remote m/c, so Spark framework ships function, closure and additional files as part of the job. Hope that helps.
Add the
--files
option before the sparktest.m .
That tells Spark to ship the sparktest.m file to all workers.