Pyspark import .py file not working

Pyspark import .py file not working - python

My goal is to import a custom .py file into my spark application and call some of the functions included inside that file
Here is what I tried:
I have a test file called Test.py which looks as follows:
def func():
print "Import is working"
Inside my Spark application I do the following (as described in the docs):
sc = SparkContext(conf=conf, pyFiles=['/[AbsolutePathTo]/Test.py'])
I also tried this instead (after the Spark context is created):
sc.addFile("/[AbsolutePathTo]/Test.py")
I even tried the following when submitting my spark application:
./bin/spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --py-files /[AbsolutePath]/Test.py ../Main/Code/app.py
However, I always get a name error:
NameError: name 'func' is not defined
when I am calling func() inside my app.py. (same error with 'Test' if I try to call Test.func())
Finally, al also tried importing the file inside the pyspark shell with the same command as above:
sc.addFile("/[AbsolutePathTo]/Test.py")
Strangely, I do not get an error on the import, but still, I cannot call func() without getting the error. Also, not sure if it matters, but I'm using spark locally on one machine.
I really tried everything I could think of, but still cannot get it to work. Probably I am missing something very simple. Any help would be appreciated.

Alright, actually my question is rather stupid. After doing:
sc.addFile("/[AbsolutePathTo]/Test.py")
I still have to import the Test.py file like I would import a regular python file with:
import Test
then I can call
Test.func()
and it works. I thought that the "import Test" is not necessary since I add the file to the spark context, but apparently that does not have the same effect.
Thanks mark91 for pointing me into the right direction.
UPDATE 28.10.2017:
as asked in the comments, here more details on the app.py
from pyspark import SparkContext
from pyspark.conf import SparkConf
conf = SparkConf()
conf.setMaster("local[4]")
conf.setAppName("Spark Stream")
sc = SparkContext(conf=conf)
sc.addFile("Test.py")
import Test
Test.func()

Related

How to call other python script in another python script?

I am trying to call other python script to another python script using import but it is giving some error. Can you please help me how to do this?
Example:
case.py is my one script which have one one function generate_case('rule_id'). This function is returning some value.
final.py is my another script in which I am trying to call above script and store return value into a variable.
I am trying to do in python :
import case as f_case
qry = ''
qry += f_case.generate_case('R162')
print(qry)
Error:
ModuleNotFoundError: No module named 'case'
Both the scripts are available in the same location.

Try this
import os
import sys
scriptpath=''
sys.path.append(os.path.abspath(scriptpath))
# Do the import
import case as f_case

I rename the script as new_cons.py. Now it working for me. May be in script name I have added integer so it is not importing the script.

python get the script which imported my script

I want to make my own programming language based on python which will provide additional features that python wasn't provide, for example to make multiline anonymous function with custom syntax. I want my programming language is so simple to be used, just import my script, then I read the script file which is imported my script, then process it's code and stop anymore execution of the script which called my script to prevent error on syntax...
Let say there are 2 py file, main.py and MyLanguage.py
The main.py imported MyLanguage.py
Then how to get the main.py file from MyLanguage.py if main.py can be another name(Dynamic Name)?
Additional information:
I using python 3.4.4 on Windows 7

Like Colonder, I believe the project you have in mind is far more difficult than you imagine.
But, to get you started, here is how to get the main.py file from inside MyLanguage.py. If your importing module looks like this
# main.py
import MyLanguage
if __name__ == "__main__":
print("Hello world from main.py")
and the module it is importing looks like this, in Python 3:
#MyLanguage.py
import inspect
def caller_discoverer():
print('Importing file is', inspect.stack()[-1].filename)
caller_discoverer()
or (edit) like this, in Python 2:
#MyLanguage.py
import inspect
def caller_discoverer():
print 'Importing file is', inspect.stack()[-1][1]
caller_discoverer()
then the output you will get when you run main.py is
Importing file is E:/..blahblahblah../StackOverflow-3.6/48034902/main.py
Hello world from main.py
I believe this answers the question you asked, though I don't think it goes very far towards achieving what you want. The reason for my scepticism is simple: the import statement expects a file containing valid Python, and if you want to import a file with your own non-Python syntax, then you are going to have to do some very clever stuff with import hooks. Without that, your program will simply fail at the import statement with a syntax error.
Best of luck.

pyspark returns a no module named error for a custom module

I would like to import a .py file that contains some modules. I have saved the files init.py and util_func.py under this folder:
/usr/local/lib/python3.4/site-packages/myutil
The util_func.py contains all the modules that i would like to use. I also need to create a pyspark udf so I can use it to transform my dataframe. My code looks like this:
import myutil
from myutil import util_func
myudf = pyspark.sql.functions.udf(util_func.ConvString, StringType())
somewhere down the code, I am using this to convert one of the columns in my dataframe:
df = df.withColumn("newcol", myudf(df["oldcol"]))
then I am trying to see if it converts it my using:
df.head()
It fails with an error "No module named myutil".
I am able to bring up the functions within ipython. Somehow the pyspark engined does not see the module. Any idea how to make sure that the pyspark engine picks up the module?

You must build a egg file of your package using setup tools and add the egg file to your application like below
sc.addFile('<path of the egg file>')
here sc is the spark context variable.

Sorry for hijack the thread. I want to reply to #rouge-one comment but I dont have enough reputation to do it
I'm having the same problem with OP but this time the module is not a single py file but the annoy spotify package in Python https://github.com/spotify/annoy/tree/master/annoy
I tried sc.addPyFile('venv.zip') and added --archives ./venv.zip#PYTHON \ in the spark-submit file
but it still threw the same error message
I can still use from annoy import AnnoyIndex in the spark submit file but everytime I try to import it in the udf like this
schema = ArrayType(StructType([
StructField("char", IntegerType(), False),
StructField("count", IntegerType(), False)
]))
f= 128
def return_candidate(x):
from annoy import AnnoyIndex
from pyspark import SparkFiles
annoy = AnnoyIndex(f)
annoy.load(SparkFiles.get("annoy.ann"))
neighbor = 5
annoy_object = annoy.get_nns_by_item(x,n = neighbor, include_distances=True)
return annoy_object
return_candidate_udf = udf(lambda y: return_candidate(y), schema )
inter4 =inter3.select('*',return_candidate_udf('annoy_id').alias('annoy_candidate_list'))

I found the point! Spark UDF uses another executor when you have a problem like yours, the environment variables are different!
My case, I was developing, debugging and testing on Zeppelin and it has two different interpreters for Python and Spark! When I install the libs in the terminal, I could use the functions normally but on UDF not!
Solution: Just set the same environment for driver and executor, PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON

Call external matlab function in Spark

I am trying to distribute some programs to a local cluster I built using Spark. The aim of this project is to pass some data to each worker and pass the data to an external matlab function to process and collect the data back to master node. I met problem of how to call matlab function. Is that possible for Spark to call external function? In other word, could we control each function parallelized in Spark to search local path of each node to execute external function.
Here is a small test code:
run.py
import sys
from operator import add
from pyspark import SparkContext
import callmatlab
def run(a):
# print '__a'
callmatlab.sparktest()
if __name__ == "__main__":
sc = SparkContext(appName="PythonWordCount")
output = sc.parallelize(range(1,2)).map(run)
print output
sc.stop()
sparktest.py
import matlab.engine as eng
import numpy as np
eng = eng.start_matlab()
def sparktest():
print "-----------------------------------------------"
data = eng.sparktest()
print "----the return data:\n", type(data), data
if __name__ == "__main__":
sparktest()
submit spark
#!/bin/bash
path=/home/zzz/ProgramFiles/spark
$path/bin/spark-submit \
--verbose \
--py-files $path/hpc/callmatlab.py $path/hpc/sparktest.m \
--master local[4] \
$path/hpc/run.py \
README.md
It seems Spark asks all attached .py files shown as parameters of --py-files, however, Spark does not recognize sparktest.m.
I do not know how to continue. Could anyone give me some advice? Does Spark allow this way? Or any recommendation of other distributed python framework?
Thanks

Thanks for trying to answer my question. I use a different way to solve this problem. I uploaded the matlab files and data need to call and load to a path in the node file system. And the python just add the path and call it using matlab.engine module.
So my callmatlab.py becomes
import matlab.engine as eng
import numpy as np
import os
eng = eng.start_matlab()
def sparktest():
print "-----------------------------------------------"
eng.addpath(os.path.join(os.getenv("HOME"), 'zzz/hpc/'),nargout=0)
data = eng.sparktest([12, 1, 2])
print data

Firstly, I do not see any reason to pass on sparktest.m.
Secondly, recommended way is putting them in a .zip file. From documentation:
For Python, you can use the --py-files argument of spark-submit to add
.py, .zip or .egg files to be distributed with your application. If
you depend on multiple Python files we recommend packaging them into a
.zip or .egg.
At the end, remember your function will be executed in an executor jvm in a remote m/c, so Spark framework ships function, closure and additional files as part of the job. Hope that helps.

Add the
--files
option before the sparktest.m .
That tells Spark to ship the sparktest.m file to all workers.

How can I import a python file through a command prompt?

I am working on project euler and wanted to time all of my code. What I have is directory of files in the form 'problemxxx.py' where xxx is the problem number. Each of these files has a main() function that returns the answer. So I have created a file called run.py, located in the same directory as the problem files. I am able to get the name of the file through command prompt. But when I try to import the problem file, I continue to get ImportError: No module named problem. Below is the code for run.py so far, along with the command prompt used.
# run.py
import sys
problem = sys.argv[1]
import problem # I have also tired 'from problem import main' w/ same result
# will add timeit functions later, but trying to get this to run first
problem.main()
The command prompts that I have tried are the following: (both of which give the ImportError stated above)
python run.py problem001
python run.py problem001.py
How can I import the function main() from the file problem001.py? Does importing not work with the file name stored as a variable? Is there a better solution than trying to get the file name through command prompt? Let me know if I need to add more information, and thank you for any help!

You can do this by using the __import__() function.
# run.py
import sys
problem = __import__(sys.argv[1], fromlist=["main"]) # I have also tired 'from problem import main' w/ same result
problem.main()
Then if you have problem001.py like this:
def main():
print "In sub_main"
Calling python run.py problem001 prints:
In sub_main
A cleaner way to do this (instead of the __import__ way) is to use the importlib module. Your run.py needs to changes:
import importlib
problem = importlib.import_module(sys.argv[1])
Alternatives are mentioned in this question.

For sure! You can use __ import_ built-in function like __import__(problem). However this is not recommended to use, because it is not nice in terms of coding-style. I think if you are using this for testing purposes then you should use unittest module, either way try to avoid these constructions.
Regards

You can use exec() trick:
import sys
problem = sys.argv[1]
exec('import %s' % problem)
exec('%s.main()' % problem)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyspark import .py file not working - python

Related

How to call other python script in another python script?

python get the script which imported my script

pyspark returns a no module named error for a custom module

Call external matlab function in Spark

How can I import a python file through a command prompt?

Categories

Resources