How to call python script in Spark?

How to call python script in Spark? - python

I have a metrics.py which calculates a graph.
I can call it in the terminal command line (python ./metrics.py -i [input] [output]).
I want to write a function in Spark. It calls the metrics.py script to run on the provide file path and collects the values that metrics.py prints out.
How can I do that?

In order to run metrics.py, you essentially ship it to all the executor nodes that run your Spark Job.
To do this, you either pass it via SparkContext -
sc = SparkContext(conf=conf, pyFiles=['path_to_metrics.py'])
or pass it later using the Spark Context's addPyFile method -
sc.addPyFile('path_to_metrics.py')
In either case, after that, do not forget to import metrics.py and then just call needed function that gives needed output.
import metrics
metrics.relevant_function()
Also make sure you have all the python libraries that are imported inside metrics.py installed on all executor nodes. Else, take care of them using the --py-files and --jars handles while spark-submitting your job.

Related

How to call . /home/test.sh file in python script

I have file called . /home/test.sh (the space between the first . and / is intentional) which contains some environmental variables. I need to load this file and run the .py. If I run the command manually first on the Linux server and then run python script it generates the required output. However, I want to call . /home/test.sh from within python to load the profile and run rest of the code. If this profile is not loaded python scripts runs and gives 0 as an output.
The call
subprocess.call('. /home/test.sh',shell=True)
runs fine but the profile is not loaded on the Linux terminal to execute python code and give the desired output.
Can someone help?

Environment variables are not inherited directly by the parent process, which is why your simple approach does not work.
If you are trying to pick up environment variables that have been set in your test.sh, then one thing you could do instead is to use env in a sub-shell to write them to stdout after sourcing the script, and then in Python you can parse these and set them locally.
The code below will work provided that test.sh does not write any output itself. (If it does, then what you could do to work around it would be to echo some separator string afterward sourcing it, and before running the env, and then in the Python code, strip off the separator string and everything before it.)
import subprocess
import os
p = subprocess.Popen(". /home/test.sh; env -0", shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
out, _ = p.communicate()
for varspec in out.decode().split("\x00")[:-1]:
pos = varspec.index("=")
name = varspec[:pos]
value = varspec[pos + 1:]
os.environ[name] = value
# just to test whether it works - output of the following should include
# the variables that were set
os.system("env")
It is also worth considering that if all that you want to do is set some environment variables every time before you run any python code, then one option is just to source your test.sh from a shell-script wrapper, and not try to set them inside python at all:
#!/bin/sh
. /home/test.sh
exec "/path/to/your/python/script $#"
Then when you want to run the Python code, you run the wrapper instead.

Export environment variables at runtime with airflow

I am currently converting workflows that were implemented in bash scripts before to Airflow DAGs. In the bash scripts, I was just exporting the variables at run time with
export HADOOP_CONF_DIR="/etc/hadoop/conf"
Now I'd like to do the same in Airflow, but haven't found a solution for this yet. The one workaround I found was setting the variables with os.environ[VAR_NAME]='some_text' outside of any method or operator, but that means they get exported the moment the script gets loaded, not at run time.
Now when I try to call os.environ[VAR_NAME] = 'some_text' in a function that gets called by a PythonOperator, it does not work. My code looks like this
def set_env():
os.environ['HADOOP_CONF_DIR'] = "/etc/hadoop/conf"
os.environ['PATH'] = "somePath:" + os.environ['PATH']
os.environ['SPARK_HOME'] = "pathToSparkHome"
os.environ['PYTHONPATH'] = "somePythonPath"
os.environ['PYSPARK_PYTHON'] = os.popen('which python').read().strip()
os.environ['PYSPARK_DRIVER_PYTHON'] = os.popen('which python').read().strip()
set_env_operator = PythonOperator(
task_id='set_env_vars_NOT_WORKING',
python_callable=set_env,
dag=dag)
Now when my SparkSubmitOperator gets executed, I get the exception:
Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
My use case where this is relevant is that I have SparkSubmitOperator, where I submit jobs to YARN, therefore either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. Setting them in my .bashrc or any other config is sadly not possible for me, which is why I need to set them at runtime.
Preferably I'd like to set them in an Operator before executing the SparkSubmitOperator, but if there was the possibility to pass them as arguments to the SparkSubmitOperator, that would be at least something.

From what I can see in the spark submit operator you can pass in environment variables to spark-submit as a dictionary.
:param env_vars: Environment variables for spark-submit. It
supports yarn and k8s mode too.
:type env_vars: dict
Have you tried this?

How to run External collector from Scollector?

I am trying to run external sample.py script in /path-to-scollector/collectors/0 folder from scollector.
scollector.toml:
Host = "localhost:0"
ColDir="//path-to-scollector//collectors//"
BatchSize=500
DisableSelf=true
command to run scollector:
scollector-windows-amd64.exe -conf scollector.toml -p
But I am not getting the sample.py metrics in the output. It is expected to run continuosly and print output to cnosole. Also when I am running:
scollector-windows-amd64.exe -conf scollector.toml -l
my external collector is not listed.

In your scollector.toml, You should one line as below,
Filter=["sample.py "] .
in your sample.py, you need this line
#!/usr/bin/python

For running scollector on linux machine the above solution works well. But with windows its a bit tricky. Since scollector running on windows can only identify batch files. So we need to do a little extra work for windows.
create external collector :-
It can be written in any language python,java etc. It contains the main code to get the data and print to console.
Example my_external_collector.py
create a wrapper batch script :-
wrapper_external_collector.bat.
Trigger my_external_collector.py inside wrapper_external_collector.bat.
python path_to_external/my_external_collector.py
You can pass arguments to the script also.Only disadvantage is we need to maintain two scripts.

How can I call an OpenModelica model in Python with OMPython?

I have an OpenModelica model made with OMEdit. In order to get a concrete example I designed the following:
Now I would like to run the model in Python. I can do this by using OMPython. After importing OMPython and loading the files I use the following command to run the simulation:
result = OMPython.execute("simulate(myGain, numberOfIntervals=2, outputFormat=\"mat\")")
The simulation now runs and the results are written to a file.
Now I would like to run the same model but with an different parameter for the constant block.
How can I do this?
Since the parameter is compiled into the model it should not be possible to change it. So what I need is a model like that:
Is it possible to call the model from Python and set the variable "a" to a specific value?
With the command OMPython.execute("simulate(...)") I can specify some environment variables like "numberOfIntervals" or "outputFormat" but not more.

You can send more flags to the simulate command. For example simflags to override parameters. See https://openmodelica.org/index.php/forum/topic?id=1011 for some details.
You can also use the buildModel(...) command followed by system("./ModelName -overrideFile ...") to avoid re-translation and re-compilation or with some minor scripting parallel parameter sweeps. If you use Linux or OSX it should be easy to call OMPython to create the executable and then call it yourself. On Windows you need to setup some environment variables for it to work as expected.

I believe you are looking for the setParameterValue command. You can read about it here: https://openmodelica.org/download/OMC_API-HowTo.pdf
Basically you would add a line similar to OMPython.execute("setParameterValue(myGain, a, 20)") to your python script before the line where you run the simulation, so long as a is a parameter in your model.

Create one new folder in windows
In this folder put/create 2 new files file1.py and file2.bat
The file1.py content is:
import os
import sys
sys.path.insert(0, "C:\OpenModelica1.11.0-32bit\share\omc\scripts\PythonInterface")
from OMPython import OMCSession
sys.path.insert(0, "C:\OpenModelica1.11.0-32bit\lib\python")
os.environ['USER'] = 'stefanache'
omc = OMCSession()
omc.sendExpression("loadModel(Modelica)")
omc.sendExpression("loadFile(getInstallationDirectoryPath() + \"/share/doc/omc/testmodels/BouncingBall.mo\")")
omc.sendExpression("instantiateModel(BouncingBall)")
omc.sendExpression("simulate(BouncingBall)")
omc.sendExpression("plot(h)")`
the file2.bat content is:
#echo off
python file1.py
pause
then click on file2.bat... and please be patient!
The plotted result window will appear.

String parameter using subprocess module

I am using Python to simplify some commands in Maven. I have this script which calls mvn test in debug mode.
from subprocess import call
commands = []
commands.append("mvn")
commands.append("test")
commands.append("-Dmaven.surefire.debug=\"-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8000 -Xnoagent -Djava.compiler=NONE\"")
call(commands)
The problem is with line -Dmaven.surefire.debug which accepts parameter which has to be in quotas and I don't know how to do that correctly. It looks fine when I print this list but when I run the script I get Error translating CommandLine and the debugging line is never executed.

The quotas are only required for the shell executing the command.
If you do the said call directly from the shell, you probably do
mvn test -Dmaven.surefire.debug="-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8000 -Xnoagent -Djava.compiler=NONE"
With these " signs you (simply spoken) tell the shell to ignore the spaces within.
The program is called with the arguments
mvn
test
-Dmaven.surefire.debug=-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8000 -Xnoagent -Djava.compiler=NONE
so
from subprocess import call
commands = []
commands.append("mvn")
commands.append("test")
commands.append("-Dmaven.surefire.debug=-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8000 -Xnoagent -Djava.compiler=NONE")
call(commands)
should be the way to go.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to call python script in Spark? - python

Related

How to call . /home/test.sh file in python script

Export environment variables at runtime with airflow

How to run External collector from Scollector?

How can I call an OpenModelica model in Python with OMPython?

String parameter using subprocess module

Categories

Resources