PySpark and argparse - python

How does one specify command line arguments using argparse for a PySpark script? I've been breaking my head over this one and I swear I can't find the solution anywhere else.
Here's my test script:
import argparse
from pyspark.sql import SparkSession
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--v1", "-a", type=int, default=2)
parser.add_argument("--v2", "-b", type=int, default=3)
args = vars(parser.parse_args())
spark = (SparkSession.builder
.appName("Test")
.master("local[*]")
.getOrCreate()
)
result = args['v1'] + args['v2']
return result
if __name__ == "__main__":
result = main()
print(result)
When I try running the file using spark-submit file.py --v1 5 --v2 4, I get an error as shown below:
[TerminalIPythonApp] CRITICAL | Bad config encountered during initialization:
[TerminalIPythonApp] CRITICAL | Unrecognized flag: '--v1'
However, when I don't specify the arguments (just spark-submit file.py), it does the sum correctly, using the default values 2 and 3 from the argument parser, and displays "5" as expected. So clearly it's reading the values from argparse correctly. What's going wrong with the command when I actually pass non-default values?
NOTE: Am using PySpark 2.4.4 and Python 3.6.
EDIT: Of course, I could just use sys.argv and be done with it, but argparse is so much better!

Based on the TerminalIPythonApp error message (similar to this one), Pyspark was trying to pass argparse arguments to ipython instead of python. To fix this, set the correct Spark environment as python3, not ipython.
Add/modify the lines in /path/to/pyspark/conf/spark-env.sh:
export SPARK_HOME=/home/user/spark-2.4.0-bin-hadoop2.7/
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3
export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=python3"
This ensures that Pyspark looks for the python3 executable, following which argparse arguments should be read without any issues.

Related

argparse not overriding default value

I managed to write some simple Python code using values from .csv to create an .svg file.
However argparse does not "pass" a value from command line - I'm not able to override the default.
I want to define [c]/[columns] with command line.
import argparse
parser = argparse.ArgumentParser("svg.py")
parser.add_argument("-c","--columns",help="number of columns (default=20)",default=20)
args = parser.parse_args()
when I run
svg.py -c 24
the value is still 20.
Maybe you have an error in argument passing. Here my working example
svg.py
import argparse
parser = argparse.ArgumentParser("svg.py")
parser.add_argument("-c","--columns",help="number of columns (default=20)",default=20)
args = parser.parse_args()
print(args.columns)
It runs correctly
$ > python svg.py -c 24
24
Hope this helps to figure out the problem

running a python program using argparse to setup input In Jupyter Notebook

I have a .py file following the normal code structure
def main( args ):
.......
.......
if __name__ == "__main__":
parser = argparse.ArgumentParser(description = “ forecasting example”)
parser.add_argument("--train-window", default=2160, type=int)
parser.add_argument("--test-window", default=336, type=int)
parser.add_argument("--stride", default=168, type=int)
parser.add_argument("-n", "--num-steps", default=501, type=int)
parser.add_argument("-lr", "--learning-rate", default=0.05, type=float)
parser.add_argument("--dct", action="store_true")
parser.add_argument("--num-samples", default=100, type=int)
parser.add_argument("--log-every", default=50, type=int)
parser.add_argument("--seed", default=1234567890, type=int)
args = parser.parse_args()
main(args)
I was trying to run this program in Jupyter notebook, but it will get errors such as
usage: ipykernel_launcher.py [-h] [--train-window TRAIN_WINDOW]
[--test-window TEST_WINDOW] [--stride STRIDE]
[-n NUM_STEPS] [-lr LEARNING_RATE] [--dct]
[--num-samples NUM_SAMPLES]
[--log-every LOG_EVERY] [--seed SEED]
ipykernel_launcher.py: error: unrecognized arguments: -f C:\Users\AppData\Roaming\jupyter\runtime\kernel-4c744f03-3278-4aaf-af5e-50c96e9e41cd.json
An exception has occurred, use %tb to see the full traceback.
SystemExit: 2
my question is that, what are the right approaches or the modifications I need to make if I want to run a python program, which setup input parameters using argparse type of mechanism, in Jupyter Notebook?
Your code should be indented differently so you can import it into your notebook, or into another Python script. The whole point of the if __name__ == "__main__": block is that it gets executed immediately when Python parses the file; the condition is true only when you run the file directly, not when you import it. But the block needs to be indented differently, so that it's not inside any def or class or other block structure.
The way to use this from a notebook, then, is to call main (or whichever other functions from the imported code you want to run) with your desired parameters.
In this case, main has been designed to expect an Argparse object as its argument, which is quite unfortunate. A better design would simply do the argument parsing inside main, and expose a different function or set of functions for reuse as a library.
Assuming your main function's internals look something like
def main(args):
print(
real_main(args.train_window, args.test_window,
stride=args.stride, num_steps=args.num_steps,
learning_rate=args.learning_rate,
dct=args.dct, num_samples=args.num_samples,
log_every=args.log_every, seed=args.seed))
and supposing you wanted to run the equivalent of
python thatfile.py -n 23 -lr 0.7--dct --num-samples 2300
the equivalent code in your notebook would look like
from thatfile import real_main as that_main
print(that_main(2160, 336, num_steps=23,
learning_rate=0.7, dct=True,
num_samples=2300))
where the first two values are simply copied from the argparse defaults, and I obviously had to speculate a great deal about which parameters are required and which are optional keyword parameters, and whether they are named identically to the argparse field names.

how to fix "SystemExit: 2 error when calling parse_args()" in jupyter notebook [duplicate]

I am trying to pass BioPython sequences to Ilya Stepanov's implementation of Ukkonen's suffix tree algorithm in iPython's notebook environment. I am stumbling on the argparse component.
I have never had to deal directly with argparse before. How can I use this without rewriting main()?
By the by, this writeup of Ukkonen's algorithm is fantastic.
An alternative to use argparse in Ipython notebooks is passing a string to:
args = parser.parse_args()
(line 303 from the git repo you referenced.)
Would be something like:
parser = argparse.ArgumentParser(
description='Searching longest common substring. '
'Uses Ukkonen\'s suffix tree algorithm and generalized suffix tree. '
'Written by Ilya Stepanov (c) 2013')
parser.add_argument(
'strings',
metavar='STRING',
nargs='*',
help='String for searching',
)
parser.add_argument(
'-f',
'--file',
help='Path for input file. First line should contain number of lines to search in'
)
and
args = parser.parse_args("AAA --file /path/to/sequences.txt".split())
Edit: It works
Using args = parser.parse_args(args=[]) would solve execution problem.
or you can declare it as class format.
class Args:
data = './data/penn'
model = 'LSTM'
emsize = 200
nhid = 200
args=Args()
I've had a similar problem before, but using optparse instead of argparse.
You don't need to change anything in the original script, just assign a new list to sys.argv like so:
if __name__ == "__main__":
from Bio import SeqIO
path = '/path/to/sequences.txt'
sequences = [str(record.seq) for record in SeqIO.parse(path, 'fasta')]
sys.argv = ['-f'] + sequences
main()
If all arguments have a default value, then adding this to the top of the notebook should be enough:
import sys
sys.argv = ['']
(otherwise, just add necessary arguments instead of the empty string)
I ended up using BioPython to extract the sequences and then editing Ilya Steanov's implementation to remove the argparse methods.
import imp
seqs = []
lcsm = imp.load_source('lcsm', '/path/to/ukkonen.py')
for record in SeqIO.parse('/path/to/sequences.txt', 'fasta'):
seqs.append(record)
lcsm.main(seqs)
For the algorithm, I had main() take one argument, his strings variable, but this sends the algorithm a list of special BioPython Sequence objects, which the re module doesn't like. So I had to extract the sequence string
suffix_tree.append_string(s)
to
suffix_tree.append_string(str(s.seq))
which seems kind of brittle, but that's all I've got for now.
I face a similar problem in invoking argsparse, the string '-f' was causing this problem. Just removing that from sys.srgv does the trick.
import sys
if __name__ == '__main__':
if '-f' in sys.argv:
sys.argv.remove('-f')
main()
Clean sys.argv
import sys; sys.argv=['']; del sys
https://github.com/spyder-ide/spyder/issues/3883#issuecomment-269131039
Here is my code which works well and I won't worry about the environment changed.
import sys
temp_argv = sys.argv
try:
sys.argv = ['']
print(sys.argv)
args = argparse.parser_args()
finally:
sys.argv = temp_argv
print(sys.argv)
Suppose you have this small code in python:
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("-v", "--verbose", help="increase output verbosity",
action="store_true")
parser.add_argument("-v_1", "--verbose_1", help="increase output verbosity",
action="store_true")
args = parser.parse_args()
To write this code in Jupyter notebook write this:
import argparse
args = argparse.Namespace(verbose=False, verbose_1=False)
Note: In python, you can pass arguments on runtime but in the Jupyter notebook that will not be the case so be careful with the data types of your arguments.
If arguments passed by the iPython environment can be ignored (do not conflict with the specified arguments), then the following works like a charm:
# REPLACE args = parser.parse_args() with:
args, unknown = parser.parse_known_args()
From: https://stackoverflow.com/a/12818237/11750716
If you don't want to change any of the arguments and working mechanisms from the original argparse function you have written or copied.
To let the program work then there is a simple solution that works most of the time.
You could just install jupyter-argparser using the below command:
pip install jupyter_argparser
The codes work without any changes thanks to the maintainer of the package.

Using Ruffus library in Python 2.7, just_print flag fails

I've got a ruffus pipeline in Python 2.7, but when I call it with -n or --just_print it still runs all the actual tasks instead of just printing the pipeline like it's supposed to. I:
* don't have a -n argument that would supercede the built-in (although I do have other command-line arguments)
* have a bunch of functions with #transform() or #merge() decorators
* end the pipeline with a run_pipeline() call
Has anyone else experienced this problem? Many thanks!
As of ruffus version 2.4, you can use the builtin ruffus.cmdline which stores the appropriate flags via the cmdline.py module that uses argparse, for example:
from ruffus import *
parser = cmdline.get_argparse(description='Example pipeline')
options = parser.parse_args()
#originate("test_out.txt")
def run_testFunction(output):
with open(output,"w") as f:
f.write("it's working!\n")
cmdline.run(options)
Then run your pipeline from the terminal with a command like:
python script.py --verbose 6 --target_tasks run_testFunction --just_print
If you want to do this manually instead (which is necessary for older version of ruffus) you can call pipeline_printout() rather than pipeline_run(), using argparse so that the --just_print flag leads to the appropriate call, for example:
from ruffus import *
import argparse
import sys
parser = argparse.ArgumentParser(description='Example pipeline')
parser.add_argument('--just_print', dest='feature', action='store_true')
parser.set_defaults(feature=False)
args = parser.parse_args()
#originate("test_out.txt")
def run_testFunction(output):
with open(output,"w") as f:
f.write("it's working!\n")
if args.feature:
pipeline_printout(sys.stdout, run_testFunction, verbose = 6)
else:
pipeline_run(run_testFunction, verbose = 6)
You would then run the command like:
python script.py --just_print

How to call module written with argparse in iPython notebook

I am trying to pass BioPython sequences to Ilya Stepanov's implementation of Ukkonen's suffix tree algorithm in iPython's notebook environment. I am stumbling on the argparse component.
I have never had to deal directly with argparse before. How can I use this without rewriting main()?
By the by, this writeup of Ukkonen's algorithm is fantastic.
An alternative to use argparse in Ipython notebooks is passing a string to:
args = parser.parse_args()
(line 303 from the git repo you referenced.)
Would be something like:
parser = argparse.ArgumentParser(
description='Searching longest common substring. '
'Uses Ukkonen\'s suffix tree algorithm and generalized suffix tree. '
'Written by Ilya Stepanov (c) 2013')
parser.add_argument(
'strings',
metavar='STRING',
nargs='*',
help='String for searching',
)
parser.add_argument(
'-f',
'--file',
help='Path for input file. First line should contain number of lines to search in'
)
and
args = parser.parse_args("AAA --file /path/to/sequences.txt".split())
Edit: It works
Using args = parser.parse_args(args=[]) would solve execution problem.
or you can declare it as class format.
class Args:
data = './data/penn'
model = 'LSTM'
emsize = 200
nhid = 200
args=Args()
I've had a similar problem before, but using optparse instead of argparse.
You don't need to change anything in the original script, just assign a new list to sys.argv like so:
if __name__ == "__main__":
from Bio import SeqIO
path = '/path/to/sequences.txt'
sequences = [str(record.seq) for record in SeqIO.parse(path, 'fasta')]
sys.argv = ['-f'] + sequences
main()
If all arguments have a default value, then adding this to the top of the notebook should be enough:
import sys
sys.argv = ['']
(otherwise, just add necessary arguments instead of the empty string)
I ended up using BioPython to extract the sequences and then editing Ilya Steanov's implementation to remove the argparse methods.
import imp
seqs = []
lcsm = imp.load_source('lcsm', '/path/to/ukkonen.py')
for record in SeqIO.parse('/path/to/sequences.txt', 'fasta'):
seqs.append(record)
lcsm.main(seqs)
For the algorithm, I had main() take one argument, his strings variable, but this sends the algorithm a list of special BioPython Sequence objects, which the re module doesn't like. So I had to extract the sequence string
suffix_tree.append_string(s)
to
suffix_tree.append_string(str(s.seq))
which seems kind of brittle, but that's all I've got for now.
I face a similar problem in invoking argsparse, the string '-f' was causing this problem. Just removing that from sys.srgv does the trick.
import sys
if __name__ == '__main__':
if '-f' in sys.argv:
sys.argv.remove('-f')
main()
Clean sys.argv
import sys; sys.argv=['']; del sys
https://github.com/spyder-ide/spyder/issues/3883#issuecomment-269131039
Here is my code which works well and I won't worry about the environment changed.
import sys
temp_argv = sys.argv
try:
sys.argv = ['']
print(sys.argv)
args = argparse.parser_args()
finally:
sys.argv = temp_argv
print(sys.argv)
Suppose you have this small code in python:
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("-v", "--verbose", help="increase output verbosity",
action="store_true")
parser.add_argument("-v_1", "--verbose_1", help="increase output verbosity",
action="store_true")
args = parser.parse_args()
To write this code in Jupyter notebook write this:
import argparse
args = argparse.Namespace(verbose=False, verbose_1=False)
Note: In python, you can pass arguments on runtime but in the Jupyter notebook that will not be the case so be careful with the data types of your arguments.
If arguments passed by the iPython environment can be ignored (do not conflict with the specified arguments), then the following works like a charm:
# REPLACE args = parser.parse_args() with:
args, unknown = parser.parse_known_args()
From: https://stackoverflow.com/a/12818237/11750716
If you don't want to change any of the arguments and working mechanisms from the original argparse function you have written or copied.
To let the program work then there is a simple solution that works most of the time.
You could just install jupyter-argparser using the below command:
pip install jupyter_argparser
The codes work without any changes thanks to the maintainer of the package.

Categories

Resources