Change Mapreduce intermediate output location using MRJob

Change Mapreduce intermediate output location using MRJob - python

I am trying to run a python script using MRJob on a cluster in which I don't have admin permissions and I got the error pasted below. What I think is happening is that the job is trying to write the intermediate files to the default /tmp.... dir and since this is a protected directory to which I don't have permission to write, the job receives an error and exits. I would like to know how I can change this tmp output directory location to someplace in my local filesystem example:
/home/myusername/some_path_in_my_local_filesystem_on_the_cluster , basically I would like to know what additional parameters I would have to pass to change the intermediate output location from /tmp/... to some place local where I have write permission.
I invoke my script as:
python myscript.py input.txt -r hadoop > output.txt
The error:
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /tmp/13435.1.all.q/mr_word_freq_count.myusername.20131215.004905.274232
writing wrapper script to /tmp/13435.1.all.q/mr_word_freq_count.myusername.20131215.004905.274232/setup-wrapper.sh
STDERR: mkdir: org.apache.hadoop.security.AccessControlException: Permission denied: user=myusername, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x
Traceback (most recent call last):
File "/home/myusername/privatemodules/python/examples/mr_word_freq_count.py", line 37, in <module>
MRWordFreqCount.run()
File "/home/myusername/.local/lib/python2.7/site-packages/mrjob/job.py", line 500, in run
mr_job.execute()
File "/home/myusername/.local/lib/python2.7/site-packages/mrjob/job.py", line 518, in execute
super(MRJob, self).execute()
File "/home/myusername/.local/lib/python2.7/site-packages/mrjob/launch.py", line 146, in execute
self.run_job()
File "/home/myusername/.local/lib/python2.7/site-packages/mrjob/launch.py", line 207, in run_job
runner.run()
File "/home/myusername/.local/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run
self._run()
File "/home/myusername/.local/lib/python2.7/site-packages/mrjob/hadoop.py", line 236, in _run
self._upload_local_files_to_hdfs()
File "/home/myusername/.local/lib/python2.7/site-packages/mrjob/hadoop.py", line 263, in _upload_local_files_to_hdfs
self._mkdir_on_hdfs(self._upload_mgr.prefix)

Are you running mrjob as a "local" job, or trying to run it on your Hadoop cluster?
If you are actually trying to use it on Hadoop, you can control the "scratch" HDFS location (where mrjob will store intermediate files) using the --base-tmp-dir flag:
python mr.py -r hadoop -o hdfs:///user/you/output_dir --base-tmp-dir hdfs:///user/you/tmp hdfs:///user/you/data.txt

Related

"libclntsh.so: cannot open shared object file in ubuntu to run python program in Spark Cluster

I have the Python program that works without any issue locally. But when I want to run it in Spark cluster I receive the error about libclntsh.so, the cluster has two nodes.
To explain more, to run the program in the cluster, first I set Master IP Address in spark-env.sh like this:
export SPARK_MASTER_HOST=x.x.x.x
Then just write IP of slave nodes to $SPARK_HOME/conf/workers.
After that, first I run Master with this line:
/opt/spark/sbin/start-master.sh
Then run Slaves:
/opt/spark/sbin/start-worker.sh spark://x.x.x.x:7077
Next I check that SPARK UI is up. So, I run the program in Master Node like this:
/opt/spark/bin/spark-submit --master spark://x.x.x.x:7077 --files sparkConfig.json --py-files cst_utils.py,grouping.py,group_state.py,g_utils.py,csts.py,oracle_connection.py,config.py,brn_utils.py,emp_utils.py main.py
When the above command is run, I receive this error:
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 604, in main
process()
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 594, in process
out_iter = func(split_index, iterator)
File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func
File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func
File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 418, in func
File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2144, in combineLocally
File "/opt/spark/python/lib/pyspark.zip/pyspark/shuffle.py", line 240, in mergeValues
for k, v in iterator:
File "/opt/spark/python/lib/pyspark.zip/pyspark/util.py", line 73, in wrapper
return f(*args, **kwargs)
File "/opt/spark/work/app-20220221165611-0005/0/customer_utils.py", line 340, in read_cst
df_group = connection.read_sql(query_cnt)
File "/opt/spark/work/app-20220221165611-0005/0/oracle_connection.py", line 109, in read_sql
self.connect()
File "/opt/spark/work/app-20220221165611-0005/0/oracle_connection.py", line 40, in connect
self.conn = cx_Oracle.connect(db_url)
cx_Oracle.DatabaseError: DPI-1047: Cannot locate a 64-bit Oracle Client library:
"libclntsh.so: cannot open shared object file: No such file or directory".
I set this Environment Variables in ~/.bashrc:
export ORACLE_HOME=/usr/share/oracle/instantclient_19_8
export LD_LIBRARY_PATH=$ORACLE_HOME:$LD_LIBRARY_PATH
export PATH=$ORACLE_HOME:$PATH
export JAVA_HOME=/usr/lib/jvm/java/jdk1.8.0_271
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PATH=$PATH:$JAVA_HOME/bin
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_HOME=/usr/bin/python3.8
export PYSPARK_DRIVER_PYTHON=python3.8
Would you please guide me what is wrong?
Any help would be appreciated.

Problem solved. According to the TroubleShooting link, first I create a file InstantClient.conf in /etc/ld.so.conf.d/ PATH and write the path to the Instant Client directory in it.
# instant client Path
/usr/share/oracle/instantclient_19_8
Finally, I run this command:
sudo ldconfig
Then I run spark-submit and it work without the error on InstantClient.
Hope it was helpful for others.

getting error while running mrjob python scripting in hadoop cluster

hi i want to sort movie ratings by a python script but i am getting error
`[root#sandbox-hdp maria_dev]# python RatingsBreakdown.py -r hadoop --hadoop-streaming-jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar u.data
No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in $PATH...
Found hadoop binary: /usr/bin/hadoop
Using Hadoop version 3.1.1.3.0.1.0
Creating temp directory /tmp/RatingsBreakdown.maria_dev.20190830.233300.332634
STDERR: mkdir: Permission denied: user=root, access=WRITE, inode="/user/maria_dev" :maria_dev:hdfs:drwxr-xr-x
Traceback (most recent call last):
File "RatingsBreakdown.py", line 19, in <module>
RatingsBreakdown.run()
File "/usr/lib/python2.7/site-packages/mrjob/job.py", line 446, in run
mr_job.execute()
File "/usr/lib/python2.7/site-packages/mrjob/job.py", line 473, in execute
super(MRJob, self).execute()
File "/usr/lib/python2.7/site-packages/mrjob/launch.py", line 202, in execute
self.run_job()
File "/usr/lib/python2.7/site-packages/mrjob/launch.py", line 247, in run_job
return self._handle(name, path, path)
File "/usr/lib/python2.7/site-packages/mrjob/fs/composite.py", line 118, in _han dle
return getattr(fs, name)(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/mrjob/fs/hadoop.py", line 298, in mkdir
raise IOError("Could not mkdir %s" % path)
IOError: Could not mkdir hdfs:///user/maria_dev/tmp/mrjob/RatingsBreakdown.maria_d ev.20190830.233300.332634/files/wd`
can you plese describe what is the problem here

Please take a look at these 2 references:
Permission denied at hdfs
Permission denied: user=root, access=WRITE, inode="/user":hdfs:supergroup:dr

i found that hortonworks take a lot of time to boot up
when i booted up correctly it worked fine
it took about 1 hour to boot

Tensorflow access denied in config_util.py while training

I have a problem with training a model in tensorflow. I am working on Windows 10. When I run the command:
python ./object_detection/model_main.py --pipeline_config_path=C:/Tensorflow/object-detection/ssd_mobilenet_v1_coco_2018_01_28 --model_dir=C:/Tensorflow/object-detection/output-model --num_train_steps=50000 --sample_1_of_n_eval_examples=1 --alsologtostderr
from C:/Tensorflow/models/research to start the training process I get an error in line 95 (proto_str = f.read()) from the config_util.py script. Below
you can see my whole console output:
Traceback (most recent call last):
File "./object_detection/model_main.py", line 109, in <module>
tf.app.run()
File "C:\Users\lucci\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "./object_detection/model_main.py", line 71, in main
FLAGS.sample_1_of_n_eval_on_train_examples))
File "C:\Tensorflow\models\research\object_detection\model_lib.py", line 536, in create_estimator_and_inputs
config_override=config_override)
File "C:\Tensorflow\models\research\object_detection\utils\config_util.py", line 95, in get_configs_from_pipeline_file
proto_str = f.read()
File "C:\Users\lucci\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 125, in read
self._preread_check()
File "C:\Users\lucci\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 85, in _preread_check
compat.as_bytes(self.__name), 1024 * 512, status)
File "C:\Users\lucci\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: NewRandomAccessFile failed to Create/Open: C:/Tensorflow/object-detection/ssd_mobilenet_v1_coco_2018_01_28 : Zugriff verweigert
; Input/output error
The error is in the last two lines: Zugriff verweigert is german and means access denied.
I am admin on this pc (it's my own pc) and i have FullControl to the folders (I double-checked it with PowerShell). When I try to move the folder to another
place, e.g. C:\Users\lucci\Documents\ I get the same error. The problem still remains when I run the console as admin, when I try the command runas /user:lucci ... and
so on.
Can anyone help me with this?
I am using Python3.6.
EDIT: This also not helps: Tensorflow Windows Accessing Folders Denied:"NewRandomAccessFile failed to Create/Open: Access is denied. ; Input/output error"

I finally found the solution on my own. It does not matter if you are operating on Linux or Windows.
When you run the command you always have to specify the full path to the pipeline.config file.
So, when your pipeline.config-file is located in C:/ObjectDetection/Model/pipeline.config it is not
sufficient to specify the location like C:/ObjectDetection/. You have to specify the location
as C:/ObjectDetection/pipeline.config.
Running the command again with the full path for the Parameter --pipeline_config_file works great!!!

`/bin/sh` not found in tox within Jenkins inside docker

I am trying to setup a Jenkins pipeline which runs tox inside a docker container. There is a known issue that shebang lines get very long inside Jenkins, and two solutions are proposed. The first is to use --workdir to select a shorter path. This option works in principle, but I loose the automatic unique path-names per project from Jenkins. I would thus prefer to use the second option, TOX_LIMITED_SHEBANG. Unfortunately, that seems to fail with the following error when the package under test is supposed to be installed: FileNotFoundError: [Errno 2] No such file or directory: "b'/bin/sh'": "b'/bin/sh'". I have verified that /bin/sh is in fact available in the docker container. The Jenkinsfile looks as follows:
node("docker") {
// burnpanck/tox-base contains tox and many python versions
docker.image('burnpanck/tox-base').inside {
checkout scm
stage('Matrix-test using Tox') {
// verify that /bin/sh exists
sh 'ls -al /bin'
// the following does not work
sh 'TOX_LIMITED_SHEBANG=1 tox -vv'
// the following works
// sh 'tox --workdir=/var/jenkins_home/tox'
}
}
}
Tox is version 3.1.2 and runs under python 3.6 (the docker image is generated from this Dockerfile). What surprises me a little is the "b'/bin/sh'" coming from str-ing a bytes instance. Could it be that tox is in fact trying to run a program by the name sh' in the path b'/bin?
The tox.ini in use simply calls pytest:
[tox]
envlist = py36
[testenv]
recreate = True
commands =
pytest
The full backtrace from tox (Jenkins console output) is the following:
py36 create: /var/jenkins_home/workspace/_debug_jenkins-long-shebang-L7UHBNCVPSOBSTKZ7COPFJBJLWR5XZXFIAD7TBGC4WQLVDLZYVQQ/.tox/py36
py36 inst: /var/jenkins_home/workspace/_debug_jenkins-long-shebang-L7UHBNCVPSOBSTKZ7COPFJBJLWR5XZXFIAD7TBGC4WQLVDLZYVQQ/.tox/dist/test_model-0.dev20180717.zip
ERROR: invocation failed (errno 2), args: [b'/bin/sh', '/var/jenkins_home/workspace/_debug_jenkins-long-shebang-L7UHBNCVPSOBSTKZ7COPFJBJLWR5XZXFIAD7TBGC4WQLVDLZYVQQ/.tox/py36/bin/pip', 'install', '/var/jenkins_home/workspace/_debug_jenkins-long-shebang-L7UHBNCVPSOBSTKZ7COPFJBJLWR5XZXFIAD7TBGC4WQLVDLZYVQQ/.tox/dist/test_model-0.dev20180717.zip'], cwd: /var/jenkins_home/workspace/_debug_jenkins-long-shebang-L7UHBNCVPSOBSTKZ7COPFJBJLWR5XZXFIAD7TBGC4WQLVDLZYVQQ
Traceback (most recent call last):
File "/.pyenv/versions/3.6.6/bin/tox", line 11, in <module>
sys.exit(cmdline())
File "/.pyenv/versions/3.6.6/lib/python3.6/site-packages/tox/session.py", line 39, in cmdline
main(args)
File "/.pyenv/versions/3.6.6/lib/python3.6/site-packages/tox/session.py", line 45, in main
retcode = Session(config).runcommand()
File "/.pyenv/versions/3.6.6/lib/python3.6/site-packages/tox/session.py", line 422, in runcommand
return self.subcommand_test()
File "/.pyenv/versions/3.6.6/lib/python3.6/site-packages/tox/session.py", line 620, in subcommand_test
self.installpkg(venv, path)
File "/.pyenv/versions/3.6.6/lib/python3.6/site-packages/tox/session.py", line 561, in installpkg
venv.installpkg(path, action)
File "/.pyenv/versions/3.6.6/lib/python3.6/site-packages/tox/venv.py", line 277, in installpkg
self._install([sdistpath], extraopts=extraopts, action=action)
File "/.pyenv/versions/3.6.6/lib/python3.6/site-packages/tox/venv.py", line 342, in _install
self.run_install_command(packages=packages, options=options, action=action)
File "/.pyenv/versions/3.6.6/lib/python3.6/site-packages/tox/venv.py", line 314, in run_install_command
redirect=self.session.report.verbosity < 2,
File "/.pyenv/versions/3.6.6/lib/python3.6/site-packages/tox/venv.py", line 427, in _pcall
return action.popen(args, cwd=cwd, env=env, redirect=redirect, ignore_ret=ignore_ret)
File "/.pyenv/versions/3.6.6/lib/python3.6/site-packages/tox/session.py", line 153, in popen
popen = self._popen(args, cwd, env=env, stdout=stdout, stderr=subprocess.STDOUT)
File "/.pyenv/versions/3.6.6/lib/python3.6/site-packages/tox/session.py", line 248, in _popen
env=env,
File "/.pyenv/versions/3.6.6/lib/python3.6/subprocess.py", line 709, in __init__
restore_signals, start_new_session)
File "/.pyenv/versions/3.6.6/lib/python3.6/subprocess.py", line 1344, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: "b'/bin/sh'": "b'/bin/sh'"

We currently have an open bug for this. TOX_LIMITED_SHEBANG is broken when tox is running under python3.
The root cause is there's a .decode() missing and a list ends up with mixed bytes and str instances.
Workarounds until I fix it are to use tox with a python2 interpreter or don't use TOX_LIMITED_SHEBANG.

Python 2.7 sub process

I have a python script that uses a package called flopy. My script generates a series of inputs to a fortran executable. Flopy writes these into text files and then calls the fortran executable, which uses the text files to run a model.
I'm using a mac (OSX) and I downloaded python 2.7 from python.org- i.e. I'm not using the Apple system version of python. The version of python I'm using is in Library/Frameworks/Python.Frameworks/
I can run my script if I call it from the Terminal window (by typing:
Python myscriptname.py
However if I run my script through IDLE (the version that came with python which I downloaded it) it returns an error:
Traceback (most recent call last):
File "/Users/neilthomas/RotatedModel_v4_Tr_mfnwt.py", line 355, in <module>
success, mfoutput = mf.run_model(silent=False, pause=False)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/flopy/mbase.py", line 638, in run_model
normal_msg=normal_msg)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/flopy/mbase.py", line 1034, in run_model
stdout=sp.PIPE, stderr=sp.STDOUT, cwd=model_ws)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 710, in __init__
errread, errwrite)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1335, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
The file 'mfnwt' absolutely does exist. I'm sure I'm missing something obvious, but is there something I need to do to allow IDLE to run programs/subprocesses via the shell it uses? Thanks.

The problem here is that you have to identify the specific MODFLOW executable file you are calling ('mfnwt' in your case). I do the same with a MODFLOW 2000 file:
mf = flopy.modflow.mf.Modflow(modelname,namefile_ext='nam',version='mf2k',exe_name='/home/MODFLOW-and-related-codes/build-08/bin-windows/mf2k.exe')
In your case, you would do something similar, only replacing the version='mf2k' and exe_name=path to match where you are storing your MODFLOW file.
See the documentation for further details: https://modflowpy.github.io/flopydoc/mf.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Change Mapreduce intermediate output location using MRJob - python

Related

"libclntsh.so: cannot open shared object file in ubuntu to run python program in Spark Cluster

getting error while running mrjob python scripting in hadoop cluster

Tensorflow access denied in config_util.py while training

`/bin/sh` not found in tox within Jenkins inside docker

Python 2.7 sub process

Categories

Resources