I'm new to map reduce and I'm trying to run a map reduce job using mrjob package of python. However, I encountered this error:
ERROR:mrjob.launch:Step 1 of 1 failed: Command '['/usr/bin/hadoop', 'jar', '/usr/lib/hadoop-mapreduce/hadoop-streaming.jar', '-files',
'hdfs:///user/hadoop/tmp/mrjob/word_count.hadoop.20180831.035452.437014/files/mrjob.zip#mrjob.zip,
hdfs:///user/hadoop/tmp/mrjob/word_count.hadoop.20180831.035452.437014/files/setup-wrapper.sh#setup-wrapper.sh,
hdfs:///user/hadoop/tmp/mrjob/word_count.hadoop.20180831.035452.437014/files/word_count.py#word_count.py', '-archives',
'hdfs:///user/hadoop/tmp/mrjob/word_count.hadoop.20180831.035452.437014/files/word_count_ccmr.tar.gz#word_count_ccmr.tar.gz', '-D',
'mapreduce.job.maps=4', '-D', 'mapreduce.job.reduces=4', '-D', 'mapreduce.map.java.opts=-Xmx1024m', '-D', 'mapreduce.map.memory.mb=1200', '-D',
'mapreduce.output.fileoutputformat.compress=true', '-D', 'mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec', '-D',
'mapreduce.reduce.java.opts=-Xmx1024m', '-D', 'mapreduce.reduce.memory.mb=1200', '-input', 'hdfs:///user/hadoop/test-1.warc', '-output',
'hdfs:///user/hadoop/gg', '-mapper', 'sh -ex setup-wrapper.sh python word_count.py --step-num=0 --mapper', '-combiner',
'sh -ex setup-wrapper.sh python word_count.py --step-num=0 --combiner', '-reducer', 'sh -ex setup-wrapper.sh python word_count.py --step-num=0 --reducer']'
returned non-zero exit status 256
I've tried running it locally with python ./word_count.py input/test-1.warc > output and it's successful.
I'm using
python 2.7.14
Hadoop 2.8.3-amzn-1
pip 18.0
mrjob 0.6.4
Any ideas? Thanks!
This is my command in running the mapreduce job. I got it from cc-mrjob repository. The file is called run_hadoop.sh and I use chmod +x run_hadoop.sh
#!/bin/sh
JOB="$1"
INPUT="$2"
OUTPUT="$3"
sudo chmod +x $JOB.py
if [ -z "$JOB" ] || [ -z "$INPUT" ] || [ -z "$OUTPUT" ]; then
echo "Usage: $0 <job> <input> <outputdir>"
echo " Run a CommonCrawl mrjob on Hadoop"
echo
echo "Arguments:"
echo " <job> CCJob implementation"
echo " <input> input path"
echo " <output> output path (must not exist)"
echo
echo "Example:"
echo " $0 word_count input/test-1.warc hdfs:///.../output/"
echo
echo "Note: don't forget to adapt the number of maps/reduces and the memory requirements"
exit 1
fi
# strip .py from job name
JOB=${JOB%.py}
# wrap Python files for deployment, cf. below option --setup,
# see for details
# http://pythonhosted.org/mrjob/guides/setup-cookbook.html#putting-your-source-tree-in-pythonpath
tar cvfz ${JOB}_ccmr.tar.gz *.py
# number of maps resp. reduces
NUM_MAPS=4
NUM_REDUCES=4
if [ -n "$S3_LOCAL_TEMP_DIR" ]; then
S3_LOCAL_TEMP_DIR="--s3_local_temp_dir=$S3_LOCAL_TEMP_DIR"
else
S3_LOCAL_TEMP_DIR=""
fi
python $JOB.py \
-r hadoop \
--jobconf "mapreduce.map.memory.mb=1200" \
--jobconf "mapreduce.map.java.opts=-Xmx1024m" \
--jobconf "mapreduce.reduce.memory.mb=1200" \
--jobconf "mapreduce.reduce.java.opts=-Xmx1024m" \
--jobconf "mapreduce.output.fileoutputformat.compress=true" \
--jobconf "mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec" \
--jobconf "mapreduce.job.reduces=$NUM_REDUCES" \
--jobconf "mapreduce.job.maps=$NUM_MAPS" \
--setup 'export PYTHONPATH=$PYTHONPATH:'${JOB}'_ccmr.tar.gz#/' \
--no-output \
--cleanup NONE \
$S3_LOCAL_TEMP_DIR \
--output-dir "$OUTPUT" \
"hdfs:///user/hadoop/$INPUT"
and I run it with ./run_hadoop.sh word_count test-1.warc output
where
word_count is the job (file called word_count.py)
test-1.warc is the input (located in hdfs:///user/hadoop/test-1.warc)
and output is the output dir (located in hdfs:///user/hadoop/output) And I also make sure I always use different output for different job to prevent duplicate folder)
* Update *
I took a look at the syslog in HUE interface. And there's this error
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Could not deallocate container for task attemptId attempt_1536113332062_0001_r_000003_0
is this related to the error I'm getting?
I also got this in one of the stderr of map attempt
/bin/sh: run_prestart: line 1: syntax error: unexpected end of file
and
No module named boto3
However, I installed boto3 using pip install boto3 in my emr. Is the module not available in hadoop?
I got it working by following this blog
http://benjamincongdon.me/blog/2018/02/02/MapReduce-on-Python-is-better-with-MRJob-and-EMR/
Essentially,
you have to include a .conf file for runner in hadoop. e.g. mrjob.conf
inside that file, use this
runners:
hadoop:
setup:
- 'set -e'
- VENV=/tmp/$mapreduce_job_id
- if [ ! -e $VENV ]; then virtualenv $VENV; fi
- . $VENV/bin/activate
- 'pip install boto3'
- 'pip install warc'
- 'pip install https://github.com/commoncrawl/gzipstream/archive/master.zip'
sh_bin: '/bin/bash -x'
and use the conf file by refering it to the run_hadoop.sh
python $JOB.py \
--conf-path mrjob.conf \ <---- OUR CONFIG FILE
-r hadoop \
--jobconf "mapreduce.map.memory.mb=1200" \
--jobconf "mapreduce.map.java.opts=-Xmx1024m" \
--jobconf "mapreduce.reduce.memory.mb=1200" \
--jobconf "mapreduce.reduce.java.opts=-Xmx1024m" \
--jobconf "mapreduce.output.fileoutputformat.compress=true" \
--jobconf "mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec" \
--jobconf "mapreduce.job.reduces=$NUM_REDUCES" \
--jobconf "mapreduce.job.maps=$NUM_MAPS" \
--setup 'export PYTHONPATH=$PYTHONPATH:'${JOB}'_ccmr.tar.gz#/' \
--cleanup NONE \
$S3_LOCAL_TEMP_DIR \
--output-dir "hdfs:///user/hadoop/$OUTPUT" \
"hdfs:///user/hadoop/$INPUT"
now if you call ./run_hadoop.sh word_count input/test-1.warc output, it should work!
Related
I installed the latest version of docker, installed WSL 2 according by the manual. And installed the container with the command docker-compose up. I need to run the tests by command tests/run_tests.sh. But after launching, after a few seconds, the window with the test closes, my container disappears in the docker, and when I try to write the command docker-compose up again, I get an error Error response from daemon: open \\.\pipe\docker_engine_linux: The system cannot find the file specified.
run_tests:
#!/usr/bin/env sh
# To run locally, execute the command NOT in container:
# bash tests/run_tests.sh
set -x
if [ -z "$API_ENV" ]; then
API_ENV=test
fi
if [ "$API_ENV" = "bitbucket_test" ]; then
COMPOSE_FILE="-f docker-compose.test.yml"
fi
docker-compose build connectors
API_ENV=$API_ENV docker-compose ${COMPOSE_FILE} up -d --force-recreate
connectors_container=$(docker ps -f name=connectors -q | tail -n1)
if [ "$API_ENV" = "bitbucket_test" ]; then
mkdir -p artifacts && docker logs --follow ${connectors_container} > ./artifacts/docker_connectors_logs.txt 2>&1 &
pytest_n_processes=100
else
pytest_n_processes=25
fi
# Timeout for the tests. In bitbucket we want to stop the tests a bit before the max time, so that
# artifacts are created and logs can be inspected
timeout_cmd="timeout 3.5m"
if [ "$API_ENV" = "bitbucket_test" ] || [ "$API_ENV" = "test" ]; then
export PYTEST_SENTRY_DSN='http://d07ba0bfff4b41888e311f8398321d14#sentry.windsor.ai/4'
export PYTEST_SENTRY_ALWAYS_REPORT=1
fi
git fetch origin "+refs/heads/master:refs/remotes/origin/master"
# Lint all the files that are modified in this branch
$(dirname "$0")/run_linters.sh &
linting_pid=$!
# bitbucket pipelines have 8 workers, use 6 for tests
#
# WARNING: Tests require gunicorn and is enabled when containers are started with: API_ENV=test docker-compose up -d --force-recreate
# Tests are run in parallel and the cache-locking in threaded flask doesnt work in this case
${timeout_cmd} docker exec ${connectors_container} bash -c \
"PYTEST_SENTRY_DSN=$PYTEST_SENTRY_DSN \
PYTEST_SENTRY_ALWAYS_REPORT=$PYTEST_SENTRY_ALWAYS_REPORT \
pytest \
--cov=connectors --cov=api --cov=base \
--cov-branch --cov-report term-missing --cov-fail-under=71.60 \
--timeout 60 \
-v \
--durations=50 \
-n $pytest_n_processes \
tests || ( \
code=$? `# store the exit code to exit with it` \
&& echo 'TESTS FAILED' \
&& mkdir -p ./artifacts \
&& docker logs ${connectors_container} > ./artifacts/docker_connectors_failure_logs.txt 2>&1 `# Ensure that the logs are complete` \
) "&
# Get the tests pid
tests_pid=$!
# wait for linting to finish
wait $linting_pid
linting_code=$?
echo "Linting code: ${linting_code}"
if [ $linting_code -ne 0 ]; then
echo 'Linting failed'
# kill running jobs on exit in local ubuntu. Some tests were left running by only killing the test_pid.
kill "$(jobs -p)"
# kills the test process explicitly in gitlab pipelines. Was needed because jobs returns empty in gitlab pipelines.
kill $tests_pid
exit 1
fi
# wait for tests to finish
wait $tests_pid
testing_code=$?
echo "Testing code: ${testing_code}"
if [ $testing_code -ne 0 ]; then
echo 'Tests failed'
exit 1
else
echo 'Tests and linting passed'
exit 0
fi
Running on a SGE cluster. I have had no issues until recently my cluster people told me that I need to start "The jobs were overloading the bossock fibres on the storage and the IO profile of the job was such that our storage had trouble
serving the requests. We will want the output files and possibly the input
files to be read and written by snakemake in scratch space, and then you
will need to copy anything you want back and delete the files in scratch."
I've been submitting snakemake with the following bash script.
#!/bin/bash
#Submit to the cluster, give it a unique name
#$ -S /bin/bash
#$ -V
#$ -l h_vmem=1.9G,h_rt=20:00:00,tmem=1.9G
#$ -l tscratch=100G
#$ -pe smp 2
# join stdout and stderr output
#$ -j y
#$ -R y
if [ "$1" != "" ]; then
RUN_NAME=$1
else
RUN_NAME=$""
fi
#setup scratch space
scratch=/scratch0/annbrown/$JOB_ID
#FOLDER=submissions/$(date +"%Y%m%d%H%M")
FOLDER=${scratch}submissions/$(date +"%Y%m%d%H%M")
mkdir -p $FOLDER
# make /a symlink of my snakemake pipeline in the scratch space
ln -s /home/annbrown/pipelines/rna_seq_snakemake $FOLDER
cd $FOLDER/rna_seq_snakemake
cp config/config.yaml $FOLDER/$RUN_NAMEconfig.yaml
snakemake -s rna_seq.snakefile \
--jobscript cluster_qsub.sh \
--cluster-config config/cluster.yaml \
--cluster-sync "qsub -R y -l h_vmem={cluster.h_vmem},h_rt={cluster.h_rt} -pe {cluster.pe} -o $FOLDER" \
-j 25 \
--nolock \
--rerun-incomplete \
--latency-wait 100
#copy work out of scratch -you may need to change the destination
cp -r $FOLDER ~/annbrown
#delete scratch once you have finished (very important)
rm -rf $scratch
But whenever snakemake starts going it's still writing it's subnmission temporary files to the wrong spot
/SAN/vyplab/alb_projects/pipelines/rna_seq_snakemake/.snakemake/tmp.gkxzqok7/snakejob.run_star_pe.139.sh
#$ -S /bin/bash
#$ -cwd
#$ -V
#$ -l h_vmem=4G,h_rt=6:00:00,tmem=4G
# join stdout and stderr output
#$ -j y
#$ -sync y
#$ -R y
cd /SAN/vyplab/alb_projects/pipelines/rna_seq_snakemake && \
/share/apps/python-3.7.2-shared/bin/python3.7 \
-m snakemake /SAN/vyplab/alb_projects/data/muscle/analysis/STAR_aligned/12_9_5_18.Aligned.out.bam --snakefile /SAN/vyplab/alb_projects/pipelines/rna_seq_snakemake/rna_seq.snakefile \
--force -j --keep-target-files --keep-remote \
--wait-for-files /SAN/vyplab/alb_projects/pipelines/rna_seq_snakemake/.snakemake/tmp.gkxzqok7 /SAN/vyplab/vyplab_reference_genomes/STAR/human/GRCh38/star_indices_overhang150/SA /SAN/vyplab/alb_projects/data/muscle/analysis/merged_fastqs/12_9_5_18_1.merged.fastq.gz /SAN/vyplab/alb_projects/data/muscle/analysis/merged_fastqs/12_9_5_18_2.merged.fastq.gz --latency-wait 100 \
--attempt 1 --force-use-threads \
--wrapper-prefix https://bitbucket.org/snakemake/snakemake-wrappers/raw/ \
--allowed-rules run_star_pe --nocolor --notemp --no-hooks --nolock \
--mode 2
Has anyone gotten snakemake to run in temporary scratch spaces? I'm not really sure if I'm meant to be using the shadow directive or what to be honest.
I'm trying to call make to compile my code but I keep getting this error:
C:\Users\lovel\Anaconda3\S4>make
mkdir -p build
mkdir -p build / S4k
The syntax of the command is incorrect.
make: *** [objdir] Error 1
Here's part of my makefile code in python:
objdir:
mkdir -p $(OBJDIR)
mkdir -p $(OBJDIR) / S4k
mkdir -p $(OBJDIR) / modules
S4_LIBOBJS = \
$(OBJDIR)/S4k/S4.o \
$(OBJDIR)/S4k/rcwa.o \
$(OBJDIR)/S4k/fmm_common.o \
$(OBJDIR)/S4k/fmm_FFT.o \
$(OBJDIR)/S4k/fmm_kottke.o \
$(OBJDIR)/S4k/fmm_closed.o \
$(OBJDIR)/S4k/fmm_PolBasisNV.o \
$(OBJDIR)/S4k/fmm_PolBasisVL.o \
$(OBJDIR)/S4k/fmm_PolBasisJones.o \
$(OBJDIR)/S4k/fmm_experimental.o \
$(OBJDIR)/S4k/fft_iface.o \
$(OBJDIR)/S4k/pattern.o \
$(OBJDIR)/S4k/intersection.o \
$(OBJDIR)/S4k/predicates.o \
$(OBJDIR)/S4k/numalloc.o \
$(OBJDIR)/S4k/gsel.o \
$(OBJDIR)/S4k/sort.o \
$(OBJDIR)/S4k/kiss_fft.o \
$(OBJDIR)/S4k/kiss_fftnd.o \
$(OBJDIR)/S4k/SpectrumSampler.o \
$(OBJDIR)/S4k/cubature.o \
$(OBJDIR)/S4k/Interpolator.o \
$(OBJDIR)/S4k/convert.o`
I'm working on Windows. I changed the '/' to '\' and it didn't work, I also added '\' in the end did not work either.
you must remove the spacing around the slashes or argument parsing believes that there are more than 1 argument (and you can also forget about creating $(OBJDIR) because -p option creates all non-existing dirs to the target dir:
Unix/Linux compliant should be:
objdir:
mkdir -p $(OBJDIR)/S4k
mkdir -p $(OBJDIR)/modules
Note that when using Windows mkdir command, the -p option should be dropped (windows version does that by default and that option isn't recognized). Given the message you're getting, you're probably running the windows version so it should be:
objdir:
mkdir $(OBJDIR)\S4k
mkdir $(OBJDIR)\modules
(slashes are not accepted by Windows mkdir command so $(OBJDIR) should be built with backslashes as well)
slashes are used for command switches in basic commands like mkdir, or else you have to quote the paths
objdir:
mkdir "$(OBJDIR)/S4k"
mkdir "$(OBJDIR)/modules"
(as you see it's rather difficult to have a portable makefile from Linux to Windows unless you're running it within MSYS shell where mkdir is the MSYS one, and keep in mind that there are native & MSYS versions of the make command too, I got caught once: How to force make to use bash as a shell on Windows/MSYS2)
I am working on a project which involves wapiti and nikto web tools. i have managed to produce one report for both these tool with this command
python wapiti.py www.kca.ac.ke ;perl nikto.pl -h www.kca.ac.ke -Display V -F htm -output /root/.wapiti/generated_report/index.html.
But i would like to run a command like
python wapiti.py www.kca.ac.ke
and get both the wapiti and nikto web scan report. How do i achieve this guys?
A shell script would work. Save the following as 'run_wapiti_and_nikto_scans', then run it as:
bash run_wapiti_and_nikto_scans www.my.site.com
Here is the script:
#!/bin/bash
SITE=$1
if [ -n "$SITE" ]; then # -n tests to see if the argument is non empty
echo "Looking to scan $SITE"
echo "Running 'python wapiti.py $SITE'"
python wapiti.py $SITE || echo "Failed to run wapiti!" && exit 1;
echo "Running 'perl nikto.pl -h $SITE -Display V -F htm -output /root/.wapiti/generated_report/index.html'"
perl nikto.pl -h $SITE -Display V -F htm -output /root/.wapiti/generated_report/index.html || echo "Failed to run nikto!" && exit 1;
echo "Done!"
exit 0; # Success
fi
echo "usage: run_wapiti_and_nikto_scans www.my.site.com";
exit 1; # Failure
LOG_FILE=/var/log/webiopi
CONFIG_FILE=/etc/webiopi/config
PATH=/sbin:/usr/sbin:/bin:/usr/bin
DESC="WebIOPi"
NAME=webiopi
HOME=/usr/share/webiopi/htdocs
DAEMON=/usr/bin/python
DAEMON_ARGS="-m webiopi -l $LOG_FILE -c $CONFIG_FILE"
PIDFILE=/var/run/$NAME.pid
SCRIPTNAME=/etc/init.d/$NAME
# Exit if the package is not installed
[ -x "$DAEMON" ] || exit 0
1)The -x "$DAEMON" only check if python has been installed but it didn't check for the package webiopi, doesn't it?
2)Does Python -m would run the whole package, not just the individual module?
# Read configuration variable file if it is present
[ -r /etc/default/$NAME ] && . /etc/default/$NAME
3)How does the configuration file /etc/webiopi/config values go into /etc/default/webiopi?From above, I didn't see the command to do that.
#
# Function that starts the daemon/service
#
do_start()
{
# Return
# 0 if daemon has been started
# 1 if daemon was already running
# 2 if daemon could not be started
start-stop-daemon --start --quiet --chdir $HOME --pidfile $PIDFILE --exec $DAEMON --test > /dev/null \
|| return 1
4) above start the python process only but not webiopi? What's point of python test?It didn't specify is it return 0?
start-stop-daemon --start --quiet --chdir $HOME --pidfile $PIDFILE --exec $DAEMON --background --make-pidfile -- \
$DAEMON_ARGS \
|| return 2
5) Above start webiopi by starting python -m webiopi * in the background?