Some elementary doubts about running Mapreduce programs using mrjob on Amazon EMR

Some elementary doubts about running Mapreduce programs using mrjob on Amazon EMR - python

I am new to mrjob and I am having problems to get the job running on Amazon EMR. I will write them in sequential order.
I can run a mrjob on my local machine. However when I have mrjob.conf in /home/ankit/.mrjob.conf and in /etc/mrjob.conf, the job is not executed on my local machine.
Here is what I am getting. https://s3-ap-southeast-1.amazonaws.com/imagna.sample/local.txt
What is MRJOB_CONF in "the location specified by MR_CONF" in the documentation?
What is the use of 'base_tmp_directory' ? Also, do I need to upload the input data in S3 before starting the job or it will load from my local computer while starting the execution?
Do I need to do some bootstrapping if I use some libraries like numpy, scikit etc? If yes, how?
This is what I am getting when I execute the command for running a job on EMR https://s3-ap-southeast-1.amazonaws.com/imagna.sample/emr.txt
Any solutions?
Thanks a lot.

Your URL is invalid (I get an "Access Denied" error).
mrjob.conf is a configuration file. It can be located in several locations, see http://pythonhosted.org/mrjob/configs-conf.html
You can use input data from your local machine just by specifying the paths to the input files on the command line. MRJob will upload the data to S3 for you. If you specify an s3://... URL, MRJob will use the data at that S3 path.
To use non-standard packages, see http://pythonhosted.org/mrjob/writing-and-running.html#custom-python-packages
Your URL is invalid (I get an "Access Denied" error).

Related

Airflow task fails with segmentation fault

I'm trying to execute this jar file https://github.com/RMLio/rmlmapper-java from Airflow, but for some reason it is failing straight away. I'm using a PythonOperator to execute some python code, and inside it I have a subprocess call to the java command.
Test command is:
java -jar /root/airflow/dags/rmlmapper-6.0.0-r363-all.jar -v
I'm running Airflow inside a Docker container. The weird thing is that if I execute the exact same command inside the container it works fine.
I tried a bit of everything but the result is always the same: SegFault 139
The memory of the container seems to be fine so it shouldn't be directly related to some OOM issue. I also tried to reset default memory in the Docker compose file with no success.
My suggestion is that the java application somehow tries to load some files which are stored locally inside the jar file, but for some reason maybe Airflow changes the 'user.dir' directory and therefore it is not able to find them and it fails.
I'm really out of ideas so any help will be highly appreciated. Thank you.

AWS Glue python shell Job fails with Internal Service error

I am running a python shell program in AWS Glue but after running for around 10 minutes its failing with error Internal service error. The logs or error logs does not give any information. Most of the time it fails by just saying Internal service error and rarely it runs for 2 days and gets timed out. The code uses pandas for transformations and it looks ok, it runs fine on local machine, necessary changes done so that it works on AWS glue[where it read/write files to s3 location instead of local folder]. What could be wrong here? Any input is appreciated.

This issue was figured out. The problem was the job was unable to download the dependent python libraries due to an access issue to the s3 bucket. Once the access issue was resolved the job started running fine.

Failed to find data source: kafka

I was reading through this post, https://nycdatascience.com/blog/student-works/yelp-recommender-part-2/, and followed basically everything they showed. However, after reading this post, Spark 2.1 Structured Streaming - Using Kakfa as source with Python (pyspark), when I run
SPARK_HOME/bin/spark-submit read_stream_spark.py --master local[4] --jars spark-sql-kafka-0.10_2.11-2.1.0.jar
I still get the error that 'Failed to find data source: kafka'.
I also read through this. https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html. The official doc ask for two hosts and two ports while I only use one. Should I specify another host and port other than cloud server and the kafka port? Thanks.
Could you please let me know what I am missing. Or I shouldn't have run the script alone?

The official doc ask for two hosts and two ports
That's not related to your error. A minimum of one bootstrap server is required.
You need to move your Python file to the end of the command, otherwise all of the options you provided are given as command line arguments to the Python script, not spark-submit. And therefore it's using the default master with no external jars.
It's also recommended that you use --packages since this should ensure transitive dependencies are included with the submission

Uploading Python files to GCP and execute code

I was trying to create a VM and upload some python files to GCP, and run the code on my buckets.
I created the VM and SSH into it. After I set up my VM instance with all the Python libraries that I need. Now I'm trying to upload my python files to GCP so that I can execute the code.
So on my Mac, I did gcloud init and then tried the following:
gcloud compute scp /Users/username/Desktop/LZ_demo_poc/helper_functions.py /home/user_name/lz_text_classification/
However, I keep getting these error messages.
WARNING: `gcloud compute copy-files` is deprecated. Please use `gcloud compute scp` instead. Note that `gcloud compute scp` does not have recursive copy on by default. To turn on recursion, use the `--recurse` flag.
ERROR: (gcloud.compute.copy-files) Source(s) must be remote when destination is local. Got sources: [/Users/username/Desktop/LZ_demo_poc/helper_functions.py], destination: /home/user_name/lz_text_classification/
Can anyone help me with the process of running a python script on GCP using data that is saved as buckets.

You need to also specify the instance where you want the files copied, otherwise the destination is interpreted as a local file, leading to the (2nd line of) your error message. From the gcloud compute scp examples:
Conversely, files from your local computer can be copied to a virtual
machine:
$ gcloud compute scp ~/localtest.txt ~/localtest2.txt \
example-instance:~/narnia
In your case it should be something like:
gcloud compute scp /Users/abrahammathew/Desktop/LZ_demo_poc/helper_functions.py your_instance_name:/home/abraham_mathew/lz_text_classification/

Execute a Hadoop Job in remote server and get its results from python webservice

I have a Hadoop job packaged in a jar file that I can execute in a server using command line and store the results in the hdfs of the server using the command line.
Now, I need to create a Web Service in Python (Tornado) that must to execute the Hadoop Job and get the results to present them to the user. The Web Service is hosted in other server.
I googled a lot for call the Job from outside the server in python Script but unfortunately did not have answers.
Anyone have a solution for this?
Thanks

One option could be install the binaries of hadoop in your webservice server using the same configuration than in your hadoop cluster. You will require that to be able to talk with the cluster. You don't nead to lunch any hadoop deamon there. At least configure HADOOP_HOME, HADOOP_CONFIG_DIR, HADOOP_LIBS and set properly the PATH environment variables.
You need the binaries because you will use them to submit the job and the configurations to tell hadoop client where is the cluster (the namenode and the resourcemanager).
Then in python you can execute the hadoop jar command using subprocess: https://docs.python.org/2/library/subprocess.html
You can configure the job to notify your server when the job has finished using a callback: https://hadoopi.wordpress.com/2013/09/18/hadoop-get-a-callback-on-mapreduce-job-completion/
And finally you could read the results in HDFS using WebHDFS (HDFS WEB API) or using some python HDFS package like: https://pypi.python.org/pypi/hdfs/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Some elementary doubts about running Mapreduce programs using mrjob on Amazon EMR - python

Related

Airflow task fails with segmentation fault

AWS Glue python shell Job fails with Internal Service error

Failed to find data source: kafka

Uploading Python files to GCP and execute code

Execute a Hadoop Job in remote server and get its results from python webservice

Categories

Resources