File not cacheing on AWS Elastic Map Reduce - python

I'm running the following MapReduce on AWS Elastic MapReduce:
./elastic-mapreduce --create --stream --name CLI_FLOW_LARGE --mapper
s3://classify.mysite.com/mapper.py --reducer
s3://classify.mysite.com/reducer.py --input
s3n://classify.mysite.com/s3_list.txt --output
s3://classify.mysite.com/dat_output4/ --cache
s3n://classify.mysite.com/classifier.py#classifier.py --cache-archive
s3n://classify.mysite.com/policies.tar.gz#policies --bootstrap-action
s3://classify.mysite.com/bootstrap.sh --enable-debugging
--master-instance-type m1.large --slave-instance-type m1.large --instance-type m1.large
For some reason the cacheFile classifier.py is not being cached, it would seem. I get this error when the reducer.py tries to import it:
File "/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201204290242_0001/attempt_201204290242_0001_r_000000_0/work/./reducer.py", line 12, in <module>
from classifier import text_from_html, train_classifiers
ImportError: No module named classifier
classifier.py is most definitely present at s3n://classify.mysite.com/classifier.py. For what it's worth, the policies archive seems to load in just fine.

I don't know how to fix this problem in EC2, but I've seen it before with Python in traditional Hadoop deployments. Hopefully the lesson translates over.
What we need to do is add the directory reduce.py is in to the python path, because presumably classifier.py is in there too. For whatever reason, this place is not in the python path, so it is failing to find classifier.
import sys
import os.path
# add the directory where reducer.py is to the python path
sys.path.append(os.path.dirname(__file__))
# __file__ is the location of reduce.py, along with "reduce.py"
# dirname strips the file name and only gives the directory
# sys.path is the python path where it looks for modules
from classifier import text_from_html, train_classifiers
The reason why your code might work locally is because of the current working directory in which you are running it from. Hadoop might not be running it from the same place you are in terms of the current working directory.

orangeoctopus deserves credit for this from his comment. Had to append the working directory system path:
sys.path.append('./')
Also, I recommend anyone who has similar issues to me to read this great article on using Distributed Cache on AWS:
https://forums.aws.amazon.com/message.jspa?messageID=152538

Related

ROS universal directory defintion

I am working with executable files that are to be included in the main node file. But the problem is that although I can define the path in my PC and open a text file with or execute the file but it is not universal with my other team members which are part of the team. The have to change the code after pulling from git.
I have the following commands:
os.chdir( "/home/user/epsilon/epsilon_catkin_ws/src/knowledge_source/sense_manager/pddl_files/")
os.system("./ff -p /home/user/epsilon/epsilon_catkin_ws/src/knowledge_source/sense_manager/pddl_files/ -o domain.pddl -f problem.pddl >
solution_path = "/home/user/epsilon/epsilon_catkin_ws/src/knowledge_source/sense_manager/pddl_files/solution.txt" solution_detail.txt")
Since the path are unique with my laptop, it will require changes by everyone. IS there a way to make the path definition universal? (The commands are part of the service call that is present in that node). I am using rosrun to run the node
I tried the following pattern but it does not work:
os.chdir( "$(find knowledge_source)/sense_manager/pddl_files/")
Do I need to do something extra to make this work?
knowledge_source is the name of the package
Any recommendations?
I think you are looking for rospkg and os combination to get the path to a file inside a ros package.
import rospkg
import os
conf_file_path = os.path.join(rospkg.RosPack().get_path('your_ros_pkg'), 'your_folder_in_the_package', 'your_file_name')

Using dlopen to load one .so in Python says it can't find another in the same directory

I connected yesterday using the SSH protocol to another computer and tried to load, through Python, a SO file (which would be compiled C). Here is what I got in the CLI:
The file that is being requested (libLMR_Demodulator.so) next to "OSError:" is in the same dir as the file I want to load (libDemodulatorJNI_lmr.so).
The python code (v3.5.2) is the following one:
import ctypes
sh_obj = ctypes.cdll.LoadLibrary('./libLMR_Demodulator.so')
actual_start_frequency = sh_obj.getActualStartFrequency(ctypes.c_long(0))
print('The Current Actual Frequency Is: ' + str(actual_start_frequency))
#Charles Duffy is right. The issue come from dependencies. You can verify this by command:
ldd libLMR_Demodulator.so
You have several ways to fix this issue:
Put all the lib to /lib, /usr/lib paths, or directly install them to your system.
Put the libs' path to /etc/ld.so.conf file, then run ldconfig to refresh cache.
use LD_LIBRARY_PATH to add the libs' path, then try to run you script
LD_LIBRARY_PATH=[..path] python [script.py]
or
export LD_LIBRARY_PATH=[..path]
python [script.py]
You can check with manual of dlopen to get more details.
I got here looking for how to ensure that a module / package with a .so file was able to load another .so file that it depends upon -- changing the current directory to the location of the first .so file (i.e., in the directory where the module is) seems to work for me:
import os,sys,inspect
cwd = os.getcwd()
currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
os.chdir(currentdir)
import _myotherlib
os.chdir(cwd) # go back
might also work for the OP case?

Can't run binary from within python aws lambda function

I am trying to run this tool within a lambda function: https://github.com/nicolas-f/7DTD-leaflet
The tool depends on Pillow which depends on imaging libraries not available in the AWS lambda container. To try and get round this I've ran pyinstaller to create a binary that I can hopefully execute. This file is named map_reader and sits at the top level of the lambda zip package.
Below is the code I am using to try and run the tool:
command = 'chmod 755 map_reader'
args = shlex.split(command)
print subprocess.Popen(args)
command = './map_reader -g "{}" -t "{}"'.format('/tmp/mapFiles', '/tmp/tiles')
args = shlex.split(command)
print subprocess.Popen(args)
And here is the error, which occurs on the second subprocess.Popen call:
<subprocess.Popen object at 0x7f08fa100d10>
[Errno 13] Permission denied: OSError
How can I run this correctly?
You may have been misled into what the issue actually is.
I don't think that the first Popen ran successfully. I think that it just dumped a message in standard error and you're not seeing it. It's probably saying that
chmod: map_reader: No such file or directory
I suggest you can try either of these 2:
Extract the map_reader from the package into /tmp. Then reference it with /tmp/map_reader.
Do it as recommended by Tim Wagner, General Manager of AWS Lambda who said the following in the article Running Arbitrary Executables in AWS Lambda:
Including your own executables is easy; just package them in the ZIP file you upload, and then reference them (including the relative path within the ZIP file you created) when you call them from Node.js or from other processes that you’ve previously started. Ensure that you include the following at the start of your function code:
process.env[‘PATH’] = process.env[‘PATH’] + ‘:’ + process.env[‘LAMBDA_TASK_ROOT’]
The above code is for Node JS but for Python, it's like the following
import os
os.environ['PATH']
That should make the command command = './map_reader <arguments> work.
If they still don't work, you may also consider running chmod 755 map_reader before creating the package and uploading it (as suggested in this other question).
I know I'm a bit late for this but if you want a more generic way of doing this (for instance if you have a lot of binaries and might not use them all), this how I do it, provided you put all your binaries in a bin folder next to your py file, and all the libraries in a lib folder :
import shutil
import time
import os
import subprocess
LAMBDA_TASK_ROOT = os.environ.get('LAMBDA_TASK_ROOT', os.path.dirname(os.path.abspath(__file__)))
CURR_BIN_DIR = os.path.join(LAMBDA_TASK_ROOT, 'bin')
LIB_DIR = os.path.join(LAMBDA_TASK_ROOT, 'lib')
### In order to get permissions right, we have to copy them to /tmp
BIN_DIR = '/tmp/bin'
# This is necessary as we don't have permissions in /var/tasks/bin where the lambda function is running
def _init_bin(executable_name):
start = time.clock()
if not os.path.exists(BIN_DIR):
print("Creating bin folder")
os.makedirs(BIN_DIR)
print("Copying binaries for "+executable_name+" in /tmp/bin")
currfile = os.path.join(CURR_BIN_DIR, executable_name)
newfile = os.path.join(BIN_DIR, executable_name)
shutil.copy2(currfile, newfile)
print("Giving new binaries permissions for lambda")
os.chmod(newfile, 0775)
elapsed = (time.clock() - start)
print(executable_name+" ready in "+str(elapsed)+'s.')
# then if you're going to call a binary in a cmd, for instance pdftotext :
_init_bin('pdftotext')
cmdline = [os.path.join(BIN_DIR, 'pdftotext'), '-nopgbrk', '/tmp/test.pdf']
subprocess.check_call(cmdline, shell=False, stderr=subprocess.STDOUT)
There were two issues here. First, as per Jeshan's answer, I had to move the binary to /tmp before I could properly access it.
The other issue was that I'd ran pyinstaller on ubuntu, creating a single file. I saw elsewhere some comments about being sure to compile on the same architecture as the lambda container runs. Therefore I ran pyinstaller on ec2 based on the Amazon Linux AMI. The output was multiple .os files, which when moved to tmp, worked as expected.
copyfile('/var/task/yourbinary', '/tmp/yourbinary')
os.chmod('/tmp/yourbinary', 0555)
Moving the binary to /tmp and making it executable worked for me
There is no need to copy the files the /tmp. You can just use ld-linux to execute any file including those not marked executable.
So, for running a non-executable on AWS Lambda, you use the following command:
/lib64/ld-linux-x86-64.so.2 /opt/map_reader
P.S. It would make more sense to add the map_reader binary or any other static files in a Lambda Layer, thus the /opt folder.
Like the docs mention for Node.js, you need to update the $PATH, else you'll get command not found when trying to run the executables you added at the root of your Lambda package. In Node.js, that's:
process.env['PATH'] = process.env['PATH'] + ':' + process.env['LAMBDA_TASK_ROOT']
Now, the same thing in Python:
import os
# Make the path stored in $LAMBDA_TASK_ROOT available to $PATH, so that we
# can run the executables we added at the root of our package.
os.environ["PATH"] += os.pathsep + os.environ['LAMBDA_TASK_ROOT']
Tested OK with Python 3.8.
(As a bonus, here are some more env variables used by Lambda.)

How to import local Python package in Amazon Elastic MapReduce (EMR)?

I have two Python scripts that are intended to run on Amazon Elastic MapReduce - one as a mapper and one as a reducer. I've just recently expanded the mapper script to require a couple more local models that I've created that both live in a package called SentimentAnalysis. What's the right way to have a Python script import from a local Python package on S3? I tried creating S3 keys that mimic my file system in hopes that the relative paths will work, but alas it didn't. Here's what I see in the log files on S3 after the step failed:
Traceback (most recent call last):
File "/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201407250000_0001/attempt_201407250000_0001_m_000000_0/work/./sa_mapper.py", line 15, in <module>
from SentimentAnalysis import NB, LR
ImportError: No module named SentimentAnalysis
The relevant file structure is like this:
sa_mapper.py
sa_reducer.py
SentimentAnalysis/NB.py
SentimentAnalysis/LR.py
And the mapper.py has:
from SentimentAnalysis import NB, LR
I tried to mirror the file structure in S3, but that doesn't seem to work.
What's the best way to setup S3 or EMR so that sa_mapper.py can import NB.py and LR.py? Is there some special trick to doing this?
do you have
__init__.py
in SentimentAnalysis folder?
What the command that you are running?
The only way to do it, is when you want to run step, you have additional fields that you can had to the step, for example: if you are using the boto package to run task on emr, you have the class: StreamingStep
in it you have the parameters: (if you use version 2.43)
cache_files (list(str)) – A list of cache files to be bundled with the job
cache_archives (list(str)) – A list of jar archives to be bundled with the job
meaning that you need to pass the files path of folder that you want to be taken from s3 into your cluster.
the syntax is:
s3://{s3 bucket path}/EMR_Config.py#EMR_Config.py
Where the hashtag is the separator that you use, the part before the (#) is the location in your s3 and the part after is the name you want it to have and the location, currently it will be located in the same place as your task that you are running.
Ones you have them in your cluster you can't simply do import,
What worked is:
# we added a file named EMR_Config.py,
sys.path.append(".")
#loading the module this way because of the EMR file system
module_name = 'EMR_Config'
__import__(module_name)
Config = sys.modules[module_name]
#now you can access the methods in the file, for example:
topic_name = Config.clean_key(row.get("Topic"))

import local python module in HTCondor

This concerns the importing of my own python modules in a HTCondor job.
Suppose 'mymodule.py' is the module I want to import, and is saved in directory called a XDIR.
In another directory called YDIR, I have written a file called xImport.py:
#!/usr/bin/env python
import os
import sys
print sys.path
import numpy
import mymodule
and a condor submit file:
executable = xImport.py
getenv = True
universe = Vanilla
output = xImport.out
error = xImport.error
log = xImport.log
queue 1
The result of submitting this is that, in xImport.out, the sys.path is printed out, showing XDIR. But in xImport.error, there is an ImporError saying 'No module named mymodule'. So it seems that the path to mymodule is in sys.path, but python does not find it. I'd also like to mention that error message says that the ImportError originates from the file
/mnt/novowhatsit/YDIR/xImport.py
and not YDIR/xImport.py.
How can I edit the above files to import mymodule.py?
When condor runs your process, it creates a directory on that machine (usually on a local hard drive). It sets that as the working directory. That's probably the issue you are seeing. If XDIR is local to the machine where you are running condor_submit, then it's contents don't exist on the remote machine where the xImport.py is running.
Try using the .submit feature transfer_input_files mechanism (see http://research.cs.wisc.edu/htcondor/manual/v7.6/2_5Submitting_Job.html) to copy the mymodule.py to the remote machines.

Categories

Resources