I have two Python scripts that are intended to run on Amazon Elastic MapReduce - one as a mapper and one as a reducer. I've just recently expanded the mapper script to require a couple more local models that I've created that both live in a package called SentimentAnalysis. What's the right way to have a Python script import from a local Python package on S3? I tried creating S3 keys that mimic my file system in hopes that the relative paths will work, but alas it didn't. Here's what I see in the log files on S3 after the step failed:
Traceback (most recent call last):
File "/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201407250000_0001/attempt_201407250000_0001_m_000000_0/work/./sa_mapper.py", line 15, in <module>
from SentimentAnalysis import NB, LR
ImportError: No module named SentimentAnalysis
The relevant file structure is like this:
sa_mapper.py
sa_reducer.py
SentimentAnalysis/NB.py
SentimentAnalysis/LR.py
And the mapper.py has:
from SentimentAnalysis import NB, LR
I tried to mirror the file structure in S3, but that doesn't seem to work.
What's the best way to setup S3 or EMR so that sa_mapper.py can import NB.py and LR.py? Is there some special trick to doing this?
do you have
__init__.py
in SentimentAnalysis folder?
What the command that you are running?
The only way to do it, is when you want to run step, you have additional fields that you can had to the step, for example: if you are using the boto package to run task on emr, you have the class: StreamingStep
in it you have the parameters: (if you use version 2.43)
cache_files (list(str)) – A list of cache files to be bundled with the job
cache_archives (list(str)) – A list of jar archives to be bundled with the job
meaning that you need to pass the files path of folder that you want to be taken from s3 into your cluster.
the syntax is:
s3://{s3 bucket path}/EMR_Config.py#EMR_Config.py
Where the hashtag is the separator that you use, the part before the (#) is the location in your s3 and the part after is the name you want it to have and the location, currently it will be located in the same place as your task that you are running.
Ones you have them in your cluster you can't simply do import,
What worked is:
# we added a file named EMR_Config.py,
sys.path.append(".")
#loading the module this way because of the EMR file system
module_name = 'EMR_Config'
__import__(module_name)
Config = sys.modules[module_name]
#now you can access the methods in the file, for example:
topic_name = Config.clean_key(row.get("Topic"))
Related
Opening and loading data from a file that is situated in the same folder as the currently executing Python 3.x script can be done like this:
import os
mydata_path = os.path.join(os.path.dirname(__file__), "mydata.txt")
with open(mydata_path, 'r') as file:
data = file.read()
However once the script and mydata.txt files become part of a Python package this is not as straight forward anymore. I have managed to do this using a concoction of functions from the pkg_resources module such as resource_exists(), resource_listdir(), resource_isdir() and resource_string(). I won't put my code here because it is horrible and broken (but it sort of works).
Anyhow my question is; is there no way to manage the loading of a file in the same folder as the currently executing Python script that works regardles of wether the files are in a package or not?
You can use importlib.resources.read_text in order to read a file that's located relative to a package:
from importlib.resources import read_text
data = read_text('mypkg.foo', 'mydata.txt')
I am writing a code organized in some (several) files. For the sake of organization of folders and the CMakeLists.txt, a pythonlibs folder is created during the build process, and some links to python files are created to libraries in the /build/src/XXXX/ folder.
In the python file, I add to the python path:
sys.path.insert(1,'/opt/hpc/softwares/erfe/erfe/build/pythonlibs')
import libmsym as msym
When I run the main python file, there is this one library lybmsym that fails with:
import libmsym as msym
File "/opt/hpc/softwares/erfe/erfe/build/pythonlibs/libmsym.py", line 15, in <module>
from . import _libmsym_install_location, export
ImportError: attempted relative import with no known parent package
I created a link using cmake, but I believe it does use the ln command (tried both hard and symbolic). Is there a way to prevent this behavior without changing the library itself, just another way to create this link?
Thanks.
I am working with executable files that are to be included in the main node file. But the problem is that although I can define the path in my PC and open a text file with or execute the file but it is not universal with my other team members which are part of the team. The have to change the code after pulling from git.
I have the following commands:
os.chdir( "/home/user/epsilon/epsilon_catkin_ws/src/knowledge_source/sense_manager/pddl_files/")
os.system("./ff -p /home/user/epsilon/epsilon_catkin_ws/src/knowledge_source/sense_manager/pddl_files/ -o domain.pddl -f problem.pddl >
solution_path = "/home/user/epsilon/epsilon_catkin_ws/src/knowledge_source/sense_manager/pddl_files/solution.txt" solution_detail.txt")
Since the path are unique with my laptop, it will require changes by everyone. IS there a way to make the path definition universal? (The commands are part of the service call that is present in that node). I am using rosrun to run the node
I tried the following pattern but it does not work:
os.chdir( "$(find knowledge_source)/sense_manager/pddl_files/")
Do I need to do something extra to make this work?
knowledge_source is the name of the package
Any recommendations?
I think you are looking for rospkg and os combination to get the path to a file inside a ros package.
import rospkg
import os
conf_file_path = os.path.join(rospkg.RosPack().get_path('your_ros_pkg'), 'your_folder_in_the_package', 'your_file_name')
This concerns the importing of my own python modules in a HTCondor job.
Suppose 'mymodule.py' is the module I want to import, and is saved in directory called a XDIR.
In another directory called YDIR, I have written a file called xImport.py:
#!/usr/bin/env python
import os
import sys
print sys.path
import numpy
import mymodule
and a condor submit file:
executable = xImport.py
getenv = True
universe = Vanilla
output = xImport.out
error = xImport.error
log = xImport.log
queue 1
The result of submitting this is that, in xImport.out, the sys.path is printed out, showing XDIR. But in xImport.error, there is an ImporError saying 'No module named mymodule'. So it seems that the path to mymodule is in sys.path, but python does not find it. I'd also like to mention that error message says that the ImportError originates from the file
/mnt/novowhatsit/YDIR/xImport.py
and not YDIR/xImport.py.
How can I edit the above files to import mymodule.py?
When condor runs your process, it creates a directory on that machine (usually on a local hard drive). It sets that as the working directory. That's probably the issue you are seeing. If XDIR is local to the machine where you are running condor_submit, then it's contents don't exist on the remote machine where the xImport.py is running.
Try using the .submit feature transfer_input_files mechanism (see http://research.cs.wisc.edu/htcondor/manual/v7.6/2_5Submitting_Job.html) to copy the mymodule.py to the remote machines.
I'm running the following MapReduce on AWS Elastic MapReduce:
./elastic-mapreduce --create --stream --name CLI_FLOW_LARGE --mapper
s3://classify.mysite.com/mapper.py --reducer
s3://classify.mysite.com/reducer.py --input
s3n://classify.mysite.com/s3_list.txt --output
s3://classify.mysite.com/dat_output4/ --cache
s3n://classify.mysite.com/classifier.py#classifier.py --cache-archive
s3n://classify.mysite.com/policies.tar.gz#policies --bootstrap-action
s3://classify.mysite.com/bootstrap.sh --enable-debugging
--master-instance-type m1.large --slave-instance-type m1.large --instance-type m1.large
For some reason the cacheFile classifier.py is not being cached, it would seem. I get this error when the reducer.py tries to import it:
File "/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201204290242_0001/attempt_201204290242_0001_r_000000_0/work/./reducer.py", line 12, in <module>
from classifier import text_from_html, train_classifiers
ImportError: No module named classifier
classifier.py is most definitely present at s3n://classify.mysite.com/classifier.py. For what it's worth, the policies archive seems to load in just fine.
I don't know how to fix this problem in EC2, but I've seen it before with Python in traditional Hadoop deployments. Hopefully the lesson translates over.
What we need to do is add the directory reduce.py is in to the python path, because presumably classifier.py is in there too. For whatever reason, this place is not in the python path, so it is failing to find classifier.
import sys
import os.path
# add the directory where reducer.py is to the python path
sys.path.append(os.path.dirname(__file__))
# __file__ is the location of reduce.py, along with "reduce.py"
# dirname strips the file name and only gives the directory
# sys.path is the python path where it looks for modules
from classifier import text_from_html, train_classifiers
The reason why your code might work locally is because of the current working directory in which you are running it from. Hadoop might not be running it from the same place you are in terms of the current working directory.
orangeoctopus deserves credit for this from his comment. Had to append the working directory system path:
sys.path.append('./')
Also, I recommend anyone who has similar issues to me to read this great article on using Distributed Cache on AWS:
https://forums.aws.amazon.com/message.jspa?messageID=152538