I have a recurring problem with my hadoop cluster in that occasionally a functioning code stops seeing python modules that are in the proper location. I'm looking for tips from someone who might have faced the same problem.
When I first started programming and a code stopped working I asked a question here on SO and someone told me to just go to bed and in the morning it should work, or some other "you're a dummy, you must have changed something" kind of comment.
I run the code several times, it works, I go to sleep, in the morning I try to run it again and it fails. Sometimes I kill jobs with CTRL+C, and sometimes I use CTRL+Z. But this just takes up resources and doesn't cause any other issue besides that - the code still runs. I have yet to have see this problem right after the code works. This usually happens the morning after, when I come into work after the code worked when I left 10 hours ago. Restarting the cluster typically solves the issue
I'm currently checking to see if the cluster restarts itself for some reason, or if some part of it is failing, but so far the ambari screens show everything green. I'm not sure if there is some automated maintenance or something that is known to screw things up.
Still working my way through the elephant book, sorry if this topic is clearly addressed on page XXXX, I just haven't made it to that page yet.
I looked through all the error logs, but the only meaningful thing I see is in stderr:
File "/data5/hadoop/yarn/local/usercache/melvyn/appcache/application_1470668235545_0029/container_e80_1470668235545_0029_01_000002/format_text.py", line 3, in <module>
from formatting_functions import *
ImportError: No module named formatting_functions
So we solved the problem. The issue is particular to our set up. We have all of our datanodes nfs mounted. Occasionally a node fails, and someone has to bring it back up and remount it.
Our script specifies the path to libraries like:'
pig -Dmapred.child.env="PYTHONPATH=$path_to_mnt$hdfs_library_path" ...
so pig couldn't find the libraries, because $path_to_mnt was invalid for one of the nodes.
Related
having a strange issue when trying to run a simple "hello world" program with MPI.
I eventually want to use 100 processes for this MPI script I'm writing in python and was even able to run the hello world test earlier with up to 100 processes. However, now I keep encountering the same error when I try to run the script with ~50 processes.
The specific error I see seems to be stating:
ORTE_ERROR_LOG: The system limit on number of network connections a process can open was reached in file util/listener.c at line 321
After trying to research this, I understand that it has something to do with a process running out of file descriptors and it seems like the most common solutions state that a file is not closing properly. However, my issue here is, I'm not opening any files? My script is just:
print('I am process:', rank)
So what could the issue be stemming from here?
I seem to have found a slight workaround.
I am working on a Mac, so I'm assuming that earlier I was able to stay under my file limit that is at a certain default amount set by the OS. By configuring the max file limit, I was able to bypass the limit amount I was originally hitting, causing my program to crash.
This fix isn't ideal, since my script now takes quite a while to run, but it is at least a temporary one until I can find a better fix.
If anyone would like to attempt this, the solution I found was posted by #tombigel on GitHub and can be found here.
I am facing a problem with a python script getting killed. I had always used this script with no problem at all until two days ago, then it started to print, without any change in the code, the string 'killed' before aborting the execution.
Other people have tried to run the same code on their system and it works fine, as it used to do with me until two days ago.
I have read some old similar question, and I have got the problem could be an out-of-memory issue due to a bad memory management in my code. It sounds a little strange to me, since it used to work perfectly until some days ago and the problem appears on my system only.
Do you have any idea on how to inspect the problem and find a possible solution, please?
Python version: Python 2.7.14+
System: Scientific Linux CERN 7
In your case, it's highly probale that the script you're processing reached some given limit of the amount of resources it's able to use and that depends on your OS and other parameters, are you running something else with the script ? or are there many open files etc ?
The most likely reason for such an error is exceeding memory use, whiwh forces the system to not take risks and break when allocating more starts failing. Maybe you can print in parallel the total memory you're using to have a glimpse of what's happening since the information you've given are not enough to help you :
import os, psutil
process = psutil.Process(os.getpid())
then: (for python 3)
print(process.memory_info().rss)
or: (for python 2.7) (tested)
print(process.memory_info()[0])
I wrote a script for my company that randomly selects employees for random drug tests. It works wonderfully, except when I gave it to the person who would use the program. She clicked on it and a message popped up asking if she trusts the program. AFter clicking run anyways, AVG flagged it two more times before it would finally load. I read someone else's comment saying to make an exception for it on the antivirus. The problem is, I wrote another program that reads other scripts and reads/writes txt files, generates excel spreadsheets and many other things. I'm really close to releasing the final product to a few select companies as a trial, and this certificate thing is going to be an issue. I code for fun, so there's a lot of lingo that goes right by me. Can someone point me in the right direction where I can get some information on creating a trusted program?
It appears to be a whole long process to obtain a digital certification. You need one to be issued by a certification authority. Microsoft appears to have a docs page on it.
After you have the certification, you'd need to sign your .exe file after it's been created using a tool like SignTool. You may find more useful and detailed answers than I can provide you in this thread, as I actually only know quite little about this whole process and can only redirect you to those who know more. I'd suggest you look through what I have listed here before asking me any more, since I probably know about as much as you do past this point.
If anyone else is having this problem, I stumbled on a solution that works for me.
I created an Install Wizard using Inno Setup. Before I could install the software (My drug test program), it got flagged, asking me if I trust the software. I clicked "run anyway" and my antivirus flagged it two more times. After the program was installed. it never flagged me again. Since my main program will probably be used by 100-200 people, I'm completely fine having to do that procedure once. However, for a more "professional" result, it's probably work investing in certificates.
I'm trying to run an automated test in Ride.py. This test works on my colleague's computer but for some reason does not work on mine. The test starts but at a certain point i get the following error:
[ ERROR ] Calling method 'start_keyword' of listener 'C:\Python27\lib\site-packages\robotide\contrib\testrunner\TestRunnerAgent.py' failed: IndexError: list index out of range
The interesting part is that this error occurs on the same spot ever time, but with a different test it happens at a different time.
I tried to google several things and nothing worked. One solution suggested there was a '#' commented somewhere and this caused the crash. I looked but I don't see a '#' commented anywhere.
Another suggestion lead me to believe my testrunneragent.py file must have been installed wrong. I went online to find the file and replaced it. This did not work either (reran the test before and after a restart of ride)
We tried to re-import the test files thinking perhaps something went wrong there. This did not help either.
Googling juts the last part (IndexError: list index out of range) gave me the suggestion it does not recognize all the lines of code in the back-end file. I would have no clue how to solve this as im not a major coder.
One difference between me and my colleague could be the versions. I downloaded python version 2.7.16 and ride 1.7.3.1. My colleague uses an older version of both python and RIDE. Perhaps the problem could be here?
https://paste.fedoraproject.org/paste/TLekH3az0m4wuUyM8C2RYw
I expect the test will run without failing (it is a happy flow) I have included some screenshots with code in the previous segment that might help
downgraded to the same version of Ride.py
This seems to have fixed the issue
I have a problem that I seriously spent months on now!
Essentially I am running code that requires to read from and save to HD5 files. I am using h5py for this.
It's very hard to debug because the problem (whatever it is) only occurs in like 5% of the cases (each run takes several hours) and when it gets there it crashes python completely so debugging with python itself is impossible. Using simple logs it's also impossible to pinpoint to the exact crashing situation - it appears to be very random, crashing at different points within the code, or with a lag.
I tried using OllyDbg to figure out whats happening and can safely conclude that it consistently crashes at the following location: http://i.imgur.com/c4X5W.png
It seems to be shortly after calling the python native PyObject_ClearWeakRefs, with an access violation error message. The weird thing is that the file is successfully written to. What would cause the access violation error? Or is that python internal (e.g. the stack?) and not file (i.e. my code) related?
Has anyone an idea whats happening here? If not, is there a smarter way of finding out what exactly is happening? maybe some hidden python logs or something I don't know about?
Thank you
PyObject_ClearWeakRefs is in the python interpreter itself. But if it only happens in a small number of runs, it could be hardware related. Things you could try:
Run your program on a different machine. if it doesn't crash there, it is probably a hardware issue.
Reinstall python, in case the installed version has somehow become corrupted.
Run a memory test program.
Thanks for all the answers. I ran two versions this time, one with a new python install and my same program, another one on my original computer/install, but replacing all HDF5 read/write procedures with numpy read/write procedures.
The program continued to crash on my second computer at odd times, but on my primary computer I had zero crashes with the changed code. I think it is thus safe to conclude that the problems were HDF5 or more specifically h5py related. It appears that more people encountered issues with h5py in that respect. Given that any error in my application translates to potentially large financial losses I decided to dump HDF5 completely in favor of other stable solutions.
Use a try catch statement. This can be put into the program in order to stop the program from crashing when erroneous data is entered