Why does my python script randomly get killed?

Why does my python script randomly get killed? - python

Basically, i have a list of 30,000 URLs.
The script goes through the URLs and downloads them (with a 3 second delay in between).
And then it stores the HTML in a database.
And it loops and loops...
Why does it randomly get "Killed."? I didn't touch anything.
Edit: this happens on 3 of my linux machines.
The machines are on a Rackspace cloud with 256 MB memory. Nothing else is running.

Looks like you might be running out of memory -- might easily happen on a long-running program if you have a "leak" (e.g., due to accumulating circular references). Does Rackspace offer any easily usable tools to keep track of a process's memory, so you can confirm if this is the case? Otherwise, this kind of thing is not hard to monitor with normal Linux tools from outside the process. Once you have determined that "out of memory" is the likely cause of death, Python-specific tools such as pympler can help you track exactly where the problem is coming from (and thus determine how to avoid those references -- be it by changing them to weak references, or other simpler approaches -- or otherwise remove the leaks).

In cases like this, you should check the log files.
I use Debian and Ubuntu, so the main log file for me is: /var/log/syslog
If you use Red Hat, I think that log is: /var/log/messages
If something happens that is as exceptional as the kernel killing your process, there will be a log event explaining it.
I suspect you are being hit by the Out Of Memory Killer.

Is it possible that it's hitting an uncaught exception? Are you running this from a shell, or is it being run from cron or in some other automated way? If it's automated, the output may not be displayed anywhere.

Are you using some sort of queue manager or process manager of some sort ?
I got apparently random killed messages when the batch queue manager I was using was sending SIGUSR2 when the time was up.
Otherwise I strongly favor the out of memory option.

For those who came here with mysql, I found this answers may by helpful:
use SSCursor as suggented by this
conn = MySQLdb.connect(host=DB_HOST, user=DB_USER, db=DB_NAME,
passwd=DB_PASSWORD, charset="utf8",
cursorclass=MySQLdb.cursors.SSCursor)
and iterate over cursor as suggested by this
cursor = conn.cursor()
cursor.execute("select * from very_big_table;")
for row in cur:
# do what you want here
pass
Do pay attention to what the doc says You MUST retrieve the entire result set and close() the cursor before additional queries can be peformed on the connection., so if you want write and the same time, you should use another connection, or you will get
`_mysql_exceptions.ProgrammingError: (2014, "Commands out of sync; you can't run this command now")`

Related

ansible runner runs to long

When I use ansible's python API to run a script on remote machines(thousands), the code is:
runner = ansible.runner.Runner(
module_name='script',
module_args='/tmp/get_config.py',
pattern='*',
forks=30
)
then, I use
datastructure = runner.run()
This takes too long. I want to insert the datastructure stdout into MySQL. What I want is if once a machine has return data, just insert the data into MySQL, then the next, until all the machines have returned.
Is this a good idea, or is there a better way?

The runner call will not complete until all machines have returned data, can't be contacted or the SSH session times out. Given that this is targeting 1000's of machines and you're only doing 30 machines in parallel (forks=30) it's going to take roughly Time_to_run_script * Num_Machines/30 to complete. Does this align with your expectation?
You could up the number of forks to a much higher number to have the runner complete sooner. I've pushed this into the 100's without much issue.
If you want max visibility into what's going on and aren't sure if there is one machine holding you up, you could run through each hosts serially in your python code.
FYI - this module and class is completely gone in Ansible 2.0 so you might want to make the jump now to avoid having to rewrite code later

Avoiding or handling "BadRequestError: The requested query has expired."?

I'm looping over data in app engine using chained deferred tasks and query cursors. Python 2.7, using db (not ndb). E.g.
def loop_assets(cursor = None):
try:
assets = models.Asset.all().order('-size')
if cursor:
assets.with_cursor(cursor)
for asset in assets.run():
if asset.is_special():
asset.yay = True
asset.put()
except db.Timeout:
cursor = assets.cursor()
deferred.defer(loop_assets, cursor = cursor, _countdown = 3, _target = version, _retry_options = dont_retry)
return
This ran for ~75 minutes total (each task for ~ 1 minute), then raised this exception:
BadRequestError: The requested query has expired. Please restart it with the last cursor to read more results.
Reading the docs, the only stated cause of this is:
New App Engine releases may change internal implementation details, invalidating cursors that depend on them. If an application attempts to use a cursor that is no longer valid, the Datastore raises a BadRequestError exception.
So maybe that's what happened, but it seems a co-incidence that the first time I ever try this technique I hit a 'change in internal implementation' (unless they happen often).
Is there another explanation for this?
Is there a way to re-architect my code to avoid this?
If not, I think the only solution is to mark which assets have been processed, then add an extra filter to the query to exclude those, and then manually restart the process each time it dies.
For reference, this question asked something similar, but the accepted answer is 'use cursors', which I am already doing, so it cant be the same issue.

You may want to look at AppEngine MapReduce
MapReduce is a programming model for processing large amounts of data
in a parallel and distributed fashion. It is useful for large,
long-running jobs that cannot be handled within the scope of a single
request, tasks like:
Analyzing application logs
Aggregating related data from external sources
Transforming data from one format to another
Exporting data for external analysis

When I asked this question, I had run the code once, and experienced the BadRequestError once. I then ran it again, and it completed without a BadRequestError, running for ~6 hours in total. So at this point I would say that the best 'solution' to this problem is to make the code idempotent (so that it can be retried) and then add some code to auto-retry.
In my specific case, it was also possible to tweak the query so that in the case that the cursor 'expires', the query can restart w/o a cursor where it left off. Effectively change the query to:
assets = models.Asset.all().order('-size').filter('size <', last_seen_size)
Where last_seen_size is a value passed from each task to the next.

postgres database: When does a job get killed

I am using a postgres database with sql-alchemy and flask. I have a couple of jobs which I have to run through the entire database to updates entries. When I do this on my local machine I get a very different behavior compared to the server.
E.g. there seems to be an upper limit on how many entries I can get from the database?
On my local machine I just query all elements, while on the server I have to query 2000 entries step by step.
If I have too many entries the server gives me the message 'Killed'.
I would like to know
1. Who is killing my jobs (sqlalchemy, postgres)?
2. Since this does seem to behave differently on my local machine there must be a way to control this. Where would that be?
thanks
carl

Just the message "killed" appearing in the terminal window usually means the kernel was running out of memory and killed the process as an emergency measure.
Most libraries which connect to PostgreSQL will read the entire result set into memory, by default. But some libraries have a way to tell it to process the results row by row, so they aren't all read into memory at once. I don't know if flask has this option or not.
Perhaps your local machine has more available RAM than the server does (or fewer demands on the RAM it does have), or perhaps your local machine is configured to read from the database row by row rather than all at once.

Most likely kernel is killing your Python script. Python can have horrible memory usage.
I have a feeling you are trying to do these 2000 entry batches in a loop in one Python process. Python does not release all used memory, so the memory usage grows until it gets killed. (You can watch this with top command.)
You should try adapting your script to process 2000 records in a step and then quit. If you run in multiple times, it should continue where it left off. Or, a better option, try using multiprocessing and run each job in separate worker. Run the jobs serially and let them die, when they finish. This way they will release the memory back to OS when they exit.

SQLite3 and Multiprocessing

I noticed that sqlite3 isn´t really capable nor reliable when i use it inside a multiprocessing enviroment. Each process tries to write some data into the same database, so that a connection is used by multiple threads. I tried it with the check_same_thread=False option, but the number of insertions is pretty random: Sometimes it includes everything, sometimes not. Should I parallel-process only parts of the function (fetching data from the web), stack their outputs into a list and put them into the table all together or is there a reliable way to handle multi-connections with sqlite?

First of all, there's a difference between multiprocessing (multiple processes) and multithreading (multiple threads within one process).
It seems that you're talking about multithreading here. There are a couple of caveats that you should be aware of when using SQLite in a multithreaded environment. The SQLite documentation mentions the following:
Do not use the same database connection at the same time in more than
one thread.
On some operating systems, a database connection should
always be used in the same thread in which it was originally created.
See here for a more detailed information: Is SQLite thread-safe?

I've actually just been working on something very similar:
multiple processes (for me a processing pool of 4 to 32 workers)
each process worker does some stuff that includes getting information
from the web (a call to the Alchemy API for mine)
each process opens its own sqlite3 connection, all to a single file, and each
process adds one entry before getting the next task off the stack
At first I thought I was seeing the same issue as you, then I traced it to overlapping and conflicting issues with retrieving the information from the web. Since I was right there I did some torture testing on sqlite and multiprocessing and found I could run MANY process workers, all connecting and adding to the same sqlite file without coordination and it was rock solid when I was just putting in test data.
So now I'm looking at your phrase "(fetching data from the web)" - perhaps you could try replacing that data fetching with some dummy data to ensure that it is really the sqlite3 connection causing you problems. At least in my tested case (running right now in another window) I found that multiple processes were able to all add through their own connection without issues but your description exactly matches the problem I'm having when two processes step on each other while going for the web API (very odd error actually) and sometimes don't get the expected data, which of course leaves an empty slot in the database. My eventual solution was to detect this failure within each worker and retry the web API call when it happened (could have been more elegant, but this was for a personal hack).
My apologies if this doesn't apply to your case, without code it's hard to know what you're facing, but the description makes me wonder if you might widen your considerations.

sqlitedict: A lightweight wrapper around Python's sqlite3 database, with a dict-like interface and multi-thread access support.

If I had to build a system like the one you describe, using SQLITE, then I would start by writing an async server (using the asynchat module) to handle all of the SQLITE database access, and then I would write the other processes to use that server. When there is only one process accessing the db file directly, it can enforce a strict sequence of queries so that there is no danger of two processes stepping on each others toes. It is also faster than continually opening and closing the db.
In fact, I would also try to avoid maintaining sessions, in other words, I would try to write all the other processes so that every database transaction is independent. At minimum this would mean allowing a transaction to contain a list of SQL statements, not just one, and it might even require some if then capability so that you could SELECT a record, check that a field is equal to X, and only then, UPDATE that field. If your existing app is closing the database after every transaction, then you don't need to worry about sessions.
You might be able to use something like nosqlite http://code.google.com/p/nosqlite/

python sqlalchemy + postgresql program freezes

I've ran into a strange situation. I'm writing some test cases for my program. The program is written to work on sqllite or postgresqul depending on preferences. Now I'm writing my test code using unittest. Very basically what I'm doing:
def setUp(self):
"""
Reset the database before each test.
"""
if os.path.exists(root_storage):
shutil.rmtree(root_storage)
reset_database()
initialize_startup()
self.project_service = ProjectService()
self.structure_helper = FilesHelper()
user = model.User("test_user", "test_pass", "test_mail#tvb.org",
True, "user")
self.test_user = dao.store_entity(user)
In the setUp I remove any folders that exist(created by some tests) then I reset my database (drop tables cascade basically) then I initialize the database again and create some services that will be used for testing.
def tearDown(self):
"""
Remove project folders and clean up database.
"""
created_projects = dao.get_projects_for_user(self.test_user.id)
for project in created_projects:
self.structure_helper.remove_project_structure(project.name)
reset_database()
Tear down does the same thing except creating the services, because this test module is part of the same suite with other modules and I don't want things to be left behind by some tests.
Now all my tests run fine with sqllite. With postgresql I'm running into a very weird situation: at some point in the execution, which actually differs from run to run by a small margin (ex one or two extra calls) the program just halts. I mean no error is generated, no exception thrown, the program just stops.
Now only thing I can think of is that somehow I forget a connection opened somewhere and after I while it timesout and something happens. But I have A LOT of connections so before I start going trough all that code, I would appreciate some suggestions/ opinions.
What could cause this kind of behaviour? Where to start looking?
Regards,
Bogdan

PostgreSQL based applications freeze because PG locks tables fairly aggressively, in particular it will not allow a DROP command to continue if any connections are open in a pending transaction, which have accessed that table in any way (SELECT included).
If you're on a unix system, the command "ps -ef | grep 'post'" will show you all the Postgresql processes and you'll see the status of current commands, including your hung "DROP TABLE" or whatever it is that's freezing. You can also see it if you select from the pg_stat_activity view.
So the key is to ensure that no pending transactions remain - this means at a DBAPI level that any result cursors are closed, and any connection that is currently open has rollback() called on it, or is otherwise explicitly closed. In SQLAlchemy, this means any result sets (i.e. ResultProxy) with pending rows are fully exhausted and any Connection objects have been close()d, which returns them to the pool and calls rollback() on the underlying DBAPI connection. you'd want to make sure there is some kind of unconditional teardown code which makes sure this happens before any DROP TABLE type of command is emitted.
As far as "I have A LOT of connections", you should get that under control. When the SQLA test suite runs through its 3000 something tests, we make sure we're absolutely in control of connections and typically only one connection is opened at a time (still, running on Pypy has some behaviors that still cause hangs with PG..its tough). There's a pool class called AssertionPool you can use for this which ensures only one connection is ever checked out at a time else an informative error is raised (shows where it was checked out).

One solution I found to this problem was to call db.session.close() before any attempt to call db.drop_all(). This will close the connection before dropping the tables, preventing Postgres from locking the tables.
See a much more in-depth discussion of the problem here.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why does my python script randomly get killed? - python

Is it possible that it's hitting an uncaught exception? Are you running this from a shell, or is it being run from cron or in some other automated way? If it's automated, the output may not be displayed anywhere.

Are you using some sort of queue manager or process manager of some sort ? I got apparently random killed messages when the batch queue manager I was using was sending SIGUSR2 when the time was up. Otherwise I strongly favor the out of memory option.

Related

ansible runner runs to long

Avoiding or handling "BadRequestError: The requested query has expired."?

postgres database: When does a job get killed

SQLite3 and Multiprocessing

python sqlalchemy + postgresql program freezes

Categories

Resources