postgres database: When does a job get killed - python

I am using a postgres database with sql-alchemy and flask. I have a couple of jobs which I have to run through the entire database to updates entries. When I do this on my local machine I get a very different behavior compared to the server.
E.g. there seems to be an upper limit on how many entries I can get from the database?
On my local machine I just query all elements, while on the server I have to query 2000 entries step by step.
If I have too many entries the server gives me the message 'Killed'.
I would like to know
1. Who is killing my jobs (sqlalchemy, postgres)?
2. Since this does seem to behave differently on my local machine there must be a way to control this. Where would that be?
thanks
carl

Just the message "killed" appearing in the terminal window usually means the kernel was running out of memory and killed the process as an emergency measure.
Most libraries which connect to PostgreSQL will read the entire result set into memory, by default. But some libraries have a way to tell it to process the results row by row, so they aren't all read into memory at once. I don't know if flask has this option or not.
Perhaps your local machine has more available RAM than the server does (or fewer demands on the RAM it does have), or perhaps your local machine is configured to read from the database row by row rather than all at once.

Most likely kernel is killing your Python script. Python can have horrible memory usage.
I have a feeling you are trying to do these 2000 entry batches in a loop in one Python process. Python does not release all used memory, so the memory usage grows until it gets killed. (You can watch this with top command.)
You should try adapting your script to process 2000 records in a step and then quit. If you run in multiple times, it should continue where it left off. Or, a better option, try using multiprocessing and run each job in separate worker. Run the jobs serially and let them die, when they finish. This way they will release the memory back to OS when they exit.

Related

How to prevent python script from freezing in ubuntu 16.04

I work for a digital marketing agency having multiple clients. And in one of the projects, I have a very resource intensive python script (which fetches data for Facebook ads), to be run on all those clients (say 500+ in number) in ubuntu 16.04 server.
Originally script took around 2 mins to complete, with 300 MB RES & 1000 MB VM (as per htop), for 1 client. Hence optimized it with ThreadPoolExecutor (max_workers=10) so that script can run on 4 clients concurrently (almost).
Then found out that sometimes, script froze during run (or basically its in "comatose state"). Debugged & profiled and found that its not the script that's causing issue, but its the system.
Then batched the script, means if there are 20 input clients, ran 5 instances (4*5=20) of script. Here sometimes it went fine but sometimes last instance froze.
Then found out that RAM (2G) was being overused, hence increased swapping memory from 0 to 1G. That did the trick. But if few clients are heavy in memory, same thing happens.
Have attached the screenshot of the latest run where after running the 8 instances, last 2 froze. They can be left for days for that matter.
I am thinking of increasing the server RAM from 2G to 4G but not sure if that's the permanent solution. Did anyone has faced similar issue?
You need to fix the Ram consumption of your script,
if your script allocates more memory than your system can provide it get's memory errors, in case you have them in threadpools or similar constructs the threads may never return under some circumstances.
You can fix this by using async functions with timeouts and implementing automatic restart handlers, in case a process does not yield an expected results.
The best way to do that is heavily dependent on the script and will probably require altering already created code
The issue is definitly with your script and not with the OS.
The fastetst workaround would be to increase she system memory or to reduce the amount of threads.
If just adding 1GB of swap area "almost" did the trick then definitely increasing the physical memory is a good way to go. Btw remember that swapping means you're using disk storage, whose speed is measured in millisecs, while RAM speed is measured in nanosecs - so avoiding swap guarantees a performance boost.
And then, reboot your system every now and then. Although Linux is far better than Windows in this respect, memory leaks do occur in Linux too, and a reboot every few months will surely help.
As Gornoka stated you need to alter the memory comsumption of the script as added details this can also be done by removing declared variables within the script once used with the keyword
del
This can also be done by ensuring that if it is processing massive files it does this line by line and saving it as it finishes each line.
I have had this happen and it usually is an indicator of working with to much data at once within the ram and it is always better to work with it partially whenever possible and if not possible get more RAM

Can i do vagrant ssh in a parallel window for a same guest machine which also currently runs a job? Will it anyway disturb the current job?

As part of my homework I need to load large data files into two MySQL tables, parsed using Python, on my guest machine that is called via Vagrant SSH.
I also then need to run a Sqoop job on one of the 2 tables so now I'm up to the point where I loaded one of the tables successfully and ran the Python script to load the second table and it's been more than 3 hours and still loading.
I was wondering whether I could complete my Sqoop job on the already loaded table instead of staring at a black screen for almost 4 hours now.
My questions are:
Is there any other way to Vagrant SSH into the same machine without doing Vagrant reload (because --reload eventually shuts down my virtual machine thereby killing all the current jobs running on my guests).
If there is, then given that I open a parallel window to log in to the guest machine as usual and start working on my Sqoop job on the first table that already loaded; will it any way affect my current job with the second table that is still loading? Or will it have a data loss as I can't risk re-doing it as it is super large and extremely time-consuming.
python code goes like this
~~
def parser():
with open('1950-sample.txt', 'r', encoding='latin_1') as input:
for line in input:
....
Inserting into tables
def insert():
if (tablename == '1950_psr'):
cursor.execute("INSERT INTO 1950_psr (usaf,wban,obs_da_dt,lati,longi,elev,win_dir,qc_wind_dir, sky,qc_sky,visib,qc_visib,air_temp,qc_air_temp,dew_temp,qc_dew_temp,atm_press,qc_atm_press)VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)",(USAF,WBAN,obs_da_dt,lati,longi,elev,win_dir,qc_wind_dir, sky,qc_sky,visib,qc_visib,air_temp,qc_air_temp,dew_temp,qc_dew_temp,atm_press,qc_atm_press))
elif (tablename == '1986_psr'):
cursor.execute("INSERT INTO 1986_psr (usaf,wban,obs_da_dt,lati,longi,elev,win_dir,qc_wind_dir, sky,qc_sky,visib,qc_visib,air_temp,qc_air_temp,dew_temp,qc_dew_temp,atm_press,qc_atm_press)VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)",(USAF,WBAN,obs_da_dt,lati,longi,elev,win_dir,qc_wind_dir, sky,qc_sky,visib,qc_visib,air_temp,qc_air_temp,dew_temp,qc_dew_temp,atm_press,qc_atm_press))
parser()
Saving & closing
conn.commit()
conn.close()
I don't know what's in your login scripts, and I'm not clear what that --reload flag is, but in general you can have multiple ssh sessions to the same machine. Just open another terminal and ssh into the VM.
However, in your case, that's probably not a good idea. I suspect that the second table is taking long time to load because your database is reindexing or it's waiting on a lock to be released.
Unless you are loading hundreds of meg's, I suggest you first check for locks and see what queries are pending.
Even if you are loading very large dataset and there are no constraints on the table you need for you script, you are just going to pile up on a machine that's already taxed pretty heavily...

Python Sqlite Concurrent access

I have a scenario where a third party application scans a folder and triggers my python script/generated EXE for the number of times (yes! number of separate processes) for the number of files that are in the folder.
My script/application writes the path of the file to a local sqlite database, calls the next application and exits.
My script/application takes care that it calls only one instance of the next application. But nothing can be done of the third party application that calls my script.
The ISSUE
Sometimes more than 1000 instances of my script/application can be called at the same time resulting in almost 1000 concurrent connections to the local sqlite database.
Due to the limited number of concurrent connections possible with sqlite, some of the processes are getting a "database is locked" Exception. This results in some of the file names NOT being written to the database
We came up with a work around for this. We write in the database in an infinite loop. On encountering the exception, we make the thread sleep for say 50 milliseconds and try again till such time that the write works. I know that this is not a clean approach.
Is there a better way of doing this? How do I handle 1000 may be 10000 or may be more concurrent connections and yet each script succeeds?
Your workaround is correct, but you can make the database do most of the work by setting a busy timeout. (For 1000 connections, this needs to be set extremely high, essentially inifinite, like you're already doing.)
But this still results in random wait times.
SQLite does not wait until one writer's transaction has finished and then signals the next, because there is no portable API for this. However, in Windows, you could use a named mutex object (which requires some trickery to access it from Python).

ansible runner runs to long

When I use ansible's python API to run a script on remote machines(thousands), the code is:
runner = ansible.runner.Runner(
module_name='script',
module_args='/tmp/get_config.py',
pattern='*',
forks=30
)
then, I use
datastructure = runner.run()
This takes too long. I want to insert the datastructure stdout into MySQL. What I want is if once a machine has return data, just insert the data into MySQL, then the next, until all the machines have returned.
Is this a good idea, or is there a better way?
The runner call will not complete until all machines have returned data, can't be contacted or the SSH session times out. Given that this is targeting 1000's of machines and you're only doing 30 machines in parallel (forks=30) it's going to take roughly Time_to_run_script * Num_Machines/30 to complete. Does this align with your expectation?
You could up the number of forks to a much higher number to have the runner complete sooner. I've pushed this into the 100's without much issue.
If you want max visibility into what's going on and aren't sure if there is one machine holding you up, you could run through each hosts serially in your python code.
FYI - this module and class is completely gone in Ansible 2.0 so you might want to make the jump now to avoid having to rewrite code later

ubuntu django run managements much faster( i tried renice by setting -18 priority to python process pid)

I am using ubuntu. I have some management commands which when run, does lots of database manipulations, so it takes nearly 15min.
My system monitor shows that my system has 4 cpu's and 6GB RAM. But, this process is not utilising all the cpu's . I think it is using only one of the cpus and that too very less ram. I think, if I am able to make it to use all the cpu's and most of the ram, then the process will be completed in very less time.
I tried renice , by settings priority to -18 (means very high) but still the speed is less.
Details:
its a python script with loop count of nearly 10,000 and that too nearly ten such loops. In every loop, it saves to postgres database.
If you are looking to make this application run across multiple cpu's then there are a number of things you can try depending on your setup.
The most obvious thing that comes to mind is making the application make use of threads and multiprocesses. This will allow the application to "do more" at once. Obviously the issue you might have here is concurrent database access so you might need to use transactions (at which point you might loose the advantage of using multiprocesses in the first place).
Secondly, make sure you are not opening and closing lots of database connections, ensure your application can hold the connection open for as long as it needs.
thirdly, Ensure the database is correctly indexed. If you are doing searches on large strings then things are going to be slow.
Fourthly, Do everything you can in SQL leaving little manipulation to python, sql is horrendously quick at doing data manipulation if you let it. As soon as you start taking data out of the database and into code then things are going to slow down big time.
Fifthly, make use of stored procedures which can be cached and optimized internally within the database. These can be a lot quicker than application built queries which cannot be optimized as easily.
Sixthly, dont save on each iteration of a program. Try to produce a batch styled job whereby you alter a number of records then save all of those in one batch job. This will reduce the amount of IO on each iteration and speed up the process massivly.
Django does support the use of a bulk update method, there was also a question on stackoverflow a while back about saving multiple django objects at once.
Saving many Django objects with one big INSERT statement
Django: save multiple object signal once
Just in case, did you run the command renice -20 -p {pid} instead of renice --20 -p {pid}? In the first case it will be given the lowest priority.

Categories

Resources