ansible runner runs to long

ansible runner runs to long - python

When I use ansible's python API to run a script on remote machines(thousands), the code is:
runner = ansible.runner.Runner(
module_name='script',
module_args='/tmp/get_config.py',
pattern='*',
forks=30
)
then, I use
datastructure = runner.run()
This takes too long. I want to insert the datastructure stdout into MySQL. What I want is if once a machine has return data, just insert the data into MySQL, then the next, until all the machines have returned.
Is this a good idea, or is there a better way?

The runner call will not complete until all machines have returned data, can't be contacted or the SSH session times out. Given that this is targeting 1000's of machines and you're only doing 30 machines in parallel (forks=30) it's going to take roughly Time_to_run_script * Num_Machines/30 to complete. Does this align with your expectation?
You could up the number of forks to a much higher number to have the runner complete sooner. I've pushed this into the 100's without much issue.
If you want max visibility into what's going on and aren't sure if there is one machine holding you up, you could run through each hosts serially in your python code.
FYI - this module and class is completely gone in Ansible 2.0 so you might want to make the jump now to avoid having to rewrite code later

Related

How to properly administrate a database using a script

I use a python script to insert data in my database (using pandas and sqlalchemy). The script read from various sources, clean the data and insert it in the database. I plan on running this script once in a while to completely override the existing database with more recent data.
At first I wanted to have a single service and simply add an endpoint that would require higher privileges to run the script. But in the end that looks a bit odd and, more importantly, that python script is using quite a lot of memory (~700M) which makes me wonder how I should configure my deployment.
Increasing the memory limit of my pod for this (once in a while) operation looks like a bad idea to me, but I'm quite new to Kubernetes, so maybe I'm wrong. Thus this question.
So what would be a good (better) solution? Run another service just for that, simply connect to my machine and run the update manually using the python script directly?

To run on demand
https://kubernetes.io/docs/concepts/workloads/controllers/job/.
This generates a Pod that runs till completion (exit) only once - a Job.
To run on schedule
https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/.
Every time when hitting a schedule, this generates the new, separate Job.

How to check CPU usage before running a Python script by shell_exec? Or alternative queuing approaches to prevent server overload?

I have an Apache-based PHP website that:
Accepts a document uploaded by a user (typically a text file)
Runs a Python script over the uploaded file using shell_exec (this produces a text file output locally on the disk); the PHP code that runs the shell_exec looks something like $result = shell_exec python3 memory_hog_script.py user_uploaded_text_file.txt
Shows the text output to the user
The Python script executed by PHP's shell_exec typically takes 5-10 seconds to be run, but uses open source libraries that take up a lot of memory/CPU (e.g. 50-70%) while it's being run.
More commonly now, I have multiple users uploading a file at the exact same time. When a handful of users upload a file at the same time, my CPU gets overloaded since a separate instance of the heavy memory hogging python file gets executed via shell_exec for each user. When this happens, the server crashes.
I'm trying to figure out what the best way is to handle this, and I had a few ideas but wasn't sure which one is the best approach / is feasible:
Before shell_exec gets run by PHP, check CPU usage and see if it is greater than a certain threshold (e.g. 70%); if so, wait 10 seconds and try again
Alternatively, I can check the database for current active uploads; if there are more than a certain threshold (e.g. 5 uploads), then do something
Create some sort of queueing system; I'm not sure what my options here, but I imagine there is a way to separate out the shell_exec into some kind of managed queue - I'm very happy to look into new technologies here, so if there are any approaches you can introduce me to, I'd love to dig deeper!
Any input would be much appreciated! Thank you :)

postgres database: When does a job get killed

I am using a postgres database with sql-alchemy and flask. I have a couple of jobs which I have to run through the entire database to updates entries. When I do this on my local machine I get a very different behavior compared to the server.
E.g. there seems to be an upper limit on how many entries I can get from the database?
On my local machine I just query all elements, while on the server I have to query 2000 entries step by step.
If I have too many entries the server gives me the message 'Killed'.
I would like to know
1. Who is killing my jobs (sqlalchemy, postgres)?
2. Since this does seem to behave differently on my local machine there must be a way to control this. Where would that be?
thanks
carl

Just the message "killed" appearing in the terminal window usually means the kernel was running out of memory and killed the process as an emergency measure.
Most libraries which connect to PostgreSQL will read the entire result set into memory, by default. But some libraries have a way to tell it to process the results row by row, so they aren't all read into memory at once. I don't know if flask has this option or not.
Perhaps your local machine has more available RAM than the server does (or fewer demands on the RAM it does have), or perhaps your local machine is configured to read from the database row by row rather than all at once.

Most likely kernel is killing your Python script. Python can have horrible memory usage.
I have a feeling you are trying to do these 2000 entry batches in a loop in one Python process. Python does not release all used memory, so the memory usage grows until it gets killed. (You can watch this with top command.)
You should try adapting your script to process 2000 records in a step and then quit. If you run in multiple times, it should continue where it left off. Or, a better option, try using multiprocessing and run each job in separate worker. Run the jobs serially and let them die, when they finish. This way they will release the memory back to OS when they exit.

ubuntu django run managements much faster( i tried renice by setting -18 priority to python process pid)

I am using ubuntu. I have some management commands which when run, does lots of database manipulations, so it takes nearly 15min.
My system monitor shows that my system has 4 cpu's and 6GB RAM. But, this process is not utilising all the cpu's . I think it is using only one of the cpus and that too very less ram. I think, if I am able to make it to use all the cpu's and most of the ram, then the process will be completed in very less time.
I tried renice , by settings priority to -18 (means very high) but still the speed is less.
Details:
its a python script with loop count of nearly 10,000 and that too nearly ten such loops. In every loop, it saves to postgres database.

If you are looking to make this application run across multiple cpu's then there are a number of things you can try depending on your setup.
The most obvious thing that comes to mind is making the application make use of threads and multiprocesses. This will allow the application to "do more" at once. Obviously the issue you might have here is concurrent database access so you might need to use transactions (at which point you might loose the advantage of using multiprocesses in the first place).
Secondly, make sure you are not opening and closing lots of database connections, ensure your application can hold the connection open for as long as it needs.
thirdly, Ensure the database is correctly indexed. If you are doing searches on large strings then things are going to be slow.
Fourthly, Do everything you can in SQL leaving little manipulation to python, sql is horrendously quick at doing data manipulation if you let it. As soon as you start taking data out of the database and into code then things are going to slow down big time.
Fifthly, make use of stored procedures which can be cached and optimized internally within the database. These can be a lot quicker than application built queries which cannot be optimized as easily.
Sixthly, dont save on each iteration of a program. Try to produce a batch styled job whereby you alter a number of records then save all of those in one batch job. This will reduce the amount of IO on each iteration and speed up the process massivly.
Django does support the use of a bulk update method, there was also a question on stackoverflow a while back about saving multiple django objects at once.
Saving many Django objects with one big INSERT statement
Django: save multiple object signal once

Just in case, did you run the command renice -20 -p {pid} instead of renice --20 -p {pid}? In the first case it will be given the lowest priority.

Python ( or maybe linux in general) file operation flow control or file lock

I am using a cluster of computers to do some parallel computation. My home directory is shared across the cluster. In one machine, I have a ruby code that creates bash script containing computation command and write the script to, say, ~/q/ directory. The scripts are named *.worker1.sh, *.worker2.sh, etc.
On other 20 machines, I have 20 python code running ( one at each machine ) that (constantly) check the ~/q/ directory and look for jobs that belong to that machine, using a python code like this:
jobs = glob.glob('q/*.worker1.sh')
[os.system('sh ' + job + ' &') for job in jobs]
For some additional control, the ruby code will create a empty file like workeri.start (i = 1..20) at q directory after it write the bash script to q directory, the python code will check for that 'start' file before it runs the above code. And in the bash script, if the command finishes successfully, the bash script will create an empty file like 'workeri.sccuess', the python code checks this file after it runs the above code to make sure the computation finishs successfully. If python finds out that the computation finishs successfully, it will remove the 'start' file in q directory, so the ruby code knows that job finishs successfully. After the 20 bash script all finished, the ruby code will create new bash script and python read and executes new scripts and so on.
I know this is not a elegant way to coordinate the computation, but I haven't figured out a better to communicate between different machines.
Now the question is: I expect that the 20 jobs will run somewhat in parallel. The total time to finish the 20 jobs will not be much longer than the time to finish one job. However, it seems that these jobs runs sequentially and time is much longer than I expected.
I suspect that part of the reason is that multiple codes are reading and writing the same directory at once but the linux system or python locks the directory and only allow one process to oprate the directory. This makes the code execute one at a time.
I am not sure if this is the case. If I split the bash scripts to different directories, and let the python code on different machines read and write different directories, will that solve the problem? Or is there any other reasons that cause the problem?
Thanks a lot for any suggestions! Let me know if I didn't explain anything clearly.
Some additional info:
my home directory is at /home/my_group/my_home, here is the mount info for it
:/vol/my_group on /home/my_group type nfs (rw,nosuid,nodev,noatime,tcp,timeo=600,retrans=2,rsize=65536,wsize=65536,addr=...)
I say constantly check the q directory, meaning a python loop like this:
While True:
if 'start' file exists:
find the scripts and execute them as I mentioned above

I know this is not a elegant way to coordinate the computation, but I
haven't figured out a better to communicate between different
machines.
While this isn't directly what you asked, you should really, really consider fixing your problem at this level, using some sort of shared message queue is likely to be a lot simpler to manage and debug than relying on the locking semantics of a particular networked filesystem.
The simplest solution to set up and run in my experience is redis on the machine currently running the Ruby script that creates the jobs. It should literally be as simple as downloading the source, compiling it and starting it up. Once the redis server is up and running, you change your code to append your the computation commands to one or more Redis lists. In ruby you would use the redis-rb library like this:
require "redis"
redis = Redis.new
# Your other code to build up command lists...
redis.lpush 'commands', command1, command2...
If the computations need to be handled by certain machines, use a list per-machine like this:
redis.lpush 'jobs:machine1', command1
# etc.
Then in your Python code, you can use redis-py to connect to the Redis server and pull jobs off the list like so:
from redis import Redis
r = Redis(host="hostname-of-machine-running-redis")
while r.llen('jobs:machine1'):
job = r.lpop('commands:machine1')
os.system('sh ' + job + ' &')
Of course, you could just as easily pull jobs off the queue and execute them in Ruby:
require 'redis'
redis = Redis.new(:host => 'hostname-of-machine-running-redis')
while redis.llen('jobs:machine1')
job = redis.lpop('commands:machine1')
`sh #{job} &`
end
With some more details about the needs of the computation and the environment it's running in, it would be possible to recommend even simpler approaches to managing it.

Try a while loop? If that doesn't work, on the python side try using a TRY statement like so:
Try:
with open("myfile.whatever", "r") as f:
f.read()
except:
(do something if it doesnt work, perhaps a PASS? (must be in a loop to constantly check this)
else:
execute your code if successful

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.