(Unix/Python) Activity Monitor % CPU

(Unix/Python) Activity Monitor % CPU - python

I have a project I am working on for developing a distributed computing model for one of my other projects. One of my scripts is multiprocessed, monitoring for changes in directories, decoding pickles, creating authentication keys for them based on the node to which they are headed, and repickling them. (Edit: These processes all operate in loops.)
My question is, when looking at the activity monitor in OSX, my %CPU column displays 100% for the primary processes that run the scripts. These three that are showing 100% are the manager script, and the two nodes (I am simulating the model on one machine, the intent is to move the model to a live cluster network in the future). Is this bad? My system usage shows 27.50%, user 12.50%, and 65% idle.
I've attempted research myself, first, and my only thought is that these numbers display that the process is utilizing the CPU for the entire time it's alive, and is never idle.
Can I please get some clarification?
Update based on comments:
My processes run in an endless loop, monitoring for changes to files in their respective directories, in order to receive new 'jobs' from the manager process/script (a separate computer in the cluster in the project's final implementation). Maybe there is a better way to be waiting for I/O in this form that doesn't require so much processor time?

Solution (If however Sub-Optimal):
Implemented a time.sleep(n) period at the end of each loop for 0.1 seconds. Brought the CPU time down to no more than 0.4%.
Still, though, I am looking for a more optimal way of reducing CPU time, without using modules not included in the Python Standard Library, I'd like to avoid using a time.sleep(n) period, as I want the system to be able to respond at any moment, and if the load of input files gets very high, I do not want it to be wasting time sleeping, when it could be processing the files.
My processes operate by busy waiting, so it was causing them to use excessive CPU time:
while True:
files = os.path.listdir('./sub_directory/')
if files != []:
do_something()
This is the basis for each script/process I am executing.

Related

Pros and cons of using shared Value/Array vs Queue/Pipe in Python multiprocessing

I've been slowly learning to use the multiprocessing library in Python these last few days, and I've come to a point where I'm asking myself this question, and I can't find an answer to this.
I understand that the answer might vary depending on the application, so I'll explain what my application is.
I've created a scheduler in my main process that control when multiple processes execute (these processes are spawned earlier, loop continuously and execute code when my scheduler raises a flag in a shared Value). Using counters in my scheduler, I can have multiple processes executing code at different frequencies (in the 100-400 Hz range), and they are all perfectly synchronized.
For example, one process executes a dynamic model of a quadcopter (ode) at 400 Hz and updates the quadcopter's state. My other processes (command generation and trajectory generation) run at lower frequencies (200 Hz and 100 Hz), but require the updated state. I see currently 2 methods of doing this:
With Pipes: This requires separate Pipes for the dynamics/control and dynamics/trajectory connections. Furthermore, I need the control and trajectory processes to use the latest calculated quadcopter's state, so I need to flush the Pipes until the last value in them. This works, but doesn't look very clean.
With a shared Value/Array : I would only need one Array for the state, my dynamics process would write to it, while my other processes would read from it. I would probably have to implement locks (can I read a shared Value/Array from 2 processes at the same time without a lock?). This hasn't been tested yet, but would probably be cleaner.
I've read around that it is a bad practice to use shared memory too much (why is that?). Yes, I'll be updating it at 400 Hz and reading it at 200 and 100 Hz, but it's not going to be such a large array (10-ish float or doubles). However, I've also read that shared memory is faster that Pipes/Queues, and I would like to prioritize speed in my code, if it's not too much of an issue to use shared memory.
Mind you, I'll have to send generated commands to my dynamics process (another 5-ish floats), and generated desired states to my control process (another 10-ish floats), so that's either more shared Arrays, of more Pipes.
So I was wondering, for my application, what are the pros and cons of both methods. Thanks!

Queuing and dispatching long simulations

I am developing a Python module to manage the execution of long-baseline simulation processes. Whilst I have had some successes to date, I would like to improve it in order to make it more usable for its intended, er, users.
My code to date uses sys, time, subprocess and the package watchdog; it monitors a folder for new run files, places these in a FIFO queue and executes them simultaneously up to a pre-defined limit. This works, but is a bit clunky, and managing this dynamically is impractical.
I am not a computer scientist or a software engineer; this is why I am asking for help. Despite that, I am confident that I would be able to do the (bulk of the) coding work, but I would like some guidance on:
What are the computer science-y names for some of the concepts I'm describing
What packages are out there for doing some of these things in Python 3.x
What approaches you would use to do these things
What data structures you might use to do this
I appreciate any help you're able to offer, and of course please let me know if there is a better StackExchange site where I could submit this question. Thank you in advance!
Some detail about the simulations:
Entirely based in Windows 10
One of three or four different varieties
Executed from command line with reference to an input file
Simulations either fail instantly or almost instantly, or run for several hours, days or even weeks
Typical batches are up to tens of simulations
Simulations use lots of RAM and an entire CPU
I should also point out that, according to the terms of the simulation software licenses, I am expressly forbidden to use third-party cloud computing platforms.
Here's what I need the module to do:
Simulations dispatched automatically
Simultaneous simulations up to a limit
Queue runs indefinitely
Queue can be added to dynamically (reference in a new file)
Feedback on what's running and what's in the queue
First in, first out queuing
And some extra ideas, from users, about what would be nice to have:
Can dispatch simulations onto multiple computers
Simultaneous simulation limit could be changed dynamically
System of prioritisation
Simulations can be terminated early
If the module crashes, can be restarted without having to rebuild the queue
A browser-based interface where the queue could be viewed / managed

Comparing two different processes based on the real, user and sys times

I have been through other answers on SO about real,user and sys times. In this question, apart from the theory, I am interested in understanding the practical implications of the times being reported by two different processes, achieving the same task.
I have a python program and a nodejs program https://github.com/rnanwani/vips_performance. Both work on a set of input images and process them to obtain different outputs. Both using libvips implementations.
Here are the times for the two
Python
real 1m17.253s
user 1m54.766s
sys 0m2.988s
NodeJS
real 1m3.616s
user 3m25.097s
sys 0m8.494s
The real time (the wall clock time as per other answers is lesser for NodeJS, which as per my understanding means that the entire process from input to output, finishes much quicker on NodeJS. But the user and sys times are very high as compared to Python. Also using the htop utility, I see that NodeJS process has a CPU usage of about 360% during the entire process maxing out the 4 cores. Python on the other hand has a CPU usage from 250% to 120% during the entire process.
I want to understand a couple of things
Does a smaller real time and a higher user+sys time mean that the process (in this case Node) utilizes the CPU more efficiently to complete the task sooner?
What is the practical implication of these times - which is faster/better/would scale well as the number of requests increase?

My guess would be that node is running more than one vips pipeline at once, whereas python is strictly running one after the other. Pipeline startup and shutdown is mostly single-threaded, so if node starts several pipelines at once, it can probably save some time, as you observed.
You load your JPEG images in random access mode, so the whole image will be decompressed to memory with libjpeg. This is a single-threaded library, so you will never see more than 100% CPU use there.
Next, you do resize/rotate/crop/jpegsave. Running through these operations, resize will thread well, with the CPU load increasing as the square of the reduction, the rotate is too simple to have much effect on runtime, and the crop is instant. Although the jpegsave is single-threaded (of course) vips runs this in a separate background thread from a write-behind buffer, so you effectively get it for free.
I tried your program on my desktop PC (six hyperthreaded cores, so 12 hardware threads). I see:
$ time ./rahul.py indir outdir
clearing output directory - outdir
real 0m2.907s
user 0m9.744s
sys 0m0.784s
That looks like we're seeing 9.7 / 2.9, or about a 3.4x speedup from threading, but that's very misleading. If I set the vips threadpool size to 1, you see something closer to the true single-threaded performance (though it still uses the jpegsave write-behind thread):
$ export VIPS_CONCURRENCY=1
$ time ./rahul.py indir outdir
clearing output directory - outdir
real 0m18.160s
user 0m18.364s
sys 0m0.204s
So we're really getting 18.1 / 2.97, or a 6.1x speedup.
Benchmarking is difficult and real/user/sys can be hard to interpret. You need to consider a lot of factors:
Number of cores and number of hardware threads
CPU features like SpeedStep and TurboBoost, which will clock cores up and down depending on thermal load
Which parts of the program are single-threaded
IO load
Kernel scheduler settings
And I'm sure many others I've forgotten.
If you're curious, libvips has it's own profiler which can help give more insight into the runtime behaviour. It can show you graphs of the various worker threads, how long they are spending in synchronisation, how long in housekeeping, how long actually processing your pixels, when memory is allocated, and when it finally gets freed again. There's a blog post about it here:
http://libvips.blogspot.co.uk/2013/11/profiling-libvips.html

Does a smaller real time and a higher user+sys time mean that the process (in this case Node) utilizes the CPU more efficiently to complete the task sooner?
It doesn't necessarily mean they utilise the processor(s) more efficiently.
The higher user time means that Node is utilising more user space processor time, and in turn complete the task quicker. As stated by Luke Exton, the cpu is spending more time on "Code you wrote/might look at"
The higher sys time means there is more context switching happening, which makes sense from your htop utilisation numbers. This means the scheduler (kernel process) is jumping between Operating system actions, and user space actions. This is the time spent finding a CPU to schedule the task onto.
What is the practical implication of these times - which is faster/better/would scale well as the number of requests increase?
The question of implementation is a long one, and has many caveats. I would assume from the python vs Node numbers that the Python threads are longer, and in turn doing more processing inline. Another thing to note is the GIL in python. Essentially python is a single threaded application, and you can't easily break out of this. This could be a contributing factor to the Node implementation being quicker (using real threads).
The Node appears to be written to be correctly threaded and to split many tasks out. The advantages of the highly threaded application will have a tipping point where you will spend MORE time trying to find a free cpu for a new thread, than actually doing the work. When this happens your python implementation might start being faster again.

The higher user+sys time means that the process had more running threads and as you've noticed by looking at 360% used almost all available CPU resources of your 4-cores. That means that NodeJS process is already limited by available CPU resources and unable to process more requests. Also any other CPU intensive processes that you could eventually run on that machine will hit your NodeJS process. On the other hand Python process doesn't take all available CPU resources and probably could scale with a number of requests.

So these times are not reliable in and of themselves, they say how long the process took to perform an action on the CPU. This is coupled very tightly to whatever else was happening at the same time on that machine and could fluctuate wildly based entirely on physical resources.
In terms of these times specifically:
real = Wall Clock time (Start to finish time)
user = Userspace CPU time (i.e. Code you wrote/might look at) e.g. node/python libs/your code
sys = Kernel CPU time (i.e. Syscalls, e.g Open a file from the OS.)
Specifically, small real time means it actually finished faster. Does it mean it did it better for sure, NO. There could have been less happening on the machine at the same time for instance.
In terms of scale, these numbers are a little irrelevant, and it depends on the architecture/bottlenecks. For instance, in terms of scale and specifically, cloud compute, it's about efficiently allocating resources and the relevant IO for each, generally (compute, disk, network). Does processing this image as fast as possible help with scale? Maybe? You need to examine bottlenecks and specifics to be sure. It could for instance overwhelm your network link and then you are constrained there, before you hit compute limits. Or you might be constrained by how quickly you can write to the disk.

One potentially important aspect of this which no one mention is the fact that your library (vips) will itself launch threads:
http://www.vips.ecs.soton.ac.uk/supported/current/doc/html/libvips/using-threads.html
When libvips calculates an image, by default it will use as many
threads as you have CPU cores. Use vips_concurrency_set() to change
this.
This explains the thing that surprised me initially the most -- NodeJS should (to my understanding) be pretty single threaded, just as Python with its GIL. It being all about asynchronous processing and all.
So perhaps Python and Node bindings for vips just use different threading settings. That's worth investigating.
(that said, a quick look doesn't find any evidence of changes to the default concurrency levels in either library)

Python CGI queue

I'm working on a fairly simple CGI with Python. I'm about to put it into Django, etc. The overall setup is pretty standard server side (i.e. computation is done on the server):
User uploads data files and clicks "Run" button
Server forks jobs in parallel behind the scenes, using lots of RAM and processor power. ~5-10 minutes later (average use case), the program terminates, having created a file of its output and some .png figure files.
Server displays web page with figures and some summary text
I don't think there are going to be hundreds or thousands of people using this at once; however, because the computation going on takes a fair amount of RAM and processor power (each instance forks the most CPU-intensive task using Python's Pool).
I wondered if you know whether it would be worth the trouble to use a queueing system. I came across a Python module called beanstalkc, but on the page it said it was an "in-memory" queueing system.
What does "in-memory" mean in this context? I worry about memory, not just CPU time, and so I want to ensure that only one job runs (or is held in RAM, whether it receives CPU time or not) at a time.
Also, I was trying to decide whether
the result page (served by the CGI) should tell you it's position in the queue (until it runs and then displays the actual results page)
OR
the user should submit their email address to the CGI, which will email them the link to the results page when it is complete.
What do you think is the appropriate design methodology for a light traffic CGI for a problem of this sort? Advice is much appreciated.

Definitely use celery. You can run an amqp server or I think you can sue the database as a queue for the messages. It allows you to run tasks in the background and it can use multiple worker machines to do the processing if you want. It can also do cron jobs that are database based if you use django-celery
It's as simple as this to run a task in the background:
#task
def add(x, y):
return x + y
In a project I have it's distributing the work over 4 machines and it works great.

Python parallel processing libraries

Python seems to have many different packages available to assist one in parallel processing on an SMP based system or across a cluster. I'm interested in building a client server system in which a server maintains a queue of jobs and clients (local or remote) connect and run jobs until the queue is empty. Of the packages listed above, which is recommended and why?
Edit: In particular, I have written a simulator which takes in a few inputs and processes things for awhile. I need to collect enough samples from the simulation to estimate a mean within a user specified confidence interval. To speed things up, I want to be able to run simulations on many different systems, each of which report back to the server at some interval with the samples that they have collected. The server then calculates the confidence interval and determines whether the client process needs to continue. After enough samples have been gathered, the server terminates all client simulations, reconfigures the simulation based on past results, and repeats the processes.
With this need for intercommunication between the client and server processes, I question whether batch-scheduling is a viable solution. Sorry I should have been more clear to begin with.

Have a go with ParallelPython. Seems easy to use, and should provide the jobs and queues interface that you want.

There are also now two different Python wrappers around the map/reduce framework Hadoop:
http://code.google.com/p/happy/
http://wiki.github.com/klbostee/dumbo
Map/Reduce is a nice development pattern with lots of recipes for solving common patterns of problems.
If you don't already have a cluster, Hadoop itself is nice because it has full job scheduling, automatic data distribution of data across the cluster (i.e. HDFS), etc.

Given that you tagged your question "scientific-computing", and mention a cluster, some kind of MPI wrapper seems the obvious choice, if the goal is to develop parallel applications as one might guess from the title. Then again, the text in your question suggests you want to develop a batch scheduler. So I don't really know which question you're asking.

The simplest way to do this would probably just to output the intermediate samples to separate files (or a database) as they finish, and have a process occasionally poll these output files to see if they're sufficient or if more jobs need to be submitted.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.