How to check cpu consumption by a python script

How to check cpu consumption by a python script - python

I have implemented kalman filter. I want to find out how much of cpu energy is being consumed by my script. I have checked other posts on Stackoverflow and following them I downloaded psutil library. Now, I am unaware of where to put the statements to get the correct answer. Here is my code:
if __name__ == "__main__":
#kalman code
pid = os.getpid()
py = psutil.Process(pid)
current_process = psutil.Process();
memoryUse = py.memory_info()[0]/2.**30 # memory use in GB...I think
print('memory use:', memoryUse)
print(current_process.cpu_percent())
print(psutil.virtual_memory()) # physical memory usage
Please inform whether I am headed in the right direction or not.
The above code generated following results.
('memory use:', 0.1001129150390625)
0.0
svmem(total=6123679744, available=4229349376, percent=30.9, used=1334358016, free=3152703488, active=1790803968, inactive=956125184, buffers=82894848, cached=1553723392, shared=289931264, slab=132927488)
Edit: Goal: find out energy consumed by CPU while running this script

I'm not sure whether you're trying to find the peak CPU usage during execution of that #kalman code, or the average, or the total, or something else?
But you can't get any of those by calling cpu_percent() after it's done. You're just measuring the CPU usage of calling cpu_percent().
Some of these, you can get by calling cpu_times(). This will tell you how many seconds were spent doing various things, including total "user CPU time" and "system CPU time". The docs indicate exactly which values you get on each platform, and what they mean.
In fact, if this is all you want (including only needing those two values), you don't even need psutils, you can just use os.times() in the standard library.
Or you can just use the time command from outside your program, by running time python myscript.py instead of python myscript.py.
For others, you need to call cpu_percent while your code is running. One simple solution is to run a background process with multiprocessing or concurrent.futures that just calls cpu_percent on your main process. To get anything useful, the child process may need to call it periodically, aggregating the results (e.g., to find the maximum), until it's told to stop, at which point it can return the aggregate.
Since this is not quite trivial to write, and definitely not easy to explain without knowing how much familiarity you have with multiprocessing, and there's a good chance cpu_times() is actually what you want here, I won't go into details unless asked.

Related

How can I make python busy-wait for an exact duration?

More out of curiosity, I was wondering how might I make a python script sleep for 1 second without using the time module?
Is there a computation that can be conducted in a while loop which takes a machine of n processing power a designated and indexable amount of time?

As mentioned in comments for your second part of question:
The processing time is depends on the machine(computer and its configuration) you are working with and active processes on it. There isnt fixed amount of time for an operation.
It's been a long time since you could get a reliable delay out of just trying to execute code that would take a certain time to complete. Computers don't work like that any more.
But to answer your first question: you can use system calls and open a os process to sleep 1 second like:
import subprocess
subprocess.run(["sleep", "1"])

How to measure process memory usage in given moment

Here is the way to measure peak memory usage of current process since the start of the process.
process= psutil.Process(os.getpid())
process.memory_full_info().peak_wset
But what if I want to do few measurements for different parts(functions) of the program? How can I get memory used by program in any desired moment to check difference before and after?
Maybe there is the way to reset peak_wset?

Currently you no longer need os.getpid() when inspecting current process. Just use psutil.Process()
1) To measure if the peak memory increased or not (never decreased) before a function call, invoke this before and after a function call and take the difference:
On Windows:
psutil.Process().memory_info().peak_wset
psutil.Process().memory_full_info().peak_wset
On Linux:
I read here the resource.getrusage(resource.RUSAGE_SELF).ru_maxrss would do the trick but I haven't tested.
2) To measure the current memory variation before and after a function call, invoke it before and after a function call and take the difference:
On Windows and Linux:
psutil.Process().memory_info().rss
psutil.Process().memory_full_info().rss
psutil.Process().memory_full_info().uss
From docs:
memory_full_info() returns the same information as memory_info(), plus, on some platform (Linux, macOS, Windows), also provides additional metrics (USS, PSS and swap). The additional metrics provide a better representation of “effective” process memory consumption (in case of USS) as explained in detail in this blog post. It does so by passing through the whole process address. As such it usually requires higher user privileges than memory_info() and is considerably slower. On platforms where extra fields are not implemented this simply returns the same metrics as memory_info().
uss (Linux, macOS, Windows): aka “Unique Set Size”, this is the memory which is unique to a process and which would be freed if the process was terminated right now.
Note: uss is probably the most representative metric for determining how much memory is actually being used by a process.
3) To measure the specific memory a function call used during its execution, before any garbage collection takes place, you need a memory profiler.

Comparing two different processes based on the real, user and sys times

I have been through other answers on SO about real,user and sys times. In this question, apart from the theory, I am interested in understanding the practical implications of the times being reported by two different processes, achieving the same task.
I have a python program and a nodejs program https://github.com/rnanwani/vips_performance. Both work on a set of input images and process them to obtain different outputs. Both using libvips implementations.
Here are the times for the two
Python
real 1m17.253s
user 1m54.766s
sys 0m2.988s
NodeJS
real 1m3.616s
user 3m25.097s
sys 0m8.494s
The real time (the wall clock time as per other answers is lesser for NodeJS, which as per my understanding means that the entire process from input to output, finishes much quicker on NodeJS. But the user and sys times are very high as compared to Python. Also using the htop utility, I see that NodeJS process has a CPU usage of about 360% during the entire process maxing out the 4 cores. Python on the other hand has a CPU usage from 250% to 120% during the entire process.
I want to understand a couple of things
Does a smaller real time and a higher user+sys time mean that the process (in this case Node) utilizes the CPU more efficiently to complete the task sooner?
What is the practical implication of these times - which is faster/better/would scale well as the number of requests increase?

My guess would be that node is running more than one vips pipeline at once, whereas python is strictly running one after the other. Pipeline startup and shutdown is mostly single-threaded, so if node starts several pipelines at once, it can probably save some time, as you observed.
You load your JPEG images in random access mode, so the whole image will be decompressed to memory with libjpeg. This is a single-threaded library, so you will never see more than 100% CPU use there.
Next, you do resize/rotate/crop/jpegsave. Running through these operations, resize will thread well, with the CPU load increasing as the square of the reduction, the rotate is too simple to have much effect on runtime, and the crop is instant. Although the jpegsave is single-threaded (of course) vips runs this in a separate background thread from a write-behind buffer, so you effectively get it for free.
I tried your program on my desktop PC (six hyperthreaded cores, so 12 hardware threads). I see:
$ time ./rahul.py indir outdir
clearing output directory - outdir
real 0m2.907s
user 0m9.744s
sys 0m0.784s
That looks like we're seeing 9.7 / 2.9, or about a 3.4x speedup from threading, but that's very misleading. If I set the vips threadpool size to 1, you see something closer to the true single-threaded performance (though it still uses the jpegsave write-behind thread):
$ export VIPS_CONCURRENCY=1
$ time ./rahul.py indir outdir
clearing output directory - outdir
real 0m18.160s
user 0m18.364s
sys 0m0.204s
So we're really getting 18.1 / 2.97, or a 6.1x speedup.
Benchmarking is difficult and real/user/sys can be hard to interpret. You need to consider a lot of factors:
Number of cores and number of hardware threads
CPU features like SpeedStep and TurboBoost, which will clock cores up and down depending on thermal load
Which parts of the program are single-threaded
IO load
Kernel scheduler settings
And I'm sure many others I've forgotten.
If you're curious, libvips has it's own profiler which can help give more insight into the runtime behaviour. It can show you graphs of the various worker threads, how long they are spending in synchronisation, how long in housekeeping, how long actually processing your pixels, when memory is allocated, and when it finally gets freed again. There's a blog post about it here:
http://libvips.blogspot.co.uk/2013/11/profiling-libvips.html

Does a smaller real time and a higher user+sys time mean that the process (in this case Node) utilizes the CPU more efficiently to complete the task sooner?
It doesn't necessarily mean they utilise the processor(s) more efficiently.
The higher user time means that Node is utilising more user space processor time, and in turn complete the task quicker. As stated by Luke Exton, the cpu is spending more time on "Code you wrote/might look at"
The higher sys time means there is more context switching happening, which makes sense from your htop utilisation numbers. This means the scheduler (kernel process) is jumping between Operating system actions, and user space actions. This is the time spent finding a CPU to schedule the task onto.
What is the practical implication of these times - which is faster/better/would scale well as the number of requests increase?
The question of implementation is a long one, and has many caveats. I would assume from the python vs Node numbers that the Python threads are longer, and in turn doing more processing inline. Another thing to note is the GIL in python. Essentially python is a single threaded application, and you can't easily break out of this. This could be a contributing factor to the Node implementation being quicker (using real threads).
The Node appears to be written to be correctly threaded and to split many tasks out. The advantages of the highly threaded application will have a tipping point where you will spend MORE time trying to find a free cpu for a new thread, than actually doing the work. When this happens your python implementation might start being faster again.

The higher user+sys time means that the process had more running threads and as you've noticed by looking at 360% used almost all available CPU resources of your 4-cores. That means that NodeJS process is already limited by available CPU resources and unable to process more requests. Also any other CPU intensive processes that you could eventually run on that machine will hit your NodeJS process. On the other hand Python process doesn't take all available CPU resources and probably could scale with a number of requests.

So these times are not reliable in and of themselves, they say how long the process took to perform an action on the CPU. This is coupled very tightly to whatever else was happening at the same time on that machine and could fluctuate wildly based entirely on physical resources.
In terms of these times specifically:
real = Wall Clock time (Start to finish time)
user = Userspace CPU time (i.e. Code you wrote/might look at) e.g. node/python libs/your code
sys = Kernel CPU time (i.e. Syscalls, e.g Open a file from the OS.)
Specifically, small real time means it actually finished faster. Does it mean it did it better for sure, NO. There could have been less happening on the machine at the same time for instance.
In terms of scale, these numbers are a little irrelevant, and it depends on the architecture/bottlenecks. For instance, in terms of scale and specifically, cloud compute, it's about efficiently allocating resources and the relevant IO for each, generally (compute, disk, network). Does processing this image as fast as possible help with scale? Maybe? You need to examine bottlenecks and specifics to be sure. It could for instance overwhelm your network link and then you are constrained there, before you hit compute limits. Or you might be constrained by how quickly you can write to the disk.

One potentially important aspect of this which no one mention is the fact that your library (vips) will itself launch threads:
http://www.vips.ecs.soton.ac.uk/supported/current/doc/html/libvips/using-threads.html
When libvips calculates an image, by default it will use as many
threads as you have CPU cores. Use vips_concurrency_set() to change
this.
This explains the thing that surprised me initially the most -- NodeJS should (to my understanding) be pretty single threaded, just as Python with its GIL. It being all about asynchronous processing and all.
So perhaps Python and Node bindings for vips just use different threading settings. That's worth investigating.
(that said, a quick look doesn't find any evidence of changes to the default concurrency levels in either library)

issue in trying to execute certain number of python scripts at certain intervals

I am trying to execute certain number of python scripts at certain intervals. Each script takes a lot of time to execute and hence I do not want to waste time in waiting to run them sequentially. I tired this code but it is not executing them simultaneously and is executing them one by one:
Main_file.py
import time
def func(argument):
print 'Starting the execution for argument:',argument
execfile('test_'+argument+'.py')
if __name__ == '__main__':
arg = ['01','02','03','04','05']
for val in arg:
func(val)
time.sleep(60)
What I want is to kick off by starting the executing of first file(test_01.py). This will keep on executing for some time. After 1 minute has passed I want to start the simultaneous execution of second file (test_02.py). This will also keep on executing for some time. Like this I want to start the executing of all the scripts after gaps of 1 minute.
With the above code, I notice that the execution is happening one after other file and not simultaneously as the print statements which are there in these files appear one after the other and not mixed up.
How can I achieve above needed functionality?

Using python 2.7 on my computer, the following seems to work with small python scripts as test_01.py, test_02.py, etc. when threading with the following code:
import time
import thread
def func(argument):
print('Starting the execution for argument:',argument)
execfile('test_'+argument+'.py')
if __name__ == '__main__':
arg = ['01','02','03']
for val in arg:
thread.start_new_thread(func, (val,))
time.sleep(10)
However, you indicated that you kept getting a memory exception error. This is likely due to your scripts using more stack memory than was allocated to them, as each thread is allocated 8 kb by default (on Linux). You could attempt to give them more memory by calling
thread.stack_size([size])
which is outlined here: https://docs.python.org/2/library/thread.html
Without knowing the number of threads that you're attempting to create or how memory intensive they are, it's difficult to if a better solution should be sought. Since you seem to be looking into executing multiple scripts essentially independently of one another (no shared data), you could also look into the Multiprocessing module here:
https://docs.python.org/2/library/multiprocessing.html

If you need them to run parallel you will need to look into threading. Take a look at https://docs.python.org/3/library/threading.html or https://docs.python.org/2/library/threading.html depending on the version of python you are using.

CPU (all cores) become idle during python multiprocessing on windows

My system is windows 7. I wrote python program to do data analysis. I use multiprocessing library to achieve parallelism. When I open windows powershell, and type python MyScript.py. It starts to use all the cpu cores. But after a while, the CPU (all cores) became idle. But if I hit Enter in powershell window, all cores are back to full-load. To be clear, the program is fine, and has been tested. The problem here is that CPU-cores went idle by themselves.
This happened not only on my office computer, which runs Windows 7 Pro, but also on my home desktop, which runs Windows 7 Ultimate.
The parallel part of the program is very simple:
def myfunc(input):
##some operations based on a huge data and a small data##
operation1: read in a piece of HugeData #query based HDF5
operation2: some operation based on HugeData and SmallData
return output
# read in Small data
SmallData=pd.read_csv('data.csv')
if __name__ == '__main__':
pool = mp.Pool()
result=pool.map_async(myfunc, a_list_of_input)
out=result.get()
My function are mainly data manipulations using Pandas.
There is nothing wrong with the program, because I've successfully finished my program couple times. But I have to keep watching it, and hit Enter when cores become idle. The job takes couple hours, and I really don't keep watching it.
Is this a problem of windows system itself or my program?
By the way, can all the cores have access to the same variable stored in the memory? e.g. I have a data set mydata read into memory right before if __name__ == '__main__':. This data will be used in myfunc. All the cores should be able to access mydata in the same time, right?
Please help!

I was re-directed to this question as I was facing a similar problem while using Python's Multiprocessing library in Ubuntu. In my case, the processes do not start by hitting enter or such, however, they start after sometime abruptly. My code is an iterative heuristic that uses multiprocessing in each of its iterations. I have to rerun the code after completion of some iterations in order to get a steady runtime-performance. As the question was posted way long back, did you come across the actual reason behind it and solution to it?

I confess to not understanding the subtleties of map_async, but I'm not sure whether you can use it like that (I can't seem to get it to work at all)...
I usually use the following recipe (a list comprehension of the calls I want doing):
In [11]: procs = [multiprocessing.Process(target=f, args=()) for _ in xrange(4)]
....: for p in procs: p.start()
....: for p in procs: p.join()
....:
It's simple and waits until the jobs are finished before continuing.
This works fine with pandas objects provided you're not doing modifications... (I think) copies of the object are passed to each thread and if you perform mutations they not propogate and will be garbage collected.
You can use multiprocessing's version of a dict or list with the Manager class, this is useful for storing the result of each job (simply access the dict/list from within the function):
mgr = multiproccessing.Manager()
d = mgr.dict()
L = mgr.list()
and they will have shared access (as if you had written a Lock). It's hardly worth mentioning, that if you are appending to a list then the order will not just be the same as procs!
You may be able to do something similar to the Manager for pandas objects (writing a lock to objects in memory without copying), but I think this would be a non-trivial task...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.