Pretty simple, I'd like to run an external command/program from within a Python script, once it is finished I would also want to know how much CPU time it consumed.
Hard mode: running multiple commands in parallel won't cause inaccuracies in the CPU consumed result.
On UNIX: either (a) use resource module (also see answer by icktoofay), or (b) use the time command and parse the results, or (c) use /proc filesystem, parse /proc/[pid]/stat and parse out utime and stime fields. The last of these is Linux-specific.
Example of using resource:
import subprocess, resource
usage_start = resource.getrusage(resource.RUSAGE_CHILDREN)
subprocess.call(["yourcommand"])
usage_end = resource.getrusage(resource.RUSAGE_CHILDREN)
cpu_time = usage_end.ru_utime - usage_start.ru_utime
Note: it is not necessary to do fork/execvp, subprocess.call() or the other subprocess methods are fine here and much easier to use.
Note: you could run multiple commands from the same python script simultaneously either using subprocess.Popen or subprocess.call and threads, but resource won't return their correct individual cpu times, it will return the sum of their times in between calls to getrusage; to get the individual times, run one little python wrapper per command to time it as above (could launch those from your main script), or use the time method below which will work correctly with multiple simultaneous commands (time is basically just such a wrapper).
Example of using time:
import subprocess, StringIO
time_output = StringIO.StringIO()
subprocess.call(["time", "yourcommand", "youroptions"], stdout=time_output)
# parse time_output
On Windows: You need to use performance counters (aka "performance data helpers") somehow. Here is a C example of the underlying API. To get it from python, you can use one of two modules: win32pdh (part of pywin32; sample code) or pyrfcon (cross-platform, also works on Unix; sample code).
Any of these methods actually meet the "hard mode" requirements above: they should be accurate even with multiple running instances of different processes on a busy system. They may not produce the exact same results in that case compared to running just one process on an idle system, because process switching does have some overhead, but they will be very close, because they ultimately get their data from the OS scheduler.
On platforms where it's available, the resource module may provide what you need. If you need to time multiple commands simultaneously, you may want to (for each command you want to run) fork and then create the subprocess so you get information for only that process. Here's one way you might do this:
def start_running(command):
time_read_pipe, time_write_pipe = os.pipe()
want_read_pipe, want_write_pipe = os.pipe()
runner_pid = os.fork()
if runner_pid != 0:
os.close(time_write_pipe)
os.close(want_read_pipe)
def finish_running():
os.write(want_write_pipe, 'x')
os.close(want_write_pipe)
time = os.read(time_read_pipe, struct.calcsize('f'))
os.close(time_read_pipe)
time = struct.unpack('f', time)[0]
return time
return finish_running
os.close(time_read_pipe)
os.close(want_write_pipe)
sub_pid = os.fork()
if sub_pid == 0:
os.close(time_write_pipe)
os.close(want_read_pipe)
os.execvp(command[0], command)
os.wait()
usage = resource.getrusage(resource.RUSAGE_CHILDREN)
os.read(want_read_pipe, 1)
os.write(time_write_pipe, struct.pack('f', usage.ru_utime))
sys.exit(0)
You can then use it to run a few commands:
get_ls_time = start_running(['ls'])
get_work_time = start_running(['python', '-c', 'print (2 ** 512) ** 200'])
After that code has executed, both of those commands should be running in parallel. When you want to wait for them to finish and get the time they took to execute, call the function returned by start_running:
ls_time = get_ls_time()
work_time = get_work_time()
Now ls_time will contain the time ls took to execute and work_time will contain the time python -c "print (2 ** 512) ** 200" took to execute.
You can do timings within Python, but if you want to know the overall CPU consumption of your program, that is kind of silly to do. The best thing to do is to just use the GNU time program. It even comes standard in most operating systems.
The timeit module of python is very useful for benchmarking/profiling purposes. In addtion to that you can even call it from the command-line interface. To benchmark a external command, you would go like this:
>>> import timeit
>>> timeit.timeit("call(['ls','-l'])",setup="from subprocess import call",number=1) #number defaults to 1 million
total 16
-rw-rw-r-- 1 nilanjan nilanjan 3675 Dec 17 08:23 icon.png
-rw-rw-r-- 1 nilanjan nilanjan 279 Dec 17 08:24 manifest.json
-rw-rw-r-- 1 nilanjan nilanjan 476 Dec 17 08:25 popup.html
-rw-rw-r-- 1 nilanjan nilanjan 1218 Dec 17 08:25 popup.js
0.02114391326904297
The last line is the returned execution time. Here, the first argument to timeit.timeit() is the code for calling the external method and setup argument specifies the code to run before the start of time-measurement. number argument is the number of time you wish to run the specified code and then you can divide the time returned by the number to get average time.
You can also use the timeit.repeat() method which takes similar arguments as timeit.timeit() but takes an additional repeat argument to specify the number of time timeit.timeit() should be called and returns a list of execution times for each run.
Note: The execution time returned by the timeit.timeit() method is the wall clock time, not the CPU time. So, other processes may interfere with the timing. So, in case of timeit.repeat() you should take the minimum value instead of trying to calculate the average or standard deviation.
You can do this using ipython's %time magic function:
In [1]: time 2**128
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00
Out[1]: 340282366920938463463374607431768211456L
In [2]: n = 1000000
In [3]: time sum(range(n))
CPU times: user 1.20 s, sys: 0.05 s, total: 1.25 s
Wall time: 1.37
Out[3]: 499999500000L
Related
I'm running Python 2.7 on the GCE platform to do calculations. The GCE instances boot, install various packages, copy 80 Gb of data from a storage bucket and runs a "workermaster.py" script with nohangup. The workermaster runs on an infinite loop which checks a task-queue bucket for tasks. When the task bucket isn't empty it picks a random file (task) and passes work to a calculation module. If there is nothing to do the workermaster sleeps for a number of seconds and checks the task-list again. The workermaster runs continuously until the instance is terminated (or something breaks!).
Currently this works quite well, but my problem is that my code only runs instances with a single CPU. If I want to scale up calculations I have to create many identical single-CPU instances and this means there is a large cost overhead for creating many 80 Gb disks and transferring the data to them each time, even though the calculation is only "reading" one small portion of the data for any particular calculation. I want to make everything more efficient and cost effective by making my workermaster capable of using multiple CPUs, but after reading many tutorials and other questions on SO I'm completely confused.
I thought I could just turn the important part of my workermaster code into a function, and then create a pool of processes that "call" it using the multiprocessing module. Once the workermaster loop is running on each CPU, the processes do not need to interact with each other or depend on each other in any way, they just happen to be running on the same instance. The workermaster prints out information about where it is in the calculation and I'm also confused about how it will be possible to tell the "print" statements from each process apart, but I guess that's a few steps from where I am now! My problems/confusion are that:
1) My workermaster "def" doesn't return any value because it just starts an infinite loop, where as every web example seems to have something in the format myresult = pool.map(.....); and
2) My workermaster "def" doesn't need any arguments/inputs - it just runs, whereas the examples of multiprocessing that I have seen on SO and on the Python Docs seem to have iterables.
In case it is important, the simplified version of the workermaster code is:
# module imports are here
# filepath definitions go here
def workermaster():
while True:
tasklist = cloudstoragefunctions.getbucketfiles('<my-task-queue-bucket')
if tasklist:
tasknumber = random.randint(2, len(tasklist))
assignedtask = tasklist[tasknumber]
print 'Assigned task is now: ' + assignedtask
subprocess.call('gsutil -q cp gs://<my-task-queue-bucket>/' + assignedtask + ' "' + taskfilepath + assignedtask + '"', shell=True)
tasktype = assignedtask.split('#')[0]
if tasktype == 'Calculation':
currentcalcid = assignedtask.split('#')[1]
currentfilenumber = assignedtask.split('#')[2].replace('part', '')
currentstartfile = assignedtask.split('#
currentendfile = assignedtask.split('#')[4].replace('.csv', '')
calcmodule.docalc(currentcalcid, currentfilenumber, currentstartfile, currentendfile)
elif tasktype == 'Analysis':
#set up and run analysis module, etc.
print ' Operation completed!'
os.remove(taskfilepath + assignedtask)
else:
print 'There are no tasks to be processed. Going to sleep...'
time.sleep(30)
Im trying to "call" the function multiple times using the multiprocessing module. I think I need to use the "pool" method, so I've tried this:
import multiprocessing
if __name__ == "__main__":
p = multiprocessing.Pool()
pool_output = p.map(workermaster, [])
My understanding from the docs is that the __name__ line is there only as a workaround for doing multiprocessing in Windows (which I am doing for development, but GCE is on Linux). The p = multiprocessing.Pool() line is creating a pool of workers equal to the number of system CPUs as no argument is specified. It the number of CPUs was 1 then I would expect the code to behave as it does before I attempted to use multiprocessing. The last line is the one that I don't understand. I thought that it was telling each of the processors in the pool that the "target" (thing to run) is workermaster. From the docs there appears to be a compulsory argument which is an iterable, but I don't really understand what this is in my case, as workermaster doesn't take any arguments. I've tried passing it an empty list, empty string, empty brackets (tuple?) and it doesn't do anything.
Please would it be possible for someone help me out? There are lots of discussions about using multiprocessing and this thread Mulitprocess Pools with different functions and this one python code with mulitprocessing only spawns one process each time seem to be close to what I am doing but still have iterables as arguments. If there is anything critical that I have left out please advise and I will modify my post - thank you to anyone who can help!
Pool() is useful if you want to run the same function with different argumetns.
If you want to run function only once then use normal Process().
If you want to run the same function 2 times then you can manually create 2 Process().
If you want to use Pool() to run function 2 times then add list with 2 arguments (even if you don't need arguments) because it is information for Pool() to run it 2 times.
But if you run function 2 times with the same folder then it may run 2 times the same task. if you will run 5 times then it may run 5 times the same task. I don't know if it is needed.
As for Ctrl+C I found on Stackoverflow Catch Ctrl+C / SIGINT and exit multiprocesses gracefully in python but I don't know if it resolves your problem.
I am building a device based on raspberry PI. It will have several concurrent functions that should work simultaneously. In this case using asyncio looks like a reasonable choice (well, I can write all this stuff in C++ with threads, but python code looks much more compact)
One of the functions is to drive a stepper motor via GPIO pulses. These pulses should be 5-10 microseconds long. Is there a way to get asleep for a sub-milliseconds intervals with asyncio sleep?
Is there a way to get asleep for a sub-milliseconds intervals with asyncio sleep?
On Linux asyncio uses the epoll_wait system call, which specifies the timeout in milliseconds, so anything sub-millisecond won't work, despite being able to specify it in asyncio.sleep().
You can test it on your machine by running the following program:
import asyncio, os
SLEEP_DURATION = 5e-3 # 5 ms sleep
async def main():
while True:
# suspend execution
await asyncio.sleep(SLEEP_DURATION)
# execute a syscall visible in strace output
os.stat('/tmp')
asyncio.run(main())
Save the program e.g. as sleep1.py and run it under strace, like this:
$ strace -fo trc -T python3.7 sleep1.py
<wait a second or two, then press Ctrl-C to interrupt>
The trc file will contain reasonably precise timings of what goes on under the hood. After the Python startup sequence, the program basically does the following in an infinite loop:
24015 getpid() = 24015 <0.000010>
24015 epoll_wait(3, [], 1, 5) = 0 <0.005071>
24015 epoll_wait(3, [], 1, 0) = 0 <0.000010>
24015 stat("/tmp", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=45056, ...}) = 0 <0.000014>
We see a call to getpid(), two calls to epoll_wait, and finally the call to stat. The first epoll_wait is actually relevant, it specifies the timeout in milliseconds, and sleeps for approximately the desired period. If we lower the sleep duration to sub-milliseconds, e.g. 100e-6, strace shows that asyncio still requests a 1ms timeout from epoll_wait, and gets as much. The same happens with timeouts down to 15 us. If you specify a 14 us or smaller timeout, asyncio actually requests a no-timeout poll, and epoll_wait completes in 8 us. However, the second epoll_wait also takes 8 us, so you can't really count on microsecond resolution in any form of shape.
Even if you use threads and busy-looping, you are likely to encounter synchronization issues with the GIL. This should likely done in a lower-level language such as C++ or Rust, and even so you'll need to be careful about the OS scheduler.
I would like to know how many processes the Linux kernel created during
a period of time.
Usually during one minute.
My background: If too many processes got created during a minute, then there is something wrong. Most of our legacy code base was moved from shell to python, but sometimes there are still some shell scripts which are slow because they a lot of processes.
I would like to create a graph from this number. Then I would like to check on which host and why so many processes got created.
I want to implement this with Python.
Answers how to read this from /proc or /sys would be great.
It would be nice if the solution works for the wrap around which happens if pid_max gets reached.
The limit (maximum number of pids) is /proc/sys/kernel/pid_max. The manual says:
/proc/sys/kernel/pid_max (since Linux 2.5.34)
This file specifies the value at which PIDs wrap around (i.e., the
value in this file is one greater than the maximum PID). The default
value for this file, 32768, results in the same range of PIDs as on
earlier kernels
check /proc/stat, there is a processes
field, counts numbers of fork since boot, doc:
$ grep processes /proc/stat
processes 81579558
I use this under windows but maybe you can try it as a starting point
>>> import subprocess
>>> subprocess.Popen('tasklist')
<subprocess.Popen object at 0x00000268164C3CC0>
>>>
Name PID Session name No. of s Utilisation
========================= ======== ================ =========== ============
this will give you a table which you can capture with
subprocess.Popen('tasklist').communicate()[0], just count the lines and you'll get the current number of processes. Do it again in 1 minute and see what's changed
Instead of manually looking at /proc or /sys , let linux do it for you:
import subprocess
from time import sleep
time = 0
ps = subprocess.Popen(["ps",'-A', '-o', 'pid'], stdout=subprocess.PIPE)
pids = [int(x) for x in ps.communicate()[0].split()[1:]]
new_pids_count = 0;
while time < 60:
ps = subprocess.Popen(["ps",'-A', '-o', 'pid'], stdout=subprocess.PIPE)
output = [int(x) for x in ps.communicate()[0].split()[1:]]
for x in output:
if x not in pids:
new_pids_count += 1
pids.append(x)
time += 1
sleep(1)
Initially, I get all the currently running PIDS, using ps -A -i -pid, and put them all in a list.
I repeat this every second, to check for the newly spawned process, by comparing the results from the running it again, and the growing pids list.
So a friend noticed something curious about numpy. Here is a minimal example that runs the same script first serially, than two instances parallel each in their own process:
#!/bin/bash
# This is runner.sh
fl=/tmp/$(mktemp test_XXXXX.py)
trap "rm -fv '$fl'" EXIT
cat - > "$fl" <<-'EndOfHereDoc'
#!/usr/bin/env python
import numpy as np
import sys
if __name__ == '__main__':
if len(sys.argv)>1: print(sys.argv[1] +' start: '+ str(datetime.datetime.now()))
cube_size=100
cube=np.zeros((cube_size,cube_size,cube_size))
cube_ones=np.ones((cube_size,cube_size,cube_size))
for x in range(10000):
np.add(cube_ones,cube,out=cube)
if len(sys.argv)>1: print(sys.argv[1] +' start: '+ str(datetime.datetime.now()))
EndOfHereDoc
echo "Serial"
time python "$fl" 0
echo
echo "Parallel"
time python "$fl" 1&
time python3 "$fl" 2&
wait
rm -fv "$fl"
trap '' EXIT
The output of which is:
$ runner.sh
Serial
0 start: 2018-09-19 15:46:52.540881
0 end: 2018-09-19 15:47:04.592280
real 0m12,105s
user 0m12,084s
sys 0m0,020s
Parallel
1 start: 2018-09-19 15:47:04.665260
2 start: 2018-09-19 15:47:04.780635
2 end: 2018-09-19 15:47:27.053261
real 0m22,480s
user 0m22,448s
sys 0m0,128s
1 end: 2018-09-19 15:47:27.097312
real 0m22,505s
user 0m22,409s
sys 0m0,040s
removed '/tmp/test_zWN0V.py'
No speedup. It is as if the processes where run one after the other. I assume numpy is using a resource exclusively and the other process waits for that resource to be freed. But what exactly is going on here? The GIL should only be an issue with multi-threading, not multiple processes, right? I find it especially weird, that p2 is not simply waiting for p1 to finish. Instead BOTH processes take ~22s to finish. I'd expect one to get the resource and finish in half the time. While the other waits until the first releases it and takes an additional ~12s.
Note that this also ocours when running the python code with python's own multiprocessing module in a Pool. It does however not occur, if you do something that doesn't involve some specific numpy functions like:
cube_size=25
cube=[0 for i in range(cube_size**3)]
for x in range(10000):
cube = [ value + 1 for value in cube]
Edit:
I have a real 4-core CPU. I kept hyperthreading in mind, it's not the issue here. During the single process part, one CPU is at 100%, the rest idle. During the two process part, two are at 100%, the rest is idle (as per htop). I understand that numpy runs ATLAS, LAPACK and BLAS libraries in the background, which are not Python (in fact pure C or Fortran). These might utilize parallel techniques. My question here is, why doesn't that show up in CPU utilization?
Numpy is not restricted by the GIL as much as core Python is. This is because numpy only stores the array as a Python object. The actual data itself is stored as "primitive" types defined in C. This is also why iterating over a numpy array is much slower than iterating over a Python list. The numpy array has to build a Python object for each value it yields, whereas the Python list already has Python objects.
As numpy is not hampered by the GIL, it is able to use threaded math libraries where available. That is to say, your parallel processes took longer to run because each process was already maxing out your machine and so both processes were competing for the same resources.
Take a look at the output and see what's available in your machine (be warned it's quite verbose).
import numpy.distutils.system_info as sysinfo
sysinfo.show_all()
I'm currently doing some work in uni that requires generating multiple benchmarks for multiple short C programs. I've written a python script to automate this process. Up until now I've been using the time module and essentially calculating the benchmark as such:
start = time.time()
successful = run_program(path)
end = time.time()
runtime = end - start
where the run_program function just uses the subprocess module to run the C program:
def run_program(path):
p = subprocess.Popen(path, shell=True, stdout=subprocess.PIPE)
p.communicate()[0]
if (p.returncode > 1):
return False
return True
However I've recently discovered that this measures elapsed time and not CPU time, i.e. this sort of measurement is sensitive to noise from the OS. Similar questions on SO suggest that the timeit module is is better for measuring CPU time, so I've adapted the run method as such:
def run_program(path):
command = 'p = subprocess.Popen(\'time ' + path + '\', shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE); out, err = p.communicate()'
result = timeit.Timer(command, setup='import subprocess').repeat(1, 10)
return numpy.median(result)
But from looking at the timeit documentation it seems that the timeit module is only meant for small snippets of python code passed in as a string. So I'm not sure if timeit is giving me accurate results for this computation. So my question is: Will timeit measure the CPU for every step of the process that it runs or will it only measure the CPU time for the actual python(i.e. the subprocess module) code to run? Is this an accurate way to benchmark a set of C programs?
timeit will measure the CPU time used by the Python process in which it runs. Execution time of external processes will not be "credited" to those times.
A more accurate way would be to do it in C, where you can get true speed and throughput.