I am trying to create a simple app that wastes cpu cycles for multi-core research. The one I created takes up 100% core usage. I want it to be around 30% 60% 70%, which adjustments should I make in order to achieve this? Thanks in advance.
Current version:
a=999999999
while True:
a=a/2
Starting at a large number isn't necessary, as dividing a number by 2 will quickly end up as 0/2 over and over again anyway. Besides, you don't have to actually do anything in a loop to consume CPU cycles - the mere action of looping is enough. This is why any infinite loop, even something as simple as while 1: pass, will eat up an entire CPU core until killed. To avoid taking up an entire core, use time.sleep to pause execution of the thread for a certain period of time. This function takes a single argument representing the time in seconds for the thread to sleep. It accepts a floating-point number.
import time
while 1:
time.sleep(0.0001)
Simply run an instance of this script (with an appropriate sleep time for the workload you'd like to put on your particular system) for each core you'd like to test.
Note that some operating systems may not support sleep times of less than one millisecond, causing shorter sleep times to come through as zero, making them incompatible with this strategy. See Python: high precision time.sleep and How accurate is python's time.sleep()? for more.
Related
I have to run my code that takes about 3 hours to complete 100 times, which amounts to a total computation time of 300 hours. This code is expensive and includes taking Laplacians, contour points, and outputting plots. I also have access to a computing cluster, which grants me access to 100 cores at once. So I wondered if I could run my 100 jobs on 100 individual cores at once --- no communication between cores --- and reduce the total computation time to somewhere around 3 hours still.
My online reading took me to multiprocessing.Pool, where I ended up using Pool.apply_async(). Quickly, I realized that the average completion time per task was way over 3 hours, the completion time of one task prior to parallelizing. Further research revealed it might be an overhead issue one has to deal with when parallelizing, but I don't think that can account for an 8-fold increase in computation time.
I was able to reproduce the issue on a simple, and shareable piece of code:
import numpy as np
import matplotlib.pyplot as plt
from multiprocessing import Pool
import time
def f(x):
'''Junk to just waste computing power!'''
arr = np.ones(shape=(10000,10000))*x
val1 = arr*arr
val2 = np.sqrt(val1)
val3 = np.roll(arr,-1) - np.roll(arr,1)
plt.subplot(121)
plt.imshow(arr)
plt.subplot(122)
plt.imshow(val3)
plt.savefig('f.png')
plt.close()
print('Done')
N = 1
pool = Pool(processes=N)
t0 = time.time()
for i in range(N):
pool.apply_async(f,[i])
pool.close()
pool.join()
print('Time taken = ', time.time()-t0)
The machine I'm on has 8 cores, and if I've understood my online reading correctly, setting N=1 should force Python to run the job exclusively on one core. Looking at the completion times, I get:
N=1 || Time taken = 7.964005947113037
N=7 || Time taken = 40.3266499042511.
Put simply, I don't understand why the times aren't the same. As for my original code, the time difference between single-task and parallelized-task is about 8-fold.
[Question 1] Is this all a case of overhead (whatever it entails!)?
[Question 2] Is it at all possible to achieve a case where I can get the same computation time as that of a single-task for an N-task job running independently on N cores?
For instance, in SLURM you could do sbatch --array=0-N myCode.slurm, where myCode.slurm would be a single-task Python code, and this would do exactly what I want. But, I need to achieve this same outcome from within Python itself.
Thank you in advance for your help and input!
I suspect that (perhaps due to memory overhead) your program is simply not CPU bound when running multiple copies, therefore process parallelism is not speeding it up as much as you might think.
An easy way to test this is just to have something else do the parallelism and see if anything is improved.
With N = 8 it took 31.2 seconds on my machine. With N = 1 my machine took 7.2 seconds. I then just fired off 8 copies of the N = 1 version.
$ for i in $(seq 8) ; do python paralleltest & done
...
Time taken = 32.07925891876221
Done
Time taken = 33.45247411727905
Done
Done
Done
Done
Done
Done
Time taken = 34.14084982872009
Time taken = 34.21410894393921
Time taken = 34.44455814361572
Time taken = 34.49029612541199
Time taken = 34.502259969711304
Time taken = 34.516881227493286
I also tried this with the entire multiprocessing stuff removed and just a single call to f(0) and the results were quite similar. So the python multiprocessing is doing what it can, but this job has constraints beyond just CPU.
When I replaced the code with something less memory intensive (math.factorial(1400000)), then the parallelism looked a lot better. 14.0 seconds for 1 copy, 16.44 seconds for 8 copies.
SLURM has the ability to make use of multiple hosts as well (depending on your configuration). The multiprocessing pool does not. This method will be limited to the resources on a single host.
This occurs at least partly because you're already taking advantage of most of your cpu in just one process. Numpy is generally built to take advantage of multiple threads without the need for multiprocessing. The benefit is greatest when a significant amount of time is spent inside numpy functions or else the comparatively slower python will dominate the time taken in-between numpy calls. When you add tasks in parallel with the CPU already fully loaded, each one runs slower. See this question on adjusting the number of threads numpy uses, and possibly play with setting it to only 1 thread to convince yourself this is the case.
I solved an MIP problem, with making the solving process sleep for 1 second every branch using BranchCallback (single thread). I noticed from the log that the system time measured in seconds changed every run, while the deterministic time measured in ticks didn't. However, the problem was that the latter didn't even change whether the 1-second sleep was applied or not. On the contrary, The system time did record the sleep time.
I also tried to get the deterministic time using the callback api, but it only counted 0.0 ticks for the 1-second sleep. It's not a problem about the sleep mode, because a simple piece of code counting for a large number also showed 0.0 ticks. I thought it might not record the code running time.
What exactly does the determministic time measure in CPLEX? Is there any method to measure the real running time (especially the real callback running time) as the system time did, but in a deterministic way?
The deterministic time is an approximation for the work that CPLEX does (you can think of it as number of instructions executed inside CPLEX). Doing nothing does not execute any instructions so does not count towards deterministic time.
Moreover, deterministic time is only measured inside CPLEX. It does not account for time spent in user code like callbacks.
If you want to measure the time spent in your callback then you have to do that yourself (there is no point for CPLEX to track this): just take a time stamp at the beginning of your callback, one at the end of your callback and then compute the difference. The CPLEX callbacks have functions to take time stamps, see the reference documentation.
In case you want to have a determinstic time for code you wrote then you have to roll your own and first of all define what deterministic time means for your code.
I have a performance measurement issue while executing a migration to Cython from C-compiled functions (through scipy.weave) called from a Python engine.
The new cython functions profiled end-to-end with cProfile (if not necessary I won't deep down in cython profiling) record cumulative measurement times highly variable.
Eg. the cumulate time of a cython function executed 9 times per 5 repetitions (after a warm-up of 5 executions - not took in consideration by the profiling function) is taking:
in a first round 215,627339 seconds
in a second round 235,336131 seconds
Each execution calls the functions many times with different, but fixed parameters.
Maybe this variability could depends on CPU loads of the test machine (a cloud-hosted dedicated one), but I wonder if such a variability (almost 10%) could depend someway by cython or lack of optimization (I already use hints on division, bounds check, wrap-around, ...).
Any idea on how to take reliable metrics?
First of all, you need to ensure that your measurement device is capable of measuring what you need: specifically, only the system resources you consume. UNIX's utime is one such command, although even that one still includes swap time. Check the documentation of your profiler: it should have capabilities to measure only the CPU time consumed by the function. If so, then your figures are due to something else.
Once you've controlled the external variations, you need to examine the internal. You've said nothing about the complexion of your function. Some (many?) functions have available short-cuts for data-driven trivialities, such as multiplication by 0 or 1. Some are dependent on an overt or covert iteration that varies with the data. You need to analyze the input data with respect to the algorithm.
One tool you can use is a line-oriented profiler to detail where the variations originate; seeing which lines take the extra time should help determine where the "noise" comes from.
I'm not a performance expert but from my understanding the thing you should be measuring would be the average time it take per execution not the cumulative time? Other than that is your function doing any like reading from disk and/or making network requests?
I read my university class theoretically the order of growth of functions and tried implementing it practically at home. Although the order of growth turned out to be exact the same as in textbooks but their executing time changes with every single time I execute the program. Why is that?
Source Code
import time
import math
from tabulate import tabulate
n=eval(input("Enter the value of n: "));
t1=time.time()
a=12
t2=time.time()
A=t2-t1
t3=time.time()
b=n
t4=time.time()
B=t4-t3
t5=time.time()
c=math.log10(n);
t6=time.time()
C=t6-t5
t7=time.time()
d=n*math.log10(n);
t8=time.time()
D=t8-t7
t9=time.time()
e=n**2
t10=time.time()
E=t10-t9
t11=time.time()
f=2**n
t12=time.time()
F=t12-t11
print(tabulate([['constant',a,A], ['n',b,B], ['logn',c,C], ['nlogn',d,D], ['n**2',e,E], ['2**n',f,F]], headers=['Function', 'Value', 'Time']))
templist= [A,B,C,D,E,F]
print("The time order in acsending order is: ", sorted(templist,key=int))
First Execution
naufil#naufil-Inspiron-7559:~/Desktop/python$ python3 time_order.py
Enter the value of n: 100
Function Value Time
---------- --------------- -----------
constant 12 2.14577e-06
n 100 1.43051e-06
logn 2 4.1008e-05
nlogn 200 3.57628e-06
n**2 10000 3.33786e-06
2**n 1.26765e+30 3.8147e-06
The time order in acsending order is: [2.1457672119140625e-06, 1.430511474609375e-06, 4.100799560546875e-05, 3.5762786865234375e-06, 3.337860107421875e-06, 3.814697265625e-06]
Second Execution
naufil#naufil-Inspiron-7559:~/Desktop/python$ python3 time_order.py
Enter the value of n: 100
Function Value Time
---------- --------------- -----------
constant 12 2.14577e-06
n 100 1.19209e-06
logn 2 4.64916e-05
nlogn 200 4.05312e-06
n**2 10000 3.33786e-06
2**n 1.26765e+30 3.57628e-06
The time order in acsending order is: [2.1457672119140625e-06, 1.1920928955078125e-06, 4.649162292480469e-05, 4.0531158447265625e-06, 3.337860107421875e-06, 3.5762786865234375e-06]
As other comments and answers have rightly pointed out, the reason for the difference in execution times that you observe come from the way operating systems work. But doing rigorous measures is a complicated matter, so let me elaborate a bit more though and give you pointers to where you should maybe direct your experimentation.
What your OS does behind your back
You can see the OS as a conductor and programs as instrument players, and imagine there are only so many instruments that can play at the same time. The conductor must therefore choose at each time who should play, also making sure nobody is frustrated in the end! Same-wise, the OS is therefore constantly in charge of choosing what programs to execute, meaning what program to dedicate CPU time. The number of programs (or rather processes) that can be executed at the same time is usually limited by the number of cores in your processor.
In practice, the way that OS chooses what to execute is a very complex and fascinating subject, which relies on experimentation-backed heuristics. (Read more here). What you have to understand, is that there is hardly any way for you to alter this behavior, and none to guarantee the same execution time between two calls.
Using linux's time command
Calling python's time like you do measures the physical time elapsed between two calls, so because of what we have said, you don't only measure time spent on your program's execution. If you want to have a better a sense of what time the OS actually dedicated to your program, you can use the linux command time. The user time, will give you the actual CPU time dedicated to the execution of your program. Check out this thread for more info. But understand that this time as well is subject to oscillations!
What wisdom are you trying to draw from your measurements?
Finally, you should ask yourself if the exact time is really what you want. Do you care about the value? or do you want to exhibit behaviors?
Usually what is done to measure performances, is averaging the execution times of repeated calls. This way, the effects that pertain to the OS's business should be averaged out. (You can see that as building an unbiased estimator for a random process). From what I understand, you are trying to show difference in execution times for algorithms with different complexity. So the actual execution time is not so relevant, what is, is the relative order. That is why averaging multiple calls will reduce the variance of the observation and you will be able to make stronger statements as to the relative execution times.
You should address this question to your operating system. What else runs on your computer? List the various processes and see how many there are; all it takes is a process or even a context swap to alter your execution time. Among other things, calling time.time can invoke such a switch, as this is a call to a system process.
It also depends on what system support routines are already loaded when you call them -- many of those calls being implicit or secondary. If you need to allocate more memory for a particular instruction because another process took the last of your RAM and then swapped out ... well, you get the idea, I hope.
The Setup
I'm working on training some neural networks. These have lots of hyperparameters, and typically you see how each set of hyperparameters performs, then pick your favorite. This is often done by (say) training a network with the given parameters for n epochs, then evaluate its performance, yielding a numerical score of each set of parameters and allowing you to pick the best.
There's a problem with this, though. Some sets of parameters let you go through more epochs more quickly, but benefit less from each epoch. Additionally, pretty much any set of parameters will always do better, given more epochs, so given infinite time, they would all do really well (to a point, but that's not the point right now).
The Problem
What I would prefer to do is to let each process figure out how long it's been running, and cut itself off (gracefully) after a specified number of seconds. The problem is, I would like to multithread this, so just because the program has been running for 60 seconds doesn't mean the process has had 60 seconds of fair CPU time.
So how can I measure how much time the process has actually had available to it, within the process itself?
The time.clock() method gives system time, which is problematic (as above).
The timeit module seems a bit better, but it's external to the script, so the process wouldn't know when to stop.
Is there a better way? Am I wrong about one of the above ways?
Specific Question
How can a python process see how many seconds it has been allocated so far? Not the amount of time that has passed, but how many seconds it itself has been allowed to execute for?
Use os.times().
This gives you the user and system times for the current process. Below is an example limiting the amount of user time.
start = os.times()
limit = 5 # seconds of user time
while True:
# your code here
check = os.times()
if check.user - start.user > limit:
break