Assert that one run is significantly faster than one other - python

I need to make a functionnal test that assert that one run is significantly faster than one other.
Here is the code I have written so far:
def test_run5(self):
cmd_line = ["python", self.__right_def_file_only_files]
start = time.clock()
with self.assertRaises(SystemExit):
ClassName().run(cmd_line)
end = time.clock()
runtime1 = end - start
start = time.clock()
with self.assertRaises(SystemExit):
ClassName().run(cmd_line)
end = time.clock()
runtime2 = end - start
self.assertTrue(runtime2 < runtime1 * 1.4)
It works but I don't like this way because the 1.4 factor has been chosen experimentaly with my specific example of execution.
How would you test that the second execution is always faster than the first?
EDIT
I didn't think that it would be necessary to explain it but in the context of my program, it is not up to me to say that a factor is significant for an unknown execution.
The whole program is a kind of Make and it is the pipeline definition file that will define what is the "significant difference of speed", not me:
If the definition file contains a lot of rules that are very fast, the difference of execution time between two consecutive execution will be very small, let's say 5% faster but still significant
Else if the definition file contains few rules but very long ones, the difference will be big, let's say 90% faster so a difference of 5% would not be significant at all.

I found out an equation named Michaelis-Menten kinetics which fit with my needs. Here is the function which should do the trick
def get_best_factor(full_exec_time, rule_count, maximum_ratio=1):
average_rule_time = full_exec_time / rule_count
return 1 + (maximum_ratio * average_rule_time / (1.5 + average_rule_time))
full_exec_time parameter is runtime1 which is the maximum execution time for a given pipeline definition file.
rule_count is the number of rules in the given pipeline definition file.
maximum_ratio means that the second execution will be, at max, 100% faster than the first (impossible, in practice)
The variable parameter of the Michaelis-Menten kinetics equation, is the average rule execution time. And I have arbitrarily chosen 1.5 seconds as the average rule execution time at which the execution time should be maximum_ratio / 2 faster. It is the actual parameter that depends on your use of this equation.

Related

simple CPython program and its performance measure

how make a simple cpython program from scratch and measure it performance gain with its respective python program..
example if I make a program of 10! in python and the in Cpython how would I know how much it improves the performance of computation.
Take a look at : How to use timeit module
simple code to test time taken for a function:
def time_taken(func):
from time import time
start = time()
func()
end = time()
return (end - start)
def your_func():
#your code or logic
# To test
time_taken(your_func)
Use this mechanism to test time taken in both situations.
Let's say CPython took c seconds & Python took p seconds, CPython is faster than Python by: (c - p) * 100 / p
If c is 2 secs & p is 1 sec. CPython is faster than Python by (2-1)*100/1 = 100%
However, the performance keeps changing depending on the code & problem statement.
Share the output. Goodluck!

Why Python tries to compute some calculations in multiprocessing in advance

I am using Ubuntu 17.04 64-bit with processor-Intel® Core™ i7-7500U CPU # 2.70GHz × 4 and 16gb of RAM.
So when I run this program, it uses a single core instead of using all the 4 cores.
import time
import multiprocessing
def boom1(*args):
print(5**10000000000)
def boom2(*args):
print(5**10000000000)
def boom3(*args):
print(5**10000000000)
def boom4(*args):
print(5**10000000000)
if __name__=="__main__":
array = []
p1 = multiprocessing.Process(target=boom1, args=(array,))
p2 = multiprocessing.Process(target=boom2, args=(array,))
p3 = multiprocessing.Process(target=boom3, args=(array,))
p4 = multiprocessing.Process(target=boom4, args=(array,))
p1.start()
p2.start()
p3.start()
p4.start()
p1.join()
p2.join()
p3.join()
p4.join()
print('Done')
Now if I print some low power of 10 in each function:
print(5 ** 10000000)
Now for small duration of time, a single core is processing 100% and then all 4 cores are performing 100%.
Why is this so? Shouldn't it start with all cores performing 100%.
What I had come to know was that python performs some operation before itself and hence was doing that from a single core. If it is so then what is the point of python being an interpreted language or am I missing something?does
The peephole optimizer is trying to constant-fold the 5**10000000000 calculation. This happens before any worker processes launch.
Most languages have a constant-folding optimization: when an operation between constants appears, the compiler will perform the operation and replace the expression with the single-constant result.
Python does this, as well. I expect that your multi-node operation was simply the sequence of start-print-join on each node.
If you want to get longer runs on the four nodes, try an expression that can't be evaluated at parse-time. For instance, pass the base in the argument list and use that instead of 5, or perhaps have each process pick a random number in the range 1-10 and add that to the exponent. This should force run-time evaluation.
I believe the previous answers are correct, but may not fully explain your observations. As others have pointed out, that the single processor time you're seeing is the time that the interpreter is spending to calculate the value of the exponential expression. Because you're using integers, and Python can do arbitrarily long integers, this takes quite awhile, probably exponential in terms of the number of 0's in your exponent. In the first case, the calculation is taking so long that it appears to not get past that (I don't know if you ran to completion, or if your machine could even do it without running out of memory).
In the second case, you've removed enough zeros so that it can calculate it (single threaded interpreter) then proceed to print it (in parallel). However long that took, it will probably take at least 1000 times longer to do the first case.

Multiprocessing in Python. Why is there no speed-up?

I am trying to get to grips with multiprocessing in Python. I started by creating this code. It simply computes cos(i) for integers i and measures the time taken when one uses multiprocessing and when one does not. I am not observing any time difference. Here is my code:
import multiprocessing
from multiprocessing import Pool
import numpy as np
import time
def tester(num):
return np.cos(num)
if __name__ == '__main__':
starttime1 = time.time()
pool_size = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=pool_size,
)
pool_outputs = pool.map(tester, range(5000000))
pool.close()
pool.join()
endtime1 = time.time()
timetaken = endtime1 - starttime1
starttime2 = time.time()
for i in range(5000000):
tester(i)
endtime2 = time.time()
timetaken2 = timetaken = endtime2 - starttime2
print( 'The time taken with multiple processes:', timetaken)
print( 'The time taken the usual way:', timetaken2)
I am observing no (or very minimal) difference between the two times measured. I am using a machine with 8 cores, so this is surprising. What have I done incorrectly in my code?
Note that I learned all of this from this.
http://pymotw.com/2/multiprocessing/communication.html
I understand that "joblib" might be more convenient for an example like this, but the ultimate thing that this needs to be applied to does not work with "joblib".
Your job seems the computation of a single cos value. This is going to be basically unnoticeable compared to the time of communicating with the slave.
Try making 5 computations of 1000000 cos values and you should see them going in parallel.
First, you wrote :
timetaken2 = timetaken = endtime2 - starttime2
So it is normal if you have the same times displayed. But this is not the important part.
I ran your code on my computer (i7, 4 cores), and I get :
('The time taken with multiple processes:', 14.95710802078247)
('The time taken the usual way:', 6.465447902679443)
The multiprocessed loop is slower than doing the for loop. Why?
The multiprocessing module can use multiple processes, but still has to work with the Python Global Interpreter Lock, wich means you can't share memory between your processes. So when you try to launch a Pool, you need to copy useful variables, process your calculation, and retrieve the result. This cost you a little time for every process, and makes you less effective.
But this happens because you do a very small computation : multiprocessing is only useful for larger calculation, when the memory copying and results retrieving is cheaper (in time) than the calculation.
I tried with following tester, which is much more expensive, on 2000 runs:
def expenser_tester(num):
A=np.random.rand(10*num) # creation of a random Array 1D
for k in range(0,len(A)-1): # some useless but costly operation
A[k+1]=A[k]*A[k+1]
return A
('The time taken with multiple processes:', 4.030329942703247)
('The time taken the usual way:', 8.180987119674683)
You can see that on an expensive calculation, it is more efficient with the multiprocessing, even if you don't always have what you could expect (I could have a x4 speedup, but I only got x2)
Keep in mind that Pool has to duplicate every bit of memory used in calculation, so it may be memory expensive.
If you really want to improve a small calculation like your example, make it big by grouping and sending a list of variable to the pool instead of one variable by process.
You should also know that numpy and scipy have a lot of expensive function written in C/Fortran and already parallelized, so you can't do anything much to speed them.
If the problem is cpu bounded then you should see the required speed-up (if the operation is long enough and overhead is not significant). But when multiprocessing (because memory is not shared between processes) it's easier to have a memory bound problem.

Recording time of execution: How do computers calculate arithemetic so fast?

My question seems elementary at first, but bear with me.
I wrote the following code in order to test how long it would take python to count from 1 to 1,000,000.
import time
class StopWatch:
def __init__(self, startTime = time.time()):
self.__startTime = startTime
self.__endTime = 0
def getStartTime(self):
return self.__startTime
def getEndTime(self):
return self.__endTime
def stop(self):
self.__endTime = time.time()
def start(self):
self.__startTime = time.time()
def getElapsedTime(self):
return self.__endTime - self.__startTime
count = 0
Timer = StopWatch()
for i in range(1, 1000001):
count += i
Timer.stop()
total_Time = Timer.getElapsedTime()
print("Total time elapsed to count to 1,000,000: ",total_Time," milliseconds")
I calculated a surprisingly short time span. It was 0.20280098915100098 milliseconds. I first want to ask: Is this correct?
I expected execution to be at least 2 or 3 milliseconds, but I did not anticipate it would be able to make that computation in less than a half of a millisecond!
If this is correct, that leads me to my secondary question: WHY is it so fast?
I know CPUs are essentially built for arithmetic, but I still wouldn't anticipate it being able to count to one million in two tenths of a millisecond!
Maybe you were tricked by time measure unit, as #jonrsharpe commented.
Nevertheless, a 3rd generation Intel i7 is capable of 120+GIPS (i.e. billions of elementary operations per second), so assuming all cache hits and no context switch (put simply, no unexpected waits), it could easily count from 0 to 1G in said time and even more. Probably not with Python, since it has some overhead, but still possible.
Explaining how a modern CPU can achieve such an... "insane" speed is quite a broad subject, actually the collaboration of more than one technology:
a dynamic scheduler will rearrange elementary instructions to reduce conflicts (thus, waits) as much as possible
a well-engineered cache will promptly provide code and (although less problematic for this benchmark) data.
a dynamic branch predictor will profile code and speculate on branch conditions (e.g. "for loop is over or not?") to anticipate jumps with a high chance of "winning".
a good compiler will provide some additional effort by rearranging instructions in order to reduce conflicts or making loops faster (by unrolling, merging, etc.)
multi-precision arithmetic could exploit vectorial operations provided by the MMX set and similar.
In short, there is more than a reason why those small wonders are so expensive :)
First, as has been pointed out, time() output is actually in seconds, not milliseconds.
Also, you are actually performing 1m additions to a total of 1m**2 /2, not counting to 1m, and you are initializing a million-long list (unless you are on python 3) with range.
I ran a simpler test on my laptop:
start = time.time()
i = 0;
while i < 1000000:
i+=1
print time.time() - start
Result:
0.069179093451
So, 70 milliseconds. That translates to 14 million operations per second.
Let's look at the table that Stefano probably referred to (http://en.wikipedia.org/wiki/Instructions_per_second) and do a rough estimation.
They don't have an i5 like I do, but the slowest i7 will be close enough. It clocks 80 GIPS with 4 cores, 20 GIPS per core.
(By the way, if your question is "how does it manage to get 20 GIPS per core?", can't help you. It's maaaagic nanotechnology)
So the core is capable of 20 billion operations per second, and we get only 14 million - different by a factor of 1400.
At this point the right question is not "why so fast?", by "why so slow?". Probably python overhead. What if we try this in C?
#include <stdio.h>
#include <unistd.h>
#include <time.h>
int i = 0;
int million = 1000000;
int main() {
clock_t cstart = clock();
while (i < million) {
i += 1;
}
clock_t cend = clock();
printf ("%.3f cpu sec\n", ((double)cend - (double)cstart) / CLOCKS_PER_SEC);
return 0;
}
Result:
0.003 cpu sec
This is 23 times faster than python, and only 60 times different from the number of theoretical 'elementary operations' per second. I see two operations here - comparison and addition, so 30 times different. This is entirely reasonable, as elementary operations are probably much smaller than our addition and comparison (let assembler experts tell us), and also we didn't factor in context switches, cache misses, time calculation overhead and who knows what else.
This also suggests that python performs 23 times as much operations to do the same thing. This is also entirely reasonable, because python is a high-level language. This is the kind of penalty you get in high level languages - and now you understand why speed-critical sections are usually written in C.
Also, python's integers are immutable, and memory should be allocated for each new integer (python runtime is smart about it, but nevertheless).
I hope that answers your question and teaches you a little bit about how to perform incredibly rough estimations =)
Short answer: As jonrsharpe mentioned in the comments, it's seconds, not milliseconds.
Also as Stefano said, xxxxxx --> check his posted answer. It has a lot of detail beyond just the ALU.
I'm just writing to mention - when you make default values in your classes or functions, make sure to use simple immutable instead of putting a function call or something like that. Your class is actually setting the start time of the timer for all instances - you will get a nasty surprise if you create a new Timer because it will use the previous value as the initial value. Try this and the timer does not get reset for the second Timer
#...
count = 0
Timer = StopWatch()
time.sleep(1)
Timer - StopWatch()
for i in range(1, 1000001):
count += i
Timer.stop()
total_Time = Timer.getElapsedTime()
print("Total time elapsed to count to 1,000,000: ",total_Time," milliseconds")
You will get about 1 second instead of what you expect.

why execution time for this python code increases each call?

import time
word = {"success":0, "desire":0, "effort":0, ...}
def cleaner(x):
dust = ",./<>?;''[]{}\=+_)(*&^%$##!`~"
for letter in x:
if letter in dust:
x = x[0:x.index(letter)]+x[x.index(letter)+1:]
else:
pass
return x #alhamdlillah it worked 31.07.12
print "input text to analyze"
itext = cleaner(raw_input()).split()
t = time.clock()
for iword in itext:
if iword in word:
word[iword] += 1
else:
pass
print t
print len(itext)
every time i call the code, t will increase. can anyone explain the underlying concept/reason behind this. perhaps in terms of system process? thank you very much, programming lads.
Because you're printing out the current time each time you run the script
That's how time works, it advances, constantly.
If you want to measure the time taken for your for loop (between the first call to time.clock() and the end), print out the difference in times:
print time.clock() - t
You are printing the current time... of course it increases every time you run the code.
From the python documentation for time.clock():
On Unix, return the current processor time as a floating point number
expressed in seconds. The precision, and in fact the very definition
of the meaning of “processor time”, depends on that of the C function
of the same name, but in any case, this is the function to use for
benchmarking Python or timing algorithms.
On Windows, this function returns wall-clock seconds elapsed since the
first call to this function, as a floating point number, based on the
Win32 function QueryPerformanceCounter(). The resolution is typically
better than one microsecond.
time.clock() returns the elapsed CPU time since the process was created. CPU time is based on how many cycles the CPU spent in the context of the process. It is a monotonic function during the lifetime of a process, i.e. if you call time.clock() several times in the same execution, you will get a list of increasing numbers. The difference between two successive invocations of clock() could be less than the elasped wall-clock time or more, depending on wheather the CPU was not running at 100% (e.g. there was some waiting for I/O) or if you have a multithreaded program which consumes more than 100% of CPU time (e.g. multicore CPU with 2 threads using 75% each -> you'd get 150% of the wall-clock time). But if you call clock() once in one process, then rerun the program again, you might get lower value than the one before, if it takes less time to process the input in the new process.
What you should be doing instead is to use time.time() which returns the current Unix timestamp with fractional (subsecond) precision. You should call it once before the processing is started and once after that and subtract the two values in order to compute the wall-clock time elapsed between the two invocations.
Note that on Windows time.clock() returns the elapsed wall-clock time since the process was started. It is like calling time.time() immediately at the beginning of the script and then subtracting the value from later calls to time.time().
There is a really good library called jackedCodeTimerPy that works better than the time module. It also has some clever error checking so you may want to try it out.
Using jackedCodeTimerPy your code should look like this:
# import time
from jackedCodeTimerPY import JackedTiming
JTimer = JackedTiming()
word = {"success":0, "desire":0, "effort":0}
def cleaner(x):
dust = ",./<>?;''[]{}\=+_)(*&^%$##!`~"
for letter in x:
if letter in dust:
x = x[0:x.index(letter)]+x[x.index(letter)+1:]
else:
pass
return x #alhamdlillah it worked 31.07.12
print "input text to analyze"
itext = cleaner(raw_input()).split()
# t = time.clock()
JTimer.start('timer_1')
for iword in itext:
if iword in word:
word[iword] += 1
else:
pass
# print t
JTimer.stop('timer_1')
print JTimer.report()
print len(itext)
It gives really good reports like
label min max mean total run count
------- ----------- ----------- ----------- ----------- -----------
imports 0.00283813 0.00283813 0.00283813 0.00283813 1
loop 5.96046e-06 1.50204e-05 6.71864e-06 0.000335932 50
I like how it gives you statistics on it and the number of times the timer is run.
It's simple to use. If i want to measure the time code takes in a for loop i just do the following:
from jackedCodeTimerPY import JackedTiming
JTimer = JackedTiming()
for i in range(50):
JTimer.start('loop') # 'loop' is the name of the timer
doSomethingHere = 'This is really useful!'
JTimer.stop('loop')
print(JTimer.report()) # prints the timing report
You can can also have multiple timers running at the same time.
JTimer.start('first timer')
JTimer.start('second timer')
do_something = 'amazing'
JTimer.stop('first timer')
do_something = 'else'
JTimer.stop('second timer')
print(JTimer.report()) # prints the timing report
There are more use example in the repo. Hope this helps.
https://github.com/BebeSparkelSparkel/jackedCodeTimerPY

Categories

Resources