simple CPython program and its performance measure - python

how make a simple cpython program from scratch and measure it performance gain with its respective python program..
example if I make a program of 10! in python and the in Cpython how would I know how much it improves the performance of computation.

Take a look at : How to use timeit module
simple code to test time taken for a function:
def time_taken(func):
from time import time
start = time()
func()
end = time()
return (end - start)
def your_func():
#your code or logic
# To test
time_taken(your_func)
Use this mechanism to test time taken in both situations.
Let's say CPython took c seconds & Python took p seconds, CPython is faster than Python by: (c - p) * 100 / p
If c is 2 secs & p is 1 sec. CPython is faster than Python by (2-1)*100/1 = 100%
However, the performance keeps changing depending on the code & problem statement.
Share the output. Goodluck!

Related

Dask delayed performance issues

I'm relatively new to Dask. I'm trying to parallelize a "custom" function that doesn't use Dask containers. I would just like to speed up the computation. But my results are that when I try parallelizing with dask.delayed, it has significantly worse performance than running the serial version. Here is a minimal implementation demonstrating the issue (the code I actually want to do this with is significantly more involved :) )
import dask,time
def mysum(rng):
# CPU intensive
z = 0
for i in rng:
z += i
return z
# serial
b = time.time(); zz = mysum(range(1, 1_000_000_000)); t = time.time() - b
print(f'time to run in serial {t}')
# parallel
ms_parallel = dask.delayed(mysum)
ss = []
ncores = 10
m = 100_000_000
for i in range(ncores):
lower = m*i
upper = (i+1) * m
r = range(lower, upper)
s = ms_parallel(r)
ss.append(s)
j = dask.delayed(ss)
b = time.time(); yy = j.compute(); t = time.time() - b
print(f'time to run in parallel {t}')
Typical results are:
time to run in serial 55.682398080825806
time to run in parallel 135.2043571472168
It seems I'm missing something basic here.
You are running a pure CPU-bound computation in threads by default. Because of python's Global Interpreter Lock (GIL), only one thread is actually running at a time. In short, you are only adding overhead to your original compute, due to thread switching and task executing.
To actually get faster for this workload, you should use dask-distributed. Just adding
import dask.distributed
client = dask.distributed.Client(threads_per_worker=1)
at the start of your script may well give you a decent speed up, since this invokes a certain number of processes, each with their own GIL. This scheduler becomes the default one just by creating it.
EDIT: ignore the following, I see you are already doing it :). Leaving here for others, unless people want it gone ...The second problem, for dask, is the sheer number of tasks. For any task execution system, there is an overhead associated with each task (actually, this is higher for distributed than the default threads scheduler). You could get around it by computing batches of function calls per task. This is, in practice, what dask.array and dask.dataframe do: they operate on largeish pieces of the overall problem, such that the overhead becomes small compared to the useful CPU execution time.

How to benchmark a C program from a python script?

I'm currently doing some work in uni that requires generating multiple benchmarks for multiple short C programs. I've written a python script to automate this process. Up until now I've been using the time module and essentially calculating the benchmark as such:
start = time.time()
successful = run_program(path)
end = time.time()
runtime = end - start
where the run_program function just uses the subprocess module to run the C program:
def run_program(path):
p = subprocess.Popen(path, shell=True, stdout=subprocess.PIPE)
p.communicate()[0]
if (p.returncode > 1):
return False
return True
However I've recently discovered that this measures elapsed time and not CPU time, i.e. this sort of measurement is sensitive to noise from the OS. Similar questions on SO suggest that the timeit module is is better for measuring CPU time, so I've adapted the run method as such:
def run_program(path):
command = 'p = subprocess.Popen(\'time ' + path + '\', shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE); out, err = p.communicate()'
result = timeit.Timer(command, setup='import subprocess').repeat(1, 10)
return numpy.median(result)
But from looking at the timeit documentation it seems that the timeit module is only meant for small snippets of python code passed in as a string. So I'm not sure if timeit is giving me accurate results for this computation. So my question is: Will timeit measure the CPU for every step of the process that it runs or will it only measure the CPU time for the actual python(i.e. the subprocess module) code to run? Is this an accurate way to benchmark a set of C programs?
timeit will measure the CPU time used by the Python process in which it runs. Execution time of external processes will not be "credited" to those times.
A more accurate way would be to do it in C, where you can get true speed and throughput.

Assert that one run is significantly faster than one other

I need to make a functionnal test that assert that one run is significantly faster than one other.
Here is the code I have written so far:
def test_run5(self):
cmd_line = ["python", self.__right_def_file_only_files]
start = time.clock()
with self.assertRaises(SystemExit):
ClassName().run(cmd_line)
end = time.clock()
runtime1 = end - start
start = time.clock()
with self.assertRaises(SystemExit):
ClassName().run(cmd_line)
end = time.clock()
runtime2 = end - start
self.assertTrue(runtime2 < runtime1 * 1.4)
It works but I don't like this way because the 1.4 factor has been chosen experimentaly with my specific example of execution.
How would you test that the second execution is always faster than the first?
EDIT
I didn't think that it would be necessary to explain it but in the context of my program, it is not up to me to say that a factor is significant for an unknown execution.
The whole program is a kind of Make and it is the pipeline definition file that will define what is the "significant difference of speed", not me:
If the definition file contains a lot of rules that are very fast, the difference of execution time between two consecutive execution will be very small, let's say 5% faster but still significant
Else if the definition file contains few rules but very long ones, the difference will be big, let's say 90% faster so a difference of 5% would not be significant at all.
I found out an equation named Michaelis-Menten kinetics which fit with my needs. Here is the function which should do the trick
def get_best_factor(full_exec_time, rule_count, maximum_ratio=1):
average_rule_time = full_exec_time / rule_count
return 1 + (maximum_ratio * average_rule_time / (1.5 + average_rule_time))
full_exec_time parameter is runtime1 which is the maximum execution time for a given pipeline definition file.
rule_count is the number of rules in the given pipeline definition file.
maximum_ratio means that the second execution will be, at max, 100% faster than the first (impossible, in practice)
The variable parameter of the Michaelis-Menten kinetics equation, is the average rule execution time. And I have arbitrarily chosen 1.5 seconds as the average rule execution time at which the execution time should be maximum_ratio / 2 faster. It is the actual parameter that depends on your use of this equation.

Recording time of execution: How do computers calculate arithemetic so fast?

My question seems elementary at first, but bear with me.
I wrote the following code in order to test how long it would take python to count from 1 to 1,000,000.
import time
class StopWatch:
def __init__(self, startTime = time.time()):
self.__startTime = startTime
self.__endTime = 0
def getStartTime(self):
return self.__startTime
def getEndTime(self):
return self.__endTime
def stop(self):
self.__endTime = time.time()
def start(self):
self.__startTime = time.time()
def getElapsedTime(self):
return self.__endTime - self.__startTime
count = 0
Timer = StopWatch()
for i in range(1, 1000001):
count += i
Timer.stop()
total_Time = Timer.getElapsedTime()
print("Total time elapsed to count to 1,000,000: ",total_Time," milliseconds")
I calculated a surprisingly short time span. It was 0.20280098915100098 milliseconds. I first want to ask: Is this correct?
I expected execution to be at least 2 or 3 milliseconds, but I did not anticipate it would be able to make that computation in less than a half of a millisecond!
If this is correct, that leads me to my secondary question: WHY is it so fast?
I know CPUs are essentially built for arithmetic, but I still wouldn't anticipate it being able to count to one million in two tenths of a millisecond!
Maybe you were tricked by time measure unit, as #jonrsharpe commented.
Nevertheless, a 3rd generation Intel i7 is capable of 120+GIPS (i.e. billions of elementary operations per second), so assuming all cache hits and no context switch (put simply, no unexpected waits), it could easily count from 0 to 1G in said time and even more. Probably not with Python, since it has some overhead, but still possible.
Explaining how a modern CPU can achieve such an... "insane" speed is quite a broad subject, actually the collaboration of more than one technology:
a dynamic scheduler will rearrange elementary instructions to reduce conflicts (thus, waits) as much as possible
a well-engineered cache will promptly provide code and (although less problematic for this benchmark) data.
a dynamic branch predictor will profile code and speculate on branch conditions (e.g. "for loop is over or not?") to anticipate jumps with a high chance of "winning".
a good compiler will provide some additional effort by rearranging instructions in order to reduce conflicts or making loops faster (by unrolling, merging, etc.)
multi-precision arithmetic could exploit vectorial operations provided by the MMX set and similar.
In short, there is more than a reason why those small wonders are so expensive :)
First, as has been pointed out, time() output is actually in seconds, not milliseconds.
Also, you are actually performing 1m additions to a total of 1m**2 /2, not counting to 1m, and you are initializing a million-long list (unless you are on python 3) with range.
I ran a simpler test on my laptop:
start = time.time()
i = 0;
while i < 1000000:
i+=1
print time.time() - start
Result:
0.069179093451
So, 70 milliseconds. That translates to 14 million operations per second.
Let's look at the table that Stefano probably referred to (http://en.wikipedia.org/wiki/Instructions_per_second) and do a rough estimation.
They don't have an i5 like I do, but the slowest i7 will be close enough. It clocks 80 GIPS with 4 cores, 20 GIPS per core.
(By the way, if your question is "how does it manage to get 20 GIPS per core?", can't help you. It's maaaagic nanotechnology)
So the core is capable of 20 billion operations per second, and we get only 14 million - different by a factor of 1400.
At this point the right question is not "why so fast?", by "why so slow?". Probably python overhead. What if we try this in C?
#include <stdio.h>
#include <unistd.h>
#include <time.h>
int i = 0;
int million = 1000000;
int main() {
clock_t cstart = clock();
while (i < million) {
i += 1;
}
clock_t cend = clock();
printf ("%.3f cpu sec\n", ((double)cend - (double)cstart) / CLOCKS_PER_SEC);
return 0;
}
Result:
0.003 cpu sec
This is 23 times faster than python, and only 60 times different from the number of theoretical 'elementary operations' per second. I see two operations here - comparison and addition, so 30 times different. This is entirely reasonable, as elementary operations are probably much smaller than our addition and comparison (let assembler experts tell us), and also we didn't factor in context switches, cache misses, time calculation overhead and who knows what else.
This also suggests that python performs 23 times as much operations to do the same thing. This is also entirely reasonable, because python is a high-level language. This is the kind of penalty you get in high level languages - and now you understand why speed-critical sections are usually written in C.
Also, python's integers are immutable, and memory should be allocated for each new integer (python runtime is smart about it, but nevertheless).
I hope that answers your question and teaches you a little bit about how to perform incredibly rough estimations =)
Short answer: As jonrsharpe mentioned in the comments, it's seconds, not milliseconds.
Also as Stefano said, xxxxxx --> check his posted answer. It has a lot of detail beyond just the ALU.
I'm just writing to mention - when you make default values in your classes or functions, make sure to use simple immutable instead of putting a function call or something like that. Your class is actually setting the start time of the timer for all instances - you will get a nasty surprise if you create a new Timer because it will use the previous value as the initial value. Try this and the timer does not get reset for the second Timer
#...
count = 0
Timer = StopWatch()
time.sleep(1)
Timer - StopWatch()
for i in range(1, 1000001):
count += i
Timer.stop()
total_Time = Timer.getElapsedTime()
print("Total time elapsed to count to 1,000,000: ",total_Time," milliseconds")
You will get about 1 second instead of what you expect.

What do I gain from using Profile or cProfile

versus something like this:
def time_this(func):
#functools.wraps(func)
def what_time_is_it(*args, **kwargs):
start_time = time.clock()
print 'STARTING TIME: %f' % start_time
result = func(*args, **kwargs)
end_time = time.clock()
print 'ENDING TIME: %f' % end_time
print 'TOTAL TIME: %f' % (end_time - start_time)
return result
return what_time_is_it
I am asking because writing a descriptor likes this seems easier and clearer to me. I recognize that profile/cprofile attempts to estimate bytecode compilation time and such(and subtracts those times from the running time), so more specifically.
I am wondering:
a) when does compilation time become significant enough for such
differences to matter?
b) How might I go about writing my own profiler that takes into
account compilation time?
Profile is slower than cProfile, but does support Threads.
cProfile is a lot faster, but AFAIK it won't profile threads (only the main one, the others will be ignored).
Profile and cProfile have nothing to do with estimating compilation time. They estimate run time.
Compilation time isn't a performance issue. Don't want your code to be compiled every time it's run? import it, and it will be saved as a .pyc, and only recompiled if you change it. It simply doesn't matter how long code takes to compile (it's very fast) since this doesn't have to be done every time it's run.
If you want to time compilation, you can use the compiler package.
Basically:
from timeit import timeit
print timeit('compiler.compileFile(' + filename + ')', 'import compiler', number=100)
will print the time it takes to compile filename 100 times.
If inside func, you append to some lists, do some addition, look up some variables in dictionaries, profile will tell you how long each of those things takes.
Your version doesn't tell you any of those things. It's also pretty inaccurate -- the time you get depends on the time it takes to look up the clock attribute of time and then call it.
If what you want is to time a short section of code, use timeit. If you want to profile code, use profile or cProfile. If what you want to know is how long arbitrary code took to run, but not what parts of it where the slowest, then your version is fine, so long as the code doesn't take just a few miliseconds.

Categories

Resources