I was trying to figure out which integers python only instantiates once (-6 to 256 it seems), and in the process stumbled on some string behaviour I can't see the pattern in. Sometimes, equal strings created in different ways share the same id, sometimes not. This code:
A = "10000"
B = "10000"
C = "100" + "00"
D = "%i"%10000
E = str(10000)
F = str(10000)
G = str(100) + "00"
H = "0".join(("10","00"))
for obj in (A,B,C,D,E,F,G,H):
print obj, id(obj), obj is A
prints:
10000 4959776 True
10000 4959776 True
10000 4959776 True
10000 4959776 True
10000 4959456 False
10000 4959488 False
10000 4959520 False
10000 4959680 False
I don't even see the pattern - save for the fact that the first four don't have an explicit function call - but surely that can't be it, since the "+" in C for example implies a function call to add. I especially don't understand why C and G are different, seeing as that implies that the ids of the components of the addition are more important than the outcome.
So, what is the special treatment that A-D undergo, making them come out as the same instance?
In terms of language specification, any compliant Python compiler and runtime is fully allowed, for any instance of an immutable type, to make a new instance OR find an existing instance of the same type that's equal to the required value and use a new reference to that same instance. This means it's always incorrect to use is or by-id comparison among immutables, and any minor release may tweak or change strategy in this matter to enhance optimization.
In terms of implementations, the tradeoff are pretty clear: trying to reuse an existing instance may mean time spent (perhaps wasted) trying to find such an instance, but if the attempt succeeds then some memory is saved (as well as the time to allocate and later free the memory bits needed to hold a new instance).
How to solve those implementation tradeoffs is not entirely obvious -- if you can identify heuristics that indicate that finding a suitable existing instance is likely and the search (even if it fails) will be fast, then you may want to attempt the search-and-reuse when the heuristics suggest it, but skip it otherwise.
In your observations you seem to have found a particular dot-release implementation that performs a modicum of peephole optimization when that's entirely safe, fast, and simple, so the assignments A to D all boil down to exactly the same as A (but E to F don't, as they involve named functions or methods that the optimizer's authors may reasonably have considered not 100% safe to assume semantics for -- and low-ROI if that was done -- so they're not peephole-optimized).
Thus, A to D reusing the same instance boils down to A and B doing so (as C and D get peephole-optimized to exactly the same construct).
That reuse, in turn, clearly suggests compiler tactics/optimizer heuristics whereby identical literal constants of an immutable type in the same function's local namespace are collapsed to references to just one instance in the function's .func_code.co_consts (to use current CPython's terminology for attributes of functions and code objects) -- reasonable tactics and heuristics, as reuse of the same immutable constant literal within one function are somewhat frequent, AND the price is only paid once (at compile time) while the advantage is accrued many times (every time the function runs, maybe within loops etc etc).
(It so happens that these specific tactics and heuristics, given their clearly-positive tradeoffs, have been pervasive in all recent versions of CPython, and, I believe, IronPython, Jython, and PyPy as well;-).
This is a somewhat worthy and interesting are of study if you're planning to write compilers, runtime environments, peephole optimizers, etc etc, for Python itself or similar languages. I guess that deep study of the internals (ideally of many different correct implementations, of course, so as not to fixate on the quirks of a specific one -- good thing Python currently enjoys at least 4 separate production-worthy implementations, not to mention several versions of each!) can also help, indirectly, make one a better Python programmer -- but it's particularly important to focus on what's guaranteed by the language itself, which is somewhat less than what you'll find in common among separate implementations, because the parts that "just happen" to be in common right now (without being required to be so by the language specs) may perfectly well change under you at the next point release of one or another implementation and, if your production code was mistakenly relying on such details, that might cause nasty surprises;-). Plus -- it's hardly ever necessary, or even particularly helpful, to rely on such variable implementation details rather than on language-mandated behavior (unless you're coding something like an optimizer, debugger, profiler, or the like, of course;-).
Python is allowed to inline string constants; A,B,C,D are actually the same literals (if Python sees a constant expression, it treats it as a constant).
str is actually a class, so str(whatever) is calling this class' constructor, which should yield a fresh object. This explains E,F,G (note that each of these has separate identity).
As for H, I am not sure, but I'd go for explanation that this expression is too complicated for Python to figure out it's actually a constant, so it computes a new string.
I believe short strings that can be evaluated at compile time, will be interned automatically. In the last examples, the result can't be evaluated at compile time because str or join might be redefined.
in answer to S.Lott's suggestion of examining the byte code:
import dis
def moo():
A = "10000"
B = "10000"
C = "100" + "00"
D = "%i"%10000
E = str(10000)
F = str(10000)
G = "1000"+str(0)
H = "0".join(("10","00"))
I = str("10000")
for obj in (A,B,C,D,E,F,G,H, I):
print obj, id(obj), obj is A
moo()
print dis.dis(moo)
yields:
10000 4968128 True
10000 4968128 True
10000 4968128 True
10000 4968128 True
10000 2840928 False
10000 2840896 False
10000 2840864 False
10000 2840832 False
10000 4968128 True
4 0 LOAD_CONST 1 ('10000')
3 STORE_FAST 0 (A)
5 6 LOAD_CONST 1 ('10000')
9 STORE_FAST 1 (B)
6 12 LOAD_CONST 10 ('10000')
15 STORE_FAST 2 (C)
7 18 LOAD_CONST 11 ('10000')
21 STORE_FAST 3 (D)
8 24 LOAD_GLOBAL 0 (str)
27 LOAD_CONST 5 (10000)
30 CALL_FUNCTION 1
33 STORE_FAST 4 (E)
9 36 LOAD_GLOBAL 0 (str)
39 LOAD_CONST 5 (10000)
42 CALL_FUNCTION 1
45 STORE_FAST 5 (F)
10 48 LOAD_CONST 6 ('1000')
51 LOAD_GLOBAL 0 (str)
54 LOAD_CONST 7 (0)
57 CALL_FUNCTION 1
60 BINARY_ADD
61 STORE_FAST 6 (G)
11 64 LOAD_CONST 8 ('0')
67 LOAD_ATTR 1 (join)
70 LOAD_CONST 12 (('10', '00'))
73 CALL_FUNCTION 1
76 STORE_FAST 7 (H)
12 79 LOAD_GLOBAL 0 (str)
82 LOAD_CONST 1 ('10000')
85 CALL_FUNCTION 1
88 STORE_FAST 8 (I)
14 91 SETUP_LOOP 66 (to 160)
94 LOAD_FAST 0 (A)
97 LOAD_FAST 1 (B)
100 LOAD_FAST 2 (C)
103 LOAD_FAST 3 (D)
106 LOAD_FAST 4 (E)
109 LOAD_FAST 5 (F)
112 LOAD_FAST 6 (G)
115 LOAD_FAST 7 (H)
118 LOAD_FAST 8 (I)
121 BUILD_TUPLE 9
124 GET_ITER
>> 125 FOR_ITER 31 (to 159)
128 STORE_FAST 9 (obj)
15 131 LOAD_FAST 9 (obj)
134 PRINT_ITEM
135 LOAD_GLOBAL 2 (id)
138 LOAD_FAST 9 (obj)
141 CALL_FUNCTION 1
144 PRINT_ITEM
145 LOAD_FAST 9 (obj)
148 LOAD_FAST 0 (A)
151 COMPARE_OP 8 (is)
154 PRINT_ITEM
155 PRINT_NEWLINE
156 JUMP_ABSOLUTE 125
>> 159 POP_BLOCK
>> 160 LOAD_CONST 0 (None)
163 RETURN_VALUE
so it would seem that indeed the compiler understands A-D to mean the same thing, and so it saves memory by only generating it once (as suggested by Alex,Maciej and Greg). (added case I seems to just be str() realising it's trying to make a string from a string, and just passing it through.)
Thanks everyone, that's a lot clearer now.
Related
I am trying to find the most efficient way to check whether the given string is palindrome or not.
Firstly, I tried brute force which has running time of the order O(N). Then I optimized the code a little bit by making only n/2 comparisons instead of n.
Here is the code:
def palindrome(a):
length=len(a)
iterator=0
while iterator <= length/2:
if a[iterator]==a[length-iterator-1]:
iterator+=1
else:
return False
return True
It takes half time when compared to brute force but it is still order O(N).
Meanwhile, I also thought of a solution which uses slice operator.
Here is the code:
def palindrome_py(a):
return a==a[::-1]
Then I did running time analysis of both. Here is the result:
Running time
Length of string used is 50
Length multiplier indicates length of new string(50*multiplier)
Running time for 100000 iterations
For palindrome For palindrome_py Length Multiplier
0.6559998989 0.5309998989 1
1.2970001698 0.5939998627 2
3.5149998665 0.7820000648 3
13.4249999523 1.5310001373 4
65.5319998264 5.2660000324 5
The code I used can be accessed here: Running Time Table Generator
Now, I want to know why there is difference between running time of slice operator(palindrome_py) and the palindrome function.Why I am getting this type of running time?
Why is the slice operator so efficient as compared to the palindrome function, what is happening behind the scenes?
My observations-:
running time is proportional to multiplier ie. running time when multiplier is 2 can be obtained by multiplying running time of case (n-1) ie. 1st in this case by multiplier (n) ie.2
Generalizing, we get Running Time(n)=Running Time(n-1)* Multiplier
Your slicing-based solution is still O(n), the constant got smaller (that's your multiplier). It's faster, because less stuff is done in Python and more stuff is done in C. The bytecode shows it all.
In [1]: import dis
In [2]: %paste
def palindrome(a):
length=len(a)
iterator=0
while iterator <= length/2:
if a[iterator]==a[length-iterator-1]:
iterator+=1
else:
return False
return True
## -- End pasted text --
In [3]: dis.dis(palindrome)
2 0 LOAD_GLOBAL 0 (len)
3 LOAD_FAST 0 (a)
6 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
9 STORE_FAST 1 (length)
3 12 LOAD_CONST 1 (0)
15 STORE_FAST 2 (iterator)
4 18 SETUP_LOOP 65 (to 86)
>> 21 LOAD_FAST 2 (iterator)
24 LOAD_FAST 1 (length)
27 LOAD_CONST 2 (2)
30 BINARY_TRUE_DIVIDE
31 COMPARE_OP 1 (<=)
34 POP_JUMP_IF_FALSE 85
5 37 LOAD_FAST 0 (a)
40 LOAD_FAST 2 (iterator)
43 BINARY_SUBSCR
44 LOAD_FAST 0 (a)
47 LOAD_FAST 1 (length)
50 LOAD_FAST 2 (iterator)
53 BINARY_SUBTRACT
54 LOAD_CONST 3 (1)
57 BINARY_SUBTRACT
58 BINARY_SUBSCR
59 COMPARE_OP 2 (==)
62 POP_JUMP_IF_FALSE 78
6 65 LOAD_FAST 2 (iterator)
68 LOAD_CONST 3 (1)
71 INPLACE_ADD
72 STORE_FAST 2 (iterator)
75 JUMP_ABSOLUTE 21
8 >> 78 LOAD_CONST 4 (False)
81 RETURN_VALUE
82 JUMP_ABSOLUTE 21
>> 85 POP_BLOCK
10 >> 86 LOAD_CONST 5 (True)
89 RETURN_VALUE
There is a hell lot of Python virtual-machine level instructions, that are basically function calls, which are very expensive in Python.
Now, what's with the second function.
In [4]: %paste
def palindrome_py(a):
return a==a[::-1]
## -- End pasted text --
In [5]: dis.dis(palindrome_py)
2 0 LOAD_FAST 0 (a)
3 LOAD_FAST 0 (a)
6 LOAD_CONST 0 (None)
9 LOAD_CONST 0 (None)
12 LOAD_CONST 2 (-1)
15 BUILD_SLICE 3
18 BINARY_SUBSCR
19 COMPARE_OP 2 (==)
22 RETURN_VALUE
No Python iteration (jumpers) involved here and you only get 3 calls (these instructions call methods): BUILD_SLICE, BINARY_SUBSCR, COMPARE_OP, all done in C, because str is a built-in type with all methods written C. To be fair, we've seen the same instructions in the first function (along with a lot more other instructions), but there they are repeated for each character, multiplying the method-call overhead by n. Here you only pay the Python's function call overhead once, the rest is done in C.
The bottomline. You shouldn't do low-level stuff in Python manually, because it will run slower than a high-level counterpart (unless you have an asymptotically faster alternative that literally requires low-level magic). Python, unlike many other languages, most of the time encourages you to use abstractions and rewards you with higher performance.
Disclaimer: I'm new to programming, but new to Python. This may be a pretty basic question.
I have the following block of code:
for x in range(0, 100):
y = 1 + 1;
Is the calculation of 1 + 1 in the second line executed 100 times?
I have two suspicions why it might not:
1) The compiler sees 1 + 1 as a constant value, and thus compiles this line into y = 2;.
2) The compiler sees that y is only set and never referenced, so it omits this line of code.
Are either/both of these correct, or does it actually get executed each iteration over the loop?
Option 1 is executed; the CPython compiler simplifies mathematical expressions with constants in the peephole optimiser.
Python will not eliminate the loop body however.
You can introspect what Python produces by looking at the bytecode; use the dis module to take a look:
>>> import dis
>>> def f():
... for x in range(100):
... y = 1 + 1
...
>>> dis.dis(f)
2 0 SETUP_LOOP 26 (to 29)
3 LOAD_GLOBAL 0 (range)
6 LOAD_CONST 1 (100)
9 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
12 GET_ITER
>> 13 FOR_ITER 12 (to 28)
16 STORE_FAST 0 (x)
3 19 LOAD_CONST 3 (2)
22 STORE_FAST 1 (y)
25 JUMP_ABSOLUTE 13
>> 28 POP_BLOCK
>> 29 LOAD_CONST 0 (None)
32 RETURN_VALUE
The bytecode at position 19, LOAD_CONST loads the value 2 to store in y.
You can see the constants associated with the code object in the co_consts attribute of a code object; for functions you can find that object under the __code__ attribute:
>>> f.__code__.co_consts
(None, 100, 1, 2)
None is the default return value for any function, 100 the literal passed to the range() call, 1 the original literal, left in place by the peephole optimiser and 2 is the result of the optimisation.
The work is done in peephole.c, in the fold_binops_on_constants() function:
/* Replace LOAD_CONST c1. LOAD_CONST c2 BINOP
with LOAD_CONST binop(c1,c2)
The consts table must still be in list form so that the
new constant can be appended.
Called with codestr pointing to the first LOAD_CONST.
Abandons the transformation if the folding fails (i.e. 1+'a').
If the new constant is a sequence, only folds when the size
is below a threshold value. That keeps pyc files from
becoming large in the presence of code like: (None,)*1000.
*/
Take into account that Python is a highly dynamic language, such optimisations can only be applied to literals and constants that you cannot later dynamically replace.
Which of the following if statements is more Pythonic?
if not a and not b:
do_something
OR
if not ( a or b ):
do something
Its not predicate logic so I should use the Python key words because its more readable right?
In the later solution more optimal than the other? (I don't believe so.)
Is there any PEP-8 guides on this?
Byte code of the two approaches(if it matters):
In [43]: def func1():
if not a and not b:
return
....:
....:
In [46]: def func2():
if not(a or b):
return
....:
....:
In [49]: dis.dis(func1)
2 0 LOAD_GLOBAL 0 (a)
3 UNARY_NOT
4 JUMP_IF_FALSE 13 (to 20)
7 POP_TOP
8 LOAD_GLOBAL 1 (b)
11 UNARY_NOT
12 JUMP_IF_FALSE 5 (to 20)
15 POP_TOP
3 16 LOAD_CONST 0 (None)
19 RETURN_VALUE
>> 20 POP_TOP
21 LOAD_CONST 0 (None)
24 RETURN_VALUE
In [50]: dis.dis(func2)
2 0 LOAD_GLOBAL 0 (a)
3 JUMP_IF_TRUE 4 (to 10)
6 POP_TOP
7 LOAD_GLOBAL 1 (b)
>> 10 JUMP_IF_TRUE 5 (to 18)
13 POP_TOP
3 14 LOAD_CONST 0 (None)
17 RETURN_VALUE
>> 18 POP_TOP
19 LOAD_CONST 0 (None)
22 RETURN_VALUE
I'd say whichever is easier for you to read, depending on what a and b are.
I think both your examples are equally readable, however if I wanted to "push the boat out" on readability I would go with:
not any((a, b))
Since to me this reads much more like English, and hence is the most Pythonic.
Which to use? Whichever is more readable for what you're trying to do.
As to which is more efficient, the first one does do an extra not so it is technically less efficient, but not so you'd notice in a normal situation.
They are equivalent and whether one is faster than the other depends on circumstances (the values of a and b).
So just choose the version which you find most readable and/or understandable.
I personally like the Eiffel approach, put into pythonic form
if a and then b:
dosomething
if a and b:
dosomething
The first approach differs from the second if a is false. It doesn't evaluate b in the first case, in the second it does.
The or equivalent is "or else"
http://en.wikipedia.org/wiki/Short-circuit_evaluation
and/or are eager.
and then/or else short circuit the evaluation
The nice thing about the syntax is that it reads well, and it doesn't introduce new keywords.
For a piece of code to be Pythonic, it must be both pleasing to the reader in and of itself (readable) and in the context of its surroundings (consistent). Without having the context of this piece of code, a good opinion is hard to give.
But, on the other hand... If I were being Pythonic in my opinion giving I would need to operate consistently with my surroundings, which seem not to take context into consideration (e.g. the OP).
The top one.
How are the following two implementations have different performance in Python?
from cStringIO import StringIO
from itertools import imap
from sys import stdin
input = imap(int, StringIO(stdin.read()))
print '\n'.join(imap(str, sorted(input)))
AND
import sys
for line in sys.stdin:
l.append(int(line.strip('\n')))
l.sort()
for x in l:
print x
The first implementation is faster than the second for inputs of the order of 10^6 lines. Why so?
>>> dis.dis(first)
2 0 LOAD_GLOBAL 0 (imap)
3 LOAD_GLOBAL 1 (int)
6 LOAD_GLOBAL 2 (StringIO)
9 LOAD_GLOBAL 3 (stdin)
12 LOAD_ATTR 4 (read)
15 CALL_FUNCTION 0
18 CALL_FUNCTION 1
21 CALL_FUNCTION 2
24 STORE_FAST 0 (input)
27 LOAD_CONST 0 (None)
30 RETURN_VALUE
>>> dis.dis(second)
2 0 SETUP_LOOP 48 (to 51)
3 LOAD_GLOBAL 0 (sys)
6 LOAD_ATTR 1 (stdin)
9 CALL_FUNCTION 0
12 GET_ITER
>> 13 FOR_ITER 34 (to 50)
16 STORE_FAST 0 (line)
3 19 LOAD_GLOBAL 2 (l)
22 LOAD_ATTR 3 (append)
25 LOAD_GLOBAL 4 (int)
28 LOAD_FAST 0 (line)
31 LOAD_ATTR 5 (strip)
34 LOAD_CONST 1 ('\n')
37 CALL_FUNCTION 1
40 CALL_FUNCTION 1
43 CALL_FUNCTION 1
46 POP_TOP
47 JUMP_ABSOLUTE 13
>> 50 POP_BLOCK
4 >> 51 LOAD_GLOBAL 2 (l)
54 LOAD_ATTR 6 (sort)
57 CALL_FUNCTION 0
60 POP_TOP
61 LOAD_CONST 0 (None)
64 RETURN_VALUE
first is your first function.
second is your second function.
dis tells one of the reasons why the first one is faster.
Two primary reasons:
The 2nd code explicitly constructs a list and sorts it afterwards, while the 1st version lets sorted create only a internal list while sorting at the same time.
The 2nd code explicitly loops over a list with for (on the Python VM), while the 1st version implicitly loops with imap (over the underlaying structure in C).
Anyways, why is StringIO in there? The most straightforward and probably fastest way is:
from sys import stdin, stdout
stdout.writelines(sorted(stdin, key=int))
Do a step-by-step conversion from the second to the first one and see how the performance changes with each step.
Remove line.strip. This will cause some speed up, whether it would be significant is another matter. The stripping is superfluous as has been mentioned by you and THC4k.
Then replace the for loop using l.append with map(int, sys.stdin). My guess is that this would give a significant speed-up.
Replace map and l.sort with imap and sorted. My guess is that it won't affect the performance, there could be a slight slowdown, but it would be far from significant. Between the two, I'd usually go with the former, but with Python 3 on the horizon the latter is probably preferable.
Replace the for loop using print with print '\n'.join(...). My guess is that this would be another speed-up, but it would cost you some memory.
Add cStringIO (which is completely unnecessary by the way) to see how it affects performance. My guess is that it would be slightly slower, but not enough to counter 4 and 2.
Then, if you try THC4k's answer, it would probably be faster than all of the above, while being simpler and easier to read, and using less memory than 4 and 5. It has slightly different behaviour (it doesn't strip leading zeros from the numbers).
Of course, try this yourself instead of trusting anyone guesses. Also run cProfile on your code and see which parts are losing most time.
A few years ago, someone posted on Active State Recipes for comparison purposes, three python/NumPy functions; each of these accepted the same arguments and returned the same result, a distance matrix.
Two of these were taken from published sources; they are both--or they appear to me to be--idiomatic numpy code. The repetitive calculations required to create a distance matrix are driven by numpy's elegant index syntax. Here's one of them:
from numpy.matlib import repmat, repeat
def calcDistanceMatrixFastEuclidean(points):
numPoints = len(points)
distMat = sqrt(sum((repmat(points, numPoints, 1) -
repeat(points, numPoints, axis=0))**2, axis=1))
return distMat.reshape((numPoints,numPoints))
The third created the distance matrix using a single loop (which, obviously is a lot of looping given that a distance matrix of just 1,000 2D points, has one million entries). At first glance this function looked to me like the code I used to write when I was learning NumPy and I would write NumPy code by first writing Python code and then translating it, line by line.
Several months after the Active State post, results of performance tests comparing the three were posted and discussed in a thread on the NumPy mailing list.
The function with the loop in fact significantly outperformed the other two:
from numpy import mat, zeros, newaxis
def calcDistanceMatrixFastEuclidean2(nDimPoints):
nDimPoints = array(nDimPoints)
n,m = nDimPoints.shape
delta = zeros((n,n),'d')
for d in xrange(m):
data = nDimPoints[:,d]
delta += (data - data[:,newaxis])**2
return sqrt(delta)
One participant in the thread (Keir Mierle) offered a reason why this might be true:
The reason that I suspect this will be faster is
that it has better locality, completely finishing a computation on a
relatively small working set before moving onto the next one. The one liners
have to pull the potentially large MxN array into the processor repeatedly.
By this poster's own account, his remark is only a suspicion, and it doesn't appear that it was discussed any further.
Any other thoughts about how to account for these results?
In particular, is there a useful rule--regarding when to loop and when to index--that can be extracted from this example as guidance in writing numpy code?
For those not familiar with NumPy, or who haven't looked at the code, this comparison is not based on an edge case--it certainly wouldn't be that interesting to me if it were. Instead, this comparison involves a function that performs a common task in matrix computation (i.e., creating a result array given two antecedents); moreover, each function is in turn comprised of among the most common numpy built-ins.
TL; DR The second code above is only looping over the number of dimensions of the points (3 times through the for loop for 3D points) so the looping isn't much there. The real speed-up in the second code above is that it better harnesses the power of Numpy to avoid creating some extra matrices when finding the differences between points. This reduces memory used and computational effort.
Longer Explanation
I think that the calcDistanceMatrixFastEuclidean2 function is deceiving you with its loop perhaps. It is only looping over the number of dimensions of the points. For 1D points, the loop only executes once, for 2D, twice, and for 3D, thrice. This is really not much looping at all.
Let's analyze the code a little bit to see why the one is faster than the other. calcDistanceMatrixFastEuclidean I will call fast1 and calcDistanceMatrixFastEuclidean2 will be fast2.
fast1 is based on the Matlab way of doing things as is evidenced by the repmap function. The repmap function creates an array in this case that is just the original data repeated over and over again. However, if you look at the code for the function, it is very inefficient. It uses many Numpy functions (3 reshapes and 2 repeats) to do this. The repeat function is also used to create an array that contains the the original data with each data item repeated many times. If our input data is [1,2,3] then we are subtracting [1,2,3,1,2,3,1,2,3] from [1,1,1,2,2,2,3,3,3]. Numpy has had to create a lot of extra matrices in between running Numpy's C code which could have been avoided.
fast2 uses more of Numpy's heavy lifting without creating as many matrices between Numpy calls. fast2 loops through each dimension of the points, does the subtraction and keeps a running total of the squared differences between each dimension. Only at the end is the square root done. So far, this may not sound quite as efficient as fast1, but fast2 avoids doing the repmat stuff by using Numpy's indexing. Let's look at the 1D case for simplicity. fast2 makes a 1D array of the data and subtracts it from a 2D (N x 1) array of the data. This creates the difference matrix between each point and all of the other points without having to use repmat and repeat and thereby bypasses creating a lot of extra arrays. This is where the real speed difference lies in my opinion. fast1 creates a lot of extra in between matrices (and they are created expensively computationally) to find the differences between points while fast2 better harnesses the power of Numpy to avoid these.
By the way, here is a little bit faster version of fast2:
def calcDistanceMatrixFastEuclidean3(nDimPoints):
nDimPoints = array(nDimPoints)
n,m = nDimPoints.shape
data = nDimPoints[:,0]
delta = (data - data[:,newaxis])**2
for d in xrange(1,m):
data = nDimPoints[:,d]
delta += (data - data[:,newaxis])**2
return sqrt(delta)
The difference is that we are no longer creating delta as a zeros matrix.
dis for fun:
dis.dis(calcDistanceMatrixFastEuclidean)
2 0 LOAD_GLOBAL 0 (len)
3 LOAD_FAST 0 (points)
6 CALL_FUNCTION 1
9 STORE_FAST 1 (numPoints)
3 12 LOAD_GLOBAL 1 (sqrt)
15 LOAD_GLOBAL 2 (sum)
18 LOAD_GLOBAL 3 (repmat)
21 LOAD_FAST 0 (points)
24 LOAD_FAST 1 (numPoints)
27 LOAD_CONST 1 (1)
30 CALL_FUNCTION 3
4 33 LOAD_GLOBAL 4 (repeat)
36 LOAD_FAST 0 (points)
39 LOAD_FAST 1 (numPoints)
42 LOAD_CONST 2 ('axis')
45 LOAD_CONST 3 (0)
48 CALL_FUNCTION 258
51 BINARY_SUBTRACT
52 LOAD_CONST 4 (2)
55 BINARY_POWER
56 LOAD_CONST 2 ('axis')
59 LOAD_CONST 1 (1)
62 CALL_FUNCTION 257
65 CALL_FUNCTION 1
68 STORE_FAST 2 (distMat)
5 71 LOAD_FAST 2 (distMat)
74 LOAD_ATTR 5 (reshape)
77 LOAD_FAST 1 (numPoints)
80 LOAD_FAST 1 (numPoints)
83 BUILD_TUPLE 2
86 CALL_FUNCTION 1
89 RETURN_VALUE
dis.dis(calcDistanceMatrixFastEuclidean2)
2 0 LOAD_GLOBAL 0 (array)
3 LOAD_FAST 0 (nDimPoints)
6 CALL_FUNCTION 1
9 STORE_FAST 0 (nDimPoints)
3 12 LOAD_FAST 0 (nDimPoints)
15 LOAD_ATTR 1 (shape)
18 UNPACK_SEQUENCE 2
21 STORE_FAST 1 (n)
24 STORE_FAST 2 (m)
4 27 LOAD_GLOBAL 2 (zeros)
30 LOAD_FAST 1 (n)
33 LOAD_FAST 1 (n)
36 BUILD_TUPLE 2
39 LOAD_CONST 1 ('d')
42 CALL_FUNCTION 2
45 STORE_FAST 3 (delta)
5 48 SETUP_LOOP 76 (to 127)
51 LOAD_GLOBAL 3 (xrange)
54 LOAD_FAST 2 (m)
57 CALL_FUNCTION 1
60 GET_ITER
>> 61 FOR_ITER 62 (to 126)
64 STORE_FAST 4 (d)
6 67 LOAD_FAST 0 (nDimPoints)
70 LOAD_CONST 0 (None)
73 LOAD_CONST 0 (None)
76 BUILD_SLICE 2
79 LOAD_FAST 4 (d)
82 BUILD_TUPLE 2
85 BINARY_SUBSCR
86 STORE_FAST 5 (data)
7 89 LOAD_FAST 3 (delta)
92 LOAD_FAST 5 (data)
95 LOAD_FAST 5 (data)
98 LOAD_CONST 0 (None)
101 LOAD_CONST 0 (None)
104 BUILD_SLICE 2
107 LOAD_GLOBAL 4 (newaxis)
110 BUILD_TUPLE 2
113 BINARY_SUBSCR
114 BINARY_SUBTRACT
115 LOAD_CONST 2 (2)
118 BINARY_POWER
119 INPLACE_ADD
120 STORE_FAST 3 (delta)
123 JUMP_ABSOLUTE 61
>> 126 POP_BLOCK
8 >> 127 LOAD_GLOBAL 5 (sqrt)
130 LOAD_FAST 3 (delta)
133 CALL_FUNCTION 1
136 RETURN_VALUE
I'm not an expert on dis, but it seems like you would have to look more at the functions that the first is calling to know why they take a while. There is a performance profiler tool with Python as well, cProfile.