I have some pretty ugly indexing going on. For example, things like
valid[ data[ index[valid[:,0],0] ] == 0, 1] = False
where valid and index are {Nx2} arrays or bools and ints respectively, and data is {N} long.
If I concentrate really hard, I can convince myself that this is doing what I want... but its incredibly obfuscated. How can I unobfuscate something like this efficiently?
I could break it up, for example:
valid_index = index[valid[:,0],0]
invalid_index = (data[ valid_index ] == 0)
valid[ invalid_index, 1 ] = False
But my arrays will have up to 100's of millions of entries so I don't want to duplicate the memory; and I need to remain as speed efficient as possible.
These two code sequences are nearly identical, and should have very similar performance. That's my "gut feeling"--but then I did static analysis and ran a partial benchmark to confirm.
The clearer option requires four more bytecodes to implement, so will probably be slightly slower. But the extra work is restricted to LOAD_FAST and STORE_FAST, which are just moves from the top of stack (TOS) to/from variables. As the extra work is modest, so should be the performance impact.
You could benchmark the two approaches on your target equipment for more quantitative precision, but on my 3-year-old laptop, 100 million extra LOAD_FAST / STORE_FAST pairs takes just over 3 seconds on standard CPython 2.7.5. So I estimate this clarity will cost you about 6 seconds per 100M entries. While the PyPy just-in-time Python compiler doesn't use the same bytecodes, I timed its overhead for the clear version at about half that, or 3 seconds per 100M. Compared to other work you're doing to process the items, the clearer version probably is not a significant showdown.
The TL;DR Backstory
My first impression is that the code sequences, while different in readability and clarity, are technically very similar, and should not have similar performance characteristics. But let's analyze a bit further using the Python disassembler. I dropped each code snippet into a function:
def one(valid, data):
valid[ data[ index[valid[:,0],0] ] == 0, 1] = False
def two(valid, data):
valid_index = index[valid[:,0],0]
invalid_index = (data[ valid_index ] == 0)
valid[ invalid_index, 1 ] = False
Then using Python's bytecode dissassember:
import dis
dis.dis(one)
print "---"
dis.dis(two)
Gives:
15 0 LOAD_GLOBAL 0 (False)
3 LOAD_FAST 0 (valid)
6 LOAD_FAST 1 (data)
9 LOAD_GLOBAL 1 (index)
12 LOAD_FAST 0 (valid)
15 LOAD_CONST 0 (None)
18 LOAD_CONST 0 (None)
21 BUILD_SLICE 2
24 LOAD_CONST 1 (0)
27 BUILD_TUPLE 2
30 BINARY_SUBSCR
31 LOAD_CONST 1 (0)
34 BUILD_TUPLE 2
37 BINARY_SUBSCR
38 BINARY_SUBSCR
39 LOAD_CONST 1 (0)
42 COMPARE_OP 2 (==)
45 LOAD_CONST 2 (1)
48 BUILD_TUPLE 2
51 STORE_SUBSCR
52 LOAD_CONST 0 (None)
55 RETURN_VALUE
18 0 LOAD_GLOBAL 0 (index)
3 LOAD_FAST 0 (valid)
6 LOAD_CONST 0 (None)
9 LOAD_CONST 0 (None)
12 BUILD_SLICE 2
15 LOAD_CONST 1 (0)
18 BUILD_TUPLE 2
21 BINARY_SUBSCR
22 LOAD_CONST 1 (0)
25 BUILD_TUPLE 2
28 BINARY_SUBSCR
29 STORE_FAST 2 (valid_index)
19 32 LOAD_FAST 1 (data)
35 LOAD_FAST 2 (valid_index)
38 BINARY_SUBSCR
39 LOAD_CONST 1 (0)
42 COMPARE_OP 2 (==)
45 STORE_FAST 3 (invalid_index)
20 48 LOAD_GLOBAL 1 (False)
51 LOAD_FAST 0 (valid)
54 LOAD_FAST 3 (invalid_index)
57 LOAD_CONST 2 (1)
60 BUILD_TUPLE 2
63 STORE_SUBSCR
64 LOAD_CONST 0 (None)
67 RETURN_VALUE
Similar but not identical, and not in the same order. A quick diff of the two shows the same, plus the possibility the clearer function requires more byte codes.
I parsed the bytecode opcodes out of each function's disassembler listing, dropped them into a collections.Counter, and compared the counts:
Bytecode Count(s)
======== ========
BINARY_SUBSCR 3
BUILD_SLICE 1
BUILD_TUPLE 3
COMPARE_OP 1
LOAD_CONST 7
LOAD_FAST 3, 5 *** differs ***
LOAD_GLOBAL 2
RETURN_VALUE 1
STORE_FAST 0, 2 *** differs ***
STORE_SUBSCR 1
Here is where it becomes evident that the second, clearer approach uses only four more bytecodes, and of the simple, fast LOAD_FAST / STORE_FAST variety. Static analysis thus shows no particular reason to fear additional memory allocation or other performance-killing side effects.
I then constructed two functions, very similar to one another, that the disassembler shows differ only in that the second one has an extra LOAD_FAST / STORE_FAST pair. I ran them 100,000,000 times, and compared their runtimes. They differed by just over 3 seconds in CPython 2.7.5, and about 1.5 seconds under PyPy 2.2.1 (based on Python 2.7.3). Even when you double those times (because you have two pairs), it's pretty clear those extra load/store pairs are not going to slow you down much.
Related
When using the in operator on a literal, is it most idiomatic for that literal to be a list, set, or tuple?
e.g.
for x in {'foo', 'bar', 'baz'}:
doSomething(x)
...
if val in {1, 2, 3}:
doSomethingElse(val)
I don't see any benefit to the list, but the tuple's immutably means it could be hoisted or reused by an efficient interpreter. And in the case of the if, if it's reused, there's an efficiency benefit.
Which is the most idiomatic, and which is most performant in cpython?
Python provides a disassembler, so you can often just check the bytecode:
In [4]: def checktup():
...: for _ in range(10):
...: if val in (1, 2, 3):
...: print("foo")
...:
In [5]: def checkset():
...: for _ in range(10):
...: if val in {1, 2, 3}:
...: print("foo")
...:
In [6]: import dis
For the tuple literal:
In [7]: dis.dis(checktup)
2 0 SETUP_LOOP 32 (to 34)
2 LOAD_GLOBAL 0 (range)
4 LOAD_CONST 1 (10)
6 CALL_FUNCTION 1
8 GET_ITER
>> 10 FOR_ITER 20 (to 32)
12 STORE_FAST 0 (_)
3 14 LOAD_GLOBAL 1 (val)
16 LOAD_CONST 6 ((1, 2, 3))
18 COMPARE_OP 6 (in)
20 POP_JUMP_IF_FALSE 10
4 22 LOAD_GLOBAL 2 (print)
24 LOAD_CONST 5 ('foo')
26 CALL_FUNCTION 1
28 POP_TOP
30 JUMP_ABSOLUTE 10
>> 32 POP_BLOCK
>> 34 LOAD_CONST 0 (None)
36 RETURN_VALUE
For the set-literal:
In [8]: dis.dis(checkset)
2 0 SETUP_LOOP 32 (to 34)
2 LOAD_GLOBAL 0 (range)
4 LOAD_CONST 1 (10)
6 CALL_FUNCTION 1
8 GET_ITER
>> 10 FOR_ITER 20 (to 32)
12 STORE_FAST 0 (_)
3 14 LOAD_GLOBAL 1 (val)
16 LOAD_CONST 6 (frozenset({1, 2, 3}))
18 COMPARE_OP 6 (in)
20 POP_JUMP_IF_FALSE 10
4 22 LOAD_GLOBAL 2 (print)
24 LOAD_CONST 5 ('foo')
26 CALL_FUNCTION 1
28 POP_TOP
30 JUMP_ABSOLUTE 10
>> 32 POP_BLOCK
>> 34 LOAD_CONST 0 (None)
36 RETURN_VALUE
You'll notice that in both cases, the function will LOAD_CONST, i.e., both times it has been optimized. Even better, in the case of the set literal, the compiler has saved a frozenset, which during the construction of the function, the peephole-optimizer has managed to figure out can become the immutable equivalent of a set.
Note, on Python 2, the compiler builds a set every time!:
In [1]: import dis
In [2]: def checkset():
...: for _ in range(10):
...: if val in {1, 2, 3}:
...: print("foo")
...:
In [3]: dis.dis(checkset)
2 0 SETUP_LOOP 49 (to 52)
3 LOAD_GLOBAL 0 (range)
6 LOAD_CONST 1 (10)
9 CALL_FUNCTION 1
12 GET_ITER
>> 13 FOR_ITER 35 (to 51)
16 STORE_FAST 0 (_)
3 19 LOAD_GLOBAL 1 (val)
22 LOAD_CONST 2 (1)
25 LOAD_CONST 3 (2)
28 LOAD_CONST 4 (3)
31 BUILD_SET 3
34 COMPARE_OP 6 (in)
37 POP_JUMP_IF_FALSE 13
4 40 LOAD_CONST 5 ('foo')
43 PRINT_ITEM
44 PRINT_NEWLINE
45 JUMP_ABSOLUTE 13
48 JUMP_ABSOLUTE 13
>> 51 POP_BLOCK
>> 52 LOAD_CONST 0 (None)
55 RETURN_VALUE
IMO, there's essentially no such thing as "idiomatic" usage of literal values as shown in the question. Such values look like "magic numbers" to me. Using literals for "performance" is probably misguided because it sacrifices readability for marginal gains. In cases where performance really matters, using literals is unlikely to help much and there are better options regardless.
I think the idiomatic thing to do would be to store such values in a global or class variable, especially if you're using them in multiple places (but also even if you aren't). This provides some documentation as to what a value's purpose is and makes it easier to update. You can then memomize these values in function/method definitions to improve performance if necessary.
As to what type of data structure is most appropriate, that would depend on what your program does and how it uses the data. For example, does ordering matter? With an if x in y, it won't, but maybe you're using the data in a for and an if. Without context, it's hard to say what the best choice would be.
Here's an example I think is readable, extensible, and also efficient. Memoizing the global ITEMS in the function definitions makes lookup fast because items is in the local namespace of the function. If you look at the disassembled code, you'll see that items is looked up via LOAD_FAST instead of LOAD_GLOBAL. This approach also avoids making multiple copies of the list of items, which might be relevant if it's big enough (although, if it was big enough, you probably wouldn't try to inline it anyway). Personally, I wouldn't bother with these kinds of optimizations most of the time, but they can be useful in some cases.
# In real code, this would have a domain-specific name instead of the
# generic `ITEMS`.
ITEMS = {'a', 'b', 'c'}
def filter_in_items(values, items=ITEMS):
matching_items = []
for value in values:
if value in items:
matching_items.append(value)
return matching_items
def filter_not_in_items(values, items=ITEMS):
non_matching_items = []
for value in values:
if value not in items:
non_matching_items.append(value)
return non_matching_items
print(filter_in_items(('a', 'x'))) # -> ['a']
print(filter_not_in_items(('a', 'x'))) # -> ['x']
import dis
dis.dis(filter_in_items)
I am trying to find the most efficient way to check whether the given string is palindrome or not.
Firstly, I tried brute force which has running time of the order O(N). Then I optimized the code a little bit by making only n/2 comparisons instead of n.
Here is the code:
def palindrome(a):
length=len(a)
iterator=0
while iterator <= length/2:
if a[iterator]==a[length-iterator-1]:
iterator+=1
else:
return False
return True
It takes half time when compared to brute force but it is still order O(N).
Meanwhile, I also thought of a solution which uses slice operator.
Here is the code:
def palindrome_py(a):
return a==a[::-1]
Then I did running time analysis of both. Here is the result:
Running time
Length of string used is 50
Length multiplier indicates length of new string(50*multiplier)
Running time for 100000 iterations
For palindrome For palindrome_py Length Multiplier
0.6559998989 0.5309998989 1
1.2970001698 0.5939998627 2
3.5149998665 0.7820000648 3
13.4249999523 1.5310001373 4
65.5319998264 5.2660000324 5
The code I used can be accessed here: Running Time Table Generator
Now, I want to know why there is difference between running time of slice operator(palindrome_py) and the palindrome function.Why I am getting this type of running time?
Why is the slice operator so efficient as compared to the palindrome function, what is happening behind the scenes?
My observations-:
running time is proportional to multiplier ie. running time when multiplier is 2 can be obtained by multiplying running time of case (n-1) ie. 1st in this case by multiplier (n) ie.2
Generalizing, we get Running Time(n)=Running Time(n-1)* Multiplier
Your slicing-based solution is still O(n), the constant got smaller (that's your multiplier). It's faster, because less stuff is done in Python and more stuff is done in C. The bytecode shows it all.
In [1]: import dis
In [2]: %paste
def palindrome(a):
length=len(a)
iterator=0
while iterator <= length/2:
if a[iterator]==a[length-iterator-1]:
iterator+=1
else:
return False
return True
## -- End pasted text --
In [3]: dis.dis(palindrome)
2 0 LOAD_GLOBAL 0 (len)
3 LOAD_FAST 0 (a)
6 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
9 STORE_FAST 1 (length)
3 12 LOAD_CONST 1 (0)
15 STORE_FAST 2 (iterator)
4 18 SETUP_LOOP 65 (to 86)
>> 21 LOAD_FAST 2 (iterator)
24 LOAD_FAST 1 (length)
27 LOAD_CONST 2 (2)
30 BINARY_TRUE_DIVIDE
31 COMPARE_OP 1 (<=)
34 POP_JUMP_IF_FALSE 85
5 37 LOAD_FAST 0 (a)
40 LOAD_FAST 2 (iterator)
43 BINARY_SUBSCR
44 LOAD_FAST 0 (a)
47 LOAD_FAST 1 (length)
50 LOAD_FAST 2 (iterator)
53 BINARY_SUBTRACT
54 LOAD_CONST 3 (1)
57 BINARY_SUBTRACT
58 BINARY_SUBSCR
59 COMPARE_OP 2 (==)
62 POP_JUMP_IF_FALSE 78
6 65 LOAD_FAST 2 (iterator)
68 LOAD_CONST 3 (1)
71 INPLACE_ADD
72 STORE_FAST 2 (iterator)
75 JUMP_ABSOLUTE 21
8 >> 78 LOAD_CONST 4 (False)
81 RETURN_VALUE
82 JUMP_ABSOLUTE 21
>> 85 POP_BLOCK
10 >> 86 LOAD_CONST 5 (True)
89 RETURN_VALUE
There is a hell lot of Python virtual-machine level instructions, that are basically function calls, which are very expensive in Python.
Now, what's with the second function.
In [4]: %paste
def palindrome_py(a):
return a==a[::-1]
## -- End pasted text --
In [5]: dis.dis(palindrome_py)
2 0 LOAD_FAST 0 (a)
3 LOAD_FAST 0 (a)
6 LOAD_CONST 0 (None)
9 LOAD_CONST 0 (None)
12 LOAD_CONST 2 (-1)
15 BUILD_SLICE 3
18 BINARY_SUBSCR
19 COMPARE_OP 2 (==)
22 RETURN_VALUE
No Python iteration (jumpers) involved here and you only get 3 calls (these instructions call methods): BUILD_SLICE, BINARY_SUBSCR, COMPARE_OP, all done in C, because str is a built-in type with all methods written C. To be fair, we've seen the same instructions in the first function (along with a lot more other instructions), but there they are repeated for each character, multiplying the method-call overhead by n. Here you only pay the Python's function call overhead once, the rest is done in C.
The bottomline. You shouldn't do low-level stuff in Python manually, because it will run slower than a high-level counterpart (unless you have an asymptotically faster alternative that literally requires low-level magic). Python, unlike many other languages, most of the time encourages you to use abstractions and rewards you with higher performance.
Will the following snippet create and destroy the list of constants on each loop, incurring whatever (albeit small) overhead this implies, or is the list created once?
for i in <some-type-of-iterable>:
if i in [1,3,5,18,3457,40567]:
print(i)
I am asking about both the Python "standard", such one as exists, and about the common CPython implementation.
I am aware that this example is contrived, as well as that trying to worry about performance using CPython is silly, but I am just curious.
This depends on the python implementation and version and how the "constant lists" are used. On Cpython2.7.10 with your example, it looks like the answer is that the list in the condition of the if statement is only created once...
>>> def foo():
... for i in iterable:
... if i in [1, 3, 5]:
... print(i)
...
>>> import dis
>>> dis.dis(foo)
2 0 SETUP_LOOP 34 (to 37)
3 LOAD_GLOBAL 0 (iterable)
6 GET_ITER
>> 7 FOR_ITER 26 (to 36)
10 STORE_FAST 0 (i)
3 13 LOAD_FAST 0 (i)
16 LOAD_CONST 4 ((1, 3, 5))
19 COMPARE_OP 6 (in)
22 POP_JUMP_IF_FALSE 7
4 25 LOAD_FAST 0 (i)
28 PRINT_ITEM
29 PRINT_NEWLINE
30 JUMP_ABSOLUTE 7
33 JUMP_ABSOLUTE 7
>> 36 POP_BLOCK
>> 37 LOAD_CONST 0 (None)
40 RETURN_VALUE
Notice: 16 LOAD_CONST 4 ((1, 3, 5))
Python's peephole optimizer has turned our list into a tuple (thanks python!) and stored it as a constant. Note that the peephole optimizer can only do these transforms on objects if it knows that you as the programmer have absolutely no way of getting a reference to the list (otherwise, you could mutate the list and change the meaning of the code). As far as I'm aware, they only do this optimization for list, set literals that are composed of entirely constants and are the RHS of an in operator. There might be other cases that I'm not aware of (dis.dis is your friend for finding these optimizations).
I hinted at it above, but you can do the same thing with set-literals in more recent versions of python (in python3.2+, the set is converted to a constant frozenset). The benefit there is that set/frozenset have faster membership testing on average than list/tuple.
Another example with Python 3.5, list is created for each iteration.
>>> import dis
>>> def func():
... for i in iterable:
... for j in [1,2,3]:
... print(i+j)
...
>>> dis.dis(func)
2 0 SETUP_LOOP 54 (to 57)
3 LOAD_GLOBAL 0 (iterable)
6 GET_ITER
>> 7 FOR_ITER 46 (to 56)
10 STORE_FAST 0 (i)
3 13 SETUP_LOOP 37 (to 53)
16 LOAD_CONST 1 (1) # building list
19 LOAD_CONST 2 (2)
22 LOAD_CONST 3 (3)
25 BUILD_LIST 3
28 GET_ITER
>> 29 FOR_ITER 20 (to 52) # inner loop body begin
32 STORE_FAST 1 (j)
4 35 LOAD_GLOBAL 1 (print)
38 LOAD_FAST 0 (i)
41 LOAD_FAST 1 (j)
44 BINARY_ADD
45 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
48 POP_TOP
49 JUMP_ABSOLUTE 29 # inner loop body end
>> 52 POP_BLOCK
>> 53 JUMP_ABSOLUTE 7 # outer loop end,
# jumping back before list creation
>> 56 POP_BLOCK
>> 57 LOAD_CONST 0 (None)
60 RETURN_VALUE
How are the following two implementations have different performance in Python?
from cStringIO import StringIO
from itertools import imap
from sys import stdin
input = imap(int, StringIO(stdin.read()))
print '\n'.join(imap(str, sorted(input)))
AND
import sys
for line in sys.stdin:
l.append(int(line.strip('\n')))
l.sort()
for x in l:
print x
The first implementation is faster than the second for inputs of the order of 10^6 lines. Why so?
>>> dis.dis(first)
2 0 LOAD_GLOBAL 0 (imap)
3 LOAD_GLOBAL 1 (int)
6 LOAD_GLOBAL 2 (StringIO)
9 LOAD_GLOBAL 3 (stdin)
12 LOAD_ATTR 4 (read)
15 CALL_FUNCTION 0
18 CALL_FUNCTION 1
21 CALL_FUNCTION 2
24 STORE_FAST 0 (input)
27 LOAD_CONST 0 (None)
30 RETURN_VALUE
>>> dis.dis(second)
2 0 SETUP_LOOP 48 (to 51)
3 LOAD_GLOBAL 0 (sys)
6 LOAD_ATTR 1 (stdin)
9 CALL_FUNCTION 0
12 GET_ITER
>> 13 FOR_ITER 34 (to 50)
16 STORE_FAST 0 (line)
3 19 LOAD_GLOBAL 2 (l)
22 LOAD_ATTR 3 (append)
25 LOAD_GLOBAL 4 (int)
28 LOAD_FAST 0 (line)
31 LOAD_ATTR 5 (strip)
34 LOAD_CONST 1 ('\n')
37 CALL_FUNCTION 1
40 CALL_FUNCTION 1
43 CALL_FUNCTION 1
46 POP_TOP
47 JUMP_ABSOLUTE 13
>> 50 POP_BLOCK
4 >> 51 LOAD_GLOBAL 2 (l)
54 LOAD_ATTR 6 (sort)
57 CALL_FUNCTION 0
60 POP_TOP
61 LOAD_CONST 0 (None)
64 RETURN_VALUE
first is your first function.
second is your second function.
dis tells one of the reasons why the first one is faster.
Two primary reasons:
The 2nd code explicitly constructs a list and sorts it afterwards, while the 1st version lets sorted create only a internal list while sorting at the same time.
The 2nd code explicitly loops over a list with for (on the Python VM), while the 1st version implicitly loops with imap (over the underlaying structure in C).
Anyways, why is StringIO in there? The most straightforward and probably fastest way is:
from sys import stdin, stdout
stdout.writelines(sorted(stdin, key=int))
Do a step-by-step conversion from the second to the first one and see how the performance changes with each step.
Remove line.strip. This will cause some speed up, whether it would be significant is another matter. The stripping is superfluous as has been mentioned by you and THC4k.
Then replace the for loop using l.append with map(int, sys.stdin). My guess is that this would give a significant speed-up.
Replace map and l.sort with imap and sorted. My guess is that it won't affect the performance, there could be a slight slowdown, but it would be far from significant. Between the two, I'd usually go with the former, but with Python 3 on the horizon the latter is probably preferable.
Replace the for loop using print with print '\n'.join(...). My guess is that this would be another speed-up, but it would cost you some memory.
Add cStringIO (which is completely unnecessary by the way) to see how it affects performance. My guess is that it would be slightly slower, but not enough to counter 4 and 2.
Then, if you try THC4k's answer, it would probably be faster than all of the above, while being simpler and easier to read, and using less memory than 4 and 5. It has slightly different behaviour (it doesn't strip leading zeros from the numbers).
Of course, try this yourself instead of trusting anyone guesses. Also run cProfile on your code and see which parts are losing most time.
A few years ago, someone posted on Active State Recipes for comparison purposes, three python/NumPy functions; each of these accepted the same arguments and returned the same result, a distance matrix.
Two of these were taken from published sources; they are both--or they appear to me to be--idiomatic numpy code. The repetitive calculations required to create a distance matrix are driven by numpy's elegant index syntax. Here's one of them:
from numpy.matlib import repmat, repeat
def calcDistanceMatrixFastEuclidean(points):
numPoints = len(points)
distMat = sqrt(sum((repmat(points, numPoints, 1) -
repeat(points, numPoints, axis=0))**2, axis=1))
return distMat.reshape((numPoints,numPoints))
The third created the distance matrix using a single loop (which, obviously is a lot of looping given that a distance matrix of just 1,000 2D points, has one million entries). At first glance this function looked to me like the code I used to write when I was learning NumPy and I would write NumPy code by first writing Python code and then translating it, line by line.
Several months after the Active State post, results of performance tests comparing the three were posted and discussed in a thread on the NumPy mailing list.
The function with the loop in fact significantly outperformed the other two:
from numpy import mat, zeros, newaxis
def calcDistanceMatrixFastEuclidean2(nDimPoints):
nDimPoints = array(nDimPoints)
n,m = nDimPoints.shape
delta = zeros((n,n),'d')
for d in xrange(m):
data = nDimPoints[:,d]
delta += (data - data[:,newaxis])**2
return sqrt(delta)
One participant in the thread (Keir Mierle) offered a reason why this might be true:
The reason that I suspect this will be faster is
that it has better locality, completely finishing a computation on a
relatively small working set before moving onto the next one. The one liners
have to pull the potentially large MxN array into the processor repeatedly.
By this poster's own account, his remark is only a suspicion, and it doesn't appear that it was discussed any further.
Any other thoughts about how to account for these results?
In particular, is there a useful rule--regarding when to loop and when to index--that can be extracted from this example as guidance in writing numpy code?
For those not familiar with NumPy, or who haven't looked at the code, this comparison is not based on an edge case--it certainly wouldn't be that interesting to me if it were. Instead, this comparison involves a function that performs a common task in matrix computation (i.e., creating a result array given two antecedents); moreover, each function is in turn comprised of among the most common numpy built-ins.
TL; DR The second code above is only looping over the number of dimensions of the points (3 times through the for loop for 3D points) so the looping isn't much there. The real speed-up in the second code above is that it better harnesses the power of Numpy to avoid creating some extra matrices when finding the differences between points. This reduces memory used and computational effort.
Longer Explanation
I think that the calcDistanceMatrixFastEuclidean2 function is deceiving you with its loop perhaps. It is only looping over the number of dimensions of the points. For 1D points, the loop only executes once, for 2D, twice, and for 3D, thrice. This is really not much looping at all.
Let's analyze the code a little bit to see why the one is faster than the other. calcDistanceMatrixFastEuclidean I will call fast1 and calcDistanceMatrixFastEuclidean2 will be fast2.
fast1 is based on the Matlab way of doing things as is evidenced by the repmap function. The repmap function creates an array in this case that is just the original data repeated over and over again. However, if you look at the code for the function, it is very inefficient. It uses many Numpy functions (3 reshapes and 2 repeats) to do this. The repeat function is also used to create an array that contains the the original data with each data item repeated many times. If our input data is [1,2,3] then we are subtracting [1,2,3,1,2,3,1,2,3] from [1,1,1,2,2,2,3,3,3]. Numpy has had to create a lot of extra matrices in between running Numpy's C code which could have been avoided.
fast2 uses more of Numpy's heavy lifting without creating as many matrices between Numpy calls. fast2 loops through each dimension of the points, does the subtraction and keeps a running total of the squared differences between each dimension. Only at the end is the square root done. So far, this may not sound quite as efficient as fast1, but fast2 avoids doing the repmat stuff by using Numpy's indexing. Let's look at the 1D case for simplicity. fast2 makes a 1D array of the data and subtracts it from a 2D (N x 1) array of the data. This creates the difference matrix between each point and all of the other points without having to use repmat and repeat and thereby bypasses creating a lot of extra arrays. This is where the real speed difference lies in my opinion. fast1 creates a lot of extra in between matrices (and they are created expensively computationally) to find the differences between points while fast2 better harnesses the power of Numpy to avoid these.
By the way, here is a little bit faster version of fast2:
def calcDistanceMatrixFastEuclidean3(nDimPoints):
nDimPoints = array(nDimPoints)
n,m = nDimPoints.shape
data = nDimPoints[:,0]
delta = (data - data[:,newaxis])**2
for d in xrange(1,m):
data = nDimPoints[:,d]
delta += (data - data[:,newaxis])**2
return sqrt(delta)
The difference is that we are no longer creating delta as a zeros matrix.
dis for fun:
dis.dis(calcDistanceMatrixFastEuclidean)
2 0 LOAD_GLOBAL 0 (len)
3 LOAD_FAST 0 (points)
6 CALL_FUNCTION 1
9 STORE_FAST 1 (numPoints)
3 12 LOAD_GLOBAL 1 (sqrt)
15 LOAD_GLOBAL 2 (sum)
18 LOAD_GLOBAL 3 (repmat)
21 LOAD_FAST 0 (points)
24 LOAD_FAST 1 (numPoints)
27 LOAD_CONST 1 (1)
30 CALL_FUNCTION 3
4 33 LOAD_GLOBAL 4 (repeat)
36 LOAD_FAST 0 (points)
39 LOAD_FAST 1 (numPoints)
42 LOAD_CONST 2 ('axis')
45 LOAD_CONST 3 (0)
48 CALL_FUNCTION 258
51 BINARY_SUBTRACT
52 LOAD_CONST 4 (2)
55 BINARY_POWER
56 LOAD_CONST 2 ('axis')
59 LOAD_CONST 1 (1)
62 CALL_FUNCTION 257
65 CALL_FUNCTION 1
68 STORE_FAST 2 (distMat)
5 71 LOAD_FAST 2 (distMat)
74 LOAD_ATTR 5 (reshape)
77 LOAD_FAST 1 (numPoints)
80 LOAD_FAST 1 (numPoints)
83 BUILD_TUPLE 2
86 CALL_FUNCTION 1
89 RETURN_VALUE
dis.dis(calcDistanceMatrixFastEuclidean2)
2 0 LOAD_GLOBAL 0 (array)
3 LOAD_FAST 0 (nDimPoints)
6 CALL_FUNCTION 1
9 STORE_FAST 0 (nDimPoints)
3 12 LOAD_FAST 0 (nDimPoints)
15 LOAD_ATTR 1 (shape)
18 UNPACK_SEQUENCE 2
21 STORE_FAST 1 (n)
24 STORE_FAST 2 (m)
4 27 LOAD_GLOBAL 2 (zeros)
30 LOAD_FAST 1 (n)
33 LOAD_FAST 1 (n)
36 BUILD_TUPLE 2
39 LOAD_CONST 1 ('d')
42 CALL_FUNCTION 2
45 STORE_FAST 3 (delta)
5 48 SETUP_LOOP 76 (to 127)
51 LOAD_GLOBAL 3 (xrange)
54 LOAD_FAST 2 (m)
57 CALL_FUNCTION 1
60 GET_ITER
>> 61 FOR_ITER 62 (to 126)
64 STORE_FAST 4 (d)
6 67 LOAD_FAST 0 (nDimPoints)
70 LOAD_CONST 0 (None)
73 LOAD_CONST 0 (None)
76 BUILD_SLICE 2
79 LOAD_FAST 4 (d)
82 BUILD_TUPLE 2
85 BINARY_SUBSCR
86 STORE_FAST 5 (data)
7 89 LOAD_FAST 3 (delta)
92 LOAD_FAST 5 (data)
95 LOAD_FAST 5 (data)
98 LOAD_CONST 0 (None)
101 LOAD_CONST 0 (None)
104 BUILD_SLICE 2
107 LOAD_GLOBAL 4 (newaxis)
110 BUILD_TUPLE 2
113 BINARY_SUBSCR
114 BINARY_SUBTRACT
115 LOAD_CONST 2 (2)
118 BINARY_POWER
119 INPLACE_ADD
120 STORE_FAST 3 (delta)
123 JUMP_ABSOLUTE 61
>> 126 POP_BLOCK
8 >> 127 LOAD_GLOBAL 5 (sqrt)
130 LOAD_FAST 3 (delta)
133 CALL_FUNCTION 1
136 RETURN_VALUE
I'm not an expert on dis, but it seems like you would have to look more at the functions that the first is calling to know why they take a while. There is a performance profiler tool with Python as well, cProfile.