When comparing the following code samples in python (the 300 is just an example; in practicality, assume there are 10,000,000 items that we need to loop through):
sum = 0
for i in xrange(0, 300):
sum += i
Vs.
sum = 0
for i in xrange(0, 100):
sum += i
for i in xrange(100, 200):
sum += i
for i in xrange(200, 300):
sum += i
Are both versions the same in terms of running time? I have a problem where I have a huge vector (say, 10,000 items). I don't necessarily have to loop through all of it, unless some conditions are met (these conditions can only be assessed when the first loop -- which using the first 100 items -- is done).
Then, I should continue the looping only if some conditions are met (these conditions can only be assessed when the next 100 items are examined).
Then, I should continue the looping only if some conditions are met (these conditions can only be assessed when the next 100 items are examined).
Same story, until I reach the end of the vector................
Given the fact that I can solve my problem in a modular way, by examining points:
{0, 100}
{100, 200}
{200, 300}
........
{9900, 10,000}
All I am trying is to get a sense whether the second approach is as efficient as the first one. Thanks.
note: the sum is used here for simplicity. In practice, the running time of the mathematical operations that will be used is significantly greater. That's why I am trying to find ways to optimize the running time by examining different coding approaches...
What kind of improvement were you hoping to see?
timeit will help you observe the difference:
$ python -m timeit 'sum = 0
> for i in xrange(0, 300):
> sum += i'
10000 loops, best of 3: 21.8 usec per loop
$ python -m timeit 'sum = 0
> for i in xrange(0, 100):
> sum += i
> for i in xrange(100, 200):
> sum += i
> for i in xrange(200, 300):
> sum += i'
10000 loops, best of 3: 22.8 usec per loop
Personally, I would go for:
$ python -m timeit 'sum(xrange(300))'
100000 loops, best of 3: 5.34 usec per loop
This example is perhaps not the best way to assess whether you'll be able to optimise your actual code, but it does demonstrate how you can test a small snippet for its running time.
Just remember what Donald Knuth said about optimisation ;-)
Here's a way you could checkpoint your iteration and break out if required:
>>> for index, item in enumerate(vect):
... results = do_stuff_with(item)
... if index % 100 == 0 and should_we_stop(results):
... break
What would be even better is if you could have do_stuff_with() return a tuple indicating whether it has finished, then you can check every iteration whether you're done, rather than wait:
>>> for item in vect:
... finished, results = do_stuff_with(item)
... if finished:
... break
Just to reiterate (no pun intended) it's very hard to say how (or even whether) to optimise your actual code without actually seeing it!
If you are not going to loop through all of it, then the cost of setting up a new generator (xrange) is quite small.
So the second approach would be faster, if you are sometimes skipping one or two of those loops.
That said, if your list is only 300 big, then the difference is going to be negligible, in the order of milliseconds or microseconds.
If you want to compare speeds, just time them both:
import timeit
chunk1 = '''
sum = 0
for i in xrange(0, 300):
sum += i
'''
chunk2 = '''
sum = 0
for i in xrange(0, 100):
sum += i
for i in xrange(100, 200):
sum += i
for i in xrange(200, 300):
sum += i
'''
timer1 = timeit.Timer(chunk1)
timer2 = timeit.Timer(chunk2)
print timer1.timeit(100000)
print timer2.timeit(100000)
I get these numbers for 100,000 iterations:
3.44955992699
3.56597089767
As you can see, the second chunk of code is slightly slower.
You might be able to try a while loop and a break statement, so here's what I interpret you as trying to do:
i = 0
while i < limit:
stop = False
for j in xrange(i + 100, i + 300):
if i == foo:
stop = True
break
if stop: break
i += 1
Related
fib1 = 1
fib2 = 2
i = 0
sum = 0
while i < 3999998:
fibn = fib1 + fib2
fib1 = fib2
fib2 = fibn
i += 1
if fibn % 2 == 0:
sum = sum + fibn
print(sum + 2)
The challenge is to add even Fibonacci numbers under 4000000. It works for small limits say 10 numbers. But goes on forever when set for 4000000.
Code is in Python
Yes, there are inefficiencies in your code, but the biggest one is that you're mistaken about what you're computing.
At each iteration i increases by one, and you are checking at each step whether i < 3999998. You are effectively finding the first 4 million fibonacci numbers.
You should change your loop condition to while fib2 < 3999998.
A couple of other minor optimisations. Leverage python's swapping syntax x, y = y, x and its sum function. Computing the sum once over a list is slightly faster then summing them up successively over a loop.
a, b = 1, 2
fib = []
while b < 3999998:
a, b = b, a + b
if b % 2 == 0:
fib.append(b)
sum(fib) + 2
This runs in 100000 loops, best of 3: 7.51 µs per loop, a whopping 3 microseconds faster than your current code (once you fix it, that is).
You are computing the first 4 million fibonacci numbers. It's going to take a while. It took me almost 5 minutes to compute the result, which was about 817 KB of digits, after I replaced fibn % 2 == 0 with fibn & 1 == 0 - an optimization that makes a big difference on such large numbers.
In other words, your code will eventually finish - it will just take a long time.
Update: your version finished after 42 minutes.
Say I have a range r=numpy.array(range(1, 6)) and I am calculating the cumulative sum using numpy.cumsum(r). But instead of returning [1, 3, 6, 10, 15] I would like it to return [1, 3, 6] because of the condition that the cumulative result must be less than 10.
If the array is very large, I would like the cumulative sum to break out before it starts calculating values that are redundant and will be thrown away later. Of course, I am trivializing everything here for the sake of the question.
Is it possible to break out of cumsum or cumprod early based on a condition?
I don't think this is possible with any function in numpy, since in most cases these are meant for vectorized computations on fixed-length arrays. One obvious way to do what you want is to break out of a standard for-loop in Python (as I assume you know):
def limited_cumsum(x, limit):
y = []
sm = 0
for item in x:
sm += item
if sm > limit:
return y
y.append(sm)
return y
But this would obviously be an order of magnitude slower than numpy's cumsum.
Since you probably need some very specialized function, the changes are low to find the exact function you need in numpy. You should probably have a look at Cython, which allows you to implement custom functions that are as flexible as a Python function (and using a syntax that is almost Python), with a speed close to that of C.
Depending on size of the array you are computing the cumulative sum for and how quickly you expect the target value to be reached it may be quicker to calculate the cumulative sum in steps.
import numpy as np
size = 1000000
target = size
def stepped_cumsum():
arr = np.arange(size)
out = np.empty(len(arr), dtype=int)
step = 1000
last_value = 0
for i in range(0, len(arr), step):
np.cumsum(arr[i:i+step], out=out[i:i+step])
out[i:i+step] += last_value
last_value = out[i+step-1]
if last_value >= target:
break
else:
return out
greater_than_target_index = i + (out[i:i+step] >= target).argmax()
# .copy() required so rest of backing array can be freed
return out[:greater_than_target_index].copy()
def normal_cumsum():
arr = np.arange(size)
out = np.cumsum(arr)
return out
stepped_result = stepped_cumsum()
normal_result = normal_cumsum()
assert (stepped_result < target).all()
assert (stepped_result == normal_result[:len(stepped_result)]).all()
Results:
In [60]: %timeit cumsum.stepped_cumsum()
1000 loops, best of 3: 1.22 ms per loop
In [61]: %timeit cumsum.normal_cumsum()
100 loops, best of 3: 3.69 ms per loop
I expected this Python implementation of ThreeSum to be slow:
def count(a):
"""ThreeSum: Given N distinct integers, how many triples sum to exactly zero?"""
N = len(a)
cnt = 0
for i in range(N):
for j in range(i+1, N):
for k in range(j+1, N):
if sum([a[i], a[j], a[k]]) == 0:
cnt += 1
return cnt
But I was shocked that this version looks pretty slow too:
def count_python(a):
"""ThreeSum using itertools"""
return sum(map(lambda X: sum(X)==0, itertools.combinations(a, r=3)))
Can anyone recommend a faster Python implementation? Both implementations just seem so slow...
Thanks
...
ANSWER SUMMARY:
Here is how the runs of all the various versions provided in this thread of the O(N^3) (for educational purposes, not used in real life) version worked out on my machine:
56 sec RUNNING count_slow...
28 sec RUNNING count_itertools, written by Ashwini Chaudhary...
14 sec RUNNING count_fixed, written by roippi...
11 sec RUNNING count_itertools (faster), written by Veedrak...
08 sec RUNNING count_enumerate, written by roippi...
*Note: Needed to modify Veedrak's solution to this to get the correct count output:
sum(1 for x, y, z in itertools.combinations(a, r=3) if x+y==-z)
Supplying a second answer. From various comments, it looks like you're primarily concerned about why this particular O(n**3) algorithm is slow when being ported over from java. Let's dive in.
def count(a):
"""ThreeSum: Given N distinct integers, how many triples sum to exactly zero?"""
N = len(a)
cnt = 0
for i in range(N):
for j in range(i+1, N):
for k in range(j+1, N):
if sum([a[i], a[j], a[k]]) == 0:
cnt += 1
return cnt
One major problem that immediately pops out is that you're doing something your java code almost certainly isn't doing: materializing a 3-element list just to add three numbers together!
if sum([a[i], a[j], a[k]]) == 0:
Yuck! Just write that as
if a[i] + a[j] + a[k] == 0:
Some benchmarking shows that you're adding 50%+ overhead just by doing that. Yikes.
The other issue here is that you're using indexing where you should be using iteration. In python try to avoid writing code like this:
for i in range(len(some_list)):
do_something(some_list[i])
And instead just write:
for x in some_list:
do_something(x)
And if you explicitly need the index that you're on (as you actually do in your code), use enumerate:
for i,x in enumerate(some_list):
#etc
This is, in general, a style thing (though it goes deeper than that, with duck typing and the iterator protocol) - but it is also a performance thing. In order to look up the value of a[i], that call is converted to a.__getitem__(i), then python has to dynamically resolve a __getitem__ method lookup, call it, and return the value. Every time. It's not a crazy amount of overhead - at least on builtin types - but it adds up if you're doing it a lot in a loop. Treating a as an iterable, on the other hand, sidesteps a lot of that overhead.
So taking that change in mind, you can rewrite your function once again:
def count_enumerate(a):
cnt = 0
for i, x in enumerate(a):
for j, y in enumerate(a[i+1:], i+1):
for z in a[j+1:]:
if x + y + z == 0:
cnt += 1
return cnt
Let's look at some timings:
%timeit count(range(-100,100))
1 loops, best of 3: 394 ms per loop
%timeit count_fixed(range(-100,100)) #just fixing your sum() line
10 loops, best of 3: 158 ms per loop
%timeit count_enumerate(range(-100,100))
10 loops, best of 3: 88.9 ms per loop
And that's about as fast as it's going to go. You can shave off a percent or so by wrapping everything in a comprehension instead of doing cnt += 1 but that's pretty minor.
I've toyed around with a few itertools implementations but I actually can't get them to go faster than this explicit loop version. This makes sense if you think about it - for every iteration, the itertools.combinations version has to rebind what all three variables refer to, whereas the explicit loops get to "cheat" and rebind the variables in the outer loops far less often.
Reality check time, though: after everything is said and done, you can still expect cPython to run this algorithm an order of magnitude slower than a modern JVM would. There is simply too much abstraction built in to python that gets in the way of looping quickly. If you care about speed (and you can't fix your algorithm - see my other answer), either use something like numpy to spend all of your time looping in C, or use a different implementation of python.
postscript: pypy
For fun, I ran count_fixed on a 1000-element list, on both cPython and pypy.
cPython:
In [81]: timeit.timeit('count_fixed(range(-500,500))', setup='from __main__ import count_fixed', number = 1)
Out[81]: 19.230753898620605
pypy:
>>>> timeit.timeit('count_fixed(range(-500,500))', setup='from __main__ import count_fixed', number = 1)
0.6961538791656494
Speedy!
I might add some java testing in later to compare :-)
Algorithmically, both versions of your function are O(n**3) - so asymptotically neither is superior. You will find that the itertools version is in practice somewhat faster since it spends more time looping in C rather than in python bytecode. You can get it down a few more percentage points by removing map entirely (especially if you're running py2) but it's still going to be "slow" compared to whatever times you got from running it in a JVM.
Note that there are plenty of python implementations other than cPython out there - for loopy code, pypy tends to be much faster than cPython. So I wouldn't write python-as-a-language off as being slow, necessarily, but I would certainly say that the reference implementation of python is not known for its blazing loop speed. Give other python flavors a shot if that's something you care about.
Specific to your algorithm, an optimization will let you drop it down to O(n**2). Build up a set of your integers, s, and build up all pairs (a,b). You know that you can "zero out" (a+b) if and only if -(a+b) in (s - {a,b}).
Thanks to #Veedrak: unfortunately constructing s - {a,b} is a slow O(len(s)) operation itself - so simply check if -(a+b) is equal to either a or b. If it is, you know there's no third c that can fulfill a+b+c == 0 since all numbers in your input are distinct.
def count_python_faster(a):
s = frozenset(a)
return sum(1 for x,y in itertools.combinations(a,2)
if -(x+y) not in (x,y) and -(x+y) in s) // 3
Note the divide-by-three at the end; this is because each successful combination is triple-counted. It's possible to avoid that but it doesn't actually speed things up and (imo) just complicates the code.
Some timings for the curious:
%timeit count(range(-100,100))
1 loops, best of 3: 407 ms per loop
%timeit count_python(range(-100,100)) #this is about 100ms faster on py3
1 loops, best of 3: 382 ms per loop
%timeit count_python_faster(range(-100,100))
100 loops, best of 3: 5.37 ms per loop
You haven't stated which version of Python you're using.
In Python 3.x, a generator expression is around 10% faster than either of the two implementations you listed. Using a random array of 100 numbers in the range [-100,100] for a:
count(a) -> 8.94 ms # as per your implementation
count_python(a) -> 8.75 ms # as per your implementation
def count_generator(a):
return sum((sum(x) == 0 for x in itertools.combinations(a,r=3)))
count_generator(a) -> 7.63 ms
But other than that, it's the shear amount of combinations that's dominating execution time - O(N^3).
I should add the times shown above are for loops of 10 calls each, averaged over 10 loops. And yeah, my laptop is slow too :)
Below is a loop to find smallest common multiple of the numbers 1-20:
count=0
while not all(count % 1 == 0, count % 2 == 0,
count % 3 == 0, count % 4 == 0,
... count % 20 ==0):
count+=1
print count
It's quite tedious to type out so many conditions. This needs improvement, especially if the number is bigger than 20. However, being new to python, my knee-jerk reaction was:
while not all(count % range(1,21)==0):
...which doesn't work because python can't read minds. I've thought about putting a list inside the all(), but I'm not sure how to generate a list with variables in it.
.
Is there a shorthand to input a pattern of conditions like these, or is there a smarter way to do this that I'm missing?
Generator expressions:
while not all(count % i == 0 for i in range(1,21)):
Incidentally, this is pretty easy to work out by hand if you factorize the numbers 1..20 into prime factors. It's on the order of 200 million so the loop might take a while.
Use a generator expression:
while not all(count % x == 0 for x in range(1,21)):
You could also use any here:
while any(count % x for x in range(1,21)):
since 0 evaluates to False in Python.
A better solution to your current problem is to use a useful property of the least common multiple function (assuming you implemented it correctly):
lcm(a, b, c) == lcm(lcm(a, b), c)
It runs pretty quickly, even for fairly large inputs (the least common multiple of the first 20,000 numbers has 8,676 digits):
>>> %timeit reduce(lcm, range(1, 20001))
1 loops, best of 3: 240 ms per loop
I am wanting to zip up a list of entities with a new entity to generate a list of coordinates (2-tuples), but I want to assure that for (i, j) that i < j is always true.
However, I am not extremely pleased with my current solutions:
from itertools import repeat
mems = range(1, 10, 2)
mem = 8
def ij(i, j):
if i < j:
return (i, j)
else:
return (j, i)
def zipij(m=mem, ms=mems, f=ij):
return map(lambda i: f(i, m), ms)
def zipij2(m=mem, ms=mems):
return map(lambda i: tuple(sorted([i, m])), ms)
def zipij3(m=mem, ms=mems):
return [tuple(sorted([i, m])) for i in ms]
def zipij4(m=mem, ms=mems):
mems = zip(ms, repeat(m))
half1 = [(i, j) for i, j in mems if i < j]
half2 = [(j, i) for i, j in mems[len(half1):]]
return half1 + half2
def zipij5(m=mem, ms=mems):
mems = zip(ms, repeat(m))
return [(i, j) for i, j in mems if i < j] + [(j, i) for i, j in mems if i > j]
Output for above:
>>> print zipij() # or zipij{2-5}
[(1, 8), (3, 8), (5, 8), (7, 8), (8, 9)]
Instead of normally:
>>> print zip(mems, repeat(mem))
[(1, 8), (3, 8), (5, 8), (7, 8), (9, 8)]
Timings: snipped (no longer relevant, see much faster results in answers below)
For len(mems) == 5, there is no real issue with any solution, but for zipij5() for instance, the second list comprehension is needlessly going back over the first four values when i > j was already evaluated to be True for those in the first comprehension.
For my purposes, I'm positive that len(mems) will never exceed ~10000, if that helps form any answers for what solution is best. To explain my use case a bit (I find it interesting), I will be storing a sparse, upper-triangular, similarity matrix of sorts, and so I need the coordinate (i, j) to not be duplicated at (j, i). I say of sorts because I will be utilizing the new Counter() object in 2.7 to perform quasi matrix-matrix and matrix-vector addition. I then simply feed counter_obj.update() a list of 2-tuples and it increments those coordinates how many times they occur. SciPy sparse matrices ran about 50x slower, to my dismay, for my use cases... so I quickly ditched those.
So anyway, I was surprised by my results... The first methods I came up with were zipij4 and zipij5, and yet they are still the fastest, despite building a normal zip() and then generating a new zip after changing the values. I'm still rather new to Python, relatively speaking (Alex Martelli, can you hear me?), so here are my naive conclusions:
tuple(sorted([i, j])) is extremely expensive (Why is that?)
map(lambda ...) seems to always do worse than a list comp (I think I've read this and it makes sense)
Somehow zipij5() isn't much slower despite going over the list twice to check for i-j inequality. (Why is this?)
And lastly, I would like to know which is considered most efficient... or if there are any other fast and memory-inexpensive ways that I haven't yet thought of. Thank you.
Current Best Solutions
## Most BRIEF, Quickest with UNSORTED input list:
## truppo's
def zipij9(m=mem, ms=mems):
return [(i, m) if i < m else (m, i) for i in ms]
## Quickest with pre-SORTED input list:
## Michal's
def zipij10(m=mem, ms=mems):
i = binsearch(m, ms) ## See Michal's answer for binsearch()
return zip(ms[:i], repeat(m)) + zip(repeat(m), ms[i:])
Timings
# Michal's
Presorted - 410µs per loop
Unsorted - 2.09ms per loop ## Due solely to the expensive sorted()
# truppo's
Presorted - 880µs per loop
Unsorted - 896µs per loop ## No sorted() needed
Timings were using mems = range(1, 10000, 2), which is only ~5000 in length. sorted() will probably become worse at higher values, and with lists that are more shuffled. random.shuffle() was used for the "Unsorted" timings.
Current version:
(Fastest at the time of posting with Python 2.6.4 on my machine.)
Update 3: Since we're going all out, let's do a binary search -- in a way which doesn't require injecting m into mems:
def binsearch(x, lst):
low, high = -1, len(lst)
while low < high:
i = (high - low) // 2
if i > 0:
i += low
if lst[i] < x:
low = i
else:
high = i
else:
i = high
high = low
return i
def zipij(m=mem, ms=mems):
i = binsearch(m, ms)
return zip(ms[:i], repeat(m)) + zip(repeat(m), ms[i:])
This runs in 828 µs = 0.828 ms on my machine vs the OP's current solution's 1.14 ms. Input list assumed sorted (and the test case is the usual one, of course).
This binary search implementation returns the index of the first element in the given list which is not smaller than the object being searched for. Thus there's no need to inject m into mems and sort the whole thing (like in the OP's current solution with .index(m)) or walk through the beginning of the list step by step (like I did previously) to find the offset at which it should be divided.
Earlier attempts:
How about this? (Proposed solution next to In [25] below, 2.42 ms to zipij5's 3.13 ms.)
In [24]: timeit zipij5(m = mem, ms = mems)
100 loops, best of 3: 3.13 ms per loop
In [25]: timeit [(i, j) if i < j else (j, i) for (i, j) in zip(mems, repeat(mem))]
100 loops, best of 3: 2.42 ms per loop
In [27]: [(i, j) if i < j else (j, i) for (i, j) in zip(mems, repeat(mem))] == zipij5(m=mem, ms=mems)
Out[27]: True
Update: This appears to be just about exactly as fast as the OP's self-answer. Seems more straighforward, though.
Update 2: An implementation of the OP's proposed simplified solution:
def zipij(m=mem, ms=mems):
split_at = 0
for item in ms:
if item < m:
split_at += 1
else:
break
return [(item, m) for item in mems[:split_at]] + [(m, item) for item in mems[split_at:]]
In [54]: timeit zipij()
1000 loops, best of 3: 1.15 ms per loop
Also, truppo's solution runs in 1.36 ms on my machine. I guess the above is the fastest so far. Note you need to sort mems before passing them into this function! If you're generating it with range, it is of course already sorted, though.
Why not just inline your ij()-function?
def zipij(m=mem, ms=mems):
return [(i, m) if i < m else (m, i) for i in ms]
(This runs in 0.64 ms instead of 2.12 ms on my computer)
Some benchmarks:
zipit.py:
from itertools import repeat
mems = range(1, 50000, 2)
mem = 8
def zipij7(m=mem, ms=mems):
cpy = sorted(ms + [m])
loc = cpy.index(m)
return zip(ms[:(loc)], repeat(m)) + zip(repeat(m), ms[(loc):])
def zipinline(m=mem, ms=mems):
return [(i, m) if i < m else (m, i) for i in ms]
Sorted:
>python -m timeit -s "import zipit" "zipit.zipinline()"
100 loops, best of 3: 4.44 msec per loop
>python -m timeit -s "import zipit" "zipit.zipij7()"
100 loops, best of 3: 4.8 msec per loop
Unsorted:
>python -m timeit -s "import zipit, random; random.shuffle(zipit.mems)" "zipit.zipinline()"
100 loops, best of 3: 4.65 msec per loop
p>python -m timeit -s "import zipit, random; random.shuffle(zipit.mems)" "zipit.zipij7()"
100 loops, best of 3: 17.1 msec per loop
Most recent version:
def zipij7(m=mem, ms=mems):
cpy = sorted(ms + [m])
loc = cpy.index(m)
return zip(ms[:(loc)], repeat(m)) + zip(repeat(m), ms[(loc):])
Benches slightly faster for me than truppo's, slower by 30% than Michal's. (Looking into that now)
I may have found my answer (for now). It seems I forgot about making a list comp version for `zipij()``:
def zipij1(m=mem, ms=mems, f=ij):
return [f(i, m) for i in ms]
It still relies on my silly ij() helper function, so it doesn't win the award for brevity, certainly, but timings have improved:
# 10000
1.27s
# 50000
6.74s
So it is now my current "winner", and also does not need to generate more than one list, or use a lot of function calls, other than the ij() helper, so I believe it would also be the most efficient.
However, I think this could still be improved... I think that making N ij() function calls (where N is the length of the resultant list) is not needed:
Find at what index mem would fit into mems when ordered
Split mems at that index into two parts
Do zip(part1, repeat(mem))
Add zip(repeat(mem), part2) to it
It'd basically be an improvement on zipij4(), and this avoids N extra function calls, but I am not sure of the speed/memory benefits over the cost of brevity. I will maybe add that version to this answer if I figure it out.