This question and my answer got me thinking about this peculiar difference between Python 2.7 and Python 3.4. Take the simple example code:
import timeit
import dis
c = 1000000
r = range(c)
def slow():
for pos in range(c):
r[pos:pos+3]
dis.dis(slow)
time = timeit.Timer(lambda: slow()).timeit(number=1)
print('%3.3f' % time)
In Python 2.7, I consistently get 0.165~ and for Python 3.4 I consistently get 0.554~. The only significant difference between the disassemblies is that Python 2.7 emits the SLICE+3 byte code while Python 3.4 emits BUILD_SLICE followed by BINARY_SUBSCR. Note that I've eliminated the candidates for potential slowdown from the other question, namely strings and the fact that xrange doesn't exist in Python 3.4 (which is supposed to be similar to the latter's range class anyways).
Using itertools' islice yields nearly identical timings between the two, so I highly suspect that it's the slicing that's the cause of the difference here.
Why is this happening and is there a link to an authoritative source documenting change in behavior?
EDIT: In response to the answer, I have wrapped the range objects in list, which did give a noticeable speedup. However as I increased the number of iterations in timeit I noticed that the timing differences became larger and larger. As a sanity check, I replaced the slicing with None to see what would happen.
500 iterations in timeit.
c = 1000000
r = list(range(c))
def slow():
for pos in r:
None
yields 10.688 and 9.915 respectively. Replacing the for loop with for pos in islice(r, 0, c, 3) yields 7.626 and 6.270 respectively. Replacing None with r[pos] yielded 20~ and 28~ respectively. r[pos:pos+3] yields 67.531 and 106.784 respectively.
As you can see, the timing differences are huge. Again, I'm still convinced the issue is not directly related to range.
On Python 2.7, you're iterating over a list and slicing a list. On Python 3.4, you're iterating over a range and slicing a range.
When I run a test with a list on both Python versions:
from __future__ import print_function
import timeit
print(timeit.timeit('x[5:8]', setup='x = list(range(10))'))
I get 0.243554830551 seconds on Python 2.7 and 0.29082867689430714 seconds on Python 3.4, a much smaller difference.
The performance difference you see after eliminating the range object is much smaller. It comes primarily from two factors: addition is a bit slower on Python 3, and Python 3 needs to go through __getitem__ with a slice object for slicing, while Python 2 has __getslice__.
I wasn't able to replicate the timing difference you saw with r[pos]; you may have had some confounding factor in that test.
Related
I have two functions to check whether two words word1 and word2 are anagrams or not (note: two words are said to be anagrams if you can rearrange the letters of one of the words to get the other).
def is_anagram(word1, wrod2):
histogram = {}
for char in word1:
histogram[char] = histogram.get(char, 0) + 1
# Trying to exhaust all the letters in a histogram by second word
for char in word2:
histogram[char] = histogram.get(char, 0) - 1
for vals in histogram.values():
if vals != 0: return False
return True
Clearly, in the above function 3 loops are running so it has an overall complexity of O(n)
Here is the second implementation:
def is_anagram2(word1, word2):
sorted_word1 = ''.join(sorted(word1))
sorted_word2 = ''.join(sorted(word2))
return sorted_word1 == sorted_word2
The sorted function has a complexity of nlogn, so the complexity of this function should be O(nlogn).
But still if you measure the execution time of theses two functions (e.g. through timeit command in ipython), it is found that the is_anagram2 function is faster.
Please explain why...
This is because the second function is leaving all the important work to the underlying native code of the Python interpreter (namely, the sorted function). The first function is instead doing everything in Python code, breaking the task into smaller operations.
Python (CPython at least, which is the official implementation) is implemented in C, and C code is an order of magnitude faster than Python code. This is because C is optimized and compiled to machine code that runs directly on your CPU. On the other hand, Python code runs on top of a virtual machine implemented in C (the Python interpreter): it has to be parsed, turned into Python bytecode, and then every single bytecode operation needs to be executed by the interpreter, which is much slower.
This is a common problem that will come up countless times when optimizing for performance in Python. Sometimes it is just better to leave native code do the job because the overhead of Python bytecode outweighs its advantages. This is also the reason why libraries such as NumPy are very popular: they implement most of the functionality in C, and only expose high level APIs through Python modules, so they can be very efficient for certain tasks if used correctly instead of plain Python code.
I recently began self-learning python, and have been using this language for an online course in algorithms. For some reason, many of my codes I created for this course are very slow (relatively to C/C++ Matlab codes I have created in the past), and I'm starting to worry that I am not using python properly.
Here is a simple python and matlab code to compare their speed.
MATLAB
for i = 1:100000000
a = 1 + 1
end
Python
for i in list(range(0, 100000000)):
a=1 + 1
The matlab code takes about 0.3 second, and the python code takes about 7 seconds. Is this normal? My python codes for much complex problems are very slow. For example, as a HW assignment, I'm running depth first search on a graph with about 900000 nodes, and this is taking forever. Thank you.
Performance is not an explicit design goal of Python:
Don’t fret too much about performance--plan to optimize later when
needed.
That's one of the reasons why Python integrated with a lot of high performance calculating backend engines, such as numpy, OpenBLAS and even CUDA, just to name a few.
The best way to go foreward if you want to increase performance is to let high-performance libraries do the heavy lifting for you. Optimizing loops within Python (by using xrange instead of range in Python 2.7) won't get you very dramatic results.
Here is a bit of code that compares different approaches:
Your original list(range())
The suggestes use of xrange()
Leaving the i out
Using numpy to do the addition using numpy array's (vector addition)
Using CUDA to do vector addition on the GPU
Code:
import timeit
import matplotlib.pyplot as mplplt
iter = 100
testcode = [
"for i in list(range(1000000)): a = 1+1",
"for i in xrange(1000000): a = 1+1",
"for _ in xrange(1000000): a = 1+1",
"import numpy; one = numpy.ones(1000000); a = one+one",
"import pycuda.gpuarray as gpuarray; import pycuda.driver as cuda; import pycuda.autoinit; import numpy;" \
"one_gpu = gpuarray.GPUArray((1000000),numpy.int16); one_gpu.fill(1); a = (one_gpu+one_gpu).get()"
]
labels = ["list(range())", "i in xrange()", "_ in xrange()", "numpy", "numpy and CUDA"]
timings = [timeit.timeit(t, number=iter) for t in testcode]
print labels, timings
label_idx = range(len(labels))
mplplt.bar(label_idx, timings)
mplplt.xticks(label_idx, labels)
mplplt.ylabel('Execution time (sec)')
mplplt.title('Timing of integer addition in python 2.7\n(smaller value is better performance)')
mplplt.show()
Results (graph) ran on Python 2.7.13 on OSX:
The reason that Numpy performs faster than the CUDA solution is that the overhead of using CUDA does not beat the efficiency of Python+Numpy. For larger, floating point calculations, CUDA does even better than Numpy.
Note that the Numpy solution performs more that 80 times faster than your original solution. If your timings are correct, this would even be faster than Matlab...
A final note on DFS (Depth-afirst-Search): here is an interesting article on DFS in Python.
Try using xrange instead of range.
The difference between them is that **xrange** generates the values as you use them instead of range, which tries to generate a static list at runtime.
Unfortunately, python's amazing flexibility and ease comes at the cost of being slow. And also, for such large values of iteration, I suggest using itertools module as it has faster caching.
The xrange is a good solution however if you want to iterate over dictionaries and such, it's better to use itertools as in that, you can iterate over any type of sequence object.
EDIT: I'm redoing the question entirely. The issue has nothing to do with time.time()
Here's a program:
import time
start=time.time()
a=9<<(1<<26) # The line that makes it take a while
print(time.time()-start)
This program, when saved as a file and run with IDLE in Python 3.4, takes about 10 seconds, even though 0.0 is printed out from time.time(). The issue is very clearly with IDLE, because when run from the command line this program takes almost no time at all.
Another program that has the same effect, as found by senshin, is:
def f():
a = 9<<(1<<26)
I have confirmed that this same program, when run in Python 2.7 IDLE or from the command line on python 2.7 or 3.4, is near instantaneous.
So what is Python 3.4 IDLE doing that makes it take so long? I understand that calculating this number and saving it to memory is disk intensive, but what I'd like to know is why Python 3.4 IDLE performs this computation and write when Python 2.7 IDLE and command line Python presumably do not.
I would look at that line and pick it apart. You have:
9 << (1 << 26)
(1 << 26) is the first expression evaluated, and it produces a really large number. What this line is saying, is that you are going to multiply the number 1 by 2 to the power of 26, effectively producing the number 2 ** 26 in memory. This is not the problem however. You then shift 9 left by the count of 2 ** 26. This produces a number that is around 50 million digits long in memory (I cant even calculate it exactly!), because the shift left is just too big. Be careful in the future, as shifts by what seems to be small amounts do in fact grow very fast. If it was any larger, your program may have not run at all. Your expression mathematically evaluates to 9 * 2 ** (2 ** 26), if you were curious.
The ambiguity in the comment section is probably actually dealing with how this huge portion of memory is handled by python under the hood, and not IDLE.
EDIT 1:
I thing that what is happening, is that a mathematical expression evaluates to its answer, even when placed inside of a function that isn't called yet, only if the expression is self sufficient. This means that if a variable is used in the equation, the equation will be untouched in the byte code, and not evaluated until hard execution. The function has to be interpreted, and in that process, I think that your value is actually computed, resulting in the slower times. I am not sure about this, but I strongly suspect this behavior to be the root cause. Even if it is not so, you got to admit that 9<<(1<<26) kicks the computer in the behind, there's not much optimization that can be done there.
In[73]: def create_number():
return 9<<(1<<26)
In[74]: #Note that this seems instantaneous, but try calling the function!
In[75]: %timeit create_number()
#Python environment crashes because task is too hard
There is a slight deception in this kind of testing however. When trying this with the regular timeit, I got:
In[3]: from timeit import timeit
In[4]: timeit(setup = 'from __main__ import create_number', stmt = 'create_number()', number = 1)
Out[4]: .004942887388800443
Also keep in mind that printing the value is not do-able, so something like:
In[102]: 9<<(1<<26)
should not even be attempted.
For even more added support:
I felt like a rebel, so I decided to see what would happen if I timeit the raw execution of the equation:
In[107]: %timeit 9<<(1<<26)
10000000 loops, best of 3: 22.8 ns per loop
In[108]: def empty(): pass
In[109]: %timeit empty()
10000000 loops, best of 3: 96.3 ns per loop
This is really fishy, because apparently this calculation happens faster than the time it takes Python to call an empty function, which is obviously not the case. I repeat, this is not instantaneous, but probably has something to do with retrieving an already calculated object somewhere in memory, and reusing that value to calculate the expression. Anyways, nice question.
I am really puzzled. Here are more results with 3.4.1. Running either of the first two lines from the editor in either 3.4.1 or 3.3.5 gives the same contrast.
>>> a = 1 << 26; b = 9 << a # fast, , .5 sec
>>> c = 9 << (1 << 26) # slow, about 3 sec
>>> b == c # fast
True
>>> exec('d=9<<(1<<26)', globals()) # fast
>>> c == d # fast
True
The difference between normal execution and Idle's is that Idle exec's code in an exec call like the above, except that the 'globals' passed to exec is not globals() but a dict configured to look like globals(). I do not know of any 2.7 -- 3.4 difference in Idle in this respect except for the change of exec from statement to function. How can exec'ing an exec be faster than a single exec? How can adding an intermediate binding be faster?
I think I may have implemented this incorrectly because the results do not make sense. I have a Go program that counts to 1000000000:
package main
import (
"fmt"
)
func main() {
for i := 0; i < 1000000000; i++ {}
fmt.Println("Done")
}
It finishes in less than a second. On the other hand I have a Python script:
x = 0
while x < 1000000000:
x+=1
print 'Done'
It finishes in a few minutes.
Why is the Go version so much faster? Are they both counting up to 1000000000 or am I missing something?
One billion is not a very big number. Any reasonably modern machine should be able to do this in a few seconds at most, if it's able to do the work with native types. I verified this by writing an equivalent C program, reading the assembly to make sure that it actually was doing addition, and timing it (it completes in about 1.8 seconds on my machine).
Python, however, doesn't have a concept of natively typed variables (or meaningful type annotations at all), so it has to do hundreds of times as much work in this case. In short, the answer to your headline question is "yes". Go really can be that much faster than Python, even without any kind of compiler trickery like optimizing away a side-effect-free loop.
pypy actually does an impressive job of speeding up this loop
def main():
x = 0
while x < 1000000000:
x+=1
if __name__ == "__main__":
s=time.time()
main()
print time.time() - s
$ python count.py
44.221405983
$ pypy count.py
1.03511095047
~97% speedup!
Clarification for 3 people who didn't "get it". The Python language itself isn't slow. The CPython implementation is a relatively straight forward way of running the code. Pypy is another implementation of the language that does many tricky (especiallt the JIT) things that can make enormous differences. Directly answering the question in the title - Go isn't "that much" faster than Python, Go is that much faster than CPython.
Having said that, the code samples aren't really doing the same thing. Python needs to instantiate 1000000000 of its int objects. Go is just incrementing one memory location.
This scenario will highly favor decent natively-compiled statically-typed languages. Natively compiled statically-typed languages are capable of emitting a very trivial loop of say, 4-6 CPU opcodes that utilizes simple check-condition for termination. This loop has effectively zero branch prediction misses and can be effectively thought of as performing an increment every CPU cycle (this isn't entirely true, but..)
Python implementations have to do significantly more work, primarily due to the dynamic typing. Python must make several different calls (internal and external) just to add two ints together. In Python it must call __add__ (it is effectively i = i.__add__(1), but this syntax will only work in Python 3.x), which in turn has to check the type of the value passed (to make sure it is an int), then it adds the integer values (extracting them from both of the objects), and then the new integer value is wrapped up again in a new object. Finally it re-assigns the new object to the local variable. That's significantly more work than a single opcode to increment, and doesn't even address the loop itself - by comparison, the Go/native version is likely only incrementing a register by side-effect.
Java will fair much better in a trivial benchmark like this and will likely be fairly close to Go; the JIT and static-typing of the counter variable can ensure this (it uses a special integer add JVM instruction). Once again, Python has no such advantage. Now, there are some implementations like PyPy/RPython, which run a static-typing phase and should fare much better than CPython here ..
You've got two things at work here. The first of which is that Go is compiled to machine code and run directly on the CPU while Python is compiled to bytecode run against a (particularly slow) VM.
The second, and more significant, thing impacting performance is that the semantics of the two programs are actually significantly different. The Go version makes a "box" called "x" that holds a number and increments that by 1 on each pass through the program. The Python version actually has to create a new "box" (int object) on each cycle (and, eventually, has to throw them away). We can demonstrate this by modifying your programs slightly:
package main
import (
"fmt"
)
func main() {
for i := 0; i < 10; i++ {
fmt.Printf("%d %p\n", i, &i)
}
}
...and:
x = 0;
while x < 10:
x += 1
print x, id(x)
This is because Go, due to it's C roots, takes a variable name to refer to a place, where Python takes variable names to refer to things. Since an integer is considered a unique, immutable entity in python, we must constantly make new ones. Python should be slower than Go but you've picked a worst-case scenario - in the Benchmarks Game, we see go being, on average, about 25x times faster (100x in the worst case).
You've probably read that, if your Python programs are too slow, you can speed them up by moving things into C. Fortunately, in this case, somebody's already done this for you. If you rewrite your empty loop to use xrange() like so:
for x in xrange(1000000000):
pass
print "Done."
...you'll see it run about twice as fast. If you find loop counters to actually be a major bottleneck in your program, it might be time to investigate a new way of solving the problem.
#troq
I'm a little late to the party but I'd say the answer is yes and no. As #gnibbler pointed out, CPython is slower in the simple implementation but pypy is jit compiled for much faster code when you need it.
If you're doing numeric processing with CPython most will do it with numpy resulting in fast operations on arrays and matrices. Recently I've been doing a lot with numba which allows you to add a simple wrapper to your code. For this one I just added #njit to a function incALot() which runs your code above.
On my machine CPython takes 61 seconds, but with the numba wrapper it takes 7.2 microseconds which will be similar to C and maybe faster than Go. Thats an 8 million times speedup.
So, in Python, if things with numbers seem a bit slow, there are tools to address it - and you still get Python's programmer productivity and the REPL.
def incALot(y):
x = 0
while x < y:
x += 1
#njit('i8(i8)')
def nbIncALot(y):
x = 0
while x < y:
x += 1
return x
size = 1000000000
start = time.time()
incALot(size)
t1 = time.time() - start
start = time.time()
x = nbIncALot(size)
t2 = time.time() - start
print('CPython3 takes %.3fs, Numba takes %.9fs' %(t1, t2))
print('Speedup is: %.1f' % (t1/t2))
print('Just Checking:', x)
CPython3 takes 58.958s, Numba takes 0.000007153s
Speedup is: 8242982.2
Just Checking: 1000000000
Problem is Python is interpreted, GO isn't so there's no real way to bench test speeds. Interpreted languages usually (not always have a vm component) that's where the problem lies, any test you run is being run in interpreted bounds not actual runtime bounds. Go is slightly slower than C in terms of speed and that is mostly due to it using garbage collection instead of manual memory management. That said GO compared to Python is fast because its a compiled language, the only thing lacking in GO is bug testing I stand corrected if I'm wrong.
It is possible that the compiler realized that you didn't use the "i" variable after the loop, so it optimized the final code by removing the loop.
Even if you used it afterwards, the compiler is probably smart enough to substitute the loop with
i = 1000000000;
Hope this helps =)
I'm not familiar with go, but I'd guess that go version ignores the loop since the body of the loop does nothing. On the other hand, in the python version, you are incrementing x in the body of the loop so it's probably actually executing the loop.
Here are two programs that naively calculate the number of prime numbers <= n.
One is in Python and the other is in Java.
public class prime{
public static void main(String args[]){
int n = Integer.parseInt(args[0]);
int nps = 0;
boolean isp;
for(int i = 1; i <= n; i++){
isp = true;
for(int k = 2; k < i; k++){
if( (i*1.0 / k) == (i/k) ) isp = false;
}
if(isp){nps++;}
}
System.out.println(nps);
}
}
`#!/usr/bin/python`
import sys
n = int(sys.argv[1])
nps = 0
for i in range(1,n+1):
isp = True
for k in range(2,i):
if( (i*1.0 / k) == (i/k) ): isp = False
if isp == True: nps = nps + 1
print nps
Running them on n=10000 I get the following timings.
shell:~$ time python prime.py 10000 && time java prime 10000
1230
real 0m49.833s
user 0m49.815s
sys 0m0.012s
1230
real 0m1.491s
user 0m1.468s
sys 0m0.016s
Am I using for loops in python in an incorrect manner here or is python actually just this much slower?
I'm not looking for an answer that is specifically crafted for calculating primes but rather I am wondering if python code is typically utilized in a smarter fashion.
The Java code was compiled with
javac 1.6.0_20
Run with java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8.1) (6b18-1.8.1-0ubuntu1~9.10.1)
OpenJDK Client VM (build 16.0-b13, mixed mode, sharing)
Python is:
Python 2.6.4 (r264:75706, Dec 7 2009, 18:45:15)
As has been pointed out, straight Python really isn't made for this sort of thing. That the prime checking algorithm is naive is also not the point. However, with two simple things I was able to greatly reduce the time in Python while using the original algorithm.
First, put everything inside of a function, call it main() or something. This decreased the time on my machine in Python from 20.6 seconds to 14.54 seconds. Doing things globally is slower than doing them in a function.
Second, use Psyco, a JIT compiler. This requires adding two lines to the top of the file (and of course having psyco installed):
import psyco
psyco.full()
This brought the final time to 2.77 seconds.
One last note. I decided for kicks to use Cython on this and got the time down to 0.8533. However, knowing how to make the few changes to make it fast Cython code isn't something that I recommend for the casual user.
Yes, Python is slow, about a hundred times slower than C. You can use xrange instead of range for a small speedup, but other than that it's fine.
Ultimately what you're doing wrong is that you do this in plain Python, instead of using optimized libraries such as Numpy or Psyco.
Java comes with a jit compiler that makes a big difference where you're just crunching numbers.
You can make your Python about twice as fast by replacing that complicated test with
if i % k == 0: isp = False
You can also make it about eight times faster (for n=10000) than that by adding a break after that isp = False.
Also, do yourself a favor and skip the even numbers (adding one to nps to start to include 2).
Finally, you only need k to go up to sqrt(i).
Of course, if you make the same changes in the Java, it's still about 10x faster than the optimized Python.
Boy, when you said it was a naive implementation, you sure weren't joking!
But yes, a one to two order of magnitude difference in performance is not unexpected when comparing JIT-compiled, optimized machine code with an interpreted language. An alternative Python implementation such as Jython, which runs on the Java VM, may well be faster for this task; you could give it a whirl. Cython, which allows you to add static typing to Python and get C-like performance in some cases, may be worth investigating as well.
Even when considering the standard Python interpreter, CPython, though, the question is: is Python fast enough for the task at hand? Will the time you save writing the code in a dynamic language like Python make up for the extra time spent running it? If you had to write a given program in Java, would it seem like too much work to be worth the trouble?
Consider, for example, that a Python program running on a modern computer will be about as fast as a Java program running on a 10-year-old computer. The computer you had ten years ago was fast enough for many things, wasn't it?
Python does have a number of features that make it great for numerical work. These include an integer type that supports an unlimited number of digits, a decimal type with unlimited precision, and an optional library called NumPy specifically for calculations. Speed of execution, however, is not generally one of its major claims to fame. Where it excels is in getting the computer to do what you want with minimal cognitive friction.
If you're looking to do it fast, Python probably isn't the way forward, but you could speed it up a bit. First, you're using quite a slow way to test for divisibility. Modulo is quicker. You can also stop the inner loop (with k) as soon as it detects a match. I'd do something like this:
nps = 0
for i in range(1, n+1):
if all(i % k for k in range(2, i)): # i.e. if divisible by none of them
nps += 1
That brings it down from 25 s to 1.5 s for me. Using xrange brings it down to 0.9 s.
You could speed it up further by keeping a list of primes you've already found, and only testing those, rather than every number up to i (if i isn't divisible by 2, it won't be divisible by 4, 6, 8...).
Why don't you post something about the memory usage - and not just the speed? Trying to get a simple servlet on tomcat is wasting 3GB on my server.
What you did with the examples up there is not very good. You need to use numpy. Replace for/range with while loops, thus avoiding the list creation.
At last, python is quite suitable for number crunching, at least by people that do it the right way, and know what Sieve of Eratosthenes is, or mod operation is.
There are lots of things you can do to this algorithm to speed it up, but most of them would also speed up the Java version as well. Some of those will speed up the Python more than the Java, so they're worth testing.
Here's just a couple of changes that speed it up from 11.4 to 2.8 seconds on my system:
nps = 0
for i in range(1,n+1):
isp = True
for k in range(2,i):
isp = isp and (i % k != 0)
if isp: nps = nps + 1
print nps
Python is a language which, ironically, is well-suited for developing algorithms. Even a modified algorithm like this:
# See Thomas K for use of all(), many posters for sqrt optimization
nps = 0
for i in xrange(1, n+1):
if all(i % k for k in xrange(2, 1 + int(i ** 0.5))):
nps += 1
runs in significantly under one second. Code like this:
def eras(n):
last = n + 1
sieve = [0,0] + range(2, last)
sqn = int(round(n ** 0.5))
it = (i for i in xrange(2, sqn + 1) if sieve[i])
for i in it:
sieve[i*i:last:i] = [0] * (n//i - i + 1)
return filter(None, sieve)
is faster still. Or try out these.
The thing is, python is usually fast enough for designing your solution. If it is not fast enough for production, use numpy or Jython to goose more performance out of it. Or move it to a compiled language, taking your algorithm observations learned in python with you.
Yes, Python is one of the slowest practical languages you'll encounter. While loops are marginally faster than for i in xrange(), but ultimately Python will always be much, much slower than anything else.
Python has its place: Prototyping theory and ideas, or in any situation where the ability to produce code fast is more important than the code's performance.
Python is a scripting language. Not a programming language.