Rust function as slow as its python counterpart

Rust function as slow as its python counterpart - python

I am trying to speed up Python programs using Rust, a language in which I am a total beginner. I wrote a function that counts the occurrences of each possible string of length n within a larger string. For instance, if the main string is "AAAAT" and n=3, the outcome would be a hashmap {"AAA":2,"AAT":1}. I use pyo3 to call the Rust function from Python. The code of the Rust function is:
fn count_nmers(seq: &str, n: usize) -> PyResult<HashMap<&str,u64>> {
let mut current_pos: usize = 0;
let mut counts: HashMap<&str,u64> = HashMap::new();
while current_pos+n <= seq.len() {
//print!("{}\n", &seq[current_pos..current_pos+n]);
match counts.get(&seq[current_pos..current_pos+n]) {
Some(repeats) => counts.insert(&seq[current_pos..current_pos+n],repeats+1),
None => counts.insert(&seq[current_pos..current_pos+n],1)
};
current_pos +=1;
}
//print!("{:?}",counts)
Ok(counts)
}
When I use small values for n (n<10), Rust is about an order of magnitude faster than Python, but as the length of n increases, the gap tends to zero with both functions having the same speed by n=200. (see graph)
Times to count for different n-mer lengths (Python black, rust red)
I must be doing something wrong with the strings, but I can't find the mistake.
The python code is:
def nmer_freq_table(sequence,nmer_length=6):
nmer_dict=dict()
for nmer in seq_win(sequence,window_size=nmer_length):
if str(nmer) in nmer_dict.keys():
nmer_dict[str(nmer)]=nmer_dict[str(nmer)]+1
else:
nmer_dict[str(nmer)]=1
return nmer_dict
def seq_win(seq,window_size=2):
length=len(seq)
i=0
while i+window_size <= length:
yield seq[i:i+window_size]
i+=1

You are computing hash function multiple times, this may matter for large n values. Try using entry function instead of manual inserts:
while current_pos+n <= seq.len() {
let en = counts.entry(&seq[current_pos..current_pos+n]).or_default();
*en += 1;
current_pos +=1;
}
Complete code here
Next, make sure you are running --release compiled code like cargo run --release.
And one more thing to take in mind is discussed here, Rust may use non-optimal hash function for your case which you can change.
And finally, on large data, most of time is spent in HashMap/dict internals which are not a python, but compiled code. So don't expect it to scale well.

Could it be because as n gets larger the number of iterations through the loop gets smaller?
Fewer iterations through the loop would reduce the performance gain seen by using Rust. I'm sure there is a small per function call performance cost for transition/marshaling to Rust from Python. This would explain how eventually the performance from pure Python and Python/Rust becomes the same.

Related

Higher time complexity but still faster

I have two functions to check whether two words word1 and word2 are anagrams or not (note: two words are said to be anagrams if you can rearrange the letters of one of the words to get the other).
def is_anagram(word1, wrod2):
histogram = {}
for char in word1:
histogram[char] = histogram.get(char, 0) + 1
# Trying to exhaust all the letters in a histogram by second word
for char in word2:
histogram[char] = histogram.get(char, 0) - 1
for vals in histogram.values():
if vals != 0: return False
return True
Clearly, in the above function 3 loops are running so it has an overall complexity of O(n)
Here is the second implementation:
def is_anagram2(word1, word2):
sorted_word1 = ''.join(sorted(word1))
sorted_word2 = ''.join(sorted(word2))
return sorted_word1 == sorted_word2
The sorted function has a complexity of nlogn, so the complexity of this function should be O(nlogn).
But still if you measure the execution time of theses two functions (e.g. through timeit command in ipython), it is found that the is_anagram2 function is faster.
Please explain why...

This is because the second function is leaving all the important work to the underlying native code of the Python interpreter (namely, the sorted function). The first function is instead doing everything in Python code, breaking the task into smaller operations.
Python (CPython at least, which is the official implementation) is implemented in C, and C code is an order of magnitude faster than Python code. This is because C is optimized and compiled to machine code that runs directly on your CPU. On the other hand, Python code runs on top of a virtual machine implemented in C (the Python interpreter): it has to be parsed, turned into Python bytecode, and then every single bytecode operation needs to be executed by the interpreter, which is much slower.
This is a common problem that will come up countless times when optimizing for performance in Python. Sometimes it is just better to leave native code do the job because the overhead of Python bytecode outweighs its advantages. This is also the reason why libraries such as NumPy are very popular: they implement most of the functionality in C, and only expose high level APIs through Python modules, so they can be very efficient for certain tasks if used correctly instead of plain Python code.

String concatenation much faster in Python than Go

I'm looking at using Go to write a small program that's mostly handling text. I'm pretty sure, based on what I've heard about Go and Python that Go will be substantially faster. I don't actually have a specific need for insane speeds, but I'd like to get to know Go.
The "Go is going to be faster" idea was supported by a trivial test:
# test.py
print("Hello world")
$ time python dummy.py
Hello world
real 0m0.029s
user 0m0.019s
sys 0m0.010s
// test.go
package main
import "fmt"
func main() {
fmt.Println("hello world")
}
$ time ./test
hello world
real 0m0.001s
user 0m0.001s
sys 0m0.000s
Looks good in terms of raw startup speed (which is entirely expected). Highly non-scientific justification:
$ strace python test.py 2>&1 | wc -l
1223
$ strace ./test 2>&1 | wc -l
174
However, my next contrived test was how fast is Go when faffing with strings, and I was expecting to be similarly blown away by Go's raw speed. So, this was surprising:
# test2.py
s = ""
for i in range(1000000):
s += "a"
$ time python test2.py
real 0m0.179s
user 0m0.145s
sys 0m0.013s
// test2.go
package main
func main() {
s := ""
for i:= 0; i < 1000000; i++ {
s += "a";
}
}
$ time ./test2
real 0m56.840s
user 1m50.836s
sys 0m17.653
So Go is hundreds of times slower than Python.
Now, I know this is probably due to Schlemiel the Painter's algorithm, which explains why the Go implementation is quadratic in i (i is 10 times bigger leads to 100 times slowdown).
However, the Python implementation seems much faster: 10 times more loops only slows it down by twice. The same effect persists if you concatenate str(i), so I doubt there's some kind of magical JIT optimization to s = 100000 * 'a' going on. And it's not much slower if I print(s) at the end, so the variable isn't being optimised out.
Naivety of the concatenation methods aside (there are surely more idiomatic ways in each language), is there something here that I have misunderstood, or is it simply easier in Go than in Python to run into cases where you have to deal with C/C++-style algorithmic issues when handling strings (in which case a straight Go port might not be as uh-may-zing as I might hope without having to, ya'know, think about things and do my homework)?
Or have I run into a case where Python happens to work well, but falls apart under more complex use?
Versions used: Python 3.8.2, Go 1.14.2

TL;DR summary: basically you're testing the two implementation's allocators / garbage collectors and heavily weighting the scale on the Python side (by chance, as it were, but this is something the Python folks optimized at some point).
To expand my comments into a real answer:
Both Go and Python have counted strings, i.e., strings are implemented as a two-element header thingy containing a length (byte count or, for Python 3 strings, Unicode characters count) and data pointer.
Both Go and Python are garbage-collected (GCed) languages. That is, in both languages, you can allocate memory without having to worry about freeing it yourself: the system takes care of that automatically.
But the underlying implementations differ, quite a bit in this particular one important way: the version of Python you are using has a reference counting GC. The Go system you are using does not.
With a reference count, the inner bits of the Python string handler can do this. I'll express it as Go (or at least pseudo-Go) although the actual Python implementation is in C and I have not made all the details line up properly:
// add (append) new string t to existing string s
func add_to_string(s, t string_header) string_header {
need = s.len + t.len
if s.refcount == 1 { // can modify string in-place
data = s.data
if cap(data) >= need {
copy_into(data + s.len, t.data, t.len)
return s
}
}
// s is shared or s.cap < need
new_s := make_new_string(roundup(need))
// important: new_s has extra space for the next call to add_to_string
copy_into(new_s.data, s.data, s.len)
copy_into(new_s.data + s.len, t.data, t.len)
s.refcount--
if s.refcount == 0 {
gc_release_string(s)
}
return new_s
}
By over-allocating—rounding up the need value so that cap(new_s) is large—we get about log2(n) calls to the allocator, where n is the number of times you do s += "a". With n being 1000000 (one million), that's about 20 times that we actually have to invoke the make_new_string function and release (for gc purposes because the collector uses refcounts as a first pass) the old string s.
[Edit: your source archaeology led to commit 2c9c7a5f33d, which suggests less than doubling but still a multiplicative increase. To other readers, see comment.]
The current Go implementation allocates strings without a separate capacity header field (see reflect.StringHeader and note the big caveat that says "don't depend on this, it might be different in future implementations"). Between the lack of a refcount—we can't tell in the runtime routine that adds two strings, that the target has only one reference—and the inability to observe the equivalent of cap(s) (or cap(s.data)), the Go runtime has to create a new string every time. That's one million memory allocations.
To show that the Python code really does use the refcount, take your original Python:
s = ""
for i in range(1000000):
s += "a"
and add a second variable t like this:
s = ""
t = s
for i in range(1000000):
s += "a"
t = s
The difference in execution time is impressive:
$ time python test2.py
0.68 real 0.65 user 0.03 sys
$ time python test3.py
34.60 real 34.08 user 0.51 sys
The modified Python program still beats Go (1.13.5) on this same system:
$ time ./test2
67.32 real 103.27 user 13.60 sys
and I have not poked any further into the details, but I suspect the Go GC is running more aggressively than the Python one. The Go GC is very different internally, requiring write barriers and occasional "stop the world" behavior (of all goroutines that are not doing the GC work). The refcounting nature of the Python GC allows it to never stop: even with a refcount of 2, the refcount on t drops to 1 and then next assignment to t drops it to zero, releasing the memory block for re-use in the next trip through the main loop. So it's probably picking up the same memory block over and over again.
(If my memory is correct, Python's "over-allocate strings and check the refcount to allow expand-in-place" trick was not in all versions of Python. It may have first been added around Python 2.4 or so. This memory is extremely vague and a quick Google search did not turn up any evidence one way or the other. [Edit: Python 2.7.4, apparently.])

Well. You should never, ever use string concatenation in this way :-)
in go, try the strings.Buider
package main
import (
"strings"
)
func main() {
var b1 strings.Builder
for i:= 0; i < 1000000; i++ {
b1.WriteString("a");
}
}

Can we say that Mathematically the time complexity of the code in python is better than that of C?

The python code for printing a diamond pattern is :
def main():
n= input('The size of the diamond :: ')
a=n
for i in range(n):
print ' '*a,'*'*(2*i-1)
a=a-1
a=0
p=n
for i in range(n):
print ' '*a,'*'*(2*p-1)
a=a+1
p=p-1
main()
for a similar output the code in C is ::
#include<stdio.h>
int main()
{
int i,j;
int n;
printf("---PATTERN---\n");
printf("enter the number of rows :: \n");
scanf("%d",&n);
for(i=0;i<=n;i++)
{
for(j=n;j>i;j--)
{
printf(" ");
}
for(j=0;j<2*i+1;j++)
{
printf("*");
}
printf("\n");
}
i=0;
for(i=0;i<=n;i++)
{
for(j=0;j<i;j++)
{
printf(" ");
}
for(j=2*n-1;j>=2*i-1;j--)
{
printf("*");
}
printf("\n");
}
return 0;
}
My Question is : Can we say that Mathematically the time complexity of the code in python is better than that of C?
Although the Run time of the C program is less than that of Python but the same program in Python does not involve nesting of loops as in the case of C can we say that structurally Python is a more efficient language?
-my apologies if the doubt sounds stupid.

Can we say that Mathematically the time complexity of the code in python is better than that of C?
No, you cannot. The time complexity of a code with regards to a machine (even theoretical ones) depends on what the machine ultimately does (time spent solving the problem) and not how you tell on what to do i.e. C and Python are two, amongst many, ways to tell to it what to do. However, I'd recommend you to compile the Python code to C and then check it against your C program, by which you'll be comparing apples to apples and not to oranges. Even better, compile (and link) them both into binaries and disassemble them to verify your assumptions to see that both does in fact loop.
Python does not involve nesting of loops
That's just syntactic sugar that Python as a language provides; however, under the hood, it'd loop too, like the other answers mention.
can we say that structurally Python is a more efficient language?
No again, because structure is a matter of style and not efficiency.
If you're taking about performance (time) or memory efficiency, then it's not an inherent nature of the language itself but of an implementation of the language and how well it performs on given architectures; this again should be measured and not assumed/guessed. For instance, take Lua, the same language has different interpreters (implementations), of which a few are remarkably faster than the others. So efficiency is a matter of implementation and not the language itself.

The nested loops on Python are just not so easy to see. The statement print ' '*a,'*'*(2*i-1) is also a loop (in fact one loop per "multiplication") – how else would you be able to do a variable amount of repeated work? It's just not a loop that you spell out.

well, when you execute ' '*a in python you are actually doing a hidden for loop

Can Go really be that much faster than Python?

I think I may have implemented this incorrectly because the results do not make sense. I have a Go program that counts to 1000000000:
package main
import (
"fmt"
)
func main() {
for i := 0; i < 1000000000; i++ {}
fmt.Println("Done")
}
It finishes in less than a second. On the other hand I have a Python script:
x = 0
while x < 1000000000:
x+=1
print 'Done'
It finishes in a few minutes.
Why is the Go version so much faster? Are they both counting up to 1000000000 or am I missing something?

One billion is not a very big number. Any reasonably modern machine should be able to do this in a few seconds at most, if it's able to do the work with native types. I verified this by writing an equivalent C program, reading the assembly to make sure that it actually was doing addition, and timing it (it completes in about 1.8 seconds on my machine).
Python, however, doesn't have a concept of natively typed variables (or meaningful type annotations at all), so it has to do hundreds of times as much work in this case. In short, the answer to your headline question is "yes". Go really can be that much faster than Python, even without any kind of compiler trickery like optimizing away a side-effect-free loop.

pypy actually does an impressive job of speeding up this loop
def main():
x = 0
while x < 1000000000:
x+=1
if __name__ == "__main__":
s=time.time()
main()
print time.time() - s
$ python count.py
44.221405983
$ pypy count.py
1.03511095047
~97% speedup!
Clarification for 3 people who didn't "get it". The Python language itself isn't slow. The CPython implementation is a relatively straight forward way of running the code. Pypy is another implementation of the language that does many tricky (especiallt the JIT) things that can make enormous differences. Directly answering the question in the title - Go isn't "that much" faster than Python, Go is that much faster than CPython.
Having said that, the code samples aren't really doing the same thing. Python needs to instantiate 1000000000 of its int objects. Go is just incrementing one memory location.

This scenario will highly favor decent natively-compiled statically-typed languages. Natively compiled statically-typed languages are capable of emitting a very trivial loop of say, 4-6 CPU opcodes that utilizes simple check-condition for termination. This loop has effectively zero branch prediction misses and can be effectively thought of as performing an increment every CPU cycle (this isn't entirely true, but..)
Python implementations have to do significantly more work, primarily due to the dynamic typing. Python must make several different calls (internal and external) just to add two ints together. In Python it must call __add__ (it is effectively i = i.__add__(1), but this syntax will only work in Python 3.x), which in turn has to check the type of the value passed (to make sure it is an int), then it adds the integer values (extracting them from both of the objects), and then the new integer value is wrapped up again in a new object. Finally it re-assigns the new object to the local variable. That's significantly more work than a single opcode to increment, and doesn't even address the loop itself - by comparison, the Go/native version is likely only incrementing a register by side-effect.
Java will fair much better in a trivial benchmark like this and will likely be fairly close to Go; the JIT and static-typing of the counter variable can ensure this (it uses a special integer add JVM instruction). Once again, Python has no such advantage. Now, there are some implementations like PyPy/RPython, which run a static-typing phase and should fare much better than CPython here ..

You've got two things at work here. The first of which is that Go is compiled to machine code and run directly on the CPU while Python is compiled to bytecode run against a (particularly slow) VM.
The second, and more significant, thing impacting performance is that the semantics of the two programs are actually significantly different. The Go version makes a "box" called "x" that holds a number and increments that by 1 on each pass through the program. The Python version actually has to create a new "box" (int object) on each cycle (and, eventually, has to throw them away). We can demonstrate this by modifying your programs slightly:
package main
import (
"fmt"
)
func main() {
for i := 0; i < 10; i++ {
fmt.Printf("%d %p\n", i, &i)
}
}
...and:
x = 0;
while x < 10:
x += 1
print x, id(x)
This is because Go, due to it's C roots, takes a variable name to refer to a place, where Python takes variable names to refer to things. Since an integer is considered a unique, immutable entity in python, we must constantly make new ones. Python should be slower than Go but you've picked a worst-case scenario - in the Benchmarks Game, we see go being, on average, about 25x times faster (100x in the worst case).
You've probably read that, if your Python programs are too slow, you can speed them up by moving things into C. Fortunately, in this case, somebody's already done this for you. If you rewrite your empty loop to use xrange() like so:
for x in xrange(1000000000):
pass
print "Done."
...you'll see it run about twice as fast. If you find loop counters to actually be a major bottleneck in your program, it might be time to investigate a new way of solving the problem.

#troq
I'm a little late to the party but I'd say the answer is yes and no. As #gnibbler pointed out, CPython is slower in the simple implementation but pypy is jit compiled for much faster code when you need it.
If you're doing numeric processing with CPython most will do it with numpy resulting in fast operations on arrays and matrices. Recently I've been doing a lot with numba which allows you to add a simple wrapper to your code. For this one I just added #njit to a function incALot() which runs your code above.
On my machine CPython takes 61 seconds, but with the numba wrapper it takes 7.2 microseconds which will be similar to C and maybe faster than Go. Thats an 8 million times speedup.
So, in Python, if things with numbers seem a bit slow, there are tools to address it - and you still get Python's programmer productivity and the REPL.
def incALot(y):
x = 0
while x < y:
x += 1
#njit('i8(i8)')
def nbIncALot(y):
x = 0
while x < y:
x += 1
return x
size = 1000000000
start = time.time()
incALot(size)
t1 = time.time() - start
start = time.time()
x = nbIncALot(size)
t2 = time.time() - start
print('CPython3 takes %.3fs, Numba takes %.9fs' %(t1, t2))
print('Speedup is: %.1f' % (t1/t2))
print('Just Checking:', x)
CPython3 takes 58.958s, Numba takes 0.000007153s
Speedup is: 8242982.2
Just Checking: 1000000000

Problem is Python is interpreted, GO isn't so there's no real way to bench test speeds. Interpreted languages usually (not always have a vm component) that's where the problem lies, any test you run is being run in interpreted bounds not actual runtime bounds. Go is slightly slower than C in terms of speed and that is mostly due to it using garbage collection instead of manual memory management. That said GO compared to Python is fast because its a compiled language, the only thing lacking in GO is bug testing I stand corrected if I'm wrong.

It is possible that the compiler realized that you didn't use the "i" variable after the loop, so it optimized the final code by removing the loop.
Even if you used it afterwards, the compiler is probably smart enough to substitute the loop with
i = 1000000000;
Hope this helps =)

I'm not familiar with go, but I'd guess that go version ignores the loop since the body of the loop does nothing. On the other hand, in the python version, you are incrementing x in the body of the loop so it's probably actually executing the loop.

Optimizing python for loops

Here are two programs that naively calculate the number of prime numbers <= n.
One is in Python and the other is in Java.
public class prime{
public static void main(String args[]){
int n = Integer.parseInt(args[0]);
int nps = 0;
boolean isp;
for(int i = 1; i <= n; i++){
isp = true;
for(int k = 2; k < i; k++){
if( (i*1.0 / k) == (i/k) ) isp = false;
}
if(isp){nps++;}
}
System.out.println(nps);
}
}
`#!/usr/bin/python`
import sys
n = int(sys.argv[1])
nps = 0
for i in range(1,n+1):
isp = True
for k in range(2,i):
if( (i*1.0 / k) == (i/k) ): isp = False
if isp == True: nps = nps + 1
print nps
Running them on n=10000 I get the following timings.
shell:~$ time python prime.py 10000 && time java prime 10000
1230
real 0m49.833s
user 0m49.815s
sys 0m0.012s
1230
real 0m1.491s
user 0m1.468s
sys 0m0.016s
Am I using for loops in python in an incorrect manner here or is python actually just this much slower?
I'm not looking for an answer that is specifically crafted for calculating primes but rather I am wondering if python code is typically utilized in a smarter fashion.
The Java code was compiled with
javac 1.6.0_20
Run with java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8.1) (6b18-1.8.1-0ubuntu1~9.10.1)
OpenJDK Client VM (build 16.0-b13, mixed mode, sharing)
Python is:
Python 2.6.4 (r264:75706, Dec 7 2009, 18:45:15)

As has been pointed out, straight Python really isn't made for this sort of thing. That the prime checking algorithm is naive is also not the point. However, with two simple things I was able to greatly reduce the time in Python while using the original algorithm.
First, put everything inside of a function, call it main() or something. This decreased the time on my machine in Python from 20.6 seconds to 14.54 seconds. Doing things globally is slower than doing them in a function.
Second, use Psyco, a JIT compiler. This requires adding two lines to the top of the file (and of course having psyco installed):
import psyco
psyco.full()
This brought the final time to 2.77 seconds.
One last note. I decided for kicks to use Cython on this and got the time down to 0.8533. However, knowing how to make the few changes to make it fast Cython code isn't something that I recommend for the casual user.

Yes, Python is slow, about a hundred times slower than C. You can use xrange instead of range for a small speedup, but other than that it's fine.
Ultimately what you're doing wrong is that you do this in plain Python, instead of using optimized libraries such as Numpy or Psyco.
Java comes with a jit compiler that makes a big difference where you're just crunching numbers.

You can make your Python about twice as fast by replacing that complicated test with
if i % k == 0: isp = False
You can also make it about eight times faster (for n=10000) than that by adding a break after that isp = False.
Also, do yourself a favor and skip the even numbers (adding one to nps to start to include 2).
Finally, you only need k to go up to sqrt(i).
Of course, if you make the same changes in the Java, it's still about 10x faster than the optimized Python.

Boy, when you said it was a naive implementation, you sure weren't joking!
But yes, a one to two order of magnitude difference in performance is not unexpected when comparing JIT-compiled, optimized machine code with an interpreted language. An alternative Python implementation such as Jython, which runs on the Java VM, may well be faster for this task; you could give it a whirl. Cython, which allows you to add static typing to Python and get C-like performance in some cases, may be worth investigating as well.
Even when considering the standard Python interpreter, CPython, though, the question is: is Python fast enough for the task at hand? Will the time you save writing the code in a dynamic language like Python make up for the extra time spent running it? If you had to write a given program in Java, would it seem like too much work to be worth the trouble?
Consider, for example, that a Python program running on a modern computer will be about as fast as a Java program running on a 10-year-old computer. The computer you had ten years ago was fast enough for many things, wasn't it?
Python does have a number of features that make it great for numerical work. These include an integer type that supports an unlimited number of digits, a decimal type with unlimited precision, and an optional library called NumPy specifically for calculations. Speed of execution, however, is not generally one of its major claims to fame. Where it excels is in getting the computer to do what you want with minimal cognitive friction.

If you're looking to do it fast, Python probably isn't the way forward, but you could speed it up a bit. First, you're using quite a slow way to test for divisibility. Modulo is quicker. You can also stop the inner loop (with k) as soon as it detects a match. I'd do something like this:
nps = 0
for i in range(1, n+1):
if all(i % k for k in range(2, i)): # i.e. if divisible by none of them
nps += 1
That brings it down from 25 s to 1.5 s for me. Using xrange brings it down to 0.9 s.
You could speed it up further by keeping a list of primes you've already found, and only testing those, rather than every number up to i (if i isn't divisible by 2, it won't be divisible by 4, 6, 8...).

Why don't you post something about the memory usage - and not just the speed? Trying to get a simple servlet on tomcat is wasting 3GB on my server.
What you did with the examples up there is not very good. You need to use numpy. Replace for/range with while loops, thus avoiding the list creation.
At last, python is quite suitable for number crunching, at least by people that do it the right way, and know what Sieve of Eratosthenes is, or mod operation is.

There are lots of things you can do to this algorithm to speed it up, but most of them would also speed up the Java version as well. Some of those will speed up the Python more than the Java, so they're worth testing.
Here's just a couple of changes that speed it up from 11.4 to 2.8 seconds on my system:
nps = 0
for i in range(1,n+1):
isp = True
for k in range(2,i):
isp = isp and (i % k != 0)
if isp: nps = nps + 1
print nps

Python is a language which, ironically, is well-suited for developing algorithms. Even a modified algorithm like this:
# See Thomas K for use of all(), many posters for sqrt optimization
nps = 0
for i in xrange(1, n+1):
if all(i % k for k in xrange(2, 1 + int(i ** 0.5))):
nps += 1
runs in significantly under one second. Code like this:
def eras(n):
last = n + 1
sieve = [0,0] + range(2, last)
sqn = int(round(n ** 0.5))
it = (i for i in xrange(2, sqn + 1) if sieve[i])
for i in it:
sieve[i*i:last:i] = [0] * (n//i - i + 1)
return filter(None, sieve)
is faster still. Or try out these.
The thing is, python is usually fast enough for designing your solution. If it is not fast enough for production, use numpy or Jython to goose more performance out of it. Or move it to a compiled language, taking your algorithm observations learned in python with you.

Yes, Python is one of the slowest practical languages you'll encounter. While loops are marginally faster than for i in xrange(), but ultimately Python will always be much, much slower than anything else.
Python has its place: Prototyping theory and ideas, or in any situation where the ability to produce code fast is more important than the code's performance.
Python is a scripting language. Not a programming language.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.