Optimizing python for loops

Optimizing python for loops - python

Here are two programs that naively calculate the number of prime numbers <= n.
One is in Python and the other is in Java.
public class prime{
public static void main(String args[]){
int n = Integer.parseInt(args[0]);
int nps = 0;
boolean isp;
for(int i = 1; i <= n; i++){
isp = true;
for(int k = 2; k < i; k++){
if( (i*1.0 / k) == (i/k) ) isp = false;
}
if(isp){nps++;}
}
System.out.println(nps);
}
}
`#!/usr/bin/python`
import sys
n = int(sys.argv[1])
nps = 0
for i in range(1,n+1):
isp = True
for k in range(2,i):
if( (i*1.0 / k) == (i/k) ): isp = False
if isp == True: nps = nps + 1
print nps
Running them on n=10000 I get the following timings.
shell:~$ time python prime.py 10000 && time java prime 10000
1230
real 0m49.833s
user 0m49.815s
sys 0m0.012s
1230
real 0m1.491s
user 0m1.468s
sys 0m0.016s
Am I using for loops in python in an incorrect manner here or is python actually just this much slower?
I'm not looking for an answer that is specifically crafted for calculating primes but rather I am wondering if python code is typically utilized in a smarter fashion.
The Java code was compiled with
javac 1.6.0_20
Run with java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8.1) (6b18-1.8.1-0ubuntu1~9.10.1)
OpenJDK Client VM (build 16.0-b13, mixed mode, sharing)
Python is:
Python 2.6.4 (r264:75706, Dec 7 2009, 18:45:15)

As has been pointed out, straight Python really isn't made for this sort of thing. That the prime checking algorithm is naive is also not the point. However, with two simple things I was able to greatly reduce the time in Python while using the original algorithm.
First, put everything inside of a function, call it main() or something. This decreased the time on my machine in Python from 20.6 seconds to 14.54 seconds. Doing things globally is slower than doing them in a function.
Second, use Psyco, a JIT compiler. This requires adding two lines to the top of the file (and of course having psyco installed):
import psyco
psyco.full()
This brought the final time to 2.77 seconds.
One last note. I decided for kicks to use Cython on this and got the time down to 0.8533. However, knowing how to make the few changes to make it fast Cython code isn't something that I recommend for the casual user.

Yes, Python is slow, about a hundred times slower than C. You can use xrange instead of range for a small speedup, but other than that it's fine.
Ultimately what you're doing wrong is that you do this in plain Python, instead of using optimized libraries such as Numpy or Psyco.
Java comes with a jit compiler that makes a big difference where you're just crunching numbers.

You can make your Python about twice as fast by replacing that complicated test with
if i % k == 0: isp = False
You can also make it about eight times faster (for n=10000) than that by adding a break after that isp = False.
Also, do yourself a favor and skip the even numbers (adding one to nps to start to include 2).
Finally, you only need k to go up to sqrt(i).
Of course, if you make the same changes in the Java, it's still about 10x faster than the optimized Python.

Boy, when you said it was a naive implementation, you sure weren't joking!
But yes, a one to two order of magnitude difference in performance is not unexpected when comparing JIT-compiled, optimized machine code with an interpreted language. An alternative Python implementation such as Jython, which runs on the Java VM, may well be faster for this task; you could give it a whirl. Cython, which allows you to add static typing to Python and get C-like performance in some cases, may be worth investigating as well.
Even when considering the standard Python interpreter, CPython, though, the question is: is Python fast enough for the task at hand? Will the time you save writing the code in a dynamic language like Python make up for the extra time spent running it? If you had to write a given program in Java, would it seem like too much work to be worth the trouble?
Consider, for example, that a Python program running on a modern computer will be about as fast as a Java program running on a 10-year-old computer. The computer you had ten years ago was fast enough for many things, wasn't it?
Python does have a number of features that make it great for numerical work. These include an integer type that supports an unlimited number of digits, a decimal type with unlimited precision, and an optional library called NumPy specifically for calculations. Speed of execution, however, is not generally one of its major claims to fame. Where it excels is in getting the computer to do what you want with minimal cognitive friction.

If you're looking to do it fast, Python probably isn't the way forward, but you could speed it up a bit. First, you're using quite a slow way to test for divisibility. Modulo is quicker. You can also stop the inner loop (with k) as soon as it detects a match. I'd do something like this:
nps = 0
for i in range(1, n+1):
if all(i % k for k in range(2, i)): # i.e. if divisible by none of them
nps += 1
That brings it down from 25 s to 1.5 s for me. Using xrange brings it down to 0.9 s.
You could speed it up further by keeping a list of primes you've already found, and only testing those, rather than every number up to i (if i isn't divisible by 2, it won't be divisible by 4, 6, 8...).

Why don't you post something about the memory usage - and not just the speed? Trying to get a simple servlet on tomcat is wasting 3GB on my server.
What you did with the examples up there is not very good. You need to use numpy. Replace for/range with while loops, thus avoiding the list creation.
At last, python is quite suitable for number crunching, at least by people that do it the right way, and know what Sieve of Eratosthenes is, or mod operation is.

There are lots of things you can do to this algorithm to speed it up, but most of them would also speed up the Java version as well. Some of those will speed up the Python more than the Java, so they're worth testing.
Here's just a couple of changes that speed it up from 11.4 to 2.8 seconds on my system:
nps = 0
for i in range(1,n+1):
isp = True
for k in range(2,i):
isp = isp and (i % k != 0)
if isp: nps = nps + 1
print nps

Python is a language which, ironically, is well-suited for developing algorithms. Even a modified algorithm like this:
# See Thomas K for use of all(), many posters for sqrt optimization
nps = 0
for i in xrange(1, n+1):
if all(i % k for k in xrange(2, 1 + int(i ** 0.5))):
nps += 1
runs in significantly under one second. Code like this:
def eras(n):
last = n + 1
sieve = [0,0] + range(2, last)
sqn = int(round(n ** 0.5))
it = (i for i in xrange(2, sqn + 1) if sieve[i])
for i in it:
sieve[i*i:last:i] = [0] * (n//i - i + 1)
return filter(None, sieve)
is faster still. Or try out these.
The thing is, python is usually fast enough for designing your solution. If it is not fast enough for production, use numpy or Jython to goose more performance out of it. Or move it to a compiled language, taking your algorithm observations learned in python with you.

Yes, Python is one of the slowest practical languages you'll encounter. While loops are marginally faster than for i in xrange(), but ultimately Python will always be much, much slower than anything else.
Python has its place: Prototyping theory and ideas, or in any situation where the ability to produce code fast is more important than the code's performance.
Python is a scripting language. Not a programming language.

Related

Why is Python recursion so expensive and what can we do about it?

Suppose we want to compute some Fibonacci numbers, modulo 997.
For n=500 in C++ we can run
#include <iostream>
#include <array>
std::array<int, 2> fib(unsigned n) {
if (!n)
return {1, 1};
auto x = fib(n - 1);
return {(x[0] + x[1]) % 997, (x[0] + 2 * x[1]) % 997};
}
int main() {
std::cout << fib(500)[0];
}
and in Python
def fib(n):
if n==1:
return (1, 2)
x=fib(n-1)
return ((x[0]+x[1]) % 997, (x[0]+2*x[1]) % 997)
if __name__=='__main__':
print(fib(500)[0])
Both will find the answer 996 without issues. We are taking modulos to keep the output size reasonable and using pairs to avoid exponential branching.
For n=5000, the C++ code outputs 783, but Python will complain
RecursionError: maximum recursion depth exceeded in comparison
If we add a couple of lines
import sys
def fib(n):
if n==1:
return (1, 2)
x=fib(n-1)
return ((x[0]+x[1]) % 997, (x[0]+2*x[1]) % 997)
if __name__=='__main__':
sys.setrecursionlimit(5000)
print(fib(5000)[0])
then Python too will give the right answer.
For n=50000 C++ finds the answer 151 within milliseconds while Python crashes (at least on my machine).
Why are recursive calls so much cheaper in C++? Can we somehow modify the Python compiler to make it more receptive to recursion?
Of course, one solution is to replace recursion with iteration. For Fibonacci numbers, this is easy to do. However, this will swap the initial and the terminal conditions, and the latter is tricky for many problems (e.g. alpha–beta pruning). So generally, this will require a lot of hard work on the part of the programmer.

The issue is that Python has an internal limit on number of recursive function calls.
That limit is configurable as shown in Quentin Coumes' answer. However, too deep a function chain will result in a stack overflow. This underlying limitation¹ applies to both C++ and Python. This limitation also applies to all function calls, not just recursive ones.
In general: You should not write² algorithms that have recursion depth growth with linear complexity or worse. Logarithmically growing recursion is typically fine. Tail-recursive functions are trivial to re-write as iterations. Other recursions may be converted to iteration using external data structures (usually, a dynamic stack).
A related rule of thumb is that you shouldn't have large objects with automatic storage. This is C++-specific since Python doesn't have the concept of automatic storage.
¹ The underlying limitation is the execution stack size. The default size differs between systems, and different function calls consume different amounts of memory, so the limit isn't specified as a number of calls but in bytes instead. This too is configurable on some systems. I wouldn't typically recommend touching that limit due to portability concerns.
² Exceptions to this rule of thumb are certain functional languages that guarantee tail-recursion elimination - such as Haskell - where that rule can be relaxed in case of recursions that are guaranteed to be eliminated. Python is not such a language, and the function in question isn't tail-recursive. While C++ compilers can perform the elimination as an optimization, it isn't guaranteed, and is typically not optimized in debug builds. Hence, the exception doesn't generally apply to C++ either.
Disclaimer: The following is my hypothesis; I don't actually know their rationale: The Python limit is probably a feature that detects potentially infinite recursions, preventing likely unsafe stack overflow crashes and substituting a more controlled RecursionError.
Why are recursive calls so much cheaper in C++?
C++ is a compiled language. Python is interpreted. (Nearly) everything is cheaper in C++, except the translation from source code to an executable program.

Let me first answer your direct questions:
Why are recursive calls so much cheaper in C++?
Because C++ has no limitation on recursive call depth, except the size of the stack. And being a fully compiled language, loops (including recursion) are much faster in C++ than in Python (the reason why special Python modules like numpy/scipy directly use C routines). Additionally, most C++ implementations use a special feature called tail recursion elimination (see later in this post) and optimize some recursive code into iterative equivalents. This is nice here but not guaranteed by the standard, so other compilations could result in a program crashing miserably - but tail recursion is probably not involved here.
If the recursion is too deep and exhausts the available stack, you will invoke the well-known Undefined Behaviour where anything can happen, from an immediate crash to a program giving wrong results (IMHO the latter is much worse and cannot be detected...)
Can we somehow modify the Python compiler to make it more receptive to recursion?
No. Python implementation explicitly never uses tail recursion elimination. You could increase the recursion limit, but this is almost always a bad idea (see later in this post why).
Now for the true explanation of the underlying rationale.
Deep recursion is evil, full stop. You should never use it. Recursion is a handy tool when you can make sure that the depth will stay in sane limits. Python uses a soft limit to warn the programmer that something is going wrong before crashing the system. On the other hand, optimizing C and C++ compilers often internally change tail recursion into an iterative loop. But relying on it is highly dangerous because a slight change could prevent that optimization and cause an application crash.
As found in this other SO post, common Python implementations do not implement that tail recursion elimination. So you should not use recursion at a 5000 depth but instead use an iterative algorithm.
As your underlying computation will need all Fibonacci numbers up to the specified one, it is not hard to iteratively compute them. Furthermore, it will be much more efficient!

A solution is a trampoline: the recursive function, instead of calling another function, returns a function that makes that call with the appropriate arguments. There's a loop one level higher that calls all those functions in a loop until we have the final result. I'm probably not explaining it very well; you can find resources online that do a better job.
The point is that this converts recursion to iteration. I don't think this is faster, maybe it's even slower, but the recursion depth stays low.
An implementation could look like below. I split the pair x into a and b for clarity. I then converted the recursive function to a version that keeps track of a and b as arguments, making it tail recursive.
def fib_acc(n, a, b):
if n == 1:
return (a, b)
return lambda: fib_acc(n - 1, (a+b) % 997, (a+2*b) % 997)
def fib(n):
x = fib_acc(n, 1, 2)
while callable(x):
x = x()
return x
if __name__=='__main__':
print(fib(50000)[0])

"Both will find the answer 996 without issues"
I do see at least one issue : the answer should be 836, not 996.
It seems that both your functions calculate Fibonacci(2*n) % p, and not Fibonacci(n) % p.
996 is the result of Fibonacci(1000) % 997.
Pick a more efficient algorithm
An inefficient algorithm stays an inefficient algorithm, regardless if it's written in C++ or Python.
In order to compute large Fibonacci numbers, there are much faster methods than simple recursion with O(n) calls (see related article).
For large n, this recursive O(log n) Python function should run in circles around your above C++ code:
from functools import lru_cache
#lru_cache(maxsize=None)
def fibonacci(n, p):
"Calculate Fibonacci(n) modulo p"
if n < 3:
return [0, 1, 1][n]
if n % 2 == 0:
m = n // 2
v1 = fibonacci(m - 1, p)
v2 = fibonacci(m, p)
return (2*v1 + v2) * v2 % p
else:
m = (n + 1) // 2
v1 = fibonacci(m, p) ** 2
v2 = fibonacci(m - 1, p) ** 2
return (v1 + v2) % p
print(fibonacci(500, 997))
#=> 836
print(fibonacci(1000, 997))
#=> 996
Try it online!
It will happily calculate fibonacci(10_000_000_000_000_000, 997).
It's possible to add recursion level as parameter, in order to see how deep the recursion needs to go, and display it with indentation. Here's an example for n=500:
# Recursion tree:
500
249
124
61
30
14
6
2
3
1
2
7
4
15
8
31
16
62
125
63
32
250
Try it online!
Your examples would simply look like very long diagonals:
500
499
498
...
...
1

For Windows executables, the stack size is specified in the header of the executable. For the Windows version of Python 3.7 x64, that size is 0x1E8480 or exactly 2.000.000 bytes.
That version crashes with
Process finished with exit code -1073741571 (0xC00000FD)
and if we look that up we find that it's a Stack Overflow.
What we can see on the (native) stack with a native debugger like WinDbg (enable child process debugging) is
[...]
fa 000000e9`6da1b680 00007fff`fb698a6e python37!PyArg_UnpackStack+0x371
fb 000000e9`6da1b740 00007fff`fb68b841 python37!PyEval_EvalFrameDefault+0x73e
fc 000000e9`6da1b980 00007fff`fb698a6e python37!PyArg_UnpackStack+0x371
fd 000000e9`6da1ba40 00007fff`fb68b841 python37!PyEval_EvalFrameDefault+0x73e
fe 000000e9`6da1bc80 00007fff`fb698a6e python37!PyArg_UnpackStack+0x371
ff 000000e9`6da1bd40 00007fff`fb68b841 python37!PyEval_EvalFrameDefault+0x73e
2:011> ? 000000e9`6da1bd40 - 000000e9`6da1ba40
Evaluate expression: 768 = 00000000`00000300
So Python will use 2 Stack frames per method call and there's an enormous 768 bytes difference in the stack positions.
If you modify that value inside the EXE (make a backup copy) with a hex editor to, let's say 256 MB
you can run the following code
[...]
if __name__=='__main__':
sys.setrecursionlimit(60000)
print(fib(50000)[0])
and it will give 151 as the answer.
In C++, we can also force a Stack Overflow, e.g. by passing 500.000 as the parameter. While debugging, we get
0:000> .exr -1
ExceptionAddress: 00961015 (RecursionCpp!fib+0x00000015)
ExceptionCode: c00000fd (Stack overflow)
[...]
0:000> k
[...]
fc 00604f90 00961045 RecursionCpp!fib+0x45 [C:\...\RecursionCpp.cpp # 7]
fd 00604fb0 00961045 RecursionCpp!fib+0x45 [C:\...\RecursionCpp.cpp # 7]
fe 00604fd0 00961045 RecursionCpp!fib+0x45 [C:\...\RecursionCpp.cpp # 7]
ff 00604ff0 00961045 RecursionCpp!fib+0x45 [C:\...\RecursionCpp.cpp # 7]
0:000> ? 00604ff0 - 00604fd0
Evaluate expression: 32 = 00000020
which is just 1 stack frame per method call and only 32 bytes difference on the stack. Compared to Python, C++ can do 768/32 = 24x more recursions for the same stack size.
My Microsoft compiler created the executable with the default stack size of 1 MB (Release build, 32 Bit):
The 64 bit version has a stack difference of 64 bit (also Release build).
Tools used:
Microsoft WinDbg Preview (free)
Sweetscape 010 Editor (commercial) with the template for PE files

You can increase the recursion limit using:
import sys
sys.setrecursionlimit(new_limit)
But note that this limit exists for a reason and that pure Python is not optimized for recursion (and compute-intensive tasks in general).

An alternative to a trampoline is to use reduce.
if you can change the recursive function to be tail-recursive, you can implement it with reduce, here's a possible implementation.
reduce is internally implemented iteratively, so you get to use your recursive function without blowing up the stack.
def inner_fib(acc, num):
# acc is a list of two values
return [(acc[0]+acc[1]) % 997, (acc[0]+2*acc[1]) % 997]
def fib(n):
return reduce(inner_fib,
range(2, n+1), # start from 2 since n=1 is covered in the base case
[1,2]) # [1,2] is the base case

My python codes in general are very slow, is this normal?

I recently began self-learning python, and have been using this language for an online course in algorithms. For some reason, many of my codes I created for this course are very slow (relatively to C/C++ Matlab codes I have created in the past), and I'm starting to worry that I am not using python properly.
Here is a simple python and matlab code to compare their speed.
MATLAB
for i = 1:100000000
a = 1 + 1
end
Python
for i in list(range(0, 100000000)):
a=1 + 1
The matlab code takes about 0.3 second, and the python code takes about 7 seconds. Is this normal? My python codes for much complex problems are very slow. For example, as a HW assignment, I'm running depth first search on a graph with about 900000 nodes, and this is taking forever. Thank you.

Performance is not an explicit design goal of Python:
Don’t fret too much about performance--plan to optimize later when
needed.
That's one of the reasons why Python integrated with a lot of high performance calculating backend engines, such as numpy, OpenBLAS and even CUDA, just to name a few.
The best way to go foreward if you want to increase performance is to let high-performance libraries do the heavy lifting for you. Optimizing loops within Python (by using xrange instead of range in Python 2.7) won't get you very dramatic results.
Here is a bit of code that compares different approaches:
Your original list(range())
The suggestes use of xrange()
Leaving the i out
Using numpy to do the addition using numpy array's (vector addition)
Using CUDA to do vector addition on the GPU
Code:
import timeit
import matplotlib.pyplot as mplplt
iter = 100
testcode = [
"for i in list(range(1000000)): a = 1+1",
"for i in xrange(1000000): a = 1+1",
"for _ in xrange(1000000): a = 1+1",
"import numpy; one = numpy.ones(1000000); a = one+one",
"import pycuda.gpuarray as gpuarray; import pycuda.driver as cuda; import pycuda.autoinit; import numpy;" \
"one_gpu = gpuarray.GPUArray((1000000),numpy.int16); one_gpu.fill(1); a = (one_gpu+one_gpu).get()"
]
labels = ["list(range())", "i in xrange()", "_ in xrange()", "numpy", "numpy and CUDA"]
timings = [timeit.timeit(t, number=iter) for t in testcode]
print labels, timings
label_idx = range(len(labels))
mplplt.bar(label_idx, timings)
mplplt.xticks(label_idx, labels)
mplplt.ylabel('Execution time (sec)')
mplplt.title('Timing of integer addition in python 2.7\n(smaller value is better performance)')
mplplt.show()
Results (graph) ran on Python 2.7.13 on OSX:
The reason that Numpy performs faster than the CUDA solution is that the overhead of using CUDA does not beat the efficiency of Python+Numpy. For larger, floating point calculations, CUDA does even better than Numpy.
Note that the Numpy solution performs more that 80 times faster than your original solution. If your timings are correct, this would even be faster than Matlab...
A final note on DFS (Depth-afirst-Search): here is an interesting article on DFS in Python.

Try using xrange instead of range.
The difference between them is that **xrange** generates the values as you use them instead of range, which tries to generate a static list at runtime.

Unfortunately, python's amazing flexibility and ease comes at the cost of being slow. And also, for such large values of iteration, I suggest using itertools module as it has faster caching.
The xrange is a good solution however if you want to iterate over dictionaries and such, it's better to use itertools as in that, you can iterate over any type of sequence object.

Performance issues in simulation with python lists

I'm currently writing a simulation in python 3.4 (miniconda).
The entire simulation is quite fast but the measurement of some simulation data is bloody slow and takes about 35% of the entire simulation time. I hope I can increase the performance of the entire simulation if I could get rid of that bottleneck. I spend quite some time to figure out how to do that but unfortunately with little success. The function MeasureValues is called in every period of the simulation run.
If anybody has an idea how to improve the code, I would be really grateful.
Thank you guys.
def MeasureValues(self, CurrentPeriod):
if CurrentPeriod > self.WarmUp:
self.ValueOne[CurrentPeriod] = self.FixedValueOne if self.Futurevalue[CurrentPeriod + self.Reload] > 0 else 0
self.ValueTwo[CurrentPeriod] = self.VarValueTwo * self.AmountValueTwo[CurrentPeriod]
self.ValueThree[CurrentPeriod] = self.VarValueThree * self.AmountValueThree[CurrentPeriod]
self.SumOfValues[CurrentPeriod] = self.ValueOne[CurrentPeriod] + self.ValueTwo[CurrentPeriod] + self.ValueThree[CurrentPeriod]
self.TotalSumOfValues += self.SumOfValues[CurrentPeriod]
self.HelperSumValueFour += self.ValueFour[CurrentPeriod]
self.HelperSumValueTwo += self.AmountValueTwo[CurrentPeriod]
self.HelperSumValueFive += self.ValueFive[CurrentPeriod]
self.RatioOne[CurrentPeriod] = (1 - (self.HelperSumValueFour / self.HelperSumValueFive )) if self.HelperSumValueFive > 0 else 1
self.RatioTwo[CurrentPeriod] = (1 - (self.HelperSumValueTwo / self.HelperSumValueFive )) if self.HelperSumValueFive > 0 else 1

The code looks basic enough that there aren't any obvious gaps for optimisation without substantial restructuring (which I don't know enough about your overall architecture to suggest).
Try installing cython - I believe nowadays you can install it with pip install cython - then use it to see if you can speed up your code any.

As stated in the comments the function is quite simple, and with the provided code I don't see a way to optimize it directly.
You can try different approaches:
PyPy, this may works depending on your current codebase and external dependecies.
Cython as suggested by holdenweb, but you need to redefine a lot of stuff to be static-typed (using cdef) or nothing will change.
rewrite the function in C as a Python extension, this may take some time, especially if you have no experience in C programming.
The PyPy way seems the most reasonable, if it works you will gain a boost for all the simulations code.

Can Go really be that much faster than Python?

I think I may have implemented this incorrectly because the results do not make sense. I have a Go program that counts to 1000000000:
package main
import (
"fmt"
)
func main() {
for i := 0; i < 1000000000; i++ {}
fmt.Println("Done")
}
It finishes in less than a second. On the other hand I have a Python script:
x = 0
while x < 1000000000:
x+=1
print 'Done'
It finishes in a few minutes.
Why is the Go version so much faster? Are they both counting up to 1000000000 or am I missing something?

One billion is not a very big number. Any reasonably modern machine should be able to do this in a few seconds at most, if it's able to do the work with native types. I verified this by writing an equivalent C program, reading the assembly to make sure that it actually was doing addition, and timing it (it completes in about 1.8 seconds on my machine).
Python, however, doesn't have a concept of natively typed variables (or meaningful type annotations at all), so it has to do hundreds of times as much work in this case. In short, the answer to your headline question is "yes". Go really can be that much faster than Python, even without any kind of compiler trickery like optimizing away a side-effect-free loop.

pypy actually does an impressive job of speeding up this loop
def main():
x = 0
while x < 1000000000:
x+=1
if __name__ == "__main__":
s=time.time()
main()
print time.time() - s
$ python count.py
44.221405983
$ pypy count.py
1.03511095047
~97% speedup!
Clarification for 3 people who didn't "get it". The Python language itself isn't slow. The CPython implementation is a relatively straight forward way of running the code. Pypy is another implementation of the language that does many tricky (especiallt the JIT) things that can make enormous differences. Directly answering the question in the title - Go isn't "that much" faster than Python, Go is that much faster than CPython.
Having said that, the code samples aren't really doing the same thing. Python needs to instantiate 1000000000 of its int objects. Go is just incrementing one memory location.

This scenario will highly favor decent natively-compiled statically-typed languages. Natively compiled statically-typed languages are capable of emitting a very trivial loop of say, 4-6 CPU opcodes that utilizes simple check-condition for termination. This loop has effectively zero branch prediction misses and can be effectively thought of as performing an increment every CPU cycle (this isn't entirely true, but..)
Python implementations have to do significantly more work, primarily due to the dynamic typing. Python must make several different calls (internal and external) just to add two ints together. In Python it must call __add__ (it is effectively i = i.__add__(1), but this syntax will only work in Python 3.x), which in turn has to check the type of the value passed (to make sure it is an int), then it adds the integer values (extracting them from both of the objects), and then the new integer value is wrapped up again in a new object. Finally it re-assigns the new object to the local variable. That's significantly more work than a single opcode to increment, and doesn't even address the loop itself - by comparison, the Go/native version is likely only incrementing a register by side-effect.
Java will fair much better in a trivial benchmark like this and will likely be fairly close to Go; the JIT and static-typing of the counter variable can ensure this (it uses a special integer add JVM instruction). Once again, Python has no such advantage. Now, there are some implementations like PyPy/RPython, which run a static-typing phase and should fare much better than CPython here ..

You've got two things at work here. The first of which is that Go is compiled to machine code and run directly on the CPU while Python is compiled to bytecode run against a (particularly slow) VM.
The second, and more significant, thing impacting performance is that the semantics of the two programs are actually significantly different. The Go version makes a "box" called "x" that holds a number and increments that by 1 on each pass through the program. The Python version actually has to create a new "box" (int object) on each cycle (and, eventually, has to throw them away). We can demonstrate this by modifying your programs slightly:
package main
import (
"fmt"
)
func main() {
for i := 0; i < 10; i++ {
fmt.Printf("%d %p\n", i, &i)
}
}
...and:
x = 0;
while x < 10:
x += 1
print x, id(x)
This is because Go, due to it's C roots, takes a variable name to refer to a place, where Python takes variable names to refer to things. Since an integer is considered a unique, immutable entity in python, we must constantly make new ones. Python should be slower than Go but you've picked a worst-case scenario - in the Benchmarks Game, we see go being, on average, about 25x times faster (100x in the worst case).
You've probably read that, if your Python programs are too slow, you can speed them up by moving things into C. Fortunately, in this case, somebody's already done this for you. If you rewrite your empty loop to use xrange() like so:
for x in xrange(1000000000):
pass
print "Done."
...you'll see it run about twice as fast. If you find loop counters to actually be a major bottleneck in your program, it might be time to investigate a new way of solving the problem.

#troq
I'm a little late to the party but I'd say the answer is yes and no. As #gnibbler pointed out, CPython is slower in the simple implementation but pypy is jit compiled for much faster code when you need it.
If you're doing numeric processing with CPython most will do it with numpy resulting in fast operations on arrays and matrices. Recently I've been doing a lot with numba which allows you to add a simple wrapper to your code. For this one I just added #njit to a function incALot() which runs your code above.
On my machine CPython takes 61 seconds, but with the numba wrapper it takes 7.2 microseconds which will be similar to C and maybe faster than Go. Thats an 8 million times speedup.
So, in Python, if things with numbers seem a bit slow, there are tools to address it - and you still get Python's programmer productivity and the REPL.
def incALot(y):
x = 0
while x < y:
x += 1
#njit('i8(i8)')
def nbIncALot(y):
x = 0
while x < y:
x += 1
return x
size = 1000000000
start = time.time()
incALot(size)
t1 = time.time() - start
start = time.time()
x = nbIncALot(size)
t2 = time.time() - start
print('CPython3 takes %.3fs, Numba takes %.9fs' %(t1, t2))
print('Speedup is: %.1f' % (t1/t2))
print('Just Checking:', x)
CPython3 takes 58.958s, Numba takes 0.000007153s
Speedup is: 8242982.2
Just Checking: 1000000000

Problem is Python is interpreted, GO isn't so there's no real way to bench test speeds. Interpreted languages usually (not always have a vm component) that's where the problem lies, any test you run is being run in interpreted bounds not actual runtime bounds. Go is slightly slower than C in terms of speed and that is mostly due to it using garbage collection instead of manual memory management. That said GO compared to Python is fast because its a compiled language, the only thing lacking in GO is bug testing I stand corrected if I'm wrong.

It is possible that the compiler realized that you didn't use the "i" variable after the loop, so it optimized the final code by removing the loop.
Even if you used it afterwards, the compiler is probably smart enough to substitute the loop with
i = 1000000000;
Hope this helps =)

I'm not familiar with go, but I'd guess that go version ignores the loop since the body of the loop does nothing. On the other hand, in the python version, you are incrementing x in the body of the loop so it's probably actually executing the loop.

Python app to make a random prime number between 10^300 and 10^301

i need to make Python app to make a random prime number between 10^300 and 10^301 , i done it with this but its very slow. any solution?
import random , math
check_prime = 0
print "Please wait ..."
def is_prime(n):
import math
n = abs(n)
i = 2
while i <= math.sqrt(n):
if n % i == 0:
return False
i += 1
return True
while check_prime == 0 :
randomnumber = random.randrange(math.pow(10,300),math.pow(10,301)-1)
if is_prime(randomnumber):
print randomnumber
break

First things first: Don't use math.pow(), as it is just a wrapper for the C floating-point function, and your numbers are WAY too big to be accurately represented as floats. Use Python's exponentiation operator, which is **.
Second: If you are on a platform which has a version of gmpy, use that for your primality testing.
Third: As eumiro notes, you might be dealing with too big of a problem space for there to be any truly fast solution.

If you need something quick and dirty simply use Fermat's little theorem.
def isPrime(p):
if(p==2): return True
if(not(p&1)): return False
return pow(2,p-1,p)==1
Although this works quite well for random numbers this will fail for numbers know as "pseudoprimes". (quite scarce)
But If you need something foolproof and simple enough to implement I'd suggest you
read up about Miller Rabin's Primality Test.
PS: pow(a,b,c) takes O(log(b)) time in python.

Take a look at Miller-Rabin test I am sure you will find some implementations in Python on the internet.

As explained in the comments, you can forget about building a list of all primes between 10^300 and 10^301 and picking one randomly - there's orders of magnitudes too many primes in that range for that to work.
The only practical approach, as far as I can see, is to randomly pick a number from the range, then test if it is prime. Repeat until you find a prime.
For the primality testing, that's a whole subject by itself (libraries [in the book sense] have been written about it). Basically you first do a few quick tests to discard numbers that are obviously not prime, then you bring out the big guns and use one of the common primality tests (Miller-Rabin (iterated), AKS, ...). I don't know enough to recommend one, but that's really a topic for research-level maths, so you should probably head over to https://math.stackexchange.com/ .
See e.g. this question for a starting point and some simple recipes:
How do you determine if a number is prime or composite?
About your code:
The code you post basically does just what I described, however it uses an extremely naive primality test (trial division by all numbers 1...sqrt(n)). That's why it's so slow. If you use a better primatlity test, it will be orders of magnitude faster.

Since you are handling extremely large numbers, you should really build on what clever people already did for you. Given that you need to know for sure your number is a prime (Miller Rabin is out), and you do want a random prime (not of a specific structure), I would suggest AKS. I am not sure how to optimize your numbers under test, but just choosing a random number might be okay given the speed of the primality test.

If you want to generate a large number of primes, or generate them quickly, pure python may not be the way to go - one option would be to use a python openssl wrapper, and use openssl's facility for generating RSA private keys (which are in fact a pair of primes), or some of openssl's other primality-related functions. Another way to acheive speed would be a C extension implementing one of the tests below...
If opessl isn't an option, your two choices (as many of the comments have mentioned) are the Miller-Rabin test and the AKS test. The main differences - AKS is deterministic, and is guaranteed to give no false results; whereas Miller-Rabin is probabalistic, it may occasionally give false positives - but the longer you run it, the lower that probability becomes (the odds are 1/4**k for k rounds of testing). You would think AKS would obviously be the way to go, except that it's much slower - O(log(n)**12) compared to Miller-Rabin's O(k*log(n)**3). (For comparison, the scan test you presented will take O(n**.5), so either of these will be much faster for large numbers).
If it would be useful, I can paste in a Miller-Rabin implementation I have, but it's rather long.

It is probably better to use a different algorithm, but you could optimise this code pretty easily as well.
This function will be twice as fast:
def is_prime(n):
import math
n = abs(n)
if n % 2 == 0:
return False
i = 3
while i <= math.sqrt(n):
if n % i == 0:
return False
i += 2
return True

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.