Weird bug in Pandas and Numpy regarding multithreading

Weird bug in Pandas and Numpy regarding multithreading - python

Most of the Numpy's function will enable multithreading by default.
for example, I work on a 8-cores intel cpu workstation, if I run a script
import numpy as np
x=np.random.random(1000000)
for i in range(100000):
np.sqrt(x)
the linux top will show 800% cpu usage during running like
Which means numpy automatically detects that my workstation has 8 cores, and np.sqrt automatically use all 8 cores to accelerate computation.
However, I found a weird bug. If I run a script
import numpy as np
import pandas as pd
df=pd.DataFrame(np.random.random((10,10)))
df+df
x=np.random.random(1000000)
for i in range(100000):
np.sqrt(x)
the cpu usage is 100%!!.
It means that if you plus two pandas DataFrame before running any numpy function, the auto multithreading feature of numpy is gone without any warning! This is absolutely not reasonable, why would Pandas dataFrame calculation affect Numpy threading setting? Is it a bug? How to work around this?
PS:
I dig further using Linux perf tool.
running first script shows
While running second script shows
So both script involves libmkl_vml_avx2.so, while the first script involves additional libiomp5.so which seems to be related to openMP.
And since vml means intel vector math library, so according to vml doc I guess at least below functions are all automatically multithreaded

Pandas uses numexpr under the hood to calculate some operations, and numexpr sets the maximal number of threads for vml to 1, when it is imported:
# The default for VML is 1 thread (see #39)
set_vml_num_threads(1)
and it gets imported by pandas when df+df is evaluated in expressions.py:
from pandas.core.computation.check import _NUMEXPR_INSTALLED
if _NUMEXPR_INSTALLED:
import numexpr as ne
However, Anaconda distribution also uses vml-functionality for such functions as sqrt, sin, cos and so on - and once numexpr set the maximal number of vml-threads to 1, the numpy-functions no longer use parallelization.
The problem can be easily seen in gdb (using your slow script):
>>> gdb --args python slow.py
(gdb) b mkl_serv_domain_set_num_threads
function "mkl_serv_domain_set_num_threads" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (mkl_serv_domain_set_num_threads) pending.
(gbd) run
Thread 1 "python" hit Breakpoint 1, 0x00007fffee65cd70 in mkl_serv_domain_set_num_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
(gdb) bt
#0 0x00007fffee65cd70 in mkl_serv_domain_set_num_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
#1 0x00007fffe978026c in _set_vml_num_threads(_object*, _object*) () from /home/ed/anaconda37/lib/python3.7/site-packages/numexpr/interpreter.cpython-37m-x86_64-linux-gnu.so
#2 0x00005555556cd660 in _PyMethodDef_RawFastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:694
...
(gdb) print $rdi
$1 = 1
i.e. we can see, numexpr sets number of threads to 1. Which is later used when vml-sqrt function is called:
(gbd) b mkl_serv_domain_get_max_threads
Breakpoint 2 at 0x7fffee65a900
(gdb) (gdb) c
Continuing.
Thread 1 "python" hit Breakpoint 2, 0x00007fffee65a900 in mkl_serv_domain_get_max_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
(gdb) bt
#0 0x00007fffee65a900 in mkl_serv_domain_get_max_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
#1 0x00007ffff01fcea9 in mkl_vml_serv_threader_d_1i_1o () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
#2 0x00007fffedf78563 in vdSqrt () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_lp64.so
#3 0x00007ffff5ac04ac in trivial_two_operand_loop () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-x86_64-linux-gnu.so
So we can see numpy uses vml's implementation of vdSqrt which utilizes mkl_vml_serv_threader_d_1i_1o to decide whether calculation should be done in parallel and it looks the number of threads:
(gdb) fin
Run till exit from #0 0x00007fffee65a900 in mkl_serv_domain_get_max_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
0x00007ffff01fcea9 in mkl_vml_serv_threader_d_1i_1o () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
(gdb) print $rax
$2 = 1
the register %rax has the maximal number of threads and it is 1.
Now we can use numexpr to increase the number of vml-threads, i.e.:
import numpy as np
import numexpr as ne
import pandas as pd
df=pd.DataFrame(np.random.random((10,10)))
df+df
#HERE: reset number of vml-threads
ne.set_vml_num_threads(8)
x=np.random.random(1000000)
for i in range(10000):
np.sqrt(x) # now in parallel
Now multiple cores are utilized!

Looking at numpy, it looks like, under the hood it has had on/off issues with multithreading, and depending on what version you are using you may expect to may start to see crashes when you bump up ne.set_vml_num_threads() ..
http://numpy-discussion.10968.n7.nabble.com/ANN-NumExpr-2-7-0-Release-td47414.html
I need to get my head around how this is glued in to the python interpreter, given your code example where it seems to be somehow allowing multiple apparently synchronous/ordered calls to np.sqrt() to proceed in parallel. I guess if python interpreter is always just returning a reference to an object when it pops the stack, and in your example is just pitching those references and not assigning or manipulating them in any way it would be fine. But if subsequent loop iterations depend on previous ones then it seems less clear how these could be safely parallelized. Arguably silent failure / wrong results is an outcome worse than crashes.

I think that your initial premise may be incorrect -
You stated: Which means numpy automatically detects that my workstation has 8 cores, and np.sqrt automatically use all 8 cores to accelerate computation.
A single function np.sqrt() cannot guess how it will next be invoked or return before it has partially completed. There are parallelism mechanisms in python, but none are automatic.
Now, having said that, the python interpreter may be able to optimize the for loop for parallelism, which may be what you are seeing, but I strongly suspect if you look at the wall-clock time for this loop to execute it will be no different regardless if you are (apparently) using 8 cores or 1 core.
UPDATE: Having read a bit more of the comments it seems as though the multi-core behavior you are seeing is related to the anaconda distribution of the python interpreter. I took a look but was unable to find any source code for it, but it seems that the python license permits entities (like anaconda.com) to compile and distribute derivatives of the interpreter without requiring their changes to be published.
I guess that you can reach out to the anaconda folks - the behaviour you are seeing will be difficult to figure out without knowing what/if anything they've changed in the interpreter ..
Also do a quick check of the wall clock time with/without the optimization to see if it is indeed 8x faster - even if you've really got all 8 cores working instead of 1 it would be good to know if the results are actually 8x faster or if there are spinlocks in use which are still serializing on a single mutex.

Related

String concatenation much faster in Python than Go

I'm looking at using Go to write a small program that's mostly handling text. I'm pretty sure, based on what I've heard about Go and Python that Go will be substantially faster. I don't actually have a specific need for insane speeds, but I'd like to get to know Go.
The "Go is going to be faster" idea was supported by a trivial test:
# test.py
print("Hello world")
$ time python dummy.py
Hello world
real 0m0.029s
user 0m0.019s
sys 0m0.010s
// test.go
package main
import "fmt"
func main() {
fmt.Println("hello world")
}
$ time ./test
hello world
real 0m0.001s
user 0m0.001s
sys 0m0.000s
Looks good in terms of raw startup speed (which is entirely expected). Highly non-scientific justification:
$ strace python test.py 2>&1 | wc -l
1223
$ strace ./test 2>&1 | wc -l
174
However, my next contrived test was how fast is Go when faffing with strings, and I was expecting to be similarly blown away by Go's raw speed. So, this was surprising:
# test2.py
s = ""
for i in range(1000000):
s += "a"
$ time python test2.py
real 0m0.179s
user 0m0.145s
sys 0m0.013s
// test2.go
package main
func main() {
s := ""
for i:= 0; i < 1000000; i++ {
s += "a";
}
}
$ time ./test2
real 0m56.840s
user 1m50.836s
sys 0m17.653
So Go is hundreds of times slower than Python.
Now, I know this is probably due to Schlemiel the Painter's algorithm, which explains why the Go implementation is quadratic in i (i is 10 times bigger leads to 100 times slowdown).
However, the Python implementation seems much faster: 10 times more loops only slows it down by twice. The same effect persists if you concatenate str(i), so I doubt there's some kind of magical JIT optimization to s = 100000 * 'a' going on. And it's not much slower if I print(s) at the end, so the variable isn't being optimised out.
Naivety of the concatenation methods aside (there are surely more idiomatic ways in each language), is there something here that I have misunderstood, or is it simply easier in Go than in Python to run into cases where you have to deal with C/C++-style algorithmic issues when handling strings (in which case a straight Go port might not be as uh-may-zing as I might hope without having to, ya'know, think about things and do my homework)?
Or have I run into a case where Python happens to work well, but falls apart under more complex use?
Versions used: Python 3.8.2, Go 1.14.2

TL;DR summary: basically you're testing the two implementation's allocators / garbage collectors and heavily weighting the scale on the Python side (by chance, as it were, but this is something the Python folks optimized at some point).
To expand my comments into a real answer:
Both Go and Python have counted strings, i.e., strings are implemented as a two-element header thingy containing a length (byte count or, for Python 3 strings, Unicode characters count) and data pointer.
Both Go and Python are garbage-collected (GCed) languages. That is, in both languages, you can allocate memory without having to worry about freeing it yourself: the system takes care of that automatically.
But the underlying implementations differ, quite a bit in this particular one important way: the version of Python you are using has a reference counting GC. The Go system you are using does not.
With a reference count, the inner bits of the Python string handler can do this. I'll express it as Go (or at least pseudo-Go) although the actual Python implementation is in C and I have not made all the details line up properly:
// add (append) new string t to existing string s
func add_to_string(s, t string_header) string_header {
need = s.len + t.len
if s.refcount == 1 { // can modify string in-place
data = s.data
if cap(data) >= need {
copy_into(data + s.len, t.data, t.len)
return s
}
}
// s is shared or s.cap < need
new_s := make_new_string(roundup(need))
// important: new_s has extra space for the next call to add_to_string
copy_into(new_s.data, s.data, s.len)
copy_into(new_s.data + s.len, t.data, t.len)
s.refcount--
if s.refcount == 0 {
gc_release_string(s)
}
return new_s
}
By over-allocating—rounding up the need value so that cap(new_s) is large—we get about log2(n) calls to the allocator, where n is the number of times you do s += "a". With n being 1000000 (one million), that's about 20 times that we actually have to invoke the make_new_string function and release (for gc purposes because the collector uses refcounts as a first pass) the old string s.
[Edit: your source archaeology led to commit 2c9c7a5f33d, which suggests less than doubling but still a multiplicative increase. To other readers, see comment.]
The current Go implementation allocates strings without a separate capacity header field (see reflect.StringHeader and note the big caveat that says "don't depend on this, it might be different in future implementations"). Between the lack of a refcount—we can't tell in the runtime routine that adds two strings, that the target has only one reference—and the inability to observe the equivalent of cap(s) (or cap(s.data)), the Go runtime has to create a new string every time. That's one million memory allocations.
To show that the Python code really does use the refcount, take your original Python:
s = ""
for i in range(1000000):
s += "a"
and add a second variable t like this:
s = ""
t = s
for i in range(1000000):
s += "a"
t = s
The difference in execution time is impressive:
$ time python test2.py
0.68 real 0.65 user 0.03 sys
$ time python test3.py
34.60 real 34.08 user 0.51 sys
The modified Python program still beats Go (1.13.5) on this same system:
$ time ./test2
67.32 real 103.27 user 13.60 sys
and I have not poked any further into the details, but I suspect the Go GC is running more aggressively than the Python one. The Go GC is very different internally, requiring write barriers and occasional "stop the world" behavior (of all goroutines that are not doing the GC work). The refcounting nature of the Python GC allows it to never stop: even with a refcount of 2, the refcount on t drops to 1 and then next assignment to t drops it to zero, releasing the memory block for re-use in the next trip through the main loop. So it's probably picking up the same memory block over and over again.
(If my memory is correct, Python's "over-allocate strings and check the refcount to allow expand-in-place" trick was not in all versions of Python. It may have first been added around Python 2.4 or so. This memory is extremely vague and a quick Google search did not turn up any evidence one way or the other. [Edit: Python 2.7.4, apparently.])

Well. You should never, ever use string concatenation in this way :-)
in go, try the strings.Buider
package main
import (
"strings"
)
func main() {
var b1 strings.Builder
for i:= 0; i < 1000000; i++ {
b1.WriteString("a");
}
}

For loop vs Numpy vectorization computation time

I was randomly comparing the computation times of an explicit for-loop with vectorized implementation in numpy. I ran exactly 1 million iterations and found some astounding differences. For-loop took about 646ms while the np.exp() function computed the same result in less than 20ms.
import time
import math
import numpy as np
iter = 1000000
x = np.zeros((iter,1))
v = np.random.randn(iter,1)
before = time.time()
for i in range(iter):
x[i] = math.exp(v[i])
after = time.time()
print(x)
print("Non vectorized= " + str((after-before)*1000) + "ms")
before = time.time()
x = np.exp(v)
after = time.time()
print(x)
print("Vectorized= " + str((after-before)*1000) + "ms")
The result I got:
[[0.9256753 ]
[1.2529006 ]
[3.47384978]
...
[1.14945181]
[0.80263805]
[1.1938528 ]]
Non vectorized= 646.1577415466309ms
[[0.9256753 ]
[1.2529006 ]
[3.47384978]
...
[1.14945181]
[0.80263805]
[1.1938528 ]]
Vectorized= 19.547224044799805ms
My questions are:
What exactly is happening in the second case? The first one is using
an explicit for-loop and thus the computation time is justified.
What is happening "behind the scenes" in the second case?
How can one implement such computations (second case) without using numpy (in plain Python)?

What is happening is that NumPy is calling high quality numerical libraries (BLAS for instance) which are very good at vector arithmetic.
I imagine you could specifically call the exact libraries used by NumPy, however, NumPy would likely know best which to use.

NumPy is a Python wrapper over libraries and code written in C. This is a large part of the efficiency of NumPy. C code compiles directly to instructions which are executed by your processor or GPU. On the other hand, Python code must be interpreted as it executes. Despite the ever increasing speed we can get from interpreted languages with advances like Just In Time Compilers, for some tasks they will never be able to approach the speed of compiled languages.

It comes down to the fact that Python does not have direct access to the hardware level.
Python can't use the SIMD (Single instruction, multiple data) assembly instructions that most modern CPU's and GPU's have. These SIMD instruction allow a single operation to execute on a vector of data all at once (within a single clock cycle) at the hardware level.
NumPy on the other hand has functions built in C, and C is a language capable of running SIMD instructions. Therefore NumPy can take advantage of the vectorization hardware in your processor.

My python codes in general are very slow, is this normal?

I recently began self-learning python, and have been using this language for an online course in algorithms. For some reason, many of my codes I created for this course are very slow (relatively to C/C++ Matlab codes I have created in the past), and I'm starting to worry that I am not using python properly.
Here is a simple python and matlab code to compare their speed.
MATLAB
for i = 1:100000000
a = 1 + 1
end
Python
for i in list(range(0, 100000000)):
a=1 + 1
The matlab code takes about 0.3 second, and the python code takes about 7 seconds. Is this normal? My python codes for much complex problems are very slow. For example, as a HW assignment, I'm running depth first search on a graph with about 900000 nodes, and this is taking forever. Thank you.

Performance is not an explicit design goal of Python:
Don’t fret too much about performance--plan to optimize later when
needed.
That's one of the reasons why Python integrated with a lot of high performance calculating backend engines, such as numpy, OpenBLAS and even CUDA, just to name a few.
The best way to go foreward if you want to increase performance is to let high-performance libraries do the heavy lifting for you. Optimizing loops within Python (by using xrange instead of range in Python 2.7) won't get you very dramatic results.
Here is a bit of code that compares different approaches:
Your original list(range())
The suggestes use of xrange()
Leaving the i out
Using numpy to do the addition using numpy array's (vector addition)
Using CUDA to do vector addition on the GPU
Code:
import timeit
import matplotlib.pyplot as mplplt
iter = 100
testcode = [
"for i in list(range(1000000)): a = 1+1",
"for i in xrange(1000000): a = 1+1",
"for _ in xrange(1000000): a = 1+1",
"import numpy; one = numpy.ones(1000000); a = one+one",
"import pycuda.gpuarray as gpuarray; import pycuda.driver as cuda; import pycuda.autoinit; import numpy;" \
"one_gpu = gpuarray.GPUArray((1000000),numpy.int16); one_gpu.fill(1); a = (one_gpu+one_gpu).get()"
]
labels = ["list(range())", "i in xrange()", "_ in xrange()", "numpy", "numpy and CUDA"]
timings = [timeit.timeit(t, number=iter) for t in testcode]
print labels, timings
label_idx = range(len(labels))
mplplt.bar(label_idx, timings)
mplplt.xticks(label_idx, labels)
mplplt.ylabel('Execution time (sec)')
mplplt.title('Timing of integer addition in python 2.7\n(smaller value is better performance)')
mplplt.show()
Results (graph) ran on Python 2.7.13 on OSX:
The reason that Numpy performs faster than the CUDA solution is that the overhead of using CUDA does not beat the efficiency of Python+Numpy. For larger, floating point calculations, CUDA does even better than Numpy.
Note that the Numpy solution performs more that 80 times faster than your original solution. If your timings are correct, this would even be faster than Matlab...
A final note on DFS (Depth-afirst-Search): here is an interesting article on DFS in Python.

Try using xrange instead of range.
The difference between them is that **xrange** generates the values as you use them instead of range, which tries to generate a static list at runtime.

Unfortunately, python's amazing flexibility and ease comes at the cost of being slow. And also, for such large values of iteration, I suggest using itertools module as it has faster caching.
The xrange is a good solution however if you want to iterate over dictionaries and such, it's better to use itertools as in that, you can iterate over any type of sequence object.

Portable way of detecting number of usable CPUs in Python

Per this question and answer -- Python multiprocessing.cpu_count() returns '1' on 4-core Nvidia Jetson TK1 -- the output of Python's multiprocessing.cpu_count() function on certain systems reflects the number of CPUs actively in use, as opposed to the number of CPUs actually usable by the calling Python program.
A common Python idiom is to use the return-value of cpu_count() to initialize the number of processes in a Pool. However, on systems that use such a "dynamic CPU activation" strategy, that idiom breaks rather badly (at least on a relatively quiescent system).
Is there some straightforward (and portable) way to get at the number of usable processors (as opposed the number currently in use) from Python?
Notes:
This question is not answered by the accepted answer to How to find out the number of CPUs using python, since as noted in the question linked at the top of this question, printing the contents of /proc/self/status shows all 4 cores as being available to the program.
To my mind, "portable" excludes any approach that involves parsing the contents of /proc/self/status, whose format may vary from release to release of Linux, and which doesn`t even exist on OS X. (The same goes for any other pseudo-file, as well.)

I don't think you will get any truly portable answers, so I will give a correct one.
The correct* answer for Linux is len(os.sched_getaffinity(pid)), where pid may be 0 for the current process. This function is exposed in Python 3.3 and later; if you need it in earlier, you'll have to do some fancy cffi coding.
Edit: you might try to see if you can use a function int omp_get_num_procs(); if it exists, it is the only meaningful answer I found on this question but I haven't tried it from Python.

Use psutil:
from the doc https://psutil.readthedocs.io/en/latest/:
>>> import psutil
>>> psutil.cpu_count()
4
>>> psutil.cpu_count(logical=False) # Ignoring virtual cores
2
This is portable

Here's an approach that gets the number of available CPU cores for the current process on systems that implement sched_getaffinity, and Windows:
import ctypes
import ctypes.wintypes
import os
from platform import system
def num_available_cores() -> int:
if hasattr(os, 'sched_getaffinity'):
return len(os.sched_getaffinity(0))
elif system() == 'Windows':
kernel32 = ctypes.WinDLL('kernel32')
DWORD_PTR = ctypes.wintypes.WPARAM
PDWORD_PTR = ctypes.POINTER(DWORD_PTR)
GetCurrentProcess = kernel32.GetCurrentProcess
GetCurrentProcess.restype = ctypes.wintypes.HANDLE
GetProcessAffinityMask = kernel32.GetProcessAffinityMask
GetProcessAffinityMask.argtypes = (ctypes.wintypes.HANDLE, PDWORD_PTR, PDWORD_PTR)
mask = DWORD_PTR()
if not GetProcessAffinityMask(GetCurrentProcess(), ctypes.byref(mask), ctypes.byref(DWORD_PTR())):
raise Exception("Call to 'GetProcessAffinityMask' failed")
return bin(mask.value).count('1')
else:
raise Exception('Cannot determine the number of available cores')
On Linux and any other systems that implement sched_getaffinity, we use Python's built-in wrapper for it.
On Windows we use ctypes to call GetProcessAffinityMask.
As far as I know there are no user APIs or tools to get/set the CPU affinity on macOS. In most cases os.cpu_count() will work fine, but if you truly need the number of available cores you may be out of luck.

Can Go really be that much faster than Python?

I think I may have implemented this incorrectly because the results do not make sense. I have a Go program that counts to 1000000000:
package main
import (
"fmt"
)
func main() {
for i := 0; i < 1000000000; i++ {}
fmt.Println("Done")
}
It finishes in less than a second. On the other hand I have a Python script:
x = 0
while x < 1000000000:
x+=1
print 'Done'
It finishes in a few minutes.
Why is the Go version so much faster? Are they both counting up to 1000000000 or am I missing something?

One billion is not a very big number. Any reasonably modern machine should be able to do this in a few seconds at most, if it's able to do the work with native types. I verified this by writing an equivalent C program, reading the assembly to make sure that it actually was doing addition, and timing it (it completes in about 1.8 seconds on my machine).
Python, however, doesn't have a concept of natively typed variables (or meaningful type annotations at all), so it has to do hundreds of times as much work in this case. In short, the answer to your headline question is "yes". Go really can be that much faster than Python, even without any kind of compiler trickery like optimizing away a side-effect-free loop.

pypy actually does an impressive job of speeding up this loop
def main():
x = 0
while x < 1000000000:
x+=1
if __name__ == "__main__":
s=time.time()
main()
print time.time() - s
$ python count.py
44.221405983
$ pypy count.py
1.03511095047
~97% speedup!
Clarification for 3 people who didn't "get it". The Python language itself isn't slow. The CPython implementation is a relatively straight forward way of running the code. Pypy is another implementation of the language that does many tricky (especiallt the JIT) things that can make enormous differences. Directly answering the question in the title - Go isn't "that much" faster than Python, Go is that much faster than CPython.
Having said that, the code samples aren't really doing the same thing. Python needs to instantiate 1000000000 of its int objects. Go is just incrementing one memory location.

This scenario will highly favor decent natively-compiled statically-typed languages. Natively compiled statically-typed languages are capable of emitting a very trivial loop of say, 4-6 CPU opcodes that utilizes simple check-condition for termination. This loop has effectively zero branch prediction misses and can be effectively thought of as performing an increment every CPU cycle (this isn't entirely true, but..)
Python implementations have to do significantly more work, primarily due to the dynamic typing. Python must make several different calls (internal and external) just to add two ints together. In Python it must call __add__ (it is effectively i = i.__add__(1), but this syntax will only work in Python 3.x), which in turn has to check the type of the value passed (to make sure it is an int), then it adds the integer values (extracting them from both of the objects), and then the new integer value is wrapped up again in a new object. Finally it re-assigns the new object to the local variable. That's significantly more work than a single opcode to increment, and doesn't even address the loop itself - by comparison, the Go/native version is likely only incrementing a register by side-effect.
Java will fair much better in a trivial benchmark like this and will likely be fairly close to Go; the JIT and static-typing of the counter variable can ensure this (it uses a special integer add JVM instruction). Once again, Python has no such advantage. Now, there are some implementations like PyPy/RPython, which run a static-typing phase and should fare much better than CPython here ..

You've got two things at work here. The first of which is that Go is compiled to machine code and run directly on the CPU while Python is compiled to bytecode run against a (particularly slow) VM.
The second, and more significant, thing impacting performance is that the semantics of the two programs are actually significantly different. The Go version makes a "box" called "x" that holds a number and increments that by 1 on each pass through the program. The Python version actually has to create a new "box" (int object) on each cycle (and, eventually, has to throw them away). We can demonstrate this by modifying your programs slightly:
package main
import (
"fmt"
)
func main() {
for i := 0; i < 10; i++ {
fmt.Printf("%d %p\n", i, &i)
}
}
...and:
x = 0;
while x < 10:
x += 1
print x, id(x)
This is because Go, due to it's C roots, takes a variable name to refer to a place, where Python takes variable names to refer to things. Since an integer is considered a unique, immutable entity in python, we must constantly make new ones. Python should be slower than Go but you've picked a worst-case scenario - in the Benchmarks Game, we see go being, on average, about 25x times faster (100x in the worst case).
You've probably read that, if your Python programs are too slow, you can speed them up by moving things into C. Fortunately, in this case, somebody's already done this for you. If you rewrite your empty loop to use xrange() like so:
for x in xrange(1000000000):
pass
print "Done."
...you'll see it run about twice as fast. If you find loop counters to actually be a major bottleneck in your program, it might be time to investigate a new way of solving the problem.

#troq
I'm a little late to the party but I'd say the answer is yes and no. As #gnibbler pointed out, CPython is slower in the simple implementation but pypy is jit compiled for much faster code when you need it.
If you're doing numeric processing with CPython most will do it with numpy resulting in fast operations on arrays and matrices. Recently I've been doing a lot with numba which allows you to add a simple wrapper to your code. For this one I just added #njit to a function incALot() which runs your code above.
On my machine CPython takes 61 seconds, but with the numba wrapper it takes 7.2 microseconds which will be similar to C and maybe faster than Go. Thats an 8 million times speedup.
So, in Python, if things with numbers seem a bit slow, there are tools to address it - and you still get Python's programmer productivity and the REPL.
def incALot(y):
x = 0
while x < y:
x += 1
#njit('i8(i8)')
def nbIncALot(y):
x = 0
while x < y:
x += 1
return x
size = 1000000000
start = time.time()
incALot(size)
t1 = time.time() - start
start = time.time()
x = nbIncALot(size)
t2 = time.time() - start
print('CPython3 takes %.3fs, Numba takes %.9fs' %(t1, t2))
print('Speedup is: %.1f' % (t1/t2))
print('Just Checking:', x)
CPython3 takes 58.958s, Numba takes 0.000007153s
Speedup is: 8242982.2
Just Checking: 1000000000

Problem is Python is interpreted, GO isn't so there's no real way to bench test speeds. Interpreted languages usually (not always have a vm component) that's where the problem lies, any test you run is being run in interpreted bounds not actual runtime bounds. Go is slightly slower than C in terms of speed and that is mostly due to it using garbage collection instead of manual memory management. That said GO compared to Python is fast because its a compiled language, the only thing lacking in GO is bug testing I stand corrected if I'm wrong.

It is possible that the compiler realized that you didn't use the "i" variable after the loop, so it optimized the final code by removing the loop.
Even if you used it afterwards, the compiler is probably smart enough to substitute the loop with
i = 1000000000;
Hope this helps =)

I'm not familiar with go, but I'd guess that go version ignores the loop since the body of the loop does nothing. On the other hand, in the python version, you are incrementing x in the body of the loop so it's probably actually executing the loop.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.