Single pass algorithm for finding the topX percent of items

Single pass algorithm for finding the topX percent of items - python

I'm looking for a single-pass algorithm for finding the topX percent of floats in a stream where I do not know the total number ahead of time ... but its on the order of 5-30 million floats. It needs to be single-pass since the data is generated on the fly and recreate the exact stream a second time.
The algorithm I have so far is to keep a sorted list of the topX items that I've seen so far. As the stream continues I enlarge the list as needed. Then I use bisect_left to find the insertion point if needed.
Below is the algorithm I have so far:
from bisect import bisect_left
from random import uniform
from itertools import islice
def data_gen(num):
for _ in xrange(num):
yield uniform(0,1)
def get_top_X_percent(iterable, percent = 0.01, min_guess = 1000):
top_nums = sorted(list(islice(iterable, int(percent*min_guess)))) #get an initial guess
for ind, val in enumerate(iterable, len(top_nums)):
if int(percent*ind) > len(top_nums):
top_nums.insert(0,None)
newind = bisect_left(top_nums, val)
if newind > 0:
top_nums.insert(newind, val)
top_nums.pop(0)
return top_nums
if __name__ == '__main__':
num = 1000000
all_data = sorted(data_gen(num))
result = get_top_X_percent(all_data)
assert result[0] == all_data[-int(num*0.01)], 'Too far off, lowest num:%f' % result[0]
print result[0]
In the real case the data does not come from any standard distribution (otherwise I could use some statistics knowledge).
Any suggestions would be appreciated.

I'm not sure there's any way to actually do that reliably, as the range denoted by the "top X percent" can grow unpredictably as you see more elements. Consider the following input:
101 102 103 104 105 106 107 108 109 110 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
If you wanted the top 25% of elements, you'd end up picking 101 and 102 out of the first ten elements, but after seeing enough zeroes after there you'd eventually have to end up selecting all of the first ten. This same pattern can be expanded to any sufficiently large stream -- it's always possible to end up getting misled by appearances and discarding elements that you actually should have kept. As such, unless you know the exact length of the stream ahead of time, I don't think this is possible (short of keeping every element in memory until you hit the end of the stream).

You must store the entire stream in memory.
Proof: You have a sequence of numbers, n1,…,nk. The value of k is unknown. How do you know when ni can be forgotten? When you have seen x*k/100 numbers greater than ni. However, since k is unknown, you can never do this.
So the only "one-pass" algorithm must store the entire sequence in memory.

As the other answers have discussed, you can't really do a whole lot better than just storing the entire stream in memory. Consider doing it that way, especially since 5-30 million floats is probably only 40-240 MB of memory which is manageable.
Given that you store the entire stream, the algorithmically fastest way to get the topX percent is by first finding the cutoff element (the smallest element that is in the topX percent) using a linear-time selection algorithm:
http://en.wikipedia.org/wiki/Selection_algorithm
Then, make another pass through the stream and filter out all the elements smaller than the cutoff element.
This method is linear time and linear space, which is the best you can hope for.

Below is the output from running this with cProfile. It seems that your code is doing fine, as most calls are 0.000 (percall). It seems its slow simply because you have many items to process. If you wanted to optimize, you'd have to try and pop less items, as that is what has it has been called 999999, which seems to be unnecessary.
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 __future__.py:48(<module>)
1 0.000 0.000 0.000 0.000 __future__.py:74(_Feature)
7 0.000 0.000 0.000 0.000 __future__.py:75(__init__)
1 0.001 0.001 0.001 0.001 bisect.py:1(<module>)
1 0.001 0.001 0.001 0.001 hashlib.py:55(<module>)
6 0.000 0.000 0.000 0.000 hashlib.py:91(__get_openssl_constructor)
1 0.000 0.000 0.000 0.000 os.py:743(urandom)
1 0.000 0.000 0.000 0.000 random.py:100(seed)
1000000 0.731 0.000 0.876 0.000 random.py:355(uniform)
1 0.003 0.003 0.004 0.004 random.py:40(<module>)
1 0.000 0.000 0.000 0.000 random.py:647(WichmannHill)
1 0.000 0.000 0.000 0.000 random.py:72(Random)
1 0.000 0.000 0.000 0.000 random.py:797(SystemRandom)
1 0.000 0.000 0.000 0.000 random.py:91(__init__)
1 2.498 2.498 13.313 13.313 test.py:12(get_top_X_percent)
1 0.006 0.006 16.330 16.330 test.py:3(<module>)
1000001 0.545 0.000 1.422 0.000 test.py:8(data_gen)
1000000 1.744 0.000 1.744 0.000 {_bisect.bisect_left}
1 0.000 0.000 0.000 0.000 {_hashlib.openssl_md5}
1 0.000 0.000 0.000 0.000 {_hashlib.openssl_sha1}
1 0.000 0.000 0.000 0.000 {_hashlib.openssl_sha224}
1 0.000 0.000 0.000 0.000 {_hashlib.openssl_sha256}
1 0.000 0.000 0.000 0.000 {_hashlib.openssl_sha384}
1 0.000 0.000 0.000 0.000 {_hashlib.openssl_sha512}
1 0.000 0.000 0.000 0.000 {binascii.hexlify}
1 0.000 0.000 0.000 0.000 {function seed at 0x100684a28}
6 0.000 0.000 0.000 0.000 {getattr}
6 0.000 0.000 0.000 0.000 {globals}
1000004 0.125 0.000 0.125 0.000 {len}
1 0.000 0.000 0.000 0.000 {math.exp}
2 0.000 0.000 0.000 0.000 {math.log}
1 0.000 0.000 0.000 0.000 {math.sqrt}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1009989 0.469 0.000 0.469 0.000 {method 'insert' of 'list' objects}
999999 8.477 0.000 8.477 0.000 {method 'pop' of 'list' objects}
1000000 0.146 0.000 0.146 0.000 {method 'random' of '_random.Random' objects}
1 0.000 0.000 0.000 0.000 {posix.close}
1 0.000 0.000 0.000 0.000 {posix.open}
1 0.000 0.000 0.000 0.000 {posix.read}
2 1.585 0.792 3.006 1.503 {sorted}
BTW, you can use cProfile with:
python -m cProfile test.py

Related

strange result from timeit

I tried to repeat the functionality of IPython %time, but for some strange reason, results of testing of some function are horrific.
IPython:
In [11]: from random import shuffle
....: import numpy as np
....: def numpy_seq_el_rank(seq, el):
....: return sum(seq < el)
....:
....: seq = np.array(xrange(10000))
....: shuffle(seq)
....:
In [12]: %timeit numpy_seq_el_rank(seq, 10000//2)
10000 loops, best of 3: 46.1 µs per loop
Python:
from timeit import timeit, repeat
def my_timeit(code, setup, rep, loops):
result = repeat(code, setup=setup, repeat=rep, number=loops)
return '%d loops, best of %d: %0.9f sec per loop'%(loops, rep, min(result))
np_setup = '''
from random import shuffle
import numpy as np
def numpy_seq_el_rank(seq, el):
return sum(seq < el)
seq = np.array(xrange(10000))
shuffle(seq)
'''
np_code = 'numpy_seq_el_rank(seq, 10000//2)'
print 'Numpy seq_el_rank:\n\t%s'%my_timeit(code=np_code, setup=np_setup, rep=3, loops=100)
And its output:
Numpy seq_el_rank:
100 loops, best of 3: 1.655324947 sec per loop
As you can see, in python i made 100 loops instead 10000 (and get 35000 times slower result) as in ipython, because it takes really long time. Can anybody explain why result in python is so slow?
UPD:
Here is cProfile.run('my_timeit(code=np_code, setup=np_setup, rep=3, loops=10000)') output:
30650 function calls in 4.987 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 4.987 4.987 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 <timeit-src>:2(<module>)
3 0.001 0.000 4.985 1.662 <timeit-src>:2(inner)
300 0.006 0.000 4.961 0.017 <timeit-src>:7(numpy_seq_el_rank)
1 0.000 0.000 4.987 4.987 Lab10.py:47(my_timeit)
3 0.019 0.006 0.021 0.007 random.py:277(shuffle)
1 0.000 0.000 0.002 0.002 timeit.py:121(__init__)
3 0.000 0.000 4.985 1.662 timeit.py:185(timeit)
1 0.000 0.000 4.985 4.985 timeit.py:208(repeat)
1 0.000 0.000 4.987 4.987 timeit.py:239(repeat)
2 0.000 0.000 0.000 0.000 timeit.py:90(reindent)
3 0.002 0.001 0.002 0.001 {compile}
3 0.000 0.000 0.000 0.000 {gc.disable}
3 0.000 0.000 0.000 0.000 {gc.enable}
3 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
3 0.000 0.000 0.000 0.000 {isinstance}
3 0.000 0.000 0.000 0.000 {len}
3 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
29997 0.001 0.000 0.001 0.000 {method 'random' of '_random.Random' objects}
2 0.000 0.000 0.000 0.000 {method 'replace' of 'str' objects}
1 0.000 0.000 0.000 0.000 {min}
3 0.003 0.001 0.003 0.001 {numpy.core.multiarray.array}
1 0.000 0.000 0.000 0.000 {range}
300 4.955 0.017 4.955 0.017 {sum}
6 0.000 0.000 0.000 0.000 {time.clock}

Well, one issue is that you're misreading the results. ipython is telling you how long it took each of the 10,000 iterations for the set of 10,000 iterations with the lowest total time. The timeit.repeat module is reporting how long the whole round of 100 iterations took (again, for the shortest of three). So the real discrepancy is 46.1 µs per loop (ipython) vs. 16.5 ms per loop (python), still a factor of ~350x difference, but not 35,000x.
You didn't show profiling results for ipython. Is it possible that in your ipython session, you did either from numpy import sum or from numpy import *? If so, you'd have been timing the numpy.sum (which is optimized for numpy arrays and would run several orders of magnitude faster), while your python code (which isolated the globals in a way that ipython does not) ran the normal sum (that has to convert all the values to Python ints and sum them).
If you check your profiling output, virtually all of your work is being done in sum; if that part of your code was sped up by several orders of magnitude, the total time would reduce similarly. That would explain the "real" discrepancy; in the test case linked above, it was a 40x difference, and that was for a smaller array (the smaller the array, the less numpy can "show off") with more complex values (vs. summing 0s and 1s here I believe).
The remainder (if any) is probably an issue of how the code is being evaled slightly differently, or possibly weirdness with the random shuffle (for consistent tests, you'd want to seed random with a consistent seed to make the "randomness" repeatable) but I doubt that's a difference of more than a few percent.

There could be any number of reasons this code is running slower in one implementation of python than another. One may be optimized differently than another, one may pre-compile certain parts while the other is fully interpreted. The only way to figure out why is to profile your code.
https://docs.python.org/2/library/profile.html
import cProfile
cProfile.run('repeat(code, setup=setup, repeat=rep, number=loops)')
Will give a result similar to
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <stdin>:1(testing)
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'upper' of 'str' objects}
Which shows you when function calls were made, how many times they were made and how long they took.

Python getting meaningful results from cProfile

I have a Python script in a file which takes just over 30 seconds to run. I am trying to profile it as I would like to cut down this time dramatically.
I am trying to profile the script using cProfile, but essentially all it seems to be telling me is that yes, the main script took a long time to run, but doesn't give the kind of breakdown I was expecting. At the terminal, I type something like:
cat my_script_input.txt | python -m cProfile -s time my_script.py
The results I get are:
<my_script_output>
683121 function calls (682169 primitive calls) in 32.133 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 31.980 31.980 32.133 32.133 my_script.py:18(<module>)
121089 0.050 0.000 0.050 0.000 {method 'split' of 'str' objects}
121090 0.038 0.000 0.049 0.000 fileinput.py:243(next)
2 0.027 0.014 0.036 0.018 {method 'sort' of 'list' objects}
121089 0.009 0.000 0.009 0.000 {method 'strip' of 'str' objects}
201534 0.009 0.000 0.009 0.000 {method 'append' of 'list' objects}
100858 0.009 0.000 0.009 0.000 my_script.py:51(<lambda>)
952 0.008 0.000 0.008 0.000 {method 'readlines' of 'file' objects}
1904/952 0.003 0.000 0.011 0.000 fileinput.py:292(readline)
14412 0.001 0.000 0.001 0.000 {method 'add' of 'set' objects}
182 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
1 0.000 0.000 0.000 0.000 fileinput.py:80(<module>)
1 0.000 0.000 0.000 0.000 fileinput.py:197(__init__)
1 0.000 0.000 0.000 0.000 fileinput.py:266(nextfile)
1 0.000 0.000 0.000 0.000 {isinstance}
1 0.000 0.000 0.000 0.000 fileinput.py:91(input)
1 0.000 0.000 0.000 0.000 fileinput.py:184(FileInput)
1 0.000 0.000 0.000 0.000 fileinput.py:240(__iter__)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
This doesn't seem to be telling me anything useful. The vast majority of the time is simply listed as:
ncalls tottime percall cumtime percall filename:lineno(function)
1 31.980 31.980 32.133 32.133 my_script.py:18(<module>)
In my_script.py, Line 18 is nothing more than the closing """ of the file's header block comment, so it's not that there is a whole load of work concentrated in Line 18. The script as a whole is mostly made up of line-based processing with mostly some string splitting, sorting and set work, so I was expecting to find the majority of time going to one or more of these activities. As it stands, seeing all the time grouped in cProfile's results as occurring on a comment line doesn't make any sense or at least does not shed any light on what is actually consuming all the time.
EDIT: I've constructed a minimum working example similar to my above case to demonstrate the same behavior:
mwe.py
import fileinput
for line in fileinput.input():
for i in range(10):
y = int(line.strip()) + int(line.strip())
And call it with:
perl -e 'for(1..1000000){print "$_\n"}' | python -m cProfile -s time mwe.py
To get the result:
22002536 function calls (22001694 primitive calls) in 9.433 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 8.004 8.004 9.433 9.433 mwe.py:1(<module>)
20000000 1.021 0.000 1.021 0.000 {method 'strip' of 'str' objects}
1000001 0.270 0.000 0.301 0.000 fileinput.py:243(next)
1000000 0.107 0.000 0.107 0.000 {range}
842 0.024 0.000 0.024 0.000 {method 'readlines' of 'file' objects}
1684/842 0.007 0.000 0.032 0.000 fileinput.py:292(readline)
1 0.000 0.000 0.000 0.000 fileinput.py:80(<module>)
1 0.000 0.000 0.000 0.000 fileinput.py:91(input)
1 0.000 0.000 0.000 0.000 fileinput.py:197(__init__)
1 0.000 0.000 0.000 0.000 fileinput.py:184(FileInput)
1 0.000 0.000 0.000 0.000 fileinput.py:266(nextfile)
1 0.000 0.000 0.000 0.000 {isinstance}
1 0.000 0.000 0.000 0.000 fileinput.py:240(__iter__)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Am I using cProfile incorrectly somehow?

As I mentioned in a comment, when you can't get cProfile to work externally, you can often use it internally instead. It's not that hard.
For example, when I run with -m cProfile in my Python 2.7, I get effectively the same results you did. But when I manually instrument your example program:
import fileinput
import cProfile
pr = cProfile.Profile()
pr.enable()
for line in fileinput.input():
for i in range(10):
y = int(line.strip()) + int(line.strip())
pr.disable()
pr.print_stats(sort='time')
… here's what I get:
22002533 function calls (22001691 primitive calls) in 3.352 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
20000000 2.326 0.000 2.326 0.000 {method 'strip' of 'str' objects}
1000001 0.646 0.000 0.700 0.000 fileinput.py:243(next)
1000000 0.325 0.000 0.325 0.000 {range}
842 0.042 0.000 0.042 0.000 {method 'readlines' of 'file' objects}
1684/842 0.013 0.000 0.055 0.000 fileinput.py:292(readline)
1 0.000 0.000 0.000 0.000 fileinput.py:197(__init__)
1 0.000 0.000 0.000 0.000 fileinput.py:91(input)
1 0.000 0.000 0.000 0.000 {isinstance}
1 0.000 0.000 0.000 0.000 fileinput.py:266(nextfile)
1 0.000 0.000 0.000 0.000 fileinput.py:240(__iter__)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
That's a lot more useful: It tells you what you probably already expected, that more than half your time is spent calling str.strip().
Also, note that if you can't edit the file containing code you wish to profile (mwe.py), you can always do this:
import cProfile
pr = cProfile.Profile()
pr.enable()
import mwe
pr.disable()
pr.print_stats(sort='time')
Even that doesn't always work. If your program calls exit(), for example, you'll have to use a try:/finally: wrapper and/or an atexit. And it it calls os._exit(), or segfaults, you're probably completely hosed. But that isn't very common.
However, something I discovered later: If you move all code out of the global scope, -m cProfile seems to work, at least for this case. For example:
import fileinput
def f():
for line in fileinput.input():
for i in range(10):
y = int(line.strip()) + int(line.strip())
f()
Now the output from -m cProfile includes, among other things:
2000000 4.819 0.000 4.819 0.000 :0(strip)
100001 0.288 0.000 0.295 0.000 fileinput.py:243(next)
I have no idea why this also made it twice as slow… or maybe that's just a cache effect; it's been a few minutes since I last ran it, and I've done lots of web browsing in between. But that's not important, what's important is that most of the time is getting charged to reasonable places.
But if I change this to move the outer loop to the global level, and only its body into a function, most of the time disappears again.
Another alternative, which I wouldn't suggest except as a last resort…
I notice that if I use profile instead of cProfile, it works both internally and externally, charging time to the right calls. However, those calls are also about 5x slower. And there seems to be an additional 10 seconds of constant overhead (which gets charged to import profile if used internally, whatever's on line 1 if used externally). So, to find out that split is using 70% of my time, instead of waiting 4 seconds and doing 2.326 / 3.352, I have to wait 27 seconds, and do 10.93 / (26.34 - 10.01). Not much fun…
One last thing: I get the same results with a CPython 3.4 dev build—correct results when used internally, everything charged to the first line of code when used externally. But PyPy 2.2/2.7.3 and PyPy3 2.1b1/3.2.3 both seem to give me correct results with -m cProfile. This may just mean that PyPy's cProfile is faked on top of profile because the pure-Python code is fast enough.
Anyway, if someone can figure out/explain why -m cProfile isn't working, that would be great… but otherwise, this is usually a perfectly good workaround.

How to avoid two loops here in python

Here is my problem: I have a dict in python such as:
a = {1:[2, 3], 2:[1]}
I would like to output:
1, 2
1, 3
2, 1
what I am doing is
for i in a:
for j in a[i]:
print i, j
so is there any easier way to do that avoiding two loops here or it is the easiest way already?

The code you have is about as good as it gets. One minor improvement might be iterating over the dictionary's items in the outer loop, rather than doing indexing:
for i, lst in a.items() # use a.iteritems() in Python 2
for j in lst:
print("{}, {}".format(i, j))

Couple of alternatives using list comprehensions, if you want to avoid explicit for loops.
# 1 method
# Python2.7
for key, value in a.iteritems(): # Use a.items() for python 3
print "\n".join(["%d, %d" % (key, val) for val in value])
# 2 method - A more fancy way with list comprehensions
print "\n".join(["\n".join(["%d, %d" % (key, val) for val in value]) for key, value in a.iteritems()])
Both will output
1, 2
1, 3
2, 1

Remember in Python, Readability counts., so ideally #Blckknght's solution is what you should look forward, but just looking at your problem, technically as a POC, that you can rewrite your expression as a single loop, here is a solution.
But caveat, if you wan;t your code to be Readable, remember Explicit is better than implicit.
>>> def foo():
return '\n'.join('{},{}'.format(*e) for e in chain(*(izip(cycle([k]),v) for k,v in a.items())))
>>> def bar():
return '\n'.join("{},{}".format(i,j) for i in a for j in a[i])
>>> cProfile.run("foo()")
20 function calls in 0.000 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <pyshell#240>:1(foo)
5 0.000 0.000 0.000 0.000 <pyshell#240>:2(<genexpr>)
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
10 0.000 0.000 0.000 0.000 {method 'format' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'items' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
>>> cProfile.run("bar()")
25 function calls in 0.000 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <pyshell#242>:1(bar)
11 0.000 0.000 0.000 0.000 <pyshell#242>:2(<genexpr>)
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
10 0.000 0.000 0.000 0.000 {method 'format' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}

Python Boto Dynamodb very slow performance for small record set retrieval on range keys

I am testing dynamodb via boto and have found it to be surprisingly slow in retrieving data sets based on hashkey, rangekey condition queries. I have seen some discussion about the oddity that causes ssl (is_secure) to perform about 6x faster then non-ssl and I can confirm that finding. But even using ssl I am seeing 1-2 seconds to retrieve 300 records using a hashkey/range key condition on a fairly small data set (less then 1K records).
Running profilehooks profiler I see a lot of extraneous time spent in ssl.py to the order of 20617 ncalls to retrieve the 300 records. It seems like even at 10 calls per record it's still 6x more then I would expect. This is on a medium instance-- though the same results occur on a micro instance. 500 Reads/sec 1000 writes/sec provisioning with no throttles logged.
I have looked at doing a batch request but the inability to use range key conditions eliminates that option for me.
Any ideas on where I'm loosing time would be greatly appreciated!!
144244 function calls in 2.083 CPU seconds
Ordered by: cumulative time, internal time, call count
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 2.083 2.083 eventstream.py:427(session_range)
107 0.006 0.000 2.081 0.019 dynamoDB.py:36(rangeQ)
408 0.003 0.000 2.073 0.005 layer2.py:493(query)
107 0.001 0.000 2.046 0.019 layer1.py:435(query)
107 0.002 0.000 2.040 0.019 layer1.py:119(make_request)
107 0.006 0.000 1.988 0.019 connection.py:699(_mexe)
107 0.001 0.000 1.916 0.018 httplib.py:956(getresponse)
107 0.002 0.000 1.913 0.018 httplib.py:384(begin)
662 0.049 0.000 1.888 0.003 socket.py:403(readline)
20617 0.040 0.000 1.824 0.000 ssl.py:209(recv)
20617 0.036 0.000 1.785 0.000 ssl.py:130(read)
20617 1.748 0.000 1.748 0.000 {built-in method read}
107 0.002 0.000 1.738 0.016 httplib.py:347(_read_status)
107 0.001 0.000 0.170 0.002 mimetools.py:24(__init__)
107 0.000 0.000 0.165 0.002 rfc822.py:88(__init__)
107 0.007 0.000 0.165 0.002 httplib.py:230(readheaders)
107 0.001 0.000 0.031 0.000 __init__.py:332(loads)
107 0.001 0.000 0.028 0.000 decoder.py:397(decode)
107 0.008 0.000 0.026 0.000 decoder.py:408(raw_decode)
107 0.001 0.000 0.026 0.000 httplib.py:910(request)
107 0.003 0.000 0.026 0.000 httplib.py:922(_send_request)
107 0.001 0.000 0.025 0.000 connection.py:350(authorize)
107 0.004 0.000 0.024 0.000 auth.py:239(add_auth)
3719 0.011 0.000 0.019 0.000 layer2.py:31(item_object_hook)
301 0.010 0.000 0.018 0.000 item.py:38(__init__)
22330 0.015 0.000 0.015 0.000 {method 'append' of 'list' objects}
107 0.001 0.000 0.012 0.000 httplib.py:513(read)
214 0.001 0.000 0.011 0.000 httplib.py:735(send)
856 0.002 0.000 0.010 0.000 __init__.py:1034(debug)
214 0.001 0.000 0.009 0.000 ssl.py:194(sendall)
107 0.000 0.000 0.008 0.000 httplib.py:900(endheaders)
107 0.001 0.000 0.008 0.000 httplib.py:772(_send_output)
107 0.001 0.000 0.008 0.000 auth.py:223(string_to_sign)
856 0.002 0.000 0.008 0.000 __init__.py:1244(isEnabledFor)
137 0.001 0.000 0.008 0.000 httplib.py:603(_safe_read)
214 0.001 0.000 0.007 0.000 ssl.py:166(send)
214 0.007 0.000 0.007 0.000 {built-in method write}
3311 0.006 0.000 0.006 0.000 item.py:186(__setitem__)
107 0.001 0.000 0.006 0.000 auth.py:95(sign_string)
137 0.001 0.000 0.006 0.000 socket.py:333(read)

This isn't a complete answer but I thought it was worth posting it at this time.
I've heard reports like this from a couple of people over the last few weeks. I was able to reproduce the anomaly of HTTPS being considerably faster than HTTP but wasn't able to track it down. It seemed like that problem was unique to Python/boto but it turns out the same issue was found on C#/.Net and investigating that it was found that the underlying problem was the use of the Nagle's algorithm in the Python and .Net libraries. In .Net, it's easy to turn this off but it's not as easy in Python, unfortunately.
To test this, I wrote a simple script that performed 1000 GetItem requests in a loop. The item that was being fetch was very small, well under 1K. Running this on Python 2.6.7 on an m1.medium instance in the us-east-1 region produced these results:
>>> http_data = speed_test(False, 1000)
dynamoDB_speed_test - RUNTIME = 53.120193
Throttling exceptions: 0
>>> https_data = speed_test(True, 1000)
dynamoDB_speed_test - RUNTIME = 8.167652
Throttling exceptions: 0
Note that there is sufficient provisioned capacity in the table to avoid any throttling from the service and the unexpected gap between HTTP and HTTPS is clear.
I next ran the same test in Python 2.7.2:
>>> http_data = speed_test(False, 1000)
dynamoDB_speed_test - RUNTIME = 5.668544
Throttling exceptions: 0
>>> https_data = speed_test(True, 1000)
dynamoDB_speed_test - RUNTIME = 7.425210
Throttling exceptions: 0
So, 2.7 seems to have fixed this issue. I then applied a simple patch to httplib.py in 2.6.7. The patch simply sets the TCP_NO_DELAY property of the socket associated with the HTTPConnection object, like this:
self.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
I then re-ran the test on 2.6.7:
>>> http_data = speed_test(False, 1000)
dynamoDB_speed_test - RUNTIME = 5.914109
Throttling exceptions: 0
>>> https_data = speed_test(True, 1000)
dynamoDB_speed_test - RUNTIME = 5.137570
Throttling exceptions: 0
Even better although still an expectedly faster time with HTTPS than HTTP. It's hard to know whether that difference is significant or not.
So, I'm looking for ways to programmatically configure the socket for HTTPConnection objects to have TCP_NO_DELAY configured correctly. It's not easy to get at that in the httplib.py. My best advice for the moment is to use Python 2.7, if possible.

numpy routines don't appear to be that fast

I'm using python to do some Bayesian statistics. I've coded it up in python and in Fortran 95. The Fortran code is waaay faster... like a factor of 100. I expected the Fortran to be faster, but I was really hoping that by using numpy I could get the python code to come close, maybe within a factor of 2. I've profiled the python code and it looks like the majority of the time is spent doing the following things:
scipy.stats.rvs: taking a random draw from a distribution. I do this ~19000 times and it takes a total time of 3.552 sec
numpy.slogdet: computing the log of the determinant of a matrix. I do this ~10,000 and it takes a total of 2.48 s
numpy.solve: solve a linear system: I call this routine ~10,000 times for a total time of 2.557 s
In total my code runs in ~ 11 sec whereas my fortran code takes .092 sec. Is this a joke? I'm really not trying to be unrealistic in my expectations of python, and I certainly don't expect to get my python code to be as fast as Fortran.. but to be slower by a factor of > 100.. Python's gotta be able to do better than that. Just in case you are curious, here is the full output of my profiler:( I don't know why it broke the text into several blocks)
1290611 function calls in 11.296 CPU seconds
Ordered by: internal time, function name
ncalls tottime percall cumtime percall filename:lineno(function)
18973 0.864 0.000 3.552 0.000 /usr/lib64/python2.6/site-packages/scipy/stats/distributions.py:484(rvs)
9976 0.819 0.000 2.480 0.000 /usr/lib64/python2.6/site-packages/numpy/linalg/linalg.py:1559(slogdet)
9976 0.627 0.000 6.659 0.001 /bluehome/legoses/bce/bayes_GP_integrated_out/python/ce_funcs.py:77(evaluate_posterior)
9384 0.591 0.000 0.753 0.000 /bluehome/legoses/bce/bayes_GP_integrated_out/python/ce_funcs.py:39(construct_R_matrix)
77852 0.533 0.000 0.533 0.000 :0(array)
37946 0.520 0.000 1.489 0.000 /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:32(_wrapit)
77851 0.423 0.000 0.956 0.000 /usr/lib64/python2.6/site-packages/numpy/core/numeric.py:216(asarray)
37946 0.360 0.000 0.360 0.000 :0(all)
9976 0.335 0.000 2.557 0.000 /usr/lib64/python2.6/sitepackages/scipy/linalg/basic.py:23(solve)
107799 0.322 0.000 0.322 0.000 :0(len)
109740 0.301 0.000 0.301 0.000 :0(issubclass)
28357 0.294 0.000 0.294 0.000 :0(prod)
9976 0.287 0.000 0.957 0.000 /usr/lib64/python2.6/site-packages/scipy/linalg/lapack.py:45(find_best_lapack_type)
1 0.282 0.282 11.294 11.294 /bluehome/legoses/bce/bayes_GP_integrated_out/python/ce_funcs.py:199(get_rho_lambda_draws)
9976 0.269 0.000 1.386 0.000 /usr/lib64/python2.6/site-packages/scipy/linalg/lapack.py:60(get_lapack_funcs)
19952 0.263 0.000 0.476 0.000 /usr/lib64/python2.6/site-packages/scipy/linalg/lapack.py:23(cast_to_lapack_prefix)
19952 0.235 0.000 0.669 0.000 /usr/lib64/python2.6/site-packages/numpy/lib/function_base.py:483(asarray_chkfinite)
66833 0.212 0.000 0.212 0.000 :0(log)
18973 0.207 0.000 1.054 0.000 /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:1427(product)
29931 0.205 0.000 0.205 0.000 :0(reduce)
28949 0.187 0.000 0.856 0.000 :0(map)
9976 0.175 0.000 0.175 0.000 :0(dot)
47922 0.163 0.000 0.163 0.000 :0(getattr)
9976 0.157 0.000 0.206 0.000 /usr/lib64/python2.6/site-packages/numpy/lib/twodim_base.py:169(eye)
19952 0.154 0.000 0.271 0.000 /bluehome/legoses/bce/bayes_GP_integrated_out/python/ce_funcs.py:32(loggbeta)
18973 0.151 0.000 0.793 0.000 /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:1548(all)
19953 0.146 0.000 0.146 0.000 :0(any)
9976 0.142 0.000 0.316 0.000 /usr/lib64/python2.6/site-packages/numpy/linalg/linalg.py:99(_commonType)
9976 0.133 0.000 0.133 0.000 :0(dgetrf)
18973 0.125 0.000 0.175 0.000 /usr/lib64/python2.6/site-packages/scipy/stats/distributions.py:462(_fix_loc_scale)
39904 0.117 0.000 0.117 0.000 :0(append)
18973 0.105 0.000 0.292 0.000 /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:1461(alltrue)
19952 0.102 0.000 0.102 0.000 :0(zeros)
19952 0.093 0.000 0.154 0.000 /usr/lib64/python2.6/site-packages/numpy/linalg/linalg.py:71(isComplexType)
19952 0.090 0.000 0.090 0.000 :0(split)
9976 0.089 0.000 2.569 0.000 /bluehome/legoses/bce/bayes_GP_integrated_out/python/ce_funcs.py:62(get_log_determinant_of_matrix)
19952 0.087 0.000 0.134 0.000 /bluehome/legoses/bce/bayes_GP_integrated_out/python/ce_funcs.py:35(logggamma)
9976 0.083 0.000 0.154 0.000 /usr/lib64/python2.6/site-packages/numpy/linalg/linalg.py:139(_fastCopyAndTranspose)
9976 0.076 0.000 0.125 0.000 /usr/lib64/python2.6/site-packages/numpy/linalg/linalg.py:157(_assertSquareness)
9976 0.074 0.000 0.097 0.000 /usr/lib64/python2.6/site-packages/numpy/linalg/linalg.py:151(_assertRank2)
9976 0.072 0.000 0.119 0.000 /usr/lib64/python2.6/site-packages/numpy/linalg/linalg.py:127(_to_native_byte_order)
18973 0.072 0.000 0.072 0.000 /usr/lib64/python2.6/site-packages/scipy/stats/distributions.py:832(_argcheck)
9976 0.072 0.000 0.228 0.000 /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:901(diagonal)
9976 0.070 0.000 0.070 0.000 :0(arange)
9976 0.061 0.000 0.061 0.000 :0(diagonal)
9976 0.055 0.000 0.055 0.000 :0(sum)
9976 0.053 0.000 0.075 0.000 /usr/lib64/python2.6/site-packages/numpy/linalg/linalg.py:84(_realType)
11996 0.050 0.000 0.091 0.000 /usr/lib64/python2.6/site-packages/scipy/stats/distributions.py:1412(_rvs)
9384 0.047 0.000 0.162 0.000 /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:1898(prod)
9976 0.045 0.000 0.045 0.000 :0(sort)
11996 0.041 0.000 0.041 0.000 :0(standard_normal)
9976 0.037 0.000 0.037 0.000 :0(_fastCopyAndTranspose)
9976 0.037 0.000 0.037 0.000 :0(hasattr)
9976 0.037 0.000 0.037 0.000 :0(range)
6977 0.034 0.000 0.055 0.000 /usr/lib64/python2.6/site-packages/scipy/stats/distributions.py:3731(_rvs)
9977 0.027 0.000 0.027 0.000 :0(max)
9976 0.023 0.000 0.023 0.000 /usr/lib64/python2.6/site-packages/numpy/core/numeric.py:498(isfortran)
9977 0.022 0.000 0.022 0.000 :0(min)
9976 0.022 0.000 0.022 0.000 :0(get)
6977 0.021 0.000 0.021 0.000 :0(uniform)
1 0.001 0.001 11.295 11.295 <string>:1(<module>)
1 0.001 0.001 11.296 11.296 profile:0(get_rho_lambda_draws(correlations,energies,rho_priors,lambda_e_prior,lambda_z_prior,candidate_sig2_rhos,candidate_sig2_lambda_e,candidate_sig2_lambda_z,3000))
2 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/arrayprint.py:445(__call__)
1 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/arrayprint.py:385(__init__)
1 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/arrayprint.py:175(_array2string)
2 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/arrayprint.py:475(_digits)
2 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/arrayprint.py:309(_extendLine)
1 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/arrayprint.py:317(_formatArray)
1 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:1477(any)
1 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/arrayprint.py:243(array2string)
1 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/numeric.py:1390(array_str)
1 0.000 0.000 0.000 0.000 :0(compress)
1 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/arrayprint.py:394(fillFormat)
6 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/numeric.py:2166(geterr)
12 0.000 0.000 0.000 0.000 :0(geterrobj)
0 0.000 0.000 profile:0(profiler)
1 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:1043(ravel)
1 0.000 0.000 0.000 0.000 :0(ravel)
8 0.000 0.000 0.000 0.000 :0(rstrip)
6 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/numeric.py:2070(seterr)
6 0.000 0.000 0.000 0.000 :0(seterrobj)
1 0.000 0.000 0.000 0.000 :0(setprofile)
EDIT:
Here is copy of the relevant routines
def get_rho_lambda_draws(correlations, energies, rho_priors, lam_e_prior, lam_z_prior,
candidate_sig2_rhos, candidate_sig2_lambda_e,
candidate_sig2_lambda_z, ndraws):
nBasis = len(correlations[0])
nStruct = len(correlations)
rho _draws = [ [0.5 for x in xrange(nBasis)] for y in xrange(ndraws)]
lambda_e_draws = [ 5 for x in xrange(ndraws)]
lambda_z_draws = [ 5 for x in xrange(ndraws)]
    
accept_rhos = array([0. for x in xrange(nBasis)])
accept_lambda_e = 0.
accept_lambda_z = 0.
for i in xrange(1,ndraws):
if i % 100 == 0:
print i, "REP<---------------------------------------------------------------------------------"
#do metropolis to get rho
rho_draws[i] = [x for x in rho_draws[i-1]]
lambda_e_draws[i] = lambda_e_draws[i-1]
lambda_z_draws[i] = lambda_z_draws[i-1]
rho_vec = [x for x in rho_draws[i-1]]
R_matrix_before =construct_R_matrix(correlations,correlations,rho_vec)
post_before = evaluate_posterior(R_matrix_before,rho_vec,energies,lambda_e_draws[i-1],lambda_z_draws[i-1],lam_e_prior,lam_z_prior,rho_priors)
index = 0
for j in xrange(nBasis):
cand = norm.rvs(rho_draws[i-1][j],scale=candidate_sig2_rhos[j])
if 0.0 < cand < 1.0:
rho_vec[j] = cand
R_matrix_after = construct_R_matrix(correlations,correlations,rho_vec)
post_after = evaluate_posterior(R_matrix_after,rho_vec,energies,lambda_e_draws[i-1],lambda_z_draws[i-1],lam_e_prior,lam_z_prior,rho_priors)
metrop_value = post_after - post_before
unif = log(uniform.rvs(0,1))
if metrop_value > unif:
rho_draws[i][j] = cand
post_before = post_after
accept_rhos[j] += 1
else:
rho_vec[j] = rho_draws[i-1][j]
R_matrix = construct_R_matrix(correlations,correlations,rho_vec)
cand = norm.rvs(lambda_e_draws[i-1],scale=candidate_sig2_lambda_e)
if cand > 0.0:
post_after = evaluate_posterior(R_matrix,rho_vec,energies,cand,lambda_z_draws[i-1],lam_e_prior,lam_z_prior,rho_priors)
metrop_value = post_after - post_before
unif = log(uniform.rvs(0,1))
if metrop_value > unif:
lambda_e_draws[i] = cand
post_before = post_after
accept_lambda_e = accept_lambda_e + 1
cand = norm.rvs(lambda_z_draws[i-1],scale=candidate_sig2_lambda_z)
if cand > 0.0:
post_after = evaluate_posterior(R_matrix,rho_vec,energies,lambda_e_draws[i],cand,lam_e_prior,lam_z_prior,rho_priors)
metrop_value = post_after - post_before
unif = log(uniform.rvs(0,1))
if metrop_value > unif:
lambda_z_draws[i] = cand
post_before = post_after
accept_lambda_z = accept_lambda_z + 1
print accept_rhos/ndraws
print accept_lambda_e/ndraws
print accept_lambda_z/ndraws
return [rho_draws,lambda_e_draws,lambda_z_draws]
def evaluate_posterior(R_matrix,rho_vec,energies,lambda_e,lambda_z,lam_e_prior,lam_z_prior,rho_prior_params):
# from scipy.linalg import solve
#from numpy import allclose
working_matrix = eye(len(R_matrix))/lambda_e + R_matrix/lambda_z
logdet = get_log_determinant_of_matrix(working_matrix)
x = solve(working_matrix,energies,sym_pos=True)
# if not allclose(dot(working_matrix,x),energies):
# exit('solve routine didnt work')
rho_priors = sum([loggbeta(rho_vec[j],rho_prior_params[j][0],rho_prior_params[j][1]) for j in xrange(len(rho_vec))])
loggposterior = -.5 * logdet - .5*dot(energies,x) + logggamma(lambda_e,lam_e_prior[0],lam_e_prior[1]) + logggamma(lambda_z,lam_z_prior[0],lam_z_prior[1]) + rho_priors #(a_e-1)*log(lambda_e) - b_e*lambda_e + (a_z-1)*log(lambda_z) - b_z*lambda_z + rho_priors
return loggposterior
def construct_R_matrix(listone,listtwo,rhos):
return prod(rhos[:]**(4*(listone[:,newaxis]-listtwo)**2),axis=2)
(Once again... I don't know why It breaks my input up into several blocks when I post.. I hope you can decifer it)

It is hard to tell exactly what's going on with your code, But my suspicion is that you just have some data which is not (or could not be) very vectorized.
Because obviously the call of .rvs() 19000 times is going to be way slower than the .rvs(size=19000). See:
In [5]: %timeit x=[scipy.stats.norm().rvs() for i in range(19000)]
1 loops, best of 3: 1.23 s per loop
In [6]: %timeit x=scipy.stats.norm().rvs(size=19000)
1000 loops, best of 3: 1.67 ms per loop
So if you indeed have a not very vectorized code or algorithm it is well expected to be slower than fortran.

Check out the performance page created by the SciPy/NumPy folks. There are a number of remarkably easy extras that foster very fast code. Among them are (a) using the weave module, especially the inline and blitz options. (b) Using Cython to write some of your functions in C but be able to call and use them in Python.
I do a lot of large-scale scientific computing work in Python for statistics, finance, and (in grad school) computer vision. The reason why Python is excellent for these kinds of issues is not that my naive, first hack code would yield the fastest solution, but because in Python I can easily interface with tons of other tasks. I can easily issue Linux commands for other programs, easily read and parse most data files, easily interface with SQL and other databse software; I have all of the R statistics library available, use of OpenCV commands (in much much nicer syntax that the C++ version), and much more.
When the importance of my task was to manipulate a new dataset and get my hands dirty, feeling out the nuances of that data, then Python's ease of programming, along with matplotlib, made it much better. Later on, when I need to scale things up, I can always use PyCUDA, Cython, or just rewrite things in C++ if high-end performance is required. Since most machines have multiprocessors now, the multiprocessing module, as well as mpi4py, allow me to quickly and cheaply turn annoying for-loop style tasks into much shorter tasks, without needing to migrate to C++.
In short, the real utility of Python doesn't come from the language all by itself, but from becoming really proficient with the add-ons and extras that let you cheaply make your little set of common problems execute quickly on the data sets that matter in the day-to-day.
Real-time embedded communications software is going to be using C++ for a long time to come... same for high-frequency trading strategies. But then again, professional solutions to these types of things is not really what Python is meant for. And in some cases, folks prefer unusual solutions for that stuff anyway.

Get rid of the two for loops and two list comprehensions by replacing them with Numpy functions and constructs that use numpy.ndarrays. Also do not print in between the computation - that is also slow. You can probably get 10-50 fold speed increase just by following the above advice.
Also see http://www.scipy.org/PerformancePython/

You usually should't use Numpy or Scipy to compute scalar values. Use 'ordinary' Python. Extending the example provided by #sega_sai:
In [11]: %timeit x = [normalvariate(0, 1) for i in range(190)]
1000 loops, best of 3: 274 µs per loop
In [12]: %timeit x = [scipy.stats.norm().rvs() for i in range(190)]
10 loops, best of 3: 180 ms per loop
In [13]: %timeit x = scipy.stats.norm().rvs(size=190)
1000 loops, best of 3: 987 µs per loop
It is faster if you make an instance of scipy.stats.norm().rvs
In [14]: rvs = scipy.stats.norm().rvs
In [15]: %timeit x = [rvs() for i in range(190)]
100 loops, best of 3: 3.8 ms per loop
In [16]: %timeit x = rvs(size=190)
10000 loops, best of 3: 44 µs per loop
Also note that PyMC has complained about Scipy's probability distributions:
"Based on informal comparison using version 2.0, the distributions in PyMC tend to be approximately an order of magnitude faster than their counterparts in SciPy (using version 0.7)"
.
http://www.map.ox.ac.uk/media/PDF/Patil_et_al_2010.pdf
import pymc
s = pymc.Normal('s', 0, 1)
%timeit x = [s.rand() for i in range(190)]
100 loops, best of 3: 3.76 ms per loop
Also note that Scipy without individual instancing at each iteration is faster:
generate = scipy.stats.norm().rvs
%timeit x = [generate() for i in range(190)]
100 loops, best of 3: 7.98 ms per loop

Try doing this:
import psyco
psyco.full()
Or using pypy, these can sometimes yield significant speed improvements, although pypy doesn't have full numpy support yet.

recently I posted something about the performance of c/c++/fortran and that of python on
Stackoverflow:
comparing python with c/fortran
what I concluded from that post was that is better to combine python with a low level
programming language than using python itself for numeric computations. I am actually using
F2PY.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.