How to speed up matrix code - python

I have the following simple code which estimates the probability that an h by n binary matrix has a certain property. It runs in exponential time (which is bad to start with) but I am surprised it is so slow even for n = 12 and h = 9.
#!/usr/bin/python
import numpy as np
import itertools
n = 12
h = 9
F = np.matrix(list(itertools.product([0,1],repeat = n))).transpose()
count = 0
iters = 100
for i in xrange(iters):
M = np.random.randint(2, size=(h,n))
product = np.dot(M,F)
setofcols = set()
for column in product.T:
setofcols.add(repr(column))
if (len(setofcols)==2**n):
count = count + 1
print count*1.0/iters
I have profiled it using n = 10 and h = 7. The output is rather long but here are the lines that took more time.
23447867 function calls (23038179 primitive calls) in 35.785 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
2 0.002 0.001 0.019 0.010 __init__.py:1(<module>)
1 0.001 0.001 0.054 0.054 __init__.py:106(<module>)
1 0.001 0.001 0.022 0.022 __init__.py:15(<module>)
2 0.003 0.002 0.013 0.006 __init__.py:2(<module>)
1 0.001 0.001 0.003 0.003 __init__.py:38(<module>)
1 0.001 0.001 0.001 0.001 __init__.py:4(<module>)
1 0.001 0.001 0.004 0.004 __init__.py:45(<module>)
1 0.001 0.001 0.002 0.002 __init__.py:88(<module>)
307200 0.306 0.000 1.584 0.000 _methods.py:24(_any)
102400 0.026 0.000 0.026 0.000 arrayprint.py:22(product)
102400 1.345 0.000 32.795 0.000 arrayprint.py:225(_array2string)
307200/102400 1.166 0.000 33.350 0.000 arrayprint.py:335(array2string)
716800 0.820 0.000 1.162 0.000 arrayprint.py:448(_extendLine)
204800/102400 1.699 0.000 5.090 0.000 arrayprint.py:456(_formatArray)
307200 0.651 0.000 22.510 0.000 arrayprint.py:524(__init__)
307200 11.783 0.000 21.859 0.000 arrayprint.py:538(fillFormat)
1353748 1.920 0.000 2.537 0.000 arrayprint.py:627(_digits)
102400 0.576 0.000 2.523 0.000 arrayprint.py:636(__init__)
716800 2.159 0.000 2.159 0.000 arrayprint.py:649(__call__)
307200 0.099 0.000 0.099 0.000 arrayprint.py:658(__init__)
102400 0.163 0.000 0.225 0.000 arrayprint.py:686(__init__)
102400 0.307 0.000 13.784 0.000 arrayprint.py:697(__init__)
102400 0.110 0.000 0.110 0.000 arrayprint.py:713(__init__)
102400 0.043 0.000 0.043 0.000 arrayprint.py:741(__init__)
1 0.003 0.003 0.003 0.003 chebyshev.py:87(<module>)
2 0.001 0.000 0.001 0.000 collections.py:284(namedtuple)
1 0.277 0.277 35.786 35.786 counterfeit.py:3(<module>)
205002 0.222 0.000 0.247 0.000 defmatrix.py:279(__array_finalize__)
102500 0.747 0.000 1.077 0.000 defmatrix.py:301(__getitem__)
102400 0.322 0.000 34.236 0.000 defmatrix.py:352(__repr__)
102400 0.100 0.000 0.508 0.000 fromnumeric.py:1087(ravel)
307200 0.382 0.000 2.829 0.000 fromnumeric.py:1563(any)
271 0.004 0.000 0.005 0.000 function_base.py:3220(add_newdoc)
1 0.003 0.003 0.003 0.003 hermite.py:59(<module>)
1 0.003 0.003 0.003 0.003 hermite_e.py:59(<module>)
1 0.001 0.001 0.002 0.002 index_tricks.py:1(<module>)
1 0.003 0.003 0.003 0.003 laguerre.py:59(<module>)
1 0.003 0.003 0.003 0.003 legendre.py:83(<module>)
1 0.001 0.001 0.001 0.001 linalg.py:10(<module>)
1 0.001 0.001 0.001 0.001 numeric.py:1(<module>)
102400 0.247 0.000 33.598 0.000 numeric.py:1365(array_repr)
204800 0.321 0.000 1.143 0.000 numeric.py:1437(array_str)
614400 1.199 0.000 2.627 0.000 numeric.py:2178(seterr)
614400 0.837 0.000 0.918 0.000 numeric.py:2274(geterr)
102400 0.081 0.000 0.186 0.000 numeric.py:252(asarray)
307200 0.259 0.000 0.622 0.000 numeric.py:322(asanyarray)
1 0.003 0.003 0.004 0.004 polynomial.py:54(<module>)
513130 0.134 0.000 0.134 0.000 {isinstance}
307229 0.075 0.000 0.075 0.000 {issubclass}
5985327/5985305 0.595 0.000 0.595 0.000 {len}
306988 0.120 0.000 0.120 0.000 {max}
102400 0.061 0.000 0.061 0.000 {method '__array__' of 'numpy.ndarray' objects}
102406 0.027 0.000 0.027 0.000 {method 'add' of 'set' objects}
307200 0.241 0.000 1.824 0.000 {method 'any' of 'numpy.ndarray' objects}
307200 0.482 0.000 0.482 0.000 {method 'compress' of 'numpy.ndarray' objects}
204800 0.035 0.000 0.035 0.000 {method 'item' of 'numpy.ndarray' objects}
102451 0.014 0.000 0.014 0.000 {method 'join' of 'str' objects}
102400 0.222 0.000 0.222 0.000 {method 'ravel' of 'numpy.ndarray' objects}
921176 3.330 0.000 3.330 0.000 {method 'reduce' of 'numpy.ufunc' objects}
102405 0.057 0.000 0.057 0.000 {method 'replace' of 'str' objects}
2992167 0.660 0.000 0.660 0.000 {method 'rstrip' of 'str' objects}
102400 0.041 0.000 0.041 0.000 {method 'splitlines' of 'str' objects}
6 0.003 0.000 0.003 0.001 {method 'sub' of '_sre.SRE_Pattern' objects}
307276 0.090 0.000 0.090 0.000 {min}
100 0.013 0.000 0.013 0.000 {numpy.core._dotblas.dot}
409639 0.473 0.000 0.473 0.000 {numpy.core.multiarray.array}
1228800 0.239 0.000 0.239 0.000 {numpy.core.umath.geterrobj}
614401 0.352 0.000 0.352 0.000 {numpy.core.umath.seterrobj}
102475 0.031 0.000 0.031 0.000 {range}
102400 0.076 0.000 0.102 0.000 {reduce}
204845/102445 0.198 0.000 34.333 0.000 {repr}
The multiplication of the matrices seems to take a tiny fraction of the time. Is it possible to speed up the rest?
Results
There are now three answers but one seems to have a bug currently. I have tested the remaining two with n=18, h=11 and iters=10 .
bubble - 21 seconds, 185MB of RAM . 16 seconds on "sort".
hpaulj - 7.5 seconds, 130MB of RAM . 3 seconds on "tolist". 1.5 seconds on "numpy.core.multiarray.array", 1.5 seconds on "genexpr" (the 'set' line).
Interestingly, the time for multiplying the matrices is still a tiny fraction of the overall time taken.

To speed up the code above you should avoid loops.
import numpy as np
import itertools
def unique_rows(a):
a = np.ascontiguousarray(a)
unique_a = np.unique(a.view([('', a.dtype)]*a.shape[1]))
return unique_a.view(a.dtype).reshape((unique_a.shape[0], a.shape[1]))
n = 12
h = 9
iters=100
F = np.matrix(list(itertools.product([0,1],repeat = n))).transpose()
M = np.random.randint(2, size=(h*iters,n))
product = np.dot(M,F)
counts = map(lambda x: len(unique_rows(x.T))==2**n, np.split(product,iters,axis=0))
prob=float(sum(counts))/iters
#All unique submatrices M (hxn) with the sophisticated property...
[np.split(M,iters,axis=0)[j] for j in range(len(counts)) if counts[j]==True]

Try replacing repr(col) with
setofcols.add(tuple(column.A1.tolist()))
set accepts a tuple. column.A1 is the matrix converted to a 1d array. The tuple is then something like (0, 1, 0), which set can easily compare.
Just replacing the expensive repr formatting lops off a lot of time (25x speedup).
EDIT
By creating and filling the set in one statement I get a further 10x speed up. In my tests it is 2x faster than bubble's vectorization.
count = 0
for i in xrange(iters):
M = np.random.randint(2, size=(h,n))
product = np.dot(M,F)
setofcols = set(tuple(x) for x in product.T.tolist())
# or {tuple(x) for x in product.T.tolist()} if new enough Python
if (len(setofcols)==2**n):
count += 1
# print M # to see the unique M
print count*1.0/iters
EDIT
Here's something even faster - transform each column of 9 integers into 1, using dot([1,10,100,...],column). Then apply np.unique (or set) to the list of integers. It's a 2-3x further speedup.
count = 0
X = 10**np.arange(h)
for i in xrange(iters):
M = np.random.randint(2, size=(h,n))
product = np.dot(M,F)
setofcols = np.unique(np.dot(X,product).A1)
if (setofcols.size==2**n):
count += 1
print count*1.0/iters
With this the top calls are
200 0.201 0.001 0.204 0.001 {numpy.core._dotblas.dot}
100 0.026 0.000 0.026 0.000 {method 'sort' of 'numpy.ndarray' objects}
100 0.007 0.000 0.035 0.000 arraysetops.py:93(unique)

As alko and seberg pointed out, you are loosing a lot of time converting your arrays to large strings to store them in your set of columns.
If I understood your code correctly, you are trying to find if the number of different columns in your product matrix is equal to the length of this matrix. You can do that easily by sorting it and looking at differences from one column to the next:
D = (np.diff(np.sort(product.T, axis=0), axis=0) == 0)
This will give you a matrix of booleans D. You can then see whether at least one element changes from one column to the next:
C = (1 - np.prod(D, axis=1)) # i.e. 'not all(D[i,:]) for all i'
You then simply have to take see whether all the values are different:
hasproperty = np.all(C)
Which gives you the complete code:
def f(n, h, iters):
F = np.array(list(itertools.product([0,1], repeat=n))).T
counts = []
for _ in xrange(iters):
M = np.random.randint(2, size=(h,n))
product = M.dot(F)
D = (np.diff(np.sort(product.T, axis=1), axis=0) == 0)
C = (1 - np.prod(D, axis=1))
hasproperty = np.all(C)
counts.append(1. if hasproperty else 0.)
return np.mean(counts)
Which takes roughly 8s for f(12, 9, 100).
If you prefer comically compact expressions:
def g(n, h, iters):
F = np.array(list(itertools.product([0,1], repeat=n))).T
return np.mean([np.all(1 - np.prod(np.diff(np.sort(np.random.randint(2,size=(h,n)).dot(F).T, axis=1), axis=0)==0, axis=1)) for _ in xrange(iters)])
Timing it gives:
>>> setup = """import numpy as np
def g(n, h, iters):
F = np.array(list(itertools.product([0,1], repeat=n))).T
return np.mean([np.all(1 - np.prod(np.diff(np.sort(np.random.randint(2,size=(h,n)).dot(F).T, axis=1), axis=0)==0, axis=1)) for _ in xrange(iters)])
"""
>>> timeit.timeit('g(10, 7, 100)', setup=setup, number=10)
17.358669997900734
>>> timeit.timeit('g(10, 7, 100)', setup=setup, number=50)
83.06966196163967
Or approximatively 1.7s per call to g(10,7,100).

Related

Optimization of N digits all permutations sum algorithm

[Note: Programming Challenge]
Challenge description:
Input: List of Digits 1, 5, 6, 2, 7, 4
Non-repeated and no zeros(digits 1-9 are all valid). The goal is to create all permutations and sum them all. I used itertools permutations for this. I was able to do some test assertions with 7,8, and 9digits on my local machine. At 9 digits the time increases quite a bit to where the server times out at 12seconds, but my computer doesn't take that long. It just says to optmize my algorithm.
Below is an example of how arrays are formed, and itertools does this in each iteration of the inner loop.
array for the list sum (terms of arrays)
[1] 1 # arrays with only one term (7 different arrays)
[5] 5
[6] 6
[2] 2
[7] 7
[3] 3
[4] 4
[1, 5] 6 # arrays with two terms (42 different arrays)
[1, 6] 7
[1, 2] 3
[1, 7] 8
[1, 3] 4
[1, 4] 5
..... ...
[1, 5, 6] 12 # arrays with three terms(210 different arrays)
[1, 5, 2] 8
[1, 5, 7] 13
[1, 5, 3] 9
........ ...
This is what I have so far, and it gets the job done but my main bottleneck is the sum, from 8 digits to 9 digits it increases exponentially. Below function passed two tests but on the server it times out.
for i in range(1, limit + 1):
for j in permutations(digits, i):
grandSum =+ grandSum + sum(j)
And below are the results from cProfile runs, 7, 8, 9:
13778 function calls in 0.011 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.011 0.011 <string>:1(<module>)
7 0.000 0.000 0.000 0.000 GreatTotalAdditions.py:25(removefirstindex)
1 0.006 0.006 0.011 0.011 GreatTotalAdditions.py:31(gta)
1 0.000 0.000 0.011 0.011 {built-in method builtins.exec}
21 0.000 0.000 0.000 0.000 {built-in method builtins.len}
8 0.000 0.000 0.000 0.000 {built-in method builtins.print}
13699 0.004 0.000 0.004 0.000 {built-in method builtins.sum}
25 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
7 0.000 0.000 0.000 0.000 {method 'insert' of 'list' objects}
7 0.000 0.000 0.000 0.000 {method 'pop' of 'list' objects}
109701 function calls in 0.086 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.085 0.085 <string>:1(<module>)
11 0.000 0.000 0.000 0.000 GreatTotalAdditions.py:25(removefirstindex)
1 0.047 0.047 0.085 0.085 GreatTotalAdditions.py:31(gta)
1 0.000 0.000 0.086 0.086 {built-in method builtins.exec}
27 0.000 0.000 0.000 0.000 {built-in method builtins.len}
9 0.000 0.000 0.000 0.000 {built-in method builtins.print}
109600 0.038 0.000 0.038 0.000 {built-in method builtins.sum}
28 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
11 0.000 0.000 0.000 0.000 {method 'insert' of 'list' objects}
11 0.000 0.000 0.000 0.000 {method 'pop' of 'list' objects}
986553 function calls in 0.701 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.701 0.701 <string>:1(<module>)
16 0.000 0.000 0.000 0.000 GreatTotalAdditions.py:25(removefirstindex)
1 0.374 0.374 0.701 0.701 GreatTotalAdditions.py:31(gta)
1 0.000 0.000 0.701 0.701 {built-in method builtins.exec}
34 0.000 0.000 0.000 0.000 {built-in method builtins.len}
10 0.000 0.000 0.000 0.000 {built-in method builtins.print}
986409 0.327 0.000 0.327 0.000 {built-in method builtins.sum}
48 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
16 0.000 0.000 0.000 0.000 {method 'insert' of 'list' objects}
16 0.000 0.000 0.000 0.000 {method 'pop' of 'list' objects}
Below is where I would like some help:
Knowing the sum is taking most of the computation time. I came up across this math formula, example for a 3 digit number:
(3-1)! * (1 + 2 + 3) * (111)
And my implementation is below, bare in mind I'm new to python:
def sum_them_quick(digits):
ones = ""
digitsSum = 0
# if 3 digits then form 111
for i in range(len(digits)):
ones += "1"
for j in digits:
digitsSum = digitsSum + j
return factorial(len(digits)-1) * (digitsSum) * (int(ones))
However the results from this last function are wrong, and doesn't correspond with good results from passed tests. I'm not entirely sure if the number of permutations is different than itertools permutations method, but I'll confirm tomorrow for sure.

Improve performance of MongoDB client (sockets)

I am using Python 2.7 (Anaconda distribution) on Windows 8.1 Pro.
I have a database of articles with their respective topics.
I am building an application which queries textual phrases in my database and associates article topics to each queried phrase. The topics are assigned based on the relevance of the phrase for the article.
The bottleneck seems to be Python socket communication with the localhost.
Here are my cProfile outputs:
topics_fit (PhraseVectorizer_1_1.py:668)
function called 1 times
1930698 function calls (1929630 primitive calls) in 148.209 seconds
Ordered by: cumulative time, internal time, call count
List reduced from 286 to 40 due to restriction <40>
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.224 1.224 148.209 148.209 PhraseVectorizer_1_1.py:668(topics_fit)
206272 0.193 0.000 146.780 0.001 cursor.py:1041(next)
601 0.189 0.000 146.455 0.244 cursor.py:944(_refresh)
534 0.030 0.000 146.263 0.274 cursor.py:796(__send_message)
534 0.009 0.000 141.532 0.265 mongo_client.py:725(_send_message_with_response)
534 0.002 0.000 141.484 0.265 mongo_client.py:768(_reset_on_error)
534 0.019 0.000 141.482 0.265 server.py:69(send_message_with_response)
534 0.002 0.000 141.364 0.265 pool.py:225(receive_message)
535 0.083 0.000 141.362 0.264 network.py:106(receive_message)
1070 1.202 0.001 141.278 0.132 network.py:127(_receive_data_on_socket)
3340 140.074 0.042 140.074 0.042 {method 'recv' of '_socket.socket' objects}
535 0.778 0.001 4.700 0.009 helpers.py:88(_unpack_response)
535 3.828 0.007 3.920 0.007 {bson._cbson.decode_all}
67 0.099 0.001 0.196 0.003 {method 'sort' of 'list' objects}
206187 0.096 0.000 0.096 0.000 PhraseVectorizer_1_1.py:705(<lambda>)
206187 0.096 0.000 0.096 0.000 database.py:339(_fix_outgoing)
206187 0.074 0.000 0.092 0.000 objectid.py:68(__init__)
1068 0.005 0.000 0.054 0.000 server.py:135(get_socket)
1068/534 0.010 0.000 0.041 0.000 contextlib.py:21(__exit__)
1068 0.004 0.000 0.041 0.000 pool.py:501(get_socket)
534 0.003 0.000 0.028 0.000 pool.py:208(send_message)
534 0.009 0.000 0.026 0.000 pool.py:573(return_socket)
567 0.001 0.000 0.026 0.000 socket.py:227(meth)
535 0.024 0.000 0.024 0.000 {method 'sendall' of '_socket.socket' objects}
534 0.003 0.000 0.023 0.000 topology.py:134(select_server)
206806 0.020 0.000 0.020 0.000 collection.py:249(database)
418997 0.019 0.000 0.019 0.000 {len}
449 0.001 0.000 0.018 0.000 topology.py:143(select_server_by_address)
534 0.005 0.000 0.018 0.000 topology.py:82(select_servers)
1068/534 0.001 0.000 0.018 0.000 contextlib.py:15(__enter__)
534 0.002 0.000 0.013 0.000 thread_util.py:83(release)
207307 0.010 0.000 0.011 0.000 {isinstance}
534 0.005 0.000 0.011 0.000 pool.py:538(_get_socket_no_auth)
534 0.004 0.000 0.011 0.000 thread_util.py:63(release)
534 0.001 0.000 0.011 0.000 mongo_client.py:673(_get_topology)
535 0.003 0.000 0.010 0.000 topology.py:57(open)
206187 0.008 0.000 0.008 0.000 {method 'popleft' of 'collections.deque' objects}
535 0.002 0.000 0.007 0.000 topology.py:327(_apply_selector)
536 0.003 0.000 0.007 0.000 topology.py:286(_ensure_opened)
1071 0.004 0.000 0.007 0.000 periodic_executor.py:50(open)
In particular: {method 'recv' of '_socket.socket' objects} seems to cause trouble.
According to suggestions found in What can I do to improve socket performance in Python 3?, I tried gevent.
I added this snippet at the beginning of my script (before importing anything):
from gevent import monkey
monkey.patch_all()
This resulted in even slower performance...
*** PROFILER RESULTS ***
topics_fit (PhraseVectorizer_1_1.py:671)
function called 1 times
1956879 function calls (1951292 primitive calls) in 158.260 seconds
Ordered by: cumulative time, internal time, call count
List reduced from 427 to 40 due to restriction <40>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 158.170 158.170 hub.py:358(run)
1 0.000 0.000 158.170 158.170 {method 'run' of 'gevent.core.loop' objects}
2/1 1.286 0.643 158.166 158.166 PhraseVectorizer_1_1.py:671(topics_fit)
206272 0.198 0.000 156.670 0.001 cursor.py:1041(next)
601 0.192 0.000 156.203 0.260 cursor.py:944(_refresh)
534 0.029 0.000 156.008 0.292 cursor.py:796(__send_message)
534 0.012 0.000 150.514 0.282 mongo_client.py:725(_send_message_with_response)
534 0.002 0.000 150.439 0.282 mongo_client.py:768(_reset_on_error)
534 0.017 0.000 150.437 0.282 server.py:69(send_message_with_response)
551/535 0.002 0.000 150.316 0.281 pool.py:225(receive_message)
552/536 0.079 0.000 150.314 0.280 network.py:106(receive_message)
1104/1072 0.815 0.001 150.234 0.140 network.py:127(_receive_data_on_socket)
2427/2395 0.019 0.000 149.418 0.062 socket.py:381(recv)
608/592 0.003 0.000 48.541 0.082 socket.py:284(_wait)
552 0.885 0.002 5.464 0.010 helpers.py:88(_unpack_response)
552 4.475 0.008 4.577 0.008 {bson._cbson.decode_all}
3033 2.021 0.001 2.021 0.001 {method 'recv' of '_socket.socket' objects}
7/4 0.000 0.000 0.221 0.055 hub.py:189(_import)
4 0.127 0.032 0.221 0.055 {__import__}
67 0.104 0.002 0.202 0.003 {method 'sort' of 'list' objects}
536/535 0.003 0.000 0.142 0.000 topology.py:57(open)
537/536 0.002 0.000 0.139 0.000 topology.py:286(_ensure_opened)
1072/1071 0.003 0.000 0.138 0.000 periodic_executor.py:50(open)
537/536 0.001 0.000 0.136 0.000 server.py:33(open)
537/536 0.001 0.000 0.135 0.000 monitor.py:69(open)
20/19 0.000 0.000 0.132 0.007 topology.py:342(_update_servers)
4 0.000 0.000 0.131 0.033 hub.py:418(_get_resolver)
1 0.000 0.000 0.122 0.122 resolver_thread.py:13(__init__)
1 0.000 0.000 0.122 0.122 hub.py:433(_get_threadpool)
206187 0.081 0.000 0.101 0.000 objectid.py:68(__init__)
206187 0.100 0.000 0.100 0.000 database.py:339(_fix_outgoing)
206187 0.098 0.000 0.098 0.000 PhraseVectorizer_1_1.py:708(<lambda>)
1 0.073 0.073 0.093 0.093 threadpool.py:2(<module>)
2037 0.003 0.000 0.092 0.000 hub.py:159(get_hub)
2 0.000 0.000 0.090 0.045 thread.py:39(start_new_thread)
2 0.000 0.000 0.090 0.045 greenlet.py:195(spawn)
2 0.000 0.000 0.090 0.045 greenlet.py:74(__init__)
1 0.000 0.000 0.090 0.090 hub.py:259(__init__)
1102 0.004 0.000 0.078 0.000 pool.py:501(get_socket)
1068 0.005 0.000 0.074 0.000 server.py:135(get_socket)
This performance is somewhat unacceptable for my application - I would like it to be much faster (this is timed and profiled for a subset of ~20 documents, and I need to process few tens of thousands).
Any ideas on how to speed it up?
Much appreciated.
Edit:
Code snippet that I profiled:
# also tried monkey patching all here, see profiler
from pymongo import MongoClient
def topics_fit(self):
client = MongoClient()
# tried motor for multithreading - also slow
#client = motor.motor_tornado.MotorClient()
# initialize DB cursors
db_wiki = client.wiki
# initialize topic feature dictionary
self.topics = OrderedDict()
self.topic_mapping = OrderedDict()
vocabulary_keys = self.vocabulary.keys()
num_categories = 0
for phrase in vocabulary_keys:
phrase_tokens = phrase.split()
if len(phrase_tokens) > 1:
# query for current phrase
AND_phrase = "\"" + phrase + "\""
cursor = db_wiki.categories.find({ "$text" : { "$search": AND_phrase } },{ "score": { "$meta": "textScore" } })
cursor = list(cursor)
if cursor:
cursor.sort(key=lambda k: k["score"], reverse = True)
added_categories = cursor[0]["category_ids"]
for added_category in added_categories:
if not (added_category in self.topics):
self.topics[added_category] = num_categories
if not (self.vocabulary[phrase] in self.topic_mapping):
self.topic_mapping[self.vocabulary[phrase]] = [num_categories, ]
else:
self.topic_mapping[self.vocabulary[phrase]].append(num_categories)
num_categories+=1
else:
if not (self.vocabulary[phrase] in self.topic_mapping):
self.topic_mapping[self.vocabulary[phrase]] = [self.topics[added_category], ]
else:
self.topic_mapping[self.vocabulary[phrase]].append(self.topics[added_category])
Edit 2: output of index_information():
{u'_id_':
{u'ns': u'wiki.categories', u'key': [(u'_id', 1)], u'v': 1},
u'article_title_text_article_body_text_category_names_text': {u'default_language': u'english', u'weights': SON([(u'article_body', 1), (u'article_title', 1), (u'category_names', 1)]), u'key': [(u'_fts', u'text'), (u'_ftsx', 1)], u'v': 1, u'language_override': u'language', u'ns': u'wiki.categories', u'textIndexVersion': 2}}

Characters in Strings in Python

Implement a function with signature find_chars(string1, string2) that
takes two strings and returns a string that contains only the
characters found in string1 and string two in the order that they are
found in string1. Implement a version of order N*N and one of order N.
(Source: http://thereq.com/q/138/python-software-interview-question/characters-in-strings)
Here are my solutions:
Order N*N:
def find_chars_slow(string1, string2):
res = []
for char in string1:
if char in string2:
res.append(char)
return ''.join(res)
So the for loop goes through N elements, and each char in string2 check is another N operations so this gives N*N.
Order N:
from collections import defaultdict
def find_char_fast(string1, string2):
d = defaultdict(int)
for char in string2:
d[char] += 1
res = []
for char in string1:
if char in d:
res.append(char)
return ''.join(res)
First store the characters of string2 as a dictionary (O(N)). Then scan string1 (O(N)) and check if it is in the dict (O(1)). This gives a total runtime of O(2N) = O(N).
Is the above correct? Is there a faster method?
Your solution is algorithmically correct (the first is O(n**2), and the second is O(n)) but you're doing some things that are going to be possible red flags to an interviewer.
The first function is basically okay. You might get bonus points for writing it like this:
''.join([c for c in string1 if c in string2])
..which does essentially the same thing.
My problem (if I'm wearing my interviewer pants) with how you've written the second function is that you use a defaultdict where you don't care at all about the count - you only care about membership. This is the classic case for when to use a set.
seen = set(string2)
''.join([c for c in string1 if c in seen])
The way I've written these functions are going to be slightly faster than what you wrote, since list comprehensions loop in native code rather than in python bytecode. They are algorithmically the same complexity.
Your method is sound, and there is no method with time complexity less than O(N), since you obviously need to go through each character at least once.
That is not to say that there's no method that runs faster. There's no need to actually increment the numbers in the dictionary. You could, for example, use a set. You could also make further use of python's features, such as list comprehensions/generators:
def find_char_fast2(string1, string2):
s= set(string2)
return "".join( (x for x in string1 if x in s) )
The algorithms you have used are perfectly fine. There are few improvements you can do
Since you are converting the second string to a dictionary, I would recommend using set, like this
d = set(string2)
Apart from that you can use list comprehension, as a filter, like this
return "".join([char for char in string1 if char in d])
If the order of the characters in the output doesn't matter, you can simply convert both the strings to sets and simply find set difference, like this
return "".join(set(string1) - set(string2))
Hi, I am trying to profiling the various solution given here:
In my snippet, I am using a module called faker to generate fake words so i can test on very long string more than 20k characters:
Snippet:
from faker import Faker
from timeit import Timer
from collections import defaultdict
def first(string1, string2):
sets = set(string2)
return ''.join((c for c in string1 if c in sets))
def second(s1, s2):
res = []
for char in string1:
if char in string2:
res.append(char)
return ''.join(res)
def third(s1, s2):
d = defaultdict(int)
for char in string2:
d[char] += 1
res = []
for char in string1:
if char in d:
res.append(char)
return ''.join(res)
f = Faker()
string1 = ''.join(f.paragraph(nb_sentences=10000).split())
string2 = ''.join(f.paragraph(nb_sentences=10000).split())
funcs = [first, second, third]
import cProfile
print 'Length of String1: ', len(string1)
print 'Length of String2: ', len(string2)
print 'Time taken to execute:'
for f in funcs:
t = Timer(lambda: f(string1, string2))
print f.__name__, cProfile.run('t.timeit(number=100)')
Output:
Length of String1: 525133
Length of String2: 501050
Time taken to execute:
first 52513711 function calls in 18.169 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
100 0.001 0.000 18.164 0.182 s.py:39(<lambda>)
100 1.723 0.017 18.163 0.182 s.py:5(first)
52513400 9.442 0.000 9.442 0.000 s.py:7(<genexpr>)
1 0.000 0.000 0.000 0.000 timeit.py:143(setup)
1 0.000 0.000 18.169 18.169 timeit.py:178(timeit)
1 0.005 0.005 18.169 18.169 timeit.py:96(inner)
1 0.000 0.000 0.000 0.000 {gc.disable}
1 0.000 0.000 0.000 0.000 {gc.enable}
1 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
100 6.998 0.070 16.440 0.164 {method 'join' of 'str' objects}
2 0.000 0.000 0.000 0.000 {time.time}
None
second 52513611 function calls in 22.280 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 22.280 22.280 <string>:1(<module>)
100 0.121 0.001 22.275 0.223 s.py:39(<lambda>)
100 16.957 0.170 22.153 0.222 s.py:9(second)
1 0.000 0.000 0.000 0.000 timeit.py:143(setup)
1 0.000 0.000 22.280 22.280 timeit.py:178(timeit)
1 0.005 0.005 22.280 22.280 timeit.py:96(inner)
1 0.000 0.000 0.000 0.000 {gc.disable}
1 0.000 0.000 0.000 0.000 {gc.enable}
1 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
52513300 4.018 0.000 4.018 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
100 1.179 0.012 1.179 0.012 {method 'join' of 'str' objects}
2 0.000 0.000 0.000 0.000 {time.time}
None
third 52513611 function calls in 28.184 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 28.184 28.184 <string>:1(<module>)
100 22.847 0.228 28.059 0.281 s.py:16(third)
100 0.120 0.001 28.179 0.282 s.py:39(<lambda>)
1 0.000 0.000 0.000 0.000 timeit.py:143(setup)
1 0.000 0.000 28.184 28.184 timeit.py:178(timeit)
1 0.005 0.005 28.184 28.184 timeit.py:96(inner)
1 0.000 0.000 0.000 0.000 {gc.disable}
1 0.000 0.000 0.000 0.000 {gc.enable}
1 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
52513300 4.032 0.000 4.032 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
100 1.180 0.012 1.180 0.012 {method 'join' of 'str' objects}
2 0.000 0.000 0.000 0.000 {time.time}
None
Conclusion:
So, the first function with comprehension is the fastest.
But when you run on strings size around 25K characters second functions wins.
Length of String1: 22959
Length of String2: 452919
Time taken to execute:
first 2296311 function calls in 2.216 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
100 0.000 0.000 2.216 0.022 s.py:39(<lambda>)
100 1.530 0.015 2.216 0.022 s.py:5(first)
2296000 0.402 0.000 0.402 0.000 s.py:7(<genexpr>)
1 0.000 0.000 0.000 0.000 timeit.py:143(setup)
1 0.000 0.000 2.216 2.216 timeit.py:178(timeit)
1 0.000 0.000 2.216 2.216 timeit.py:96(inner)
1 0.000 0.000 0.000 0.000 {gc.disable}
1 0.000 0.000 0.000 0.000 {gc.enable}
1 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
100 0.284 0.003 0.686 0.007 {method 'join' of 'str' objects}
2 0.000 0.000 0.000 0.000 {time.time}
None
second 2296211 function calls in 0.939 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.939 0.939 <string>:1(<module>)
100 0.003 0.000 0.939 0.009 s.py:39(<lambda>)
100 0.729 0.007 0.936 0.009 s.py:9(second)
1 0.000 0.000 0.000 0.000 timeit.py:143(setup)
1 0.000 0.000 0.939 0.939 timeit.py:178(timeit)
1 0.000 0.000 0.939 0.939 timeit.py:96(inner)
1 0.000 0.000 0.000 0.000 {gc.disable}
1 0.000 0.000 0.000 0.000 {gc.enable}
1 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
2295900 0.165 0.000 0.165 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
100 0.042 0.000 0.042 0.000 {method 'join' of 'str' objects}
2 0.000 0.000 0.000 0.000 {time.time}
None
third 2296211 function calls in 8.361 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 8.361 8.361 <string>:1(<module>)
100 8.145 0.081 8.357 0.084 s.py:16(third)
100 0.004 0.000 8.361 0.084 s.py:39(<lambda>)
1 0.000 0.000 0.000 0.000 timeit.py:143(setup)
1 0.000 0.000 8.361 8.361 timeit.py:178(timeit)
1 0.000 0.000 8.361 8.361 timeit.py:96(inner)
1 0.000 0.000 0.000 0.000 {gc.disable}
1 0.000 0.000 0.000 0.000 {gc.enable}
1 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
2295900 0.169 0.000 0.169 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
100 0.043 0.000 0.043 0.000 {method 'join' of 'str' objects}
2 0.000 0.000 0.000 0.000 {time.time}
None

Testing whether a tuple has all distinct elements

I was seeking for a way to test whether a tuple has all distinct elements - so to say, it is a set and ended up with this quick and dirty solution.
def distinct ( tup):
n=0
for t in tup:
for k in tup:
#print t,k,n
if (t == k ):
n = n+1
if ( n != len(tup)):
return False
else :
return True
print distinct((1,3,2,10))
print distinct((3,3,4,2,7))
Any thinking error? Is there a builtin on tuples?
You can very easily do it as:
len(set(tup))==len(tup)
This creates a set of tup and checks if it is the same length as the original tup. The only case in which they would have the same length is if all elements in tup were unique
Examples
>>> a = (1,2,3)
>>> print len(set(a))==len(a)
True
>>> b = (1,2,2)
>>> print len(set(b))==len(b)
False
>>> c = (1,2,3,4,5,6,7,8,5)
>>> print len(set(c))==len(c)
False
In the majority of the cases, where all of the items in the tuple are hashable and support comparison (using == operator) with each other, #sshashank124's solution is what you're after:
len(set(tup))==len(tup)
For the example you posted, i.e. a tuple of ints, that would do.
Else, if the items are not hashable, but do have order defined on them (support '==', '<' operators, etc.), the best you can do is sorting them (O(NlogN) worst case), and then look for adjacent equal elements:
sorted_tup = sorted(tup)
all( x!=y for x,y in zip(sorted_tup[:-1],sorted_tup[1:]) )
Else, if the items only support equality comparison (==), the best you can do is the O(N^2) worst case algorithm, i.e. comparing every item to every other.
I would implement it this way, using itertools.combinations:
def distinct(tup):
for x,y in itertools.combinations(tup, 2):
if x == y:
return False
return True
Or as a one-liner:
all( x!=y for x,y in itertools.combinations(tup, 2) )
What about using early_exit:
This will be faster:
def distinct(tup):
s = set()
for x in tup:
if x in s:
return False
s.add(x)
return True
Ok: Here is the test and profiling with 1000 int:
#!/usr/bin/python
def distinct1(tup):
n=0
for t in tup:
for k in tup:
if (t == k ):
n = n+1
if ( n != len(tup)):
return False
else :
return True
def distinct2(tup):
return len(set(tup))==len(tup)
def distinct3(tup):
s = set()
for x in tup:
if x in s:
return False
s.add(x)
return True
import cProfile
from faker import Faker
from timeit import Timer
s = Faker()
func = [ distinct1, distinct2, distinct3 ]
tuples = tuple([ s.random_int(min=1, max=9999) for x in range(1000) ])
for fun in func:
t = Timer(lambda: fun(tuples))
print fun.__name__, cProfile.run('t.timeit(number=1000)')
Output: distinct 2 and distinct3 are almost the same.
distinct1 3011 function calls in 60.289 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 60.289 60.289 <string>:1(<module>)
1000 60.287 0.060 60.287 0.060 compare_tuple.py:3(distinct1)
1000 0.001 0.000 60.288 0.060 compare_tuple.py:34(<lambda>)
1 0.000 0.000 0.000 0.000 timeit.py:143(setup)
1 0.000 0.000 60.289 60.289 timeit.py:178(timeit)
1 0.001 0.001 60.289 60.289 timeit.py:96(inner)
1 0.000 0.000 0.000 0.000 {gc.disable}
1 0.000 0.000 0.000 0.000 {gc.enable}
1 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
1000 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2 0.000 0.000 0.000 0.000 {time.time}
None
distinct2 4011 function calls in 0.053 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.053 0.053 <string>:1(<module>)
1000 0.052 0.000 0.052 0.000 compare_tuple.py:14(distinct2)
1000 0.000 0.000 0.053 0.000 compare_tuple.py:34(<lambda>)
1 0.000 0.000 0.000 0.000 timeit.py:143(setup)
1 0.000 0.000 0.053 0.053 timeit.py:178(timeit)
1 0.000 0.000 0.053 0.053 timeit.py:96(inner)
1 0.000 0.000 0.000 0.000 {gc.disable}
1 0.000 0.000 0.000 0.000 {gc.enable}
1 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
2000 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2 0.000 0.000 0.000 0.000 {time.time}
None
distinct3 183011 function calls in 0.072 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.072 0.072 <string>:1(<module>)
1000 0.051 0.000 0.070 0.000 compare_tuple.py:17(distinct3)
1000 0.002 0.000 0.072 0.000 compare_tuple.py:34(<lambda>)
1 0.000 0.000 0.000 0.000 timeit.py:143(setup)
1 0.000 0.000 0.072 0.072 timeit.py:178(timeit)
1 0.000 0.000 0.072 0.072 timeit.py:96(inner)
1 0.000 0.000 0.000 0.000 {gc.disable}
1 0.000 0.000 0.000 0.000 {gc.enable}
1 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
181000 0.019 0.000 0.019 0.000 {method 'add' of 'set' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2 0.000 0.000 0.000 0.000 {time.time}
None
a = (1,2,3,4,5,6,7,8,9)
b = (1,2,3,4,5,6,6,7,8,9)
def unique(tup):
i = 0
while i < len(tup):
if tup[i] in tup[i+1:]:
return False
i = i + 1
return True
unique(a)
True
unique(b)
False
Readable and works with hashable items. People seem adverse to "while". This also allows for "special cases" and filtering on attributes.

Slow conversion of binary data

I have a binary file with a particular format, described here for those that are interested. The format isn't the import thing. I can read and convert this data into the form that I want but the problem is these binary files tend to have a lot of information in them. If I am just returning the bytes as read, this is very quick (less than 1 second), but I can't do anything useful with those bytes, they need to be converted into genotypes first and that is the code that appears to be slowing things down.
The conversion for a series of bytes into genotypes is as follows
h = ['%02x' % ord(b) for b in currBytes]
b = ''.join([bin(int(i, 16))[2:].zfill(8)[::-1] for i in h])[:nBits]
genotypes = [b[i:i+2] for i in range(0, len(b), 2)]
map = {'00': 0, '01': 1, '11': 2, '10': None}
return [map[i] for i in genotypes]
What I am hoping is that there is a faster way to do this? Any ideas? Below are the results of running python -m cProfile test.py where test.py is calling a reader object I have written to read these files.
vlan1711:src davykavanagh$ python -m cProfile test.py
183, 593483, 108607389, 366, 368, 46
that took 93.6410450935
86649088 function calls in 96.396 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.248 1.248 2.753 2.753 plinkReader.py:13(__init__)
1 0.000 0.000 0.000 0.000 plinkReader.py:47(plinkReader)
1 0.000 0.000 0.000 0.000 plinkReader.py:48(__init__)
1 0.000 0.000 0.000 0.000 plinkReader.py:5(<module>)
1 0.000 0.000 0.000 0.000 plinkReader.py:55(__iter__)
593484 77.634 0.000 91.477 0.000 plinkReader.py:58(next)
1 0.000 0.000 0.000 0.000 plinkReader.py:71(SNP)
593483 1.123 0.000 1.504 0.000 plinkReader.py:75(__init__)
1 0.000 0.000 0.000 0.000 plinkReader.py:8(plinkFiles)
1 0.000 0.000 0.000 0.000 plinkReader.py:85(Person)
183 0.000 0.000 0.001 0.000 plinkReader.py:89(__init__)
1 2.166 2.166 96.396 96.396 test.py:5(<module>)
27300218 5.909 0.000 5.909 0.000 {bin}
593483 0.080 0.000 0.080 0.000 {len}
1 0.000 0.000 0.000 0.000 {math.ceil}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2 0.000 0.000 0.000 0.000 {method 'format' of 'str' objects}
593483 0.531 0.000 0.531 0.000 {method 'join' of 'str' objects}
593485 0.588 0.000 0.588 0.000 {method 'read' of 'file' objects}
593666 0.257 0.000 0.257 0.000 {method 'rsplit' of 'str' objects}
593666 0.125 0.000 0.125 0.000 {method 'rstrip' of 'str' objects}
27300218 4.098 0.000 4.098 0.000 {method 'zfill' of 'str' objects}
3 0.000 0.000 0.000 0.000 {open}
27300218 1.820 0.000 1.820 0.000 {ord}
593483 0.817 0.000 0.817 0.000 {range}
2 0.000 0.000 0.000 0.000 {time.time}
You are slowing things down by creating lists and large strings you don't need. You are just examining bits of the bytes and convert two-bit groups into numbers. That can be achieved much simpler, e. g. by this code:
def convert(currBytes, nBits):
for byte in currBytes:
for p in range(4):
bits = (ord(byte) >> (p*2)) & 3
yield None if bits == 1 else 1 if bits == 2 else 2 if bits == 3 else 0
nBits -= 2
if nBits <= 0:
raise StopIteration()
In case you really need a list in the end, just use
list(convert(currBytes, nBits))
But I guess there can be cases in which you just want to iterate over the results:
for blurp in convert(currBytes, nBits):
# handle your blurp (0, 1, 2, or None)

Categories

Resources