Best way in Python to determine all possible intersections in a matrix? - python

So if I have a matrix (list of lists) where each column represents a unique word, each row represents a distinct document, and every entry is a 1 or 0, indicating whether or not the word for a given column exists in the document for a given row.
What I'd like to know is how to determine all the possible combinations of words and documents where more than one word is in common with more than one document. The result might look something like:
[ [[Docid_3, Docid_5], ['word1', 'word17', 'word23']],
[[Docid_3, Docid_9, Docid_334], ['word2', 'word7', 'word23', 'word68', 'word982']],
...
and so on for each possible combination. Would love a solution that provides the complete set of combinations and one that yields only the combinations that are not a subset of another, so from the example, not [[Docid_3, Docid_5], ['word1', 'word17']] since it's a complete subset of the first example.
I feel like there is an elegant solution that just isn't coming to mind and the beer isn't helping.
Thanks.

Normalize the text. You only want strings made of string.lowercase. Split/strip on everything else.
Make sets out of this.
Use something like this to get all possible groupings of all sizes:
def get_all_lengths_combinations_of(elements):
for no_of_items in range(2, len(elements)+1):
for items in itertools.combinations(elements, no_of_items)
yield items
I'm sure the real itertools wizards will come up with something better, possibly involving izip().
Remember you should be able to use the set.intersection() method like this:
set.intersection(*list_of_sets_to_intersect)

First, build a mapping from document ID to set of words -- your matrix of 0 and 1 is quite an unwieldy structure to process directly. If I read you correctly, the "column headings" (words) are the first list in the matrix (minus presumably the first item) and the "row headings" (docids) are the first items of each row (minus presumably the first row). Then (assuming Python 2.6 or better):
def makemap(matrix):
im = iter(matrix)
words = next(im)[1:]
themap = {}
for row in im:
mapent = set()
docid = row[0]
for w, bit in zip(words, row[1:]):
try:
if bit: mapent.add(w)
except:
print 'w is %r' % (w,)
raise
themap[docid] = mapent
return themap
Now you need to check all feasible subsets of documents -- the total number of subsets is huge so you really want to prune that search tree as much as you can, and brute-force generation of all subsets (e.g. by looping on itertools.combinations for various lengths) will not perform any pruning of course.
I would start with all 2-combinations (all pairs of docids -- itertools.combinations is fine for this of course) and make the first batch (those pairs which have 2+ words in commons) of "feasible 2-length subsets". That can go in another mapping with tuples or frozensets of docids as the keys.
Then, to make the feasible (N+1)-length subsets, I would only try to extend existing feasible N-length subsets by one more docid each (checking the total intersection is still 2+ long of course). This at least does some pruning rather than blindly trying all the 2**N subsets (or even just the 2**N - N - 1 subsets of length at least two;-).
It might perhaps be possible to do even better by recording all docids that proved unable to extend a certain N-length subset -- no point in trying those against any of the (N+1)-length subsets derived from it. This is worth trying as a second level of pruning/optimization.
There may be further tweaks yet you might do for further pruning but offhand none immediately comes to mind, so that's where I'd start from. (For readability, I'm not bothering below with using microoptimizations such as iteritems in lieu of items, frozensets in lieu of tuples, etc -- they're probably marginal given those sequences are all O(N) vs the exponential size of computed structures, although of course worth trying in the tuning/optimizing phase).
def allfeasiblesubsets(matrix):
mapping = makemap(matrix)
docids = sorted(mapping.keys())
feasible_len2 = {}
dont_bother = dict((d, set([d])) for d in docids)
for d1, d2 in itertools.combinations(docids, 2):
commonw = mapping[d1].intersection(mapping[d2])
if len(commonw) >= 2:
feasible_len2[d1, d2] = commonw
else:
dont_bother[d1].add(d2)
dont_bother[d2].add(d1)
all_feasible = [feasible_len2]
while all_feasible[-1]:
feasible_Np1 = {}
for ds, ws in all_feasible[-1].items():
md = max(ds)
for d, w in mapping.items():
if d <= md or any(d in dont_bother[d1] for d1 in ds):
continue
commonw = w.intersection(ws)
if len(commonw) >= 2:
feasible_Np1[ds+(d,)] = commonw
all_feasible.append(feasible_Np1)
return all_feasible[:-1]
You'll notice I've applied only a mild form of my suggested "further pruning" -- dont_bother only records "incompatibilities" (<2 words in common) between one docid and others -- this may help if there are several pairs of such "incompatible docids", and is simple and reasonably unobtrusive, but is not as powerful in pruning as the harder "full" alternative. I'm also trying to keep all keys in the feasible* dicts as sorted tuples of docids (as the itertools.combinations originally provides for the pairs) to avoid duplications and therefore redundant work.
Here's the small example I've tried (in the same file as these functions after, of course, the import for itertools and collections):
mat = [ ['doc']+'tanto va la gatta al lardo che ci lascia lo zampino'.split(),
['uno', 0, 0, 0, 1, 0, 1, 0, 0, 0, 1],
['due', 1, 0, 0, 0, 0, 1, 0, 1, 0, 1],
['tre', 1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
['qua', 0, 0, 0, 1, 0, 1, 0, 1, 0, 1]]
mm = makemap(mat)
print mm
afs = allfeasiblesubsets(mat)
print afs
The results, which appear OK, are:
{'qua': set(['gatta', 'lo', 'ci', 'lardo']), 'tre': set(['lo', 'ci', 'tanto']), 'due': set(['lo', 'ci', 'lardo', 'tanto']), 'uno': set(['gatta', 'lo', 'lardo'])}
[{('due', 'tre'): set(['lo', 'ci', 'tanto']), ('due', 'uno'): set(['lo', 'lardo']), ('qua', 'uno'): set(['gatta', 'lo', 'lardo']), ('due', 'qua'): set(['lo', 'ci', 'lardo']), ('qua', 'tre'): set(['lo', 'ci'])}, {('due', 'qua', 'tre'): set(['lo', 'ci']), ('due', 'qua', 'uno'): set(['lo', 'lardo'])}]
but of course there might still be bugs lurking since I haven't tested it thoroughly. BTW, I hope it's clear that the result as supplied here (a list of dicts for various increasing lengths, each dict having the ordered tuple forms of the docids-sets as keys and the sets of their common words as values) can easily be post-processed into any other form you might prefer, such as nested lists.
(Not that it matters, but the text I'm using in the example is an old Italian proverb;-).

Take a look at
SO what-tried-and-true-algorithms-for-suggesting-related-articles-are-out-there
For real problem sizes, say > 100 docs, 10000 words,
get the nice bitarray module
(which says, by the way,
"the same algorithm in Python ... is about 20 times slower than in C").
On "only the combinations that are not a subset of another":
define a hit22 as a 2x2 submatrix with 11 11,
a hit23 as a 2x3 submatrix with 111 111 (2 docs, 3 words in common), and so on.
A given hit22 may be in many hit2n s — 2 docs, n words,
and also in many hitn2 s — n docs, 2 words. Looks fun.
Added Monday 14Jun: little functions using bitarray.
(Intro / python modules for real doc classification ? Dunno.)
""" docs-words with bitarray, randombits """
# google "document classification" (tutorial | python) ...
# https://stackoverflow.com/questions/1254627/what-tried-and-true-algorithms-for-suggesting-related-articles-are-out-there
from __future__ import division
import random
import sys
from bitarray import bitarray # http://pypi.python.org/pypi/bitarray
__date__ = "14jun 2010 denis"
ndoc = 100
nbits = 1000
exec "\n".join( sys.argv[1:] ) # run this.py ndoc= ...
random.seed(1)
me = __file__.split('/') [-1]
print "%s ndoc=%d nbits=%d" % (me, ndoc, nbits)
# bitarray stuff --
def bitslist( bits ):
""" 011001 -> [1,2,5] """
return [ j for j in range(len(bits)) if bits[j] ]
hex_01 = {
"0": "0000", "1": "0001", "2": "0010", "3": "0011",
"4": "0100", "5": "0101", "6": "0110", "7": "0111",
"8": "1000", "9": "1001", "a": "1010", "b": "1011",
"c": "1100", "d": "1101", "e": "1110", "f": "1111",
}
def to01( x, len_ ):
x = "%x" % x
s = "".join( hex_01[c] for c in x )
return (len_ - len(s)) * "0" + s
def randombits( nbits ):
""" -> bitarray 1/16 1, 15/16 0 """
hibit = 1 << (nbits - 1)
r = (random.randint( 0, hibit - 1 )
& random.randint( 0, hibit - 1 )
& random.randint( 0, hibit - 1 )
& random.randint( 0, hibit - 1 )) # prob 1/16
return bitarray( to01( r, nbits ))
#...............................................................................
doc = [ randombits(nbits) for j in range(ndoc) ] # ndoc x nbits
def mostsimilarpair():
""" -> (sim, j, k) most similar pair of docs """
mostsim = (-1,-1,-1)
for j in range(ndoc):
for k in range(j+1, ndoc):
# allpairs[j,k] -> scipy.cluster.hier ?
sim = (doc[j] & doc[k]).count() # nr bits (words) in common, crude
mostsim = max( mostsim, (sim,j,k) )
return mostsim
sim, jdoc, kdoc = mostsimilarpair()
print "The 2 most similar docs:" ,
print "doc %d has %d words," % ( jdoc, doc[jdoc].count() ) ,
print "doc %d has %d," % ( kdoc, doc[kdoc].count() )
print "%d words in common: %s" % ( sim, bitslist( doc[jdoc] & doc[kdoc] ))
print ""
#...............................................................................
def docslike( jdoc, thresh ):
""" -> (doc index, sim >= thresh) ... """
for j in range(ndoc):
if j == jdoc: continue
sim = (doc[j] & doc[jdoc]).count()
if sim >= thresh:
yield (j, sim)
thresh = sim // 2
print "Docs like %d, with >= %d words in common:" % (jdoc, thresh)
for j, sim in docslike( jdoc, thresh ):
print "%3d %s" % ( j, bitslist( doc[j] & doc[jdoc] ))
"""
The 2 most similar docs: doc 72 has 66 words, doc 84 has 60,
12 words in common: [11, 51, 119, 154, 162, 438, 592, 696, 800, 858, 860, 872]
Docs like 72, with >= 6 words in common:
2 [3, 171, 258, 309, 592, 962]
...
"""

How many documents? How many unique words? How much RAM do you have?
What do you want to produce in the following scenario: document A has words 1, 2, 3; B has 1, 2; C has 2, 3

Related

How to find the maximum probability of satisfying the conditions in all combinations of arrays

for example, I got a list of tokens and each token's number of characters(length) is
length = [2, 1, 1, 2, 2, 3, 2, 1, 1, 2, 2, 2]
and here is the list of each token's probability of [not insert a linefeed, insert a linefeed] after the token
prob = [[9.9978e-01, 2.2339e-04], [9.9995e-01, 4.9344e-05], [0.9469, 0.0531],
[9.9994e-01, 5.8422e-05], [0.9964, 0.0036], [9.9991e-01, 9.4295e-05],
[9.9980e-01, 1.9620e-04], [1.0000e+00, 5.2492e-08], [9.9998e-01, 1.8293e-05],
[9.9999e-01, 5.1220e-06], [1.0000e+00, 3.9795e-06], [0.0142, 0.9858]]
and the result for the probabilies is
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
which means inserting a linefeed after the last token.
The whole length of this line is 21, and I would like to have a maximum of 20 characters per line.
In that case, I have to insert one (in this example, maybe more in other situations) more linefeed to make sure every line has 20 characters at most.
In this example, the best answer is
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1]
since the 3rd token gets the highest probability of inserting a linefeed.
My thought is to make all combinations of these probabilities.(Multiply them instead of adding) I got 12 tokens in this example, each token gets its 0-1 Classification Probability, so there are 2^12 kinds of combination. And I use the binary sequence to record every situation (since it's a 0-1 Classification Problem)and store them in a dictionary in format of [binary sequence, the combination of probabilities].
for i in range(nums):
num *= 2
numx = bin(num)
for i in range(num):
numx = bin(numx - 1)
str1 = numx.encode('ascii').decode('ascii')
str1 = str1.lstrip('0b')
probb = 1
for k in range(len(str1)):
x = str1[k]
if int(x) == 0: # [0, 1]
probb *= prob_task2[k][0]
else:
probb *= prob_task2[k][1]
dic[str1] = probb
Then I want to sort all kinds of combination, and search the possible result from high to low.
I make two loops for making all combinations. And another two loops for searching the combinations from top to low in order to meet the restriction of characters. But I got some troubles with the efficiency, since once there are 40 tokens, I have to count 2^40 kinds of combinations.
I am not good at algorithms, so I do want to ask is there an efficient way to solve the problem.
To rephrase, you have a list of tokens of given lengths, each with an
independent probability of being followed by a line break, and you want
to find the maximum likelihood outcome whose longest line doesn’t exceed
the given max.
There is an efficient dynamic program (O(n L) where n is the number of
tokens and L is the line length). The idea is that we can prevent the
search tree from blowing up exponentially by pruning the less likely
possibilities that have the same current line length. In Python:
import collections
import math
length = [2, 1, 1, 2, 2, 3, 2, 1, 1, 2, 2, 2]
prob = [
[9.9978e-1, 2.2339e-4],
[9.9995e-1, 4.9344e-5],
[0.9469, 0.0531],
[9.9994e-1, 5.8422e-5],
[0.9964, 0.0036],
[9.9991e-1, 9.4295e-5],
[9.998e-1, 1.962e-4],
[1.0e0, 5.2492e-8],
[9.9998e-1, 1.8293e-5],
[9.9999e-1, 5.122e-6],
[1.0e0, 3.9795e-6],
[0.0142, 0.9858],
]
max_line_length = 20
line_length_to_best = {length[0]: (0, [])}
for i, (p_no_break, p_break) in enumerate(prob[:-1]):
line_length_to_options = collections.defaultdict(list)
for line_length, (likelihood, breaks) in line_length_to_best.items():
length_without_break = line_length + length[i + 1]
if length_without_break <= max_line_length:
line_length_to_options[length_without_break].append(
(likelihood + math.log2(p_no_break), breaks + [0])
)
line_length_to_options[length[i + 1]].append(
(likelihood + math.log2(p_break), breaks + [1])
)
line_length_to_best = {
line_length: max(options)
for (line_length, options) in line_length_to_options.items()
}
_, breaks = max(line_length_to_best.values())
print(breaks + [1])

Is there a statistical test that can compare two ordered lists

I would like to get a statistical test statistic to compare two lists. Suppose my Benchmark list is
Benchmark = [a,b,c,d,e,f,g]
and I have two other lists
A = [g,c,b,a,f,e,d]
C = [c,d,e,a,b,f,g]
I want the test to inform me which list is closer to the Benchmark. The test should consider the absolute location, but also the relative location for example it should penalize the fact that in list A 'g' is at the start but in the benchmark it is at the end(how far is something from its true location), but also it should also reward the fact that 'a' and 'b' are close to each other in list C just like in the Benchmark.
A and C are always shuffled Benchmark. I would like a statistical test or some kind of metric that informs me that the orderings of list A , B and C are not statistically different from that of the Benchmark but that of a certain list D is significantly different at a certain threshold or p-value such as 5%. And even among the lists A,B and C, the test should perfectly outline which ordering is closer to the Benchmark.
Well, if you come to the conclusion that a metric will suffice, here you go:
def dist(a, b):
perm = []
for v in b:
perm.append(a.index(v))
perm_vals = [a[p] for p in perm]
# displacement
ret = 0
for i, v in enumerate(perm):
ret += abs(v - i)
# coherence break
current = perm_vals.index(a[0])
for v in a[1:]:
new = perm_vals.index(v)
ret += abs(new - current) - 1
current = new
return ret
I've created a few samples to test this:
import random
ground_truth = [0, 1, 2, 3, 4, 5, 6]
samples = []
for i in range(7):
samples.append(random.sample(ground_truth, len(ground_truth)))
samples.append([0, 6, 1, 5, 3, 4, 2])
samples.append([6, 5, 4, 3, 2, 1, 0])
samples.append([0, 1, 2, 3, 4, 5, 6])
def dist(a, b):
perm = []
for v in b:
perm.append(a.index(v))
perm_vals = [a[p] for p in perm]
# displacement
ret = 0
for i, v in enumerate(perm):
ret += abs(v - i)
# coherence break
current = perm_vals.index(a[0])
for v in a[1:]:
new = perm_vals.index(v)
ret += abs(new - current) - 1
current = new
return ret
for s in samples:
print(s, dist(ground_truth, s))
The metric is a cost, that is, the lower it is, the better. I designed it to yield 0 iff the permutation is an identity. The job left for you, that which none can do for you, is deciding how strict you want to be when evaluating samples using this metric, which definitely depends on what you're trying to achieve.

Fill order from smaller packages?

The input is an integer that specifies the amount to be ordered.
There are predefined package sizes that have to be used to create that order.
e.g.
Packs
3 for $5
5 for $9
9 for $16
for an input order 13 the output should be:
2x5 + 1x3
So far I've the following approach:
remaining_order = 13
package_numbers = [9,5,3]
required_packages = []
while remaining_order > 0:
found = False
for pack_num in package_numbers:
if pack_num <= remaining_order:
required_packages.append(pack_num)
remaining_order -= pack_num
found = True
break
if not found:
break
But this will lead to the wrong result:
1x9 + 1x3
remaining: 1
So, you need to fill the order with the packages such that the total price is maximal? This is known as Knapsack problem. In that Wikipedia article you'll find several solutions written in Python.
To be more precise, you need a solution for the unbounded knapsack problem, in contrast to popular 0/1 knapsack problem (where each item can be packed only once). Here is working code from Rosetta:
from itertools import product
NAME, SIZE, VALUE = range(3)
items = (
# NAME, SIZE, VALUE
('A', 3, 5),
('B', 5, 9),
('C', 9, 16))
capacity = 13
def knapsack_unbounded_enumeration(items, C):
# find max of any one item
max1 = [int(C / item[SIZE]) for item in items]
itemsizes = [item[SIZE] for item in items]
itemvalues = [item[VALUE] for item in items]
# def totvalue(itemscount, =itemsizes, itemvalues=itemvalues, C=C):
def totvalue(itemscount):
# nonlocal itemsizes, itemvalues, C
totsize = sum(n * size for n, size in zip(itemscount, itemsizes))
totval = sum(n * val for n, val in zip(itemscount, itemvalues))
return (totval, -totsize) if totsize <= C else (-1, 0)
# Try all combinations of bounty items from 0 up to max1
bagged = max(product(*[range(n + 1) for n in max1]), key=totvalue)
numbagged = sum(bagged)
value, size = totvalue(bagged)
size = -size
# convert to (iten, count) pairs) in name order
bagged = ['%dx%d' % (n, items[i][SIZE]) for i, n in enumerate(bagged) if n]
return value, size, numbagged, bagged
if __name__ == '__main__':
value, size, numbagged, bagged = knapsack_unbounded_enumeration(items, capacity)
print(value)
print(bagged)
Output is:
23
['1x3', '2x5']
Keep in mind that this is a NP-hard problem, so it will blow as you enter some large values :)
You can use itertools.product:
import itertools
remaining_order = 13
package_numbers = [9,5,3]
required_packages = []
a=min([x for i in range(1,remaining_order+1//min(package_numbers)) for x in itertools.product(package_numbers,repeat=i)],key=lambda x: abs(sum(x)-remaining_order))
remaining_order-=sum(a)
print(a)
print(remaining_order)
Output:
(5, 5, 3)
0
This simply does the below steps:
Get value closest to 13, in the list with all the product values.
Then simply make it modify the number of remaining_order.
If you want it output with 'x':
import itertools
from collections import Counter
remaining_order = 13
package_numbers = [9,5,3]
required_packages = []
a=min([x for i in range(1,remaining_order+1//min(package_numbers)) for x in itertools.product(package_numbers,repeat=i)],key=lambda x: abs(sum(x)-remaining_order))
remaining_order-=sum(a)
print(' '.join(['{0}x{1}'.format(v,k) for k,v in Counter(a).items()]))
print(remaining_order)
Output:
2x5 + 1x3
0
For you problem, I tried two implementations depending on what you want, in both of the solutions I supposed you absolutely needed your remaining to be at 0. Otherwise the algorithm will return you -1. If you need them, tell me I can adapt my algorithm.
As the algorithm is implemented via dynamic programming, it handles good inputs, at least more than 130 packages !
In the first solution, I admitted we fill with the biggest package each time.
I n the second solution, I try to minimize the price, but the number of packages should always be 0.
remaining_order = 13
package_numbers = sorted([9,5,3], reverse=True) # To make sure the biggest package is the first element
prices = {9: 16, 5: 9, 3: 5}
required_packages = []
# First solution, using the biggest package each time, and making the total order remaining at 0 each time
ans = [[] for _ in range(remaining_order + 1)]
ans[0] = [0, 0, 0]
for i in range(1, remaining_order + 1):
for index, package_number in enumerate(package_numbers):
if i-package_number > -1:
tmp = ans[i-package_number]
if tmp != -1:
ans[i] = [tmp[x] if x != index else tmp[x] + 1 for x in range(len(tmp))]
break
else: # Using for else instead of a boolean value `found`
ans[i] = -1 # -1 is the not found combinations
print(ans[13]) # [0, 2, 1]
print(ans[9]) # [1, 0, 0]
# Second solution, minimizing the price with order at 0
def price(x):
return 16*x[0]+9*x[1]+5*x[2]
ans = [[] for _ in range(remaining_order + 1)]
ans[0] = ([0, 0, 0],0) # combination + price
for i in range(1, remaining_order + 1):
# The not found packages will be (-1, float('inf'))
minimal_price = float('inf')
minimal_combinations = -1
for index, package_number in enumerate(package_numbers):
if i-package_number > -1:
tmp = ans[i-package_number]
if tmp != (-1, float('inf')):
tmp_price = price(tmp[0]) + prices[package_number]
if tmp_price < minimal_price:
minimal_price = tmp_price
minimal_combinations = [tmp[0][x] if x != index else tmp[0][x] + 1 for x in range(len(tmp[0]))]
ans[i] = (minimal_combinations, minimal_price)
print(ans[13]) # ([0, 2, 1], 23)
print(ans[9]) # ([0, 0, 3], 15) Because the price of three packages is lower than the price of a package of 9
In case you need a solution for a small number of possible
package_numbers
but a possibly very big
remaining_order,
in which case all the other solutions would fail, you can use this to reduce remaining_order:
import numpy as np
remaining_order = 13
package_numbers = [9,5,3]
required_packages = []
sub_max=np.sum([(np.product(package_numbers)/i-1)*i for i in package_numbers])
while remaining_order > sub_max:
remaining_order -= np.product(package_numbers)
required_packages.append([max(package_numbers)]*np.product(package_numbers)/max(package_numbers))
Because if any package is in required_packages more often than (np.product(package_numbers)/i-1)*i it's sum is equal to np.product(package_numbers). In case the package max(package_numbers) isn't the one with the samllest price per unit, take the one with the smallest price per unit instead.
Example:
remaining_order = 100
package_numbers = [5,3]
Any part of remaining_order bigger than 5*2 plus 3*4 = 22 can be sorted out by adding 5 three times to the solution and taking remaining_order - 5*3.
So remaining order that actually needs to be calculated is 10. Which can then be solved to beeing 2 times 5. The rest is filled with 6 times 15 which is 18 times 5.
In case the number of possible package_numbers is bigger than just a handful, I recommend building a lookup table (with one of the others answers' code) for all numbers below sub_max which will make this immensely fast for any input.
Since no declaration about the object function is found, I assume your goal is to maximize the package value within the pack's capability.
Explanation: time complexity is fixed. Optimal solution may not be filling the highest valued item as many as possible, you have to search all possible combinations. However, you can reuse the possible optimal solutions you have searched to save space. For example, [5,5,3] is derived from adding 3 to a previous [5,5] try so the intermediate result can be "cached". You may either use an array or you may use a set to store possible solutions. The code below runs the same performance as the rosetta code but I think it's clearer.
To further optimize, use a priority set for opts.
costs = [3,5,9]
value = [5,9,16]
volume = 130
# solutions
opts = set()
opts.add(tuple([0]))
# calc total value
cost_val = dict(zip(costs, value))
def total_value(opt):
return sum([cost_val.get(cost,0) for cost in opt])
def possible_solutions():
solutions = set()
for opt in opts:
for cost in costs:
if cost + sum(opt) > volume:
continue
cnt = (volume - sum(opt)) // cost
for _ in range(1, cnt + 1):
sol = tuple(list(opt) + [cost] * _)
solutions.add(sol)
return solutions
def optimize_max_return(opts):
if not opts:
return tuple([])
cur = list(opts)[0]
for sol in opts:
if total_value(sol) > total_value(cur):
cur = sol
return cur
while sum(optimize_max_return(opts)) <= volume - min(costs):
opts = opts.union(possible_solutions())
print(optimize_max_return(opts))
If your requirement is "just fill the pack" it'll be even simpler using the volume for each item instead.

How toget a list of "fastest miles" from a set of GPS Points

I'm trying to solve a weird problem. Maybe you guys know of some algorithm that takes care of this.
I have data for a cargo freight truck and want to extract some data. Suppose I've got a list of sorted points that I get from the GPS. That's the route for that truck:
[
{
"lng": "-111.5373066",
"lat": "40.7231711",
"time": "1970-01-01T00:00:04Z",
"elev": "1942.1789265256325"
},
{
"lng": "-111.5372056",
"lat": "40.7228762",
"time": "1970-01-01T00:00:07Z",
"elev": "1942.109892409177"
}
]
Now, what I want to get is a list of the "fastest miles". I'll do an example:
Given the points:
A, B, C, D, E, F
the distance from point A to point B is 1 mile, and the cargo took 10:32 minutes. From point B to point D i've got other mile, and the cargo took 10 minutes, etc. So, i need a list sorted by time. Similar to:
B -> D: 10
A -> B: 10:32
D -> F: 11:02
Do you know any efficient algorithm that let me calculate that?
Thank you all.
PS: I'm using Python.
EDIT:
I've got the distance. I know how to calculate it and there are plenty of posts to do that. What I need is an algorithm to tokenize by mile and get speed from that. Having a distance function is not helpful enough:
results = {}
for point in points:
aux_points = points.takeWhile(point>n) #This doesn't exist, just trying to be simple
for aux_point in aux_points:
d = distance(point, aux_point)
if d == 1_MILE:
time_elapsed = time(point, aux_point)
results[time_elapsed] = (point, aux_point)
I'm still doing some pretty inefficient calculations.
If you have locations and timestamps for when the location data was fetched, you can simply do something like this:
def CalculateSpeeds(list_of_points_in_time_order):
"""Calculate a list of (average) speeds for a list of geographic points."""
points = list_of_points_in_time_order
segment_start = points[0]
speed_list = []
for segment_end in points[1:]:
dt = ElapsedTime(segment_start, segment_end)
# If you're looking at skipping points, with a slight risk of degraded data
# you could do something like "if dt < MIN_ELAPSED_TIME:" and indent
# the rest of the loop. However, you'd need to then check if the last point
# has been accounted for, as it might've been too close to the last considered
# point.
d = Distance(segment_start, segment_end)
speed_list.append(d/dt)
segment_start = segment_end
return speed_list
You've said (in comments) that you can do this for a single pair, so all you need to do is to do it for all consecutive pairs.
So, if you have n such points, there will be n - 1 "legs" on the journey. You can form that list by simply:
legs = []
for i in xrange(n - 1):
legs.append(build_leg(point[i], point[i + 1]))
Assuming point is the list of points and build_leg() is a function that accepts two points and computes the distance and average speed.
The above loop will call build_leg with first point 0 and 1, then 1 and 2, and so on up until n - 2 and n - 1 which are the two last points.
I've grown to love the sliding window, and it may be helpful here. Same concept as the other answers, just a slightly different method.
from itertools import islice
def window(seq, n=2):
"Returns a sliding window (of width n) over data from the iterable"
" s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ... "
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + (elem,)
yield result
results = {}
# presort your points in time if necessary
for point_a, point_b in window(points):
d = distance(point_a, point_b)
if d == 1_MILE:
time_elapsed = time(point_a, point_b)
results[time_elapsed] = (point_a, point_b)

Is there a faster way to get subtrees from tree like structures in python than the standard "recursive"?

Let's assume the following data structur with three numpy arrays (id, parent_id) (parent_id of the root element is -1):
import numpy as np
class MyStructure(object):
def __init__(self):
"""
Default structure for now:
1
/ \
2 3
/ \
4 5
"""
self.ids = np.array([1,2,3,4,5])
self.parent_ids = np.array([-1, 1, 1, 3, 3])
def id_successors(self, idOfInterest):
"""
Return logical index.
"""
return self.parent_ids == idOfInterest
def subtree(self, newRootElement):
"""
Return logical index pointing to elements of the subtree.
"""
init_vector = np.zeros(len(self.ids), bool)
init_vector[np.where(self.ids==newRootElement)[0]] = 1
if sum(self.id_successors(newRootElement))==0:
return init_vector
else:
subtree_vec = init_vector
for sucs in self.ids[self.id_successors(newRootElement)==1]:
subtree_vec += self.subtree(sucs)
return subtree_vec
This get's really slow for many ids (>1000). Is there a faster way to implement that?
Have you tried to use psyco module if you are using Python 2.6? It can sometimes do dramatic speed up of code.
Have you considered recursive data structure: list?
Your example is also as standard list:
[1, 2, [3, [4],[5]]]
or
[1, [2, None, None], [3, [4, None, None],[5, None, None]]]
By my pretty printer:
[1,
[2, None, None],
[3,
[4, None, None],
[5, None, None]]]
Subtrees are ready there, cost you some time inserting values to right tree. Also worth while to check if heapq module fits your needs.
Also Guido himself gives some insight on traversing and trees in http://python.org/doc/essays/graphs.html, maybe you are aware of it.
Here is some advanced looking tree stuff, actually proposed for Python as basic list type replacement, but rejected in that function. Blist module
I think it's not the recursion as such that's hurting you, but the multitude of very wide operations (over all elements) for every step. Consider:
init_vector[np.where(self.ids==newRootElement)[0]] = 1
That runs a scan through all elements, calculates the index of every matching element, then uses only the index of the first one. This particular operation is available as the method index for lists, tuples, and arrays - and faster there. If IDs are unique, init_vector is simply ids==newRootElement anyway.
if sum(self.id_successors(newRootElement))==0:
Again a linear scan of every element, then a reduction on the whole array, just to check if any matches are there. Use any for this type of operation, but once again we don't even need to do the check on all elements - "if newRootElement not in self.parent_ids" does the job, but it's not necessary as it's perfectly valid to do a for loop over an empty list.
Finally there's the last loop:
for sucs in self.ids[self.id_successors(newRootElement)==1]:
This time, an id_successors call is repeated, and then the result is compared to 1 needlessly. Only after that comes the recursion, making sure all the above operations are repeated (for different newRootElement) for each branch.
The whole code is a reversed traversal of a unidirectional tree. We have parents and need children. If we're to do wide operations such as numpy is designed for, we'd best make them count - and thus the only operation we care about is building a list of children per parent. That's not very hard to do with one iteration:
import collections
children=collections.defaultdict(list)
for i,p in zip(ids,parent_ids):
children[p].append(i)
def subtree(i):
return i, map(subtree, children[i])
The exact structure you need will depend on more factors, such as how often the tree changes, how large it is, how much it branches, and how large and many subtrees you need to request. The dictionary+list structure above isn't terribly memory efficient, for instance. Your example is also sorted, which could make the operation even easier.
In theory, every algorithm can be written iteratively as well as recursively. But this is a fallacy (like Turing-completeness). In practice, walking an arbitrarily-nested tree via iteration is generally not feasible. I doubt there is much to optimize (at least you're modifying subtree_vec in-place). Doing x on thousands of elements is inherently damn expensive, no matter whether you do it iteratively or recursively. At most there are a few micro-optimizations possible on the concrete implementation, which will at most yield <5% improvement. Best bet would be caching/memoization, if you need the same data several times. Maybe someone has a fancy O(log n) algorithm for your specific tree structure up their sleeve, I don't even know if one is possible (I'd assume no, but tree manipulation isn't my staff of life).
This is my answer (written without access to your class, so the interface is slightly different, but I'm attaching it as is so that you can test if it is fast enough):
=======================file graph_array.py==========================
import collections
import numpy
def find_subtree(pids, subtree_id):
N = len(pids)
assert 1 <= subtree_id <= N
subtreeids = numpy.zeros(pids.shape, dtype=bool)
todo = collections.deque([subtree_id])
iter = 0
while todo:
id = todo.popleft()
assert 1 <= id <= N
subtreeids[id - 1] = True
sons = (pids == id).nonzero()[0] + 1
#print 'id={0} sons={1} todo={2}'.format(id, sons, todo)
todo.extend(sons)
iter = iter+1
if iter>N:
raise ValueError()
return subtreeids
=======================file graph_array_test.py==========================
import numpy
from graph_array import find_subtree
def _random_graph(n, maxsons):
import random
pids = numpy.zeros(n, dtype=int)
sons = numpy.zeros(n, dtype=int)
available = []
for id in xrange(1, n+1):
if available:
pid = random.choice(available)
sons[pid - 1] += 1
if sons[pid - 1] == maxsons:
available.remove(pid)
else:
pid = -1
pids[id - 1] = pid
available.append(id)
assert sons.max() <= maxsons
return pids
def verify_subtree(pids, subtree_id, subtree):
ids = set(subtree.nonzero()[0] + 1)
sons = set(ids) - set([subtree_id])
fathers = set(pids[id - 1] for id in sons)
leafs = set(id for id in ids if not (pids == id).any())
rest = set(xrange(1, pids.size+1)) - fathers - leafs
assert fathers & leafs == set()
assert fathers | leafs == ids
assert ids & rest == set()
def test_linear_graph_gen(n, genfunc, maxsons):
assert maxsons == 1
pids = genfunc(n, maxsons)
last = -1
seen = set()
for _ in xrange(pids.size):
id = int((pids == last).nonzero()[0]) + 1
assert id not in seen
seen.add(id)
last = id
assert seen == set(xrange(1, pids.size + 1))
def test_case1():
"""
1
/ \
2 4
/
3
"""
pids = numpy.array([-1, 1, 2, 1])
subtrees = {1: [True, True, True, True],
2: [False, True, True, False],
3: [False, False, True, False],
4: [False, False, False, True]}
for id in xrange(1, 5):
sub = find_subtree(pids, id)
assert (sub == numpy.array(subtrees[id])).all()
verify_subtree(pids, id, sub)
def test_random(n, genfunc, maxsons):
pids = genfunc(n, maxsons)
for subtree_id in numpy.arange(1, n+1):
subtree = find_subtree(pids, subtree_id)
verify_subtree(pids, subtree_id, subtree)
def test_timing(n, genfunc, maxsons):
import time
pids = genfunc(n, maxsons)
t = time.time()
for subtree_id in numpy.arange(1, n+1):
subtree = find_subtree(pids, subtree_id)
t = time.time() - t
print 't={0}s = {1:.2}ms/subtree = {2:.5}ms/subtree/node '.format(
t, t / n * 1000, t / n**2 * 1000),
def pytest_generate_tests(metafunc):
if 'case' in metafunc.function.__name__:
return
ns = [1, 2, 3, 4, 5, 10, 20, 50, 100, 1000]
if 'timing' in metafunc.function.__name__:
ns += [10000, 100000, 1000000]
pass
for n in ns:
func = _random_graph
for maxsons in sorted(set([1, 2, 3, 4, 5, 10, (n+1)//2, n])):
metafunc.addcall(
funcargs=dict(n=n, genfunc=func, maxsons=maxsons),
id='n={0} {1.__name__}/{2}'.format(n, func, maxsons))
if 'linear' in metafunc.function.__name__:
break
===================py.test --tb=short -v -s test_graph_array.py============
...
test_graph_array.py:72: test_timing[n=1000 _random_graph/1] t=13.4850590229s = 13.0ms/subtree = 0.013485ms/subtree/node PASS
test_graph_array.py:72: test_timing[n=1000 _random_graph/2] t=0.318281888962s = 0.32ms/subtree = 0.00031828ms/subtree/node PASS
test_graph_array.py:72: test_timing[n=1000 _random_graph/3] t=0.265519142151s = 0.27ms/subtree = 0.00026552ms/subtree/node PASS
test_graph_array.py:72: test_timing[n=1000 _random_graph/4] t=0.24147105217s = 0.24ms/subtree = 0.00024147ms/subtree/node PASS
test_graph_array.py:72: test_timing[n=1000 _random_graph/5] t=0.211434841156s = 0.21ms/subtree = 0.00021143ms/subtree/node PASS
test_graph_array.py:72: test_timing[n=1000 _random_graph/10] t=0.178458213806s = 0.18ms/subtree = 0.00017846ms/subtree/node PASS
test_graph_array.py:72: test_timing[n=1000 _random_graph/500] t=0.209936141968s = 0.21ms/subtree = 0.00020994ms/subtree/node PASS
test_graph_array.py:72: test_timing[n=1000 _random_graph/1000] t=0.245707988739s = 0.25ms/subtree = 0.00024571ms/subtree/node PASS
...
Here every subtree of every tree is taken, and the interesting value is the mean time to extract a tree: ~0.2ms per subtree, except for strictly linear trees. I'm not sure what is happening here.

Categories

Resources