How to tractably solve the assignment optimisation task

How to tractably solve the assignment optimisation task - python

I'm working on a script that takes the elements from companies and pairs them up with the elements of people. The goal is to optimize the pairings such that the sum of all pair values is maximized (the value of each individual pairing is precomputed and stored in the dictionary ctrPairs).
They're all paired in a 1:1, each company has only one person and each person belongs to only one company, and the number of companies is equal to the number of people. I used a top-down approach with a memoization table (memDict) to avoid recomputing areas that have already been solved.
I believe that I could vastly improve the speed of what's going on here but I'm not really sure how. Areas I'm worried about are marked with #slow?, any advice would be appreciated (the script works for inputs of lists n<15 but it gets incredibly slow for n > ~15)
def getMaxCTR(companies, people):
if(memDict.has_key((companies,people))):
return memDict[(companies,people)] #here's where we return the memoized version if it exists
if(not len(companies) or not len(people)):
return 0
maxCTR = None
remainingCompanies = companies[1:len(companies)] #slow?
for p in people:
remainingPeople = list(people) #slow?
remainingPeople.remove(p) #slow?
ctr = ctrPairs[(companies[0],p)] + getMaxCTR(remainingCompanies,tuple(remainingPeople)) #recurse
if(ctr > maxCTR):
maxCTR = ctr
memDict[(companies,people)] = maxCTR
return maxCTR

To all those who wonder about the use of learning theory, this question is a good illustration. The right question is not about a "fast way to bounce between lists and tuples in python" — the reason for the slowness is something deeper.
What you're trying to solve here is known as the assignment problem: given two lists of n elements each and n×n values (the value of each pair), how to assign them so that the total "value" is maximized (or equivalently, minimized). There are several algorithms for this, such as the Hungarian algorithm (Python implementation), or you could solve it using more general min-cost flow algorithms, or even cast it as a linear program and use an LP solver. Most of these would have a running time of O(n3).
What your algorithm above does is to try each possible way of pairing them. (The memoisation only helps to avoid recomputing answers for pairs of subsets, but you're still looking at all pairs of subsets.) This approach is at least Ω(n222n). For n=16, n3 is 4096 and n222n is 1099511627776. There are constant factors in each algorithm of course, but see the difference? :-) (The approach in the question is still better than the naive O(n!), which would be much worse.) Use one of the O(n^3) algorithms, and I predict it should run in time for up to n=10000 or so, instead of just up to n=15.
"Premature optimization is the root of all evil", as Knuth said, but so is delayed/overdue optimization: you should first carefully consider an appropriate algorithm before implementing it, not pick a bad one and then wonder what parts of it are slow. :-) Even badly implementing a good algorithm in Python would be orders of magnitude faster than fixing all the "slow?" parts of the code above (e.g., by rewriting in C).

i see two issues here:
efficiency: you're recreating the same remainingPeople sublists for each company. it would be better to create all the remainingPeople and all the remainingCompanies once and then do all the combinations.
memoization: you're using tuples instead of lists to use them as dict keys for memoization; but tuple identity is order-sensitive. IOW: (1,2) != (2,1) you better use sets and frozensets for this: frozenset((1,2)) == frozenset((2,1))

This line:
remainingCompanies = companies[1:len(companies)]
Can be replaced with this line:
remainingCompanies = companies[1:]
For a very slight speed increase. That's the only improvement I see.

If you want to get a copy of a tuple as a list you can do
mylist = list(mytuple)

Related

Two number sum : why don't anybody do it this way

I was looking for the solution to "two number sum problem" and I saw every body using two for loops
and another way I saw was using a hash table
def twoSumHashing(num_arr, pair_sum):
sums = []
hashTable = {}
for i in range(len(num_arr)):
complement = pair_sum - num_arr[i]
if complement in hashTable:
print("Pair with sum", pair_sum,"is: (", num_arr[i],",",complement,")")
hashTable[num_arr[i]] = num_arr[i]
# Driver Code
num_arr = [4, 5, 1, 8]
pair_sum = 9
# Calling function
twoSumHashing(num_arr, pair_sum)
But why don't nobody discuss about this solution
def two_num_sum(array, target):
for num in array:
match = target - num
if match in array:
return [match, num]
return "no result found"
when using a hash table we have to store values into the hash table. But here there is no need for that.
1)Does that affect the time complexity of the solution?
2)looking up a value in hash table is easy compared to array, but if the values are huge in number,
does storing them in a hash table take more space?

First of all, the second function you provide as a solution is not correct and does not return a complete list of answers.
Second, as a Pythonist, it's better to say dictionary instead of the hash table. A python dictionary is one of the implementations of a hash table.
Anyhow, regarding the other questions that you asked:
Using two for-loops is a brute-force approach and usually is not an optimized approach in real. Dictionaries are way faster than lists in python. So for the sake of time-complexity sure, dictionaries are the winner.
From the point of view of space complexity, using dictionaries for sure takes more memory allocation, but with current hardware, it is not an essential issue for billions of numbers. It depends on your situation, whether the speed is crucial to you or the amount of memory.

first function
uses O(n) time complexity as you iterate over n members in the array
uses O(n) space complexity as you can have one pair which is the first and the last, then in the worst case you can store up to n-1 numbers.
second function
uses O(n^2) time complexity as you iterate first on the array then uses in which uses __contains__ on list which is O(n) in worst case.
So the second function is like doing two loops to brute force the solution.
Another thing to point out in second function is that you don't return all the pairs but just the first pair you find.
Then you can try and fix it by iterating from the index of num+1 but you will have duplicates.
This is all comes down to a preference of what's more important - time complexity or space complexity-
this is one of many interview / preparation to interview question where you need to explain why you would use function two (if was working properly) over function one and vice versa.
Answers for your questions
1.when using a hash table we have to store values into the hash table. But here there is no need for that. 1)Does that affect the time complexity of the solution?
Yes now time complexity is O(n^2) which is worse
2)looking up a value in hash table is easy compared to array, but if the values are huge in number, does storing them in a hash table take more space?
In computers numbers are just representation of bits. Larger numbers can take up more space as they need more bits to represent them but storing them will be the same, no matter where you store.

Finding combination of columns which provides best combination based on function return

I have a dataframe with daily returns 6 portfolios (PORT1, PORT2, PORT3, ... PORT6).
I have defined functions for compound annual returns and risk-adjusted returns. I can run this function for any one PORT.
I want to find a combination of portfolios (assume equal weighting) to obtain the highest returns. For example, a combination of PORT1, PORT3, PORT4, and PORT6 may provide the highest risk adjusted return. Is there a method to automatically run the defined function on all combinations and obtain the desired combination?
No code is included as I do not think it is necessary to show the computation used to determine risk adj return.
def returns(PORT):
val = ... [computation of return here for PORT]
return val

Finding the optimal location within a multidimensional space is possible, but people have made fortunes figuring out better ways of achieving exactly this.
The problem at the outset is setting out your possibility space. You've six dimensions, and presumably you want to allocate 1 unit of "stuff" across all those six, such that a vector of the allocations {a,b,c,d,e,f} sums to 1. That's still an infinity of numbers, so maybe we only start off with increments of size 0.10. So 10 increments possible, across 6 dimensions, gives you 10^6 possibilities.
So the simple brute-force method would be to "simply" run your function across the entire parameter space, store the values and pick the best one.
That may not be the answer you want, other methods exist, including randomising your guesses and limiting your results to a more manageable number. But the performance gain is offset with some uncertainty - and some potentially difficult conversations with your clients "What do you mean you did it randomly?!".
To make any guesses at what might be optimal, it would be helpful to have an understanding of the response curves each portfolio has under different circumstances and the sorts of risk/reward profiles you might expect them to operate under. Are they linear, quadratic, or are they more complex? If you can model them mathematically, you might be able to use an algorithm to reduce your search space.
Short (but fundamental) answer is "it depends".

You can do
import itertools
best_return = 0
for r in range(len(PORTS)):
for PORT in itertools.combinations(PORTS,r):
cur_return = returns(PORT)
if cur_return > best_return :
best_return = cur_return
best_PORT = PORT
You can also do
max([max([PORT for PORT in itertools.combinations(PORTS,r)], key = returns)
for r in range(len(PORTS))], key = returns)
However, this is more of an economics question than a CS one. Given a set of positions and their returns and risk, there are explicit formulae to find the optimal portfolio without having to brute force it.

Flatten range pairs in Python (heapy/bisect?)

In bioinformatics, we do the following transformation an awful lot:
>>> data = {
(90,100):1,
(91,101):1,
(92,102):2,
(93,103):1,
(94,104):1
}
>>> someFuction(data)
{
90:1,
91:2,
92:4,
93:5,
94:6,
95:6,
96:6,
97:6,
98:6,
99:6,
100:6,
101:5,
102:4,
103:2,
104:1
}
Where the tuple in data is always a unique pair.
But there are many methods for doing this transform - some significantly better than others. One i have tried is:
newData = {}
for pos, values in data.iteritems():
A,B = pos
for i in xrange(A,B+1):
try: newData[i] += values
except KeyError: newData[i] = values
This has the benefit that its short and sweet, but im not actually sure it is that efficient....
I have a feeling that somehow turning the dict into a list of lists, and then doing the xrange, would save an awful lot of time. We're talking weeks of computational work per experiment. Something like this:
>>> someFuction(data)
[
[90,90,1],
[91,91,2],
[92,92,4],
[93,93,5],
[94,100,6],
[101,101,5],
[102,102,4],
[103,103,2],
[104,104,1]
]
and THEN do the for/xrange loop.
People on #Python have recommended bisect and heapy, but after struggling with bisect all day, I can't come up with a nice algorithm which i can be 100% will work all the time. If anyone on here could help or even point me in the right direction, id be massively grateful :)

I worked out a solution last night that takes the total run time of one file from roughly 400 minutes to 251 minutes. I would post the code but its pretty long, and likely to have bugs in the edge-cases. For that reason i'll say the 'working' code can be found in the program 'rawSeQL', but the algorithmic improvements that helped the most were:
Looping over the overlapping arrays and flattening them to non-overlapping arrays with a multiplier value made an enormous difference, as xrange() does not now need to repeat itself.
Using collections.defaultdict(int) made a big difference over the try/except loop above. collections.Counter() and orderedDict was a LOT slower than the try/except.
I went with using bisect_left() to find where to insert the next non-overlapping piece, and it was so-so, but then adding in Bisect's 'low' parameter to limit the range of the list it needs to check gave a sizeable reduction in compute time. If you sort the input list, your value for low is always the last returned value for bisect, which makes this process easy :)
It is possible that heapy would provide even more benefits still - but for now the main algorithm improvements mentioned above will probably outweigh any compile-time tricks. I have 75 files to process now, which means just these three things save roughly 12500 days of compute time :)

Optimize algorithm to compute orbit under a given action in python

My goal is to iterate through a set S of elements given a single element and an action G: S -> S that acts transitively on S (i.e., for any elt,elt' in S, there is a map f in G such that f(elt) = elt'). The action is finitely generated, so I can use that I can apply each generator to a given element.
The algorithm I use is:
def orbit(act,elt):
new_elements = [elt]
seen_elements = set([elt])
yield elt
while new_elements:
elt = new_elements.pop()
seen_elements.add(elt)
for f in act.gens():
elt_new = f(elt)
if elt_new not in seen_elements:
new_elements.append(elt_new)
seen_elements.add(elt_new)
yield elt_new
This algorithm seems to be well-suited and very generic. BUT it has one major and one minor slowdown in big computations that I would like to get rid of:
The major: seen_elements collects all the elements, and is thus too memory consuming, given that I do not need the actual elements anymore.
How can I achieve to not have all the elements stored in memory?
Very likely, this depends on what the elements are. So for me, these are short lists (<10 entries) of ints (each < 10^3). So first, is there a fast way to associate a (with high probability) unique integer to such a list? Does that save much memory? If so, should I put those into a dict to check the containment (in this case, first the hash equality test, and then an int equality test are done, right?), or how should I do that?
the minor: poping the element takes a lot of time given that I don't quite need that list. Is there a better way of doing that?
Thanks a lot for your suggestions!

So first, is there a fast way to associate a (with high probability) unique integer to such a list?
If the list entries all are in range(1, 1024), then sum(x << (i * 10) for i, x in enumerate(elt)) yields a unique integer.
Does that save much memory?
The short answer is yes. The long answer is that it's complicated to determine how much. Python's long integer representation uses (probably) 30-bit digits, so the digits will pack 3 to the 32-bit word instead of 1 (or 0.5 for 64-bit). There's some object overhead (8/16 bytes?), and then there's the question of how many of the list entries require separate objects, which is where the big win may lie.
If you can tolerate errors, then a Bloom filter would be a possibility.
the minor: popping the element takes a lot of time given that I don't quite need that list. Is there a better way of doing that?
I find that claim surprising. Have you measured?

Are there any cleverly efficient algorithms to perform a calculation over the space of partitionings of a string?

I'm working on a statistical project that involves iterating over every possible way to partition a collection of strings and running a simple calculation on each. Specifically, each possible substring has a probability associated with it, and I'm trying to get the sum across all partitions of the product of the substring probability in the partition.
For example, if the string is 'abc', then there would be probabilities for 'a', 'b', 'c', 'ab, 'bc' and 'abc'. There are four possible partitionings of the string: 'abc', 'ab|c', 'a|bc' and 'a|b|c'. The algorithm needs to find the product of the component probabilities for each partitioning, then sum the four resultant numbers.
Currently, I've written a python iterator that uses binary representations of integers for the partitions (eg 00, 01, 10, 11 for the example above) and simply runs through the integers. Unfortunately, this is immensely slow for strings longer than 20 or so characters.
Can anybody think of a clever way to perform this operation without simply running through every partition one at a time? I've been stuck on this for days now.
In response to some comments here is some more information:
The string can be just about anything, eg "foobar(foo2)" -- our alphabet is lowercase alphanumeric plus all three type of braces ("(","[","{"), hyphens and spaces.
The goal is to get the likelihood of the string given individual 'word' likelihoods. So L(S='abc')=P('abc') + P('ab')P('c') + P('a')P('bc') + P('a')P('b')P('c') (Here "P('abc')" indicates the probability of the 'word' 'abc', while "L(S='abc')" is the statistical likelihood of observing the string 'abc').

A Dynamic Programming solution (if I understood the question right):
def dynProgSolution(text, probs):
probUpTo = [1]
for i in range(1, len(text)+1):
cur = sum(v*probs[text[k:i]] for k, v in enumerate(probUpTo))
probUpTo.append(cur)
return probUpTo[-1]
print dynProgSolution(
'abc',
{'a': 0.1, 'b': 0.2, 'c': 0.3,
'ab': 0.4, 'bc': 0.5, 'abc': 0.6}
)
The complexity is O(N2) so it will easily solve the problem for N=20.
How why does this work:
Everything you will multiply by probs['a']*probs['b'] you will also multiply by probs['ab']
Thanks to the Distributive Property of multiplication and addition, you can sum those two together and multiply this single sum by all of its continuations.
For every possible last substring, it adds the sum of all splits ending with that by adding its probability multiplied by the sum of all probabilities of previous paths. (alternative phrasing would be appreciated. my python is better than my english..)

First, profile to find the bottleneck.
If the bottleneck is simply the massive number of possible partitions, I recommend parallelization, possibly via multiprocessing. If that's still not enough, you might look into a Beowulf cluster.
If the bottleneck is just that the calculation is slow, try shelling out to C. It's pretty easy to do via ctypes.
Also, I'm not really sure how you're storing the partitions, but you could probably squash memory consumption a pretty good bit by using one string and a suffix array. If your bottleneck is swapping and/or cache misses, that might be a big win.

Your substrings are going to be reused over and over again by the longer strings, so caching the values using a memoizing technique seems like an obvious thing to try. This is just a time-space trade off. The simplest implementation is to use a dictionary to cache values as you calculate them. Do a dictionary lookup for every string calculation; if it's not in the dictionary, calculate and add it. Subsequent calls will make use of the pre-computed value. If the dictionary lookup is faster than the calculation, you're in luck.
I realise you are using Python, but... as a side note that may be of interest, if you do this in Perl, you don't even have to write any code; the built in Memoize module will do the caching for you!

You may get a minor reduction of the amount of computation by a small refactoring based on associative properties of arithmetic (and string concatenation) though I'm not sure it will be a life-changer. The core idea would be as follows:
consider a longish string e.g. 'abcdefghik', 10-long, for definiteness w/o loss of generality. In a naive approach you'd be multiplying p(a) by the many partitions of the 9-tail, p(ab) by the many partitions of the 8-tail, etc; in particular p(a) and p(b) will be multiplying exactly the same partitions of the 8-tail (all of them) as p(ab) will -- 3 multiplications and two sums among them. So factor that out:
(p(ab) + p(a) * p(b)) * (partitions of the 8-tail)
and we're down to 2 multiplications and 1 sum for this part, having saved 1 product and 1 sum. to cover all partitions with a split point just right of 'b'. When it comes to partitions with a split just right of 'c',
(p(abc) + p(ab) * p(c) + p(a) * (p(b)*p(c)+p(bc)) * (partitions of the 7-tail)
the savings mount, partly thanks to the internal refactoring -- though of course one must be careful about double-counting. I'm thinking that this approach may be generalized -- start with the midpoint and consider all partitions that have a split there, separately (and recursively) for the left and right part, multiplying and summing; then add all partitions that DON'T have a split there, e.g. in the example, the halves being 'abcde' on the left and 'fghik' on the right, the second part is about all partitions where 'ef' are together rather than apart -- so "collapse" all probabilities by considering that 'ef' as a new 'superletter' X, and you're left with a string one shorter, 'abcdXghik' (of course the probabilities for the substrings of THAT map directly to the originals, e.g. the p(cdXg) in the new string is just exactly the p(cdefg) in the original).

You should look into the itertools module. It can create a generator for you that is very fast. Given your input string, it will provide you with all possible permutations. Depending on what you need, there is also a combinations() generator. I'm not quite sure if you're looking at "b|ca" when you're looking at "abc," but either way, this module may prove useful to you.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.