Recursive python matching algorithm based on subsets working too slowly - python

I'm building a web app to match high school students considering a gap year to students who have taken a gap year, based on interest as denoted by tags. A prototype is up at covidgapyears.com. I have never written a matching/recommendation algorithm, so though people have suggested things like collaborative filtering and association rule mining, or adapting the stable marriage problem, I don't think any of those will work because it's a small dataset (few hundred users right now, few thousand soon). So I wrote my own alg using common sense.
It essentially takes in a list of tags that the student is interested it, then searches for an exact match of those tags with someone who has taken a gap year and registered with the site (who also selected tags on registration). An exactMatch, as given below, is when the tags the user specifies are ALL contained by some profile (i.e., are a subset). If it can't find an exact match with ALL of the user's inputted tags, it will check all n-1 length subsets of the tags list itself to see if any less selective queries have matches. It does this recursively until at least 3 matches are found. While it works fine for small tags selections (up to 5-7) it gets slow for larger tags selections (7-13), taking several seconds to return a result. When 11-13 tags are selected, hits a Heroku error due to worker timeout.
I did some tests by putting variables inside the algorithm to count computations and it seems that when it goes a bit deep into the recursive stack, it checks a few hundred subsets each time (to see if there's an exactMatch for that subset, and if there is, add it to results list to output), and the total number of computations doubles as you add one more tag (it went 54, 150, 270, 500, 1000, 1900, 3400 operations for more and more tags). It is true that there are a few hundred subsets at each depth. But exactMatches is O(1) as I've written it (no iteration), and aside from the other O(1) operations like IF, the FOR inside the subset loop will, at most, be gone through around 10 times. This agrees with the measured result of a few thousand computations each time.
This did not surprise me as selecting and iterating over all subsets seems to be something that could get non harder, but my question is about why it's so slow despite only doing a few thousand computations. I know my computer operates in GHz and I expect web servers are similar, so surely a few thousand computations would be near-instantaneous? What am I missing and how can I improve this algorithm? Any other approaches I should look into?
# takes in a list of length n and returns a list of all combos of subsets of depth n
def arbSubsets(seq, n):
return list(itertools.combinations(seq, len(seq)-n))
# takes in a tagsList and check Gapper.objects.all to see if any gapper has all those tags
def exactMatches(tagsList):
tagsSet = set(tagsList)
exactMatches = []
for gapper in Gapper.objects.all():
gapperSet = set(gapper.tags.names())
if tagsSet.issubset(gapperSet):
exactMatches.append(gapper)
return exactMatches
# takes in tagsList that has been cleaned to remove any tags that NO gappers have and then checks gapper objects to find optimal match
def matchGapper(tagsList, depth, results):
# handles the case where we're only given tags contained by no gappers
if depth == len(tagsList):
return []
# counter variable is to measure complexity for debugging
counter += 1
# we don't want too many results or it stops feeling tailored
upper_limit_results = 3
# now we must check subsets for match
subsets = arbSubsets(tagsList, depth)
for subset in subsets:
counter += 1
matches = exactMatches(subset)
if matches:
for match in matches:
counter += 1
# new need to check because we might be adding depth 2 to results from depth 1
# which we didn't do before, to make sure we have at least 3 results
if match not in results:
# don't want to show too many or it doesn't feel tailored anymore
counter += 1
if len(results) > upper_limit_results: break
results.append(match)
# always give at least 3 results
if len(results) > 2:
return results
else:
# check one level deeper (less specific) into tags if not enough gappers that match to get more results
counter += 1
return matchGapper(tagsList, depth + 1, results)
# this is the list of matches we then return to the user
matches = matchGapper(tagsList, 0, [])

It doesn't seem you are doing a few hundred computation steps. In fact you have a few hundred options for each depth, thus you should not add, but multiply the number of steps at each depth to estimate the complexity of your solution.
Additionally this statement: This or adapting the stable marriage problem, I don't think any of those will work because it's a small dataset is also obviously not true. Although these algorithms may be overkill for some very simple cases, they are still valid and will work for them.

Okay, so after much fiddling with timers I've figured it out. There are a few functions at play when matching: exactMatches, matchGapper and arbSubset. When I put the counter into a global variable and measured operations (as measured as lines of my code being executed, it came in around 2-10K for large inputs (around 10 tags)).
It is true that arbSubset, which returns a list of subsets, at first seems like a plausible bottleneck. But if you look closely, we are 1) handling small amounts of tags (order of 10-50) and more importantly, 2) we are only calling arbSubset when we recurse matchGapper, which only happens a max of about 10 times, since tagsList can only be around 10 (order of 10-50, as above). And when I checked the time it took to generate arbSubsets, it was order of 2e-5. And so the total time spend on generating the subsets of arbitrary size is only 2e-4. In other words, not the source of the 5-30 second waiting time in the web app.
And so with that aside, knowing that arbSubset is only called on the order of 10 times, and is fast at that, and knowing that there are only around a max of 10K computations taking place in my code it starts to become clear that I must be using some out-of-the-box function, I don't know--like set() or .issubset() or something like that--that takes a nontrivial amount of time to compute, and is executed many times. Adding some counters in some more places, it becomes clear that exactMatch() accounts for around 95-99% of all computations that take place (as would be expected if we have to check all combinations of subsets of various sizes for exactMatches).
So the problem, at this point, is reduced to the fact that exactMatch takes around 0.02s (empirically) as implemented, and is called several thousand times. And so we can either try to make it faster by a couple of order of magnitudes (it's already pretty optimal), or take another approach that doesn't involve finding matches using subsets. A friend of mine suggested creating a dict with all the combinations of tags (so 2^len(tagsList) keys) and setting them equal to lists of registered profiles with that exact combination. This way, querying is just traversing a (huge) dict, which can be done fast. Any other suggestions are welcome.

Related

Finding combination of columns which provides best combination based on function return

I have a dataframe with daily returns 6 portfolios (PORT1, PORT2, PORT3, ... PORT6).
I have defined functions for compound annual returns and risk-adjusted returns. I can run this function for any one PORT.
I want to find a combination of portfolios (assume equal weighting) to obtain the highest returns. For example, a combination of PORT1, PORT3, PORT4, and PORT6 may provide the highest risk adjusted return. Is there a method to automatically run the defined function on all combinations and obtain the desired combination?
No code is included as I do not think it is necessary to show the computation used to determine risk adj return.
def returns(PORT):
val = ... [computation of return here for PORT]
return val
Finding the optimal location within a multidimensional space is possible, but people have made fortunes figuring out better ways of achieving exactly this.
The problem at the outset is setting out your possibility space. You've six dimensions, and presumably you want to allocate 1 unit of "stuff" across all those six, such that a vector of the allocations {a,b,c,d,e,f} sums to 1. That's still an infinity of numbers, so maybe we only start off with increments of size 0.10. So 10 increments possible, across 6 dimensions, gives you 10^6 possibilities.
So the simple brute-force method would be to "simply" run your function across the entire parameter space, store the values and pick the best one.
That may not be the answer you want, other methods exist, including randomising your guesses and limiting your results to a more manageable number. But the performance gain is offset with some uncertainty - and some potentially difficult conversations with your clients "What do you mean you did it randomly?!".
To make any guesses at what might be optimal, it would be helpful to have an understanding of the response curves each portfolio has under different circumstances and the sorts of risk/reward profiles you might expect them to operate under. Are they linear, quadratic, or are they more complex? If you can model them mathematically, you might be able to use an algorithm to reduce your search space.
Short (but fundamental) answer is "it depends".
You can do
import itertools
best_return = 0
for r in range(len(PORTS)):
for PORT in itertools.combinations(PORTS,r):
cur_return = returns(PORT)
if cur_return > best_return :
best_return = cur_return
best_PORT = PORT
You can also do
max([max([PORT for PORT in itertools.combinations(PORTS,r)], key = returns)
for r in range(len(PORTS))], key = returns)
However, this is more of an economics question than a CS one. Given a set of positions and their returns and risk, there are explicit formulae to find the optimal portfolio without having to brute force it.

Skipping a pattern of elements using itertools and accompanying list

I have some code that is slow (30-60mins by last count), that I need to optimize, it is a data extraction script for Abaqus for a structural engineering model. The worst part of the script is the loop where it iterates through the object model database first by frame (i.e. the time in the time history of the simulation) and nested under this it iterates by each of the nodes. The silly thing is that there are ~100k 'nodes' but only about ~20k useful nodes. But luckily for me the nodes are always in the same order, meaning I do not need to look up the node's uniqueLabel, I can do this in a separate loop once and then filter what I get at the end. That is why I have dumped everything into one list and then I remove all the nodes that are repeats. But as you can see from the code:
timeValues = []
peeqValues = []
for frame in frames: #760 loops
setValues = frame.fieldOutputs['###fieldOutputType'].getSubset
region=abaqusSet, position=ELEMENT_NODAL).values
timeValues.append(frame.frameValue)
for value in setValues: # 100k loops
peeqValues.append(value.data)
It still needs to make the value.data calls unnecessarily, about ~80k times. If anyone is familiar with Abaqus odb (object database) objects, they're super slow under python. To add insult to injury they only run in a single thread, under Abaqus which has its own python version (2.6.x) and packages (so e.g. numpy is available, pandas is not). Another thing that may be annoying is the fact that you can address the objects by position e.g. frames[-1] gives you the last frame, but you cannot slice, so e.g. you can't do this: for frame in frames[0:10]: # iterate first 10 elements.
I don't have any experience with itertools but I'd want to provide it a list of nodeIDs (or list of True/False) to map onto the setValues. The length and pattern of setValues to skip is always the same for each of the 760 frames. Maybe something like:
for frame in frames: #still 760 calls
setValues = frame.fieldOutputs['###fieldOutputType'].getSubset(
region=abaqusSet, position=ELEMENT_NODAL).values
timeValues.append(frame.frameValue)
# nodeSet_IDs_TF = [True, True, False, False, False, ...] same length as
# setValues
filteredSetValues = ifilter(nodeSet_IDs_TF, setValues)
for value in filteredSetValues: # only 20k calls
peeqValues.append(value.data)
Any other tips also appreciated, after this I did want to "avoid the dots" by removing the .append() from the loop, and then putting the whole thing in a function to see if it helps. The whole script already runs in under 1.5 hours (down from 6 and at one point 21 hours), but once you start optimizing there is no way to stop.
Memory considerations also appreciated, I run these on a cluster and I believe I got away once with 80 GB of RAM. The scripts definitely work on 160 GB, the issue is getting the resources allocated to me.
I've searched around for a solution but maybe I'm using the wrong keywords, I'm sure this is not an uncommon issue in looping.
EDIT 1
Here is what I ended up using:
# there is no compress under 2.6.x ... so use the equivalent recipe:
from itertools import izip
def compress(data, selectors):
# compress('ABCDEF', [1,0,1,0,1,1]) --> ACEF
return (d for d, s in izip(data, selectors) if s)
def iterateOdb(frames, selectors): # minor speed up
peeqValues = []
timeValues = []
append = peeqValues.append # minor speed up
for frame in frames:
setValues = frame.fieldOutputs['###fieldOutputType'].getSubset(
region=abaqusSet, position=ELEMENT_NODAL).values
timeValues.append(frame.frameValue)
for value in compress(setValues, selectors): # massive speed up
append(value.data)
return peeqValues, timeValues
peeqValues, timeValues = iterateOdb(frames, selectors)
The biggest improvement came from using the compress(values, selectors) method (the whole script, including the odb portion went from ~1:30 hours to 25mins. There was also a minor improvement from append = peeqValues.append as well as enclosing everything in def iterateOdb(frames, selectors):.
I used tips from: https://wiki.python.org/moin/PythonSpeed/PerformanceTips
Thanks to everyone for answering & helping!
If you're not confident with itertools try using an if statement in your for loop first.
eg.
for index, item in enumerate(values):
if not selectors[index]:
continue
...
# where selectors is a truth array like nodeSet_IDs_TF
This way you can be more sure that you are getting the correct behaviour and will get getting most of the performance increase you would get from using itertools.
The itertools equivalent is compress.
for item in compress(values, selectors):
...
I'm not familiar with abaqus, but the best optimisations you could achieve would be to see if there is anyway way to give abaqus your selectors so it doesn't have to waste creating each value, only for it to be thrown away. If abaqus is used for doing large array-based manipulations of data then it's like this is the case.
Another variant in addition to those in Dunes's solution:
for value, selector in zip(setValues, selectors):
if selector:
peeqValue.append(value.data)
If you want to keep the output list length the same as the setValue length, then add an else clause:
for value, selector in zip(setValues, selectors):
if selector:
peeqValue.append(value.data)
else:
peeqValue.append(None)
selector is here a vector with True/False, and it has the same length as setValues.
In this case it is really a matter of taste which one you like. If the full iteration of 76 million nodes (760 x 100 000) takes 30 minutes, the time is not spent in python's loops.
I tried this:
def loopit(a):
for i in range(760):
for j in range(100000):
a = a + 1
return a
IPython's %timeit reports the loop time as 3.54 s. So, the looping spends maybe 0.1 % of the total time.

Hashtables over large natural language word sets

I'm writing a program in python to do a unigram (and eventually bigram etc) analysis of movie reviews. The goal is to create feature vectors to feed into libsvm. I have 50,000 odd unique words in my feature vector (which seems rather large to me, but I ham relatively sure I'm right about that).
I'm using the python dictionary implementation as a hashtable to keep track of new words as I meet them, but I'm noticing an enormous slowdown after the first 1000 odd documents are processed. Would I have better efficiency (given the distribution of natural language) if I used several smaller hashtable/dictionaries or would it be the same/worse?
More info:
The data is split into 1500 or so documents, 500-ish words each. There are between 100 and 300 unique words (with respect to all previous documents) in each document.
My current code:
#processes each individual file, tok == filename, v == predefined class
def processtok(tok, v):
#n is the number of unique words so far,
#reference is the mapping reference in case I want to add new data later
#hash is the hashtable
#statlist is the massive feature vector I'm trying to build
global n
global reference
global hash
global statlist
cin=open(tok, 'r')
statlist=[0]*43990
statlist[0] = v
lines = cin.readlines()
for l in lines:
line = l.split(" ")
for word in line:
if word in hash.keys():
if statlist[hash[word]] == 0:
statlist[hash[word]] = 1
else:
hash[word]=n
n+=1
ref.write('['+str(word)+','+str(n)+']'+'\n')
statlist[hash[word]] = 1
cin.close()
return statlist
Also keep in mind that my input data is about 6mb and my output data is about 300mb. I'm simply startled at how long this takes, and I feel that it shouldn't be slowing down so dramatically as it's running.
Slowing down: the first 50 documents take about 5 seconds, the last 50 take about 5 minutes.
#ThatGuy has made the fix, but hasn't actually told you this:
The major cause of your slowdown is the line
if word in hash.keys():
which laboriously makes a list of all the keys so far, then laboriously searches that list for `word'. The time taken is proportional to the number of keys i.e. the number of unique words found so far. That's why it starts fast and becomes slower and slower.
All you need is if word in hash: which in 99.9999999% of cases takes time independent of the number of keys -- one of the major reasons for having a dict.
The faffing about with statlist[hash[word]] doesn't help, either. By the way, the fixed size in statlist=[0]*43990 needs explanation.
More problems
Problem A: Either (1) your code suffered from indentation distortion when you published it, or (2) hash will never be updated by that function. Quite simply, if word is not in hash i.e it's the first time you've seen it, absolutely nothing happens. The hash[word] = n statement (the ONLY code that updates hash) is NOT executed. So no word will ever be in hash.
It looks like this block of code needs to be shifted left 4 columns, so that it's aligned with the outer if:
else:
hash[word]=n
ref.write('['+str(word)+','+str(n)+']'+'\n')
statlist[hash[word]] = 1
Problem B: There is no code at all to update n (allegedly the number of unique words so far).
I strongly suggest that you take as many of the suggestions that #ThatGuy and I have made as you care to, rip out all the global stuff, fix up your code, chuck in a few print statements at salient points, and run it over say 2 documents each of 3 lines with about 4 words in each. Ensure that it is working properly. THEN run it on your big data set (with the prints suppressed). In any case you may want to put out stats (like number of documents, lines, words, unique words, elapsed time, etc) at regular intervals.
Another problem
Problem C: I mentioned this in a comment on #ThatGuy's answer, and he agreed with me, but you haven't mentioned taking it up:
>>> line = "foo bar foo\n"
>>> line.split(" ")
['foo', 'bar', 'foo\n']
>>> line.split()
['foo', 'bar', 'foo']
>>>
Your use of .split(" ") will lead to spurious "words" and distort your statistics, including the number of unique words that you have. You may well find the need to change that hard-coded magic number.
I say again: There is no code that updates n in the function . Doing hash[word] = n seems very strange, even if n is updated for each document.
I don't think Python's Dictionary has anything to do with your slowdown here. Especially when you are saying that the entries are around 100. I am hoping that you are referring to Insertion and Retrival, which are both O(1) in a dictionary. The problem could be that you are not using iterators (or loading key,value pairs one at a time) when creating a dictionary and you are loading the entire words in-memory. In that case, the slowdown is due to memory consumption.
I think you've got a few problems going on here. Mostly, I am unsure of what you are tying to accomplish with statlist. It seems to me like it is serving as a poor duplicate of your dictionary. Create it after you have found all of your words.
Here is my guess as to what you want:
def processtok(tok, v):
global n
global reference
global hash
cin=open(tok, 'rb')
for l in cin:
line = l.split(" ")
for word in line:
if word in hash:
hash[word] += 1
else:
hash[word] = 1
n += 1
ref.write('['+str(word)+','+str(n)+']'+'\n')
cin.close()
return hash
Note, that this means you no longer need an "n" as you can discover this by doing len(n).

Maximal Length of List to Shuffle with Python random.shuffle?

I have a list which I shuffle with the Python built in shuffle function (random.shuffle)
However, the Python reference states:
Note that for even rather small len(x), the total number of permutations of x is larger than the period of most random number generators; this implies that most permutations of a long sequence can never be generated.
Now, I wonder what this "rather small len(x)" means. 100, 1000, 10000,...
TL;DR: It "breaks" on lists with over 2080 elements, but don't worry too much :)
Complete answer:
First of all, notice that "shuffling" a list can be understood (conceptually) as generating all possible permutations of the elements of the lists, and picking one of these permutations at random.
Then, you must remember that all self-contained computerised random number generators are actually "pseudo" random. That is, they are not actually random, but rely on a series of factors to try and generate a number that is hard to be guessed in advanced, or purposefully reproduced. Among these factors is usually the previous generated number. So, in practice, if you use a random generator continuously a certain number of times, you'll eventually start getting the same sequence all over again (this is the "period" that the documentation refers to).
Finally, the docstring on Lib/random.py (the random module) says that "The period [of the random number generator] is 2**19937-1."
So, given all that, if your list is such that there are 2**19937 or more permutations, some of these will never be obtained by shuffling the list. You'd (again, conceptually) generate all permutations of the list, then generate a random number x, and pick the xth permutation. Next time, you generate another random number y, and pick the yth permutation. And so on. But, since there are more permutations than you'll get random numbers (because, at most after 2**19937-1 generated numbers, you'll start getting the same ones again), you'll start picking the same permutations again.
So, you see, it's not exactly a matter of how long your list is (though that does enter into the equation). Also, 2**19937-1 is quite a long number. But, still, depending on your shuffling needs, you should bear all that in mind. On a simplistic case (and with a quick calculation), for a list without repeated elements, 2081 elements would yield 2081! permutations, which is more than 2**19937.
I wrote that comment in the Python source originally, so maybe I can clarify ;-)
When the comment was introduced, Python's Wichmann-Hill generator had a much shorter period, and we couldn't even generate all the permutations of a deck of cards.
The period is astronomically larger now, and 2080 is correct for the current upper bound. The docs could be beefed up to say more about that - but they'd get awfully tedious.
There's a very simple explanation: A PRNG of period P has P possible starting states. The starting state wholly determines the permutation produced. Therefore a PRNG of period P cannot generate more than P distinct permutations (and that's an absolute upper bound - it may not be achieved). That's why comparing N! to P is the correct computation here. And, indeed:
>>> math.factorial(2080) > 2**19937 - 1
False
>>> math.factorial(2081) > 2**19937 - 1
True
What they mean is that permutations on n objects (noted n!) grows absurdly high very fast.
Basically n! = n x n-1 x ... x 1; for example, 5! = 5 x 4 x 3 x 2 x 1 = 120 which means there are 120 possible ways of shuffling a 5-items list.
On the same Python page documentation they give 2^19937-1 as the period, which is 4.something × 10^6001 or something. Based on the Wikipedia page on factorials, I guess 2000! should be around that. (Sorry, I didn't find the exact figure.)
So basically there are so many possible permutations the shuffle will take from that there's probably no real reason to worry about those it won't.
But if it really is an issue (pesky customer asking for a guarantee of randomness perhaps?), you could also offload the task to some third-party; see http://www.random.org/ for example.

How to tractably solve the assignment optimisation task

I'm working on a script that takes the elements from companies and pairs them up with the elements of people. The goal is to optimize the pairings such that the sum of all pair values is maximized (the value of each individual pairing is precomputed and stored in the dictionary ctrPairs).
They're all paired in a 1:1, each company has only one person and each person belongs to only one company, and the number of companies is equal to the number of people. I used a top-down approach with a memoization table (memDict) to avoid recomputing areas that have already been solved.
I believe that I could vastly improve the speed of what's going on here but I'm not really sure how. Areas I'm worried about are marked with #slow?, any advice would be appreciated (the script works for inputs of lists n<15 but it gets incredibly slow for n > ~15)
def getMaxCTR(companies, people):
if(memDict.has_key((companies,people))):
return memDict[(companies,people)] #here's where we return the memoized version if it exists
if(not len(companies) or not len(people)):
return 0
maxCTR = None
remainingCompanies = companies[1:len(companies)] #slow?
for p in people:
remainingPeople = list(people) #slow?
remainingPeople.remove(p) #slow?
ctr = ctrPairs[(companies[0],p)] + getMaxCTR(remainingCompanies,tuple(remainingPeople)) #recurse
if(ctr > maxCTR):
maxCTR = ctr
memDict[(companies,people)] = maxCTR
return maxCTR
To all those who wonder about the use of learning theory, this question is a good illustration. The right question is not about a "fast way to bounce between lists and tuples in python" — the reason for the slowness is something deeper.
What you're trying to solve here is known as the assignment problem: given two lists of n elements each and n×n values (the value of each pair), how to assign them so that the total "value" is maximized (or equivalently, minimized). There are several algorithms for this, such as the Hungarian algorithm (Python implementation), or you could solve it using more general min-cost flow algorithms, or even cast it as a linear program and use an LP solver. Most of these would have a running time of O(n3).
What your algorithm above does is to try each possible way of pairing them. (The memoisation only helps to avoid recomputing answers for pairs of subsets, but you're still looking at all pairs of subsets.) This approach is at least Ω(n222n). For n=16, n3 is 4096 and n222n is 1099511627776. There are constant factors in each algorithm of course, but see the difference? :-) (The approach in the question is still better than the naive O(n!), which would be much worse.) Use one of the O(n^3) algorithms, and I predict it should run in time for up to n=10000 or so, instead of just up to n=15.
"Premature optimization is the root of all evil", as Knuth said, but so is delayed/overdue optimization: you should first carefully consider an appropriate algorithm before implementing it, not pick a bad one and then wonder what parts of it are slow. :-) Even badly implementing a good algorithm in Python would be orders of magnitude faster than fixing all the "slow?" parts of the code above (e.g., by rewriting in C).
i see two issues here:
efficiency: you're recreating the same remainingPeople sublists for each company. it would be better to create all the remainingPeople and all the remainingCompanies once and then do all the combinations.
memoization: you're using tuples instead of lists to use them as dict keys for memoization; but tuple identity is order-sensitive. IOW: (1,2) != (2,1) you better use sets and frozensets for this: frozenset((1,2)) == frozenset((2,1))
This line:
remainingCompanies = companies[1:len(companies)]
Can be replaced with this line:
remainingCompanies = companies[1:]
For a very slight speed increase. That's the only improvement I see.
If you want to get a copy of a tuple as a list you can do
mylist = list(mytuple)

Categories

Resources