You are given an array A of integers, each of which is in the range [0, 1000], along with some number m. For example, you might get this input:
A=[5,6,7,8] m=1
The question is to determine, as efficiently as possible, how many distinct, nonempty subarrays there are of the array A that contain at most m even numbers. For example, for the above array, there are eight distinct subarrays with at most one even number, as shown here:
[(5, 6, 7), (6, 7), (5, 6), (8), (5), (6), (7), (7, 8)]
Here's the solution I have so far, which runs in time O(n3):
def beautiful(A, m):
subs = [tuple(A[i:j]) for i in range(0, len(A)) for j in range(i + 1, len(A) + 1)]
uniqSubs = set(subs)
return len([n for n in uniqSubs if sum(int(i) % 2 == 0 for i in n)<=m ])
Is there a better solution to this problem - ideally, one that runs in linear time or atleast O(n^2)?
I believe you can do this in linear time by using suffix trees. This is certainly not a lightweight solution - good luck coding up a linear-time algorithm for building a suffix tree with a variable-size alphabet! - but it shows that it's possible.
Here's the idea. Start by building a suffix tree for the array, treating it not as a list of numbers, but rather as a string of characters, where each character is a number. Since you know all the numbers are at most 1,000, the number of distinct characters is a constant, so using a fast suffix tree construction algorithm (for example, SA-IS), you can build the suffix tree in time O(n).
Suffix trees are a nice structure here because they collapse repeated copies of the same substrings together into overlapping groups, which makes it easier to deduplicate things. For example, if the pattern [1, 3, 7] appears multiple times in the array, then the root will contain exactly one path starting with [1, 3, 7].
The question now is how to go from the suffix tree to the number of distinct subarrays. For now, let's tackle an easier question - how do you count up the number of distinct subarrays, period, completely ignoring the restriction on odd and even numbers? This, fortunately, turns out to be a well-studied problem that can be solved in linear time. Essentially, every prefix encoded in the suffix tree corresponds to a distinct subarray of the original array, so you just need to count up how many prefixes there are. That can be done by recursively walking the tree, adding up, for each edge in the tree, how many characters are along that edge. This can be done in time O(n) because a suffix tree for an array/string of length n has O(n) nodes, and we spend a constant amount of time processing each node (just by looking at the edge above it.)
So now we just need to incorporate the restriction on the number of even numbers you're allowed to use. This complicates things a little bit, but the reason why is subtle. Intuitively, it seems like this shouldn't be a problem. We could, after all, just do a DFS of the suffix tree and, as we go, count the number of even numbers on the path we've traversed, stopping as soon as we exceed m.
The problem with this approach is that even though the suffix tree has O(n) nodes in it, the edges, implicitly, encode ranges whose lengths can be as high as n itself. As a result, the act of scanning the edges could itself blow the runtime up to Ω(n2): visiting Θ(n) edges and doing Ω(n) work per edge.
We can, however, speed things up a little bit. Each edge in a suffix tree is typically represented as a pair of indices [start, stop] into the original array. So let's imagine that, as an additional preprocessing step, we build a table Evens such that Evens[n] returns the number of even numbers in the array up to and including position n. Then we can count the number of even numbers in any range [start, stop] by computing Evens[start] - Evens[stop]. That takes time O(1), and it means that we can aggregate the number of even numbers we encounter along a path in time proportional to the number of edges followed, not the number of characters encountered.
... except that there's one complication. What happens if we have a very long edge where, prior to reading that edge, we know that we're below the even number limit, and after reading that edge, we know that we're above the limit? That means that we need to stop partway through the edge, but we're not sure exactly where that is. That might require us to do a linear search over the edge to find the crossover point, and there goes our runtime.
But fortunately, there's a way out of that little dilemma. (This next section contains an improvement found by #Matt Timmermans). As part of the preprocessing, in addition to the Evens array, build a second table KthEven, where KthEven[i] returns the position of the kth even number in the array. This can be built in time O(n) using the Evens array. Once you have this, let's imagine that you have a bad edge, one that will push you over the limit. If you know how many even numbers you've encountered so far, you can determine the index of the even number that will push you over the limit. Then, you can look up where that even number is by indexing into the KthEven table in time O(1). This means that we only need to spend O(1) work per edge in the suffix tree, pushing our runtime down to O(n)!
So, to recap, here's a linear-time solution to this problem:
Build a suffix tree for the array using a fast suffix tree construction algorithm, like SA-IS or Ukkonen's algorithm. This takes time O(n) because there are at most 1,000 different numbers in the string, and 1,000 is a constant.
Compute the table Even[n] in time O(n).
Compute the table KthEven[n] in time O(n).
Do a DFS over the tree, keeping track of the number of even numbers encountered so far. When encountering an edge [start, stop], compute how many even numbers are in that range using Even in time O(1). If that's below the limit, keep recursing. If not, use the KthEven table to figure out how much of the edge is usable in time O(1). Either way, increment the global count of the number of distinct subarrays by the usable length of the current edge. This does O(1) work for each of the O(n) edges in the suffix tree for a total of O(n) work.
Phew! That wasn't an easy problem. I imagine there's some way to simplify this construction, and I'd welcome comments and suggestions about how to do this. But it shows that it is indeed possible to solve this problem in O(n) time, which is not immediately obvious!
Related
I'm building a web app to match high school students considering a gap year to students who have taken a gap year, based on interest as denoted by tags. A prototype is up at covidgapyears.com. I have never written a matching/recommendation algorithm, so though people have suggested things like collaborative filtering and association rule mining, or adapting the stable marriage problem, I don't think any of those will work because it's a small dataset (few hundred users right now, few thousand soon). So I wrote my own alg using common sense.
It essentially takes in a list of tags that the student is interested it, then searches for an exact match of those tags with someone who has taken a gap year and registered with the site (who also selected tags on registration). An exactMatch, as given below, is when the tags the user specifies are ALL contained by some profile (i.e., are a subset). If it can't find an exact match with ALL of the user's inputted tags, it will check all n-1 length subsets of the tags list itself to see if any less selective queries have matches. It does this recursively until at least 3 matches are found. While it works fine for small tags selections (up to 5-7) it gets slow for larger tags selections (7-13), taking several seconds to return a result. When 11-13 tags are selected, hits a Heroku error due to worker timeout.
I did some tests by putting variables inside the algorithm to count computations and it seems that when it goes a bit deep into the recursive stack, it checks a few hundred subsets each time (to see if there's an exactMatch for that subset, and if there is, add it to results list to output), and the total number of computations doubles as you add one more tag (it went 54, 150, 270, 500, 1000, 1900, 3400 operations for more and more tags). It is true that there are a few hundred subsets at each depth. But exactMatches is O(1) as I've written it (no iteration), and aside from the other O(1) operations like IF, the FOR inside the subset loop will, at most, be gone through around 10 times. This agrees with the measured result of a few thousand computations each time.
This did not surprise me as selecting and iterating over all subsets seems to be something that could get non harder, but my question is about why it's so slow despite only doing a few thousand computations. I know my computer operates in GHz and I expect web servers are similar, so surely a few thousand computations would be near-instantaneous? What am I missing and how can I improve this algorithm? Any other approaches I should look into?
# takes in a list of length n and returns a list of all combos of subsets of depth n
def arbSubsets(seq, n):
return list(itertools.combinations(seq, len(seq)-n))
# takes in a tagsList and check Gapper.objects.all to see if any gapper has all those tags
def exactMatches(tagsList):
tagsSet = set(tagsList)
exactMatches = []
for gapper in Gapper.objects.all():
gapperSet = set(gapper.tags.names())
if tagsSet.issubset(gapperSet):
exactMatches.append(gapper)
return exactMatches
# takes in tagsList that has been cleaned to remove any tags that NO gappers have and then checks gapper objects to find optimal match
def matchGapper(tagsList, depth, results):
# handles the case where we're only given tags contained by no gappers
if depth == len(tagsList):
return []
# counter variable is to measure complexity for debugging
counter += 1
# we don't want too many results or it stops feeling tailored
upper_limit_results = 3
# now we must check subsets for match
subsets = arbSubsets(tagsList, depth)
for subset in subsets:
counter += 1
matches = exactMatches(subset)
if matches:
for match in matches:
counter += 1
# new need to check because we might be adding depth 2 to results from depth 1
# which we didn't do before, to make sure we have at least 3 results
if match not in results:
# don't want to show too many or it doesn't feel tailored anymore
counter += 1
if len(results) > upper_limit_results: break
results.append(match)
# always give at least 3 results
if len(results) > 2:
return results
else:
# check one level deeper (less specific) into tags if not enough gappers that match to get more results
counter += 1
return matchGapper(tagsList, depth + 1, results)
# this is the list of matches we then return to the user
matches = matchGapper(tagsList, 0, [])
It doesn't seem you are doing a few hundred computation steps. In fact you have a few hundred options for each depth, thus you should not add, but multiply the number of steps at each depth to estimate the complexity of your solution.
Additionally this statement: This or adapting the stable marriage problem, I don't think any of those will work because it's a small dataset is also obviously not true. Although these algorithms may be overkill for some very simple cases, they are still valid and will work for them.
Okay, so after much fiddling with timers I've figured it out. There are a few functions at play when matching: exactMatches, matchGapper and arbSubset. When I put the counter into a global variable and measured operations (as measured as lines of my code being executed, it came in around 2-10K for large inputs (around 10 tags)).
It is true that arbSubset, which returns a list of subsets, at first seems like a plausible bottleneck. But if you look closely, we are 1) handling small amounts of tags (order of 10-50) and more importantly, 2) we are only calling arbSubset when we recurse matchGapper, which only happens a max of about 10 times, since tagsList can only be around 10 (order of 10-50, as above). And when I checked the time it took to generate arbSubsets, it was order of 2e-5. And so the total time spend on generating the subsets of arbitrary size is only 2e-4. In other words, not the source of the 5-30 second waiting time in the web app.
And so with that aside, knowing that arbSubset is only called on the order of 10 times, and is fast at that, and knowing that there are only around a max of 10K computations taking place in my code it starts to become clear that I must be using some out-of-the-box function, I don't know--like set() or .issubset() or something like that--that takes a nontrivial amount of time to compute, and is executed many times. Adding some counters in some more places, it becomes clear that exactMatch() accounts for around 95-99% of all computations that take place (as would be expected if we have to check all combinations of subsets of various sizes for exactMatches).
So the problem, at this point, is reduced to the fact that exactMatch takes around 0.02s (empirically) as implemented, and is called several thousand times. And so we can either try to make it faster by a couple of order of magnitudes (it's already pretty optimal), or take another approach that doesn't involve finding matches using subsets. A friend of mine suggested creating a dict with all the combinations of tags (so 2^len(tagsList) keys) and setting them equal to lists of registered profiles with that exact combination. This way, querying is just traversing a (huge) dict, which can be done fast. Any other suggestions are welcome.
Assume that I have two lists named a and b of both size n, and I want to do the following slice setting operation with k < n
a[:k] = b[:k]
In the Python wiki's Time Complexity page it says that the complexity of slice setting is O(n+k) where k is the length of the slice. I just cannot understand why it is not just O(k) in the above situation.
I know that slicing returns a new list, so it is O(k), and I know that the list holds its data in a continuous way, so inserting an item in the middle would take O(n) time. But the above operation can easily be done in O(k) time. Am I missing something?
Furthermore, is there a documentation where I can find detailed information about such issues? Should I look into the CPython implementation?
Thanks.
O(n+k) is the average case, which includes having to grow or shrink the list to adjust for the number of elements inserted to replace the original slice.
Your case, where you replace the slice with an equal number of new elements, the implementation only takes O(k) steps. But given all possible combinations of number of elements inserted and deleted, the average case has to move the n remaining elements in the list up or down.
See the list_ass_slice function for the exact implementation.
You're right, if you want to know the exact details it's best to use the source. The CPython implementation of setting a slice is in listobject.c.
If I read it correctly, it will...
Count how many new elements you're inserting (or deleting!)
Shift the n existing elements of the list over enough places to make room for the new elements, taking O(n) time in the worst case (when every element of the list has to be shifted).
Copy over the new elements into the space that was just created, taking O(k) time.
That adds up to O(n+k).
Of course, your case is probably not that worst case: you're changing the last k elements of the list, so there might be no need for shifting at all, reducing the complexity to O(k) you expected. However, that is not true in general.
I am writing a small script that guesses numeric passwords (including ones with leading zeros). The script works fine but I am having trouble understanding what the worst case time complexity would be for this algorithm. Any insight on the complexity of this implementation would be great, thanks.
def bruteforce(cipherText):
for pLen in itertools.count():
for password in itertools.product("0123456789", repeat=pLen):
if hashlib.sha256("".join(password)).hexdigest() == cipherText:
return "".join(password)
First, it's always possible that you're going to find a hash collision before you find the right password. And, for a long-enough input string, this is guaranteed. So, really, the algorithm is constant time: it will complete in about 2^256 steps no matter what the input is.
But this isn't very helpful when you're asking about how it scales with more reasonable N, so let's assume we had some upper limit that was low enough where hash collisions weren't relevant.
Now, if the password is length N, the outer loop will run N times.*
* I'm assuming here that the password really will be purely numeric. Otherwise, of course, it'll fail to find the right answer at N and keep going until it finds a hash collision.
How long does the inner loop take? Well, the main thing it does is iterate each element in product("0123456789", repeat=pLen). That just iterates the cartesian product of the 10-element list pLen times—in other words, there are 10^pLen elements in the product.
Since 10**pLen is greater than sum(10**i for i in range(pLen)) (e.g., 100000 > 11111), we can ignore all but the last time through the outer loop, so that 10**pLen is the total number of inner loops.
The stuff that it does inside each inner loop is all linear on pLen (joining a string, hashing a string) or constant (comparing two hashes), so there are (10^pLen)*pLen total steps.
So, the worst-case complexity is exponential: O(10^N). Because that's the same as multiplying N by a constant and then doing 2^cN, that's the same complexity class as O(2^N).
If you want to combine the two together, you could call this O(2^min(log2(10)*N, 256)). Which, again, is constant-time (since the asymptote is still 2^256), but shows how it's practically exponential if you only care about smaller N.
I have a list which I shuffle with the Python built in shuffle function (random.shuffle)
However, the Python reference states:
Note that for even rather small len(x), the total number of permutations of x is larger than the period of most random number generators; this implies that most permutations of a long sequence can never be generated.
Now, I wonder what this "rather small len(x)" means. 100, 1000, 10000,...
TL;DR: It "breaks" on lists with over 2080 elements, but don't worry too much :)
Complete answer:
First of all, notice that "shuffling" a list can be understood (conceptually) as generating all possible permutations of the elements of the lists, and picking one of these permutations at random.
Then, you must remember that all self-contained computerised random number generators are actually "pseudo" random. That is, they are not actually random, but rely on a series of factors to try and generate a number that is hard to be guessed in advanced, or purposefully reproduced. Among these factors is usually the previous generated number. So, in practice, if you use a random generator continuously a certain number of times, you'll eventually start getting the same sequence all over again (this is the "period" that the documentation refers to).
Finally, the docstring on Lib/random.py (the random module) says that "The period [of the random number generator] is 2**19937-1."
So, given all that, if your list is such that there are 2**19937 or more permutations, some of these will never be obtained by shuffling the list. You'd (again, conceptually) generate all permutations of the list, then generate a random number x, and pick the xth permutation. Next time, you generate another random number y, and pick the yth permutation. And so on. But, since there are more permutations than you'll get random numbers (because, at most after 2**19937-1 generated numbers, you'll start getting the same ones again), you'll start picking the same permutations again.
So, you see, it's not exactly a matter of how long your list is (though that does enter into the equation). Also, 2**19937-1 is quite a long number. But, still, depending on your shuffling needs, you should bear all that in mind. On a simplistic case (and with a quick calculation), for a list without repeated elements, 2081 elements would yield 2081! permutations, which is more than 2**19937.
I wrote that comment in the Python source originally, so maybe I can clarify ;-)
When the comment was introduced, Python's Wichmann-Hill generator had a much shorter period, and we couldn't even generate all the permutations of a deck of cards.
The period is astronomically larger now, and 2080 is correct for the current upper bound. The docs could be beefed up to say more about that - but they'd get awfully tedious.
There's a very simple explanation: A PRNG of period P has P possible starting states. The starting state wholly determines the permutation produced. Therefore a PRNG of period P cannot generate more than P distinct permutations (and that's an absolute upper bound - it may not be achieved). That's why comparing N! to P is the correct computation here. And, indeed:
>>> math.factorial(2080) > 2**19937 - 1
False
>>> math.factorial(2081) > 2**19937 - 1
True
What they mean is that permutations on n objects (noted n!) grows absurdly high very fast.
Basically n! = n x n-1 x ... x 1; for example, 5! = 5 x 4 x 3 x 2 x 1 = 120 which means there are 120 possible ways of shuffling a 5-items list.
On the same Python page documentation they give 2^19937-1 as the period, which is 4.something × 10^6001 or something. Based on the Wikipedia page on factorials, I guess 2000! should be around that. (Sorry, I didn't find the exact figure.)
So basically there are so many possible permutations the shuffle will take from that there's probably no real reason to worry about those it won't.
But if it really is an issue (pesky customer asking for a guarantee of randomness perhaps?), you could also offload the task to some third-party; see http://www.random.org/ for example.
I'm working on a statistical project that involves iterating over every possible way to partition a collection of strings and running a simple calculation on each. Specifically, each possible substring has a probability associated with it, and I'm trying to get the sum across all partitions of the product of the substring probability in the partition.
For example, if the string is 'abc', then there would be probabilities for 'a', 'b', 'c', 'ab, 'bc' and 'abc'. There are four possible partitionings of the string: 'abc', 'ab|c', 'a|bc' and 'a|b|c'. The algorithm needs to find the product of the component probabilities for each partitioning, then sum the four resultant numbers.
Currently, I've written a python iterator that uses binary representations of integers for the partitions (eg 00, 01, 10, 11 for the example above) and simply runs through the integers. Unfortunately, this is immensely slow for strings longer than 20 or so characters.
Can anybody think of a clever way to perform this operation without simply running through every partition one at a time? I've been stuck on this for days now.
In response to some comments here is some more information:
The string can be just about anything, eg "foobar(foo2)" -- our alphabet is lowercase alphanumeric plus all three type of braces ("(","[","{"), hyphens and spaces.
The goal is to get the likelihood of the string given individual 'word' likelihoods. So L(S='abc')=P('abc') + P('ab')P('c') + P('a')P('bc') + P('a')P('b')P('c') (Here "P('abc')" indicates the probability of the 'word' 'abc', while "L(S='abc')" is the statistical likelihood of observing the string 'abc').
A Dynamic Programming solution (if I understood the question right):
def dynProgSolution(text, probs):
probUpTo = [1]
for i in range(1, len(text)+1):
cur = sum(v*probs[text[k:i]] for k, v in enumerate(probUpTo))
probUpTo.append(cur)
return probUpTo[-1]
print dynProgSolution(
'abc',
{'a': 0.1, 'b': 0.2, 'c': 0.3,
'ab': 0.4, 'bc': 0.5, 'abc': 0.6}
)
The complexity is O(N2) so it will easily solve the problem for N=20.
How why does this work:
Everything you will multiply by probs['a']*probs['b'] you will also multiply by probs['ab']
Thanks to the Distributive Property of multiplication and addition, you can sum those two together and multiply this single sum by all of its continuations.
For every possible last substring, it adds the sum of all splits ending with that by adding its probability multiplied by the sum of all probabilities of previous paths. (alternative phrasing would be appreciated. my python is better than my english..)
First, profile to find the bottleneck.
If the bottleneck is simply the massive number of possible partitions, I recommend parallelization, possibly via multiprocessing. If that's still not enough, you might look into a Beowulf cluster.
If the bottleneck is just that the calculation is slow, try shelling out to C. It's pretty easy to do via ctypes.
Also, I'm not really sure how you're storing the partitions, but you could probably squash memory consumption a pretty good bit by using one string and a suffix array. If your bottleneck is swapping and/or cache misses, that might be a big win.
Your substrings are going to be reused over and over again by the longer strings, so caching the values using a memoizing technique seems like an obvious thing to try. This is just a time-space trade off. The simplest implementation is to use a dictionary to cache values as you calculate them. Do a dictionary lookup for every string calculation; if it's not in the dictionary, calculate and add it. Subsequent calls will make use of the pre-computed value. If the dictionary lookup is faster than the calculation, you're in luck.
I realise you are using Python, but... as a side note that may be of interest, if you do this in Perl, you don't even have to write any code; the built in Memoize module will do the caching for you!
You may get a minor reduction of the amount of computation by a small refactoring based on associative properties of arithmetic (and string concatenation) though I'm not sure it will be a life-changer. The core idea would be as follows:
consider a longish string e.g. 'abcdefghik', 10-long, for definiteness w/o loss of generality. In a naive approach you'd be multiplying p(a) by the many partitions of the 9-tail, p(ab) by the many partitions of the 8-tail, etc; in particular p(a) and p(b) will be multiplying exactly the same partitions of the 8-tail (all of them) as p(ab) will -- 3 multiplications and two sums among them. So factor that out:
(p(ab) + p(a) * p(b)) * (partitions of the 8-tail)
and we're down to 2 multiplications and 1 sum for this part, having saved 1 product and 1 sum. to cover all partitions with a split point just right of 'b'. When it comes to partitions with a split just right of 'c',
(p(abc) + p(ab) * p(c) + p(a) * (p(b)*p(c)+p(bc)) * (partitions of the 7-tail)
the savings mount, partly thanks to the internal refactoring -- though of course one must be careful about double-counting. I'm thinking that this approach may be generalized -- start with the midpoint and consider all partitions that have a split there, separately (and recursively) for the left and right part, multiplying and summing; then add all partitions that DON'T have a split there, e.g. in the example, the halves being 'abcde' on the left and 'fghik' on the right, the second part is about all partitions where 'ef' are together rather than apart -- so "collapse" all probabilities by considering that 'ef' as a new 'superletter' X, and you're left with a string one shorter, 'abcdXghik' (of course the probabilities for the substrings of THAT map directly to the originals, e.g. the p(cdXg) in the new string is just exactly the p(cdefg) in the original).
You should look into the itertools module. It can create a generator for you that is very fast. Given your input string, it will provide you with all possible permutations. Depending on what you need, there is also a combinations() generator. I'm not quite sure if you're looking at "b|ca" when you're looking at "abc," but either way, this module may prove useful to you.