How to compute longest anagram using trie? - python

I am currently working on a project that require me to compute the largest anagram group in an dictionary using trie and the return result must in alphabetical order. The solution should compute the result in O(C) time and space complexity where C represent all characters inside dictionary.
I plan to construct the trie by creating new array of size 26 at each node and index of the array represent a letter from A - Z, eg array[0] = A, array[1] = B etc. I will also map all letter from A - Z to an unique prime so that I can compute an unique number for anagram group.
During insertion, I insert the pointer of next letter's array and the result of multiplication between current letter's prime number and previous letter's prime number to its corresponding index.
I did not certain if my approach is correct as I feel the space complexity probably did not meet the requirement, any suggestion or hint would be appreciated.

Related

Time complexity of quick sorting array of strings with specific total number of characters

How I can change quick sort to change the average time complexity of quick sorting to K + n log n comparisons? K is the total number of characters of the array, n the length of array. As far I know quick sort has an average complexity of n log n for array of numbers. The average length of every string is K/n, thus the average complexity for an array of characters is K/n(n logn),
I believe that this can be achieved in two steps, one to change first the strings to numbers in K comparisons and then quick sort the numbers. However I am struggling to convert the strings to numbers using the ascii value of each character in K time.
Is there a way to achieve my goal of just changing quick sort? Is there a way to convert strings to numbers and vice versa in K time? I tried a loop using reduce - map command, but I could not do this in K comparisons.

How can I scramble an array in linear time without having duplicate elements next to one another?

For example:
T = [b, c, b, b, a, c] #each element represents questions related to topics a, b, c
What I want is make sure that no 2 questions from same topic are next to one another. (see the T where b, b are next to eachother)
So I want to rearrange the T in such a that no two questions belonging to same topic are next to each other, i.e. Tnew = [b, c, b, a, b, c]
But condition is that we have to do it linear time i.e. O(n) or Big-O of (n)
The algorithm that I thought of:
1)Create a dict or map to hold the occurrence of each topics:
a --> 1
b --> 3
c --> 2
2) Now based on the counts we can create new array such that:
A = [a, b, b, b, c, c]
3) Now perform unsorting of Array which I believe runs in O(n).
(unsorting is basically find midpoint and then merge the elements alternately from each half.
Can someone please help me design a pseudocode or algorithm that can do this better on any inputs with k number of topics?
This is a random question that I am practing for exam.
There is a linearithmic approach of time complexity O(log c * n) where c is the number of unique items and n is the total number of items.
It works as follows:
Create a frequency table of the items. This is just a list of tuples that tells how many of each item are available. Let's store the tuples in the list as (quantity, item). This Step is O(n). You could use collections.Counter in python, or collections.defaultdict(int) or a vanilla dict.
Heapify the list of tuples in a max heap. This can be done in O(n). This heap has the items with the largest number of quantity at the front. You could use heapq module in python. Let's call this heap hp.
Have a list for the results called res.
Now run a loop while len(hp) > 0: and do as follows:
pop the 2 largest elements from the heap. O(log c) operation.
add one from each to res. Make sure you handle edge cases properly, if any.
decrement the quantity of both items. If their quantity > 0 push them back on the heap. O(log c) operation.
At the end, you could end with one item that has no peers for inter weaving them. This can happen if the quantity of one item is larger than the sum of the quantities of all the other items. But there's no way around this. Your input data must respect this condition.
One final note about time complexity: If the number of unique items is constant, we could drop the log c factor from the time complexity and consider it linear. This is mainly a case of how we define things.
Here's the O(n) solution (inspired in part by #user1984's answer):
Imagine you know how many of each element to insert, and have ordered these counts. Say we then decided to build up a solution by interleaving groups of elements incrementally. We start off with just our group of elements G0 with lowest frequency. Then, we take the next most popular group G1, and interleave these values into our existing list.
If we were to continue in this fashion, we could observe a few rules:
if the current group G has more elements than all over smaller groups combined plus one, then:
the result will have elements of G neighboring each other
regardless of its prior state, no elements from the smaller groups will neighbor each other (inter-group nor intra-group)
otherwise
the result will have no elements of G neighboring each other
regardless, G contains enough elements to separate individual elements of the next smallest group G-1, if positioned wisely.
With this in mind, we can see (recursively) that as long as we shift our interleaving to overlap any outstanding neighbor violations, we can guarantee that as long as G has fewer elements than the smaller groups combined plus two, the overall result absolutely will not have any neighbor violations.
Of course, the logic I outlined above for computing this poses some performance issues, so we're going to compute the same result in a slightly different, but equivalent, way. We'll instead insert the largest group first, and work our way smaller.
First, create a frequency table. You need to form a item: quantity mapping, so something like collections.Counter() in python. This takes O(N) time.
Next, order this mapping by frequency. This can be done in O(N) time using counting sort, since all elements are integers. Note there's c elements, but c <= N, so O(c) is O(N).
After that, build a linked list of length N, with node values from [0, N) (ascending). This will help us track which indices to write into next.
For each item in our ordered mapping, iterate from 0 to the associated count (exclusive). Each iteration, remove the current element from linked list ((re)starting at the head), and traverse two nodes forward in the linked list. Insert the item/number into the destination array at the index of each removed node. This will take O(N) time since we traverse ~2k nodes per item (where k in the group size), and the combined size of all groups is N, so 2N traversals. Each removal can be performed in O(1) time, so O(N) for traversing and removing.
So all in all, this will take O(N) time, utilizing linked lists, hash tables (or O(1) access mappings of some sort), and counting sort.
The collections module can help. Using Counter gets you a number of occurrence of each question in O(n) time. Converting those into iterators in a deque will allow you to interleave the questions sequentially but you need to process them in decreasing order of occurrences. Getting the counters in order of frequency would normally require a sort, which is O(nLogn), but you can use a pigeon hole approach to group the iterators by common frequency in O(n) time then go through the groups in reverse order of frequency. The maximum number of distinct frequencies is knownable and will be less or equal to √(0.25+2n)-0.5 which is less than O(n).
There will be at most n iterators so building the deque will be <= O(n). going through the iterators until exhaustion will take at most 2n iterations:
T= ['b', 'c', 'b', 'b', 'a', 'c']
from collections import Counter,deque
from itertools import repeat
result = []
counts = Counter(T) # count occurrences O(n)
common = [[] for _ in range(max(counts.values()))] # frequency groups
for t,n in counts.items():
common[n-1].append(repeat(t,n)) # iterators by freq.
Q = deque(iq for cq in reversed(common) for iq in cq) # queue iterators
while Q: # 2n iterations or less
iq = Q.popleft() # O(1) - extract first iterator
q = next(iq,None) # O(1) - next question
if not q: continue # - exhausted, removed from deque
result.append(q) # O(1) - add question to result
Q.insert(1,iq) # O(1) - put back iterator as 2nd
print(result)
['b', 'c', 'b', 'c', 'b', 'a']

find longest noncontiguous nondecreasing subsequence in array

I want to prove why finding the longest non contiguous non decreasing sub sequence of an array of size n, in O(n).
By "find" I mean know its length, and a list of the relevant indices.
Here is a solution in NlogN.
Here is the Wikipedia article.
I want to convince myself that it can't be done any faster.
Here is partial proof:
Assume this were possible faster than O(nlogn), for simplicity, O(n) but this holds for anything better than O(nlogn)
We can merge two sorted arrays into a single sorted array composed of all elements of both in O(n1 + n2).
Given an array A, we could then find its longest longest non contiguous non decreasing sub sequence in O(n).
If this sequence is smaller then n/2, then for reversed(A) it is larger or equal to n/2 [I need proof for that]
This way, we can split the array into sorted chunks, each time in O(n), and being left with a remainder of size k that can also be split and sorted in O(k) + O(remainer) until we are left with one element which is O(1).
Thus, sorting the array would take O(n)
Here is a solution in O(n+klogk), where k is the number of elements which are in an unsorted position.
Armed with the above algorithm, we can find which of the k elements are out of their position by sorting using the above algorithm O(n+klogk) and traversing the unsorted and the sorted arrays together, finding the indices of the k (or less) differences (single pass, O(n)).
The rest of the indices would define a sorted array which is also a non consecutive subsequence of the input array, which is of the maximal length, because by definition, those indices are sorted, and the other k aren't.
In general, this is O(n log n), but in case there is a need to search for a longest non consecutive subsequence, it would make sense to optimize like this, because the problem will probably be such that n << k`.

Best way to hash ordered permutation of [1,9]

I'm trying to implement a method to keep the visited states of the 8 puzzle from generating again.
My initial approach was to save each visited pattern in a list and do a linear check each time the algorithm wants to generate a child.
Now I want to do this in O(1) time through list access. Each pattern in 8 puzzle is an ordered permutation of numbers between 1 to 9 (9 being the blank block), for example 125346987 is:
1 2 5
3 4 6
_ 8 7
The number of all of the possible permutation of this kind is around 363,000 (9!). what is the best way to hash these numbers to indexes of a list of that size?
You can map a permutation of N items to its index in the list of all permutations of N items (ordered lexicographically).
Here's some code that does this, and a demonstration that it produces indexes 0 to 23 once each for all permutations of a 4-letter sequence.
import itertools
def fact(n):
r = 1
for i in xrange(n):
r *= i + 1
return r
def I(perm):
if len(perm) == 1:
return 0
return sum(p < perm[0] for p in perm) * fact(len(perm) - 1) + I(perm[1:])
for p in itertools.permutations('abcd'):
print p, I(p)
The best way to understand the code is to prove its correctness. For an array of length n, there's (n-1)! permutations with the smallest element of the array appearing first, (n-1)! permutations with the second smallest element appearing first, and so on.
So, to find the index of a given permutation, see count how many items are smaller than the first thing in the permutation and multiply that by (n-1)!. Then recursively add the index of the remainder of the permutation, considered as a permutation of (n-1) elements. The base case is when you have a permutation of length 1. Obviously there's only one such permutation, so its index is 0.
A worked example: [1324].
[1324]: 1 appears first, and that's the smallest element in the array, so that gives 0 * (3!)
Removing 1 gives us [324]. The first element is 3. There's one element that's smaller, so that gives us 1 * (2!).
Removing 3 gives us [24]. The first element is 2. That's the smallest element remaining, so that gives us 0 * (1!).
Removing 2 gives us [4]. There's only one element, so we use the base case and get 0.
Adding up, we get 0*3! + 1*2! + 0*1! + 0 = 1*2! = 2. So [1324] is at index 2 in the sorted list of 4 permutations. That's correct, because at index 0 is [1234], index 1 is [1243], and the lexicographically next permutation is our [1324].
I believe you're asking for a function to map permutations to array indices. This dictionary maps all permutations of numbers 1-9 to values from 0 to 9!-1.
import itertools
index = itertools.count(0)
permutations = itertools.permutations(range(1, 10))
hashes = {h:next(index) for h in permutations}
For example, hashes[(1,2,5,3,4,6,9,8,7)] gives a value of 1445.
If you need them in strings instead of tuples, use:
permutations = [''.join(x) for x in itertools.permutations('123456789')]
or as integers:
permutations = [int(''.join(x)) for x in itertools.permutations('123456789')]
It looks like you are only interested in whether or not you have already visited the permutation.
You should use a set. It grants the O(1) look-up you are interested in.
A space as well lookup efficient structure for this problem is a trie type structure, as it will use common space for lexicographical matches in any
permutation.
i.e. the space used for "123" in 1234, and in 1235 will be the same.
Lets assume 0 as replacement for '_' in your example for simplicity.
Storing
Your trie will be a tree of booleans, the root node will be an empty node, and then each node will contain 9 children with a boolean flag set to false, the 9 children specify digits 0 to 8 and _ .
You can create the trie on the go, as you encounter a permutation, and store the encountered digits as boolean in the trie by setting the bool as true.
Lookup
The trie is traversed from root to children based on digits of the permutation, and if the nodes have been marked as true, that means the permutation has occured before. The complexity of lookup is just 9 node hops.
Here is how the trie would look for a 4 digit example :
Python trie
This trie can be easily stored in a list of booleans, say myList.
Where myList[0] is the root, as explained in the concept here :
https://webdocs.cs.ualberta.ca/~holte/T26/tree-as-array.html
The final trie in a list would be around 9+9^2+9^3....9^8 bits i.e. less than 10 MB for all lookups.
Use
I've developed a heuristic function for this specific case. It is not a perfect hashing, as the mapping is not between [0,9!-1] but between [1,767359], but it is O(1).
Let's assume we already have a file / reserved memory / whatever with 767359 bits set to 0 (e.g., mem = [False] * 767359). Let a 8puzzle pattern be mapped to a python string (e.g., '125346987'). Then, the hash function is determined by:
def getPosition( input_str ):
data = []
opts = range(1,10)
n = int(input_str[0])
opts.pop(opts.index(n))
for c in input_str[1:len(input_str)-1]:
k = opts.index(int(c))
opts.pop(k)
data.append(k)
ind = data[3]<<14 | data[5]<<12 | data[2]<<9 | data[1]<<6 | data[0]<<3 | data[4]<<1 | data[6]<<0
output_str = str(ind)+str(n)
output = int(output_str)
return output
I.e., in order to check if a 8puzzle pattern = 125346987 has already been used, we need to:
pattern = '125346987'
pos = getPosition(pattern)
used = mem[pos-1] #mem starts in 0, getPosition in 1.
With a perfect hashing we would have needed 9! bits to store the booleans. In this case we need 2x more (767359/9! = 2.11), but recall that it is not even 1Mb (barely 100KB).
Note that the function is easily invertible.
Check
I could prove you mathematically why this works and why there won't be any collision, but since this is a programming forum let's just run it for every possible permutation and check that all the hash values (positions) are indeed different:
def getPosition( input_str ):
data = []
opts = range(1,10)
n = int(input_str[0])
opts.pop(opts.index(n))
for c in input_str[1:len(input_str)-1]:
k = opts.index(int(c))
opts.pop(k)
data.append(k)
ind = data[3]<<14 | data[5]<<12 | data[2]<<9 | data[1]<<6 | data[0]<<3 | data[4]<<1 | data[6]<<0
output_str = str(ind)+str(n)
output = int(output_str)
return output
#CHECKING PURPOSES
def addperm(x,l):
return [ l[0:i] + [x] + l[i:] for i in range(len(l)+1) ]
def perm(l):
if len(l) == 0:
return [[]]
return [x for y in perm(l[1:]) for x in addperm(l[0],y) ]
#We generate all the permutations
all_perms = perm([ i for i in range(1,10)])
print "Number of all possible perms.: "+str(len(all_perms)) #indeed 9! = 362880
#We execute our hash function over all the perms and store the output.
all_positions = [];
for permutation in all_perms:
perm_string = ''.join(map(str,permutation))
all_positions.append(getPosition(perm_string))
#We wan't to check if there has been any collision, i.e., if there
#is one position that is repeated at least twice.
print "Number of different hashes: "+str(len(set(all_positions)))
#also 9!, so the hash works properly.
How does it work?
The idea behind this has to do with a tree: at the beginning it has 9 branches going to 9 nodes, each corresponding to a digit. From each of these nodes we have 8 branches going to 8 nodes, each corresponding to a digit except its parent, then 7, and so on.
We first store the first digit of our input string in a separate variable and pop it out from our 'node' list, because we have already taken the branch corresponding to the first digit.
Then we have 8 branches, we choose the one corresponding with our second digit. Note that, since there are 8 branches, we need 3 bits to store the index of our chosen branch and the maximum value it can take is 111 for the 8th branch (we map branch 1-8 to binary 000-111). Once we have chosen and store the branch index, we pop that value out, so that the next node list doesn't include again this digit.
We proceed in the same way for branches 7, 6 and 5. Note that when we have 7 branches we still need 3 bits, though the maximum value will be 110. When we have 5 branches, the index will be at most binary 100.
Then we get to 4 branches and we notice that this can be stored just with 2 bits, same for 3 branches. For 2 branches we will just need 1bit, and for the last branch we don't need any bit: there will be just one branch pointing to the last digit, which will be the remaining from our 1-9 original list.
So, what we have so far: the first digit stored in a separated variable and a list of 7 indexes representing branches. The first 4 indexes can be represented with 3bits, the following 2 indexes can be represented with 2bits and the last index with 1bit.
The idea is to concatenate all this indexes in their bit form to create a larger number. Since we have 17bits, this number will be at most 2^17=131072. Now we just add the first digit we had stored to the end of that number (at most this digit will be 9) and we have that the biggest number we can create is 1310729.
But we can do better: recall that when we had 5 branches we needed 3 bits, though the maximum value was binary 100. What if we arrange our bits so that those with more 0s come first? If so, in the worst case scenario our final bit number will be the concatenation of:
100 10 101 110 111 11 1
Which in decimal is 76735. Then we proceed as before (adding the 9 at the end) and we get that our biggest possible generated number is 767359, which is the ammount of bits we need and corresponds to input string 987654321, while the lowest possible number is 1 which corresponds to input string 123456789.
Just to finish: one might wonder why have we stored the first digit in a separate variable and added it at the end. The reason is that if we had kept it then the number of branches at the beginning would have been 9, so for storing the first index (1-9) we would have needed 4 bits (0000 to 1000). which would have make our mapping much less efficient, as in that case the biggest possible number (and therefore the amount of memory needed) would have been
1000 100 10 101 110 111 11 1
which is 1125311 in decimal (1.13Mb vs 768Kb). It is quite interesting to see that the ratio 1.13M/0.768K = 1.47 has something to do with the ratio of the four bits compared to just adding a decimal value (2^4/10 = 1.6) which makes a lot of sense (the difference is due to the fact that with the first approach we are not fully using the 4 bits).
First. There is nothing faster than a list of booleans. There's a total of 9! == 362880 possible permutations for your task, which is a reasonably small amount of data to store in memory:
visited_states = [False] * math.factorial(9)
Alternatively, you can use array of bytes which is slightly slower (not by much though) and has a much lower memory footprint (by a power of magnitude at least). However any memory savings from using an array will probably be of little value considering the next step.
Second. You need to convert your specific permutation to it's index. There are algorithms which do this, one of the best StackOverflow questions on this topic is probably this one:
Finding the index of a given permutation
You have fixed permutation size n == 9, so whatever complexity an algorithm has, it will be equivalent to O(1) in your situation.
However to produce even faster results, you can pre-populate a mapping dictionary which will give you an O(1) lookup:
all_permutations = map(lambda p: ''.join(p), itertools.permutations('123456789'))
permutation_index = dict((perm, index) for index, perm in enumerate(all_permutations))
This dictionary will consume about 50 Mb of memory, which is... not that much actually. Especially since you only need to create it once.
After all this is done, checking your specific combination is done with:
visited = visited_states[permutation_index['168249357']]
Marking it to visited is done in the same manner:
visited_states[permutation_index['168249357']] = True
Note that using any of permutation index algorithms will be much slower than mapping dictionary. Most of those algorithms are of O(n2) complexity and in your case it results 81 times worse performance even discounting the extra python code itself. So unless you have heavy memory constraints, using mapping dictionary is probably the best solution speed-wise.
Addendum. As has been pointed out by Palec, visited_states list is actually not needed at all - it's perfectly possible to store True/False values directly in the permutation_index dictionary, which saves some memory and an extra list lookup.
Notice if you type hash(125346987) it returns 125346987. That is for a reason, because there is no point in hashing an integer to anything other than an integer.
What you should do, is when you find a pattern add it to a dictionary rather than a list. This will provide the fast lookup you need rather than traversing the list like you are doing now.
So say you find the pattern 125346987 you can do:
foundPatterns = {}
#some code to find the pattern
foundPatterns[1] = 125346987
#more code
#test if there?
125346987 in foundPatterns.values()
True
If you must always have O(1), then seems like a bit array would do the job. You'd only need to store 363,000 elements, which seems doable. Though note that in practice it's not always faster. Simplest implementation looks like:
Create data structure
visited_bitset = [False for _ in xrange(373000)]
Test current state and add if not visited yet
if !visited[current_state]:
visited_bitset[current_state] = True
Paul's answer might work.
Elisha's answer is perfectly valid hash function that would guarantee that no collision happen in the hash function. The 9! would be a pure minimum for a guaranteed no collision hash function, but (unless someone corrects me, Paul probably has) I don't believe there exists a function to map each board to a value in the domain [0, 9!], let alone a hash function that is nothing more that O(1).
If you have a 1GB of memory to support a Boolean array of 864197532 (aka 987654321-12346789) indices. You guarantee (computationally) the O(1) requirement.
Practically (meaning when you run in a real system) speaking this isn't going to be cache friendly but on paper this solution will definitely work. Even if an perfect function did exist, doubt it too would be cache friendly either.
Using prebuilts like set or hashmap (sorry I haven't programmed Python in a while, so don't remember the datatype) must have an amortized 0(1). But using one of these with a suboptimal hash function like n % RANDOM_PRIME_NUM_GREATER_THAN_100000 might give the best solution.

Sum of maximal element from every possible subset of size 'k' of an array

I have a very large list comprising of about 10,000 elements and each element is an integer as big as 5 billion. I would like to find the sum of maximal elements from every possible subset of size 'k' (given by the user) of an array whose maximum size is 10,000 elements. The only solution that comes to my head is to generate each of the subset (using itertools) and find its maximum element. But this would take an insane amount of time! What would be a pythonic way to solve this?
Don't use python, use mathematics first. This is a combinatorial problem: If you have an array S of n numbers (n large), and generate all possible subsets of size k, you want to calculate the sum of the maximal elements of the subsets.
Assuming the numbers are all distinct (though it also works if they are not), you can calculate exactly how often each will appear in a subset, and go on from there without ever actually constructing a subset. You should have taken it over to math.stackexchange.com, they'd have sorted you out in a jiffy. Here it is, but without the nice math notation:
Sort your array in increasing order and let S_1 be the smallest (first) number,
S_2 the next smallest, and so on. (Note: Indexing from 1).
S_n, the largest element, is obviously the maximal element of any subset
it is part of, and there are exactly (n-1 choose k-1) such subsets.
Of the subsets that don't contain S_n, there are (n-2 choose k-1)
subsets that contain S_{n-1}, in which it is the largest element.
Continue this until you come down to S_k, the k-th smallest number
(counting from the smallest), which will be the maximum of exactly one
subset: (k-1 choose k-1) = 1. Smaller numbers (S_1 to S_{k-1})
can never be maximal: Every set of k elements will contain something
larger.
Sum the above (n-k+1 terms), and there's your answer:
S_n*(n-1 choose k-1) + S_{n-1}*(n-2 choose k-1) + ... + S_k*(k-1 choose k-1)
Writing the terms from smallest to largest, this is just the sum
Sum(i=k..n) S_i * (i-1 choose k-1)
If we were on math.stackexchange you'd get it in the proper mathematical notation, but you get the idea.

Categories

Resources