Merging and sorting n strings in O(n)

Merging and sorting n strings in O(n) - python

I recently was given a question in a coding challenge where I had to merge n strings of alphanumeric characters and then sort the new merged string while only allowing alphabetical characters in the sorted string. Now, this would be fairly straight forward except that the caveat added was that the algorithm had to be O(n) (it didn't specify whether this was time or space complexity or both).
My initial approach was to concatenate the strings into a new one, only adding alphabetical characters and then sorting at the end. I wanted to come up with a more efficient solution but I was given less time than I was initially told. There isn't any sorting algorithm (that I know of) which runs in O(n) time, so the only thing I can think of is that I could increase the space complexity and use a sorted hashtable (e.g. C++ map) to store the counts of each character and then print the hashtable in sorted order. But as this would require possibly printing n characters n times, I think it would still run in quadratic time. Also, I was using python which I don't think has a way to keep a dictionary sorted (maybe it does).
Is there anyway this problem could have been solved in O(n) time and/or space complexity?

Your counting sort is the way to go: build a simple count table for the 26 letters in order. Iterate through your two strings, counting letters, ignoring non-letters. This is one pass of O(n). Now, simply go through your table, printing each letter the number of times indicated. This is also O(n), since the sum of the counts cannot exceed n. You're not printing n letters n times each: you're printing a total of n letters.

Concatenate your strings (not really needed, you can also count chars in the individual strings)
Create an array with length equal to total nr of charcodes
Read through your concatenated string and count occurences in the array made at step 2
By reading through the char freq array, build up an output array with the right nr of repetitions of each char.
Since each step is O(n) the whole thing is O(n)
[#patatahooligan: had made this edit before I saw your remark, accidentally duplicated the answer]

If I've understood the requirement correctly, you're simply sorting the characters in the string?
I.e. ADFSACVB becomes AABCDFSV?
If so then the trick is to not really "sort". You have a fixed (and small) number of values. So you can simply keep a count of each value and generate your result from that.
E.g. Given ABACBA
In the first pass, increment a counters in an array indexed by characters. This produces:
[A] == 3
[B] == 2
[C] == 1
In second pass output the number of each character indicated by the counters. AAABBC
In summary, you're told to sort, but thinking outside the box, you really want a counting algorithm.

Related

Effeciently remove single letter substrings from a string

So I've been trying to attack this problem for a while but have no idea how to do it efficiently.
I'm given a substring of N (N >= 3) characters, and the substring contains solely of the characters 'A' and 'B'. I have to efficiently find a way to count all the substrings possible, which have only one A or only one B, with the same order given.
For example ABABA:
For three letters, the substrings would be: ABA, BAB, ABA. For this all three count because all three of them contain only one B or only one A.
For four letters, the substrings would be: ABAB, BABA. None of these count because they both don't have only one A or B.
For five letters: ABABA. This doesn't count because it doesn't have only one A or B.
If the string was bigger, then all substring combinations would be checked.
I need to implement this is O(n^2) or even O(nlogn) time, but the best I've been able to do was O(n^3) time, where I loop from 3 to the string's length for the length of the substrings, use a nested for loop to check each substring, then use indexOf and lastIndexOf and seeing for each substring if they match and don't equal -1 (meaning that there is only 1 of the character), for both A and B.
Any ideas how to implement O(n^2) or O(nlogn) time? Thanks!

Effeciently remove single letter substrings from a string
This is completely impossible. Removing a letter is O(n) time already. The right answer is to not remove anything anywhere. You don't need to.
The actual answer is to stop removing letters and making substrings. If you call substring you messed up.
Any ideas how to implement O(n^2) or O(nlogn) time? Thanks!
I have no clue. Also seems kinda silly. But, there's some good news: There's an O(n) algorithm available, why mess about with pointlessly inefficient algorithms?
charAt(i) is efficient. We can use that.
Here's your algorithm, in pseudocode because if I just write it for you, you wouldn't learn much:
First do the setup. It's a little bit complicated:
Maintain counters for # of times A and B occurs.
Maintain the position of the start of the current substring you're on. This starts at 0, obviously.
Start off the proceedings by looping from 0 to x (x = substring length), and update your A/B counters. So, if x is 3, and input is ABABA, you want to end with aCount = 2 and bCount = 1.
With that prepwork completed, let's run the algorithm:
Check for your current substring (that's the substring that starts at 0) if it 'works'. You do not need to run substring or do any string manipulation at all to know this. Just check your aCount and bCount variables. Is one of them precisely 1? Then this substring works. If not, it doesn't. Increment your answer counter by 1 if it works, don't do that if it doesn't.
Next, move to the next substring. To calculate this, first get the character at your current position (0). Then substract 1 from aCount or bCount depending on what's there. Then, fetch the char at 'the end' (.charAt(pos + x)) and add 1 to aCount or bCount depending on what's there. Your aCount and bCount vars now represent how many As respectively Bs are in the substring that starts at pos 1. And it only took 2 constant steps to update these vars.
... and loop. Keep looping until the end (pos + x) is at the end of the string.
This is O(n): Given, say, an input string of 1000 chars, and a substring check of 10, then the setup costs 10, and the central loop costs 990 loops. O(n) to the dot. .charAt is O(1), and you need two of them on every loop. Constant factors don't change big-O number.

Updating a dictionary with index of an input string as we iterate over the string- O(n) or O(1) space complexity?

My question is around the Space Complexity of updating a dictionary with an input string.
e.g. string = "thisisarandomstring", my_dict = dict()
Say we iterate over the input string, and for each character, we store and update the index of the latest character.
i.e.
for i in range(len(string)):
my_dict[string[i]] = i
Would the space complexity for the above be O(n)? Or O(1)?
In the question I am solving the solution says it is O(n), but the way I see it, we will store at most 26 characters in the dictionary, so should't it be O(1)? I can see that the number of updates would be dependent on the length of the input string, but does this impact the space? Since for every update, we replace what was the previous index of the seen before element.

You are right, eventually we can store 26 different keys in dictionary, therefore space consumed by the dictionary will be constant, that is O(1). However you are creating a string variable that consists of n characters. The space complexity of storing a new string of length n is Θ(n) because each individual character must be stored somewhere. Therefore that may be the reason why it is indicated as O(n).
Btw, if we denote the space complexity (amount of space consumed) as f(n). Then f(n) ∈ O(1) indicates f(n) ∈ O(n). Therefore f(n) ∈ O(n) is not a wrong statement. Well, technically.

Check if two sorted strings are equal in O(log n) time

I need to write a Python function which takes two sorted strings (the characters in each string are in increasing alphabetical order) containing only lowercase letters, and checks whether or not the strings are equal.
The function's time complexity needs to be O(log n), where n is the length of each string.
I can't figure out how to check it without comparing each character in the first string with the parallel character of the second string.

This is, in fact, possible in O(log n) time in the worst case, since the strings are formed from an alphabet of constant size.
You can do 26 binary searches on each string to find the left-most occurrence of each letter. If the strings are equal, then all 26 binary searches will give the same results; either that the letter exists in neither string, or that its left-most occurrence is the same in both strings.
Conversely, if all of the binary searches give the same result, then the strings must be equal, because (1) the alphabet is fixed, (2) the indices of the left-most occurrences determine the frequency of each letter in the string, and (3) the strings are sorted, so the letter frequencies uniquely determine the string.
I'm assuming here that the strings have the same length. If they might not, then check that first and return False if the lengths are different. Getting the length of a string takes O(1) time.
As #wim notes in the comments, this solution cannot be generalised to lists of numbers; it specifically only works with strings. When you have an algorithmic problem involving strings, the alphabet size is usually a constant, and this fact can often be exploited to achieve a better time complexity than would otherwise be possible.

Is there any algorithm that can be applied to this program?

I am doing an intern writing a program to do gene matching.
For example:
File "A" contains some strings of gene type. (the original data is not sorted)
rs17760268
rs10439884
rs4911642
rs157640
rs1958589
rs10886159
rs424232
....
and file "B" contains 900 thousands of rs number like above (also not sorted)
My program now can get correct results, but I would like to make it more efficient.
Is there any algorithm that can be applied to this program?
BTW, I will try to make my program do multi-processing and see if it gets better performance.
pseudocode:
read File "A" by string, append to A[]
A[] = rs numbers from File "A"
read File "B" by string
for gene_B in file_B_reader:
for gene_A in A:
if gene_A == gene_B:
#append to result[]

I don't think there's a need to sort anything first.
Process larger list B into a hashmap or hashset, O(n) amortized
Iterate over list A and remove from A if not in B, O(m)
return A
Total: O(n + m)

Though your explanations are quite unclear, I guess that you are appending the A values to a list. Use a dictionary instead, and you can lookup A much more efficiently.

From the description it appears you want result[] to contain rs strings that are in both A and B (aka Intersection).
Your algorithm is O(n*m), but you could easily improve this by sorting both files first (O(n*logn) for comparison based sorts), and then read from both at the same time, increasing position in one that has lower current rs number, and adding matches to result[] at the same time.

Finding words from random input letters in python. What algorithm to use/code already there?

I am trying to code a word descrambler like this one here and was wondering what algorithms I should use to implement this. Also, if anyone can find existing code for this that would be great as well. Basically the functionality is going to be like a boggle solver but without being a matrix, just searching for all word possibilities from a string of characters. I do already have adequate dictionaries.
I was planning to do this in either python or ruby.
Thanks in advance for your help guys!

I'd use a Trie. Here's an implementation in Python: http://jtauber.com/2005/02/trie.py (credit to James Tauber)

I may be missing an understanding of the game but barring some complications in the rules, such as with the introduction of "joker" (wildcard) letters, missing or additional letters, multiple words etc... I think the following ideas would help turn the problem in a somewhat relatively uninteresting thing. :-(
Main idea index words by the ordered sequence of their letters.
For example "computer" gets keyed as "cemoprtu". Whatever the random drawings provide is sorting in kind, and used as key to find possible matches.
Using trie structures as suggested by perimosocordiae, as the underlying storage for these sorted keys and associated words(s)/wordIds in the "leaf" nodes, Word lookup can be done in O(n) time, where n is the number of letters (or better, on average due to non-existing words).
To further help with indexing we can have several tables/dictionaries, one per number of letters. Also depending on statistics the vowels and consonants could be handled separately. Another trick would be to have a custom sort order, placing the most selective letters first.
Additional twists to the game (such as finding words made from a subset of the letters) is mostly a matter of iterating the power set of these letters and checking the dictionary for each combination.
A few heuristics can be introduced to help prune some of the combinations (for example combinations without vowels [and of a given length] are not possible solutions etc. One should manage these heuristics carefully for the lookup cost is relatively small.

For your dictionary index, build a map (Map[Bag[Char], List[String]]). It should be a hash map so you can get O(1) word lookup. A Bag[Char] is an identifier for a word that is unique up to character order. It's is basically a hash map from Char to Int. The Char is a given character in the word and the Int is the number of times that character appears in the word.
Example:
{'a'=>3, 'n'=>1, 'g'=>1, 'r'=>1, 'm'=>1} => ["anagram"]
{'s'=>3, 't'=>1, 'r'=>1, 'e'=>2, 'd'=>1} => ["stressed", "desserts"]
To find words, take every combination of characters from the input string and look it up in this map. The complexity of this algorithm is O(2^n) in the length of the input string. Notably, the complexity does not depend on the length of the dictionary.

This sounds like Rabin-Karp string search would be a good choice. If you use a rolling hash-function then at each position you need one hash value update and one dictionary lookup. You also need to create a good way to cope with different word lengths, like truncating all words to the shortest word in the set and rechecking possible matches. Splitting the word set into separate length ranges will reduce the amount of false positives at the expense of increasing the hashing work.

There are two ways to do this. One is to check every candidate permutation of letters in the word to see if the candidate is in your dictionary of words. That's an O(N!) operation, depending on the length of the word.
The other way is to check every candidate word in your dictionary to see if it's contained within the word. This can be sped up by aggregating the dictionary; instead of every candidate word, you check all words that are anagrams of each other at once, since if any one of them is contained in your word, all of them are.
So start by building a dictionary whose key is a sorted string of letters and whose value is a list of the words that are anagrams of the key:
>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> with open(r"c:\temp\words.txt", "r") as f:
for line in f.readlines():
if line[0].isupper(): continue
word = line.strip()
key = "".join(sorted(word.lower()))
d[key].append(word)
Now we need a function to see if a word contains a candidate. This function assumes that the word and candidate are both sorted, so that it can go through them both letter by letter and give up quickly when it finds that they don't match.
>>> def contains(sorted_word, sorted_candidate):
wchars = (c for c in sorted_word)
for cc in sorted_candidate:
while(True):
try:
wc = wchars.next()
except StopIteration:
return False
if wc < cc: continue
if wc == cc: break
return False
return True
Now find all the candidate keys in the dictionary that are contained by the word, and aggregate all of their values into a single list:
>>> w = sorted("mythopoetic")
>>> result = []
>>> for k in d.keys():
if contains(w, k): result.extend(d[k])
>>> len(result)
429
>>> sorted(result)[:20]
['c', 'ce', 'cep', 'ceti', 'che', 'chetty', 'chi', 'chime', 'chip', 'chit', 'chitty', 'cho', 'chomp', 'choop', 'chop', 'chott', 'chyme', 'cipo', 'cit', 'cite']
That last step takes about a quarter second on my laptop; there are 195K keys in my dictionary (I'm using the BSD Unix words file).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.