Effeciently remove single letter substrings from a string

Effeciently remove single letter substrings from a string - python

So I've been trying to attack this problem for a while but have no idea how to do it efficiently.
I'm given a substring of N (N >= 3) characters, and the substring contains solely of the characters 'A' and 'B'. I have to efficiently find a way to count all the substrings possible, which have only one A or only one B, with the same order given.
For example ABABA:
For three letters, the substrings would be: ABA, BAB, ABA. For this all three count because all three of them contain only one B or only one A.
For four letters, the substrings would be: ABAB, BABA. None of these count because they both don't have only one A or B.
For five letters: ABABA. This doesn't count because it doesn't have only one A or B.
If the string was bigger, then all substring combinations would be checked.
I need to implement this is O(n^2) or even O(nlogn) time, but the best I've been able to do was O(n^3) time, where I loop from 3 to the string's length for the length of the substrings, use a nested for loop to check each substring, then use indexOf and lastIndexOf and seeing for each substring if they match and don't equal -1 (meaning that there is only 1 of the character), for both A and B.
Any ideas how to implement O(n^2) or O(nlogn) time? Thanks!

Effeciently remove single letter substrings from a string
This is completely impossible. Removing a letter is O(n) time already. The right answer is to not remove anything anywhere. You don't need to.
The actual answer is to stop removing letters and making substrings. If you call substring you messed up.
Any ideas how to implement O(n^2) or O(nlogn) time? Thanks!
I have no clue. Also seems kinda silly. But, there's some good news: There's an O(n) algorithm available, why mess about with pointlessly inefficient algorithms?
charAt(i) is efficient. We can use that.
Here's your algorithm, in pseudocode because if I just write it for you, you wouldn't learn much:
First do the setup. It's a little bit complicated:
Maintain counters for # of times A and B occurs.
Maintain the position of the start of the current substring you're on. This starts at 0, obviously.
Start off the proceedings by looping from 0 to x (x = substring length), and update your A/B counters. So, if x is 3, and input is ABABA, you want to end with aCount = 2 and bCount = 1.
With that prepwork completed, let's run the algorithm:
Check for your current substring (that's the substring that starts at 0) if it 'works'. You do not need to run substring or do any string manipulation at all to know this. Just check your aCount and bCount variables. Is one of them precisely 1? Then this substring works. If not, it doesn't. Increment your answer counter by 1 if it works, don't do that if it doesn't.
Next, move to the next substring. To calculate this, first get the character at your current position (0). Then substract 1 from aCount or bCount depending on what's there. Then, fetch the char at 'the end' (.charAt(pos + x)) and add 1 to aCount or bCount depending on what's there. Your aCount and bCount vars now represent how many As respectively Bs are in the substring that starts at pos 1. And it only took 2 constant steps to update these vars.
... and loop. Keep looping until the end (pos + x) is at the end of the string.
This is O(n): Given, say, an input string of 1000 chars, and a substring check of 10, then the setup costs 10, and the central loop costs 990 loops. O(n) to the dot. .charAt is O(1), and you need two of them on every loop. Constant factors don't change big-O number.

Related

Finding maximum possible count of elements in list whom all sharing at least one same digit

There is a problem where array containing numbers is given, the statement is to find the maximum length of a sub-sequence formed from the given array such that all the elements in that sub-sequence share at least one common digit.
Now whats the catch? Well, I intended to use a dictionary b to store key as every digit and value as count so far while traversing the given array, digit by digit. I thought the maximum number in values of dictionary i.e., bigger count of a digit would be the problems answer, given that we still have a glitch that we should not count a same digit that present in ONE element of array more than one time. To overcome that glitch, I used set c.
The code function for this along with driver function written below for convinience.
def solve (a):
b={}
answer=1
for i in a:
j=i
c=set()
c.clear()
while(j):
last_digit=i%10
if(last_digit not in b and last_digit not in c):
b[last_digit]=1
c.add(last_digit)
elif(last_digit in b and last_digit not in c):
b[last_digit]+=1
c.add(last_digit)
answer=max(answer,b[last_digit])
j//=10
return answer
a=list(map(int,input().strip().split()))
print(solve(a))
There are lot test cases concerned for this code to be correct.. One of them is input is 12 11 3 4 5, the output that code gave is 1 and expected output is 2. What gives?

You have good ideas. But your code would be easier if you use the Counter object from the collections module. It is designed to do just what you are trying to do: count the number of occurrences of an item in an iterable.
This code uses a generator expression to look at each value in the list alist, uses the built-in str() function to convert that integer to a string of digits, then uses the set() built-in function to convert that to a set. As you said, this removes the duplicate digits since you want to count each digit only once per item. The Counter object then looks at these digits and counts their occurrences. The code then uses Counter's most_common method to choose the digit that occurs most (the (1) parameter returns only the single most popular digit in a list, and the 0 index takes that digit and its count out of the list) then takes the count of that digit (that's the 1 index). That count is then returned to the caller.
If you are not familiar with Counter or with generator expressions, you could do the counting yourself and use regular for loops. But this code is short and fairly clear to anyone who knows the Counter object. You could use the line in the comment to replace the following four lines, if you want brief code, but I expanded out the code to make it more clear.
from collections import Counter
def solve(alist):
digitscount = Counter(digit for val in alist for digit in set(str(abs(val))))
# return digitscount.most_common(1)[0][1]
most_common_list = digitscount.most_common(1)
most_common_item = most_common_list[0]
most_common_count = most_common_item[1]
return most_common_count
alist = list(map(int, input().strip().split()))
print(solve(alist))
For your example input 12 11 3 4 5, this prints the correct answer 2. Note that my code will give an error if the input is empty or contains a non-integer. This version of my code takes the absolute value of the list values, which prevents a minus (or negative) sign from being counted as a digit.

Here's my own implementation of this:
def solve(list_of_numbers):
counts = {str(i):0 for i in range(10)} # make a dict of placeholders 0 through 9
longest_sequence = 0 # store the length of the longest sequence so far
for num in list_of_numbers: # iterate through our list of numbers
num_str = str(num) # convert number to a string (this makes "is digit present?" easier)
for digit, ct in counts.items(): # evaluate whether each digit is present
if digit in num_str: # if the digit is present in this number...
counts[digit] += 1 # ...then increment the running count of this digit
else: # otherwise, we've broken the sequence...
counts[digit] = 0 # ...so we reset the running count to 0.
if ct > longest_sequence: # check if we've found a new longest sequence...
longest_sequence = ct # ...and if so, replace the current longest sequence
return longest_sequence[1] # finally, return the digit that had the longest running sequence.
It uses a dict to store the running counts of each digit's consecutive occurrences - for every number, the count is incremented if the digit is present, and reset to 0 if it isn't. The longest sequence's length so far is saved in its own variable for storage.
There are a few details that your implementation, I think, is overlooking:
Your code might be returning the digit that appears the most, rather than the greatest number of appearances. I'm not sure, because your code is kind of hard for me to parse, and you only gave the one test example.
Try to stay away from using single-letter variable names, if you can. Notice how much more clear my code above is, since I've used full names (and, at worst, clear abbreviations like ct for "count"). This makes it much easier to debug your own code.
I see what you're doing to find the digits that are present in the number, but that's a bit more verbose than the situation requires. A simpler solution, which I used, was to simply cast the number into a string and use each individual digit's character instead of its value. In your implementation, you could do something like this: c = set(int(digit) for digit in str(j))
You don't seem to have anything in your code that detects when a digit is no longer present, which could lead to an incorrect result.

I'm having trouble understanding the original problem, but I think what you need to do is cast each item as a string if it is an int and then split out each digit.
digits = {}
for item in thelist:
digit[item] = []
if len(item) > 1:
for digit in str(item):
digits[item].append(int(digit))
if your test case is 12 11 3 4 5 then this would produce a dictionary of {12 : [1,2], 11 : [1,1], etc}

Merging and sorting n strings in O(n)

I recently was given a question in a coding challenge where I had to merge n strings of alphanumeric characters and then sort the new merged string while only allowing alphabetical characters in the sorted string. Now, this would be fairly straight forward except that the caveat added was that the algorithm had to be O(n) (it didn't specify whether this was time or space complexity or both).
My initial approach was to concatenate the strings into a new one, only adding alphabetical characters and then sorting at the end. I wanted to come up with a more efficient solution but I was given less time than I was initially told. There isn't any sorting algorithm (that I know of) which runs in O(n) time, so the only thing I can think of is that I could increase the space complexity and use a sorted hashtable (e.g. C++ map) to store the counts of each character and then print the hashtable in sorted order. But as this would require possibly printing n characters n times, I think it would still run in quadratic time. Also, I was using python which I don't think has a way to keep a dictionary sorted (maybe it does).
Is there anyway this problem could have been solved in O(n) time and/or space complexity?

Your counting sort is the way to go: build a simple count table for the 26 letters in order. Iterate through your two strings, counting letters, ignoring non-letters. This is one pass of O(n). Now, simply go through your table, printing each letter the number of times indicated. This is also O(n), since the sum of the counts cannot exceed n. You're not printing n letters n times each: you're printing a total of n letters.

Concatenate your strings (not really needed, you can also count chars in the individual strings)
Create an array with length equal to total nr of charcodes
Read through your concatenated string and count occurences in the array made at step 2
By reading through the char freq array, build up an output array with the right nr of repetitions of each char.
Since each step is O(n) the whole thing is O(n)
[#patatahooligan: had made this edit before I saw your remark, accidentally duplicated the answer]

If I've understood the requirement correctly, you're simply sorting the characters in the string?
I.e. ADFSACVB becomes AABCDFSV?
If so then the trick is to not really "sort". You have a fixed (and small) number of values. So you can simply keep a count of each value and generate your result from that.
E.g. Given ABACBA
In the first pass, increment a counters in an array indexed by characters. This produces:
[A] == 3
[B] == 2
[C] == 1
In second pass output the number of each character indicated by the counters. AAABBC
In summary, you're told to sort, but thinking outside the box, you really want a counting algorithm.

What is more efficient? Using .replace() or passing string to list

Solving the following problem from CodeFights:
Given two strings, find the number of common characters between them.
For s1 = "aabcc" and s2 = "adcaa", the output should be
commonCharacterCount(s1, s2) = 3.
Strings have 3 common characters - 2 "a"s and 1 "c".
The way I approached it, whenever I took a letter into account I wanted to cancel it out so as not to count it again. I know strings are immutable, even when using methods such as .replace() (replace() method returns a copy of the string, no the actual string changed).
In order to mutate said string what I tend to do at the start is simply pass it on to a list with list(mystring) and then mutate that.
Question is... what is more efficient of the following? Take into account that option B gets done over and over, worst case scenario the strings are equal and have a match for match. Meanwhile option A happens once.
Option A)
list(mystring)
Option B)
mystring = mystring.replace(letterThatMatches, "")

The idea of calling replace on the string for each element you find, is simply not a good idea: it takes O(n) to do that. If you do that for every character in the other string, it will result in an O(m×n) algorithm with m the number of characters of the first string, and n the number of characters from the second string.
You can simply use two Counters, then calculate the minimum of the two, and then calculate the sum of the counts. Like:
from collections import Counter
def common_chars(s1,s2):
c1 = Counter(s1) # count for every character in s1, the amount
c2 = Counter(s2) # count for every character in s2, the amount
c3 = c1 & c2 # produce the minimum of each character
return sum(c3.values()) # sum up the resulting counts
Or as a one-liner:
def common_chars(s1,s2):
return sum((Counter(s1) & Counter(s2)).values())
If dictionary lookup can be done in O(1) (which usually holds for an average case), this is an O(m+n) algorithm: the counting then happens in O(m) and O(n) respectively, and calculating the minimum runs in the number of different characters (which is at most O(max(m,n)). Finally taking the sum is again an O(max(m,n)) operation.
For the sample input, this results in:
>>> common_chars("aabcc","adcaa")
3

Can this python code be more efficient?

I have written some code to find how many substrings of a string are anagram pairs. The function to find anagram(anagramSolution) is of complexity O(N). The substring function has complexity less than N square. But, this code here is the problem. Can it be more optimized?
for i in range(T):
x = raw_input()
alist = get_all_substrings(x)
for k, j in itertools.combinations(alist,2):
if(len(k) == len(j)):
if(anagramSolution(k,j)):
counter +=1
counterlist.append(counter)
counter = 0
The alist can have thousands of items (subsets). The main problem is the loop. It is taking a lot of time to iterate over all the items. Is there any faster or more efficient way to do this?

Define the anagram class of a string to be the set of counts of how many times each letter appears in the string. For example, 'banana' has anagram class a: 3, b: 1, n: 2. Two strings are anagrams of each other if they have the same anagram class. We can count how many substrings of the string are in each anagram class, then compute the number of pairs by computing (n choose 2) for every anagram class with n substrings:
from collections import Counter
anagram_class_counts = Counter()
for substring in get_all_substrings(x):
anagram_class_counts[frozenset(Counter(substring).viewitems())] += 1
anagram_pair_count = sum(x*(x-1)/2 for x in anagram_class_counts.viewvalues())
frozenset(Counter(substring).viewitems()) builds a hashable representation of a string's anagram class.
Counter takes an iterable and builds a mapping representing how many times each item appeared, so
Counter(substring) builds a mapping representing a string's anagram class.
viewitems() gives a set-like collection of letter: count pairs, and
frozenset turns that into an immutable set that can be used as a dict key.
These steps together take time proportional to the size of the substring; on average, substrings are about a third of the size of the whole string, so on average, processing each substring takes O(len(x)) time. There are O(len(x)**2) substrings, so processing all substrings takes O(len(x)**3) time.
If there are x substrings with the same anagram class, they can be paired up in x*(x-1)/2 ways, so the sum goes through the number of occurrences of each anagram class and computes the number of pairs. This takes O(len(x)**2) time, since it has to go through each anagram class once, and there can't be more anagram classes than substrings.
Overall, this algorithm takes O(len(x)**3) time, which isn't great, but it's a lot better than the original. There's still room to optimize this, such as by computing anagram classes in a way that takes advantage of the overlap between substrings or by using a more efficient anagram class representation.

I don't think you can wholly escape iterations for this problem but at least you can make the task at hand smaller by a factor of O(2^nC2/2^n).
You want to group substrings into their respective lengths before starting to iterate as you are adding a lot more cases to check.
The current method compares all pairs from a set, which takes 2^nC2 = comparisons. This is a huge number (2^n)! / ((2^n-2)! * 2!).
If we first make a list of length 1-n substrings and then compare, we spend:
2^n operations going through all substrings
nC2 operations going through substring of length 1
...
That is, we are doing logarithmically better.
Edit: I realised that strings aren't sets and substrings aren't subsets, but that only affects the number of subsets and does not affect the main argument.

Finding words from random input letters in python. What algorithm to use/code already there?

I am trying to code a word descrambler like this one here and was wondering what algorithms I should use to implement this. Also, if anyone can find existing code for this that would be great as well. Basically the functionality is going to be like a boggle solver but without being a matrix, just searching for all word possibilities from a string of characters. I do already have adequate dictionaries.
I was planning to do this in either python or ruby.
Thanks in advance for your help guys!

I'd use a Trie. Here's an implementation in Python: http://jtauber.com/2005/02/trie.py (credit to James Tauber)

I may be missing an understanding of the game but barring some complications in the rules, such as with the introduction of "joker" (wildcard) letters, missing or additional letters, multiple words etc... I think the following ideas would help turn the problem in a somewhat relatively uninteresting thing. :-(
Main idea index words by the ordered sequence of their letters.
For example "computer" gets keyed as "cemoprtu". Whatever the random drawings provide is sorting in kind, and used as key to find possible matches.
Using trie structures as suggested by perimosocordiae, as the underlying storage for these sorted keys and associated words(s)/wordIds in the "leaf" nodes, Word lookup can be done in O(n) time, where n is the number of letters (or better, on average due to non-existing words).
To further help with indexing we can have several tables/dictionaries, one per number of letters. Also depending on statistics the vowels and consonants could be handled separately. Another trick would be to have a custom sort order, placing the most selective letters first.
Additional twists to the game (such as finding words made from a subset of the letters) is mostly a matter of iterating the power set of these letters and checking the dictionary for each combination.
A few heuristics can be introduced to help prune some of the combinations (for example combinations without vowels [and of a given length] are not possible solutions etc. One should manage these heuristics carefully for the lookup cost is relatively small.

For your dictionary index, build a map (Map[Bag[Char], List[String]]). It should be a hash map so you can get O(1) word lookup. A Bag[Char] is an identifier for a word that is unique up to character order. It's is basically a hash map from Char to Int. The Char is a given character in the word and the Int is the number of times that character appears in the word.
Example:
{'a'=>3, 'n'=>1, 'g'=>1, 'r'=>1, 'm'=>1} => ["anagram"]
{'s'=>3, 't'=>1, 'r'=>1, 'e'=>2, 'd'=>1} => ["stressed", "desserts"]
To find words, take every combination of characters from the input string and look it up in this map. The complexity of this algorithm is O(2^n) in the length of the input string. Notably, the complexity does not depend on the length of the dictionary.

This sounds like Rabin-Karp string search would be a good choice. If you use a rolling hash-function then at each position you need one hash value update and one dictionary lookup. You also need to create a good way to cope with different word lengths, like truncating all words to the shortest word in the set and rechecking possible matches. Splitting the word set into separate length ranges will reduce the amount of false positives at the expense of increasing the hashing work.

There are two ways to do this. One is to check every candidate permutation of letters in the word to see if the candidate is in your dictionary of words. That's an O(N!) operation, depending on the length of the word.
The other way is to check every candidate word in your dictionary to see if it's contained within the word. This can be sped up by aggregating the dictionary; instead of every candidate word, you check all words that are anagrams of each other at once, since if any one of them is contained in your word, all of them are.
So start by building a dictionary whose key is a sorted string of letters and whose value is a list of the words that are anagrams of the key:
>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> with open(r"c:\temp\words.txt", "r") as f:
for line in f.readlines():
if line[0].isupper(): continue
word = line.strip()
key = "".join(sorted(word.lower()))
d[key].append(word)
Now we need a function to see if a word contains a candidate. This function assumes that the word and candidate are both sorted, so that it can go through them both letter by letter and give up quickly when it finds that they don't match.
>>> def contains(sorted_word, sorted_candidate):
wchars = (c for c in sorted_word)
for cc in sorted_candidate:
while(True):
try:
wc = wchars.next()
except StopIteration:
return False
if wc < cc: continue
if wc == cc: break
return False
return True
Now find all the candidate keys in the dictionary that are contained by the word, and aggregate all of their values into a single list:
>>> w = sorted("mythopoetic")
>>> result = []
>>> for k in d.keys():
if contains(w, k): result.extend(d[k])
>>> len(result)
429
>>> sorted(result)[:20]
['c', 'ce', 'cep', 'ceti', 'che', 'chetty', 'chi', 'chime', 'chip', 'chit', 'chitty', 'cho', 'chomp', 'choop', 'chop', 'chott', 'chyme', 'cipo', 'cit', 'cite']
That last step takes about a quarter second on my laptop; there are 195K keys in my dictionary (I'm using the BSD Unix words file).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.