Python optimizing loops for checking number of anagrams

Python optimizing loops for checking number of anagrams - python

i am trying to solve a question that provides a string as input and gives the number of anagrams possible as output. this can be solved using dictionaries but i could only think in terms of for loops and string indices.
count = 0
for i in range(1,len(s)):
for j in range(0,len(s)):
for k in range(j+1,len(s)):
if(k+i>len(s)):
continue
# print(list(s[j:j+i]),list(s[k:k+i]),j,j+i,k,k+i)
if(sorted(list(s[j:j+i]))==sorted(list(s[k:k+i]))):
count +=1
return count
i have coded this far and tried for optimization with k+i. can someone tell me other techniques to optimize the code without losing the logic. the code keeps getting terminated due to time-out for larger strings.should i replace sorted function with something else.

The number of anagrams if each letter was unique would be n! with n as the length of the string (e.g. law has 3!=6). If there a given letter is repeated, say, twice (e.g. wall), then you have twice as many answers as you should (since things like w-(second l)-(first l)-a are actually indistinguishable from things like w-(first l)-(second l)-a). It turns out that if a letter is repeated k times (k is 2 for the letter "l" in wall), n! overcounts by a factor of k!. This is true for each repeated letter.
So to get the number of anagrams, you can do:
letter_counts = get_letter_counts(s) #returns something like [1, 1, 2] when given wall, since there is one w, one a, two ls
n_anagrams = factorial(len(s))
#account for overcounts
for letter_count in letter_counts:
n_anagrams /= factorial(letter_count)
return n_anagrams
Implementing factorial and get_letter_counts left as an excercise for the reader :-) . Note: Be careful to consider that repeated letters can show up more than once, and not always next to each other. Ex: "aardvark" should return a count of 3 for the "a"s, 2 for the "r"s, and 1 for everything else.

Related

Check how many character need to be deleted to make an anagram in Python

I wrote python code to check how many characters need to be deleted from two strings for them to become anagrams of each other.
This is the problem statement "Given two strings, and , that may or may not be of the same length, determine the minimum number of character deletions required to make and anagrams. Any characters can be deleted from either of the strings"
def makeAnagram(a, b):
# Write your code here
ac=0 # tocount the no of occurences of chracter in a
bc=0 # tocount the no of occurences of chracter in b
p=False #used to store result of whether an element is in that string
c=0 #count of characters to be deleted to make these two strings anagrams
t=[] # list of previously checked chracters
for x in a:
if x in t == True:
continue
ac=a.count(x)
t.insert(0,x)
for y in b:
p = x in b
if p==True:
bc=b.count(x)
if bc!=ac:
d=ac-bc
c=c+abs(d)
elif p==False:
c=c+1
return(c)

You can use collections.Counter for this:
from collections import Counter
def makeAnagram(a, b):
return sum((Counter(a) - Counter(b) | Counter(b) - Counter(a)).values())
Counter(x) (where x is a string) returns a dictionary that maps characters to how many times they appear in the string.
Counter(a) - Counter(b) gives you a dictionary that maps characters which are overabundant in b to how many times they appear in b more than the number of times they appear in a.
Counter(b) - Counter(a) is like above, but for characters which are overabundant in a.
The | merges the two resulting counters. We then take the values of this, and sum them to get the total number of characters which are overrepresented in either string. This is equivalent to the minimum number of characters that need to be deleted to form an anagram.
As for why your code doesn't work, I can't pin down any one problem with it. To obtain the code below, all I did was some simplification (e.g. removing unnecessary variables, looping over a and b together, removing == True and == False, replacing t with a set, giving variables descriptive names, etc.), and the code began working. Here is that simplified working code:
def makeAnagram(a, b):
c = 0 # count of characters to be deleted to make these two strings anagrams
seen = set() # set of previously checked characters
for character in a + b:
if character not in seen:
seen.add(character)
c += abs(a.count(character) - b.count(character))
return c
I recommend you make it a point to learn how to write simple/short code. It may not seem important compared to actually tackling the algorithms and getting results. It may seem like cleanup or styling work. But it pays off enormously. Bug are harder to introduce in simple code, and easier to spot. Oftentimes simple code will be more performant than equivalent complex code too, either because the programmer was able to more easily see ways to improve it, or because the more performant approach just arose naturally from the cleaner code.

Assuming there are only lowercase letters
The idea is to make character count arrays for both the strings and store frequency of each character. Now iterate the count arrays of both strings and difference in frequency of any character abs(count1[str1[i]-‘a’] – count2[str2[i]-‘a’]) in both the strings is the number of character to be removed in either string.
CHARS = 26
# function to calculate minimum
# numbers of characters
# to be removed to make two
# strings anagram
def remAnagram(str1, str2):
count1 = [0]*CHARS
count2 = [0]*CHARS
i = 0
while i < len(str1):
count1[ord(str1[i])-ord('a')] += 1
i += 1
i =0
while i < len(str2):
count2[ord(str2[i])-ord('a')] += 1
i += 1
# traverse count arrays to find
# number of characters
# to be removed
result = 0
for i in range(26):
result += abs(count1[i] - count2[i])
return result
Here time complexity is O(n + m) where n and m are the length of the two strings
Space complexity is O(1) as we use only array of size 26
This can be further optimised by just using a single array for taking the count.
In this case for string s1 -> we increment the counter
for string s2 -> we decrement the counter
def makeAnagram(a, b):
buffer = [0] * 26
for char in a:
buffer[ord(char) - ord('a')] += 1
for char in b:
buffer[ord(char) - ord('a')] -= 1
return sum(map(abs, buffer))
if __name__ == "__main__" :
str1 = "bcadeh"
str2 = "hea"
print(makeAnagram(str1, str2))
Output : 3

Generating all possibilities with letters in python and exploiting results in python3

First, I got this problem: how many words are there (counting all of them, even those that don't make sense) of 5 letters that have at least one I and at least two T's, but no K or Y?
First, I defined the alphabet, which has 24 letters ( k and y aren't counted). After that, i made a code to generate all possibilites
alphabet = list(range(1, 24))
for L in range(0, len(alphabet)+1):
for subset in itertools.permutations(alphabet, L):
I don't know how to use the data.

If a “brute force” method is enough for you, this will work:
import string
import itertools
alphabet = string.ascii_uppercase.replace("K", "").replace("Y", "")
count = 0
for word in itertools.product(alphabet, repeat = 5):
if "I" in word and word.count("T") >= 2:
count += 1
print (count)
It will print the result 15645.
Note that you have to use itertools.product(), because itertools.permutations() will not contain repeated occurences, so it will never contain “T” twice.
Edit: Alternatively, you can calculate the count with a list comprehension or a generator expression. It takes advantage of the fact that boolean True and False are equivalent to integer values 1 and 0, respectively.
count = sum(
"I" in word and word.count("T") >= 2
for word in itertools.product(alphabet, repeat = 5)
)
NB: Interestingly, the first solution (explicit for loop with counter += 1) runs about 15 % faster on my computer than the second solution (generator expression with sum()). Both require the same amount of memory (this is expected).

Longest repeated substring in massive string

Given a long string, find the longest repeated sub-string.
The brute-force approach of course is to find all substrings and check the substrings of the remaining string, but the string(s) in question have millions of characters (like a DNA sequence, AGGCTAGCT etc) and I'd like something that finishes before the universe collapses in on itself.
Tried a number of approaches, and I have one solution that works quite fast on strings of up to several million, but takes literally forever (6+ hours) for larger strings, particularly when the length of the repeated sequence gets really long.
def find_lrs(text, cntr=2):
sol = (0, 0, 0)
del_list = ['01','01','01']
while len(del_list) != 0:
d = defaultdict(list)
for i in range(len(text)):
d[text[i:i + cntr]].append(i)
del_list = [(item, d[item]) for item in d if len(d[item]) > 1]
# if list is empty, we're done
if len(del_list) == 0:
return sol
else:
sol = (del_list[0][1][0], (del_list[0][1][1]),len(del_list[0][0]))
cntr += 1
return sol
I know it's ugly, but hey, I'm a beginner, and I'm just happy I got something to work. Idea is to go through the string starting out with length-2 substrings as the keys, and the index the substring is at the value. If the text was, say, 'BANANA', after the first pass through, the dict would look like this:
{'BA': [0], 'AN': [1, 3], 'NA': [2, 4], 'A': [5]}
BA shows up only once - starting at index 0. AN and NA show up twice, showing up at index 1/3 and 2/4, respectively.
I then create a list that only includes keys that showed up at least twice. In the example above, we can remove BA, since it only showed up once - if there's no substring of length 2 starting out with 'BA', there won't be an substring of length 3 starting with BA.
So after the first past through the pruned list is:
[('AN', [1, 3]), ('NA', [2, 4])]
Since there is at least two possibilities, we save the longest substring and indices found so far and increment the substring length to 3. We continue until no substring was repeated.
As noted, this works on strings up to 10 million in about 2 minutes, which apparently is reasonable - BUT, that's with the longest repeated sequence being fairly short. On a shorter string but longer repeated sequence, it takes -hours- to run. I suspect that it has something to do with how big the dictionary gets, but not quite sure why.
What I'd like to do of course is keep the dictionary short by removing the substrings that clearly aren't repeated, but I can't delete items from the dict while iterating over it. I know there are suffix tree approaches and such that - for now - are outside my ken.
Could simply be that this is beyond my current knowledge, which of course is fine, but I can't help shaking the idea that there is a solution here.

I forgot to update this. After going over my code again, away from my PC - literally writing out little diagrams on my iPad - I realized that the code above wasn't doing what I thought it was doing.
As noted above, my plan of attack was to start out by going through the string starting out with length-2 substrings as the keys, and the index the substring is at the value, creating a list that captures only length-2 substrings that occured at least twice, and only look at those locations.
All well and good - but look closely and you'll realize that I'm never actually updating the default dictionary to only have locations with two or more repeats! //bangs head against table.
I ultimately came up with two solutions. The first solution used a slightly different approach, the 'sorted suffixes' approach. This gets all the suffixes of the word, then sorts them in alphabetical order. For example, the suffixes of "BANANA", sorted, would be:
A
ANA
ANANA
BANANA
NA
NANA
We then look at each adjacent suffix and find how many letters each pair start out having in common. A and ANA have only 'A' in common. ANA and ANANA have "ANA" in common, so we have length 3 as the longest repeated substring. ANANA and BANANA have nothing in common at the start, ditto BANANA and NA. NA and NANA have "NA" in common. So 'ANA', length 3, is the longest repeated substring.
I made a little helper function to do the actual comparing. The code looks like this:
def longest_prefix(suf1, suf2, mx=None):
min_len = min(len(suf1), len(suf2))
for i in range(min_len):
if suf1[i] != suf2[i]:
return (suf1[0:i], len(suf1[0:i]))
return (suf1[0:i], len(suf1[0:i]))
def longest_repeat(txt):
lst = sorted([text[i:] for i in range(len(text))])
print(lst)
mxLen = 0
mx_string = ""
for x in range(len(lst) - 1):
temp = longest_prefix(lst[x], lst[x + 1])
if temp[1] > mxLen:
mxLen = temp[1]
mx_string = temp[0]
first = txt.find(mx_string)
last = txt.rfind(mx_string)
return (first, last, mxLen)
This works. I then went back and relooked at my original code and saw that I wasn't resetting the dictionary. The key is that after each pass through I update the dictionary to -only- look at repeat candidates.
def longest_repeat(text):
# create the initial dictionary with all length-2 repeats
cntr = 2 # size of initial substring length we look for
d = defaultdict(list)
for i in range(len(text)):
d[text[i:i + cntr]].append(i)
# find any item in dict that wasn't repeated at least once
del_list = [(d[item]) for item in d if len(d[item]) > 1]
sol = (0,0,0)
# Keep looking as long as del_list isn't empty,
while len(del_list) > 0:
d = defaultdict(list) # reset dictionary
cntr += 1 # increment search length
for item in del_list:
for i in item:
d[text[i:i + cntr]].append(i)
# filter as above
del_list = [(d[item]) for item in d if len(d[item]) > 1]
# if not empty, update solution
if len(del_list) != 0:
sol = (del_list[0][0], del_list[0][1], cntr)
return sol
This was quite fast, and I think it's easier to follow.

Find longest quasi-constant sub-sequence of a sequence

I had this test earlier today, and I tried to be too clever and hit a road block. Unfortunately I got stuck in this mental rut and wasted too much time, failing this portion of the test. I solved it afterward, but maybe y'all can help me get out of the initial rut I was in.
Problem definition:
An unordered and non-unique sequence A consisting of N integers (all positive) is given. A subsequence of A is any sequence obtained by removing none, some or all elements from A. The amplitude of a sequence is the difference between the largest and the smallest element in this sequence. The amplitude of the empty subsequence is assumed to be 0.
For example, consider the sequence A consisting of six elements such that:
A[0] = 1
A[1] = 7
A[2] = 6
A[3] = 2
A[4] = 6
A[5] = 4
A subsequence of array A is called quasi-constant if its amplitude does not exceed 1. In the example above, the subsequences [1,2], [6,6], and [6,6,7] are quasi-constant. Subsequence [6, 6, 7] is the longest possible quasi-constant subsequence of A.
Now, find a solution that, given a non-empty zero-indexed array A consisting of N integers, returns the length of the longest quasi-constant subsequence of array A. For example, given sequence A outlined above, the function should return 3, as explained.
Now, I solved this in python 3.6 after the fact using a sort-based method with no classes (my code is below), but I didn't initially want to do that as sorting on large lists can be very slow. It seemed this should have a relatively simple formulation as a breadth-first tree-based class, but I couldn't get it right. Any thoughts on this?
My class-less sort-based solution:
def amp(sub_list):
if len(sub_list) <2:
return 0
else:
return max(sub_list) - min(sub_list)
def solution(A):
A.sort()
longest = 0
idxStart = 0
idxEnd = idxStart + 1
while idxEnd <= len(A):
tmp = A[idxStart:idxEnd]
if amp(tmp) < 2:
idxEnd += 1
if len(tmp) > longest:
longest = len(tmp)
else:
idxStart = idxEnd
idxEnd = idxStart + 1
return longest

As Andrey Tyukin pointed out, you can solve this problem in O(n) time, which is better than the O(n log n) time you'd likely get from either sorting or any kind of tree based solution. The trick is to use dictionaries to count the number of occurrences of each number in the input, and use the count to figure out the longest subsequence.
I had a similar idea to him, but I had though of a slightly different implementation. After a little testing, it looks like my approach is a quite a bit faster, so I'm posting it as my own answer. It's quite short!
from collections import Counter
def solution(seq):
if not seq: # special case for empty input sequence
return 0
counts = Counter(seq)
return max(counts[x] + counts[x+1] for x in counts)
I suspect this is faster than Andrey's solution because the running time for both of our solutions really take O(n) + O(k) time where k is the number of distinct values in the input (and n is the total number of values in the input). My code handles the O(n) part very efficiently by handing off the sequence to the Counter constructor, which is implemented in C. It is likely to be a bit slower (on a per-item basis) to deal with the O(k) part, since it needs a generator expression. Andrey's code does the reverse (it runs slower Python code for the O(n) part, and uses faster builtin C functions for the O(k) part). Since k is always less than or equal to n (perhaps a lot less if the sequence has a lot of repeated values), my code is faster overall. Both solutions are still O(n) though, and both should be much better than sorting for large inputs.

I don't know how BFS is supposed to help here.
Why not simply run once through the sequence and count how many elements every possible quasi-constant subsequence would have?
from collections import defaultdict
def longestQuasiConstantSubseqLength(seq):
d = defaultdict(int)
for s in seq:
d[s] += 1
d[s+1] += 1
return max(d.values() or [0])
s = [1,7,6,2,6,4]
print(longestQuasiConstantSubseqLength(s))
prints:
3
as expected.
Explanation: Every non-constant quasi-constant subsequence is uniquely identified by the greatest number that it contains (there can be only two, take the greater one). Now, if you have a number s, it can either contribute to the quasi-constant subsequence that has s or s + 1 as the greatest number. So, just add +1 to the subsequences identified by s and s + 1. Then output the maximum of all counts.
You can't get it faster than O(n), because you have to look at every entry of the input sequence at least once.

Sorting a string to make a new one

Here I had to remove the most frequent alphabet of a string(if frequency of two alphabets is same, then in alphabetical order) and put it into new string.
Input:
abbcccdddd
Output:
dcdbcdabcd
The code I wrote is:
s = list(sorted(<the input string>))
a = []
for c in range(len(s)):
freq =[0 for _ in range(26)]
for x in s:
freq[ord(x)-ord('a')] += 1
m = max(freq)
allindices = [p for p,q in enumerate(freq) if q == m]
r = chr(97+allindices[0])
a.append(r)
s.remove(r)
print''.join(a)
But it passed the allowed runtime limit maybe due to too many loops.(There's another for loop which seperates the strings from user input)
I was hoping if someone could suggest a more optimised version of it using less memory space.

Your solution involves 26 linear scans of the string and a bunch of unnecessary
conversions to count the frequencies. You can save some work by replacing all those linear scans with a linear count step, another linear repetition generation, then a sort to order your letters and a final linear pass to strip counts:
from collections import Counter # For unsorted input
from itertools import groupby # For already sorted input
from operator import itemgetter
def makenewstring(inp):
# When inp not guaranteed to be sorted:
counts = Counter(inp).iteritems()
# Alternative if inp is guaranteed to be sorted:
counts = ((let, len(list(g))) for let, g in groupby(inp))
# Create appropriate number of repetitions of each letter tagged with a count
# and sort to put each repetition of a letter in correct order
# Use negative n's so much more common letters appear repeatedly at start, not end
repeats = sorted((n, let) for let, cnt in counts for n in range(0, -cnt, -1))
# Remove counts and join letters
return ''.join(map(itemgetter(1), repeats))
Updated: It occurred to me that my original solution could be made much more concise, a one-liner actually (excluding required imports), that minimizes temporaries, in favor of a single sort-by-key operation that uses a trick to sort each letter by the count of that letter seen so far:
from collections import defaultdict
from itertools import count
def makenewstring(inp):
return ''.join(sorted(inp, key=lambda c, d=defaultdict(count): (-next(d[c]), c)))
This is actually the same basic logic as the original answer, it just accomplishes it by having sorted perform the decoration and undecoration of the values implicitly instead of doing it ourselves explicitly (implicit decorate/undecorate is the whole point of sorted's key argument; it's doing the Schwartzian transform for you).
Performance-wise, both approaches are similar; they both (in practice) scale linearly for smaller inputs (the one-liner up to inputs around 150 characters long, the longer code, using Counter, up to inputs in the len 2000 range), and while the growth is super-linear above that point, it's always below the theoretical O(n log_2 n) (likely due to the data being not entirely random thanks to the counts and limited alphabet, ensuring Python's TimSort has some existing ordering to take advantage of). The one-liner is somewhat faster for smaller strings (len 100 or less), the longer code is somewhat faster for larger strings (I'm guessing it has something to do with the longer code creating some ordering by grouping runs of counts for each letter). Really though, it hardly matters unless the input strings are expected to be huge.

Since the alphabet will always be a constant 26 characters,
this will work in O(N) and only takes a constant amount of space of 26
from collections import Counter
from string import ascii_lowercase
def sorted_alphabet(text):
freq = Counter(text)
alphabet = filter(freq.get, ascii_lowercase) # alphabet filtered with freq >= 1
top_freq = max(freq.values()) if text else 0 # handle empty text eg. ''
for top_freq in range(top_freq, 0, -1): # from top_freq to 1
for letter in alphabet:
if freq[letter] >= top_freq:
yield letter
print ''.join(sorted_alphabet('abbcccdddd'))
print ''.join(sorted_alphabet('dbdd'))
print ''.join(sorted_alphabet(''))
print ''.join(sorted_alphabet('xxxxaaax'))
dcdbcdabcd
ddbd
xxaxaxax

What about this?
I am making use of in-built python functions to eliminate loops and improve efficiency.
test_str = 'abbcccdddd'
remaining_letters = [1] # dummy initialisation
# sort alphabetically
unique_letters = sorted(set(test_str))
frequencies = [test_str.count(letter) for letter in unique_letters]
out = []
while(remaining_letters):
# in case of ties, index takes the first occurence, so the alphabetical order is preserved
max_idx = frequencies.index(max(frequencies))
out.append(unique_letters[max_idx])
#directly update frequencies instead of calculating them again
frequencies[max_idx] -= 1
remaining_letters = [idx for idx, freq in enumerate(frequencies) if freq>0]
print''.join(out) #dcdbcdabcd

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python optimizing loops for checking number of anagrams - python

Related

Check how many character need to be deleted to make an anagram in Python

Generating all possibilities with letters in python and exploiting results in python3

Longest repeated substring in massive string

Find longest quasi-constant sub-sequence of a sequence

Sorting a string to make a new one

Categories

Resources