Sorting a string to make a new one

Sorting a string to make a new one - python

Here I had to remove the most frequent alphabet of a string(if frequency of two alphabets is same, then in alphabetical order) and put it into new string.
Input:
abbcccdddd
Output:
dcdbcdabcd
The code I wrote is:
s = list(sorted(<the input string>))
a = []
for c in range(len(s)):
freq =[0 for _ in range(26)]
for x in s:
freq[ord(x)-ord('a')] += 1
m = max(freq)
allindices = [p for p,q in enumerate(freq) if q == m]
r = chr(97+allindices[0])
a.append(r)
s.remove(r)
print''.join(a)
But it passed the allowed runtime limit maybe due to too many loops.(There's another for loop which seperates the strings from user input)
I was hoping if someone could suggest a more optimised version of it using less memory space.

Your solution involves 26 linear scans of the string and a bunch of unnecessary
conversions to count the frequencies. You can save some work by replacing all those linear scans with a linear count step, another linear repetition generation, then a sort to order your letters and a final linear pass to strip counts:
from collections import Counter # For unsorted input
from itertools import groupby # For already sorted input
from operator import itemgetter
def makenewstring(inp):
# When inp not guaranteed to be sorted:
counts = Counter(inp).iteritems()
# Alternative if inp is guaranteed to be sorted:
counts = ((let, len(list(g))) for let, g in groupby(inp))
# Create appropriate number of repetitions of each letter tagged with a count
# and sort to put each repetition of a letter in correct order
# Use negative n's so much more common letters appear repeatedly at start, not end
repeats = sorted((n, let) for let, cnt in counts for n in range(0, -cnt, -1))
# Remove counts and join letters
return ''.join(map(itemgetter(1), repeats))
Updated: It occurred to me that my original solution could be made much more concise, a one-liner actually (excluding required imports), that minimizes temporaries, in favor of a single sort-by-key operation that uses a trick to sort each letter by the count of that letter seen so far:
from collections import defaultdict
from itertools import count
def makenewstring(inp):
return ''.join(sorted(inp, key=lambda c, d=defaultdict(count): (-next(d[c]), c)))
This is actually the same basic logic as the original answer, it just accomplishes it by having sorted perform the decoration and undecoration of the values implicitly instead of doing it ourselves explicitly (implicit decorate/undecorate is the whole point of sorted's key argument; it's doing the Schwartzian transform for you).
Performance-wise, both approaches are similar; they both (in practice) scale linearly for smaller inputs (the one-liner up to inputs around 150 characters long, the longer code, using Counter, up to inputs in the len 2000 range), and while the growth is super-linear above that point, it's always below the theoretical O(n log_2 n) (likely due to the data being not entirely random thanks to the counts and limited alphabet, ensuring Python's TimSort has some existing ordering to take advantage of). The one-liner is somewhat faster for smaller strings (len 100 or less), the longer code is somewhat faster for larger strings (I'm guessing it has something to do with the longer code creating some ordering by grouping runs of counts for each letter). Really though, it hardly matters unless the input strings are expected to be huge.

Since the alphabet will always be a constant 26 characters,
this will work in O(N) and only takes a constant amount of space of 26
from collections import Counter
from string import ascii_lowercase
def sorted_alphabet(text):
freq = Counter(text)
alphabet = filter(freq.get, ascii_lowercase) # alphabet filtered with freq >= 1
top_freq = max(freq.values()) if text else 0 # handle empty text eg. ''
for top_freq in range(top_freq, 0, -1): # from top_freq to 1
for letter in alphabet:
if freq[letter] >= top_freq:
yield letter
print ''.join(sorted_alphabet('abbcccdddd'))
print ''.join(sorted_alphabet('dbdd'))
print ''.join(sorted_alphabet(''))
print ''.join(sorted_alphabet('xxxxaaax'))
dcdbcdabcd
ddbd
xxaxaxax

What about this?
I am making use of in-built python functions to eliminate loops and improve efficiency.
test_str = 'abbcccdddd'
remaining_letters = [1] # dummy initialisation
# sort alphabetically
unique_letters = sorted(set(test_str))
frequencies = [test_str.count(letter) for letter in unique_letters]
out = []
while(remaining_letters):
# in case of ties, index takes the first occurence, so the alphabetical order is preserved
max_idx = frequencies.index(max(frequencies))
out.append(unique_letters[max_idx])
#directly update frequencies instead of calculating them again
frequencies[max_idx] -= 1
remaining_letters = [idx for idx, freq in enumerate(frequencies) if freq>0]
print''.join(out) #dcdbcdabcd

Related

Check how many character need to be deleted to make an anagram in Python

I wrote python code to check how many characters need to be deleted from two strings for them to become anagrams of each other.
This is the problem statement "Given two strings, and , that may or may not be of the same length, determine the minimum number of character deletions required to make and anagrams. Any characters can be deleted from either of the strings"
def makeAnagram(a, b):
# Write your code here
ac=0 # tocount the no of occurences of chracter in a
bc=0 # tocount the no of occurences of chracter in b
p=False #used to store result of whether an element is in that string
c=0 #count of characters to be deleted to make these two strings anagrams
t=[] # list of previously checked chracters
for x in a:
if x in t == True:
continue
ac=a.count(x)
t.insert(0,x)
for y in b:
p = x in b
if p==True:
bc=b.count(x)
if bc!=ac:
d=ac-bc
c=c+abs(d)
elif p==False:
c=c+1
return(c)

You can use collections.Counter for this:
from collections import Counter
def makeAnagram(a, b):
return sum((Counter(a) - Counter(b) | Counter(b) - Counter(a)).values())
Counter(x) (where x is a string) returns a dictionary that maps characters to how many times they appear in the string.
Counter(a) - Counter(b) gives you a dictionary that maps characters which are overabundant in b to how many times they appear in b more than the number of times they appear in a.
Counter(b) - Counter(a) is like above, but for characters which are overabundant in a.
The | merges the two resulting counters. We then take the values of this, and sum them to get the total number of characters which are overrepresented in either string. This is equivalent to the minimum number of characters that need to be deleted to form an anagram.
As for why your code doesn't work, I can't pin down any one problem with it. To obtain the code below, all I did was some simplification (e.g. removing unnecessary variables, looping over a and b together, removing == True and == False, replacing t with a set, giving variables descriptive names, etc.), and the code began working. Here is that simplified working code:
def makeAnagram(a, b):
c = 0 # count of characters to be deleted to make these two strings anagrams
seen = set() # set of previously checked characters
for character in a + b:
if character not in seen:
seen.add(character)
c += abs(a.count(character) - b.count(character))
return c
I recommend you make it a point to learn how to write simple/short code. It may not seem important compared to actually tackling the algorithms and getting results. It may seem like cleanup or styling work. But it pays off enormously. Bug are harder to introduce in simple code, and easier to spot. Oftentimes simple code will be more performant than equivalent complex code too, either because the programmer was able to more easily see ways to improve it, or because the more performant approach just arose naturally from the cleaner code.

Assuming there are only lowercase letters
The idea is to make character count arrays for both the strings and store frequency of each character. Now iterate the count arrays of both strings and difference in frequency of any character abs(count1[str1[i]-‘a’] – count2[str2[i]-‘a’]) in both the strings is the number of character to be removed in either string.
CHARS = 26
# function to calculate minimum
# numbers of characters
# to be removed to make two
# strings anagram
def remAnagram(str1, str2):
count1 = [0]*CHARS
count2 = [0]*CHARS
i = 0
while i < len(str1):
count1[ord(str1[i])-ord('a')] += 1
i += 1
i =0
while i < len(str2):
count2[ord(str2[i])-ord('a')] += 1
i += 1
# traverse count arrays to find
# number of characters
# to be removed
result = 0
for i in range(26):
result += abs(count1[i] - count2[i])
return result
Here time complexity is O(n + m) where n and m are the length of the two strings
Space complexity is O(1) as we use only array of size 26
This can be further optimised by just using a single array for taking the count.
In this case for string s1 -> we increment the counter
for string s2 -> we decrement the counter
def makeAnagram(a, b):
buffer = [0] * 26
for char in a:
buffer[ord(char) - ord('a')] += 1
for char in b:
buffer[ord(char) - ord('a')] -= 1
return sum(map(abs, buffer))
if __name__ == "__main__" :
str1 = "bcadeh"
str2 = "hea"
print(makeAnagram(str1, str2))
Output : 3

Generating all possibilities with letters in python and exploiting results in python3

First, I got this problem: how many words are there (counting all of them, even those that don't make sense) of 5 letters that have at least one I and at least two T's, but no K or Y?
First, I defined the alphabet, which has 24 letters ( k and y aren't counted). After that, i made a code to generate all possibilites
alphabet = list(range(1, 24))
for L in range(0, len(alphabet)+1):
for subset in itertools.permutations(alphabet, L):
I don't know how to use the data.

If a “brute force” method is enough for you, this will work:
import string
import itertools
alphabet = string.ascii_uppercase.replace("K", "").replace("Y", "")
count = 0
for word in itertools.product(alphabet, repeat = 5):
if "I" in word and word.count("T") >= 2:
count += 1
print (count)
It will print the result 15645.
Note that you have to use itertools.product(), because itertools.permutations() will not contain repeated occurences, so it will never contain “T” twice.
Edit: Alternatively, you can calculate the count with a list comprehension or a generator expression. It takes advantage of the fact that boolean True and False are equivalent to integer values 1 and 0, respectively.
count = sum(
"I" in word and word.count("T") >= 2
for word in itertools.product(alphabet, repeat = 5)
)
NB: Interestingly, the first solution (explicit for loop with counter += 1) runs about 15 % faster on my computer than the second solution (generator expression with sum()). Both require the same amount of memory (this is expected).

How can I merge overlapping strings in python?

I have some strings,
['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
These strings partially overlap each other. If you manually overlapped them you would get:
SGALWDVPSPV
I want a way to go from the list of overlapping strings to the final compressed string in python. I feel like this must be a problem that someone has solved already and am trying to avoid reinventing the wheel. The methods I can imagine now are either brute force or involve getting more complicated by using biopython and sequence aligners than I would like. I have some simple short strings and just want to properly merge them in a simple way.
Does anyone have any advice on a nice way to do this in python? Thanks!

Here is a quick sorting solution:
s = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
new_s = sorted(s, key=lambda x:s[0].index(x[0]))
a = new_s[0]
b = new_s[-1]
final_s = a[:a.index(b[0])]+b
Output:
'SGALWDVPSPV'
This program sorts s by the value of the index of the first character of each element, in an attempt to find the string that will maximize the overlap distance between the first element and the desired output.

My proposed solution with a more challenging test list:
#strFrag = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
strFrag = ['ALWDVPS', 'SGALWDV', 'LWDVPSP', 'WDVPSPV', 'GALWDVP', 'LWDVPSP', 'ALWDVPS']
for repeat in range(0, len(strFrag)-1):
bestMatch = [2, '', ''] #overlap score (minimum value 3), otherStr index, assembled str portion
for otherStr in strFrag[1:]:
for x in range(0,len(otherStr)):
if otherStr[x:] == strFrag[0][:len(otherStr[x:])]:
if len(otherStr)-x > bestMatch[0]:
bestMatch = [len(otherStr)-x, strFrag.index(otherStr), otherStr[:x]+strFrag[0]]
if otherStr[:-x] == strFrag[0][-len(otherStr[x:]):]:
if x > bestMatch[0]:
bestMatch = [x, strFrag.index(otherStr), strFrag[0]+otherStr[-x:]]
if bestMatch[0] > 2:
strFrag[0] = bestMatch[2]
strFrag = strFrag[:bestMatch[1]]+strFrag[bestMatch[1]+1:]
print(strFrag)
print(strFrag[0])
Basically the code compares every string/fragment to the first in list and finds the best match (most overlap). It consolidates the list progressively, merging the best matches and removing the individual strings. Code assumes that there are no unfillable gaps between strings/fragments (Otherwise answer may not result in longest possible assembly. Can be solved by randomizing the starting string/fragment). Also assumes that the reverse complement is not present (poor assumption with contig assembly), which would result in nonsense/unmatchable strings/fragments. I've included a way to restrict the minimum match requirements (changing bestMatch[0] value) to prevent false matches. Last assumption is that all matches are exact. To enable flexibility in permitting mismatches when assembling the sequence makes the problem considerably more complex. I can provide a solution for assembling with mismatches upon request.

To determine the overlap of two strings a and b, you can check if any prefix of b is a suffix of a. You can then use that check in a simple loop, aggregating the result and slicing the next string in the list according to the overlap.
lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
def overlap(a, b):
return max(i for i in range(len(b)+1) if a.endswith(b[:i]))
res = lst[0]
for s in lst[1:]:
o = overlap(res, s)
res += s[o:]
print(res) # SGALWDVPSPV
Or using reduce:
from functools import reduce # Python 3
print(reduce(lambda a, b: a + b[overlap(a,b):], lst))
This is probably not super-efficient, with complexity of about O(n k), with n being the number of strings in the list and k the average length per string. You can make it a bit more efficient by only testing whether the last char of the presumed overlap of b is the last character of a, thus reducing the amount of string slicing and function calls in the generator expression:
def overlap(a, b):
return max(i for i in range(len(b)) if b[i-1] == a[-1] and a.endswith(b[:i]))

Here's my solution which borders on brute force from the OP's perspective. It's not bothered by order (threw in a random shuffle to confirm that) and there can be non-matching elements in the list, as well as other independent matches. Assumes overlap means not a proper subset but independent strings with elements in common at the start and end:
from collections import defaultdict
from random import choice, shuffle
def overlap(a, b):
""" get the maximum overlap of a & b plus where the overlap starts """
overlaps = []
for i in range(len(b)):
for j in range(len(a)):
if a.endswith(b[:i + 1], j):
overlaps.append((i, j))
return max(overlaps) if overlaps else (0, -1)
lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV', 'NONSEQUITUR']
shuffle(lst) # to verify order doesn't matter
overlaps = defaultdict(list)
while len(lst) > 1:
overlaps.clear()
for a in lst:
for b in lst:
if a == b:
continue
amount, start = overlap(a, b)
overlaps[amount].append((start, a, b))
maximum = max(overlaps)
if maximum == 0:
break
start, a, b = choice(overlaps[maximum]) # pick one among equals
lst.remove(a)
lst.remove(b)
lst.append(a[:start] + b)
print(*lst)
OUTPUT
% python3 test.py
NONSEQUITUR SGALWDVPSPV
%
Computes all the overlaps and combines the largest overlap into a single element, replacing the original two, and starts process over again until we're down to a single element or no overlaps.
The overlap() function is horribly inefficient and likely can be improved but that doesn't matter if this isn't the type of matching the OP desires.

Once the peptides start to grow to 20 aminoacids cdlane's code chokes and spams (multiple) incorrect answer(s) with various amino acid lengths.
Try to add and use AA sequence 'VPSGALWDVPS' with or without 'D' and the code starts to fail its task because the N-and C-terminus grow and do not reflect what Adam Price is asking for. The output is: 'SGALWDVPSGALWDVPSPV' and thus 100% incorrect despite the effort.
Tbh imo there is only one 100% answer and that is to use BLAST and its protein search page or BLAST in the BioPython package. Or adapt cdlane's code to reflect AA gaps, substitutions and AA additions.

Dredging up an old thread, but had to solve this myself today.
For this specific case, where the fragments are already in order, and each overlap by the same amount (in this case 1), the following fairly simply concatenation works, though might not be the worlds most robust solution:
lst = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
reference = "SGALWDVPSPV"
string = "".join([i[0] for i in lst] + [lst[-1][1:]])
reference == string
True

How many minimum numbers of characters from a given string S, should delete to make it a sorted string [duplicate]

This question already has answers here:
How to determine the longest increasing subsequence using dynamic programming?
(20 answers)
Closed 6 years ago.
I need to find the minimum number of deletions required to make string sorted.
Sample Test case:
# Given Input:
teststr = "abcb"
# Expected output:
1
# Explanation
# In this test case, if I delete last 'b' from "abcb",
# then the remaining string "abc" is sorted.
# That is, a single deletion is required.
# Given Input:
teststr = "vwzyx"
# Expected output:
2
# Explanation
# Here, if I delete 'z' and 'x' from "vwzyx",
# then the remaining string "vwy" is a sorted string.
I tried the following but it gives time limit exceeded error.
Any other approach to this problem?
string = input()
prev_ord = ord(string[0])
deletion = 0
for char in string[1:]:
if ord(char) > prev_ord +1 or ord(char) < prev_ord:
deletion += 1
continue
prev_ord = ord(char)
print(deletion)

Your current algorithm will give incorrect results for many strings.
I suspect that there's a more efficient way to solve this problem, but here's a brute-force solution. It generates subsets of the input string, ordered by length, descending. The elements in the subsets retain the order from the original string. As soon as count_deletions finds an ordered subset it returns it (converted back into a string), as well as the number of deletions. Thus the solution it finds is guaranteed to be no shorter than any other sorted selection of the input string.
Please see the itertools docs for info about the various itertools functions I've used; the algorithm for generating subsets was derived from the powerset example in the Recipes section.
from itertools import chain, combinations
def count_deletions(s):
for t in chain.from_iterable(combinations(s, r) for r in range(len(s), 0, -1)):
t = list(t)
if t == sorted(t):
return ''.join(t), len(s) - len(t)
# Some test data.
data = [
"abcdefg",
"cba",
"abcb",
"vwzyx",
"zvwzyx",
"adabcef",
"fantastic",
]
for s in data:
print(s, count_deletions(s))
output
abcdefg ('abcdefg', 0)
cba ('c', 2)
abcb ('abc', 1)
vwzyx ('vwz', 2)
zvwzyx ('vwz', 3)
adabcef ('aabcef', 1)
fantastic ('fntt', 5)
That data set is not really adequate to fully test algorithms designed to solve this problem, but I guess it's an ok starting point. :)
Update
Here's a Python 3 implementation of the algorithm mentioned by Salvador Dali on the linked page. It's much faster than my previous brute-force approach, especially for longer strings.
We can find the longest sorted subsequence by sorting a copy of the string and then finding the Longest Common Subsequence (LCS) of the original string & the sorted string. Salvador's version removes duplicate elements from the sorted string because he wants the result to be strictly increasing, but we don't need that here.
This code only returns the number of deletions required, but it's easy enough to modify it to return the actual sorted string.
To make this recursive function more efficient it uses the lru_cache decorator from functools.
from functools import lru_cache
#lru_cache(maxsize=None)
def lcs_len(x, y):
if not x or not y:
return 0
xhead, xtail = x[0], x[1:]
yhead, ytail = y[0], y[1:]
if xhead == yhead:
return 1 + lcs_len(xtail, ytail)
return max(lcs_len(x, ytail), lcs_len(xtail, y))
def count_deletions(s):
lcs_len.cache_clear()
return len(s) - lcs_len(s, ''.join(sorted(s)))
data = [
"abcdefg",
"cba",
"abcb",
"vwzyx",
"zvwzyx",
"adabcef",
"fantastic",
]
for s in data:
print(s, count_deletions(s))
output
abcdefg 0
cba 2
abcb 1
vwzyx 2
zvwzyx 3
adabcef 1
fantastic 5

Hope it works for all cases :)
s = input()
s_2 = ''.join(sorted(set(s), key=s.index))
sorted_string = sorted(s_2)
str_to_list = list(s_2)
dif = 0
for i in range(len(sorted_string)):
if sorted_string[i]!=str_to_list[i]:
dif+=1
print(dif+abs(len(s)-len(s_2)))

longest common sub-string, Python complexity analysis

I built a function which finds the longest common sub-string of two text files in ascending order based on Rabin–Karp algorithm.
the main function is "find_longest" and the inner functions are: "make_hashtable","extend_fingerprints" and "has_match".
I'm having trouble analyzing the average case complexity of has_match.
Denote n1,n2 as text1,text2 and l as the size of the currunt "window".
fingers are the hash table of the substring.
def has_match(text1,text2,fingers1,fingers2,l,r):
h = make_hashtable(fingers2,r)
for i in range(len(fingers1)):
for j in h[fingers1[i]]:
if text1[i:i+l] == text2[j:j+l]:
return text1[i:i+l]
return None
this is "make_hashtable", here I'm pretty sure that the complexcity is O(n2-l+1):
def make_hashtable(fingers, table_size):
hash_table=[[] for i in range(table_size)]
count=0
for f in fingers:
hash_table[f].append(count)
count+=1
return hash_table
this is "find_longest", im adding this function despite the fact that i dont need it for the complexity analyzing.
def find_longest(text1,text2,basis=2**8,r=2**17-1):
match = ''
l = 0 #initial "window" size
#fingerprints of "windows" of size 0 - all are 0
fingers1 = [0]*(len(text1)+1)
fingers2 = [0]*(len(text2)+1)
while match != None: #there was a common substring of len l
l += 1
extend_fingerprints(text1, fingers1, l, basis, r)
extend_fingerprints(text2, fingers2, l, basis, r)
match = has_match(text1,text2,fingers1,fingers2,l,r)
print(match)
return l-1
and this is "extend_fingerprints":
def extend_fingerprints(text, fingers, length, basis=2**8, r=2**17-1):
count=0
for f in fingers:
if count==len(fingers)-1:
fingers.pop(len(fingers)-1)
break
fingers[count]=(f*basis+ord(text[length-1+count]))%r
count+=1
I'm having doubts between this two options:
1.O(n_2-l+1)+O(n_1-l+1)*O(l)
Refer to r as a constant number while n1,n2 are very large therefore a lot of collisions would be made at the hash table (let's say O(1) items at every 'cell', yet, always some "false-positives")
2.O(n_2-l+1)+O(n_1-l+1)+O(l)
Refer to r as optimal for a decent hash function, therefore almost no collisions which means that if two texts are the same cell at the hash table we may assume they are actually the same text?
Personally I lean towards the Bold statement.
tnx.

I think the answer is
O((n_2-l) + l*(n_1-l))
.
(n_2-l) represents the complexity of make_hashtable for the second text.
l*(n_1-l) represents the two nested loops who go through every item in the finger prints of the first text and perform 1 comparison operation (for l length slice), for some constant 'm' if there are some items of the same index in the hash table.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sorting a string to make a new one - python

Related

Check how many character need to be deleted to make an anagram in Python

Generating all possibilities with letters in python and exploiting results in python3

How can I merge overlapping strings in python?

How many minimum numbers of characters from a given string S, should delete to make it a sorted string [duplicate]

longest common sub-string, Python complexity analysis

Categories

Resources