Let's say I have the following string abcixigea and I want to replace the first 'i', 'e' and the second 'a' with '1', '3' and '4', getting all the combinations with those "progressive" replacement.
So, I need to get:
abc1xigea
abcixig3a
abcixig34
abc1xige4
...and so on.
I tried with itertools.product following this question python string replacement, all possible combinations #2 but the result I get is not exactly what I need and I can see why.
However I'm stuck on trying with combinations and keeping parts of the string fixed (changing just some chars as explained above).
from itertools import product
s = "abc{}xig{}{}"
for combo in product(("i", 1), ("e", 3), ("a", 4)):
print(s.format(*combo))
produces
abcixigea
abcixige4
abcixig3a
abcixig34
abc1xigea
abc1xige4
abc1xig3a
abc1xig34
Edit: in a more general way, you want something like:
from itertools import product
def find_nth(s, char, n):
"""
Return the offset of the nth occurrence of char in s,
or -1 on failure
"""
assert len(char) == 1
offs = -1
for _ in range(n):
offs = s.find(char, offs + 1)
if offs == -1:
break
return offs
def gen_replacements(base_string, *replacement_values):
"""
Generate all string combinations from base_string
by replacing some characters according to replacement_values
Each replacement_value is a tuple of
(original_char, occurrence, replacement_char)
"""
assert len(replacement_values) > 0
# find location of each character to be replaced
replacement_offsets = [
(find_nth(base_string, orig, occ), orig, occ, (orig, repl))
for orig,occ,repl in replacement_values
]
# put them in ascending order
replacement_offsets.sort()
# make sure all replacements are actually possible
if replacement_offsets[0][0] == -1:
raise ValueError("'{}' occurs less than {} times".format(replacement_offsets[0][1], replacement_offsets[0][2]))
# create format string and argument list
args = []
for i, (offs, _, _, arg) in enumerate(replacement_offsets):
# we are replacing one char with two, so we have to
# increase the offset of each replacement by
# the number of replacements already made
base_string = base_string[:offs + i] + "{}" + base_string[offs + i + 1:]
args.append(arg)
# ... and we feed that into the original code from above:
for combo in product(*args):
yield base_string.format(*combo)
def main():
s = "abcixigea"
for result in gen_replacements(s, ("i", 1, "1"), ("e", 1, "3"), ("a", 2, "4")):
print(result)
if __name__ == "__main__":
main()
which produces exactly the same output as above.
Related
I want to be able to generate a string from a dictionary containing substrings, whereby I input a string where each character corresponds to the key of the dictionary and it spits out a new string from the associated values to that key. However I also want to minimise certain characters being next to each other.
For example:
dict = {'I': ['ATA', 'ATC', 'ATT'], 'M': ['ATG'], 'T': ['ACA', 'ACC', 'ACG', 'ACT'], 'N':['AAC', 'AAT'], 'K': ['AAA', 'AAG'], 'S': ['AGC', 'AGT'], 'R': ['AGA', 'AGG']}
input_str = "IIMTSTTKRI"
The output would be a string of the three character substrings associated with each key.However there are many 3 character substrings that could be used, I would like to minimise the number of G's and C's that are next to one another.
I currently have this:
n = []
#make list of possible substrings for each character in string
for i in str:
if i in dict.keys():
n.append(dict[i])
#generate all permutations
p = [''.join(s) for s in itertools.product(*n)]
#if no consecutive GCs in a permutation add to list
ls = []
for i in p:
q = i.count('GC')
if q == 0:
ls.append(i)
Which 'works' but there are a couple of problems. The first (minor one) is that I have to assume the consective "GC" is 0 and for some strings that may not be possible. The second (major one), is its extremely slow for longer strings because it has to generate all permutations.
Can anyone provide a way to improve the speed or an alternative way?
Based on your comments, you can look at the problem as a optimal path searching (Think about your problem as a graph where you must follow path defined in input_str and in each vertex you must chose from a list of defined 3-character wide strings).
There are many search algorithms, my solution is using A*Star:
from heapq import heappop, heappush
dct = {
"I": ["ATA", "ATC", "ATT"],
"M": ["ATG"],
"T": ["ACA", "ACC", "ACG", "ACT"],
"N": ["AAC", "AAT"],
"K": ["AAA", "AAG"],
"S": ["AGC", "AGT"],
"R": ["AGA", "AGG"],
}
input_str = "IIMTSTTKRI"
def valid_moves(s):
key = input_str[len(s) // 3]
for i in dct[key]:
yield s + i
def distance(s):
return len(input_str) - (len(s) // 3)
def my_cost_func(_from, _to):
return _to.count("GC")
def a_star(start, moves_func, h_func, cost_func):
"""
Find a shortest sequence of states from start to a goal state
(a state s with h_func(s) == 0).
"""
frontier = [
(h_func(start), start)
] # A priority queue, ordered by path length, f = g + h
previous = {
start: None
} # start state has no previous state; other states will
path_cost = {start: 0} # The cost of the best path to a state.
Path = lambda s: ([] if (s is None) else Path(previous[s]) + [s])
while frontier:
(f, s) = heappop(frontier)
if h_func(s) == 0:
return Path(s)
for s2 in moves_func(s):
g = path_cost[s] + cost_func(s, s2)
if s2 not in path_cost or g < path_cost[s2]:
heappush(frontier, (g + h_func(s2), s2))
path_cost[s2] = g
previous[s2] = s
path = a_star("", valid_moves, distance, my_cost_func)
print("Result:", path[-1])
This prints:
Result: ATAATAATGACAAGTACAACAAAAAGAATAAGT
I want to design guide RNAs to find palindromic sequences in a FASTA file. I want to write a python script that finds all the palindromic sequences of length 18 throughout my sequence. I have a logic in mind but I don't know how to put it in Python words. My logic is:
1)If i is [ATCG] and i+17 is [TAGC] then check:
2)if i+1 is [ATCG] and i+16 is [TAGC] then check:
3)if i+2 is [ATCG] and i+15 is [TAGC] then check"
.
.
.
10)if i+9 is [ATCG] and i+10 is [TAGC] and all the above are true,
then recognize the sequence of i to i+17 as a palindromic. But I need to make sure that for an A for i it considers only T for i+17.
Any idea how I write this logic in python?
Thanks,
So you want to match A+T and G+C. We can use a dictionary for that. Then we just check if opposite sides are pairs.
pairs = {"A":"T", "T":"A", "G":"C", "C":"G"}
for i in range(len(sequence) - 18 + 1):
pal = True
for j in range(9):
if pairs[ sequence[i+j] ] != sequence[i+17-j]:
pal = False
break
if pal:
print(sequence[i : i+18])
For any n-length palindrome (including odd n):
pairs = {"A":"T", "T":"A", "G":"C", "C":"G"}
n=18
for i in range(len(sequence) - n + 1):
pal = True
for j in range(n//2):
if pairs[ sequence[i+j] ] != sequence[i-j+n-1]:
pal = False
break
if pal:
print(sequence[i : i+n])
Looping individually through strings takes too much time. String handling is way more efficient in Python.
#create random test sequence
import random
random.seed(1234)
seq = "".join(random.choices(["A", "T", "C", "G"], k=99))
n = 4 #not exactly 18 but good enough as a test case
print(seq)
>>>GTAGGCCAGAAGTCCAAAATGACTCACTCCTTAGTCACAATTACACAGGGATATGAAGAGATTTGTGTGGTGGTAATACGTGCCTCGAGTAGCGTATAT
#dictionary because translation
bp = {"A":"T", "T":"A", "G":"C", "C":"G"}
#checks if first half translates into reversed second half
#returns False if not, e.g., if the length ls of s is not an even number
def palin(s):
ls = len(s)
if ls%2:
return False
return s[:ls//2]=="".join([bp[i] for i in s[ls:ls//2-1:-1]])
#now to the actual test, checking all substrings of length n in our test sequence seq
#returns tuples of the index within seq and the found substring
res = [(i, seq[i:i+n]) for i in range(len(seq)-n+1) if palin(seq[i:i+n])]
print(res)
>>>[(3, 'GGCC'), (38, 'AATT'), (50, 'ATAT'), (77, 'ACGT'), (84, 'TCGA'), (94, 'TATA'), (95, 'ATAT')]
I'm trying to write a program that would find the longest intersection between two strings. The conditions are:
If there is no common character the program returns an empty chain.
If there are multiple substrings of common characters with the same length it should return whichever is the largest, for example, for "bbaacc" and "aabb" the repeating substrings are "aa" and "bb" but as "bb" > "aa", so the programs must return only "bb".
Finally the program should return the longest common substring, for instance, for "programme" and "grammaire" the return should be "gramm" not "gramme".
My code has a problem with this last condition, how could I change it so it works as expected?
def intersection(v, w):
if not v or not w:
return ""
x, xs, y, ys = v[0], v[1:], w[0], w[1:]
if x == y:
return x + intersection(xs, ys)
else:
return max(intersection(v, ys), intersection(xs, w), key=len)
Driver:
print(intersection('programme', 'grammaire'))
cant find the issue with your code, but i solved it like this
def longest_str_intersection(a: str, b: str):
# identify all possible character sequences from str a
seqs = []
for pos1 in range(len(a)):
for pos2 in range(len(a)):
seqs.append(a[pos1:pos2+1])
# remove empty sequences
seqs = [seq for seq in seqs if seq != '']
# find segments in str b
max_len_match = 0
max_match_sequence = ''
for seq in seqs:
if seq in b:
if len(seq) > max_len_match:
max_len_match = len(seq)
max_match_sequence = seq
return max_match_sequence
longest_str_intersection('programme', 'grammaire')
-> 'gramm'
also interested to see if you found a more elegant solution!
I'm trying to write a function that will take a string, and given an integer, will remove all the adjacent duplicates larger than the integer and output the remaining string. I have this function right now that removes all the duplicates in a string, and I'm not sure how to put the integer constraint into it:
def remove_duplicates(string):
s = set()
list = []
for i in string:
if i not in s:
s.add(i)
list.append(i)
return ''.join(list)
string = "abbbccaaadddd"
print(remove_duplicates(string))
This outputs
abc
What I would want is a function like
def remove_duplicates(string, int):
.....
Where if for the same string I input int=2, I want to remove my n characters without removing all the characters. Output should be
abbccaadd
I'm also concerned about run time and complexity for very large strings, so if my initial approach is bad, please suggest a different approach. Any help is appreciated!
Not sure I understand your question correctly. I think that, given m repetitions of a character, you want to remove up to k*n duplicates such that k*n < m.
You could try this, using groupby:
>>> from itertools import groupby
>>> string = "abbbccaaadddd"
>>> n = 2
>>> ''.join(c for k, g in groupby(string) for c in k * (len(list(g)) % n or n))
'abccadd'
Here, k * (len(list(g)) % n or n) means len(g) % n repetitions, or n if that number is 0.
Oh, you changed it... now my original answer with my "interpretation" of your output actually works. You can use groupby together with islice to get at most n characters from each group of duplicates.
>>> from itertools import groupby, islice
>>> string = "abbbccaaadddd"
>>> n = 2
>>> ''.join(c for _, g in groupby(string) for c in islice(g, n))
'abbccaadd'
Create group of letters, but compute the length of the groups, maxed out by your parameter.
Then rebuild the groups and join:
import itertools
def remove_duplicates(string,maxnb):
groups = ((k,min(len(list(v)),maxnb)) for k,v in itertools.groupby(string))
return "".join(itertools.chain.from_iterable(v*k for k,v in groups))
string = "abbbccaaadddd"
print(remove_duplicates(string,2))
this prints:
abbccaadd
can be a one-liner as well (cover your eyes!)
return "".join(itertools.chain.from_iterable(v*k for k,v in ((k,min(len(list(v)),maxnb)) for k,v in itertools.groupby(string))))
not sure about the min(len(list(v)),maxnb) repeat value which can be adapted to suit your needs with a modulo (like len(list(v)) % maxnb), etc...
You should avoid using int as a variable name as it is a python keyword.
Here is a vanilla function that does the job:
def deduplicate(string: str, treshold: int) -> str:
res = ""
last = ""
count = 0
for c in string:
if c != last:
count = 0
res += c
last = c
else:
if count < treshold:
res += c
count += 1
return res
I am trying to find all occurrences of sub-strings in a main string (of all lengths). My function takes one string and then returns a dictionary of every sub-string (which occurs more than once, of course) and how many times it occurs (format of the dictionary: {substring: # of occurrences, ...}). I am using collections.Counter(s) to help me with it.
Here is my function:
from collections import Counter
def patternFind(s):
patterns = {}
for index in range(1, len(s)+1)[::-1]:
d = nChunks(s, step=index)
parts = dict(Counter(d))
patterns.update({elem: parts[elem] for elem in parts.keys() if parts[elem] > 1})
return patterns
def nChunks(iterable, start=0, step=1):
return [iterable[i:i+step] for i in range(start, len(iterable), step)]
I have a string, data with about 2500 random letters (in a random order). However, there are 2 strings inserted into it (random points). Say this string is 'TEST'. data.count('TEST') returns 2. However, patternFind(data)['TEST'] gives me a KeyError. Therefore, my program does not detect the two strings in it.
What have I done wrong? Thanks!
Edit: My method of creating testing-instances:
def createNewTest():
n = randint(500, 2500)
x, y = randint(500, n), randint(500, n)
s = ''
for i in range(n):
s += choice(uppercase)
if i == x or i == y: s += "TEST"
return s
Using Regular Expressions
Apart from the count() method you described, regex is an obvious alternative
import re
needle = r'TEST'
haystack = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklagh'
pattern = re.compile(needle)
print len(re.findall(pattern, haystack))
Short Cut
If you need to build a dictionary of substrings, possibly you can do this with only subset of those strings. Assuming you know the needle you are looking for in the data then you only need the dictionary of substrings of data that are the same length of needle. This is very fast.
from collections import Counter
needle = "TEST"
def gen_sub(s, len_chunk):
for start in range(0, len(s)-len_chunk+1):
yield s[start:start+len_chunk]
data = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklaghTESz'
parts = Counter([sub for sub in gen_sub(data, len(needle))])
print parts[needle]
Brute Force: building dictionary of all substrings
If you need to have a count of all possible substrings, this works but it is very slow:
from collections import Counter
def gen_sub(s):
for start in range(0, len(s)):
for end in range(start+1, len(s)+1):
yield s[start:end]
data = 'khjkzahklahjTESTkahklaghTESTjklajhkhz'
parts = Counter([sub for sub in gen_sub(data)])
print parts['TEST']
Substring generator adapted from this: https://stackoverflow.com/a/8305463/1290420
While jurgenreza has explained why your program didn't work, the solution is still quite slow. If you only examine substrings s for which you know that s[:-1] repeats, you get a much faster solution (typically a hundred times faster and more):
from collections import defaultdict
def pfind(prefix, sequences):
collector = defaultdict(list)
for sequence in sequences:
collector[sequence[0]].append(sequence)
for item, matching_sequences in collector.items():
if len(matching_sequences) >= 2:
new_prefix = prefix + item
yield (new_prefix, len(matching_sequences))
for r in pfind(new_prefix, [sequence[1:] for sequence in matching_sequences]):
yield r
def find_repeated_substrings(s):
s0 = s + " "
return pfind("", [s0[i:] for i in range(len(s))])
If you want a dict, you call it like this:
result = dict(find_repeated_substrings(s))
On my machine, for a run with 2247 elements, it took 0.02 sec, while the original (corrected) solution took 12.72 sec.
(Note that this is a rather naive implementation; using indexes of instead of substrings should be even faster.)
Edit: The following variant works with other sequence types (not only strings). Also, it doesn't need a sentinel element.
from collections import defaultdict
def pfind(s, length, ends):
collector = defaultdict(list)
if ends[-1] >= len(s):
del ends[-1]
for end in ends:
if end < len(s):
collector[s[end]].append(end)
for key, matching_ends in collector.items():
if len(matching_ends) >= 2:
end = matching_ends[0]
yield (s[end - length: end + 1], len(matching_ends))
for r in pfind(s, length + 1, [end + 1 for end in matching_ends if end < len(s)]):
yield r
def find_repeated_substrings(s):
return pfind(s, 0, list(range(len(s))))
This still has the problem that very long substrings will exceed recursion depth. You might want to catch the exception.
The problem is in your nChunks function. It does not give you all the chunks that are necessary.
Let's consider a test string:
s='1test2345test'
For the chunks of size 4 your nChunks function gives this output:
>>>nChunks(s, step=4)
['1tes', 't234', '5tes', 't']
But what you really want is:
>>>def nChunks(iterable, start=0, step=1):
return [iterable[i:i+step] for i in range(len(iterable)-step+1)]
>>>nChunks(s, step=4)
['1tes', 'test', 'est2', 'st23', 't234', '2345', '345t', '45te', '5tes', 'test']
You can see that this way there are two 'test' chunks and your patternFind(s) will work like a charm:
>>> patternFind(s)
{'tes': 2, 'st': 2, 'te': 2, 'e': 2, 't': 4, 'es': 2, 'est': 2, 'test': 2, 's': 2}
here you can find a solution that uses a recursive wrapper around string.find() that searches all the occurences of a substring in a main string.
The collectallchuncks() function returns a defaultdict whith all the substrings as keys and for each substring a list of all the indexes where the substring is found in the main string.
import collections
# Minimum substring size, may be 1
MINSIZE = 3
# Recursive wrapper
def recfind(p, data, pos, acc):
res = data.find(p, pos)
if res == -1:
return acc
else:
acc.append(res)
return recfind(p, data, res+1, acc)
def collectallchuncks(data):
res = collections.defaultdict(str)
size = len(data)
for base in xrange(size):
for seg in xrange(MINSIZE, size-base+1):
chunk = data[base:base+seg]
if data.count(chunk) > 1:
res[chunk] = recfind(chunk, data, 0, [])
return res
if __name__ == "__main__":
data = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklaghTESz'
allchuncks = collectallchuncks(data)
print 'TEST', allchuncks['TEST']
print 'hklag', allchuncks['hklag']
EDIT: If you just need the number of occurrences of each substring in the main string you can easily obtain it getting rid of the recursive function:
import collections
MINSIZE = 3
def collectallchuncks2(data):
res = collections.defaultdict(str)
size = len(data)
for base in xrange(size):
for seg in xrange(MINSIZE, size-base+1):
chunk = data[base:base+seg]
cnt = data.count(chunk)
if cnt > 1:
res[chunk] = cnt
return res
if __name__ == "__main__":
data = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklaghTESz'
allchuncks = collectallchuncks2(data)
print 'TEST', allchuncks['TEST']
print 'hklag', allchuncks['hklag']