Finding motif in a sequence of characters

Finding motif in a sequence of characters - python

I have also a dictionary in which the keys are ids and the values are long sequences made not only with K and M but also with some more characters which are not important for me.
li = {id1: "KKMKMKMKJASGKKKMOOGBMMMMMMMMMMMMMMMMMM",
id2:"MMKFJDFKFGKJKMKMKMKMKMJKJHFKMKMKM"}
I want to find the motifs of "KMKMKM" with the length of at least 6. it could be even or odd just equal or longer than 6. it should also be in a dictionary with the same keys but instead of the whole sequence, the value must be the list of motifs. like the following example.
results = {id1: ["KMKMKMK"], id2: ["KMKMKMKMKM", "KMKMKM"] }
I have wrote this code but did not return interested motifs.
{k: re.findall(r'(?:KM){6,1000}', v) for k, v in li.items()}

This one does the job:
((?:KM){3,}K?)
Explanation:
( : group 1
(?:KM){3,} : non capture group, 3 or more times KM
K? : optional K
) : end group 1
In action:
import re
li = {'id1': "KKMKMKMKJASGKKKMOOGBMMMMMMMMMMMMMMMMMM",
'id2':"MMKFJDFKFGKJKMKMKMKMKMJKJHFKMKMKM"}
res = {k: re.findall(r'((?:KM){3,}K?)', v) for k, v in li.items()}
print(res)
Output:
{'id2': ['KMKMKMKMKM', 'KMKMKM'], 'id1': ['KMKMKMK']}

Is this what you are looking for:
import re
stringA = "KKMKMKMKJASGKKKMOOGBMMMMMMMMMMMMMMMMMM";
motifs = "KMKMKM";
m = re.search(motifs, stringA)
if m:
print(motifs);
In reply to your comment below:
stringA = "KKMKMKMKJASGKKKMOOGBMMMMMMMMMMMMMMMMMM";
motifs = "KMKMKM";
i = 0;
while True:
seq = stringA[i:]
i = i + 1;
if (seq.startswith(motifs)):
print(seq);
if (len(stringA) == i):
break;

Related

counting specific dictionary entries in python

I have a python dictionary and I want to count the amount of keys that have a specific format.
The keys that I want to count are all the keys that have the format ‘letter, number, number’.
In my specific case the key always begins with the letter ‘A’. only the numbers change.
Example: A12, A16, A71
For example I want to count all the entries that have this AXX format (where the X’s are numbers).
{'A34': 83, 'B32': 70, 'A44': 66, A12: 47, 'B90': 71}
I know I can count all the entries of my dictionary by using:
print(len(my_dict.keys()))
but how do I count up all the entries that have the specific format I need.

You can use a generator comprehension inside the sum function:
print(sum(1 for k in d.keys() if k.startswith('A') and len(k) == 3 and k[1:3].isdigit()))
This does three checks: if the key starts with A, if the length of this key is 3 and if the last two characters of this key is a digit.
You can also use Regex:
import re
print(sum(1 for k in d.keys() if re.match('^A\\d{2}$', k)))
Both snippets outputs 3.

You can try list comprehension.
len([key for key in list(my_dict.keys()) if 'A' in key])
For your specific condition, we can try the below, if you need to be more specific then write a regex in the if clause.
len([key for key in list(my_dict.keys()) if ((key.startswith('A')) and (len(key)==3))])
Should work!

import re
my_dict = { ... }
filtered = filter(lambda k: bool(re.match("^[A-Z][0-9]{2}", k)), my_dict.keys())
print(len(filtered))

Go through all possibilities and check?
result = sum(f'A{i:02}' in my_dict for i in range(100))
Benchmark along with the solutions from the accepted answer, with a dict like you described your real one ("about 5000 items" and "all my A values have 2 digits. However other values that I have such as B and C values will have 3 digits."):
41.5 μs sum(f'A{i:02}' in my_dict for i in range(100))
573.5 μs sum(1 for k in my_dict.keys() if k.startswith('A') and len(k) == 3 and k[1:3].isdigit())
3546.0 μs sum(1 for k in my_dict.keys() if re.match('^A\d{2}$', k))
Benchmark code (Try it online!):
from timeit import repeat
setup = '''
from random import sample
from string import ascii_uppercase as letters
import re
A = [f'A{i:02}' for i in range(100)]
B2Z = [f'{letter}{i}' for letter in letters for i in range(10, 1000)]
A2Z = sample(A + sample(B2Z, 4900), 5000)
my_dict = dict.fromkeys(A2Z)
'''
E = [
"sum(f'A{i:02}' in my_dict for i in range(100))",
"sum(1 for k in my_dict.keys() if k.startswith('A') and len(k) == 3 and k[1:3].isdigit())",
"sum(1 for k in my_dict.keys() if re.match('^A\\d{2}$', k))",
]
for _ in range(3):
for e in E:
number = 10
t = min(repeat(e, setup, number=number)) / number
print('%6.1f μs ' % (t * 1e6), e)
print()

create all possible combinations with multiple variants from list

Ok so the problem is as follows:
let's say I have a list like this [12R,102A,102L,250L] what I would want is a list of all possible combinations, however for only one combination/number. so for the example above, the output I would like is:
[12R,102A,250L]
[12R,102L,250L]
my actual problem is a lot more complex with many more sites. Thanks for your help
edit: after reading some comments I guess this is slightly unclear. I have 3 unique numbers here, [12, 102, and 250] and for some numbers, I have different variations, for example [102A, 102L]. what I need is a way to combine the different positions[12,102,250] and all possible variations within. just like the lists, I presented above. they are the only valid solutions. [12R] is not. neither is [12R,102A,102L,250L]. so far I have done this with nested loops, but I have a LOT of variation within these numbers, so I can't really do that anymore
ill edit this again: ok so it seems as though there is still some confusion so I might extend the point I made before. what I am dealing with there is DNA. 12R means the 12th position in the sequence was changed to an R. so the solution [12R,102A,250L] means that the amino acid on position 12 is R, 102 is A 250 is L.
this is why a solution like [102L, 102R, 250L] is not usable, because the same position can not be occupied by 2 different amino acids.
thank you

You can use a recursive generator function:
from itertools import groupby as gb
import re
def combos(d, c = []):
if not d:
yield c
else:
for a, b in d[0]:
yield from combos(d[1:], c + [a+b])
d = ['12R', '102A', '102L', '250L']
vals = [re.findall('^\d+|\w+$', i) for i in d]
new_d = [list(b) for _, b in gb(sorted(vals, key=lambda x:x[0]), key=lambda x:x[0])]
print(list(combos(new_d)))
Output:
[['102A', '12R', '250L'], ['102L', '12R', '250L']]

So it works with ["10A","100B","12C","100R"] (case 1) and ['12R','102A','102L','250L'] (case 2)
import itertools as it
liste = ['12R','102A','102L','250L']
comb = []
for e in it.combinations(range(4), 3):
e1 = liste[e[0]][:-1]
e2 = liste[e[1]][:-1]
e3 = liste[e[2]][:-1]
if e1 != e2 and e2 != e3 and e3 != e1:
comb.append([e1+liste[e[0]][-1], e2+liste[e[1]][-1], e3+liste[e[2]][-1]])
print(list(comb))
# case 1 : [['10A', '100B', '12C'], ['10A', '12C', '100R']]
# case 2 : [['12R', '102A', '250L'], ['12R', '102L', '250L']]

import re
def get_grouped_options(input):
options = {}
for option in input:
m = re.match('([\d]+)([A-Z])$', option)
if m:
position = int(m.group(1))
acid = m.group(2)
else:
continue
if position not in options:
options[position] = []
options[position].append(acid)
return options
def yield_all_combos(options):
n = len(options)
positions = list(options.keys())
indices = [0] * n
while True:
yield ["{}{}".format(position, options[position][indices[i]])
for i, position in enumerate(positions)]
j = 0
indices[j] += 1
while indices[j] == len(options[positions[j]]):
# carry
indices[j] = 0
j += 1
if j == n:
# overflow
return
indices[j] += 1
input = ['12R', '102A', '102L', '250L']
options = get_grouped_options(input)
for combo in yield_all_combos(options):
print("[{}]".format(",".join(combo)))
Gives:
[12R,102A,250L]
[12R,102L,250L]

Try this:
from itertools import groupby
import re
def __genComb(arr, res=[]):
for i in range(len(res), len(arr)):
el=arr[i]
if(len(el[1])==1):
res+=el[1]
else:
for el_2 in el[1]:
yield from __genComb(arr, res+[el_2])
break
if(len(res)==len(arr)): yield res
def genComb(arr):
res=[(k, list(v)) for k,v in groupby(sorted(arr), key=lambda x: re.match(r"(\d*)", x).group(1))]
yield from __genComb(res)
Sample output (using the input you provided):
test=["12R","102A","102L","250L"]
for el in genComb(test):
print(el)
# returns:
['102A', '12R', '250L']
['102L', '12R', '250L']

I believe this is what you're looking for!
This works by
generating a collection of all the postfixes each prefix can have
finding the total count of positions (multiply the length of each sublist together)
rotating through each postfix by basing the read index off of both its member postfix position in the collection and the absolute result index (known location in final results)
import collections
import functools
import operator
import re
# initial input
starting_values = ["12R","102A","102L","250L"]
d = collections.defaultdict(list) # use a set if duplicates are possible
for value in starting_values:
numeric, postfix = re.match(r"(\d+)(.*)", value).groups()
d[numeric].append(postfix) # .* matches ""; consider (postfix or "_") to give value a size
# d is now a dictionary of lists where each key is the prefix
# and each value is a list of possible postfixes
# each set of postfixes multiplies the total combinations by its length
total_combinations = functools.reduce(
operator.mul,
(len(sublist) for sublist in d.values())
)
results = collections.defaultdict(list)
for results_pos in range(total_combinations):
for index, (prefix, postfix_set) in enumerate(d.items()):
results[results_pos].append(
"{}{}".format( # recombine the values
prefix, # numeric prefix
postfix_set[(results_pos + index) % len(postfix_set)]
))
# results is now a dictionary mapping { result index: unique list }
displaying
# set width of column by longest prefix string
# need a collection for intermediate cols, but beyond scope of Q
col_width = max(len(str(k)) for k in results)
for k, v in results.items():
print("{:<{w}}: {}".format(k, v, w=col_width))
0: ['12R', '102L', '250L']
1: ['12R', '102A', '250L']
with a more advanced input
["12R","102A","102L","250L","1234","1234A","1234C"]
0: ['12R', '102L', '250L', '1234']
1: ['12R', '102A', '250L', '1234A']
2: ['12R', '102L', '250L', '1234C']
3: ['12R', '102A', '250L', '1234']
4: ['12R', '102L', '250L', '1234A']
5: ['12R', '102A', '250L', '1234C']
You can confirm the values are indeed unique with a set
final = set(",".join(x) for x in results.values())
for f in final:
print(f)
12R,102L,250L,1234
12R,102A,250L,1234A
12R,102L,250L,1234C
12R,102A,250L,1234
12R,102L,250L,1234A
12R,102A,250L,1234C
notes
in cPython, regexes are cached after their first compile
list member multiplier from "How can I multiply all items in a list together with Python?"

Count occurance of an item in a list and store it in another list if it is exists more than once

Let's say I have the following list.
my_list = ['4/10', '8/-', '9/2', '4/11', '-/13', '19/10', '25/-', '26/-', '4/12', '10/16']
I would like to check the occurrence of each item and if it exists more than once I would like to store it in a new list.
For example from the above list, 4 is existed 3 times before / as 4/10, 4/11, 4/12. So I would like to create a new list called new list and store them as new_list = '4/10', '4/11', '4/12, 19/10'.
An additional example I want to consider also /. if 10 exist twice as 4/10 and 10/16 I don want to consider it as a duplicate since the position after and before / is different.
If there any way to count the existence of an item in a list and store them in a new list?
I tried the following but got an error.
new_list = []
d = Counter(my_list)
for v in d.items():
if v > 1:
new_list.append(v)
The error TypeError: '>' not supported between instances of 'tuple' and 'int'
Can anyone help with this?

I think below code is quite self-explanatory. It will work alright. If you have any issues or need clarification, feel free to ask.
NOTE : This code is not very efficient and can be improved a lot. But will work allright if you are not running this on extremely large data.
my_list = ['4/10', '8/-', '9/2', '4/11', '-/13', '19/10', '25/-', '26/-', '4/12', '10/16']
frequency = {}; new_list = [];
for string in my_list:
x = '';
for j in string:
if j == '/':
break;
x += j;
if x.isdigit():
frequency[x] = frequency.get(x, 0) + 1;
for string in my_list:
x = '';
for j in string:
if j == '/':
break;
x += j;
if x.isdigit():
if frequency[x] > 1:
new_list.append(string);
print(new_list);

.items() is not what you think - it returns a list of key-value pairs (tuples), not sole values. You want to:
d = Counter(node)
new_list = [ k for (k,v) in d.items() if v > 1 ]
Besides, I am not sure how node is related to my_list but I think there is some additional processing you didn't show.
Update: after reading your comment clarifying the problem, I think it requires two separate counters:
first_parts = Counter([x.split('/')[0] for x in my_list])
second_parts = Counter([x.split('/')[1] for x in my_list])
first_duplicates = { k for (k,v) in first_parts.items() if v > 1 and k != '-' }
second_duplicates = { k for (k,v) in second_parts.items() if v > 1 and k != '-' }
new_list = [ e for e in my_list if
e.split('/')[0] in first_duplicates or e.split('/')[1] in second_duplicates ]

this might help : create a dict to contain the pairings and then extract the pairings that have a length more than one. defaultdict helps with aggregating data, based on the common keys.
from collections import defaultdict
d = defaultdict(list)
e = defaultdict(list)
m = [ent for ent in my_list if '-' not in ent]
for ent in m:
front, back = ent.split('/')
d[front].append(ent)
e[back].append(ent)
new_list = []
for k,v in d.items():
if len(v) > 1:
new_list.extend(v)
for k,v in e.items():
if len(v) > 1:
new_list.extend(v)
sortr = lambda x: [int(ent) for ent in x.split("/")]
from operator import itemgetter
sorted(set(new_list), key = sortr)
print(new_list)
['4/10', '4/11', '4/12', '19/10']

Remove adjacent duplicates given a condition

I'm trying to write a function that will take a string, and given an integer, will remove all the adjacent duplicates larger than the integer and output the remaining string. I have this function right now that removes all the duplicates in a string, and I'm not sure how to put the integer constraint into it:
def remove_duplicates(string):
s = set()
list = []
for i in string:
if i not in s:
s.add(i)
list.append(i)
return ''.join(list)
string = "abbbccaaadddd"
print(remove_duplicates(string))
This outputs
abc
What I would want is a function like
def remove_duplicates(string, int):
.....
Where if for the same string I input int=2, I want to remove my n characters without removing all the characters. Output should be
abbccaadd
I'm also concerned about run time and complexity for very large strings, so if my initial approach is bad, please suggest a different approach. Any help is appreciated!

Not sure I understand your question correctly. I think that, given m repetitions of a character, you want to remove up to k*n duplicates such that k*n < m.
You could try this, using groupby:
>>> from itertools import groupby
>>> string = "abbbccaaadddd"
>>> n = 2
>>> ''.join(c for k, g in groupby(string) for c in k * (len(list(g)) % n or n))
'abccadd'
Here, k * (len(list(g)) % n or n) means len(g) % n repetitions, or n if that number is 0.
Oh, you changed it... now my original answer with my "interpretation" of your output actually works. You can use groupby together with islice to get at most n characters from each group of duplicates.
>>> from itertools import groupby, islice
>>> string = "abbbccaaadddd"
>>> n = 2
>>> ''.join(c for _, g in groupby(string) for c in islice(g, n))
'abbccaadd'

Create group of letters, but compute the length of the groups, maxed out by your parameter.
Then rebuild the groups and join:
import itertools
def remove_duplicates(string,maxnb):
groups = ((k,min(len(list(v)),maxnb)) for k,v in itertools.groupby(string))
return "".join(itertools.chain.from_iterable(v*k for k,v in groups))
string = "abbbccaaadddd"
print(remove_duplicates(string,2))
this prints:
abbccaadd
can be a one-liner as well (cover your eyes!)
return "".join(itertools.chain.from_iterable(v*k for k,v in ((k,min(len(list(v)),maxnb)) for k,v in itertools.groupby(string))))
not sure about the min(len(list(v)),maxnb) repeat value which can be adapted to suit your needs with a modulo (like len(list(v)) % maxnb), etc...

You should avoid using int as a variable name as it is a python keyword.
Here is a vanilla function that does the job:
def deduplicate(string: str, treshold: int) -> str:
res = ""
last = ""
count = 0
for c in string:
if c != last:
count = 0
res += c
last = c
else:
if count < treshold:
res += c
count += 1
return res

Find values in list which differ from reference list by up to N characters

I have a list like the following:
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
And a reference list like this:
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA']
I want to extract the values from Test if they are N or less characters different from any one of the items in Ref.
For example, if N = 1, only the first two elements of Test should be output. If N = 2, all three elements fit this criteria and should be returned.
It should be noted that I am looking for same charcacter length values (ASDFGY -> ASDFG matching doesn't work for N = 1), so I want something more efficient than levensthein distance.
I have over 1000 values in ref and a couple hundred million in Test so efficiency is key.

Using a generation expression with sum:
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA']
from collections import Counter
def comparer(x, y, n):
return (len(x) == len(y)) and (sum(i != j for i, j in zip(x, y)) <= n)
res = [a for a, b in zip(Ref, Test) if comparer(a, b, 1)]
print(res)
['ASDFGY', 'QWERTYI']

Using difflib
Demo:
import difflib
N = 1
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA']
result = []
for i,v in zip(Test, Ref):
c = 0
for j,s in enumerate(difflib.ndiff(i, v)):
if s.startswith("-"):
c += 1
if c <= N:
result.append( i )
print(result)
Output:
['ASDFGH', 'QWERTYU']

The newer regex module offers a "fuzzy" match possibility:
import regex as re
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA', 'ASDFGI', 'ASDFGX']
for item in Test:
rx = re.compile('(' + item + '){s<=3}')
for r in Ref:
if rx.search(r):
print(rf'{item} is similar to {r}')
This yields
ASDFGH is similar to ASDFGY
ASDFGH is similar to ASDFGI
ASDFGH is similar to ASDFGX
QWERTYU is similar to QWERTYI
ZXCVB is similar to ZXCAA
You can control it via the {s<=3} part which allows three or less substitutions.
To have pairs, you could write
pairs = [(origin, difference)
for origin in Test
for rx in [re.compile(rf"({origin}){{s<=3}}")]
for difference in Ref
if rx.search(difference)]
Which would yield for
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA', 'ASDFGI', 'ASDFGX']
the following output:
[('ASDFGH', 'ASDFGY'), ('ASDFGH', 'ASDFGI'),
('ASDFGH', 'ASDFGX'), ('QWERTYU', 'QWERTYI'),
('ZXCVB', 'ZXCAA')]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding motif in a sequence of characters - python

Related

counting specific dictionary entries in python

create all possible combinations with multiple variants from list

Count occurance of an item in a list and store it in another list if it is exists more than once

Remove adjacent duplicates given a condition

Find values in list which differ from reference list by up to N characters

Categories

Resources