Identify patterns within list of words with pattern threshold - python

Working on a pattern recognition function in Python that suppose to return an array of patterns with a counter
Let's imagine a list of strings:
m = ['ABA','ABB', 'ABC','BCA','BCB','BCC','ABBC', 'ABBA', 'ABBC']
at the high-level, what I would like to get back is:
Pattern | Count
----------------
AB | 6
ABB | 4
BC | 3
----------------
The problem: all I know that patterns begin with 2 characters and are leading characters for each string value (i.e. XXZZZ, XXXZZZ (where XX is a pattern that I'm looking for)). I would like to be able to parametrize minimal length of a pattern as a function's input to optimize the run time.
PS. each item in the list is a single word already.
my problem is that I need to iterate for each letter starting from the threshold, and I'm getting stuck there.
I'd prefer to use startswith('AB')

First, let's define your string:
>>> m = ['ABA','ABB', 'ABC','BCA','BCB','BCC','ABBC', 'ABBA', 'ABBC']
Now, let's get a count of all leading strings of length 2 or 3:
>>> from collections import Counter
>>> c = Counter([s[:2] for s in m] + [s[:3] for s in m if len(s)>=3])
To compare with your table, here are the three most common leading strings:
>>> c.most_common(3)
Out[15]: [('AB', 6), ('ABB', 4), ('BC', 3)]
Update
To include all keys up to up to length len(max(m, key=len))-1:
>>> n = len(max(m, key=len))
>>> c = Counter(s[:i] for s in m for i in range(2, min(n, 1+len(s))))
Additional Test
To demonstrate that we are working correctly with longer strings, let's consider different input:
>>> m = ['ab', 'abc', 'abcdef']
>>> n = len(max(m, key=len))
>>> c = Counter(s[:i] for s in m for i in range(2, min(n, 1+len(s))))
>>> c.most_common()
[('ab', 3), ('abc', 2), ('abcd', 1), ('abcde', 1)]

Using collections.Counter
counter = collections.Counter()
min_length = 2
max_length = len(max(m, key=len))
for length in range(min_length, max_length):
counter.update(word[:length] for word in m if len(word) >= length)

You can use the function accumulate() to generate accumulated strings and the function islice() to get the strings with a minimal length:
from itertools import accumulate, islice
from collections import Counter
m = ['ABA','ABB', 'ABC','BCA','BCB','BCC','ABBC', 'ABBA', 'ABBC']
c = Counter()
for i in map(accumulate, m):
c.update(islice(i, 1, None)) # get strings with a minimal length of 2
print(c.most_common(3))
# [('AB', 6), ('ABB', 4), ('BC', 3)]

Related

Python - Finding the list of words that occurs n times in the given file

I want to find a list of word which occurs n times(for example 200) in a given file. For this purpose I get the the each unique tokens in the file with the following code but I couldn't understand how can I get the ones with the condition of occuring n times.
from collections import Counter
import re
seen = list()
words = re.findall(r'[\w+]+', open('deneme.txt').read())
seen = Counter(words).most_common()
Output is:
[('Erke', 4), ('aƧ+Noun', 4), ('Antalya', 3), ('123', 3), ('ol+Verb', 3), ('Varol', 2), ('Koleji', 1), ('asdfsdf', 1), ('birak+Verb', 1)]
For example I want to get tokens with occuring of 3 times. How can I achieve this. I can't reach the number of occurance in the list.
You could use a list comprehension:
from collections import Counter
import re
seen = list()
words = re.findall(r'[\w+]+', open('deneme.txt').read())
seen = Counter(words).most_common()
print([w for w, c in seen if c == 3])
Output
['123', 'Antalya', 'ol+Verb']

How to find the index of exact match?

I know how to use python to report exact match in a string:
import re
word='hello,_hello,"hello'
re.findall('\\bhello\\b',word)
['hello', 'hello']
How do I report the indices of the exact matches? (in this case, 0 and 14)
Use finditer:
[(g.start(), g.group()) for g in re.finditer('\\b(hello)\\b',word)]
# [(0, 'hello'), (14, 'hello')]
instead use word.find('hello',x)
word = 'hello,_hello,"hello'
tmp = 0
index = []
for i in range(len(word)):
tmp = word.find('hello', tmp)
if tmp >= 0:
index.append(tmp)
tmp += 1

A more complex version of "How can I tell if a string repeats itself in Python?"

I was reading this post and I wonder if someone can find the way to catch repetitive motifs into a more complex string.
For example, find all the repetitive motifs in
string = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
Here the repetitive motifs:
'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
So, the output should be something like this:
output = {'ACGT': {'repeat': 2,
'region': (5,13)},
'GT': {'repeat': 3,
'region': (19,24)},
'TATACG': {'repeat': 2,
'region': (29,40)}}
This example comes from a typical biological phenomena termed microsatellite which is present into the DNA.
UPDATE 1: Asterisks were removed from the string variable. It was a mistake.
UPDATE 2: Single character motif doesn't count. For example: in ACGUGAAAGUC, the 'A' motif is not taken into account.
you can use a recursion function as following :
Note: The result argument will be treated as a global variable (because passing mutable object to the function affects the caller)
import re
def finder(st,past_ind=0,result=[]):
m=re.search(r'(.+)\1+',st)
if m:
i,j=m.span()
sub=st[i:j]
ind = (sub+sub).find(sub, 1)
sub=sub[:ind]
if len(sub)>1:
result.append([sub,(i+past_ind+1,j+past_ind+1)])
past_ind+=j
return finder(st[j:],past_ind)
else:
return result
s='AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
print finder(s)
result:
[['ACGT', (5, 13)], ['GT', (19, 25)], ['TATACG', (29, 41)]]
answer to previous question for following string :
s = 'AAAC**ACGTACGTA**ATTCC**GTGTGT**CCCC**TATACGTATACG**TTT'
You can use both answers from mentioned question and some extra recipes :
First you can split the string with ** then create a new list contain the repeated strings with r'(.+)\1+' regex :
So the result will be :
>>> new=[re.search(r'(.+)\1+',i).group(0) for i in s.split('**')]
>>> new
['AAA', 'ACGTACGT', 'TT', 'GTGTGT', 'CCCC', 'TATACGTATACG', 'TTT']
Note about 'ACGTACGT' that missed the A at the end!
Then you can use principal_period's function to get the repeated sub strings :
def principal_period(s):
i = (s+s).find(s, 1, -1)
return None if i == -1 else s[:i]
>>> for i in new:
... p=principal_period(i)
... if p is not None and len(p)>1:
... l.append(p)
... sub.append(i)
...
So you will have the repeated strings in l and main strings in sub :
>>> l
['ACGT', 'GT', 'TATACG']
>>> sub
['ACGTACGT', 'GTGTGT', 'TATACGTATACG']
Then you need a the region that you can do it with span method :
>>> for t in sub:
... regons.append(re.search(t,s).span())
>>> regons
[(6, 14), (24, 30), (38, 50)]
And at last you can zip the 3 list regon,sub,l and use a dict comprehension to create the expected result :
>>> z=zip(sub,l,regons)
>>> out={i :{'repeat':i.count(j),'region':reg} for i,j,reg in z}
>>> out
{'TATACGTATACG': {'region': (38, 50), 'repeat': 2}, 'ACGTACGT': {'region': (6, 14), 'repeat': 2}, 'GTGTGT': {'region': (24, 30), 'repeat': 3}}
The main code :
>>> s = 'AAAC**ACGTACGTA**ATTCC**GTGTGT**CCCC**TATACGTATACG**TTT'
>>> sub=[]
>>> l=[]
>>> regon=[]
>>> new=[re.search(r'(.+)\1+',i).group(0) for i in s.split('**')]
>>> for i in new:
... p=principal_period(i)
... if p is not None and len(p)>1:
... l.append(p)
... sub.append(i)
...
>>> for t in sub:
... regons.append(re.search(t,s).span())
...
>>> z=zip(sub,l,regons)
>>> out={i :{'repeat':i.count(j),'region':reg} for i,j,reg in z}
>>> out
{'TATACGTATACG': {'region': (38, 50), 'repeat': 2}, 'ACGTACGT': {'region': (6, 14), 'repeat': 2}, 'GTGTGT': {'region': (24, 30), 'repeat': 3}}
If you can bound your query then you can use a single pass of the string. The number of comparisons will be length of string * (max_length - min_length) so will scale linearly.
s = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
def find_repeats(s, max_length, min_length=2):
for i in xrange(len(s)):
for j in xrange(min_length, max_length+1):
count = 1
while s[i:i+j] == s[i+j*count:i+j*count+j]: count += 1
if count > 1:
yield s[i:i+j], i, count
for pattern, position, count in find_repeats(s, 6, 2):
print "%6s at region (%d, %d), %d repeats" % (pattern, position, position + count*len(pattern), count)
Output:
AC at region (2, 6), 2 repeats
ACGT at region (4, 12), 2 repeats
CGTA at region (5, 13), 2 repeats
GT at region (18, 24), 3 repeats
TG at region (19, 23), 2 repeats
GT at region (20, 24), 2 repeats
CC at region (24, 28), 2 repeats
TA at region (28, 32), 2 repeats
TATACG at region (28, 40), 2 repeats
ATACGT at region (29, 41), 2 repeats
TA at region (34, 38), 2 repeats
Note that this catches a fair few more overlapping patterns than the regexp answers, but without knowing more about what you consider a good match it is difficult to reduce it further, for example why is TATACG better than ATACGT?
Extra: Using a dict to return matches is a bad idea as the patterns are not going to be unique.
This simple while loop detects all repeated patterns:
def expand():
global hi
hi += 1
def shrink():
global lo
lo += 1
s = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
motifs = set()
lo = 0
hi = 0
f = expand
while hi <= len(s):
sub = s[lo : hi+1]
if s.count(sub) > 1:
motifs.add(sub)
if lo==hi: f = expand
f()
else:
f = shrink if lo<=hi else expand
f()
At this point, motifs contains all the repeated patterns... Let's filter them with some criteria:
minlen = 3
for m in filter(lambda m: len(m)>=minlen and s.count(2*m)>=1, motifs):
print(m)
'''
ATACGT
ACGT
TATACG
CGTA
'''
You can use the fact that in regex, lookaheads do not advance the primary iterator. Thus, you can nest a capture group within a lookahead to find the (potentially overlapping) patterns that repeat and have a specified minimum length:
>>> import re
>>> s = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
>>> re.findall(r'(?=(.{2,})\1+)', s)
['AC', 'ACGT', 'CGTA', 'GT', 'TG', 'GT', 'CC', 'TATACG', 'ATACGT', 'TA']
>>> re.findall(r'(?=(.{2,}?)\1+)', s)
['AC', 'ACGT', 'CGTA', 'GT', 'TG', 'GT', 'CC', 'TA', 'ATACGT', 'TA']
Note the slightly different results between using a greedy and a non-greedy quantifier. The greedy quantifier searches for the longest repeating substring starting from every index in the original string, if one exists. The non-greedy quantifier searches for the shortest of the same. The limitation is that you can only get a maximum one pattern per starting index in the string. If you have any ideas to solve this problem, let me know! Potentially, we can use the greedy quantifier regex to set up a recursive solution that finds every repeating pattern starting from each index, but let's avoid "premature computation" for now.
Now if we take the regex (?=(.{2,})\1+) and modify it, we can also capture the entire substring that contains repeated motifs. By doing this, we can use the span of the matches to calculate the number of repetitions:
(?=((.{2,})\2+))
In the above regex, we have a capture group inside a capture group inside a lookahead. Now we have everything we need to solve the problem:
def repeated_motifs(s):
import re
from collections import defaultdict
rmdict = defaultdict(list)
for match in re.finditer(r'(?=((.{2,})\2+))', s):
motif = match.group(2)
span1, span2 = match.span(1), match.span(2)
startindex = span1[0]
repetitions = (span1[1] - startindex) // (span2[1] - startindex)
others = rmdict[motif]
if not others or startindex > others[-1]['region'][1]:
others.append({'repeat': repetitions, 'region': span1})
return rmdict
s = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
d = repeated_motifs(s)
print(d)
# list of the repeating motifs we have found sorted by first region
print(sorted(list(d.keys()), key=lambda k: d[k][0]['region']))
Because desired behavior in the situation where a motif repeats itself in multiple "regions" of the string was not specified, I have made the assumption that OP would like a dictionary of string->list where each list contains its own set of dictionaries.

How to return the count of the same elements in two lists?

I have two very large lists(that's why I used ... ), a list of lists:
x = [['I like stackoverflow. Hi ok!'],['this is a great community'],['Ok, I didn\'t like this!.'],...,['how to match and return the frequency?']]
and a list of strings:
y = ['hi', 'nice', 'ok',..., 'frequency']
I would like to return in a new list the times (count) that any word in y occurred in all the lists of x. For example, for the above lists, this should be the correct output:
[(1,2),(2,0),(3,1),...,(n,count)]
As follows, [(1,count),...,(n,count)]. Where n is the number of the list and count the number of times that any word from y appeared in x. Any idea of how to approach this?.
First, you should preprocess x into a list of sets of lowercased words -- that will speed up the following lookups enormously. E.g:
ppx = []
for subx in x:
ppx.append(set(w.lower() for w in re.finditer(r'\w+', subx))
(yes, you could collapse this into a list comprehension, but I'm aiming for some legibility).
Next, you loop over y, checking how many of the sets in ppx contain each item of y -- that would be
[sum(1 for s in ppx if w in s) for w in y]
That doesn't give you those redundant first items you crave, but enumerate to the rescue...:
list(enumerate((sum(1 for s in ppx if w in s) for w in y), 1))
should give exactly what you require.
Here is a more readable solution. Check my comments in the code.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
x = [['I like stackoverflow. Hi ok!'],['this is a great community'],['Ok, I didn\'t like this!.'],['how to match and return the frequency?']]
y = ['hi', 'nice', 'ok', 'frequency']
assert len(x)==len(y), "you have to make sure length of x equals y's"
num = []
for i in xrange(len(y)):
# lower all the strings in x for comparison
# find all matched patterns in x and count it, and store result in variable num
num.append(len(re.findall(y[i], x[i][0].lower())))
res = []
# use enumerate to give output in format you want
for k, v in enumerate(num):
res.append((k,v))
# here is what you want
print res
OUTPUT:
[(0, 1), (1, 0), (2, 1), (3, 1)]
INPUT:
x = [['I like stackoverflow. Hi ok!'],['this is a great community'],
['Ok, I didn\'t like this!.'],['how to match and return the frequency?']]
y = ['hi', 'nice', 'ok', 'frequency']
CODE:
import re
s1 = set(y)
index = 0
result = []
for itr in x:
itr = re.sub('[!.?]', '',itr[0].lower()).split(' ')
# remove special chars and convert to lower case
s2 = set(itr)
intersection = s1 & s2
#find intersection of common strings
num = len(intersection)
result.append((index,num))
index = index+1
OUTPUT:
result = [(0, 2), (1, 0), (2, 1), (3, 1)]
You could do like this also.
>>> x = [['I like stackoverflow. Hi ok!'],['this is a great community'],['Ok, I didn\'t like this!.'],['how to match and return the frequency?']]
>>> y = ['hi', 'nice', 'ok', 'frequency']
>>> l = []
>>> for i,j in enumerate(x):
c = 0
for x in y:
if re.search(r'(?i)\b'+x+r'\b', j[0]):
c += 1
l.append((i+1,c))
>>> l
[(1, 2), (2, 0), (3, 1), (4, 1)]
(?i) will do a case-insensitive match. \b called word boundaries which matches between a word character and a non-word character.
Maybe you could concatenate the strings in x to make the computation easy:
w = ' '.join(i[0] for i in x)
Now w is a long string like this:
>>> w
"I like stackoverflow. Hi ok! this is a great community Ok, I didn't like this!. how to match and return the frequency?"
With this conversion, you can simply do this:
>>> l = []
>>> for i in range(len(y)):
l.append((i+1, w.count(str(y[i]))))
which gives you:
>>> l
[(1, 2), (2, 0), (3, 1), (4, 0), (5, 1)]
You can make a dictionary where key is each item in the "Y" List. Loop through the values of the keys and look up for them in the dictionary. Keep updating the value as soon as you encounter the word into your X nested list.

How to generate a list of palindromes with only 'x','y' and a given length n in Python?

is there any patterns i can use to sort out how to create a string that is palindrome which made up with 'X' 'Y'
Let's assume n is even. Generate every string of length n/2 that consists of x and y, and append its mirror image to get a palindrome.
Exercise 1: prove that this generates all palindromes of length n.
Exercise 2: figure out what to do when n is odd.
First generate all possible strings given a list of characters:
>>> from itertools import product
>>> characters = ['x','y']
>>> n = 5
>>> [''.join(i) for i in product(characters, repeat=n)]
['xxxxx', 'xxxxy', 'xxxyx', 'xxxyy', 'xxyxx', 'xxyxy', 'xxyyx', 'xxyyy', 'xyxxx', 'xyxxy', 'xyxyx', 'xyxyy', 'xyyxx', 'xyyxy', 'xyyyx', 'xyyyy', 'yxxxx', 'yxxxy', 'yxxyx', 'yxxyy', 'yxyxx', 'yxyxy', 'yxyyx', 'yxyyy', 'yyxxx', 'yyxxy', 'yyxyx', 'yyxyy', 'yyyxx', 'yyyxy', 'yyyyx', 'yyyyy']
Then filter out non-palindrome:
>>> n = 4
>>> [''.join(i) for i in product(characters, repeat=n) if i[:n/2] == i[::-1][:n/2]]
['xxxx', 'xyyx', 'yxxy', 'yyyy']
>>> n = 5
>>> [''.join(i) for i in product(characters, repeat=n) if i[:n/2] == i[::-1][:n/2]]
['xxxxx', 'xxyxx', 'xyxyx', 'xyyyx', 'yxxxy', 'yxyxy', 'yyxyy', 'yyyyy']
If you don't like if conditions in list comprehension, you can use filter():
>>> from itertools import product
>>> characters = ['x','y']
>>> n = 5
>>> def ispalindrome(x): return x[:n/2] == x[::-1][:n/2];
>>> filter(ispalindrome, [''.join(i) for i in product(characters, repeat=n)])
['xxxxx', 'xxyxx', 'xyxyx', 'xyyyx', 'yxxxy', 'yxyxy', 'yyxyy', 'yyyyy']

Categories

Resources