How to iterate over all overlapping matches in string?

How to iterate over all overlapping matches in string? - python

I want to iterate over all overlapping matches of terms in a sentence/string. If it is not possible to solve the problem with just a single regex, it is fine to go with more than one -- but the number of expressions should be independent of the number of terms.
>>> import re
>>> text = 'X has effect on Y.'
>>> terms = ['has effect', 'effect', 'effect on']
>>> pattern = r"(?=\b(" + "|".join(re.escape(term) for term in terms) + r")\b)"
>>> print pattern
(?=\b(has\ effect|effect|effect\ on)\b)
>>> [(m.span(1), text[m.start(1):m.end(1)]) for m in re.finditer(pattern, text)]
[((2, 12), 'has effect'), ((6, 12), 'effect')]
I'm still not able to extract 'effect on' term -- I think I stuck with 'lookbehind assertion'. Thanks!
EDIT #1:
Problem description in terms of input/output specs with example:
Input:
a sentence, e.g. 'X has effect on Y.'
a list of terms, e.g. ['has effect', 'effect', 'effect on']
Output:
a list of mentions (((start, end), term) tuples), e.g. [((2, 12), 'has effect'), ((6, 12), 'effect'), ((6, 15), 'effect on')]

Related

split int and string to tuple from string python

i have a list of string with names and numbers like :
["mike5","john","sara2","bob","nick6"]
and i want to create from each string a tuple (name,age) like this :
[('mike', 5), ('john', 0), ('sara', 2), ('bob', 0), ('nick', 5)]
so if a string doesn't contain a number the age is 0
what is the simplest way to do it?
i tried to use :
temp = re.compile("([a-zA-Z]+)([0-9]+)")
res = temp.match(type).group()
but it fails

You can use the following regex to find the name and the number ([a-z]+)(\d+)?, along with .groups(0) as default value (see match.groups())
def split_vals(word):
name, number = re.search(r"([a-z]+)(\d+)?", word).groups(0)
return name, int(number)
values = ["mike5", "john", "sara2", "bob", "nick6"]
values = [split_vals(value) for value in values]
# [('mike', 5), ('john', 0), ('sara', 2), ('bob', 0), ('nick', 6)]

If fails because your match doesn't return anything:
temp.match('john') is None
True
You need to change your regex to:
# The * means 0 or more. Otherwise, you've required a number to be present
temp = re.compile("([a-zA-Z]+)([0-9]*)")
temp.match('john')
<re.Match object; span=(0, 4), match='john'>
Last, if you want tuples, use groups(), not group()
[temp.match(item).groups() for item in x]
[('mike', '5'), ('john', ''), ('sara', '2'), ('bob', ''), ('nick', '6')]

A couple of things:
The regex is correct up to [0-9]+. This means you MUST match 1 or more digits. However, not all your strings will have a digit present such as john, so I would suggest using * which matches zero or more digits.
You are using the syntax pattern.match(string) which will throw an error. You need to use the syntax match(pattern, string) (see below for further clarification).
In addition, using groups() instead of group() will return a tuple of all the captured matches within your regex (again see below).
Using a loop to iterate over your items and an if statement you should be able to achieve your desired result:
lst=["mike5","john","sara2","bob","nick6"]
pattern = re.compile("([a-zA-Z]+)([0-9]*)")
name_age = []
for value in lst:
name,age = re.match(pattern,value).groups()
if not age: age = 0
name_age.append((name,age))
print(name_age)

import re
inArr = ["mike5","john","sara2","bob","nick6"]
outArr = []
for item in inArr:
regexResult = re.search('([a-z]+)(\d?)', item, re.IGNORECASE)
if regexResult:
name = regexResult.group(1)
age = regexResult.group(2) or 0
outArr.append((name, int(age))
print(outArr) # [('mike', 5), ('john', 0), ('sara', 2), ('bob', 0), ('nick', 6)]

How to Identify Repetitive Characters in a String Using Python?

I am new to python and I want to write a program that determines if a string consists of repetitive characters. The list of strings that I want to test are:
Str1 = "AAAA"
Str2 = "AGAGAG"
Str3 = "AAA"
The pseudo-code that I come up with:
WHEN len(str) % 2 with zero remainder:
- Divide the string into two sub-strings.
- Then, compare the two sub-strings and check if they have the same characters, or not.
- if the two sub-strings are not the same, divide the string into three sub-strings and compare them to check if repetition occurs.
I am not sure if this is applicable way to solve the problem, Any ideas how to approach this problem?
Thank you!

You could use the Counter library to count the most common occurrences of the characters.
>>> from collections import Counter
>>> s = 'abcaaada'
>>> c = Counter(s)
>>> c.most_common()
[('a', 5), ('c', 1), ('b', 1), ('d', 1)]
To get the single most repetitive (common) character:
>>> c.most_common(1)
[('a', 5)]

You could do this using a RegX backreferences.

To find a pattern in Python, you are going to need to use "Regular Expressions". A regular expression is typically written as:
match = re.search(pat, str)
This is usually followed by an if-statement to determine if the search succeeded.
for example this is how you would find the pattern "AAAA" in a string:
import re
string = ' blah blahAAAA this is an example'
match = re.search(r'AAAA', string)
if match:
print 'found', match.group()
else:
print 'did not find'
This returns "found 'AAAA'"
Do the same for your other two strings and it will work the same.
Regular expressions can do a lot more than just this so work around with them and see what else they can do.

Assuming you mean the whole string is a repeating pattern, this answer has a good solution:
def principal_period(s):
i = (s+s).find(s, 1, -1)
return None if i == -1 else s[:i]

A more complex version of "How can I tell if a string repeats itself in Python?"

I was reading this post and I wonder if someone can find the way to catch repetitive motifs into a more complex string.
For example, find all the repetitive motifs in
string = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
Here the repetitive motifs:
'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
So, the output should be something like this:
output = {'ACGT': {'repeat': 2,
'region': (5,13)},
'GT': {'repeat': 3,
'region': (19,24)},
'TATACG': {'repeat': 2,
'region': (29,40)}}
This example comes from a typical biological phenomena termed microsatellite which is present into the DNA.
UPDATE 1: Asterisks were removed from the string variable. It was a mistake.
UPDATE 2: Single character motif doesn't count. For example: in ACGUGAAAGUC, the 'A' motif is not taken into account.

you can use a recursion function as following :
Note: The result argument will be treated as a global variable (because passing mutable object to the function affects the caller)
import re
def finder(st,past_ind=0,result=[]):
m=re.search(r'(.+)\1+',st)
if m:
i,j=m.span()
sub=st[i:j]
ind = (sub+sub).find(sub, 1)
sub=sub[:ind]
if len(sub)>1:
result.append([sub,(i+past_ind+1,j+past_ind+1)])
past_ind+=j
return finder(st[j:],past_ind)
else:
return result
s='AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
print finder(s)
result:
[['ACGT', (5, 13)], ['GT', (19, 25)], ['TATACG', (29, 41)]]
answer to previous question for following string :
s = 'AAAC**ACGTACGTA**ATTCC**GTGTGT**CCCC**TATACGTATACG**TTT'
You can use both answers from mentioned question and some extra recipes :
First you can split the string with ** then create a new list contain the repeated strings with r'(.+)\1+' regex :
So the result will be :
>>> new=[re.search(r'(.+)\1+',i).group(0) for i in s.split('**')]
>>> new
['AAA', 'ACGTACGT', 'TT', 'GTGTGT', 'CCCC', 'TATACGTATACG', 'TTT']
Note about 'ACGTACGT' that missed the A at the end!
Then you can use principal_period's function to get the repeated sub strings :
def principal_period(s):
i = (s+s).find(s, 1, -1)
return None if i == -1 else s[:i]
>>> for i in new:
... p=principal_period(i)
... if p is not None and len(p)>1:
... l.append(p)
... sub.append(i)
...
So you will have the repeated strings in l and main strings in sub :
>>> l
['ACGT', 'GT', 'TATACG']
>>> sub
['ACGTACGT', 'GTGTGT', 'TATACGTATACG']
Then you need a the region that you can do it with span method :
>>> for t in sub:
... regons.append(re.search(t,s).span())
>>> regons
[(6, 14), (24, 30), (38, 50)]
And at last you can zip the 3 list regon,sub,l and use a dict comprehension to create the expected result :
>>> z=zip(sub,l,regons)
>>> out={i :{'repeat':i.count(j),'region':reg} for i,j,reg in z}
>>> out
{'TATACGTATACG': {'region': (38, 50), 'repeat': 2}, 'ACGTACGT': {'region': (6, 14), 'repeat': 2}, 'GTGTGT': {'region': (24, 30), 'repeat': 3}}
The main code :
>>> s = 'AAAC**ACGTACGTA**ATTCC**GTGTGT**CCCC**TATACGTATACG**TTT'
>>> sub=[]
>>> l=[]
>>> regon=[]
>>> new=[re.search(r'(.+)\1+',i).group(0) for i in s.split('**')]
>>> for i in new:
... p=principal_period(i)
... if p is not None and len(p)>1:
... l.append(p)
... sub.append(i)
...
>>> for t in sub:
... regons.append(re.search(t,s).span())
...
>>> z=zip(sub,l,regons)
>>> out={i :{'repeat':i.count(j),'region':reg} for i,j,reg in z}
>>> out
{'TATACGTATACG': {'region': (38, 50), 'repeat': 2}, 'ACGTACGT': {'region': (6, 14), 'repeat': 2}, 'GTGTGT': {'region': (24, 30), 'repeat': 3}}

If you can bound your query then you can use a single pass of the string. The number of comparisons will be length of string * (max_length - min_length) so will scale linearly.
s = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
def find_repeats(s, max_length, min_length=2):
for i in xrange(len(s)):
for j in xrange(min_length, max_length+1):
count = 1
while s[i:i+j] == s[i+j*count:i+j*count+j]: count += 1
if count > 1:
yield s[i:i+j], i, count
for pattern, position, count in find_repeats(s, 6, 2):
print "%6s at region (%d, %d), %d repeats" % (pattern, position, position + count*len(pattern), count)
Output:
AC at region (2, 6), 2 repeats
ACGT at region (4, 12), 2 repeats
CGTA at region (5, 13), 2 repeats
GT at region (18, 24), 3 repeats
TG at region (19, 23), 2 repeats
GT at region (20, 24), 2 repeats
CC at region (24, 28), 2 repeats
TA at region (28, 32), 2 repeats
TATACG at region (28, 40), 2 repeats
ATACGT at region (29, 41), 2 repeats
TA at region (34, 38), 2 repeats
Note that this catches a fair few more overlapping patterns than the regexp answers, but without knowing more about what you consider a good match it is difficult to reduce it further, for example why is TATACG better than ATACGT?
Extra: Using a dict to return matches is a bad idea as the patterns are not going to be unique.

This simple while loop detects all repeated patterns:
def expand():
global hi
hi += 1
def shrink():
global lo
lo += 1
s = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
motifs = set()
lo = 0
hi = 0
f = expand
while hi <= len(s):
sub = s[lo : hi+1]
if s.count(sub) > 1:
motifs.add(sub)
if lo==hi: f = expand
f()
else:
f = shrink if lo<=hi else expand
f()
At this point, motifs contains all the repeated patterns... Let's filter them with some criteria:
minlen = 3
for m in filter(lambda m: len(m)>=minlen and s.count(2*m)>=1, motifs):
print(m)
'''
ATACGT
ACGT
TATACG
CGTA
'''

You can use the fact that in regex, lookaheads do not advance the primary iterator. Thus, you can nest a capture group within a lookahead to find the (potentially overlapping) patterns that repeat and have a specified minimum length:
>>> import re
>>> s = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
>>> re.findall(r'(?=(.{2,})\1+)', s)
['AC', 'ACGT', 'CGTA', 'GT', 'TG', 'GT', 'CC', 'TATACG', 'ATACGT', 'TA']
>>> re.findall(r'(?=(.{2,}?)\1+)', s)
['AC', 'ACGT', 'CGTA', 'GT', 'TG', 'GT', 'CC', 'TA', 'ATACGT', 'TA']
Note the slightly different results between using a greedy and a non-greedy quantifier. The greedy quantifier searches for the longest repeating substring starting from every index in the original string, if one exists. The non-greedy quantifier searches for the shortest of the same. The limitation is that you can only get a maximum one pattern per starting index in the string. If you have any ideas to solve this problem, let me know! Potentially, we can use the greedy quantifier regex to set up a recursive solution that finds every repeating pattern starting from each index, but let's avoid "premature computation" for now.
Now if we take the regex (?=(.{2,})\1+) and modify it, we can also capture the entire substring that contains repeated motifs. By doing this, we can use the span of the matches to calculate the number of repetitions:
(?=((.{2,})\2+))
In the above regex, we have a capture group inside a capture group inside a lookahead. Now we have everything we need to solve the problem:
def repeated_motifs(s):
import re
from collections import defaultdict
rmdict = defaultdict(list)
for match in re.finditer(r'(?=((.{2,})\2+))', s):
motif = match.group(2)
span1, span2 = match.span(1), match.span(2)
startindex = span1[0]
repetitions = (span1[1] - startindex) // (span2[1] - startindex)
others = rmdict[motif]
if not others or startindex > others[-1]['region'][1]:
others.append({'repeat': repetitions, 'region': span1})
return rmdict
s = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
d = repeated_motifs(s)
print(d)
# list of the repeating motifs we have found sorted by first region
print(sorted(list(d.keys()), key=lambda k: d[k][0]['region']))
Because desired behavior in the situation where a motif repeats itself in multiple "regions" of the string was not specified, I have made the assumption that OP would like a dictionary of string->list where each list contains its own set of dictionaries.

Check if collection of words (pyenchant)

I want to check if a string in Python is a collection of words using PyEnchant.
For example, I want to somehow check a concatenated string is a word or not:
eng = enchant.Dict("en_US")
eng.check("Applebanana")
I know this will return false, but I want it to return true, since Apple + banana are legit words by PyEnchant.

If you limit yourself to words combined from two other words, you can check the combinations yourself:
>>> s = "applebanana"
>>> splits = [(s[:i], s[i:]) for i in range(1,len(s))]
>>> splits
[('a', 'pplebanana'), ('ap', 'plebanana'), ('app', 'lebanana'),
('appl', 'ebanana'), ('apple', 'banana'), ('appleb', 'anana'),
('appleba', 'nana'), ('appleban', 'ana'), ('applebana', 'na'),
('applebanan', 'a')]
>>> any((eng.check(item[0]) and eng.check(item[1])) for item in splits)
True
Of course you can expand that to more than two, but this should give you a general idea of where you're headed.

Detect repetitions in string

I have a simple problem, but can't come with a simple solution :)
Let's say I have a string. I want to detect if there is a repetition in it.
I'd like:
"blablabla" # => (bla, 3)
"rablabla" # => (bla, 2)
The thing is I don't know what pattern I am searching for (I don't have "bla" as input).
Any idea?
EDIT:
Seeing the comments, I think I should precise a bit more what I have in mind:
In a string, there is either a pattern that is repeted or not.
The repeted pattern can be of any length.
If there is a pattern, it would be repeted over and over again until the end. But the string can end in the middle of the pattern.
Example:
"testblblblblb" # => ("bl",4)

import re
def repetitions(s):
r = re.compile(r"(.+?)\1+")
for match in r.finditer(s):
yield (match.group(1), len(match.group(0))/len(match.group(1)))
finds all non-overlapping repeating matches, using the shortest possible unit of repetition:
>>> list(repetitions("blablabla"))
[('bla', 3)]
>>> list(repetitions("rablabla"))
[('abl', 2)]
>>> list(repetitions("aaaaa"))
[('a', 5)]
>>> list(repetitions("aaaaablablabla"))
[('a', 5), ('bla', 3)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to iterate over all overlapping matches in string? - python

Related

split int and string to tuple from string python

How to Identify Repetitive Characters in a String Using Python?

A more complex version of "How can I tell if a string repeats itself in Python?"

Check if collection of words (pyenchant)

Detect repetitions in string

Categories

Resources