How to extract span in re.finditer method in python? [duplicate]

How to extract span in re.finditer method in python? [duplicate] - python

This question already has answers here:
Python Regex - How to Get Positions and Values of Matches
(4 answers)
Closed 4 days ago.
the results of re.finditer is as below.
[i for i in result]
=[<re.Match object; span=(0, 10), match='sin theta '>,
<re.Match object; span=(12, 18), match='cos x '>,
<re.Match object; span=(20, 26), match='e ^ x '>,
<re.Match object; span=(26, 32), match='f( x )'>,
<re.Match object; span=(37, 45), match='log_ {x}'>]
Here, I used the code i.span instead of i, but I just got something as below.
[<function Match.span(group=0, /)>,
<function Match.span(group=0, /)>,
<function Match.span(group=0, /)>,
<function Match.span(group=0, /)>,
<function Match.span(group=0, /)>]
I'm gonna extract span in re.finditer.
like (0,10), (12,18), ...
Help me please!
I defined the function for getting re.finditer
The code is as below.
import re
def convert_ftn_to_token(seq):
va = '[a-z]{1,}'
ftn_lst = ['sin','cos','tan','log_', 'e ?\^']
ftn_lst = [ftn + ' ?\{? ?' + va +' ?\}?' for ftn in ftn_lst]
ftn_lst2 = [chr(i) for i in range(65,91)] + [chr(i) for i in range(97,123)]
ftn_lst2 = [ftn + ' ?\( ?' + va + ' ?\)' for ftn in ftn_lst2]
ftn_c = re.compile(
'|'.join(ftn_lst2) +'|'+
'|'.join(ftn_lst)
)
return re.finditer(ftn_c,seq)
i.span for i in results

.span is a method, not an attribute. You want .span() which will give the start, end tuple.

You can use start() and end() in regex's Match object, documentation about it here. They correspond to the lower and upper bound of span respectively. As for the grouping stated in the docs, that only applies if you are intending to use the grouping functionality of Match. If you intend to get the span of the entire match, you can simply do match.start() and match.end(), where match is the match object returned by the regex.
Another option is using span() of the same Match object. Note this is different from just span which will give you the memory address rather than actually call the function. Doing match.span() will give you a tuple of the start and end. Taking your first match object as an example this would return (0,10)

Related

Finding a given substring in a string with Regular Expression in Python

I am trying to find all the occurrences of a substring in a string like below:
import re
S = 'aaadaa'
matches = re.finditer('(aa)', S)
if matches:
#print(matches)
for match in matches:
print(match)
else:
print("No match")
The current output is:
<re.Match object; span=(0, 2), match='aa'>
<re.Match object; span=(4, 6), match='aa'>
But I am expecting that it should give the values as:
<re.Match object; span=(0, 2), match='aa'>
<re.Match object; span=(1, 3), match='aa'>
<re.Match object; span=(4, 6), match='aa'>
Could someone please help me on this?

Taken from the answer I linked in the comments, here is the pattern you need:
(?=(aa)).
You’ll have to access the matched substring using match_obj.groups(1), and the match indices using match_obj.span(1).

The problem here is that once the re module matches a double aa, it will also consume both of the letters. But, you want overlapping matches. One trick you could use here would be to search for a(?=a):
S = 'aaadaa'
matches = re.findall(r'a(?=a)', S)
matches = [s + "a" for s in matches]
print(matches)
['aa', 'aa', 'aa']
Note that we tag on the second a to the output list, since only the first letter is actually matched at each step.

Python: I am trying to match number but it is matching something else

import re
list = []
string = "[50,40]"
print(string)
for line in string.split(","):
print(line)
match = re.search(r'\d[0-9]', line)
print(match)
if match:
list.append(match)
print("list is", list)
list is:
[<_sre.SRE_Match object; span=(1, 3), match='50'>,
<_sre.SRE_Match object; span=(0, 2), match='40'>]
I want to match only 40 and 50 and not some other useless info like
[<_sre.SRE_Match object; span=(1, 3),
<_sre.SRE_Match object; span=(0, 2),]
How to avoid other things and match only 40 and 50

Use re.findall function, it'll
Return all non-overlapping matches of pattern in string, as a list of
strings
string = "[50,40]"
result = re.findall(r'\d+', string)
print(result)
The output:
['50', '40']

Your code is matching the numbers, but you need to extract the strings from the Match objects. You can do that with the .groups method.
Here's a repaired version of your code. I've changed some of your names because you should not shadow the builtin list type, also string is the name of a standard module.
import re
lst = []
s = "[50,40]"
print(s)
for line in s.split(","):
print(line)
match = re.search(r'\d[0-9]', line)
print(match)
if match:
lst.append(match.group(0))
print("list is", lst)
output
[50,40]
[50
<_sre.SRE_Match object; span=(1, 3), match='50'>
40]
<_sre.SRE_Match object; span=(0, 2), match='40'>
list is ['50', '40']
Your regex is a little strange. If you only want to match 2 digit numbers you could use r'\d\d' or r'\d{2}'. If you want to match any (non-negative) you should use r'\d+'.
And you really don't need to do that loop. Just use the re.findall method:
import re
s = "[50,40]"
lst = re.findall(r'\d+', s)
print("list is", lst)
If you intend to do lots of searches with the same pattern, it's a good idea to use a compiled regex. The re module compiles and caches all regexes anyway, but doing it explicitly is considered good style, and is a litle more efficient.
import re
pat = re.compile(r'\d+')
s = "[50,40]"
lst = pat.findall(s)
print("list is", lst)

How to Identify Repetitive Characters in a String Using Python?

I am new to python and I want to write a program that determines if a string consists of repetitive characters. The list of strings that I want to test are:
Str1 = "AAAA"
Str2 = "AGAGAG"
Str3 = "AAA"
The pseudo-code that I come up with:
WHEN len(str) % 2 with zero remainder:
- Divide the string into two sub-strings.
- Then, compare the two sub-strings and check if they have the same characters, or not.
- if the two sub-strings are not the same, divide the string into three sub-strings and compare them to check if repetition occurs.
I am not sure if this is applicable way to solve the problem, Any ideas how to approach this problem?
Thank you!

You could use the Counter library to count the most common occurrences of the characters.
>>> from collections import Counter
>>> s = 'abcaaada'
>>> c = Counter(s)
>>> c.most_common()
[('a', 5), ('c', 1), ('b', 1), ('d', 1)]
To get the single most repetitive (common) character:
>>> c.most_common(1)
[('a', 5)]

You could do this using a RegX backreferences.

To find a pattern in Python, you are going to need to use "Regular Expressions". A regular expression is typically written as:
match = re.search(pat, str)
This is usually followed by an if-statement to determine if the search succeeded.
for example this is how you would find the pattern "AAAA" in a string:
import re
string = ' blah blahAAAA this is an example'
match = re.search(r'AAAA', string)
if match:
print 'found', match.group()
else:
print 'did not find'
This returns "found 'AAAA'"
Do the same for your other two strings and it will work the same.
Regular expressions can do a lot more than just this so work around with them and see what else they can do.

Assuming you mean the whole string is a repeating pattern, this answer has a good solution:
def principal_period(s):
i = (s+s).find(s, 1, -1)
return None if i == -1 else s[:i]

Python Regular Expressions - How is "+?" equivalent to "*"

* : 0 or more occurrences of the pattern to its left
+ : 1 or more occurrences of the pattern to its left
? : 0 or 1 occurrences of the pattern to its left
How is "+?" equivalent to "*" ?
Consider a search for any 3 letter word if it exists.
re1.search(r,'(\w\w\w)*,"abc")
In case of re1, * tries to get either 0 or more occurrences of the pattern to its left which in this case is the group of 3 letters. So it will either try to find a 3 letter word or fail
re2.search(r,'(\w\w\w)+?,"abc")
In case of re2, it's supposed to give the same output but I'm confused as to why "*" and "?+" are equivalent. Can you please explain this ?

* and +? are not equivalent. The ? takes on a special meaning if it follows a quantifier, making that quantifier lazy.
Usually, quantifiers are greedy, meaning they will try to match as many repetitions as they can; lazy quantifiers match as few as they can. But a+? will still match at least one a.
In [1]: re.search("(a*)(.*)", "aaaaaa").groups()
Out[1]: ('aaaaaa', '')
In [2]: re.search("(a+?)(.*)", "aaaaaa").groups()
Out[2]: ('a', 'aaaaa')
In your example, both regexes happen to match the same text because both (\w\w\w)* and (\w\w\w)+? can match three letters, and there are exactly three letters in your input. But they will differ in other strings:
In [12]: re.search(r"(\w\w\w)+?", "abcdef")
Out[12]: <_sre.SRE_Match object; span=(0, 3), match='abc'>
In [13]: re.search(r"(\w\w\w)+?", "ab") # No match
In [14]: re.search(r"(\w\w\w)*", "abcdef")
Out[14]: <_sre.SRE_Match object; span=(0, 6), match='abcdef'>
In [15]: re.search(r"(\w\w\w)*", "ab")
Out[15]: <_sre.SRE_Match object; span=(0, 0), match=''>

If you run with a simpler expression you will see is not the same:
import re
>>> re.search("[0-9]*", "1")
<_sre.SRE_Match object; span=(0, 1), match='1'>
>>> re.search("[0-9]*", "")
<_sre.SRE_Match object; span=(0, 0), match=''>
>>> re.search("[0-9]+", "")
>>> re.search("[0-9]+", "1")
<_sre.SRE_Match object; span=(0, 1), match='1'>
The problem in your code is (words)+?. is one or more or nothing

Matching a Regex against a multiline string

I am trying to match a Regex against a multi-line string, but the match fails after the first line.
These expressions work as expected:
>>> import re
>>> r = re.compile("a")
>>> a = "a"
>>> r.match(a)
<_sre.SRE_Match object; span=(0, 1), match='a'>
>>> a = "a\n"
>>> r.match(a)
<_sre.SRE_Match object; span=(0, 1), match='a'>
>>>
Whilst this expression does not work:
>>> a = "\na"
>>> r.match(a)
>>>

re.match was designed to match from the first character (the start) of the string. In the first two examples, the match works fine because a is the first character. In the last example however, the match fails because \n is the first character.
You need to use re.search in this case to have Python search for the a:
>>> import re
>>> r = re.compile("a")
>>> a = "\na"
>>> r.search(a)
<_sre.SRE_Match object; span=(1, 2), match='a'>
>>>
Also, just a note: if you are working with multi-line strings, then you will need to set the dot-all flag to have . match newlines. This can be done with re.DOTALL.

Why doesnt match work?
match searches the pattern at the start of the string.
How to correct?
use search instead
>>> import re
>>> pat=re.compile('a')
>>> pat.search('\na')
<_sre.SRE_Match object at 0x7faef636d440>
>>>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract span in re.finditer method in python? [duplicate] - python

.span is a method, not an attribute. You want .span() which will give the start, end tuple.

Related

Finding a given substring in a string with Regular Expression in Python

Python: I am trying to match number but it is matching something else

How to Identify Repetitive Characters in a String Using Python?

Python Regular Expressions - How is "+?" equivalent to "*"

Matching a Regex against a multiline string

Categories

Resources