I am trying to get element name and digits as regex's group. Even for the simplest case, like as shown, I cant:
>>> import re
>>> t = "Fe35C65"
>>> m = re.match("(\D*\d+\D*\d+)", t)
>>> print(m.group(1))
Fe35C65
>>> print(m.group(0))
Fe35C65
>>> print(m)
<_sre.SRE_Match object; span=(0, 7), match='Fe35C65'>
>>>
What I am looking is the output as
Name[0] = "Fe" Name[1]="C"
Num[0] = 35, Num[1] = 65
Here, there is 2 set, but it should not be limited. But, original problem is extractinh the data.
The problem is that re.match only returns 1 match and the number of capturing groups is fixed.
To match multiple occurrences of your pattern you may use re.findall and the r'(\D*)(\d+)' pattern that matches and captures 0+ non-digit symbols onto Group 1 and then 1+ digits into Group 2:
re.findall(r'(\D*)(\d+)', t)
Since re.findall returns captured texts only, you will get a list of 2-element tuples.
Alternative solution with PyPi regex module
You may both validate a string and easily capture all occurrences of the multiple pairs of groups with the PyPi regex module like this:
>>> import regex
>>> t = "Fe35C65"
>>> pat = r"(?:(\D*)(\d+))+"
>>> m = regex.fullmatch(pat, t)
>>> if m:
print(zip(m.captures(1), m.captures(2)))
[('Fe', '35'), ('C', '65')]
The point here is:
(?:(\D*)(\d+))+ matches 1+ occurrences of (Group 1) 0+ non-digits and (Group 2) 1+ digits (extraction)
regex.fullmatch requires the entire string to match the pattern (validation)
The captures are stored in a group capture stack and can be accessed with .captures(n).
If there can be many you can use this:
x="Fe35C65"
m=re.compile(r"(\D+)(\d+)")
for i in m.finditer(x):
print i.groups()
Output:
('Fe', '35')
('C', '65')
Related
Metacharacter +: one or more occurrences. What is the general method to get the number of this occurrences?
For example:
import re
x = re.finditer(r'0(10)+(20)+', '0001010202020000')
for i in iter(x):
print(i) # <re.Match object; span=(2, 13), match='01010202020'>
I want to get: [('01', 2), ('02', 3)] due to (10)+ and (20)+ in regex.
One way you could do this is to enclose each repeating capture group inside another group, then you can divide the length of the outer match by the length of the inner match to determine how many times each inner group matched. For example:
import re
m = re.search(r'0((10)+)((20)+)', '0001010202020000')
num_grps = len(m.groups())
for i in range(1, num_grps+1,2):
outer = m.end(i) - m.start(i)
inner = m.end(i+1) - m.start(i+1)
print((m.group(i+1), outer//inner))
Output:
('10', 2)
('20', 3)
You are using re, but as an alternative using regex with the PyPi regex module you could use the same pattern and make use counting the captures() which gives a list of all the captures of a group.
import regex
x = regex.search(r'0(10)+(20)+', '0001010202020000')
res = []
for i, val in enumerate(x.groups(), 1):
res.append((val, len(x.captures(i))))
print(res)
Output
[('10', 2), ('20', 3)]
Python demo
This is not supported by regular expression engine, you would have to do it yourself. In this case you can capture both the repeated string and the result of the repetition, then count the repetitions yourself.
matches = re.finditer(r'0((10)+)((20)+)', '0001010202020000')
for match in matches:
item = [(pattern, len(instance) // len(pattern))
for instance, pattern in zip(*[iter(match.groups())]*2)]
print(item)
Note that it won't work on a non-fixed subpatterns. In this case, you would have to run findall (or finditer) on the match group itself and count.
Let's say I have a string :
s = "id_john, num847, id_000, num___"
I know how to retrieve either of 2 patterns with |:
re.findall("id_[a-z]+|num[0-9]+", s)
#### ['id_john', 'num847'] # OK
I know how to capture a portion only of a match with parenthesis:
re.findall("id_([a-z]+)", s)
#### ['john']
But I fail when i try to combine those two features, this is my desired output:
#### ['john', '847']
Thanks for your help.. (I work with python)
No need for lookaheads or complex patterns.
Consider this:
>>> re.findall('id_([a-z]+)|num([0-9]+)', s)
[('john', ''), ('', '847')]
When the first pattern matches, the first group will contain the match, and the second group will be empty. When the second pattern matches, the first group is empty, and the second group contains the match.
Since one of the two groups will always be empty, joining them couldn't hurt.
>>> [a+b for a,b in re.findall('id_([a-z]+)|num([0-9]+)', s)]
['john', '847']
You may use this code in Python with lookaheads:
>>> s = "id_john, num847, id_000, num___"
>>> print re.findall(r'(?:id_(?=[a-z]+\b)|num(?=\d+\b))([a-z\d]+)', s)
['john', '847']
RegEx Details:
(?:: Start non-capture group
id_(?=[a-z]+\b): Match id_ with a lookahead assertion to make sure we have [a-z]+ characters ahead followed by word boundary
|: OR
num(?=\d+\b))([a-z\d]+: Matchnum` with a lookahead assertion to make sure we have digits ahead followed by word boundary
): End non-capture group
([a-z\d]+): Match 1+ characters with lowercase letters or digits
I have the following Python regex:
>>> p = re.compile(r"(\b\w+)\s+\1")
\b : word boundary
\w+ : one or more alphanumerical characters
\s+ : one or more whitespaces (can be , \t, \n, ..)
\1 : backreference to group 1 ( = the part between (..))
This regex should find all double occurences of a word - if the two occurences are next to each other with some whitespace in between.
The regex seems to work fine when using the search function:
>>> p.search("I am in the the car.")
<_sre.SRE_Match object; span=(8, 15), match='the the'>
The found match is the the, just as I had expected. The weird behaviour is in the findall function:
>>> p.findall("I am in the the car.")
['the']
The found match is now only the. Why the difference?
When using groups in a regular expression, findall() returns only the groups; from the documentation:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
You can't avoid using groups when using backreferences, but you can put a new group around the whole pattern:
>>> p = re.compile(r"((\b\w+)\s+\2)")
>>> p.findall("I am in the the car.")
[('the the', 'the')]
The outer group is group 1, so the backreference should be pointing to group 2. You now have two groups, so there are two results per entry. Using a named group might make this more readable:
>>> p = re.compile(r"((?P<word>\b\w+)\s+(?P=word))")
You can filter that back to just the outer group result:
>>> [m[0] for m in p.findall("I am in the the car.")]
['the the']
Consider the text below:
foobar¬
nextline
The regex (.*?(?: *?\n)) matches foobar¬
where ¬ denotes a newline \n.
Why does the regex match it? shouldn't the non-capture group exclude it?
Tested on Regex101 for the python dialect.
“Non-capturing group” refers to the fact that matches within that group will not be available as separate groups in the resulting match object. For example:
>>> re.search('(foo)(bar)', 'foobarbaz').groups()
('foo', 'bar')
>>> re.search('(foo)(?:bar)', 'foobarbaz').groups()
('foo',)
However, everything that is part of an expression is matched and as such appears in the resulting match (Group 0 shows the whole match):
>>> re.search('(foo)(bar)', 'foobarbaz').group(0)
'foobar'
>>> re.search('(foo)(?:bar)', 'foobarbaz').group(0)
'foobar'
If you don’t want to match that part but still want to make sure it’s there, you can use a lookahead expression:
>>> re.search('(foo)(?=bar)', 'foobarbaz')
<_sre.SRE_Match object; span=(0, 3), match='foo'>
>>> re.search('(foo)(?=bar)', 'foobaz')
None
So in your case, you could use (.*?(?= *?\n)).
The \n is captured because the non-capturing group is inside the capturing group:
>>> s = 'foobar\nnextline'
>>> re.search(r'(.*?(?: *?\n))', s).groups()
('foobar\n',)
If you don't want that, place the non-capturing group outside of the capturing one:
>>> re.search(r'(.*?)(?: *?\n)', s).groups()
('foobar',)
I'm using python and regex (new to both) to find sequence of chars in a string as follows:
Grab the first instance of p followed by any number (It'll always be in the form of p_ _ where _ and _ will be integers). Then either find an 's' or a 'go' then all integers till the end of the string. For example:
ascjksdcvyp12nbvnzxcmgonbmbh12hjg23
should yield p12 go 12 23.
ascjksdcvyp12nbvnzxcmsnbmbh12hjg23
should yield p12 s 12 23.
I've only managed to get the p12 part of the string and this is what I've tried so far to extract the 'go' or 's':
decoded = (re.findall(r'([p][0-9]*)',myStr))
print(decoded) //prints p12
I know by doing something like
re.findall(r'[s]|[go]',myStr)
will give me all occurrences of s and g and o, but something like that is not what I'm looking for. And I'm not sure how I'd combine these regexes to get the desired output.
Use re.findall with pattern grouping:
>>> string = 'ascjksdcvyp12nbvnzxcmgonbmbh12hjg23'
>>> re.findall(r'(p\d{2}).*(s|go)\D*(\d+)(?:\D*(\d+))*', string)
[('p12', 'go', '12', '23')]
>>> string = 'ascjksdcvyp12nbvnzxcmsnbmbh12hjg23'
>>> re.findall(r'(p\d{2}).*(s|go)\D*(\d+)(?:\D*(\d+))*', string)
[('p12', 's', '12', '23')]
With re.findall we are only willing to get what are matched by pattern grouping ()
p\d{2} matches any two digits after p
After that .* matches anything
Then, s|go matches either s or go
\D* matches any number of non-digits
\d+ indicates one or more digits
(?:) is a non-capturing group i.e. the match inside won't show up in the output, it is only for the sake of grouping tokens
Note:
>>> re.findall(r'(p\d{2}).*(s|go)(?:\D*(\d+))+?', string)
[('p12', 's', '12')]
>>> re.findall(r'(p\d{2}).*(s|go)(?:\D*(\d+))+', string)
[('p12', 's', '23')]
I would like to use one of the above two as matching later digits is kind of a repeated task but there are problems with both non-greedy and greedy matches, hence we need to match the digits after s or go well, kind of explicitly.
First, try to match your line with a minimal pattern, as a test. Use (grouping) and (?:nongrouping) parens to capture the interesting parts and not capture the uninteresting parts. Store away what you care about,
then chop off the remainder of the string and search for numbers as a second step.
import re
simple_test = r'^.*p(\d{2}).*?(?:s|go).*?(\d+)'
m = re.match(simple_test, line)
if m is not None:
p_num = m.group(1)
trailing_numbers = [m.group(2)]
remainder = line[m.end()+1:]
trailing_numbers.extend( # extend list by appending
map( # list from applying
lambda m: m.group(1), # get group(1) from match
re.finditer(r"(\d+)", remainder) # of each number in string
)
)
print("P:", p_num, "Numbers:", trailing_numbers)