I want to find the first number that is sandwiched with commas on either end, and I came up with this:
m = re.search("\,([0-9])*\,",line)
However, this returns to me the number with the commas, how do I exclude them?
m.group(0) returns
',1620693,'
group(0) will always return the entire match.
See python documentation:
>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group(0) # The entire match
'Isaac Newton'
>>> m.group(1) # The first parenthesized subgroup.
'Isaac'
Use m.group(1). You also don't need to escape (backslash) the commas. m.group(0) refers to the entire match, and each number after that refers to matched groups.
Related
I am trying to use a regular expression to extract words inside of a pattern.
I have some string that looks like this
someline abc
someother line
name my_user_name is valid
some more lines
I want to extract the word my_user_name. I do something like
import re
s = #that big string
p = re.compile("name .* is valid", re.flags)
p.match(s) # this gives me <_sre.SRE_Match object at 0x026B6838>
How do I extract my_user_name now?
You need to capture from regex. search for the pattern, if found, retrieve the string using group(index). Assuming valid checks are performed:
>>> p = re.compile("name (.*) is valid")
>>> result = p.search(s)
>>> result
<_sre.SRE_Match object at 0x10555e738>
>>> result.group(1) # group(1) will return the 1st capture (stuff within the brackets).
# group(0) will returned the entire matched text.
'my_user_name'
You can use matching groups:
p = re.compile('name (.*) is valid')
e.g.
>>> import re
>>> p = re.compile('name (.*) is valid')
>>> s = """
... someline abc
... someother line
... name my_user_name is valid
... some more lines"""
>>> p.findall(s)
['my_user_name']
Here I use re.findall rather than re.search to get all instances of my_user_name. Using re.search, you'd need to get the data from the group on the match object:
>>> p.search(s) #gives a match object or None if no match is found
<_sre.SRE_Match object at 0xf5c60>
>>> p.search(s).group() #entire string that matched
'name my_user_name is valid'
>>> p.search(s).group(1) #first group that match in the string that matched
'my_user_name'
As mentioned in the comments, you might want to make your regex non-greedy:
p = re.compile('name (.*?) is valid')
to only pick up the stuff between 'name ' and the next ' is valid' (rather than allowing your regex to pick up other ' is valid' in your group.
You could use something like this:
import re
s = #that big string
# the parenthesis create a group with what was matched
# and '\w' matches only alphanumeric charactes
p = re.compile("name +(\w+) +is valid", re.flags)
# use search(), so the match doesn't have to happen
# at the beginning of "big string"
m = p.search(s)
# search() returns a Match object with information about what was matched
if m:
name = m.group(1)
else:
raise Exception('name not found')
You can use groups (indicated with '(' and ')') to capture parts of the string. The match object's group() method then gives you the group's contents:
>>> import re
>>> s = 'name my_user_name is valid'
>>> match = re.search('name (.*) is valid', s)
>>> match.group(0) # the entire match
'name my_user_name is valid'
>>> match.group(1) # the first parenthesized subgroup
'my_user_name'
In Python 3.6+ you can also index into a match object instead of using group():
>>> match[0] # the entire match
'name my_user_name is valid'
>>> match[1] # the first parenthesized subgroup
'my_user_name'
Maybe that's a bit shorter and easier to understand:
import re
text = '... someline abc... someother line... name my_user_name is valid.. some more lines'
>>> re.search('name (.*) is valid', text).group(1)
'my_user_name'
You want a capture group.
p = re.compile("name (.*) is valid", re.flags) # parentheses for capture groups
print p.match(s).groups() # This gives you a tuple of your matches.
Here's a way to do it without using groups (Python 3.6 or above):
>>> re.search('2\d\d\d[01]\d[0-3]\d', 'report_20191207.xml')[0]
'20191207'
You can also use a capture group (?P<user>pattern) and access the group like a dictionary match['user'].
string = '''someline abc\n
someother line\n
name my_user_name is valid\n
some more lines\n'''
pattern = r'name (?P<user>.*) is valid'
matches = re.search(pattern, str(string), re.DOTALL)
print(matches['user'])
# my_user_name
I found this answer via google because I wanted to unpack a re.search() result with multiple groups directly into multiple variables. While this might be obvious for some, it was not for me because I always used group() in the past, so maybe it helps someone in the future who also did not know about group*s*().
s = "2020:12:30"
year, month, day = re.search(r"(\d+):(\d+):(\d+)", s).groups()
It seems like you're actually trying to extract a name vice simply find a match. If this is the case, having span indexes for your match is helpful and I'd recommend using re.finditer. As a shortcut, you know the name part of your regex is length 5 and the is valid is length 9, so you can slice the matching text to extract the name.
Note - In your example, it looks like s is string with line breaks, so that's what's assumed below.
## covert s to list of strings separated by line:
s2 = s.splitlines()
## find matches by line:
for i, j in enumerate(s2):
matches = re.finditer("name (.*) is valid", j)
## ignore lines without a match
if matches:
## loop through match group elements
for k in matches:
## get text
match_txt = k.group(0)
## get line span
match_span = k.span(0)
## extract username
my_user_name = match_txt[5:-9]
## compare with original text
print(f'Extracted Username: {my_user_name} - found on line {i}')
print('Match Text:', match_txt)
I have the following Python regex:
>>> p = re.compile(r"(\b\w+)\s+\1")
\b : word boundary
\w+ : one or more alphanumerical characters
\s+ : one or more whitespaces (can be , \t, \n, ..)
\1 : backreference to group 1 ( = the part between (..))
This regex should find all double occurences of a word - if the two occurences are next to each other with some whitespace in between.
The regex seems to work fine when using the search function:
>>> p.search("I am in the the car.")
<_sre.SRE_Match object; span=(8, 15), match='the the'>
The found match is the the, just as I had expected. The weird behaviour is in the findall function:
>>> p.findall("I am in the the car.")
['the']
The found match is now only the. Why the difference?
When using groups in a regular expression, findall() returns only the groups; from the documentation:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
You can't avoid using groups when using backreferences, but you can put a new group around the whole pattern:
>>> p = re.compile(r"((\b\w+)\s+\2)")
>>> p.findall("I am in the the car.")
[('the the', 'the')]
The outer group is group 1, so the backreference should be pointing to group 2. You now have two groups, so there are two results per entry. Using a named group might make this more readable:
>>> p = re.compile(r"((?P<word>\b\w+)\s+(?P=word))")
You can filter that back to just the outer group result:
>>> [m[0] for m in p.findall("I am in the the car.")]
['the the']
How does one replace a pattern when the substitution itself is a variable?
I have the following string:
s = '''[[merit|merited]] and [[eat|eaten]] and [[go]]'''
I would like to retain only the right-most word in the brackets ('merited', 'eaten', 'go'), stripping away what surrounds these words, thus producing:
merited and eaten and go
I have the regex:
p = '''\[\[[a-zA-Z]*\[|]*([a-zA-Z]*)\]\]'''
...which produces:
>>> re.findall(p, s)
['merited', 'eaten', 'go']
However, as this varies, I don't see a way to use re.sub() or s.replace().
s = '''[[merit|merited]] and [[eat|eaten]] and [[go]]'''
p = '''\[\[[a-zA-Z]*?[|]*([a-zA-Z]*)\]\]'''
re.sub(p, r'\1', s)
? so that for [[go]] first [a-zA-Z]* will match empty (shortest) string and second will get actual go string
\1 substitutes first (in this case the only) match group in a pattern for each non-overlapping match in the string s. r'\1' is used so that \1 is not interpreted as the character with code 0x1
well first you need to fix your regex to capture the whole group:
>>> s = '[[merit|merited]] and [[eat|eaten]] and [[go]]'
>>> p = '(\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\])'
>>> [('[[merit|merited]]', 'merited'), ('[[eat|eaten]]', 'eaten'), ('[[go]]', 'go')]
[('[[merit|merited]]', 'merited'), ('[[eat|eaten]]', 'eaten'), ('[[go]]', 'go')]
This matches the whole [[whateverisinhere]] and separates the whole match as group 1 and just the final word as group 2. You can than use \2 token to replace the whole match with just group 2:
>>> re.sub(p,r'\2',s)
'merited and eaten and go'
or change your pattern to:
p = '\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\]'
which gets rid of grouping the entire match as group 1 and only groups what you want. you can then do:
>>> re.sub(p,r'\1',s)
to have the same effect.
POST EDIT:
I forgot to mention that I actually changed your regex so here is the explanation:
\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\]
\[\[ \]\] #literal matches of brackets
(?: )* #non-capturing group that can match 0 or more of whats inside
[a-zA-Z]*\| #matches any word that is followed by a '|' character
( ... ) #captures into group one the final word
I feel like this is stronger than what you originally had because it will also change if there are more than 2 options:
>>> s = '[[merit|merited]] and [[ate|eat|eaten]] and [[go]]'
>>> p = '\[\[(?:[a-zA-Z]*\|)*([a-zA-Z]*)\]\]'
>>> re.sub(p,r'\1',s)
'merited and eaten and go'
I have a sequence
seq = 'CCGATGACCTCACAGCCGCCTCCTGCCTTGAGGAAAGGAACTGCAATTCC'
I am try to get sequences of length 23bp ending with NGG where N = A or C or T or G`
I use p = re.compile('([ACGT]{21}GG)') and when I say
for m in p.finditer(seq):
print m.group()
I get only CACAGCCGCCTCCTGCCTTGAGG.
But clearly CCGCCTCCTGCCTTGAGGAAAGG also matches. Why is it not reported?
Put your regex inside a lookahead assertion in-order to do overlapping matches. Your regex fails to match the two strings because by one match contains another match . Regex by default won't do overlapping match. You need to put the pattern inside lookarounds in-order to capture the overlapping matches.
(?=([ACGT]{21}GG))
Code:
>>> seq = 'CCGATGACCTCACAGCCGCCTCCTGCCTTGAGGAAAGGAACTGCAATTCC'
>>> p = re.compile(r'(?=([ACGT]{21}GG))')
>>> for m in p.finditer(seq):
print m.group(1)
CACAGCCGCCTCCTGCCTTGAGG
CCGCCTCCTGCCTTGAGGAAAGG
See the captures at the right side in this demo.
I am trying to get first pair of numbers from "09_135624.jpg"
My code now:
import re
string = "09_135624.jpg"
pattern = r"(?P<pair>(.*))_135624.jpg"
match = re.findall(pattern, string)
print match
Output:
[('09', '09')]
Why I have tuple in output?
Can you help me modify my code to get this:
['09']
Or:
'09'
re.findall returns differently according to the number of capturing group in the pattern:
>>> re.findall(r"(?P<pair>.*)_135624\.jpg", "09_135624.jpg")
['09']
According to the documentation:
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match.
Alternative using re.search:
>>> re.search(r"(?P<pair>.*)_135624\.jpg", "09_135624.jpg")
<_sre.SRE_Match object at 0x00000000025D0D50>
>>> re.search(r"(?P<pair>.*)_135624\.jpg", "09_135624.jpg").group('pair')
'09'
>>> re.search(r"(?P<pair>.*)_135624\.jpg", "09_135624.jpg").group(1)
'09'
UPDATE
To match . literally, you need to escape it: \..
(?P<pair>(?:.*))_135624.jpg
Try this. You are getting two results because you are capturing them twice. I have modified it to capture only once:
http://regex101.com/r/lS5tT3/62