I'm trying to apply regex to gather data from a file, and I only get empty results. For example, this little test (python3.4.3)
import re
a = 'abcde'
r = re.search('a',a)
print(r.groups())
exit()
Results with empty tuple (()). Clearly, I'm doing something wrong here, but what?
Comment:
What I'm actually trying to do is to interpret expressions such as 0.7*sqrt(2), by finding the value inside the parenthesis.
It happens because there are no groups in your regex. If you replace it with:
>>> r = re.search('(a)',a)
you'll get the groups:
>>> print(r.groups())
('a',)
Using group should work with the first option:
>>> print(re.search('a',a).group())
a
r.groups() returns an empty tuple, because your regular expression did not contain any group.
>>> import re
>>> a = 'abcde'
>>> re.search('a', a)
<_sre.SRE_Match object; span=(0, 1), match='a'>
>>> re.search('a', a).groups()
()
>>> re.search('(a)', a).groups()
('a',)
Have a look at the re module documentation:
(...)
Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group;
Edit: If you want to catch the bit between the brackets in the expression O.7*sqrt(2), you could use the following pattern:
>>> re.search('[\d\.]+\*sqrt\((\d)\)', '0.7*sqrt(2)').group(1)
'2'
Related
This code below should be self explanatory. The regular expression is simple. Why doesn't it match?
>>> import re
>>> digit_regex = re.compile('\d')
>>> string = 'this is a string with a 4 digit in it'
>>> result = digit_regex.match(string)
>>> print result
None
Alternatively, this works:
>>> char_regex = re.compile('\w')
>>> result = char_regex.match(string)
>>> print result
<_sre.SRE_Match object at 0x10044e780>
Why does the second regex work, but not the first?
Here is what re.match() says If zero or more characters at the beginning of string match the regular expression pattern ...
In your case the string doesn't have any digit \d at the beginning. But for the \w it has t at the beginning at your string.
If you want to check for digit in your string using same mechanism, then add .* with your regex:
digit_regex = re.compile('.*\d')
The second finds a match because string starts with a word character. If you want to find matches within the string, use the search or findall methods (I see this was suggested in a comment too). Or change your regex (e.g. .*(\d).*) and use the .groups() method on the result.
In the case of re.search(), is there a way I can get hold of just the part of input string that matches the regex? i.e. I just want the "heeehe" part and not the stuff that comes before it:
>>> s = "i got away with it, heeehe"
>>> import re
>>> match = re.search("he*he", s)
>>> match.string
'i got away with it, heeehe'
>>> match.?
'heeehe'
match.group(0) is the matched string.
Demo:
>>> import re
>>> s = "i got away with it, heeehe"
>>> match = re.search("he*he", s)
>>> match.group(0)
'heeehe'
You can also omit the argument, 0 is the default.
I am trying to match different expressions for addresses:
Example: '398 W. Broadway'
I would like to match W. or E. (east) or Pl. for place ...etc
It is very simple using this regex
(W.|West) for example.
Yet python re module doesn't match anything when I input that
>>> a
'398 W. Broadway'
>>> x = re.match('(W.|West)', a)
>>> x
>>> x == None
True
>>>
re.match matches at the beginning of the input string.
To match anywhere, use re.search instead.
>>> import re
>>> re.match('a', 'abc')
<_sre.SRE_Match object at 0x0000000001E18578>
>>> re.match('a', 'bac')
>>> re.search('a', 'bac')
<_sre.SRE_Match object at 0x0000000002654370>
See search() vs. match():
Python offers two different primitive operations based on regular
expressions: re.match() checks for a match only at the beginning of
the string, while re.search() checks for a match anywhere in the
string (this is what Perl does by default).
.match() constrains the search to begin at the first character of the string. Use .search() instead. Note too that . matches any character (except a newline). If you want to match a literal period, escape it (\. instead of plain .).
I am parsing a log file, and I am trying to match for expressions like "key"=>"value" to extract the value. I can't figure out how to match for the greater than symbol.
I am not having any luck with re.match(r">", ...), r"\>", or r"\\r". How do I do this?
re.match only matches from the start of the string. try re.search:
>>> re.match(">", "a>b")
>>> re.search(">", "a>b")
<_sre.SRE_Match object at 0x7f4dd577e3d8>
or match the other text first:
>>> re.match(".*>", "a>b")
<_sre.SRE_Match object at 0x7f4dd577e440>
I knew that [] denotes a set of allowable characters -
>>> p = r'^[ab]$'
>>>
>>> re.search(p, '')
>>> re.search(p, 'a')
<_sre.SRE_Match object at 0x1004823d8>
>>> re.search(p, 'b')
<_sre.SRE_Match object at 0x100482370>
>>> re.search(p, 'ab')
>>> re.search(p, 'ba')
But ... today I came across an expression with vertical bars within parenthesis to define mutually exclusive patterns -
>>> q = r'^(a|b)$'
>>>
>>> re.search(q, '')
>>> re.search(q, 'a')
<_sre.SRE_Match object at 0x100498dc8>
>>> re.search(q, 'b')
<_sre.SRE_Match object at 0x100498e40>
>>> re.search(q, 'ab')
>>> re.search(q, 'ba')
This seems to mimic the same functionality as above, or am I missing something?
PS: In Python parenthesis themselves are used to define logical groups of matched text. If I use the second technique, then how do I use parenthesis for both jobs?
In this case it is the same.
However, the alternation is not just limited to a single character. For instance,
^(hello|world)$
will match "hello" or "world" (and only these two inputs) while
^[helloworld]$
would just match a single character ("h" or "w" or "d" or whatnot).
Happy coding.
[ab] matches one character (a or b) and doesn't capture the group. (a|b) captures a or b, and matches it. In this case, no big difference, but in more complex cases [] can only contain characters and character classes, while (|) can contain arbitrarily complex regex's on either side of the pipe
In the example you gave they are interchangeable. There are some differences worth noting:
In the character class square brackets you don't have to escape anything but a dash or square brackets, or the caret ^ (but then only if it's the first character.)
Parentheses capture matches so you can refer to them later. Character class matches don't do that.
You can match multi-character strings in parentheses but not in character classes