python Regex match exact word - python

I am trying to match different expressions for addresses:
Example: '398 W. Broadway'
I would like to match W. or E. (east) or Pl. for place ...etc
It is very simple using this regex
(W.|West) for example.
Yet python re module doesn't match anything when I input that
>>> a
'398 W. Broadway'
>>> x = re.match('(W.|West)', a)
>>> x
>>> x == None
True
>>>

re.match matches at the beginning of the input string.
To match anywhere, use re.search instead.
>>> import re
>>> re.match('a', 'abc')
<_sre.SRE_Match object at 0x0000000001E18578>
>>> re.match('a', 'bac')
>>> re.search('a', 'bac')
<_sre.SRE_Match object at 0x0000000002654370>
See search() vs. match():
Python offers two different primitive operations based on regular
expressions: re.match() checks for a match only at the beginning of
the string, while re.search() checks for a match anywhere in the
string (this is what Perl does by default).

.match() constrains the search to begin at the first character of the string. Use .search() instead. Note too that . matches any character (except a newline). If you want to match a literal period, escape it (\. instead of plain .).

Related

Python: Difference between "re.match(pattern)" v/s "re.search('^' + pattern)" [duplicate]

This question already has answers here:
What is the difference between re.search and re.match?
(9 answers)
Closed 3 years ago.
While reading the docs, I found out that the whole difference between re.match() and re.search() is that re.match() starts checking only from the beginning of the string.
>>> import re
>>> a = 'abcde'
>>> re.match(r'b', a)
>>> re.search(r'b', a)
<_sre.SRE_Match object at 0xffe25c98>
>>> re.search(r'^b', a)
>>>
Is there anything I am misunderstanding, or is there no difference at all between re.search('^' + pattern) and re.match(pattern)?
Is it a good practice to only use re.search()?
You should take a look at Python's re.search() vs. re.match() document which clearly mentions about the other difference which is:
Note however that in MULTILINE mode match() only matches at the beginning of the string, whereas using search() with a regular expression beginning with '^' will match at the beginning of each line.
>>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
<_sre.SRE_Match object; span=(4, 5), match='X'>
The first difference (for future readers) being:
Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default).
For example:
>>> re.match("c", "abcdef") # No match
>>> re.search("c", "abcdef") # Match
<_sre.SRE_Match object; span=(2, 3), match='c'>
Regular expressions beginning with '^' can be used with search() to restrict the match at the beginning of the string:
>>> re.match("c", "abcdef") # No match
>>> re.search("^c", "abcdef") # No match
>>> re.search("^a", "abcdef") # Match
<_sre.SRE_Match object; span=(0, 1), match='a'>
If you look at this from a code golfing perspective, I'd say there is some use in keeping the two functions separate.
If you're looking from the beginning of the string, re.match, would be preferable to re.search, because the former has one character less in its name, thus saving a byte. Furthermore, with re.search, you also have to add the start-of-line anchor ^ to signify matching from the start. You don't need to specify this with re.match because it is implied, further saving another byte.

Add [] around numbers in strings

I like to add [] around any sequence of numbers in a string e.g
"pixel1blue pin10off output2high foo9182bar"
should convert to
"pixel[1]blue pin[10]off output[2]high foo[9182]bar"
I feel there must be a simple way but its eluding me :(
Yes, there is a simple way, using re.sub():
result = re.sub(r'(\d+)', r'[\1]', inputstring)
Here \d matches a digit, \d+ matches 1 or more digits. The (...) around that pattern groups the match so we can refer to it in the second argument, the replacement pattern. That pattern simply replaces the matched digits with [...] around the group.
Note that I used r'..' raw string literals; if you don't you'd have to double all the \ backslashes; see the Backslash Plague section of the Python Regex HOWTO.
Demo:
>>> import re
>>> inputstring = "pixel1blue pin10off output2high foo9182bar"
>>> re.sub(r'(\d+)', r'[\1]', inputstring)
'pixel[1]blue pin[10]off output[2]high foo[9182]bar'
You can use re.sub :
>>> s="pixel1blue pin10off output2high foo9182bar"
>>> import re
>>> re.sub(r'(\d+)',r'[\1]',s)
'pixel[1]blue pin[10]off output[2]high foo[9182]bar
Here the (\d+) will match any combinations of digits and re.sub function will replace it with the first group match within brackets r'[\1]'.
You can start here to learn regular expression http://www.regular-expressions.info/

Python - why doesn't this simple regex work?

This code below should be self explanatory. The regular expression is simple. Why doesn't it match?
>>> import re
>>> digit_regex = re.compile('\d')
>>> string = 'this is a string with a 4 digit in it'
>>> result = digit_regex.match(string)
>>> print result
None
Alternatively, this works:
>>> char_regex = re.compile('\w')
>>> result = char_regex.match(string)
>>> print result
<_sre.SRE_Match object at 0x10044e780>
Why does the second regex work, but not the first?
Here is what re.match() says If zero or more characters at the beginning of string match the regular expression pattern ...
In your case the string doesn't have any digit \d at the beginning. But for the \w it has t at the beginning at your string.
If you want to check for digit in your string using same mechanism, then add .* with your regex:
digit_regex = re.compile('.*\d')
The second finds a match because string starts with a word character. If you want to find matches within the string, use the search or findall methods (I see this was suggested in a comment too). Or change your regex (e.g. .*(\d).*) and use the .groups() method on the result.

Python re.search

I have a string variable containing
string = "123hello456world789"
string contain no spacess. I want to write a regex such that prints only words containing(a-z)
I tried a simple regex
pat = "([a-z]+){1,}"
match = re.search(r""+pat,word,re.DEBUG)
match object contains only the word Hello and the word World is not matched.
When is used re.findall() I could get both Hello and World.
My question is why we can't do this with re.search()?
How do this with re.search()?
re.search() finds the pattern once in the string, documenation:
Scan through string looking for a location where the regular
expression pattern produces a match, and return a corresponding
MatchObject instance. Return None if no position in the string matches
the pattern; note that this is different from finding a zero-length
match at some point in the string.
In order to match every occurrence, you need re.findall(), documentation:
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match.
Example:
>>> import re
>>> regex = re.compile(r'([a-z]+)', re.I)
>>> # using search we only get the first item.
>>> regex.search("123hello456world789").groups()
('hello',)
>>> # using findall we get every item.
>>> regex.findall("123hello456world789")
['hello', 'world']
UPDATE:
Due to your duplicate question (as discussed at this link) I have added my other answer here as well:
>>> import re
>>> regex = re.compile(r'([a-z][a-z-\']+[a-z])')
>>> regex.findall("HELLO W-O-R-L-D") # this has uppercase
[] # there are no results here, because the string is uppercase
>>> regex.findall("HELLO W-O-R-L-D".lower()) # lets lowercase
['hello', 'w-o-r-l-d'] # now we have results
>>> regex.findall("123hello456world789")
['hello', 'world']
As you can see, the reason why you were failing on the first sample you provided is because of the uppercase, you can simply add the re.IGNORECASE flag, though you mentioned that matches should be lowercase only.
#InbarRose answer shows why re.search works that way, but if you want match objects rather than just the string outputs from re.findall, use re.finditer
>>> for match in re.finditer(pat, string):
... print match.groups()
...
('hello',)
('world',)
>>>
Or alternatively if you wanted a list
>>> list(re.finditer(pat, string))
[<_sre.SRE_Match object at 0x022DB320>, <_sre.SRE_Match object at 0x022DB660>]
It's also generally a bad idea to use string as a variable name given that it's a common module.

regular expression to parse option string in python

I can't seem to create the correct regular expression to extract the correct tokens from my string. Padding the beginning of the string with a space generates the correct output, but seems less than optimal:
>>> import re
>>> s = '-edge_0triggered a-b | -level_Sensitive c-d | a-b-c'
>>> re.findall(r'\W(-[\w_]+)',' '+s)
['-edge_0triggered', '-level_Sensitive'] # correct output
Here are some of the regular expressions I've tried, does anyone have a regex suggestion that doesn't involve changing the original string and generates the correct output
>>> re.findall(r'(-[\w_]+)',s)
['-edge_0triggered', '-b', '-level_Sensitive', '-d', '-b', '-c']
>>> re.findall(r'\W(-[\w_]+)',s)
['-level_Sensitive']
r'(?:^|\W)(-\w+)'
\w already includes the underscore.
Change the first qualifier to accept either a beginning anchor or a not-word, instead of only a not-word:
>>> re.findall(r'(?:^|\W)(-[\w_]+)', s)
['-edge_0triggered', '-level_Sensitive']
The ?: at the beginning of the group simply tells the regex engine to not treat that as a group for purposes of results.
You could use a negative-lookbehind:
re.findall(r'(?<!\w)(-\w+)', s)
the (?<!\w) part means "match only if not preceded by a word-character".

Categories

Resources