Python regular expressions match end of word - python

For example, how to match the second _ab in the sentence _ab_ab is a test? I tried \> to match end of word, but not work for Python 2.7. Note: I am matching not end of a string, but end of a single word.
There are implicit answers in other posts. But I believe a simple and direct answer to such question should be advocated. So I ask it after trying the following posts without direct & concise solutions found.
Python Regex to find whitespace, end of string, and/or word boundary
Does Python re module support word boundaries (\b)?

You may use word boundary \b at the last. Note that adding \b before _ab won't work because there is a b (word char) exists before underscore. \b matches between a word character and a non-word character(vice-versa).
r'_ab\b'

_ab(?!\w) #if you want `_` as word character
or
_ab(?![a-zA-Z0-9])
You can simply use lookahead to indicate end of word.
import re
p = re.compile(r'_ab(?!\w)') #consider underscore also as a word character.
or
p = re.compile(r'_ab(?![a-zA-Z0-9])')
test_str = "_ab_ab"
re.findall(p, test_str)

use r'\>' rather than just '\>'.
I find this solution after reading this post: https://stackoverflow.com/a/3995242/2728388
When using the re module in Python, remember Python’s raw string notation, add a r prefix to escape backslash in your regular expressions.
Any other solutions, such as using word boundary \b?

import re
string='''ab_ab _ab_ab ab__ab abab_ ab_ababab_ '''
patt=re.compile(r'_ab\b')
#this will search _ab from the back of the string
allmatches=patt.findall(patt,string)
print(allmatches)
this will match all _ab form the back of the string

Related

How to write word boundary inside character class in python without losing its meaning? I wish to add underscore(_) in definition of word boundary(\b)

I am aware that definition of word boundary is (?<!\w)(?=\w)|(?<=\w)(?!\w)
and i wish to add underscore(optionally) too in definition of word boundary.
The one way of doing it is we can simply modify the definition
like the new one would be (_)?((?<!\w)(?=\w)|(?<=\w)(?!\w))
, but don't wish to use too long expression.
Easy Approach can be
If i can write word boundary inside character class, then adding underscore inside character class would be very easy just like [\b-], but the problem is that putting \b inside character class i.e. [\b], means back space character not word boundary.
please tell the solution i.e. how to put \b inside character class without losing its original meaning.
You may use lookarounds:
(?:\b|(?<=_))word(?=\b|_)
^^^^^^^^^^^^^ ^^^^^^^
See the regex demo where (?:\b|(?<=_)) is a non-capturing group matching either a word boundary or a location preceded with _, and (?=\b|_) is a positive lookahead matching either a word boundary or a _ symbol.
Unfortunately, Python re won't allow using (?<=\b|_) as the lookbehind pattern should be of fixed width (else, you will get look-behind requires fixed-width pattern error).
A Python demo:
import re
rx = r"(?:\b|(?<=_))word(?=\b|_)"
s = "some_word_here and a word there"
print(re.findall(rx,s))
An alternative solution is to use custom word boundaries like (?<![^\W_]) / (?![^\W_]) (see online demo):
rx = r"(?<![^\W_])word(?![^\W_])"
The (?<![^\W_]) negative lookbehind fails a match if there is no character other than non-word and _ char (so, it requires the start of string or any word char excluding _ before the search word) and (?![^\W_]) negative lookahead will fail the match if there is no char other than non-word and _ char (that is, requires the end of string or a word char excluding _).

Regular expressions in python to match Twitter handles

I'm trying to use regular expressions to capture all Twitter handles within a tweet body. The challenge is that I'm trying to get handles that
Contain a specific string
Are of unknown length
May be followed by either
punctuation
whitespace
or the end of string.
For example, for each of these strings, Ive marked in italics what I'd like to return.
"#handle what is your problem?" [RETURN '#handle']
"what is your problem #handle?" [RETURN '#handle']
"#123handle what is your problem #handle123?" [RETURN '#123handle', '#handle123']
This is what I have so far:
>>> import re
>>> re.findall(r'(#.*handle.*?)\W','hi #123handle, hello #handle123')
['#123handle']
# This misses the handles that are followed by end-of-string
I tried modifying to include an or character allowing the end-of-string character. Instead, it just returns the whole string.
>>> re.findall(r'(#.*handle.*?)(?=\W|$)','hi #123handle, hello #handle123')
['#123handle, hello #handle123']
# This looks like it is too greedy and ends up returning too much
How can I write an expression that will satisfy both conditions?
I've looked at a couple other places, but am still stuck.
It seems you are trying to match strings starting with #, then having 0+ word chars, then handle, and then again 0+ word chars.
Use
r'#\w*handle\w*'
or - to avoid matching #+word chars in emails:
r'\B#\w*handle\w*'
See the Regex 1 demo and the Regex 2 demo (the \B non-word boundary requires a non-word char or start of string to be right before the #).
Note that the .* is a greedy dot matching pattern that matches any characters other than newline, as many as possible. \w* only matches 0+ characters (also as many as possible) but from the [a-zA-Z0-9_] set if the re.UNICODE flag is not used (and it is not used in your code).
Python demo:
import re
p = re.compile(r'#\w*handle\w*')
test_str = "#handle what is your problem?\nwhat is your problem #handle?\n#123handle what is your problem #handle123?\n"
print(p.findall(test_str))
# => ['#handle', '#handle', '#123handle', '#handle123']
Matches only handles that contain this range of characters -> /[a-zA-Z0-9_]/.
s = "#123handle what is your problem #handle123?"
print re.findall(r'\B(#[\w\d_]+)', s)
>>> ['#123handle', '#handle123']
s = '#The quick brown fox#jumped over the LAAZY #_dog.'
>>> ['#The', '#_dog']

Using re to sanitize a word file, allowing letters with hyphens and apostrophes

Here's what I have so far:
import re
def read_file(file):
words = []
for line in file:
for word in line.split():
words.append(re.sub("[^a-z]", "", word.lower()))
As it stands, this will read in "can't" as "cant" and "co-ordinate" as "coordinate". I want to read in the words so that these 2 punctuation marks are allowed. How do I modify my code to do this?
There can be two approaches: one is suggested by ritesht93 in the comment to the question, though I'd use
words.append(re.sub("[^-'a-z]+", "", word.lower()))
^^ ^ - One or more occurrences to remove in one go
| - Apostrophe and hyphen added
The + quantifier will remove the unwanted characters matching the pattern in one go.
Note that the hyphen is added at the beginning of the negated character class and thus does not have to be escaped. NOTE: It is still recommended to escape it if other, less regex-savvy developers are going to maintain this later.
The second approach will be helpful if you have Unicode letters.
ur'((?![-'])[\W\d_])+'
See the regex demo (to be compiled with re.UNICODE flag)
The pattern matches any non-letter (except a hyphen or an apostrophe due to the negative lookahead (?![-'])), any digit or underscore ([\W\d_])

Regular expressions with \b and non-word characters (like '.')

Why does this regular expression:
r'^(?P<first_init>\b\w\.\b)\s(?P<mid_init>\b\w\.\b)\s(?P<last_name>\b\w+\b)$'
does not match J. F. Kennedy?
I have to remove \b in groups first_init and mid_init to match the words.
I am using Python. And for testing i am using https://regex101.com/
Thanks
You are over-applying the \b word breaks.
\b will only match if on one side there is a valid "word" character and on the other side not. Now you use this construction twice:
\b\w\.\b\s
.. and, rightly so, it does not match because on the left side you have a not-word character (a single full stop) and on the other side you also have a not-word character (a space).
Removing the \b between the full stop and \s is enough to make it work.
\b matches the empty string only at the beginning or end of a word. A word is a sequence of alphanumeric or underscore characters. The dot (.) cannot comprise part of the word.
>>> import re
# does not match when \. is within word boundary
>>> re.match(r'^(?P<first_init>\b\w\.\b)\s(?P<mid_init>\b\w\.\b)\s(?P<last_name>\b\w+\b)$', 'J. F. Kennedy')
# matches when \b is moved to left of \.
>>> re.match(r'^(?P<first_init>\b\w\b\.)\s(?P<mid_init>\b\w\b\.)\s(?P<last_name>\b\w+\b)$', 'J. F. Kennedy') # matches
The . is not part of the word in this sense. See the docs here.
It does not match because of the \. (dot) character. A word boundary does not include the dot (it is not the same definition of word you perhaps would like). You can easily rewrite it without the need of \b. Read the documentation carefully.
Just remove the second boundary:
^(?P<first_init>\b\w\.)\s
(?P<mid_init>\b\w\.)\s
(?P<last_name>\b\w+\b)$
And see a demo on regex101.com.
Background is that the second \b is between a dot and a space, so it fails (remember that one of the sides needs to be a word character, ie one of a-zA-Z0-9_)
\b means border of a word.
Word here is defined like so:
A word ends, when there is a space character following it.
"J.", "F." and "Kennedy" are the words here.
You're example is trying to search for a space between the letter and the dot and it is searching for J . F . Kennedy.

Regex that recognizes ^^ and ^ as different

I am using the Python re module.
I can use the regex r'\bA\b' (a raw string) to differentiate between 'A' and 'AA': it will find a match in the string 'A' and no matches in the string 'AA'.
I would like to achieve the same thing with a carat ^ instead of the A: I want a regex which differentiates between '^' and '^^'.
The problem I have is that the regex r'\b\^\b' does not find a match in '^'.
Any ideas?
You need to use lookaround for this:
(?<!\^)\^(?!\^)
\b is a word boundary, a place between a word character and a non-word character, so your pattern is quite non-specific (doesn't say anything about A specifically, A_ would also not match given that _ is a word character.
Here, we assert that there needs to be a place where the preceding character is not a caret, then a caret, then a place where the following character is not a caret (which boils down to "the caret must not be in caret company").

Categories

Resources