Strange regex behavior when matching with strings that include dots - python

I have a following case where the matching seems to not work properly:
import re
test_case1 = u"I will meet you at 2 pm"
test_case2 = u"I will meet you at 2 p.m."
test_case3 = u"I will meet you at 2 p.m. "
test_case4 = u"I will meet you at 2 p.m. pm "
list_of_words = ['p.m.', 'pm'] # list of words that can be enlarged
# join all words into an or expression and escape all punctuation
joined_words = '|'.join([re.escape(x) for x in list_of_words])
# create a regex that will match a word from the list of words only if it is
# at the start/end of the sentence or it is between two word boundaries
match_regex = r'(^|\b)('+joined_words+r')(\b|$)'
comp_regex = re.compile(match_regex, re.IGNORECASE) # compile the final regex
print comp_regex.findall(test_case1), len(comp_regex.findall(test_case1))
print comp_regex.findall(test_case2), len(comp_regex.findall(test_case2))
print comp_regex.findall(test_case3), len(comp_regex.findall(test_case3))
print comp_regex.findall(test_case4), len(comp_regex.findall(test_case4))
I get the following results for the 4 test cases:
[(u'', u'pm', u'')] 1
[(u'', u'p.m.', u'')] 1
[] 0
[(u'', u'pm', u'')] 1
The 1st and 2nd cases seem to work fine, 3rd doesn't match "p.m." if there is space after it, even though I have used "\b" word boundary in the regex.
The 4th case doesn't seem to match the "p.m." at all and only matches the "pm".
I can't seem to understand where the problem lies, any help is appreciated.

Python docs state following about \b:
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string, so the precise set of characters deemed to be alphanumeric depends on the values of the UNICODE and LOCALE flags. For example, r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'. Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.
According to that definition . can't mark the end of the word so \b doesn't match after p.m.. If you make following change to your match you get the behavior you expect:
match_regex = r'(^|\b)('+joined_words+r')(\s|$)'

You could try :
match_regex = r'(^|\b)('+joined_words+r')(\s|$)'
If you want the p.m and pm at the last line : example
or
match_regex = r'(^|\s)('+joined_words+r')(\s|$)'
if you want the first p.m. only : example

Related

How to replace single digit by same digit followed by punctuation?

I want to replace any single digit by the same digit followed by punctuation (comma ,) using python regex?
text = 'I am going at 5pm to type 3 and the 9 later'
I want this to be converted to
text = 'I am going at 5pm to type 3, and the 9, later'
My attempt:
match = re.search('\s\d{1}\s', x)
I could able to detect them but dont now how to replace by the same digit followed by comma.
Regex #1
See regex in use here
(?<=\b\d)\b
Replace with ,
How it works:
(?<=(?:)\d) positive lookbehind ensuring the following precedes:
\b assert position as a word boundary
\d match a digit
\b assert position as a word boundary
To prevent it from matching locations like 3, a simply append (?!,) to the regex.
Regex #2
To prevent matching a single digit at the start and end of the string, you can use the following regex:
See regex in use here
(?<=(?<!^)\b\d)\b(?!$)
Same as above regex, but adds following:
(?<!^) ensures the word boundary \b that it precedes doesn't match the start of the line
(?!$) ensure the word boundary \b that it follows doesn't match the end of the line
You can remove either token if that's not the behaviour you want.
To prevent it from matching locations like 3, a simply change the negative lookahead to (?!,|$) or append (?!,) to the regex.
Regex #3
If \b can't be used (e.g. if you have some numbers like 3.3), you can use the following instead:
See regex in use here
(?:(?<=\s\d)|(?<=^\d))(?=\s)
How it works:
(?:(?<=\s\d)|(?<=^\d)) match either of the following:
(?<=\s\d) positive lookbehind ensuring what precedes is a whitespace character
(?<=^\d) positive lookbehind ensuring what precedes is the start of the line
(?=\s) positive lookahead ensuring what follows is a whitespace character
Regex #4
If you don't need to match digits at the start of the string, modify the second regex by removing the second lookbehind as such:
See regex in use here
(?<=\s\d)(?=\s)
Code
Sample code (replace regex pattern with whichever pattern works best for you):
import re
x = 'I am going at 5pm to type 3 and the 9 later'
r = re.sub(r'(?<=\b\d)\b', ',', x)
print(r)
You could use a word boundary and a capture group to achieve this:
import re
text = 'I am going at 5pm to type 3 and the 9 later'
re.sub(r'\b(\d)\b', r"\1,", text)
# => 'I am going at 5pm to type 3, and the 9, later'

Match charactes and whitespaces, but not numbers

I am trying to create a regex that will match characters, whitespaces, but not numbers.
So hello 123 will not match, but hell o will.
I tried this:
[^\d\w]
but, I cannot find a way to add whitespaces here. I have to use \w, because my strings can contain Unicode characters.
Brief
It's unclear what exactly characters refers to, but, assuming you mean alpha characters (based on your input), this regex should work for you.
Code
See regex in use here
^(?:(?!\d)[\w ])+$
Note: This regex uses the mu flags for multiline and Unicode (multiline only necessary if input is separated by newline characters)
Results
Input
ÀÇÆ some words
ÀÇÆ some words 123
Output
This only shows matches
ÀÇÆ some words
Explanation
^ Assert position at the start of the line
(?:(?!\d)[\w ])+ Match the following one or more times (tempered greedy token)
(?!\d) Negative lookahead ensuring what follows doesn't match a digit. You can change this to (?![\d_]) if you want to ensure _ is also not used.
[\w ] Match any word character or space (matches Unicode word characters with u flag)`
$ Assert position at the end of the line
You can use a lookahead:
(?=^\D+$)[\w\s]+
In Python:
import re
strings = ['hello 123', 'hell o']
rx = re.compile(r'(?=^\D+$)[\w\s]+')
new_strings = [string for string in strings if rx.match(string)]
print(new_strings)
# ['hell o']

Trying to repeat the regex breaks the regex

I have a working regex that matches ONE of the following lines:
A punctuation from the following list [.,!?;]
A word that is preceded by the beginning of the string or a space.
Here's the regex in question ([.,!?;] *|(?<= |\A)[\-'’:\w]+)
What I need it to do however is for it to match 3 instances of this. So, for example, the ideal end result would be something like this.
Sample text: "This is a test. Test"
Output
"This" "is" "a"
"is" "a" "test"
"a" "test" "."
"test" "." "Test"
I've tried simply adding {3} to the end in the hopes of it matching 3 times. This however results in it matching nothing at all or the occasional odd character. The other possibility I've tried is just repeating the whole regex 3 times like so ([.,!?;] *|(?<= |\A)[\-'’:\w]+)([.,!?;] *|(?<= |\A)[\-'’:\w]+)([.,!?;] *|(?<= |\A)[\-'’:\w]+) which is horrible to look at but I hoped it would work. This had the odd effect of working, but only if at least one of the matches was one of the previously listed punctuation.
Any insights would be appreciated.
I'm using the new regex module found here so that I can have overlapping searches.
What is wrong with your approach
The ([.,!?;] *|(?<= |\A)[\-'’:\w]+) pattern matches a single "unit" (either a word or a single punctuation from the specified set [.,!?;] followed with 0+ spaces. Thus, when you fed this pattern to the regex.findall, it only could return just the chunk list ['This', 'is', 'a', 'test', '. ', 'Test'].
Solution
You can use a slightly different approach: match all words, and all chunks that are not words. Here is a demo (note that C'est and AUX-USB are treated as single "words"):
>>> pat = r"((?:[^\w\s'-]+(?=\s|\b)|\b(?<!')\w+(?:['-]\w+)*))\s*((?1))\s*((?1))"
>>> results = regex.findall(pat, text, overlapped = True)
>>> results
[("C'est", 'un', 'test'), ('un', 'test', '....'), ('test', '....', 'aux-usb')]
Here, the pattern has 3 capture groups, and the second and third one contain the same pattern as in Group 1 ((?1) is a subroutine call used in order to avoid repeating the same pattern used in Group 1). Group 2 and Group 3 can be separated with whitespaces (not necessarily, or the punctuation glued to a word would not be matched). Also, note the negative lookbehind (?<!') that will ensure that C'est is treated as a single entity.
Explanation
The pattern details:
((?:[^\w\s'-]+(?=\s|\b)|\b(?<!')\w+(?:['-]\w+)*)) - Group 1 matching:
(?:[^\w\s'-]+(?=\s|\b) - 1+ characters other than [a-zA-Z0-9_], whitespace, ' and - immediately followed with a whitespace or a word boundary
| - or
\b(?<!')\w+(?:['-]\w+)*) - 1+ word characters not preceded with a ' (due to (?<!')) and preceded with a word boundary (\b) and followed with 0+ sequences of - or ' followed with 1+ word characters.
\s* - 0+ whitespaces
((?1)) - Group 2 (same pattern as for Group 1)
\s*((?1)) - see above

RegEx match word in string containing + and - using re.findall() Python

myreg = r"\babcb\"
mystr = "sdf ddabc"
mystr1 = "sdf abc"
print(re.findall(myreg,mystr))=[]
print(re.findall(myreg,mystr1))=[abc]
Until now everything works as expected but if i change my reg and my str to.
myreg = r"\b\+abcb\"
mystr = "sdf +abc"
print(re.findall(myreg,mystr)) = [] but i would like to get [+abc]
I have noticed that using the following works as expected.
myreg = "^\\+abc$"
mystr = "+abc"
mystr1 = "-+abc"
My question: Is it possible to achieve the same results as above without splitting the string?
Best regards,
Gabriel
Your problem is the following:
\b is defined as the boundary between a \w and a \W character
(or vice versa).
\w contains the character set [a-zA-Z0-9_]
\W contains the character set [^a-zA-Z0-9_], which means all characters except [a-zA-Z0-9_]
'+' is not contained in \w so you won't match the boundary between the whitespace and the '+'.
To get what you want, you should remove the first \b from your pattern:
import re
string = "sdf +abc"
pattern = r"\+abc\b"
matches = re.findall(pattern, string)
print matches
['+abc']
There are two problems
Before your + in +abc, there is no word boundary, so \b cannot match.
Your regex \b\+abcb\ tries to match a literal b character after abc (typo).
Word Boundaries
The word boundary \b matches at a position between a word character (letters, digits and underscore) and a non-word character (or a line beginning or ending). For instance, there is a word boundary between the + and the a
Solution: Make your Own boundary
If you want to match +abc but only when it is not preceded by a word character (for instance, you don't want it inside def+abc), then you can make your own boundary with a lookbehind:
(?<!\w)\+abc
This says "match +abc if it is not preceded by a word character (letter, digit, underscore)".

How to match one character word?

How do I match only words of character length one? Or do I have to check the length of the match after I performed the match operation? My filter looks like this:
sw = r'\w+,\s+([A-Za-z]){1}
So it should match
rs =re.match(sw,'Herb, A')
But shouldn't match
rs =re.match(sw,'Herb, Abc')
If you use \b\w\b you will only match one character of type word. So your expression would be
sw = r'\w+,\s+\w\b'
(since \w is preceded by at least one \s you don't need the first \b)
Verification:
>>> sw = r'\w+,\s+\w\b'
>>> print re.match(sw,'Herb, A')
<_sre.SRE_Match object at 0xb7242058>
>>> print re.match(sw,'Herb, Abc')
None
You can use
(?<=\s|^)\p{L}(?=[\s,.!?]|$)
which will match a single letter that is preceded and followed either by a whitespace character or the end of the string. The lookahead is a little augmented by punctuation marks as well ... this all depends a bit on your input data. You could also do a lookahead on a non-letter, but that begs the question whether “a123” is really a one-letter word. Or “I'm”.

Categories

Resources