Python Regex \pL matching issues - python

I'm trying to match a list of keywords I have, taking care to include all Latin characters (e.g accented).
Here's an example
import regex as re
p = r'((?!\pL)|^)blah((?!\pL)|$)'
print(re.search(p, "blah u"))
print(re.search(p, "blahé u"))
print(re.search(p, "éblah u"))
print(re.search(p, "blahaha"))
gives:
<regex.Match object; span=(0, 4), match='blah'>
None
None
None
Which looks correct. However:
print(re.search(p, "u blah"))
gives:
None
This is wrong, as I expect a match for "u blah".
I've tried to also use Pythons built in re module, but I cannot get it to work with \pL or \p{Latin} due to "bad-escape" errors. I've also tried to use unicode strings (without the "r") but despite adding slashes to \\\\pL, I just can't get this to work right.
Note: I'm using Python 3.8

The problem with your ((?!\pL)|^)blah((?!\pL)|$) regex is that the ((?!\pL)|^) group contains two alternatives where the first one always fails the regex (why? Because (?!\pL) is a negative lookahead that fails the match if the next char is a letter, and the next char to match is b in blah) and only ^ works all the time, i.e. your regex is equal to ^blah((?!\pL)|$) and only matches at the start of string.
Note (?!\pL) already matches a position at the end of string, so ((?!\pL)|$) = (?!\pL).
You should use
(?<!\pL)blah(?!\pL)
See the regex demo (switched to PCRE for the demo purposes).
Note that the re-compatible version of the regex is
(?<![^\W\d_])blah(?![^\W\d_])
See the regex demo.

Related

Use Python 3.4+ Regex to match up to, but not including a # symbol, plus a range of lowercase letters

I would like to use Regex in Python 3.4+ to match a combination of the '#' symbol + the next lowercase letter. There's a bunch of obfuscating data in the strings that's making it tricky for me to do this in one clean line of regex. Here's an example string:
Stack #Overflow is a question and answer website for #professional and enthusiast programmers.
I'd like the regex here to match up to the word '#professional' (because it's lowercase), skipping over the '#Overflow' occurrence (because it's uppercase). After the operation I want to be left with:
professional and enthusiast programmers
or
#professional and enthusiast programmers
I can get it to match up to the first # with ^[^#]*, but I'm not seeing a good way to put a range of chars in there to specify that the following character needs to be lowercase(a-z, etc).
My initial thought was to try ^[^#a-z]*, but this doesn't work.
Any ideas of how to make this work with Python?
you're looking for a "positive lookahead" -- an anchor which consumes no part of the string but makes an assertion about the characters afterwards
>>> s = 'Stack #Overflow is a question and answer website for #professional and enthusiast programmers.'
>>> re.search('#(?=[a-z])', s)
<re.Match object; span=(53, 54), match='#'>
the (?=...) part is the positive lookahead, asserting that the # is immediately followed by a lowercase character -- notice this matches the second # and not the first. from here you can get the rest of the string:
>>> s[_.end():]
'professional and enthusiast programmers.'
_ here being the last expression in the repl (you'd want to assign the match to a variable in your actual code)
I think you can use pattern r'#([a-z])(.*)' with re.search to get the expected result
import re
line = "Stack #Overflow is a question and answer website for #professional and enthusias programmers."
matchObj = re.search(r'#([a-z])(.*)', line)
if matchObj:
print("match string : ", matchObj.group())

how to understand re.sub("", "-", "abxc") in python [duplicate]

This is the results from python2.7.
>>> re.sub('.*?', '-', 'abc')
'-a-b-c-'
The results I thought should be as follows.
>>> re.sub('.*?', '-', 'abc')
'-------'
But it's not. Why?
The best explanation of this behaviour I know of is from the regex PyPI package, which is intended to eventually replace re (although it has been this way for a long time now).
Sometimes it’s not clear how zero-width matches should be handled. For example, should .* match 0 characters directly after matching >0 characters?
Most regex implementations follow the lead of Perl (PCRE), but the re module sometimes doesn’t. The Perl behaviour appears to be the most common (and the re module is sometimes definitely wrong), so in version 1 the regex module follows the Perl behaviour, whereas in version 0 it follows the legacy re behaviour.
Examples:
# Version 0 behaviour (like re)
>>> regex.sub('(?V0).*', 'x', 'test')
'x'
>>> regex.sub('(?V0).*?', '|', 'test')
'|t|e|s|t|'
# Version 1 behaviour (like Perl)
>>> regex.sub('(?V1).*', 'x', 'test')
'xx'
>>> regex.sub('(?V1).*?', '|', 'test')
'|||||||||'
(?VX) sets the version flag in the regex. The second example is what you expect, and is supposedly what PCRE does. Python's re is somewhat nonstandard, and is kept as it is probably solely due to backwards compatibility concerns. I've found an example of something similar (with re.split).
For your new, edited question:
The .*? can match any number of characters, including zero. So what it does is it matches zero characters at every position in the string: before the "a", between the "a" and "b", etc. It replaces each of those zero-width matches with a hyphen, giving the result you see.
The regex does not try to match each character one by one; it tries to match at each position in the string. Your regex allows it to match zero characters. So it matches zero at each position and moves on to the next. You seem to be thinking that in a string like "abc" there is one position before the "b", one position "inside" the "b", and one position after "b", but there isn't a position "inside" an individual character. If it matches zero characters starting before "b", the next thing it tries is to match starting after "b". There's no way you can get a regex to match seven times in a three-character string, because there are only four positions to match at.
Are you sure you interpreted re.sub's documentation correctly?
*?, +?, ?? The '', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if
the RE <.> is matched against '<H1>title</H1>', it will match the
entire string, and not just '<H1>'. Adding '?' after the qualifier
makes it perform the match in non-greedy or minimal fashion; as few
characters as possible will be matched. Using .*? in the previous
expression will match only ''.
Adding a ? will turn the expression into a non-greedy one.
Greedy:
re.sub(".*", "-", "abc")
non-Greedy:
re.sub(".*?", "-", "abc")
Update: FWIW re.sub does exactly what it should:
>>> from re import sub
>>> sub(".*?", "-", "abc")
'-a-b-c-'
>>> sub(".*", "-", "abc")
'-'
See #BrenBarn's awesome answer on why you get -a-b-c- :)
Here's a visual representation of what's going on:
.*?
Debuggex Demo
To elaborate on Veedrac's answer, different implementation has different treatment of zero-width matches in a FindAll (or ReplaceAll) operations. Two behaviors can be observed among different implementations, and Python re simply chooses to follow the first line of implementation.
1. Always bump along by one character on zero-width match
In Java and JavaScript, zero-width match causes the index to bump along by one character, since staying at the same index will cause an infinite loop in FindAll or ReplaceAll operations.
As a result, output of FindAll operations in such implementation can contain at most 1 match starting at a particular index.
The default Python re package probably also follow the same implementation (and it seems to be also the case for Ruby).
2. Disallow zero-width match on next match at same index
In PHP, which provides a wrapper over PCRE libreary, zero-width match does not cause the index to bump along immediately. Instead, it will set a flag (PCRE_NOTEMPTY) requiring the next match (which starts at the same index) to be a non-zero-width match. If the match succeeds, it will bump along by the length of the match (non-zero); otherwise, it bumps along by one character.
By the way, PCRE library does not provide built-in FindAll or ReplaceAll operation. It is actually provided by PHP wrapper.
As a result, output of FindAll operations in such implementation can contain up to 2 matches starting at the same index.
Python regex package probably follows this line of implementation.
This line of implementation is more complex, since it requires the implementation of FindAll or ReplaceAll to keep an extra state of whether to disallow zero-width match or not. Developer also needs to keep track of this extra flags when they use the low level matching API.

Regular expressions in python to match Twitter handles

I'm trying to use regular expressions to capture all Twitter handles within a tweet body. The challenge is that I'm trying to get handles that
Contain a specific string
Are of unknown length
May be followed by either
punctuation
whitespace
or the end of string.
For example, for each of these strings, Ive marked in italics what I'd like to return.
"#handle what is your problem?" [RETURN '#handle']
"what is your problem #handle?" [RETURN '#handle']
"#123handle what is your problem #handle123?" [RETURN '#123handle', '#handle123']
This is what I have so far:
>>> import re
>>> re.findall(r'(#.*handle.*?)\W','hi #123handle, hello #handle123')
['#123handle']
# This misses the handles that are followed by end-of-string
I tried modifying to include an or character allowing the end-of-string character. Instead, it just returns the whole string.
>>> re.findall(r'(#.*handle.*?)(?=\W|$)','hi #123handle, hello #handle123')
['#123handle, hello #handle123']
# This looks like it is too greedy and ends up returning too much
How can I write an expression that will satisfy both conditions?
I've looked at a couple other places, but am still stuck.
It seems you are trying to match strings starting with #, then having 0+ word chars, then handle, and then again 0+ word chars.
Use
r'#\w*handle\w*'
or - to avoid matching #+word chars in emails:
r'\B#\w*handle\w*'
See the Regex 1 demo and the Regex 2 demo (the \B non-word boundary requires a non-word char or start of string to be right before the #).
Note that the .* is a greedy dot matching pattern that matches any characters other than newline, as many as possible. \w* only matches 0+ characters (also as many as possible) but from the [a-zA-Z0-9_] set if the re.UNICODE flag is not used (and it is not used in your code).
Python demo:
import re
p = re.compile(r'#\w*handle\w*')
test_str = "#handle what is your problem?\nwhat is your problem #handle?\n#123handle what is your problem #handle123?\n"
print(p.findall(test_str))
# => ['#handle', '#handle', '#123handle', '#handle123']
Matches only handles that contain this range of characters -> /[a-zA-Z0-9_]/.
s = "#123handle what is your problem #handle123?"
print re.findall(r'\B(#[\w\d_]+)', s)
>>> ['#123handle', '#handle123']
s = '#The quick brown fox#jumped over the LAAZY #_dog.'
>>> ['#The', '#_dog']

regular expression match issue in Python

For input string, want to match text which starts with {(P) and ends with (P)}, and I just want to match the parts in the middle. Wondering if we can write one regular expression to resolve this issue?
For example, in the following example, for the input string, I want to retrieve hello world part. Using Python 2.7.
python {(P)hello world(P)} java
You can try {\(P\)(.*)\(P\)}, and use parenthesis in the pattern to capture everything between {(P) and (P)}:
import re
re.findall(r'{\(P\)(.*)\(P\)}', "python {(P)hello world(P)} java")
# ['hello world']
.* also matches unicode characters, for example:
import re
str1 = "python {(P)£1,073,142.68(P)} java"
str2 = re.findall(r'{\(P\)(.*)\(P\)}', str1)[0]
str2
# '\xc2\xa31,073,142.68'
print str2
# £1,073,142.68
You can use positive look-arounds to ensure that it only matches if the text is preceded and followed by the start and end tags. For instance, you could use this pattern:
(?<={\(P\)).*?(?=\(P\)})
See the demo.
(?<={\(P\)) - Look-behind expression stating that a match must be preceded by {(P).
.*? - Matches all text between the start and end tags. The ? makes the star lazy (i.e. non-greedy). That means it will match as little as possible.
(?=\(P\)}) - Look-ahead expression stating that a match must be followed by (P)}.
For what it's worth, lazy patterns are technically less efficient, so if you know that there will be no ( characters in the match, it would be better to use a negative character class:
(?<={\(P\))[^(]*(?=\(P\)})
You can also do this without regular expressions:
s = 'python {(P)hello world(P)} java'
r = s.split('(P)')[1]
print(r)
# 'hello world'

Python re match at specific point in string

If I have a given string s in Python, is it possible to easily check if a regex matches the string starting at a specific position i in the string?
I would rather not slice the entire string from i to the end as it doesn't seem very scalable (ruling out re.match I think).
re.match doesn't support this directly. However, if you pre-compile your regular expression (often a good idea anyway) with re.compile, then the RegexObject's similar method, match (and search) both take an optional pos parameter:
The optional second parameter pos gives an index in the string where the search is to start; it defaults to 0. This is not completely equivalent to slicing the string; the '^' pattern character matches at the real beginning of the string and at positions just after a newline, but not necessarily at the index where the search is to start.
Example:
import re
s = 'this is a test 4242 did you get it'
pat = re.compile('[a-zA-Z]+ ([0-9]+)')
print pat.match(s, 10).group(0)
Output:
'test 4242'
Although re.match does not support this, the new regex module (intended to replace the re module) has a treasure trove of new features, including pos and endpos arguments for search, match, sub, and subn. Although not official yet, the regex module can be pip installed and works for Python versions 2.5 through 3.4. Here's an example:
>>> import regex
>>> regex.match(r'\d+', 'abc123def')
>>> regex.match(r'\d+', 'abc123def', pos=3)
<regex.Match object; span=(3, 6), match='123'>
>>> regex.match(r'\d+', 'abc123def', pos=3, endpos=5)
<regex.Match object; span=(3, 5), match='12'>

Categories

Resources