Trying to repeat the regex breaks the regex - python

I have a working regex that matches ONE of the following lines:
A punctuation from the following list [.,!?;]
A word that is preceded by the beginning of the string or a space.
Here's the regex in question ([.,!?;] *|(?<= |\A)[\-'’:\w]+)
What I need it to do however is for it to match 3 instances of this. So, for example, the ideal end result would be something like this.
Sample text: "This is a test. Test"
Output
"This" "is" "a"
"is" "a" "test"
"a" "test" "."
"test" "." "Test"
I've tried simply adding {3} to the end in the hopes of it matching 3 times. This however results in it matching nothing at all or the occasional odd character. The other possibility I've tried is just repeating the whole regex 3 times like so ([.,!?;] *|(?<= |\A)[\-'’:\w]+)([.,!?;] *|(?<= |\A)[\-'’:\w]+)([.,!?;] *|(?<= |\A)[\-'’:\w]+) which is horrible to look at but I hoped it would work. This had the odd effect of working, but only if at least one of the matches was one of the previously listed punctuation.
Any insights would be appreciated.
I'm using the new regex module found here so that I can have overlapping searches.

What is wrong with your approach
The ([.,!?;] *|(?<= |\A)[\-'’:\w]+) pattern matches a single "unit" (either a word or a single punctuation from the specified set [.,!?;] followed with 0+ spaces. Thus, when you fed this pattern to the regex.findall, it only could return just the chunk list ['This', 'is', 'a', 'test', '. ', 'Test'].
Solution
You can use a slightly different approach: match all words, and all chunks that are not words. Here is a demo (note that C'est and AUX-USB are treated as single "words"):
>>> pat = r"((?:[^\w\s'-]+(?=\s|\b)|\b(?<!')\w+(?:['-]\w+)*))\s*((?1))\s*((?1))"
>>> results = regex.findall(pat, text, overlapped = True)
>>> results
[("C'est", 'un', 'test'), ('un', 'test', '....'), ('test', '....', 'aux-usb')]
Here, the pattern has 3 capture groups, and the second and third one contain the same pattern as in Group 1 ((?1) is a subroutine call used in order to avoid repeating the same pattern used in Group 1). Group 2 and Group 3 can be separated with whitespaces (not necessarily, or the punctuation glued to a word would not be matched). Also, note the negative lookbehind (?<!') that will ensure that C'est is treated as a single entity.
Explanation
The pattern details:
((?:[^\w\s'-]+(?=\s|\b)|\b(?<!')\w+(?:['-]\w+)*)) - Group 1 matching:
(?:[^\w\s'-]+(?=\s|\b) - 1+ characters other than [a-zA-Z0-9_], whitespace, ' and - immediately followed with a whitespace or a word boundary
| - or
\b(?<!')\w+(?:['-]\w+)*) - 1+ word characters not preceded with a ' (due to (?<!')) and preceded with a word boundary (\b) and followed with 0+ sequences of - or ' followed with 1+ word characters.
\s* - 0+ whitespaces
((?1)) - Group 2 (same pattern as for Group 1)
\s*((?1)) - see above

Related

FutureWarning: Possible nested set at position 1 Error Python

I was working on something and at some point, I needed to check whether the string satisfies this:
The string must contain at least 5 words and each separated by a hyphen(-) or an underscore(_).
Here is the code that I wrote:
password=eval(input('Password:'))
pattern=r'[[\w][-_]]{5,}'
import re
re.fullmatch(pattern,password)
But it gives ' ipython-input-32-7c87b09218f8>:4: FutureWarning: Possible nested set at position 1
re.fullmatch(pattern,password) ' error. Why that happens, any idea?Thanks in advance.Btw I'm using Jupyter notebook.
You can match 1+ word characters, and then repeat at least 4 times matching either _ or / and again 1 or more word characters.
\w+(?:[/_]\w+){4,}
Explanation
\w+ Match 1+ word characters
(?: Non capture group to repeat as a whole part
[/_] Character class matching either / or _
\w+ Match 1+ word characters
){4,} close the no capture group and repeat 4 or more times
See a regex demo.

Regex to match dollar amount with uppercase letter or word

I'm trying to match some sort of amount, here are all possibilities:
$5.6 million
$4,1 million
$8,1M
$6.3M
$333,333
$2 million
$5 million
I have already this regex:
\$\d{1,3}(?:,\d{3})*(?:\s+(?:thousand|[mb]illion|[MB]illion)|[M])?
See online demo.
But I'm not able to match those ones:
$5.6 million
$4,1 million
$8,1M
$6.3M
Any help would be appreciated.
Let's look at your regular expression:
\$\d{1,3}(?:,\d{3})*(?:\s+(?:thousand|[mb]illion|[MB]illion)|[M])?
\$\d{1,3} is fine. What follows? One way to answer that is to consider the following three possibilities.
The string to be matched ends ' million'
This string (which begins with a space, in case you missed that) is preceded by an empty string or a single digit preceded by a comma or period:
(?:[,.]\d)? million
Evidently, "million" can be "thousand" or "billion", and the first in last might be capitalized, so we change the expression to
(?:[,.]\d)? (?:[MmBb]illion|thousand)
One potential problem is that this matches '$5.6 millionaire'. We can avoid that problem by tacking on a word boundary preventing the match to be followed by a word character:
(?:[,.]\d)? (?:[MmBb]illion|thousand)\b
The string ends 'M'
In this case the 'M' must be preceded by a single digit preceded by a comma or period:
[,.]\dM\b
You could accept 'B' as well by changing M to [MB].
The string ends with three digits preceded by a comma
Here we need
,\d{3}\b
Here the word boundary avoids matching, for example, $333,3333'. It will not match, however, '$333,333,333' or '$333,333,333,333'. If we want to match those we could change the expression to
(?:,\d{3})+\b
or to match '$333' as well, change it to
(?:,\d{3})*\b
Construct the alternation
We therefore can use the following regular expression.
\$\d{1,3}(?:(?:[,.]\d)? (?:[MmBb]illion|thousand)\b|[,.]\dMb|,\d{3}b)
Factoring out the end-of-string anchor we obtain
\$\d{1,3}(?:(?:[,.]\d)? (?:[MmBb]illion|thousand)|[,.]\dM|,\d{3})b
Demo
You can use
(?i)\$\d+(?:[.,]\d+)*(?:\s+(?:thousand|[mb]illion)|m)?
If you need to make sure you do not match m that is part of another word:
(?i)\$\d+(?:[.,]\d+)*(?:\s+(?:thousand|[mb]illion)|m)?\b
See the regex demo. Details:
(?i) - case insensitive option
\$ - a $ char
\d+ - one or more digits
(?:[.,]\d+)* - zero or more repetitions of . or , and then one or more digits
(?:\s+(?:thousand|[mb]illion)|m)? - an optional occurrence of
\s+(?:thousand|[mb]illion) - one or more whitespaces and then thousand, million or billion
| - or
m - an m char
\b - a word boundary.

Split by '.' when not preceded by digit

I want to split '10.1 This is a sentence. Another sentence.'
as ['10.1 This is a sentence', 'Another sentence'] and split '10.1. This is a sentence. Another sentence.' as ['10.1. This is a sentence', 'Another sentence']
I have tried
s.split(r'\D.\D')
It doesn't work, how can this be solved?
If you plan to split a string on a . char that is not preceded or followed with a digit, and that is not at the end of the string a splitting approach might work for you:
re.split(r'(?<!\d)\.(?!\d|$)', text)
See the regex demo.
If your strings can contain more special cases, you could use a more customizable extracting approach:
re.findall(r'(?:\d+(?:\.\d+)*\.?|[^.])+', text)
See this regex demo. Details:
(?:\d+(?:\.\d+)*\.?|[^.])+ - a non-capturing group that matches one or more occurrences of
\d+(?:\.\d+)*\.? - one or more digits (\d+), then zero or more sequences of . and one or more digits ((?:\.\d+)*) and then an optional . char (\.?)
| - or
[^.] - any char other than a . char.
All sentences (except the very last one) end with a period followed by space, so split on that. Worrying about the clause number is backwards. You could potentially find all kinds of situations that you DON'T want, but it is generally much easier to describe the situation that you DO want. In this case '. ' is that situation.
import re
doc = '10.1 This is a sentence. Another sentence.'
def sentences(doc):
#split all sentences
s = re.split(r'\.\s+', doc)
#remove empty index or remove period from absolute last index, if present
if s[-1] == '':
s = s[0:-1]
elif s[-1].endswith('.'):
s[-1] = s[-1][:-1]
#return sentences
return s
print(sentences(doc))
The way I structured my regex it should also eliminate arbitrary whitespace between paragraphs.
You have multiple issues:
You're not using re.split(), you're using str.split().
You haven't escaped the ., use \. instead.
You're not using lookahead and lookbehinds so your 3 characters are gone.
Fixed code:
>>> import re
>>> s = '10.1 This is a sentence. Another sentence.'
>>> re.split(r"(?<=\D\.)(?=\D)", s)
['10.1 This is a sentence.', ' Another sentence.']
Basically, (?<=\D\.) finds a position right after a . that has a non-digit character. (?=\D) then makes sure there's a non digit after the current position. When everything applies, it splits correctly.

strange output regular expression r'[-.\:alnum:](.*)'

I expect to fetch all alphanumeric characters after "-"
For an example:
>>> str1 = "12 - mystr"
>>> re.findall(r'[-.\:alnum:](.*)', str1)
[' mystr']
First, it's strange that white space is considered alphanumeric, while I expected to get ['mystr'].
Second, I cannot understand why this can be fetched, if there is no "-":
>>> str2 = "qwertyuio"
>>> re.findall(r'[-.\:alnum:](.*)', str2)
['io']
First of all, Python re does not support POSIX character classes.
The white space is not considered alphanumeric, your first pattern matches - with [-.\:alnum:] and then (.*) captures into Group 1 all 0 or more chars other than a newline. The [-.\:alnum:] pattern matches one char that is either -, ., :, a, l, n, u or m. Thus, when run against the qwertyuio, u is matched and io is captured into Group 1.
Alphanumeric chars can be matched with the [^\W_] pattern. So, to capture all alphanumeric chars after - that is followed with 0+ whitespaces you may use
re.findall(r'-\s*([^\W_]+)', s)
See the regex demo
Details
- - a hyphen
\s* - 0+ whitespaces
([^\W_]+) - Capturing group 1: one or more (+) chars that are letters or digits.
Python demo:
print(re.findall(r'-\s*([^\W_]+)', '12 - mystr')) # => ['mystr']
print(re.findall(r'-\s*([^\W_]+)', 'qwertyuio')) # => []
Your regex says: "Find any one of the characters -.:alnum, then capture any amount of any characters into the first capture group".
In the first test, it found - for the first character, then captured mystr in the first capture group. If any groups are in the regex, findall returns list of found groups, not the matches, so the matched - is not included.
Your second test found u as one of the -.:alnum characters (as none of qwerty matched any), then captured and returned the rest after it, io.
As #revo notes in comments, [....] is a character class - matching any one character in it. In order to include a POSIX character class (like [:alnum:]) inside it, you need two sets of brackets. Also, there is no order in a character class; the fact that you included - inside it just means it would be one of the matched characters, not that alphanumeric characters would be matched without it. Finally, if you want to match any number of alphanumerics, you have your quantifier * on the wrong thing.
Thus, "match -, then any number of alphanumeric characters" would be -([[:alnum:]]*), except... Python does not support POSIX character classes. So you have to write your own: -([A-Za-z0-9]*).
However, that will not match your string because the intervening space is, as you note, not an alphanumeric character. In order to account for that, -\s*([A-Za-z0-9]*).
Not quite sure what you want to match. I'll assume you don't want to include '-' in any matches.
If you want to get all alphanumeric chars after the first '-' and skip all other characters you can do something like this.
re.match('.*?(?<=-)(((?<=\s+)?[a-zA-Z\d]+(?=\s+)?)+)', inputString)
If you want to find each string of alphanumerics after a each '-' then you can do this.
re.findall('(?<=-)[a-zA-Z\d]+')

How to change a quantifier in a Regex based on a condition?

I would like to find words of length >= 1 which may contain a ' or a - within. Here is a test string:
a quake-prone area- (aujourd'hui-
In Python, I'm currently using this regex:
string = "a quake-prone area- (aujourd'hui-"
RE_WORDS = re.compile(r'[a-z]+[-\']?[a-z]+')
words = RE_WORDS.findall(string)
I would like to get this result:
>>> words
>>> [u'a', u'quake-prone', u'area', u"aujourd'hui"]
but I get this instead:
>>> words
>>> [u'quake-prone', u'area', u"aujourd'hui"]
Unfortunately, because of the last + quantifier, it skips all words of length 1. If I use the * quantifier, it will find a but also area- instead of area.
Then how could create a conditional regex saying: if the word contains an apostrophe or an hyphen, use the + quantifier else use the * quantifier ?
I suggest you to change the last [-\']?[a-z]+ part as optional by putting it into a group and then adding a ? quantifier next to that group.
>>> string = "a quake-prone area- (aujourd'hui-"
>>> RE_WORDS = re.compile(r'[a-z]+(?:[-\'][a-z]+)?')
>>> RE_WORDS.findall(string)
['a', 'quake-prone', 'area', "aujourd'hui"]
Reason for why the a is not printed is because of your regex contains two [a-z]+ which asserts that there must be atleast two lowercase letters present in the match.
Note that the regex i mentioned won't match area- because (?:[-\'][a-z]+)? optional group asserts that there must be atleast one lowercase letter would present just after to the - symbol. If no, then stop matching until it reaches the hyphen. So that you got area at the output instead of area- because there isn't an lowercase letter exists next to the -. Here it stops matching until it finds an hyphen without following lowercase letter.

Categories

Resources