Python, regex negative lookbehind behavior

Python, regex negative lookbehind behavior - python

I have a regular experssion that should find up to 10 words in a line. THat is, it should include the word just preceding the line feed but not the words after the line feed. I am using a negative lookbehind with "\n".
a = re.compile(r"((\w)+[\s /]){0,10}(?<!\n)")
r = a.search("THe car is parked in the garage\nBut the sun is shining hot.")
When I execute this regex and call the method r.group(), I am getting back the whole sentence but the last word that contains a period. I was expecting only the complete string preceding the new line. That is, "THe car is parked in the garage\n".
What is the mistake that I am making here with the negative look behind...?

I don't know why you would use negative lookahead. You are saying that you want a maximum of 10 words before a linefeed. The regex below should work. It uses a positive lookahead to ensure that there is a linefeed after the words. Also when searching for words use `b\w+\b` instead of what you were using.
/(\b\w+\b)*(?=.*\\n)/
Python :
result = re.findall(r"(\b\w+\b)*(?=.*\\n)", subject)
Explanation :
# (\b\w+\b)*(?=.*\\n)
#
# Match the regular expression below and capture its match into backreference number 1 «(\b\w+\b)*»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
# Note: You repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations. «*»
# Assert position at a word boundary «\b»
# Match a single character that is a “word character” (letters, digits, etc.) «\w+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Assert position at a word boundary «\b»
# Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=.*\\n)»
# Match any single character that is not a line break character «.*»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
# Match the character “\” literally «\\»
# Match the character “n” literally «n»
You may also wish to consider the fact that there could be no \n at your string.

If I read you right, you want to read up to 10 words, or the first newline, whichever comes first:
((?:(?<!\n)\w+\b[\s.]*){0,10})
This uses a negative lookbehind, but just before the word match, so it blocks getting any word after a newline.
This will need some tuning for imperfect input, but it's a start.

For this task there is the anchor $ to find the the end of the string and together with the modifier re.MULTILINE/re.M it will find the end of the line. So you would end up with something like this
(\b\w+\b[.\s /]{0,2}){0,10}$
See it here on Regexr
The \b is a word boundary. I included [.\s /]{0,2} to match a dot followed by a whitespace in my example. If you don't want the dots you need to make this part at least optional like this [\s /]? otherwise it will be missing at the last word and then the \s is matching the \n.
Update/Idea 2
OK, maybe I misunderstood your question with my first solution.
If you just want to not match a newline and continue in the second row, then just don't allow it. The problem is that the newline is matched by the \s in your character class. The \s is a class for whitespace and this includes also the newline characters \r and \n
You already have a space in the class then just replace the \s with \t in case you want to allow tab and then you should be fine without lookbehind. And of course, make the character class optional otherwise the last word will also not be matched.
((\w)+[\t /]?){0,10}
See it here on Regexr

I think you shouldn't be using a lookbehind at all. If you want to match up to ten words not including a newline, try this:
\S+(?:[ \t]+\S+){0,9}
A word is defined here as one or more non-whitespace characters, which includes periods, apostrophes, and other sentence punctuation as well as letters. If you know the text you're matching is regular prose, there's no point limiting yourself to \w+, which isn't really meant to match natural-language words anyway.
After the first word, it repeatedly matches one or more horizontal whitespace characters (space or TAB) followed by another word, for a maximum of ten words. If it encounters a newline before the tenth word, it simply stops matching at that point. There's no need to mention newlines in the regex at all.

Related

Regex match for non hyphenated words

I am trying to create a regex expression in Python for non-hyphenated words but I am unable to figure out the right syntax.
The requirements for the regex are:
It should not contain hyphens AND
It should contain atleast 1 number
The expressions that I tried are:=
^(?!.*-)
This matches all non-hyphenated words but I am not able to figure out how to additionally add the second condition.
^(?!.*-(?=/d{1,}))
I tried using double lookahead but I am not sure about the syntax to use for it. This matches ID101 but also matches STACKOVERFLOW
Sample Words Which Should Match:
1DRIVE , ID100 , W1RELESS
Sample Words Which Should Not Match:
Basically any non-numeric string (like STACK , OVERFLOW) or any hyphenated words (Test-11 , 24-hours)
Additional Info:
I am using library re and compiling the regex patterns and using re.search for matching.
Any assistance would be very helpful as I am new to regex matching and am stuck on this for quite a few hours.

Maybe,
(?!.*-)(?=.*\d)^.+$
might simply work OK.
Test
import re
string = '''
abc
abc1-
abc1
abc-abc1
'''
expression = r'(?m)(?!.*-)(?=.*\d)^.+$'
print(re.findall(expression, string))
Output
['abc1']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
RegEx 101 Explanation
/
(?!.*-)(?=.*\d)^.+$
/
gm
Negative Lookahead (?!.*-)
Assert that the Regex below does not match
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
- matches the character - literally (case sensitive)
Positive Lookahead (?=.*\d)
Assert that the Regex below matches
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\d matches a digit (equal to [0-9])
^ asserts position at start of a line
.+ matches any character (except for line terminators)
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)

I came up with -
^[^-]*\d[^-]*$
so we need at LEAST one digit (\d)
We need the rest of the string to contain anything BUT a - ([^-])
We can have unlimited number of those characters, so [^-]*
but putting them together like [^-]*\d would fail on aaa3- because the - comes after a valid match- lets make sure no dashes can sneak in before or after our match ^[-]*\d$
Unfortunately that means that aaa555D fails. So we actually need to add the first group again- ^[^-]*\d[^-]$ --- which says start - any number of chars that aren't dashes - a digit - any number of chars that aren't dashes - end
Depending on style, we could also do ^([^-]*\d)+$ since the order of the digits/numbers dont matter, we can have as many of those as we want.
However, finally... this is how I would ACTUALLY solve this particular problem, since regexes may be powerful, but they tend to make the code harder to understand...
if ("-" not in text) and re.search("\d", text):

Regex backreference to match opposite case

Before I begin — it may be worth stating, that: this technically does not have to be solved using a Regex, it's just that I immediately thought of a Regex when I started solving this problem, and I'm interested in knowing whether it's possible to solve using a Regex.
I've spent the last couple hours trying to create a Regex that does the following.
The regex must match a string that is ten characters long, iff the first five characters and last five characters are identical but each individual character is opposite in case.
In other words, if you take the first five characters, invert the case of each individual character, that should match the last five characters of the string.
For example, the regex should match abCDeABcdE, since the first five characters and the last five characters are the same, but each matching character is opposite in case. In other words, flip_case("abCDe") == "ABcdE"
Here are a few more strings that should match:
abcdeABCDE, abcdEABCDe, zYxWvZyXwV.
And here are a few that shouldn't match:
abcdeABCDZ, although the case is opposite, the strings themselves do not match.
abcdeABCDe, is a very close match, but should not match since the e's are not opposite in case.
Here is the first regex I tried, which is obviously wrong since it doesn't account for the case-swap process.
/([a-zA-Z]{5})\1/g
My next though was whether the following is possible in a regex, but I've been reading several Regex tutorials and I can't seem to find it anywhere.
/([A-Z])[\1+32]/g
This new regex (that obviously doesn't work) is supposed to match a single uppercase letter, immediately followed by itself-plus-32-ascii, so, in other words, it should match an uppercase letter followed immediately by its' lowercase counterpart. But, as far as I'm concerned, you cannot "add an ascii value" to backreference in a regex.
And, bonus points to whoever can answer this — in this specific case, the string in question is known to be 10 characters long. Would it be possible to create a regex that matches strings of an arbitrary length?

You want to use the following pattern with the Python regex module:
^(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L}))(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$)
See the regex demo
Details
^ - start of string
(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L})) - a positive lookahead with a sequence of five capturing groups that capture the first five letters individually
(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$) - a ppositive lookahead that make sure that, at the end of the string, there are 5 letters that are the same as the ones captured at the start but are of different case.
In brief, the first (\p{L}) in the first lookahead captures the first a in abcdeABCDE and then, inside the second lookahead, (?!\1)(?i:\1) makes sure the fifth char from the end is the same (with the case insensitive mode on), and (?!\1) negative lookahead make sure this letter is not identical to the one captured.
The re module does not support inline modifier groups, so this expression won't work with that moduue.
Python regex based module demo:
import regex
strs = ['abcdeABCDE', 'abcdEABCDe', 'zYxWvZyXwV', 'abcdeABCDZ', 'abcdeABCDe']
rx = r'^(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L}))(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$)'
for s in strs:
print("Testing {}...".format(s))
if regex.search(rx, s):
print("Matched")
Output:
Testing abcdeABCDE...
Matched
Testing abcdEABCDe...
Matched
Testing zYxWvZyXwV...
Matched
Testing abcdeABCDZ...
Testing abcdeABCDe...

What is the purpose of .* in a Python lookahead regex?

I am learning about regular expressions, and I found an interesting and helpful page on using them for password input validation here. The question I have is about the .* in the following expression:
"^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)[a-zA-Z\d]{8,}$"
I understand that .* is a wildcard character representing any amount of text (or no text) but I'm having trouble wrapping my head around its purpose in these lookahead expressions. Why are these necessary in order to make these lookaheads function as needed?

Lookahead means direct lookahead. So if you write:
(?=a)
it means that the first character should be a. Sometimes, for instance with password checking, you do not want that. You want to express that somewhere there should be an a. So:
(?=.*a)
means that the first character can for instance be a b, 8 or #. But that eventually there should be an a somewhere.
Your regex thus means:
^ # start a match at the beginning of the string
(?=.*[a-z]) # should contain at least one a-z character
(?=.*[A-Z]) # should contain at least one A-Z character
(?=.*\d) # should contain at least one digit
[a-zA-Z\d]{8,} # consists out of 8 or more characters and only A-Za-z0-9
$ # end the match at the end of the string
Without the .*, there could never be a match, since:
"^(?=[a-z])(?=[A-Z])(?=\d)[a-zA-Z\d]{8,}$"
means:
^ # start a match at the beginning of the string
(?=[a-z]) # first character should be an a-z character
(?=[A-Z]) # first character should be an A-Z character
(?=\d) # first character should be a digit
[a-zA-Z\d]{8,} # consists out of 8 or more characters and only A-Za-z0-9
$ # end the match at the end of the string
Since there is no character that is both an A-Z character and a digit at the same time. This would never be satisfied.
Side notes:
we do not capture in the lookahead so greedyness does not matter;
the dot . by default does not match the new line character;
even if it did the fact that you have a constraint ^[A-Za-z0-9]{8,}$ means that you only would validate input with no new line.

Why doesn't this regex pattern work as intended?

I needed a regex pattern to catch any 16 digit string of numbers (each four number group separated by a hyphen) without any number being repeated more than 3 times, with or without hyphens in between.
So the pattern I wrote is
a=re.compile(r'(?!(\d)\-?\1\-?\1\-?\1)(^d{4}\-?\d{4}\-?\d{4}\-?\d{4}$)')
But the example "5133-3367-8912-3456" gets matched even when 3 is repeated 4 times. (What is the problem with the negative lookahead section?)

Lookaheads only do the check at the position they are at, so in your case at the start of the string. If you want a lookahead to basically check the whole string, if a certain pattern can or can't be matched, you can add .* in front to make go deeper into the string.
In your case, you could change it to r'(?!.*(\d)\-?\1\-?\1\-?\1)(^d{4}\-?\d{4}\-?\d{4}\-?\d{4}$)'.
There is also no need to escape the minus at the position they are at and I would move the lookahead right after the ^. I don't know how well python regexes are optimized, but that way the start of the string anchor is matched first (only 1 valid position) instead of checking the lookahead at any place just to fail the match at ^. This would give r'^(?!.*(\d)-?\1-?\1-?\1)(\d{4}-?\d{4}-?\d{4}-?\d{4}$)'

Python regex positive look ahead

I have the following regex that is supposed to find sequence of words that are ended with a punctuation. The look ahead function assures that after the match there is a space and a capital letter or digit.
pat1 = re.compile(r"\w.+?[?.!](?=\s[A-Z\d])"
What is the function of the following lookahead?
pat2 = re.compile(r"\w.+?[?.!](?=\s+[A-Z\d])"
Is Python 3.2 supporting variable lookahead (\s+)? I do not get any error. Furthermore I cannot see any differences in both patterns. Both seem to work the same regardless the number of blanks that I have. Is there an explanation for the purpose of the \s+ in the look ahead?

I'm not really sure what you are tying to achieve here.
Sequence of words ended by a punctuation can be matched with something like:
re.findall(r'([\w\s]*[\?\!\.;])', s)
the lookahead requires another string to follow?
In any case:
\s requires one and only one space;
\s+ requires at least one space.
And yes, the lookahead accepts the "+" modifier even in python 2.x
The same as before but with a lookahead:
re.findall(r'([\w\s]*[\?\!\.;])(?=\s\w)', s)
or
re.findall(r'([\w\s]*[\?\!\.;])(?=\s+\w)', s)
you can try them all on something like:
s='Stefano ciao. a domani. a presto;'
Depending on your strings, the lookahead might be necessary or not, and might or might not change to have "+" more than one space option.

The difference is that the first lookahead expects exactly one whitespace character before the digit or capital letter while the second one expects at least one whitespace character but as many as possible.
The + is called a quantifier. It means 1 to n as many as possible.
To recap
\s (Exactly one whitespace character allowed. Will fail without it or with more than one.)
\s+ (At least one but maybe more whitespaces allowed.)
Further studying.
I have multiple blanks, the \w.+? continues to match the blanks until the last blank before the capital letter
To answer this comment please consider :
What does \w.+? actually matches?
A single word character [a-zA-Z0-9_] followed by at least one "any" character(except newline) but with the lazy quantifier +?. So in your case, it leaves one space so that the lookahead later matches. Therefore you consume all the blanks except one. This is why you see them at your output.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.