Python regex to match words not having dot

Python regex to match words not having dot - python

I want to accept only those strings having the pattern 'wild.flower', 'pink.flower',...i.e any word preceding '.flower', but the word should not contain dot. For example, "pink.blue.flower" is unacceptable. Can anyone help how to do this in python using regex?

You are looking for "^\w+\.flower$".

Is this sufficient?
^\w+\.\w+$

Here is the regex for you. ^([^\.]*)\.flower$.
Example: https://regex101.com/r/cSL445/1.

Your case of pink.blue.flower is unclear. There are 2 possibilities:
Match only blue (cut off preceding dot and what was before).
Reject this case altogether (you want to match a word preceding .flower
only if it is not preceded with a dot).
In the first case accept other answers.
But if you want the second solution, use: \b(?<!\.)[a-z]+(?=\.flower).
Description:
\b - Start from a word boundary (but it allows the "after a dot" case).
(?<!\.) - Negative lookbehind - exclude the "after a dot" case.
[a-z]+ - Match a sequence of letters.
(?=\.flower) - Positive lookahead for .flower.
I assumed that you have only lower case letters, but if it is not the case,
then add i (case insensitive) option.
Another remark: Other answers include \w, which matches also digits and
_ or even [^\.] - any char other than a dot (including e.g. \n).
Are you happy with that? If you aren't, change to [a-z] (again, maybe
with i option).

To match any character except a newline or a dot you could use a negated character class [^.\r\n]+ and repeat that one or more times and use anchors to assert the start ^ and the end $ of the line.
^[^.\r\n]+\.flower$
Or you could specify in a character class which characters you would allow to match followed by a dot \. and flower.
^[a-z0-9]+\.flower$

Related

Regex match for non hyphenated words

I am trying to create a regex expression in Python for non-hyphenated words but I am unable to figure out the right syntax.
The requirements for the regex are:
It should not contain hyphens AND
It should contain atleast 1 number
The expressions that I tried are:=
^(?!.*-)
This matches all non-hyphenated words but I am not able to figure out how to additionally add the second condition.
^(?!.*-(?=/d{1,}))
I tried using double lookahead but I am not sure about the syntax to use for it. This matches ID101 but also matches STACKOVERFLOW
Sample Words Which Should Match:
1DRIVE , ID100 , W1RELESS
Sample Words Which Should Not Match:
Basically any non-numeric string (like STACK , OVERFLOW) or any hyphenated words (Test-11 , 24-hours)
Additional Info:
I am using library re and compiling the regex patterns and using re.search for matching.
Any assistance would be very helpful as I am new to regex matching and am stuck on this for quite a few hours.

Maybe,
(?!.*-)(?=.*\d)^.+$
might simply work OK.
Test
import re
string = '''
abc
abc1-
abc1
abc-abc1
'''
expression = r'(?m)(?!.*-)(?=.*\d)^.+$'
print(re.findall(expression, string))
Output
['abc1']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
RegEx 101 Explanation
/
(?!.*-)(?=.*\d)^.+$
/
gm
Negative Lookahead (?!.*-)
Assert that the Regex below does not match
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
- matches the character - literally (case sensitive)
Positive Lookahead (?=.*\d)
Assert that the Regex below matches
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\d matches a digit (equal to [0-9])
^ asserts position at start of a line
.+ matches any character (except for line terminators)
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)

I came up with -
^[^-]*\d[^-]*$
so we need at LEAST one digit (\d)
We need the rest of the string to contain anything BUT a - ([^-])
We can have unlimited number of those characters, so [^-]*
but putting them together like [^-]*\d would fail on aaa3- because the - comes after a valid match- lets make sure no dashes can sneak in before or after our match ^[-]*\d$
Unfortunately that means that aaa555D fails. So we actually need to add the first group again- ^[^-]*\d[^-]$ --- which says start - any number of chars that aren't dashes - a digit - any number of chars that aren't dashes - end
Depending on style, we could also do ^([^-]*\d)+$ since the order of the digits/numbers dont matter, we can have as many of those as we want.
However, finally... this is how I would ACTUALLY solve this particular problem, since regexes may be powerful, but they tend to make the code harder to understand...
if ("-" not in text) and re.search("\d", text):

Two word boundaries (\b) to isolate a single word

I am trying to match the word that appears immediately after a number - in the sentence below, it is the word "meters".
The tower is 100 meters tall.
Here's the pattern that I tried which didn't work:
\d+\s*(\b.+\b)
But this one did:
\d+\s*(\w+)
The first incorrect pattern matched this:
The tower is 100 meters tall.
I didn't want the word "tall" to be matched. I expected the following behavior:
\d+ match one or more occurrence of a digit
\s* match any or no spaces
( start new capturing group
\b find the word/non-word boundary
.+ match 1 or more of everything except new line
\b find the next word/non-word boundary
) stop capturing group
The problem is I don't know tiddly-twat about regex, and I am very much a noob as a noob can be. I am practicing by making my own problems and trying to solve them - this is one of them. Why didn't the match stop at the second break (\b)?
This is Python flavored
Here's the regex101 test link of the above regex.

It didn't stop because + is greedy by default, you want +? for a non-greedy match.
A concise explanation — * and + are greedy quantifiers/operators meaning they will match as much as they can and still allow the remainder of the regular expression to match.
You need to follow these operators with ? for a non-greedy match, going in the above order it would be (*?) "zero or more" or (+?) "one or more" — but preferably "as few as possible".
Also a word boundary \b matches positions where one side is a word character (letter, digit or underscore OR a unicode letter, digit or underscore in Python 3) and the other side is not a word character. I wouldn't use \b around the . if you're unclear what's in between the boundaries.

It match both words because . match (nearly) all characters, so also space character, and because + is greedy, so it will match as much as it could. If you would use \w instead of . it would work (because \w match only word characters - a-zA-Z_0-9).

Python reference to regex in parentheses

I have a text file that needs to have the letter 't' removed if it is not immediately preceded by a number.
I am trying to do this using re.sub and I have this:
f=open('File.txt').read()
g=f
g=re.sub('([^0-9])t','',g)
This identifies the letters to be removed correctly but also removes the preceding character. How can I refer to the parenthesized regex in the replacement String?
Thanks!

Use a lookbehind (or negative lookbehind) instead.
g=re.sub('(?<=[^0-9])t','',g)
or
g=re.sub('(?<![0-9])t','',g)

Three options:
g=re.sub('([^0-9])t','\\1',g)
or
g=re.sub('(?<=[^0-9])t','',g)
or
g=re.sub('(?<![0-9])t','',g)
The first option is what you are looking for, a backreference to the captured string. \\1 will refer to the first captured group.
Lookarounds don't consume characters, so you don't need to replace them back. Here, I have used a positive lookbehind for the first one and a negative lookbehind for the second one. Those don't consume the characters within their brackets, so you are not taking the [^0-9] or [0-9] in the replacement. It might be better to use those since it prevents overlapping matches.
The positive lookbehind makes sure that t has a non-digit character before it. The negative lookbehind makes sure that t does not have a digit character before it.

Python regex positive look ahead

I have the following regex that is supposed to find sequence of words that are ended with a punctuation. The look ahead function assures that after the match there is a space and a capital letter or digit.
pat1 = re.compile(r"\w.+?[?.!](?=\s[A-Z\d])"
What is the function of the following lookahead?
pat2 = re.compile(r"\w.+?[?.!](?=\s+[A-Z\d])"
Is Python 3.2 supporting variable lookahead (\s+)? I do not get any error. Furthermore I cannot see any differences in both patterns. Both seem to work the same regardless the number of blanks that I have. Is there an explanation for the purpose of the \s+ in the look ahead?

I'm not really sure what you are tying to achieve here.
Sequence of words ended by a punctuation can be matched with something like:
re.findall(r'([\w\s]*[\?\!\.;])', s)
the lookahead requires another string to follow?
In any case:
\s requires one and only one space;
\s+ requires at least one space.
And yes, the lookahead accepts the "+" modifier even in python 2.x
The same as before but with a lookahead:
re.findall(r'([\w\s]*[\?\!\.;])(?=\s\w)', s)
or
re.findall(r'([\w\s]*[\?\!\.;])(?=\s+\w)', s)
you can try them all on something like:
s='Stefano ciao. a domani. a presto;'
Depending on your strings, the lookahead might be necessary or not, and might or might not change to have "+" more than one space option.

The difference is that the first lookahead expects exactly one whitespace character before the digit or capital letter while the second one expects at least one whitespace character but as many as possible.
The + is called a quantifier. It means 1 to n as many as possible.
To recap
\s (Exactly one whitespace character allowed. Will fail without it or with more than one.)
\s+ (At least one but maybe more whitespaces allowed.)
Further studying.
I have multiple blanks, the \w.+? continues to match the blanks until the last blank before the capital letter
To answer this comment please consider :
What does \w.+? actually matches?
A single word character [a-zA-Z0-9_] followed by at least one "any" character(except newline) but with the lazy quantifier +?. So in your case, it leaves one space so that the lookahead later matches. Therefore you consume all the blanks except one. This is why you see them at your output.

Could you explain why this regex is not working?

>>> d = "Batman,Superman"
>>> m = re.search("(?<!Bat)\w+",d)
>>> m.group(0)
'Batman'
Why isn't group(0) matching Superman? This lookaround tutorial says:
(?<!a)b matches a "b" that is not
preceded by an "a", using negative
lookbehind

Batman isn't directly preceded by Bat, so that matches first. In fact, neither is Superman; there's a comma in-between in your string which will do just fine to allow that RE to match, but that's not matched anyway because it's possible to match earlier in the string.
Maybe this will explain better: if the string was Batman and you were starting to try to match from the m, the RE would not match until the character after (giving a match of an) because that's the only place in the string which is preceded by Bat.

At a simple level, the regex engine starts from the left of the string and moves progressively towards the right, trying to match your pattern (think of it like a cursor moving through the string). In the case of a lookaround, at each stop of the cursor, the lookaround is asserted, and if true, the engine continues trying to make a match. As soon as the engine can match your pattern, it'll return a match.
At position 0 of your string (ie. prior to the B in Batman), the assertion succeeded, as Bat is not present before the current position - thus, \w+ can match the entire word Batman (remember, regexes are inherently greedy - ie. will match as much as possible).
See this page for more information on engine internals.
To achieve what you wanted, you could instead use something like:
\b(?!Bat)\w+
In this pattern, the engine will match a word boundary (\b)1, followed by one or more word characters, with the assertion that the word characters do not start with Bat. A lookahead is used rather than a lookbehind because using a lookbehind here would have the same problem as your original pattern; it would look before the position directly following the word boundary, and since its already been determined that the position before the cursor is a word boundary, the negative lookbehind would always succeed.
1 Note that word boundaries match a boundary between \w and \W (ie. between [A-Za-z0-9_] and any other character; it also matches the ^ and $ anchors). If your boundaries need to be more complex, you'll need a different way of anchoring your pattern.

From the manual:
Patterns which start with negative
lookbehind assertions may match at the
beginning of the string being
searched.
http://docs.python.org/library/re.html#regular-expression-syntax

You're looking for the first set of one or more alphanumeric characters (\w+) that is not preceded by 'Bat'. Batman is the first such match. (Note that negative lookbehind assertions can match the start of a string.)

To do what you want, you have to constrain the regex to match 'man' specifically; otherwise, as others have pointed out, \w greedily matches anything including 'Batman'. As in:
>>> re.search("\w+(?<!Bat)man","Batman,Superman").group(0)
'Superman'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.