I have a text file that needs to have the letter 't' removed if it is not immediately preceded by a number.
I am trying to do this using re.sub and I have this:
f=open('File.txt').read()
g=f
g=re.sub('([^0-9])t','',g)
This identifies the letters to be removed correctly but also removes the preceding character. How can I refer to the parenthesized regex in the replacement String?
Thanks!
Use a lookbehind (or negative lookbehind) instead.
g=re.sub('(?<=[^0-9])t','',g)
or
g=re.sub('(?<![0-9])t','',g)
Three options:
g=re.sub('([^0-9])t','\\1',g)
or
g=re.sub('(?<=[^0-9])t','',g)
or
g=re.sub('(?<![0-9])t','',g)
The first option is what you are looking for, a backreference to the captured string. \\1 will refer to the first captured group.
Lookarounds don't consume characters, so you don't need to replace them back. Here, I have used a positive lookbehind for the first one and a negative lookbehind for the second one. Those don't consume the characters within their brackets, so you are not taking the [^0-9] or [0-9] in the replacement. It might be better to use those since it prevents overlapping matches.
The positive lookbehind makes sure that t has a non-digit character before it. The negative lookbehind makes sure that t does not have a digit character before it.
Related
How can I get first character if not have int inside:
I need to look all the place have '[' without integer after.
For example:
[abc] pass
[cxvjk234] pass
[123] fail
Right now, I have this:
((([[])([^0-9])))
It gets the first 2 characters while I need only one.
In general, to match some pattern not followed with a digit, you need to add a (?!\d) / (?![0-9]) negative lookahead to the expression:
\[(?!\d)
\[(?![0-9])
^^^^^^^^^
See the regex demo. This matches any [ symbol that is not immediately followed with a digit.
Your current regex pattern is overloaded with capturing groups, and if we remove those redundant ones, it looks like (\[)([^0-9]) - it matches a [ and then a char other than an ASCII digit.
You may use
(?<=\[)\D
or (if you want to only match the ASCII digits with the pattern only)
(?<=\[)[^0-9]
See the regex demo
Details:
(?<=\[) - a positive lookbehind requiring a [ (but not consuming the [ char, i.e. not returning it as part of the match value) before...
\D / [^0-9] - a non-digit. NOTE: to only negate ASCII digits, you may use \D with the RegexOptions.ECMAScript flag.
One possible solution would be:
\[\D[^]]*\]
# look for [
# \D - not a digit
# anything not ], zero or more times
# followed by ]
See a demo on regex101.com.
Don't use so many parentheses. Parentheses are both grouping and determining what's returned as the match.
\[([^0-9])
If you absolutely need to use the parentheses, use (?:…) for parentheses that group but are not returned as part of the match.
(?:(?:(?:\[)([^0-9])))
maybe what you're searching for is \[([^0-9])
Do you expect [123abc] to pass?
I want to accept only those strings having the pattern 'wild.flower', 'pink.flower',...i.e any word preceding '.flower', but the word should not contain dot. For example, "pink.blue.flower" is unacceptable. Can anyone help how to do this in python using regex?
You are looking for "^\w+\.flower$".
Is this sufficient?
^\w+\.\w+$
Here is the regex for you. ^([^\.]*)\.flower$.
Example: https://regex101.com/r/cSL445/1.
Your case of pink.blue.flower is unclear. There are 2 possibilities:
Match only blue (cut off preceding dot and what was before).
Reject this case altogether (you want to match a word preceding .flower
only if it is not preceded with a dot).
In the first case accept other answers.
But if you want the second solution, use: \b(?<!\.)[a-z]+(?=\.flower).
Description:
\b - Start from a word boundary (but it allows the "after a dot" case).
(?<!\.) - Negative lookbehind - exclude the "after a dot" case.
[a-z]+ - Match a sequence of letters.
(?=\.flower) - Positive lookahead for .flower.
I assumed that you have only lower case letters, but if it is not the case,
then add i (case insensitive) option.
Another remark: Other answers include \w, which matches also digits and
_ or even [^\.] - any char other than a dot (including e.g. \n).
Are you happy with that? If you aren't, change to [a-z] (again, maybe
with i option).
To match any character except a newline or a dot you could use a negated character class [^.\r\n]+ and repeat that one or more times and use anchors to assert the start ^ and the end $ of the line.
^[^.\r\n]+\.flower$
Or you could specify in a character class which characters you would allow to match followed by a dot \. and flower.
^[a-z0-9]+\.flower$
I needed a regex pattern to catch any 16 digit string of numbers (each four number group separated by a hyphen) without any number being repeated more than 3 times, with or without hyphens in between.
So the pattern I wrote is
a=re.compile(r'(?!(\d)\-?\1\-?\1\-?\1)(^d{4}\-?\d{4}\-?\d{4}\-?\d{4}$)')
But the example "5133-3367-8912-3456" gets matched even when 3 is repeated 4 times. (What is the problem with the negative lookahead section?)
Lookaheads only do the check at the position they are at, so in your case at the start of the string. If you want a lookahead to basically check the whole string, if a certain pattern can or can't be matched, you can add .* in front to make go deeper into the string.
In your case, you could change it to r'(?!.*(\d)\-?\1\-?\1\-?\1)(^d{4}\-?\d{4}\-?\d{4}\-?\d{4}$)'.
There is also no need to escape the minus at the position they are at and I would move the lookahead right after the ^. I don't know how well python regexes are optimized, but that way the start of the string anchor is matched first (only 1 valid position) instead of checking the lookahead at any place just to fail the match at ^. This would give r'^(?!.*(\d)-?\1-?\1-?\1)(\d{4}-?\d{4}-?\d{4}-?\d{4}$)'
I have the following regex that is supposed to find sequence of words that are ended with a punctuation. The look ahead function assures that after the match there is a space and a capital letter or digit.
pat1 = re.compile(r"\w.+?[?.!](?=\s[A-Z\d])"
What is the function of the following lookahead?
pat2 = re.compile(r"\w.+?[?.!](?=\s+[A-Z\d])"
Is Python 3.2 supporting variable lookahead (\s+)? I do not get any error. Furthermore I cannot see any differences in both patterns. Both seem to work the same regardless the number of blanks that I have. Is there an explanation for the purpose of the \s+ in the look ahead?
I'm not really sure what you are tying to achieve here.
Sequence of words ended by a punctuation can be matched with something like:
re.findall(r'([\w\s]*[\?\!\.;])', s)
the lookahead requires another string to follow?
In any case:
\s requires one and only one space;
\s+ requires at least one space.
And yes, the lookahead accepts the "+" modifier even in python 2.x
The same as before but with a lookahead:
re.findall(r'([\w\s]*[\?\!\.;])(?=\s\w)', s)
or
re.findall(r'([\w\s]*[\?\!\.;])(?=\s+\w)', s)
you can try them all on something like:
s='Stefano ciao. a domani. a presto;'
Depending on your strings, the lookahead might be necessary or not, and might or might not change to have "+" more than one space option.
The difference is that the first lookahead expects exactly one whitespace character before the digit or capital letter while the second one expects at least one whitespace character but as many as possible.
The + is called a quantifier. It means 1 to n as many as possible.
To recap
\s (Exactly one whitespace character allowed. Will fail without it or with more than one.)
\s+ (At least one but maybe more whitespaces allowed.)
Further studying.
I have multiple blanks, the \w.+? continues to match the blanks until the last blank before the capital letter
To answer this comment please consider :
What does \w.+? actually matches?
A single word character [a-zA-Z0-9_] followed by at least one "any" character(except newline) but with the lazy quantifier +?. So in your case, it leaves one space so that the lookahead later matches. Therefore you consume all the blanks except one. This is why you see them at your output.
Using Python module re, how to get the equivalent of the "\w" (which matches alphanumeric chars) WITHOUT matching the numeric characters (those which can be matched by "[0-9]")?
Notice that the basic need is to match any character (including all unicode variation) without numerical chars (which are matched by "[0-9]").
As a final note, I really need a regexp as it is part of a greater regexp.
Underscores should not be matched.
EDIT:
I hadn't thought about underscores state, so thanks for warnings about this being matched by "\w" and for the elected solution that addresses this issue.
You want [^\W\d]: the group of characters that is not (either a digit or not an alphanumeric). Add an underscore in that negated set if you don't want them either.
A bit twisted, if you ask me, but it works. Should be faster than the lookahead alternative.
(?!\d)\w
A position that is not followed by a digit, and then \w. Effectively cancels out digits but allows the \w range by using a negative look-ahead.
The same could be expressed as a positive look-ahead and \D:
(?=\D)\w
To match multiple of these, enclose in parens:
(?:(?!\d)\w)+