Regex RE: All but not this pattern - python

quick question...
I need a regex that matches a particular letter in a code unless it is contained in a certain pattern.
I want something that matches N followed or preceded by anything aslong as it isn't preceded IMMEDIATELY by C(=O).
Example:
C(=O)N
Should not match
C(=O)CN
Should match
But it doesn't need an anchor because:
C(=O)NCCCN
Should match because of the N at the end
So far i have this:
(?!C\(=O\)N$)[N]
Any help would be appreciated.

You can use a negative lookbehind:
(?<!C\(=O\))N
See the regex demo
The N will get matched only when not preceded immediately with a literal C(=O) sequence.
The (?<!...) is called a negative lookahead. It does not consume characters (does not move the regex index), but just checks if something is absent from the string before the current position. If the text is matched, the match is failed (there is no match). See Lookarounds for more details.
In Python: r'(?<!C\(=O\))N':
import re
p = re.compile(r'(?<!C\(=O\))N')
strs = ["C(=O)N", "C(=O)CN", "C(=O)NCCCN"]
print([x for x in strs if p.search(x)])

Use a negative look-behind instead:
(?<!C\(=O\))N
See this regex101 example.
Regards.

Related

how to write regex to accept the string which end with string

I want to write a regex which accepts this:
Accept:
done
done1
done1,done2,done3
Do not accept:
done1,
done1,done2,
I tried to write this regex
([a-zA-Z]+)?(/d)?(,)([a-zA-Z]+)
but it is not working.
What's wrong? How can I fix it?
I would phrase the regex pattern as:
(?<!\S)\w+(?:,\w+)*(?!\S)
Sample script:
inp = "done done1 done1,done2,done3 done1, done1,done2,"
matches = re.findall(r'(?<!\S)\w+(?:,\w+)*(?!\S)', inp)
print(matches) # ['done', 'done1', 'done1,done2,done3']
Here is an explanation of the regex pattern:
(?<!\S) assert that what precedes is either whitespace or the start of the input
\w+ match a word
(?:,\w+)* followed by comma another word, both zero or more times
(?!\S) assert that what follows the final word is either whitespace
or the end of the input
It also depends on how you apply the regex. The regex alone (e.g. when used with re.search()) tells you whether the input contains any substring which matches your regex. In the trivial case, if you are examining one line at a time, add start and end of line anchors around your regex to force it to match the entire line.
Also, of course, notice that the regex to match a single digit is \d, not /d.
Your regex looks like you want both the alphabetics and the numbers to be optional, but the group of alphabetics and numbers to be non-empty; is that correct? One way to do that is to add a lookahead (?=[a-zA-Z\d]) before the phrase which matches both optionally.
import re
tests = """\
done
done1
done1,done2,done3
done1,
done1,done2,
"""
regex = re.compile(r'^(?=[a-zA-Z\d])[a-zA-Z]*\d?(?:,(?=[a-zA-Z\d])[a-zA-Z]*\d?)*$')
for line in tests.splitlines():
match = regex.search(line)
if match:
print(line)
The individual phrases here should be easy to understand. [a-zA-Z]* matches zero or more alphabetics, and \d? matches zero or one digits. We require one of those, followed by zero or more repetitions of a comma followed by a repeat of the first expression.
Perhaps also note that [a-zA-Z\d] is almost the same as \w (the latter also matches an underscore). If you don't care about this inexactness, the expression could be simplified. It would certainly be useful in the lookahead, where the regex after it will not match an underscore anyhow. But I've left in the more complex expression just to make the code easier to follow in relation to the original example.
Demo: https://ideone.com/4mVGDh

How to match substring or whole string

In Python regex, how would I match only the facebook.com...777 substrings given either string? I don't want the ?sfnsn=mo at the end.
I have (?<=https://m\.)([^\s]+) to match everything after the https://m.. I also have (?=\?sfnsn) to match every thing in front of ?sfnsn.
How do I combine the regex to only return the facebook.com...777 part for either string.
have: https://m.facebook.com/story.php?story_fbid=123456789&id=7777777777?sfnsn=mo
want: facebook.com/story.php?story_fbid=123456789&id=7777777777
have: https://m.facebook.com/story.php?story_fbid=123456789&id=7777777777
want: facebook.com/story.php?story_fbid=123456789&id=7777777777
Here's what I was messing around with https://regex101.com/r/WYz5dn/2
(?<=https://m\.)([^\s]+)(?=\?sfnsn)
You could use a capturing group instead of a positive lookbehind and match either ?sfnsn or the end of the string.
https://m\.(\S*?)(?:\?sfnsn|$)
Regex demo
Using the lookarounds, the pattern could be:
(?<=https://m\.)\S*?(?=\?sfnsn|$)
Regex demo
Putting a ? at the end works, since the last grouped lookahead may or may not exist, we put a question mark after it:
(?<=https://m\.)([^\s]+)(?=\?sfnsn)?

Regex to match parenthesis and its contents if it does not start with an underscore

I have this regex:
\([^\(]*?\)
Which matches parenthesis of a String and the contents within the parenthesis. I would like it to only match if there is no _ before the parenthesis.
For example I would like it to match (text) in this example:
This is some random (text)
But I do not want it to match anything in this example:
This is another_(text)
How would I go about this?
You can use negative lookbehind for that:
(?<!_)\([^\(]*\)
# ^ negative lookbehind
As is demonstrated in this regex101
Like #SebastianProske says, there is no reason to make [^\(] greedy: since it will never match a closing bracket. So I made it greedy.
Add negative lookbehind: (?<!_) checking just what you said (no "_" before).
One more remark: the content between both parentheses should be any sequence of
chars but other than closing one.
So the whole regex should be:
(?<!_)\([^\)]*\)

Regex, find pattern only in middle of string

I am using python 2.6 and trying to find a bunch of repeating characters in a string, let's say a bunch of n's, e.g. nnnnnnnABCnnnnnnnnnDEF. In any place of the string the number of n's can be variable.
If I construct a regex like this:
re.findall(r'^(((?i)n)\2{2,})', s),
I can find occurences of case-insensitive n's only in the beginning of the string, which is fine. If I do it like this:
re.findall(r'(((?i)n)\2{2,}$)', s),
I can detect the ones only in the end of the sequence. But what about just in the middle?
At first, I thought of using re.findall(r'(((?i)n)\2{2,})', s) and the two previous regex(-ices?) to check the length of the returned list and the presence of n's either in the beginning or end of the string and make logical tests, but it became an ugly if-else mess very quickly.
Then, I tried re.findall(r'(?!^)(((?i)n)\2{2,})', s), which seems to exlude the beginning just fine but (?!$) or (?!\z) at the end of the regex only excludes the last n in ABCnnnn. Finally, I tried re.findall(r'(?!^)(((?i)n)\2{2,})\w+', s) which seems to work sometimes, but I get weird results at others. It feels like I need a lookahead or lookbehind, but I can't wrap my head around them.
Instead of using a complicated regex in order to refuse of matching the leading and trailing n characters. As a more pythonic approach you can strip() your string then find all the sequence of ns using re.findall() and a simple regex:
>>> s = "nnnABCnnnnDEFnnnnnGHInnnnnn"
>>> import re
>>>
>>> re.findall(r'n{2,}', s.strip('n'), re.I)
['nnnn', 'nnnnn']
Note : re.I is Ignore-case flag which makes the regex engine matches upper case and lower case characters.
Since "n" is a character (and not a subpattern), you can simply use:
re.findall(r'(?<=[^n])nn+(?=[^n])(?i)', s)
or better:
re.findall(r'n(?<=[^n]n)n+(?=[^n])(?i)', s)
NOTE: This solution assumes n may be a sequence of some characters. For more efficient alternatives when n is just 1 character, see other answers here.
You can use
(?<!^)(?<!n)((n)\2{2,})(?!$)(?!n)
See the regex demo
The regex will match repeated consecutive ns (ignoring case can be achieved with re.I flag) that are not at the beginning ((?<!^)) or end ((?!$)) of the string and not before ((?!n)) or after ((?<!n)) another n.
The (?<!^)(?<!n) is a sequence of 2 lookbehinds: (?<!^) means do not consume the next pattern if preceded with the start of the string. The (?<!n) negative lookbehind means do not consume the next pattern if preceded with n. The negative lookaheads (?!$) and (?!n)have similar meanings: (?!$) fails a match if after the current position the end of string occurs and (?!n) will fail a match if n occurs after the current position in string (that is, right after matching all consecutive ns. The lookaround conditions must all be met, that is why we only get the innermost matches.
See IDEONE demo:
import re
p = re.compile(r'(?<!^)(?<!n)((n)\2{2,})(?!$)(?!n)', re.IGNORECASE)
s = "nnnnnnnABCnnnnnNnnnnDEFnNn"
print([x.group() for x in p.finditer(s)])

Could you explain why this regex is not working?

>>> d = "Batman,Superman"
>>> m = re.search("(?<!Bat)\w+",d)
>>> m.group(0)
'Batman'
Why isn't group(0) matching Superman? This lookaround tutorial says:
(?<!a)b matches a "b" that is not
preceded by an "a", using negative
lookbehind
Batman isn't directly preceded by Bat, so that matches first. In fact, neither is Superman; there's a comma in-between in your string which will do just fine to allow that RE to match, but that's not matched anyway because it's possible to match earlier in the string.
Maybe this will explain better: if the string was Batman and you were starting to try to match from the m, the RE would not match until the character after (giving a match of an) because that's the only place in the string which is preceded by Bat.
At a simple level, the regex engine starts from the left of the string and moves progressively towards the right, trying to match your pattern (think of it like a cursor moving through the string). In the case of a lookaround, at each stop of the cursor, the lookaround is asserted, and if true, the engine continues trying to make a match. As soon as the engine can match your pattern, it'll return a match.
At position 0 of your string (ie. prior to the B in Batman), the assertion succeeded, as Bat is not present before the current position - thus, \w+ can match the entire word Batman (remember, regexes are inherently greedy - ie. will match as much as possible).
See this page for more information on engine internals.
To achieve what you wanted, you could instead use something like:
\b(?!Bat)\w+
In this pattern, the engine will match a word boundary (\b)1, followed by one or more word characters, with the assertion that the word characters do not start with Bat. A lookahead is used rather than a lookbehind because using a lookbehind here would have the same problem as your original pattern; it would look before the position directly following the word boundary, and since its already been determined that the position before the cursor is a word boundary, the negative lookbehind would always succeed.
1 Note that word boundaries match a boundary between \w and \W (ie. between [A-Za-z0-9_] and any other character; it also matches the ^ and $ anchors). If your boundaries need to be more complex, you'll need a different way of anchoring your pattern.
From the manual:
Patterns which start with negative
lookbehind assertions may match at the
beginning of the string being
searched.
http://docs.python.org/library/re.html#regular-expression-syntax
You're looking for the first set of one or more alphanumeric characters (\w+) that is not preceded by 'Bat'. Batman is the first such match. (Note that negative lookbehind assertions can match the start of a string.)
To do what you want, you have to constrain the regex to match 'man' specifically; otherwise, as others have pointed out, \w greedily matches anything including 'Batman'. As in:
>>> re.search("\w+(?<!Bat)man","Batman,Superman").group(0)
'Superman'

Categories

Resources