Regexp to remove specific number of occurrences of character only - python

In Python re, I have long strings of text with > character chunks of different lengths. One string can have 3 consecutive > chars in the middle, >> in the beginning, or any such combination.
I want to write a regexp that, after splitting the string based on spaces, iterates through each word to only identify those regions with exactly 2 occurrences >>, and I can't be sure if it's at the beginning, middle or end of the whole string, or what characters are before or after it, or if it's even the only 2 characters in the string.
So far I could come up with:
word = re.sub(r'>{2}', '', word)
This ends up removing all occurrences of 2 or more. What regular expression would work for this requirement? Any help is appreciated.

You need to make sure there is no character of your choice both on the left and right using a pair of lookaround, a lookahead and a lookbehind. The general scheme is
(?<!X)X{n}(?!X)
where (?<!X) means no X immediately on the left is allowed, X{n} means n occurrences of X, and (?!X) means no X immediately on the right is allowed.
In this case, use
r'(?<!>)>{2}(?!>)'
See the regex demo.

no need to split on spaces first if dont needs to
try (?<![^ ])[^ >]*>>[^ >]*(?![^ ])
finds segments on space boundry's with only >> in it and no more

Related

Regex backreference to match opposite case

Before I begin — it may be worth stating, that: this technically does not have to be solved using a Regex, it's just that I immediately thought of a Regex when I started solving this problem, and I'm interested in knowing whether it's possible to solve using a Regex.
I've spent the last couple hours trying to create a Regex that does the following.
The regex must match a string that is ten characters long, iff the first five characters and last five characters are identical but each individual character is opposite in case.
In other words, if you take the first five characters, invert the case of each individual character, that should match the last five characters of the string.
For example, the regex should match abCDeABcdE, since the first five characters and the last five characters are the same, but each matching character is opposite in case. In other words, flip_case("abCDe") == "ABcdE"
Here are a few more strings that should match:
abcdeABCDE, abcdEABCDe, zYxWvZyXwV.
And here are a few that shouldn't match:
abcdeABCDZ, although the case is opposite, the strings themselves do not match.
abcdeABCDe, is a very close match, but should not match since the e's are not opposite in case.
Here is the first regex I tried, which is obviously wrong since it doesn't account for the case-swap process.
/([a-zA-Z]{5})\1/g
My next though was whether the following is possible in a regex, but I've been reading several Regex tutorials and I can't seem to find it anywhere.
/([A-Z])[\1+32]/g
This new regex (that obviously doesn't work) is supposed to match a single uppercase letter, immediately followed by itself-plus-32-ascii, so, in other words, it should match an uppercase letter followed immediately by its' lowercase counterpart. But, as far as I'm concerned, you cannot "add an ascii value" to backreference in a regex.
And, bonus points to whoever can answer this — in this specific case, the string in question is known to be 10 characters long. Would it be possible to create a regex that matches strings of an arbitrary length?
You want to use the following pattern with the Python regex module:
^(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L}))(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$)
See the regex demo
Details
^ - start of string
(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L})) - a positive lookahead with a sequence of five capturing groups that capture the first five letters individually
(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$) - a ppositive lookahead that make sure that, at the end of the string, there are 5 letters that are the same as the ones captured at the start but are of different case.
In brief, the first (\p{L}) in the first lookahead captures the first a in abcdeABCDE and then, inside the second lookahead, (?!\1)(?i:\1) makes sure the fifth char from the end is the same (with the case insensitive mode on), and (?!\1) negative lookahead make sure this letter is not identical to the one captured.
The re module does not support inline modifier groups, so this expression won't work with that moduue.
Python regex based module demo:
import regex
strs = ['abcdeABCDE', 'abcdEABCDe', 'zYxWvZyXwV', 'abcdeABCDZ', 'abcdeABCDe']
rx = r'^(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L}))(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$)'
for s in strs:
print("Testing {}...".format(s))
if regex.search(rx, s):
print("Matched")
Output:
Testing abcdeABCDE...
Matched
Testing abcdEABCDe...
Matched
Testing zYxWvZyXwV...
Matched
Testing abcdeABCDZ...
Testing abcdeABCDe...

Why doesn't this regex pattern work as intended?

I needed a regex pattern to catch any 16 digit string of numbers (each four number group separated by a hyphen) without any number being repeated more than 3 times, with or without hyphens in between.
So the pattern I wrote is
a=re.compile(r'(?!(\d)\-?\1\-?\1\-?\1)(^d{4}\-?\d{4}\-?\d{4}\-?\d{4}$)')
But the example "5133-3367-8912-3456" gets matched even when 3 is repeated 4 times. (What is the problem with the negative lookahead section?)
Lookaheads only do the check at the position they are at, so in your case at the start of the string. If you want a lookahead to basically check the whole string, if a certain pattern can or can't be matched, you can add .* in front to make go deeper into the string.
In your case, you could change it to r'(?!.*(\d)\-?\1\-?\1\-?\1)(^d{4}\-?\d{4}\-?\d{4}\-?\d{4}$)'.
There is also no need to escape the minus at the position they are at and I would move the lookahead right after the ^. I don't know how well python regexes are optimized, but that way the start of the string anchor is matched first (only 1 valid position) instead of checking the lookahead at any place just to fail the match at ^. This would give r'^(?!.*(\d)-?\1-?\1-?\1)(\d{4}-?\d{4}-?\d{4}-?\d{4}$)'

Regex, find pattern only in middle of string

I am using python 2.6 and trying to find a bunch of repeating characters in a string, let's say a bunch of n's, e.g. nnnnnnnABCnnnnnnnnnDEF. In any place of the string the number of n's can be variable.
If I construct a regex like this:
re.findall(r'^(((?i)n)\2{2,})', s),
I can find occurences of case-insensitive n's only in the beginning of the string, which is fine. If I do it like this:
re.findall(r'(((?i)n)\2{2,}$)', s),
I can detect the ones only in the end of the sequence. But what about just in the middle?
At first, I thought of using re.findall(r'(((?i)n)\2{2,})', s) and the two previous regex(-ices?) to check the length of the returned list and the presence of n's either in the beginning or end of the string and make logical tests, but it became an ugly if-else mess very quickly.
Then, I tried re.findall(r'(?!^)(((?i)n)\2{2,})', s), which seems to exlude the beginning just fine but (?!$) or (?!\z) at the end of the regex only excludes the last n in ABCnnnn. Finally, I tried re.findall(r'(?!^)(((?i)n)\2{2,})\w+', s) which seems to work sometimes, but I get weird results at others. It feels like I need a lookahead or lookbehind, but I can't wrap my head around them.
Instead of using a complicated regex in order to refuse of matching the leading and trailing n characters. As a more pythonic approach you can strip() your string then find all the sequence of ns using re.findall() and a simple regex:
>>> s = "nnnABCnnnnDEFnnnnnGHInnnnnn"
>>> import re
>>>
>>> re.findall(r'n{2,}', s.strip('n'), re.I)
['nnnn', 'nnnnn']
Note : re.I is Ignore-case flag which makes the regex engine matches upper case and lower case characters.
Since "n" is a character (and not a subpattern), you can simply use:
re.findall(r'(?<=[^n])nn+(?=[^n])(?i)', s)
or better:
re.findall(r'n(?<=[^n]n)n+(?=[^n])(?i)', s)
NOTE: This solution assumes n may be a sequence of some characters. For more efficient alternatives when n is just 1 character, see other answers here.
You can use
(?<!^)(?<!n)((n)\2{2,})(?!$)(?!n)
See the regex demo
The regex will match repeated consecutive ns (ignoring case can be achieved with re.I flag) that are not at the beginning ((?<!^)) or end ((?!$)) of the string and not before ((?!n)) or after ((?<!n)) another n.
The (?<!^)(?<!n) is a sequence of 2 lookbehinds: (?<!^) means do not consume the next pattern if preceded with the start of the string. The (?<!n) negative lookbehind means do not consume the next pattern if preceded with n. The negative lookaheads (?!$) and (?!n)have similar meanings: (?!$) fails a match if after the current position the end of string occurs and (?!n) will fail a match if n occurs after the current position in string (that is, right after matching all consecutive ns. The lookaround conditions must all be met, that is why we only get the innermost matches.
See IDEONE demo:
import re
p = re.compile(r'(?<!^)(?<!n)((n)\2{2,})(?!$)(?!n)', re.IGNORECASE)
s = "nnnnnnnABCnnnnnNnnnnDEFnNn"
print([x.group() for x in p.finditer(s)])

Is this regex syntax working?

I wanted to search a string for a substring beginning with ">"
Does this syntax say what I want it to say: this character followed by anything.
regex_firstline = re.compile("[>]{1}.*")
As a pythonic way for such tasks you can use str.startswith() method, and don't need to use regex.
But about your regex "[>]{1}.*" you don't need {1} after your character class and you can specify the start of your regex with anchor ^.So it can be "^>.*"
Using http://regex101.com:
[>]{1} matches the single character > literally exactly one time (but it denotes {1} is a meaningless quantifier), and
.* then matches any character as many times as possible.
If a list was provided inside square brackets (as opposed to a single character), regex would attempt to match a single character within the list exactly one time. http://regex101.com has a good listing of tokens and what they mean.
An ideal regex expression would be ^[>].*, meaning at the beginning of a string find exactly one > character followed by anything else (and with only one character in the square brackets, you can remove those to simplify it even further: ^>.*

Match to string length by using regex in python

Writing python regex for string. I want the string to be at least 1 symbol and max 30. The problem is that im using 3 sub-blocks in regex letters, so there always must be 3 characters long length.
Is it possible to add that condition in this regex (1-30 characters length):
regex = re.compile("^[a-zA-Z]+[a-zA-Z0-9\.\-]+[a-zA-Z0-9]$")
r = regex.search(login)
Thank you.
Although it is not clear which 1 or 2 length character strings you want to accept I propose the following regex:
regex = re.compile("^[a-zA-Z][a-zA-Z0-9\.\-]{0,28}[a-zA-Z0-9]$")
As the middle set includes all other this will directly match all words with length 3-30 as you wish.
I hope this regex also matches your 2 length strings (I just assumed that the first character must be a letter), you need to add something (using '|') for single letter matches.
In general, this is difficult and doing some work outside of the RE (as suggested in the comment by M. Buettner) is often required. Your problem is easier because it can be reduced to a pattern with only one repeating element.
You have one or more letters, followed by one or more of (letter, digit, dot, hyphen) followed by a single (letter or digit), right? If so, the repetition of the first group is not needed. Leave off the + to get
r"^[a-zA-Z][a-zA-Z0-9\.\-]+[a-zA-Z0-9]$"
and you will match exactly the same set of strings. Any extra leading letters past the first will be matched in the second group instead of the first.
Now, the only variable portion of your RE is the middle section. To limit the overall length to 30, all you need do is limit that middle portion to 28 characters. Change the + to {1,28} to get:
r"^[a-zA-Z][a-zA-Z0-9\.\-]{1,28}[a-zA-Z0-9]$"
You can read more about Python REs at:
http://docs.python.org/2/library/re.html

Categories

Resources