Consecutive uppercase letters regex

Consecutive uppercase letters regex - python

I'm trying to use Regular expressions to find three consecutive uppercase letters within a string.
I've tried using:
\b([A-Z]){3}\b
as my regex which works to an extent.
However this only returns strings by themselves. I also want it to find three consecutive uppercase letters nested within a string. i.e thisISAtest.

I wonder why you have those word boundaries in your regexp \b? Word boundaries ensure that an word character is followed by a non-word character (or vice versa). Those are what prevents thisISAtest from being matched. Remove them and you should be good!
([A-Z]){3}
Another thing is that I'm not sure why you're using a capture group. Are you extracting the last letter of the three uppercase letters? If not, you can simply use:
[A-Z]{3}
You don't necessarily need groups to use definite quantifiers. :)
EDIT: To prevent more consecutive uppercase letters, you can make use of negative lookarounds:
(?<![A-Z])[A-Z]{3}(?![A-Z])
(?<![A-Z]) makes sure there's no preceeding uppercase letter;
(?![A-Z]) makes sure there's no following uppercase letter.

Related

Regex python find uppercase names

I have a text file of the type:
[...speech...]
NAME_OF_SPEAKER_1: [...speech...]
NAME_OF_SPEAKER_2: [...speech...]
My aim is to isolate the speeches of the various speakers. They are clearly identified because the name of each speaker is always indicated in uppercase letters (name+surname). However, in the speeches there can be nouns (not people's names) which are in uppercase letter, but there is only one word that is actually long enough to give me issue (it has four letter, say it is 'ABCD'). I was thinking to identifiy the position of each speaker's name (I assume every name long at least 3 letters) with something like
re.search('[A-Z^(ABCD)]{3,}',text_to_search)
in order to exclude that specific (constant) word 'ABCD'. However, the command identifies that word instead of excluding it. Any ideas about how to overcome this problem?

In the pattern that you tried, you get partial matches, as there are no boundaries and [A-Z^(ABCD)]{3,} will match 3 or more times any of the listed characters.
A-Z will also match ABCD, so it could also be written as [A-Z^)(]{3,}
Instead of using the negated character class, you could assert that the word that consists only of uppercase chars A-Z does not contain ABCD using a negative lookahead (?!
\b(?![A-Z]*ABCD)[A-Z]{3,}\b
Regex demo
If the name should start with 3 uppercase char, and can contain also lowercase chars, an underscore or digits, you could add \w* after matching 3 uppercase chars:
\b(?![A-Z]*ABCD)[A-Z]{3}\w*\b
Regex demo

Square brackets [] match single characters, only. Also round brackets() inside of square brackets match single characters, only. That means:
[ABCD] and [(ABCD)] are the same as [A-D].
[^(ABCD)] matches any character, which is not one of A-D
I would try something different:
^[A-Z]*?: matches each word written in capital letters, which starts at the beginning of a line, and is followed by a colon

Regex backreference to match opposite case

Before I begin — it may be worth stating, that: this technically does not have to be solved using a Regex, it's just that I immediately thought of a Regex when I started solving this problem, and I'm interested in knowing whether it's possible to solve using a Regex.
I've spent the last couple hours trying to create a Regex that does the following.
The regex must match a string that is ten characters long, iff the first five characters and last five characters are identical but each individual character is opposite in case.
In other words, if you take the first five characters, invert the case of each individual character, that should match the last five characters of the string.
For example, the regex should match abCDeABcdE, since the first five characters and the last five characters are the same, but each matching character is opposite in case. In other words, flip_case("abCDe") == "ABcdE"
Here are a few more strings that should match:
abcdeABCDE, abcdEABCDe, zYxWvZyXwV.
And here are a few that shouldn't match:
abcdeABCDZ, although the case is opposite, the strings themselves do not match.
abcdeABCDe, is a very close match, but should not match since the e's are not opposite in case.
Here is the first regex I tried, which is obviously wrong since it doesn't account for the case-swap process.
/([a-zA-Z]{5})\1/g
My next though was whether the following is possible in a regex, but I've been reading several Regex tutorials and I can't seem to find it anywhere.
/([A-Z])[\1+32]/g
This new regex (that obviously doesn't work) is supposed to match a single uppercase letter, immediately followed by itself-plus-32-ascii, so, in other words, it should match an uppercase letter followed immediately by its' lowercase counterpart. But, as far as I'm concerned, you cannot "add an ascii value" to backreference in a regex.
And, bonus points to whoever can answer this — in this specific case, the string in question is known to be 10 characters long. Would it be possible to create a regex that matches strings of an arbitrary length?

You want to use the following pattern with the Python regex module:
^(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L}))(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$)
See the regex demo
Details
^ - start of string
(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L})) - a positive lookahead with a sequence of five capturing groups that capture the first five letters individually
(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$) - a ppositive lookahead that make sure that, at the end of the string, there are 5 letters that are the same as the ones captured at the start but are of different case.
In brief, the first (\p{L}) in the first lookahead captures the first a in abcdeABCDE and then, inside the second lookahead, (?!\1)(?i:\1) makes sure the fifth char from the end is the same (with the case insensitive mode on), and (?!\1) negative lookahead make sure this letter is not identical to the one captured.
The re module does not support inline modifier groups, so this expression won't work with that moduue.
Python regex based module demo:
import regex
strs = ['abcdeABCDE', 'abcdEABCDe', 'zYxWvZyXwV', 'abcdeABCDZ', 'abcdeABCDe']
rx = r'^(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L}))(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$)'
for s in strs:
print("Testing {}...".format(s))
if regex.search(rx, s):
print("Matched")
Output:
Testing abcdeABCDE...
Matched
Testing abcdEABCDe...
Matched
Testing zYxWvZyXwV...
Matched
Testing abcdeABCDZ...
Testing abcdeABCDe...

how do you substitute a repeating character both uppercase and lowercase with regex in python

Say you have a word "Aabrakadaabra" and what you want to do is find the repeating characters and replace them with a single one. which in our case should return "Abrakadabra".
what i did was re.sub(r"([a-zA-z])\1",r"\1","Aabrakadaabra") which returns 'Aabrakadabra' and this regex cannot catch when there is an uppercase and a lowercase repeating. Im not sure if there is an easy, one liner way to do this but any help would be educative.

Use re.IGNORECASE.
>>> re.sub(r"([a-zA-z])\1",r"\1","Aabrakadaabra", flags=re.IGNORECASE)
'Abrakadabra'

Regex pattern to avoid match certain words like customize negation

I am having a regular expression to match a particular pattern. Say, a pattern that will match all three letter words. But i want it to not match words like 'and','got' etc... What would be the best way to do it ,in Python.
My pattern is
r'\b\w{3}\b'
I tried
r'(\b\w{3}\b)(?!and)'
but fails.

Regexes match left to right, and lookaheads are no exception. Your expression will match three letters that are not followed by and (which is impossible because of the \b, by the way).
Move the lookahead before the \w to make it work:
r'(\b(?!and)\w{3}\b)'
You can add more words there --
r'(\b(?!and|got|may)\w{3}\b)'
but for more non-matches it may be more effective to match all three letter words and use code to strip the result of them.

How to match alphabetical chars without numeric chars with Python regexp?

Using Python module re, how to get the equivalent of the "\w" (which matches alphanumeric chars) WITHOUT matching the numeric characters (those which can be matched by "[0-9]")?
Notice that the basic need is to match any character (including all unicode variation) without numerical chars (which are matched by "[0-9]").
As a final note, I really need a regexp as it is part of a greater regexp.
Underscores should not be matched.
EDIT:
I hadn't thought about underscores state, so thanks for warnings about this being matched by "\w" and for the elected solution that addresses this issue.

You want [^\W\d]: the group of characters that is not (either a digit or not an alphanumeric). Add an underscore in that negated set if you don't want them either.
A bit twisted, if you ask me, but it works. Should be faster than the lookahead alternative.

(?!\d)\w
A position that is not followed by a digit, and then \w. Effectively cancels out digits but allows the \w range by using a negative look-ahead.
The same could be expressed as a positive look-ahead and \D:
(?=\D)\w
To match multiple of these, enclose in parens:
(?:(?!\d)\w)+

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.