Regex python find uppercase names - python

I have a text file of the type:
[...speech...]
NAME_OF_SPEAKER_1: [...speech...]
NAME_OF_SPEAKER_2: [...speech...]
My aim is to isolate the speeches of the various speakers. They are clearly identified because the name of each speaker is always indicated in uppercase letters (name+surname). However, in the speeches there can be nouns (not people's names) which are in uppercase letter, but there is only one word that is actually long enough to give me issue (it has four letter, say it is 'ABCD'). I was thinking to identifiy the position of each speaker's name (I assume every name long at least 3 letters) with something like
re.search('[A-Z^(ABCD)]{3,}',text_to_search)
in order to exclude that specific (constant) word 'ABCD'. However, the command identifies that word instead of excluding it. Any ideas about how to overcome this problem?

In the pattern that you tried, you get partial matches, as there are no boundaries and [A-Z^(ABCD)]{3,} will match 3 or more times any of the listed characters.
A-Z will also match ABCD, so it could also be written as [A-Z^)(]{3,}
Instead of using the negated character class, you could assert that the word that consists only of uppercase chars A-Z does not contain ABCD using a negative lookahead (?!
\b(?![A-Z]*ABCD)[A-Z]{3,}\b
Regex demo
If the name should start with 3 uppercase char, and can contain also lowercase chars, an underscore or digits, you could add \w* after matching 3 uppercase chars:
\b(?![A-Z]*ABCD)[A-Z]{3}\w*\b
Regex demo

Square brackets [] match single characters, only. Also round brackets() inside of square brackets match single characters, only. That means:
[ABCD] and [(ABCD)] are the same as [A-D].
[^(ABCD)] matches any character, which is not one of A-D
I would try something different:
^[A-Z]*?: matches each word written in capital letters, which starts at the beginning of a line, and is followed by a colon

Related

start with alphabet and finish with digits

I'd like to define a regular expression in python3 where I can extract words that starts with alphabets and finish with digits.
what I've been trying is r'^[a-z][A-Z].[0-9]$'
and didn't return any single word.
Use
r'\b[A-Za-z]\w*[0-9]\b'
See proof. This matches words that begin with a letter, have any word characters after, and end in a digit. Notice the word boundaries that match whole words.
As per the valuable comment below, consider an alternative:
r'\b[A-Za-z][A-Za-z0-9]*[0-9]\b'
The [A-Za-z0-9]* won't match underscores while \w will.

Regex condition after and before a known phrase

I am trying to capture a phrase that starts with a Capital letter between 2 known phrase. Let's say between "Known phrase, " and the word "The".
For example in the text below, the phrase I'm trying to capture is: Stuff TO CApture That always start with Capital letter but stop capturing when
Ignore Words Known phrase, ignore random phrase Stuff TO CApture That always start with Capital letter but stop capturing when The appears.
Regex I have tried: (?<=Known phrase, ).*(?= The) and Known phrase, (.*) The
These regex also captures ignore random phrase. How do I ignore this?
For your exaple data, you might use:
Known phrase, [a-z ]+([A-Z].*?) The
See the regex demo
Explanation
Known phrase, Match literally
[a-z ]+ Match 1+ times a lowercase character or a space (Add to the character class what you would allow to match except an Uppercase character)
([A-Z].*?) Capture in a group matching an uppercase character followed by 0+ times any character except a newline.
The Match literally
I guess as regular expression is left side greedy you should first try to match anything that is not capital letters
Something like /Start[^A-Z]*(.*)stop/
([^A-Z] matches anything that is not capital letter)
regex101 demo
I'm not sure of what you are trying to do, but, trying to stick with your code, (?<=Known phrase, )([^A-Z]*)(.*)(?=The) should do the trick: the text you need is in the group 2.
If you need to match everything just change to (.*)(?<=Known phrase, )([^A-Z]*)(.*)(?=The)(.*) and get your text in group 3.

Consecutive uppercase letters regex

I'm trying to use Regular expressions to find three consecutive uppercase letters within a string.
I've tried using:
\b([A-Z]){3}\b
as my regex which works to an extent.
However this only returns strings by themselves. I also want it to find three consecutive uppercase letters nested within a string. i.e thisISAtest.
I wonder why you have those word boundaries in your regexp \b? Word boundaries ensure that an word character is followed by a non-word character (or vice versa). Those are what prevents thisISAtest from being matched. Remove them and you should be good!
([A-Z]){3}
Another thing is that I'm not sure why you're using a capture group. Are you extracting the last letter of the three uppercase letters? If not, you can simply use:
[A-Z]{3}
You don't necessarily need groups to use definite quantifiers. :)
EDIT: To prevent more consecutive uppercase letters, you can make use of negative lookarounds:
(?<![A-Z])[A-Z]{3}(?![A-Z])
(?<![A-Z]) makes sure there's no preceeding uppercase letter;
(?![A-Z]) makes sure there's no following uppercase letter.

Python regex and spaces?

If I have a piece of text, i.e.
title="gun control" href="/EBchecked/topic/683775/gun-control"
and want to create a regular expression that matches (see inside <> below)
title="<1 word or many words separated by a space>" href="/EBchecked/topic/\w*/\S*"
How do I solve that part in the <>?
The following regex will match 1 word or many words separated by a space:
\w+( \w+)*
Here a "word" is considered to consist of letters, digits, and underscores. If you only want to allow letters you could use [a-zA-Z]+( [a-zA-Z]+)*.

Python, regex negative lookbehind behavior

I have a regular experssion that should find up to 10 words in a line. THat is, it should include the word just preceding the line feed but not the words after the line feed. I am using a negative lookbehind with "\n".
a = re.compile(r"((\w)+[\s /]){0,10}(?<!\n)")
r = a.search("THe car is parked in the garage\nBut the sun is shining hot.")
When I execute this regex and call the method r.group(), I am getting back the whole sentence but the last word that contains a period. I was expecting only the complete string preceding the new line. That is, "THe car is parked in the garage\n".
What is the mistake that I am making here with the negative look behind...?
I don't know why you would use negative lookahead. You are saying that you want a maximum of 10 words before a linefeed. The regex below should work. It uses a positive lookahead to ensure that there is a linefeed after the words. Also when searching for words use `b\w+\b` instead of what you were using.
/(\b\w+\b)*(?=.*\\n)/
Python :
result = re.findall(r"(\b\w+\b)*(?=.*\\n)", subject)
Explanation :
# (\b\w+\b)*(?=.*\\n)
#
# Match the regular expression below and capture its match into backreference number 1 «(\b\w+\b)*»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
# Note: You repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations. «*»
# Assert position at a word boundary «\b»
# Match a single character that is a “word character” (letters, digits, etc.) «\w+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Assert position at a word boundary «\b»
# Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=.*\\n)»
# Match any single character that is not a line break character «.*»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
# Match the character “\” literally «\\»
# Match the character “n” literally «n»
You may also wish to consider the fact that there could be no \n at your string.
If I read you right, you want to read up to 10 words, or the first newline, whichever comes first:
((?:(?<!\n)\w+\b[\s.]*){0,10})
This uses a negative lookbehind, but just before the word match, so it blocks getting any word after a newline.
This will need some tuning for imperfect input, but it's a start.
For this task there is the anchor $ to find the the end of the string and together with the modifier re.MULTILINE/re.M it will find the end of the line. So you would end up with something like this
(\b\w+\b[.\s /]{0,2}){0,10}$
See it here on Regexr
The \b is a word boundary. I included [.\s /]{0,2} to match a dot followed by a whitespace in my example. If you don't want the dots you need to make this part at least optional like this [\s /]? otherwise it will be missing at the last word and then the \s is matching the \n.
Update/Idea 2
OK, maybe I misunderstood your question with my first solution.
If you just want to not match a newline and continue in the second row, then just don't allow it. The problem is that the newline is matched by the \s in your character class. The \s is a class for whitespace and this includes also the newline characters \r and \n
You already have a space in the class then just replace the \s with \t in case you want to allow tab and then you should be fine without lookbehind. And of course, make the character class optional otherwise the last word will also not be matched.
((\w)+[\t /]?){0,10}
See it here on Regexr
I think you shouldn't be using a lookbehind at all. If you want to match up to ten words not including a newline, try this:
\S+(?:[ \t]+\S+){0,9}
A word is defined here as one or more non-whitespace characters, which includes periods, apostrophes, and other sentence punctuation as well as letters. If you know the text you're matching is regular prose, there's no point limiting yourself to \w+, which isn't really meant to match natural-language words anyway.
After the first word, it repeatedly matches one or more horizontal whitespace characters (space or TAB) followed by another word, for a maximum of ten words. If it encounters a newline before the tenth word, it simply stops matching at that point. There's no need to mention newlines in the regex at all.

Categories

Resources