Find consecutive capitalized words in a string, including apostrophes - python

I am using regex to find all instances of consecutive words that are both capitalized, and where some of the consecutive words contain an apostrophe, ie ("The mother-daughter bakery, Molly’s Munchies, was founded in 2009"). And I have written a few lines of code to do this:
string = "The mother-daughter bakery, Molly’s Munchies, was founded in 2009"
test = re.findall("([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)", string)
print(test)
The issue is I am unable to print the result ('Molly's Munchies')
Instead my output is:
('[]')
Desired output:
("Molly's Munchies")
Any help appreciated, thank you!

You may use this regex in python:
r"\b[A-Z][a-z'’]*(?:\s+[A-Z][a-z'’]*)+"
RegEx Demo
RegEx Details:
\b: Word match
[A-Z]: Match a capital letter
[a-z'’]*: Match 0 or more characters containing lowercase letter or ' or ’
(?:\s+[A-Z][a-z'’]*)+ Match 1 or more such capital letter words

You would need to add it in both places you define a "word". You only added it in one place.
string = "The Cow goes moo, and the Dog's Name is orange"
# e.g. both here and here
# v v
print(re.findall("([A-Z][a-z']+(?=\s[A-Z])(?:\s[A-Z][a-z']+)+)", string))
['The Cow', "Dog's Name"]

Related

Printing words from a text

I've made this Python program for printing words from a text but I got stuck where Python reaches the next 'tab' index it returns to the initial one when it checks the conditional and I don't know why, so can anyone explain to me why it doesn't take the new 'tab' index?
import re
initial_text = '# Traditionally, a text is understood to be a piece of written or spoken material in its primary form (as opposed to a paraphrase or summary). A text is any stretch of language that can be understood in context. It may be as simple as 1-2 words (such as a stop sign) or as complex as a novel. Any sequence of sentences that belong together can be considered a text.'
text = re.sub('\W+', ' ', initial_text)
t = -1
for i in text:
n = text.find(i)
if i == ' ':
print(text[t+1:n])
t = n
This is because you are using the find() function, this will return the index number of the first occurrence of the word you are searching, that's why it is again moving to the first index.
You can refer to the find() function documentation.
Use this approach
import re
initial_text = "whatever your text is"
text = re.sub(r'[^\w\s]', '', initial_text)
words_list = text.split()
for word in words:
print(word)
Explanation using an example :
import re
initial_text = "Hello : David welcome to Stack ! overflow"
text = re.sub(r'[^\w\s]', '', initial_text)
Above piece removes the punctuations
words_list = text.split()
words_list after this step will be : ['Hello', 'David', 'welcome', 'to', 'Stack', 'overflow']
for word in words_list:
print(word)
Above code takes each element from the list and prints it.
Looks like you can use
import re
initial_text = '# Traditionally, a text is understood to be a piece of written or spoken material in its primary form (as opposed to a paraphrase or summary). A text is any stretch of language that can be understood in context. It may be as simple as 1-2 words (such as a stop sign) or as complex as a novel. Any sequence of sentences that belong together can be considered a text.'
words = re.findall(r'[^\W_]+', initial_text)
for word in words:
print(word)
See Python proof.
re.findall extracts all non-overlapping matches from the given text.
[^\W_]+ is a regular expression that matches one or more characters different from non-word and underscores, and that means it matches substrings that consist of digits or/and letters only (all, ASCII and other Unicode).
See regex proof.
EXPLANATION
[^\W_]+ any character except: non-word characters
(all but a-z, A-Z, 0-9, _), '_' (1 or more
times (matching the most amount possible))

Regex to find strings containing substring, but not ending on same substring

I'm trying to write a regex that checks if a string contains the substring "ing", but it most not end on "ing".
So the word sing would not work but singer would.
I think I have figured out how to make sure that the string does not end with ing, for that I'm using
(!<?(ing))$
But I can't seem to get it to work when I want the word to contain "ing" as well. I was thinking something like
(\w+(ing))(!<?(ing))$
But that does not work, all of my solution that sort of makes it work will take in more than one word as well. So it will match singer but not singer crafting, it should still match singer here, just not crafting.
You may use the pattern:
ing(?=\w)
This would only be true for words which contain ing which is also followed by another word character. Here is an example:
inp = 'singer'
if re.search(r'ing(?=\w)', inp):
print('singer is a MATCH')
inp = 'sing'
if re.search(r'ing(?=\w)', inp):
print('sing is a MATCH')
This prints:
singer is a MATCH
Edit:
To match entire words containing non terminal ing, I suggest using re.findall:
inp = "Madonna is a singer who likes to sing."
matches = re.findall(r'\b\w*ing\w+\b', inp)
print(matches) # prints ['singer']
If the word can not end with ing but must contain ing:
\b\w*ing(?!\w*ing\b)\w+
Explanation
\b A word boundary
\w* Match 0+ word characters
ing Match the required ing
(?!\w*ing\b) Negaetive lookahead, assert the ing is not at the end of the word
\w+ Match 1+ word chars so that there must be at least a single char following
Regex demo | Python demo
For example
import re
items = ["singer","singing","ing","This is a ing testing singalong"]
pattern = r"\b\w*ing(?!\w*ing$)\w+\b"
for item in items:
result = re.findall(pattern, item)
if result:
print(result)
Output
['singer']
['singalong']
You can use this pattern:
import re
pattern = re.compile('\w*ing\w+')
print(pattern.match('sing')) # No match
print(pattern.match('singer')) # Match

Replacing or introducing a space after a sequence of letters some known or unknown then write on newlines content

The problem is that I now have a string where some words are sticked together:
fooledDog and I need fooled D****string text continues with inserted " "
whateveredJ and I need whatevered J*******string text continues with inserted " "
string = string.replace("edD","ed D")
string = string.replace("edJ","ed J")
but I need instead of "D" and "J" to have any possible character so to avoid hard coding values here so that the code will work with any letter or number in this position.
This is a pretty easy problem to solve with regular expressions (not something that is always true, even if regex are the best tool for the job). Try this:
import re
text = "fooledDog whateveredJob"
fixed_text = re.sub(r'ed([A-Z])', r'ed \1', text)
print(fixed_text) # prints "fooled Dog whatevered Job"
The pattern looks for the letters 'ed' in lowercase, followed by any capital letter (which gets captured). The replacement is 'ed' and a space, followed by the capital letter from the capturing group.
I don't fully understand your question, but it seems you have some camelCase words you wanna separate. If that's the case, try this:
import re
name = 'CamelCaseTest123'
splitted = re.sub('(?!^)([A-Z][a-z]+)', r' \1', name).split()
Output:
['Camel', 'Case', 'Test123']

Python substitute a word for a word and the next concatenated

I want to be able to take in a string and if r'\snot\s' is located, essentially concatenate 'not' and the next word (replacing the white space in between with an underscore).
So if the string is
string="not that my name is Brian and I am not happy about nothing"
The result after a regular expression would be:
'not_that my name is Brian and I am not_happy about nothing'
(not in nothing is not touched).
I need to locate 'not' that is either seperated by white space or at the start of a sentence and then join it to '_' and the next word.
Use re.sub() with saving groups:
>>> re.sub(r"not\s\b(.*?)\b", r"not_\1", string)
'not_that my name is Brian and I am not_happy about nothing'
not\s\b(.*?)\b here would match not followed by a space, followed by a word (\b are the word boundaries). The (.*?) is a capturing group that help us capture the word after the not that we can then reference in the substitution (\1).
Why not just use the replace method on strings? It's a bit more readable than regex.
>>> msg = "not that my name is Brian and I am not happy about nothing"
>>> msg.replace('not ', 'not_')
'not_that my name is Brian and I am not_happy about nothing'
How about just:
\bnot\s
Example:
>>> string
'not that my name is Brian and I am not happy about nothing'
>>> re.sub(r'\bnot\s', 'not_', string)
'not_that my name is Brian and I am not_happy about nothing'

Identifying lines with consecutive upper case letters

I'm looking for logic that searches a capital word in a line in python, like I have a *.txt:
aaa
adadad
DDD_AAA
Dasdf Daa
I would like to search only for the lines which have 2 or more capital words after each other (in the above case DDD_AAA).
Regex are the way to go:
import re
pattern = "([A-Z]+_[A-Z]+)" # matches CAPITALS_CAPITALS only
match = re.search(pattern, text)
if match: print match.group(0)
You have to figure out what exactly you are looking for though.
Presuming your definition of a "capital word" is a string of two or more uppercase alphabet (non-numeric) characters, i.e. [A-Z], and assuming that what separates one "capital word" from another is not quite the complementary set ([^A-Z]) but rather the complementary set to the alphanumeric characters, i.e. [^a-zA-Z0-9], you're looking for a regex like
\b[A-Z]{2,}\b.*\b[A-Z]{2,}\b
I say like because the above is not exactly correct: \b counts the underscore _ as a word character. Replace the \bs with [^a-zA-Z0-9]s wrapped in lookaround assertions (to make them zero-width, like \b), and you have the correct regex:
(?<=[^a-zA-Z0-9]|^)[A-Z]{2,}(?=[^a-zA-Z0-9]).*(?<=[^a-zA-Z0-9])[A-Z]{2,}(?=[^a-zA-Z0-9]|$)
Here's a Rubular demo.
Finally, if you consider a one-character word, a "word", then simply do away with the {2,} quantifiers:
(?<=[^a-zA-Z0-9]|^)[A-Z]+(?=[^a-zA-Z0-9]).*(?<=[^a-zA-Z0-9])[A-Z]+(?=[^a-zA-Z0-9]|$)
print re.findall("[A-Z][a-zA-Z]*\s[A-Z][a-zA-Z]",search_text)
should work to match 2 words that both start with a capital letter
for your specific example
lines = []
for line in file:
if re.findall("[A-Z][a-zA-Z]*\s[A-Z][a-zA-Z]",line): lines.append(line)
print lines
basically look into regexes!
Here you go:
import re
lines = open("r1.txt").readlines()
for line in lines:
if re.match(r'[^\w]*[A-Z]+[ _][A-Z]+[^\w]*', line) is not None:
print line.strip("\n")
Output:
DDD_AAA

Categories

Resources