I am trying to search a string, which I know is always a sentence, to find the three words that come before and three words that come after a comma. Is regex the right way to do this AND how do you account for the fact that sometimes you will be at the beginning and end of a sentence and there will not be 3 words?
Thanks for the help, trying to learn regex.
Hmm, that one's a bit long, but I guess it works:
>>> import re
>>> string = "The brown fox jumped over the red barn, and found the chickens."
>>> res = re.findall(r'(\b[a-z]+\b)?[^a-z]*(\b[a-z]+\b)?[^a-z]*(\b[a-z]+\b)?[^a-z]*,\s*(\b[a-z]+\b)?[^a-z]*(\b[a-z]+\b)?[^a-z]*(\b[a-z]+\b)?', string, re.IGNORECASE)
>>> res
[('the', 'red', 'barn', 'and', 'found', 'the')]
This will ignore numbers as well, for strings such as:
string = "The brown fox jumped over the red barn, and found 10 chickens."
To give:
[('the', 'red', 'barn', 'and', 'found', 'chickens')]
For things like:
string = "The brown fox jumped over the red barn, and fled."
It gives:
[('the', 'red', 'barn', 'and', 'fled')]
And same for words before the comma.
\b refers to a word boundary and will match only at the end of a word (letter, or number).
[a-z]+ refers to a character class, namely all the letters from a to z. The + at the end indicates that this character class is repeated more than once, thus fulfilling the match of a full word.
(\b[a-z]+\b) is a capture group (notice the brackets) and will be stored in result. Adding a question mark at the end will indicate a possible occurrence (i.e. will match if it exists, and it won't match if it doesn't exist, thus how you can get results if there are less than 3 words before the comma).
[^a-z]* is a negated class, notice the caret just after the opening square bracket. It will match any character, not being letters a through z. The asterisk * indicates an occurrence of 0 or more times.
, is a literal comma.
\s is a space, tab, newline character. The asterisk after it still means an occurrence of 0 or more times.
re.IGNORECASE, as it suggests, will make the match case-insensitive.
for your example,
sen= "The brown fox jumped over the red barn,and found the chickens"
result_left=sen.split(',')[0].split()[-3:]
#result_left ['the', 'red', 'barn']
#for the right words
result_right=sen.split(',')[1].split()[:3]
Related
I am using regex to find all instances of consecutive words that are both capitalized, and where some of the consecutive words contain an apostrophe, ie ("The mother-daughter bakery, Molly’s Munchies, was founded in 2009"). And I have written a few lines of code to do this:
string = "The mother-daughter bakery, Molly’s Munchies, was founded in 2009"
test = re.findall("([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)", string)
print(test)
The issue is I am unable to print the result ('Molly's Munchies')
Instead my output is:
('[]')
Desired output:
("Molly's Munchies")
Any help appreciated, thank you!
You may use this regex in python:
r"\b[A-Z][a-z'’]*(?:\s+[A-Z][a-z'’]*)+"
RegEx Demo
RegEx Details:
\b: Word match
[A-Z]: Match a capital letter
[a-z'’]*: Match 0 or more characters containing lowercase letter or ' or ’
(?:\s+[A-Z][a-z'’]*)+ Match 1 or more such capital letter words
You would need to add it in both places you define a "word". You only added it in one place.
string = "The Cow goes moo, and the Dog's Name is orange"
# e.g. both here and here
# v v
print(re.findall("([A-Z][a-z']+(?=\s[A-Z])(?:\s[A-Z][a-z']+)+)", string))
['The Cow', "Dog's Name"]
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
This code snippet is not showing 'bats' & moreover 'eats' is showing as 'eat' in the last?
When I don't use '[force]' 7 'at' is showing?
What is the use of 'force'?
t="A fat cat doesn't eat oat but a rat eats bats."
mo = re.findall("[force]at", t)
print(mo)
['fat', 'cat', 'eat', 'oat', 'rat', 'eat']
One of places where you can find explanation of Python regular expressions is re module docs, in your case - [force]at relevant part is that [] is
Used to indicate a set of characters. In a set:
Characters can be listed individually, e.g. [amk] will match 'a', 'm', or 'k'.
Therefore [force]at will match: fat, oat, rat, cat, eat.
Play around with regex here. There you have an explanation for your whole regex you use. I.e. comare it to [fc]at to get a feeling for it.
you can use:
import re
t="A fat cat doesn't eat oat but a rat eats bats."
mo = re.findall("\w*at\w*", t)
print(mo)
output:
['fat', 'cat', 'eat', 'oat', 'rat', 'eats', 'bats']
\w* matches any word character (equal to [a-zA-Z0-9_])
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
[force]at
Match a single character present in the list below [force] force
matches a single character in the list force (case sensitive) at
matches the characters at literally (case sensitive)
Given a phrase in a given line, I need to be able to match that phrase even if the words have a different number of spaces in the line.
Thus, if the phrase is "the quick brown fox" and the line is "the quick brown fox jumped over the lazy dog", the instance of "the quick brown fox" should still be matched.
The method I already tried was to replace all instances of whitespace in the line with a regex pattern for whitespace, but this doesn't always work if the line contains characters that aren't treated as literal by regex.
This should work:
import re
pattern = r'the\s+quick\s+brown\s+fox'
text = 'the quick brown fox jumped over the lazy dog'
match = re.match(pattern, text)
print(match.group(0))
The output is:
the quick brown fox
You can use this regex. Check here
(the\s+quick\s+brown\s+fox)
You can split the given string by white spaces and join them back by a white space, so that you can then compare it to the phrase you're looking for:
s = "the quick brown fox"
' '.join(s.split()) == "the quick brown fox" # returns True
for the general case:
replace each sequence of space characters in only one space character.
check if the given sentence is sub string of the line after the replacement
import re
pattern = "your pattern"
for line in lines:
line_without_spaces= re.sub(r'\s+', ' ', line)
# will replace multiple spaces with one space
return pattern in line_without_spaces
As your later clarified, you needed to match any line and series of words. To achieve this I added some more examples to clarify what the both proposed similar regexes do:
text = """the quick brown fox
another line with single and multiple spaces
some other instance with six words"""
Matching whole lines
The first one matches the whole line, iterating over the single lines
pattern1 = re.compile(r'((?:\w+)(?:\s+|$))+')
for i, line in enumerate(text.split('\n')):
match = re.match(pattern1, line)
print(i, match.group(0))
Its output is:
0 the quick brown fox
1 another line with single and multiple spaces
2 some other instance with six words
Matching whole lines
The second one matches single words and iterates of them one-by-one while iterating over the single lines:
pattern2 = re.compile(r'(\w+)(?:\s+|$)')
for i, line in enumerate(text.split('\n')):
for m in re.finditer(pattern2, line):
print(m.group(1))
print()
Its output is:
the
quick
brown
fox
another
line
with
single
and
multiple
spaces
some
other
instance
with
six
words
I'm trying to match a specific pattern using the re module in python.
I wish to match a full sentence (More correctly I would say that they are alphanumeric string sequences separated by spaces and/or punctuation)
Eg.
"This is a regular sentence."
"this is also valid"
"so is This ONE"
I'm tried out of various combinations of regular expressions but I am unable to grasp the working of the patterns properly, with each expression giving me a different yet inexplicable result (I do admit I am a beginner, but still).
I'm tried:
"((\w+)(\s?))*"
To the best of my knowledge this should match one or more alpha alphanumerics greedily followed by either one or no white-space character and then it should match this entire pattern greedily. This is not what it seems to do, so clearly I am wrong but I would like to know why. (I expected this to return the entire sentence as the result)
The result I get for the first sample string mentioned above is [('sentence', 'sentence', ''), ('', '', ''), ('', '', ''), ('', '', '')].
"(\w+ ?)*"
I'm not even sure how this one should work. The official documentation(python help('re')) says that the ,+,? Match x or x (greedy) repetitions of the preceding RE.
In such a case is simply space the preceding RE for '?' or is '\w+ ' the preceding RE? And what will be the RE for the '' operator? The output I get with this is ['sentence'].
Others such as "(\w+\s?)+)" ; "((\w*)(\s??)) etc. which are basically variation of the same idea that the sentence is a set of alpha numerics followed by a single/finite number of white spaces and this pattern is repeated over and over.
Can someone tell me where I go wrong and why, and why the above expressions do not work the way I was expecting them to?
P.S I eventually got "[ \w]+" to work for me but With this I cannot limit the number of white-space characters in continuation.
Your reasoning about the regex is correct, your problem is coming from using capturing groups with *. Here's an alternative:
>>> s="This is a regular sentence."
>>> import re
>>> re.findall(r'\w+\s?', s)
['This ', 'is ', 'a ', 'regular ', 'sentence']
In this case it might make more sense for you to use \b in order to match word boundries.
>>> re.findall(r'\w+\b', s)
['This', 'is', 'a', 'regular', 'sentence']
Alternatively you can match the entire sentence via re.match and use re.group(0) to get the whole match:
>>> r = r"((\w+)(\s?))*"
>>> s = "This is a regular sentence."
>>> import re
>>> m = re.match(r, s)
>>> m.group(0)
'This is a regular sentence'
Here's an awesome Regular Expression tutorial website:
http://regexone.com/
Here's a Regular Expression that will match the examples given:
([a-zA-Z0-9,\. ]+)
Why do you want to limit the number of white space character in continuation? Because a sentence can have any number of words (sequences of alphanumeric characters) and spaces in a row, but rather a sentence is the area of text that ends with a punctuation mark or rather something that is not in the above sequence including white space.
([a-zA-Z0-9\s])*
The above regex will match a sentence wherein it is a series or spaces in series zero or more times. You can refine it to be the following though:
([a-zA-Z0-9])([a-zA-Z0-9\s])*
Which simply states that the above sequence must be prefaced with a alphanumeric character.
Hope this is what you were looking for.
Maybe this will help:
import re
source = """
This is a regular sentence.
this is also valid
so is This ONE
how about this one followed by this one
"""
re_sentence = re.compile(r'[^ \n.].*?(\.|\n| +)')
def main():
i = 0
for s in re_sentence.finditer(source):
print "%d:%s" % (i, s.group(0))
i += 1
if __name__ == '__main__':
main()
I am using alternation in the expression (\.|\n| +) to describe the end-of-sentence condition. Note the use of two spaces in the third alternation. The second space has the '+' meta-character so that two or more spaces in a row will be an end-of-sentence.
This may be a silly question but...
Say you have a sentence like:
The quick brown fox
Or you might get a sentence like:
The quick brown fox jumped over the lazy dog
The simple regexp (\w*) finds the first word "The" and puts it in a group.
For the first sentence, you could write (\w*)\s*(\w*)\s*(\w*)\s*(\w*)\s* to put each word in its own group, but that assumes you know the number of words in the sentence.
Is it possible to write a regular expression that puts each word in any arbitrary sentence into its own group? It would be nice if you could do something like (?:(\w*)\s*)* to have it group each instance of (\w*), but that doesn't work.
I am doing this in Python, and my use case is obviously a little more complex than "The quick brown fox", so it would be nifty if Regex could do this in one line, but if that's not possible then I assume the next best solution is to loop over all the matches using re.findall() or something similar.
Thanks for any insight you may have.
Edit: For completeness's sake here's my actual use case and how I solved it using your help. Thanks again.
>>> s = '1 0 5 test1 5 test2 5 test3 5 test4 5 test5'
>>> s = re.match(r'^\d+\s\d+\s?(.*)', s).group(1)
>>> print s
5 test1 5 test2 5 test3 5 test4 5 test5
>>> list = re.findall(r'\d+\s(\w+)', s)
>>> print list
['test1', 'test2', 'test3', 'test4', 'test5']
You can also use the function findall in the module re
import re
>>> re.findall("\w+", "The quick brown fox")
['The', 'quick', 'brown', 'fox']
I don't believe that it is possible. Regexes pair the captures with the parentheses in the given regular expression... if you only listed one group, like '((\w+)\s+){0,99}', then it would just repeatedly capture to the same first and second group... not create new groups for each match found.
You could use split, but that only splits on one character value, not a class of characters like whitespace.
Instead, you can use re.split, which can split on a regular expression, and give it '\s' to match any whitespace. You probably want it to match '\s+' to gather the whitespace greedily.
>>> import re
>>> help(re.split)
Help on function split in module re:
split(pattern, string, maxsplit=0)
Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings.
>>> re.split('\s+', 'The quick brown\t fox')
['The', 'quick', 'brown', 'fox']
>>>
Why use a regex when string.split does the same thing?
>>> "The quick brown fox".split()
['The', 'quick', 'brown', 'fox']
Regular expressions can't group into unknown number of groups. But there is hope in your case. Look into the 'split' method, it should help in your case.