Regex only returning first part of match - python

I'd like to extract the cat and another mat from this sentence:
>>> text = "the cat sat on another mat"
>>>
>>> re.findall('(the|another)\s+\w+', text)
['the', 'another']
But it won't return the cat and mat following. If I change it to re.findall('another\s+\w+', text) then it finds that part, but why doesn't the (first thing | second thing) work?
(Using Python's re module)

I would do
import re
text = "the cat sat on another mat"
re.findall('the\s+\w+|another\s+\w+', text)
The result should be
>>> ['the cat', 'another mat']

re.findall returns only the substrings in the capture group if a capture group exists in the given regex pattern, so in this case you should use a non-capturing group instead, so that re.findall would return the entire matches:
re.findall('(?:the|another)\s+\w+', text)
This returns:
['the cat', 'another mat']

Related

Join negative particle

How to attach a negative particle to the next word in python (for all texts where "no" can be)?
For example, make this string ['This is not apple']
Into this: ['This is not_apple']
You can use a regular expression:
\bnot\s+(?=\w) to match the word not (but not other words ending in not) followed by one or more space and another word.
import re
s = 'This is not apple'
s2 = re.sub(r'\bnot\s+(?=\w)', 'not_', s)
output: 'This is not_apple'
with a list
import re
l = ['This is not apple']
[re.sub(r'\bnot\s+(?=\w)', 'not_', s) for s in l]
output: ['This is not_apple']

re.findall() isn't as greedy as expected - Python 2.7

I am attempting to pull a list of complete sentences out of a body of plaintext using a regular expression in python 2.7. For my purposes, it is not important that everything that could be construed as a complete sentence should be in the list, but everything in the list does need to be a complete sentence. Below is the code that will illustrate the issue:
import re
text = "Hello World! This is your captain speaking."
sentences = re.findall("[A-Z]\w+(\s+\w+[,;:-]?)*[.!?]", text)
print sentences
Per this regex tester, I should, in theory, be getting a list like this:
>>> ["Hello World!", "This is your captain speaking."]
But the output I am actually getting is like this:
>>> [' World', ' speaking']
The documentation indicates that the findall searches from left to right and that the * and + operators are handled greedily. Appreciate the help.
The issue is that findall() is showing just the captured subgroups rather than the full match. Per the docs for re.findall():
If one or more groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group.
It is easy to see what is going on using re.finditer() and exploring the match objects:
>>> import re
>>> text = "Hello World! This is your captain speaking."
>>> it = re.finditer("[A-Z]\w+(\s+\w+[,;:-]?)*[.!?]", text)
>>> mo = next(it)
>>> mo.group(0)
'Hello World!'
>>> mo.groups()
(' World',)
>>> mo = next(it)
>>> mo.group(0)
'This is your captain speaking.'
>>> mo.groups()
(' speaking',)
The solution to your problem is to suppress the subgroups with ?:. Then you get the expected results:
>>> re.findall("[A-Z]\w+(?:\s+\w+[,;:-]?)*[.!?]", text)
['Hello World!', 'This is your captain speaking.'
You can change your regex somewhat:
>>> re.findall(r"[A-Z][\w\s]+[!.,;:]", text)
['Hello World!', 'This is your captain speaking.']

Python re.sub to change only the word 'a' as opposed to every instance of 'a' as a letter

For example, let's say I want to change every word 'a' into 'an' in the following text:
"a apple is a| awesome fruit."
Assume that the "|" character is there as a garbage character that needs to be worked around.
I want the end product to be as follows:
"an apple is an| awesome fruit."
So far, the closest I've gotten is with the following code:
>>> s = 'a apple is a| awesome fruit.'
>>> regex = '[^A-Za-z0-9](a)[^A-Za-z0-9]'
>>> s = re.sub(regex, 'an', s)
>>> s
'a apple isan awesome fruit.'
'a' showing up at the beginning of the string isn't being affected at all, while the 'a' followed up by garbage mutilates the string in that area. I understand why it's happening this way, I just don't know how I should make the regex pattern to fit this situation. My plan is to only change the substring group (a), but I don't know how to work with that in re.sub. How can I only substitute the substring group? Are there any better ways to use a regex pattern for this situation?
You can make use of word boundaries (\b it matches between a \w class and a \W class (or \w and ^ or \w and $):
>>> s = 'a apple is a| awesome fruit.'
>>> regex = r'\ba\b'
>>> s = re.sub(regex, 'an', s)
>>> s
'an apple is an| awesome fruit.'
\b for word boundaries is a good answer here, the more general construct is called "lookahead" and "lookbehind". Here it would look like:
re.sub(r'\ba\b', 'an', s)
or
re.sub(r'((?<=\W)|^)a((?=\W)|$)', 'an', s)

match text against multiple regex in python

I have a text corpus of 11 files each having about 190000 lines.
I have 10 strings one or more of which may appear in each line the above corpus.
When I encounter any of the 10 strings, I need to record that string which appears in the line separately.
The brute force way of looping through the regex for every line and marking it is taking a long time. Is there an efficient way of doing this?
I found a post (Match a line with multiple regex using Python) which provides a TRUE or FALSE output. But how do I record the matching regex from the line:
any(regex.match(line) for regex in [regex1, regex2, regex3])
Edit: adding example
regex = ['quick','brown','fox']
line1 = "quick brown fox jumps on the lazy dog" # i need to be able to record all of quick, brown and fox
line2 = "quick dog and brown rabbit ran together" # i should record quick and brown
line3 = "fox was quick an rabit was slow" # i should be able to record quick and fox.
Looping through the regex and recording the matching one is one of the solutions, but looking at the scale (11 * 190000 * 10), my script is running for a while now. i need to repeat this in my work quite many times. so i was looking at a more efficient way.
The approach below is in the case that you want the matches. In the case that you need the regular expression in a list that triggered a match, you are out of luck and will probably need to loop.
Based on the link you have provided:
import re
regexes= 'quick', 'brown', 'fox'
combinedRegex = re.compile('|'.join('(?:{0})'.format(x) for x in regexes))
lines = 'The quick brown fox jumps over the lazy dog', 'Lorem ipsum dolor sit amet', 'The lazy dog jumps over the fox'
for line in lines:
print combinedRegex.findall(line)
outputs:
['quick', 'brown', 'fox']
[]
['fox']
The point here is that you do not loop over the regex but combine them.
The difference with the looping approach is that re.findall will not find overlapping matches. For instance if your regexes were: regexes= 'bro', 'own', the output of the lines above would be:
['bro']
[]
[]
whereas the looping approach would result in:
['bro', 'own']
[]
[]
If you're just trying to match literal strings, it's probably easier to just do:
strings = 'foo','bar','baz','qux'
regex = re.compile('|'.join(re.escape(x) for x in strings))
and then you can test the whole thing at once:
match = regex.match(line)
Of course, you can get the string which matched from the resulting MatchObject:
if match:
matching_string = match.group(0)
In action:
import re
strings = 'foo','bar','baz','qux'
regex = re.compile('|'.join(re.escape(x) for x in strings))
lines = 'foo is a word I know', 'baz is a word I know', 'buz is unfamiliar to me'
for line in lines:
match = regex.match(line)
if match:
print match.group(0)
It appears that you're really looking to search the string for your regex. In this case, you'll need to use re.search (or some variant), not re.match no matter what you do. As long as none of your regular expressions overlap, you can use my above posted solution with re.findall:
matches = regex.findall(line)
for word in matches:
print ("found {word} in line".format(word=word))

How to make a group for each word in a sentence?

This may be a silly question but...
Say you have a sentence like:
The quick brown fox
Or you might get a sentence like:
The quick brown fox jumped over the lazy dog
The simple regexp (\w*) finds the first word "The" and puts it in a group.
For the first sentence, you could write (\w*)\s*(\w*)\s*(\w*)\s*(\w*)\s* to put each word in its own group, but that assumes you know the number of words in the sentence.
Is it possible to write a regular expression that puts each word in any arbitrary sentence into its own group? It would be nice if you could do something like (?:(\w*)\s*)* to have it group each instance of (\w*), but that doesn't work.
I am doing this in Python, and my use case is obviously a little more complex than "The quick brown fox", so it would be nifty if Regex could do this in one line, but if that's not possible then I assume the next best solution is to loop over all the matches using re.findall() or something similar.
Thanks for any insight you may have.
Edit: For completeness's sake here's my actual use case and how I solved it using your help. Thanks again.
>>> s = '1 0 5 test1 5 test2 5 test3 5 test4 5 test5'
>>> s = re.match(r'^\d+\s\d+\s?(.*)', s).group(1)
>>> print s
5 test1 5 test2 5 test3 5 test4 5 test5
>>> list = re.findall(r'\d+\s(\w+)', s)
>>> print list
['test1', 'test2', 'test3', 'test4', 'test5']
You can also use the function findall in the module re
import re
>>> re.findall("\w+", "The quick brown fox")
['The', 'quick', 'brown', 'fox']
I don't believe that it is possible. Regexes pair the captures with the parentheses in the given regular expression... if you only listed one group, like '((\w+)\s+){0,99}', then it would just repeatedly capture to the same first and second group... not create new groups for each match found.
You could use split, but that only splits on one character value, not a class of characters like whitespace.
Instead, you can use re.split, which can split on a regular expression, and give it '\s' to match any whitespace. You probably want it to match '\s+' to gather the whitespace greedily.
>>> import re
>>> help(re.split)
Help on function split in module re:
split(pattern, string, maxsplit=0)
Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings.
>>> re.split('\s+', 'The quick brown\t fox')
['The', 'quick', 'brown', 'fox']
>>>
Why use a regex when string.split does the same thing?
>>> "The quick brown fox".split()
['The', 'quick', 'brown', 'fox']
Regular expressions can't group into unknown number of groups. But there is hope in your case. Look into the 'split' method, it should help in your case.

Categories

Resources