So I'm in a counter-intuitive situation, and I wanted to get some advice.
Mostly I'm just doing some string matching, using extracted string as the patterns for my regular expression. While generally I can do this pretty well overall with a fuzzy regex search, on occasion I run into this situation:
Let's say I've extracted the following pattern from some data (Python regex package).
pattern = 'the quick brown fox jumps over the lazy dog'
Now, I need to have it match a string that could look like either of these, though mostly the first one.
string = 'quick brown fox jumps over the lazy'
string2 = 'and then a quick brown fox jumps onto the cat'
Because of the beginning and trailing characters, obviously I won't get a match if I try and do something like what I've been doing, which currently looks something like this:
if re.search("("+pattern+"){e<=2}", string):
print(True)
Unfortunately the error count is not consistent and there could be many characters leading and/or ending the pattern. Given I don't know a priori if I'll run into this problem, is there anything I could do to get a match if a sufficient substring of the pattern matches it? I looked at Levenshtein distance to account for this, but it requires setting some threshold that seems super sensitive to the length of the strings to match (after normalizing by length), and so it just ends up being a toss up on whether or not I get a match when I want it to. Are there any other options out there or better ways of normalizing results?
Also, one thing that I can't do is to just always take the best match because sometimes the proper entry doesn't actually appear in the text I'm checking.
Anything I've missed in the regex package that could help out with this?
Ouf, took me quite some time to come with this (I'm not a python developer), but this should do the trick:
import re
sentence = "the quick brown fox jumps over the lazy dog"
string = 'quick brown fox jumps over the lazy'
string2 = 'and then a quick brown fox jumps onto the cat'
count1 = 0
count2 = 0
pattern = re.sub(
'(\w+\s*)',
'\\1|',
sentence
)
pattern ="(?:(?!" + pattern.rstrip("|") + ").|" + re.sub(
'(\w+\s*)',
'(\\1){0,1}',
sentence
) + ")+"
results = re.match(
pattern,
string
)
total = len(results.groups())
for index in range(1, total):
if results.group(index):
count1 = count1 + 1
results = re.match(
pattern,
string2
)
for index in range(1, total):
if results.group(index):
count2 = count2 + 1
message = 'The following string:"' + string + '" matched ' + str(count1) + ' time and the following string:"' + string + '" matched ' + str(count2) + ' time.'
Tested here: http://www.pythontutor.com/visualize.html#mode=edit
Related
Given a phrase in a given line, I need to be able to match that phrase even if the words have a different number of spaces in the line.
Thus, if the phrase is "the quick brown fox" and the line is "the quick brown fox jumped over the lazy dog", the instance of "the quick brown fox" should still be matched.
The method I already tried was to replace all instances of whitespace in the line with a regex pattern for whitespace, but this doesn't always work if the line contains characters that aren't treated as literal by regex.
This should work:
import re
pattern = r'the\s+quick\s+brown\s+fox'
text = 'the quick brown fox jumped over the lazy dog'
match = re.match(pattern, text)
print(match.group(0))
The output is:
the quick brown fox
You can use this regex. Check here
(the\s+quick\s+brown\s+fox)
You can split the given string by white spaces and join them back by a white space, so that you can then compare it to the phrase you're looking for:
s = "the quick brown fox"
' '.join(s.split()) == "the quick brown fox" # returns True
for the general case:
replace each sequence of space characters in only one space character.
check if the given sentence is sub string of the line after the replacement
import re
pattern = "your pattern"
for line in lines:
line_without_spaces= re.sub(r'\s+', ' ', line)
# will replace multiple spaces with one space
return pattern in line_without_spaces
As your later clarified, you needed to match any line and series of words. To achieve this I added some more examples to clarify what the both proposed similar regexes do:
text = """the quick brown fox
another line with single and multiple spaces
some other instance with six words"""
Matching whole lines
The first one matches the whole line, iterating over the single lines
pattern1 = re.compile(r'((?:\w+)(?:\s+|$))+')
for i, line in enumerate(text.split('\n')):
match = re.match(pattern1, line)
print(i, match.group(0))
Its output is:
0 the quick brown fox
1 another line with single and multiple spaces
2 some other instance with six words
Matching whole lines
The second one matches single words and iterates of them one-by-one while iterating over the single lines:
pattern2 = re.compile(r'(\w+)(?:\s+|$)')
for i, line in enumerate(text.split('\n')):
for m in re.finditer(pattern2, line):
print(m.group(1))
print()
Its output is:
the
quick
brown
fox
another
line
with
single
and
multiple
spaces
some
other
instance
with
six
words
I am trying to search a string, which I know is always a sentence, to find the three words that come before and three words that come after a comma. Is regex the right way to do this AND how do you account for the fact that sometimes you will be at the beginning and end of a sentence and there will not be 3 words?
Thanks for the help, trying to learn regex.
Hmm, that one's a bit long, but I guess it works:
>>> import re
>>> string = "The brown fox jumped over the red barn, and found the chickens."
>>> res = re.findall(r'(\b[a-z]+\b)?[^a-z]*(\b[a-z]+\b)?[^a-z]*(\b[a-z]+\b)?[^a-z]*,\s*(\b[a-z]+\b)?[^a-z]*(\b[a-z]+\b)?[^a-z]*(\b[a-z]+\b)?', string, re.IGNORECASE)
>>> res
[('the', 'red', 'barn', 'and', 'found', 'the')]
This will ignore numbers as well, for strings such as:
string = "The brown fox jumped over the red barn, and found 10 chickens."
To give:
[('the', 'red', 'barn', 'and', 'found', 'chickens')]
For things like:
string = "The brown fox jumped over the red barn, and fled."
It gives:
[('the', 'red', 'barn', 'and', 'fled')]
And same for words before the comma.
\b refers to a word boundary and will match only at the end of a word (letter, or number).
[a-z]+ refers to a character class, namely all the letters from a to z. The + at the end indicates that this character class is repeated more than once, thus fulfilling the match of a full word.
(\b[a-z]+\b) is a capture group (notice the brackets) and will be stored in result. Adding a question mark at the end will indicate a possible occurrence (i.e. will match if it exists, and it won't match if it doesn't exist, thus how you can get results if there are less than 3 words before the comma).
[^a-z]* is a negated class, notice the caret just after the opening square bracket. It will match any character, not being letters a through z. The asterisk * indicates an occurrence of 0 or more times.
, is a literal comma.
\s is a space, tab, newline character. The asterisk after it still means an occurrence of 0 or more times.
re.IGNORECASE, as it suggests, will make the match case-insensitive.
for your example,
sen= "The brown fox jumped over the red barn,and found the chickens"
result_left=sen.split(',')[0].split()[-3:]
#result_left ['the', 'red', 'barn']
#for the right words
result_right=sen.split(',')[1].split()[:3]
I have a text corpus of 11 files each having about 190000 lines.
I have 10 strings one or more of which may appear in each line the above corpus.
When I encounter any of the 10 strings, I need to record that string which appears in the line separately.
The brute force way of looping through the regex for every line and marking it is taking a long time. Is there an efficient way of doing this?
I found a post (Match a line with multiple regex using Python) which provides a TRUE or FALSE output. But how do I record the matching regex from the line:
any(regex.match(line) for regex in [regex1, regex2, regex3])
Edit: adding example
regex = ['quick','brown','fox']
line1 = "quick brown fox jumps on the lazy dog" # i need to be able to record all of quick, brown and fox
line2 = "quick dog and brown rabbit ran together" # i should record quick and brown
line3 = "fox was quick an rabit was slow" # i should be able to record quick and fox.
Looping through the regex and recording the matching one is one of the solutions, but looking at the scale (11 * 190000 * 10), my script is running for a while now. i need to repeat this in my work quite many times. so i was looking at a more efficient way.
The approach below is in the case that you want the matches. In the case that you need the regular expression in a list that triggered a match, you are out of luck and will probably need to loop.
Based on the link you have provided:
import re
regexes= 'quick', 'brown', 'fox'
combinedRegex = re.compile('|'.join('(?:{0})'.format(x) for x in regexes))
lines = 'The quick brown fox jumps over the lazy dog', 'Lorem ipsum dolor sit amet', 'The lazy dog jumps over the fox'
for line in lines:
print combinedRegex.findall(line)
outputs:
['quick', 'brown', 'fox']
[]
['fox']
The point here is that you do not loop over the regex but combine them.
The difference with the looping approach is that re.findall will not find overlapping matches. For instance if your regexes were: regexes= 'bro', 'own', the output of the lines above would be:
['bro']
[]
[]
whereas the looping approach would result in:
['bro', 'own']
[]
[]
If you're just trying to match literal strings, it's probably easier to just do:
strings = 'foo','bar','baz','qux'
regex = re.compile('|'.join(re.escape(x) for x in strings))
and then you can test the whole thing at once:
match = regex.match(line)
Of course, you can get the string which matched from the resulting MatchObject:
if match:
matching_string = match.group(0)
In action:
import re
strings = 'foo','bar','baz','qux'
regex = re.compile('|'.join(re.escape(x) for x in strings))
lines = 'foo is a word I know', 'baz is a word I know', 'buz is unfamiliar to me'
for line in lines:
match = regex.match(line)
if match:
print match.group(0)
It appears that you're really looking to search the string for your regex. In this case, you'll need to use re.search (or some variant), not re.match no matter what you do. As long as none of your regular expressions overlap, you can use my above posted solution with re.findall:
matches = regex.findall(line)
for word in matches:
print ("found {word} in line".format(word=word))
I have a string in Python, say The quick #red fox jumps over the #lame brown dog.
I'm trying to replace each of the words that begin with # with the output of a function that takes the word as an argument.
def my_replace(match):
return match + str(match.index('e'))
#Psuedo-code
string = "The quick #red fox jumps over the #lame brown dog."
string.replace('#%match', my_replace(match))
# Result
"The quick #red2 fox jumps over the #lame4 brown dog."
Is there a clever way to do this?
You can pass a function to re.sub. The function will receive a match object as the argument, use .group() to extract the match as a string.
>>> def my_replace(match):
... match = match.group()
... return match + str(match.index('e'))
...
>>> string = "The quick #red fox jumps over the #lame brown dog."
>>> re.sub(r'#\w+', my_replace, string)
'The quick #red2 fox jumps over the #lame4 brown dog.'
I wasn't aware you could pass a function to a re.sub() either. Riffing on #Janne Karila's answer to solve a problem I had, the approach works for multiple capture groups, too.
import re
def my_replace(match):
match1 = match.group(1)
match2 = match.group(2)
match2 = match2.replace('#', '')
return u"{0:0.{1}f}".format(float(match1), int(match2))
string = 'The first number is 14.2#1, and the second number is 50.6#4.'
result = re.sub(r'([0-9]+.[0-9]+)(#[0-9]+)', my_replace, string)
print(result)
Output:
The first number is 14.2, and the second number is 50.6000.
This simple example requires all capture groups be present (no optional groups).
Try:
import re
match = re.compile(r"#\w+")
items = re.findall(match, string)
for item in items:
string = string.replace(item, my_replace(item)
This will allow you to replace anything that starts with # with whatever the output of your function is.
I wasn't very clear if you need help with the function as well. Let me know if that's the case
A short one with regex and reduce:
>>> import re
>>> pat = r'#\w+'
>>> reduce(lambda s, m: s.replace(m, m + str(m.index('e'))), re.findall(pat, string), string)
'The quick #red2 fox jumps over the #lame4 brown dog.'
I'm using pyparsing to parse documents containing text in which the line ends vary in location. I need to write a parser expression that matches the text regardless of line break location. The following does NOT work:
from __future__ import print_function
from pyparsing import *
string_1 = """The quick brown
fox jumps over the lazy dog.
"""
string_2 = """The quick brown fox jumps
over the lazy dog.
"""
my_expr = Literal(string_1)
print(my_expr.searchString(string_1)
print(my_expr.searchString(string_2)
This results in the following being displayed on the console:
[['The quick brown \nfox jumps over the lazy dog.\n']]
[]
Since line breaks are included in ParserElement.DEFAULT_WHITE_CHARS, I don't understand why both strings do not match my expression. How do I create a parser element which DOES match text regardless of where the line breaks occur?
Your question is a good example of why I discourage people from defining literals with embedded whitespace, because this defeats pyparsing's built-in whitespace skipping. Pyparsing skips over whitespace between expressions. In your case, you are specifying only a single expression, a Literal comprising an entire string of words, including whitespace between them.
You can get whitespace skipped by breaking your string up into separate Literals (adding a string to a pyparsing expression automatically constructs a Literal from that string):
from pyparsing import *
my_expr = Literal("The") + "quick" + "brown" + "fox" + "jumps" + "over" + "the" + "lazy" + "dog"
string_1 = """The quick brown
fox jumps over the lazy dog.
"""
string_2 = """The quick brown fox jumps
over the lazy dog.
"""
for test in (string_1, string_2):
print '-'*40
print test
print my_expr.parseString(test)
print
If you don't like typing all those separate quoted strings, you can have Python split the string up for you, map them to Literals, and feed the whole list to make up a pyparsing And:
my_expr = And(map(Literal, "The quick brown fox jumps over the lazy dog".split()))
If you want to preserve the original whitespace, wrap your expression in originalTextFor:
my_expr = originalTextFor(my_expr)