regex: finding match that satisfies a specific length constrain - python

I have a regex job to search for a pattern
(This) <some words> (is a/was a) <some words> (0-4 digit number) <some words> (word)
where <some words> can be any number of words/charecters including a space.
I used the following to get achieve this.
(^|\W)This(?=\W).*?(?<=\W)(is a|was a)(?=\W).*?(?<=\W)(\d{1,4})((?=\W).*?(?<=\W))*(word)(?=\W)
I also have another constrain: the total length of the match should be less than 30 char.
Currently, my search works for all lengths and searches for all sets of words.
Is there an option in regex which I can use to achieve this constrain using the regex string itself?
I am currently getting this done by looking at the length of the matched regex objects. I have to deal with strings that are more than the required length and this is causing issues which misses some detections which are under the length constrain.
for eg:
string:
"hi This is a alpha, bravo Charley, delta, echo, fox, golf, this is a 12 word finish."
has 2 matches:
"This is a alpha, bravo Charley, delta, echo, fox, golf, this is 12
word"
"this is a 12 word"
My search captures the first one and misses the second. But the second one matches my length criteria.
If the first match is less than the length constrain then I can ignore the second match.
I am using re.sub() to replace those strings and use a repl function inside sub() to check the length. My dataset is large, so the search takes a lot of time. The most important thing to me is to do the search efficiently including the length constraints so as to avoid these incorrect matches.
I am using python 3
Thanks in advance

The regex engine doesn't provide a method to do exactly what you're asking for; you'd need to use regex in conjunction with another tool to get the result you want.
Building on some of the comments on your question, the following regex will return the entire match (everything from 'This' through 'word'):
\b(?=([Tt]his\b.+?\b(?:i|wa)s a\b.+?\b\d{1,4}\b.+?\bword))\b
You can then filter the results to only produce the output you're looking for.
import re
string = 'hi This is a alpha, bravo Charley, delta, echo, fox, golf, this is a 12 word finish.'
pat = re.compile(r'\b(?=([Tt]his\b.*?\b(?:i|wa)s a\b.*?\b\d{1,4}\b.*?\bword))\b')
# returns ['this is a 12 word']
[x[1] for x in pat.finditer(string) if len(x[1]) < 30]

Related

Python regex for multiple and single dots

I'm currently trying to clean a 1-gram file. Some of the words are as follows:
word - basic word, classical case
word. - basic word but with a dot
w.s.f.w. - (word stands for word) - correct acronym
w.s.f.w - incorrect acronym (missing the last dot)
My current implementation considers two different RegExes because I haven't succeeded in combining them in one. The first RegEx recognises basic words:
find_word_pattern = re.compile(r'[A-Za-z]', flags=re.UNICODE)
The second one is used in order to recognise acronyms:
find_acronym_pattern = re.compile(r'([A-Za-z]+(?:\.))', flags=re.UNICODE)
Let's say that I have an input_word as a sequence of characters. The output is obtained with:
"".join(re.findall(pattern, input_word))
Then I choose which output to use based on the length: the longer the output the better. My strategy works well with case no. 1 where both patterns return the same length.
Case no. 2 is problematic because my approach produces word. (with dot) but I need it to return word (without dot). Currently the case is decided in favour of find_acronym_pattern that produces longer sequence.
The case no. 3 works as expected.
The case no. 4: find_acronym_pattern misses the last character meaning that it produces w.s.f. whereas find_word_pattern produces wsfw.
I'm looking for a RegEx (preferably one instead of two that are currently used) that:
given word returns word
given word. returns word
given w.s.f.w. returns w.s.f.w.
given w.s.f.w returns w.s.f.w.
given m.in returns m.in.
A regular expression will never return what is not there, so you can forget about requirement 5. What you can do is always drop the final period, and add it back if the result contains embedded periods. That will give you the result you want, and it's pretty straightforward:
found = re.findall(r"\w+(?:\.\w+)*", input_word)[0]
if "." in found:
found += "."
As you see I match a word plus any number of ".part" suffixes. Like your version, this matches not only single letter acronyms but longer abbreviations like Ph.D., Prof.Dr., or whatever.
If you want one regex, you can use something like this:
((?:[A-Za-z](\.))*[A-Za-z]+)\.?
And substitute with:
\1\2
Regex demo.
Python 3 example:
import re
regex = r"((?:[A-Za-z](\.))*[A-Za-z]+)\.?"
test_str = ("word\n" "word.\n" "w.s.f.w.\n" "w.s.f.w\n" "m.in")
subst = "\\1\\2"
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
Output:
word
word
w.s.f.w.
w.s.f.w.
m.in.
Python demo.

Find out till where a regex satisfies a sentence

I have some sentence and a regular expression. Is it possible to find out till where in the regex my sentence satisfies. For example consider my sentence as MMMV and regex as M+V?T*Z+. Now regex till M+V? satisfies the sentences and the remaining part of regex is T*Z+ which should be my output.
My approach right now is to break the regex in individual parts and store that in a list and then match by concatenating first n parts till sentence matches. For example if my regex is M+V?T*Z+, then my list is ['M+', 'V?', 'T*', 'Z+']. I then match my string in loop first by M+, second by M+V? and so on till complete match is found and then take the remaining list as output. Below is the code
re_exp = ['M+', 'V?', 'T*', 'Z+']
for n in range(len(re_exp)):
re_expression = ''.join(re_exp[:n+1])
if re.match(r'{0}$'.format(re_expression), sentence_language):
return re_exp[n+1:]
Is there a better approach to achieve this may be by using some parsing library etc.
Assuming that your regex is rather simple, with no groups, backreferences, lookaheads, etc., e.g. as in your case, following the pattern \w[+*?]?, you can first split it up into parts, as you already do. But then instead of iteratively joining the parts and matching them against the entire string, you can test each part individually by slicing away the already matched parts.
def match(pattern, string):
res = pat = ""
for p in re.findall(r"\w[+*?]?", pattern):
m = re.match(p, string)
if m:
g = m.group()
string = string[len(g):]
res, pat = res + g, pat + p
else:
break
return pat, res
Example:
>>> for s in "MMMV", "MMVVTTZ", "MTTZZZ", "MVZZZ", "MVTZX":
>>> print(*match("M+V?T*Z+", s))
...
M+V?T* MMMV
M+V?T* MMV
M+V?T*Z+ MTTZZZ
M+V?T*Z+ MVZZZ
M+V?T*Z+ MVTZ
Note, however, that in the worst case of having a string of length n and a pattern of n parts, each matching just a single character, this will still have O(n²) for repeatedly slicing the string.
Also, this may fail if two consecutive parts are about the same character, e.g. a?a+b (which should be equivalent to a+b) will not match ab but only aab as the single a is already "consumed" by the a?.
You could get the complexity down to O(n) by writing your own very simple regex matcher for that very reduced sort of regex, but in the average case that might not be worth it, or even slower.
You can use () to enclose groups in regex. For example: M+V?(T*Z+), the output you want is stored in the first group of the regex.
I know the question says python, but here you can see the regex in action:
const regex = /M+V?(T*Z+)/;
const str = `MMMVTZ`;
let m = regex.exec(str);
console.log(m[1]);

Python RegEx Overlapping

The title of this question probably isn't sufficient to describe the problem I'm trying to solve so hopefully my example gets the point across. I am hoping a Python RegEx is the right tool for the job:
First, we're lookig for any one of these strings:
CATGTG
CATTTG
CACGTG
Second, the pattern is:
string
6-7 letters
string
Example
match: CATGTGXXXXXXCACGTG
no match: CATGTGXXXCACGTG (because 3 letters between)
Third, when a match is found, begin the next search from the end of the previous match, inclusive. Report index of each match.
Example:
input (spaces for readability): XXX CATGTG XXXXXX CATTTG XXXXXXX CACGTG XXX
workflow (spaces for readability):
found match: CATGTG XXXXXX CATTTG
it starts at 3
resuming search at C in CATTTG
found match: CATTTG XXXXXXX CACGTG
it starts at 15
and so on...
After a few hours of tinkering, my sorry attempt did not yield what I expected:
regex = re.compile("CATGTG|CATTTG|CACGTG(?=.{6,7})CATGTG|CATTTG|CACGTG")
for m in regex.finditer('ATTCATGTG123456CATTTGCCG'):
print(m.start(), m.group())
3 CATGTG
15 CATTTG (incorrect)
You're a genius if you can figure this out with a RegEx. Thanks :D
You can use this kind of pattern:
import re
s='XXXCATGTGXXXXXXCATTTGXXXXXXXCACGTGXXX'
regex = re.compile(r'(?=(((?:CATGTG|CATTTG|CACGTG).{6,7}?)(?:CATGTG|CATTTG|CACGTG)))\2')
for m in regex.finditer(s):
print(m.start(), m.group(1))
The idea is to put the whole string inside the lookahead and to use a backreference to consume characters you don't want to test after.
The first capture group contains the whole sequence, the second contains all characters until the next start position.
Note that you can change (?:CATGTG|CATTTG|CACGTG) to CA(?:TGTG|TTTG|CGTG) to improve the pattern.
The main issue is that in order to use the | character, you need to enclose the alternatives in parentheses.
Assuming from your example that you want only the first matching string, try the following:
regex = re.compile("(CATGTG|CATTTG|CACGTG).{6,7}(?:CATGTG|CATTTG|CACGTG)")
for m in regex.finditer('ATTCATGTG123456CATTTGCCG'):
print(m.start(), m.group(1))
Note the .group(1), which will match only what's in the first set of parentheses, as opposed to .group() which will return the whole match.

Try to find repetitive string in a pattern using findall() in Python

I am using Python to write a program that counts how many time a word appears. But, in order to count, the program needs to look at the beginning of a sentence and only count words in a sentence that starts with %. For example,
%act: <dur> pours peanut on plate
and I want to count the word peanut. The program should return 1. While,
*CHI: peanut.
would return 0 because it starts with *
So I used findall()
findall('\%.*?' + "peanut", website_html)
But, if a sentence has two "peanut"'s, the pattern matching would only return 1. For example
%act: <bef> gives peanut . eats . <dur> gives peanut . <aft> gives raisin
would only return 1.
How can I make it return 2?
Thanks
I'd recommend breaking it down into two parts. I.e., something like:
num_peanuts = 0
for sentence in re.findall(r'(?m)^%.*', website_html):
num_peanuts = len(re.findall(r'\bpeanut\b', sentence))
I'm not sure what the right regexp would be for selecting "a sentence that begins with "%" -- here I assume that it's a line whose first character is % (note that by default . does not match newlines; also, the (?m) puts the regexp in multiline mode; and the ^ is a zero-width assertion that matches the beginning of a line.).
I'll also note that the \b's in my peanut-related regexp are to make sure that the word peanut is not a substring of some larger word (eg peanuts). You may or may not want them, depending on the details of your task.

Searching a string and returning only things I specify

Hopefully this post goes better..
So I am stuck on this feature of this program that will return the whole word where a certain keyword is specified.
ie - If I tell it to look for the word "I=" in the string "blah blah blah blah I=1mV blah blah etc?", that it returns the whole word where it is found, so in this case, it would return I=1mV.
I have tried a bunch of different approaches, such as,
text = "One of the values, I=1mV is used"
print(re.split('I=', text))
However, this returns the same String without I in it, so it would return
['One of the values, ', '1mV is used']
If I try regex solutions, I run into the problem where the number could possibly be more then 1 digit, and so this bottom piece of code only works if the number is 1 digit. If I=10mV was that value, it would only return one, but if i have [/0-9] in twice, the code no longer works with only 1 value.
text = "One of the values, I=1mV is used"
print(re.findall("I=[/0-9]", text))
['I=1']
When I tried using re.match,
text = "One of the values, I=1mV is used"
print(re.search("I=", text))
<_sre.SRE_Match object at 0x02408BF0>
What is a good way to retrieve the word (In this case, I want to retrieve I=1mV) and cut out the rest of the string?
A better way would be to split the text into words first:
>>> text = "One of the values, I=1mV is used"
>>> words = text.split()
>>> words
['One', 'of', 'the', 'values,', 'I=1mV', 'is', 'used']
And then filter the words to find the one you need:
>>> [w for w in words if 'I=' in w]
['I=1mV']
This returns a list of all words with I= in them. We can then just take the first element found:
>>> [w for w in words if 'I=' in w][0]
'I=1mV'
Done! What we can do to clean this up a bit is to just look for the first match, rather then checking every word. We can use a generator expression for that:
>>> next(w for w in words if 'I=' in w)
'I=1mV'
Of course you could adapt the if condition to fit your needs better, you could for example use str.startswith() to check if the words starts with a certain string or re.match() to check if the word matches a pattern.
Using string methods
For the record, your attempt to split the string in two halves, using I= as the separator, was nearly correct. Instead of using str.split(), which discards the separator, you could have used str.partition(), which keeps it.
>>> my_text = "Loadflow current was I=30.63kA"
>>> my_text.partition("I=")
('Loadflow current was ', 'I=', '30.63kA')
Using regular expressions
A more flexible and robust solution is to use a regular expression:
>>> import re
>>> pattern = r"""
... I= # specific string "I="
... \s* # Possible whitespace
... -? # possible minus sign
... \s* # possible whitespace
... \d+ # at least one digit
... (\.\d+)? # possible decimal part
... """
>>> m = re.search(pattern, my_text, re.VERBOSE)
>>> m
<_sre.SRE_Match object at 0x044CCFA0>
>>> m.group()
'I=30.63'
This accounts for a lot more possibilities (negative numbers, integer or decimal numbers).
Note the use of:
Quantifiers to say how many of each thing you want.
a* - zero or more as
a+ - at least one a
a? - "optional" - one or zero as
Verbose regular expression (re.VERBOSE flag) with comments - much easier to understand the pattern above than the non-verbose equivalent, I=\s?-?\s?\d+(\.\d+).
Raw strings for regexp patterns, r"..." instead of plain strings "..." - means that literal backslashes don't have to be escaped. Not required here because our pattern doesn't use backslashes, but one day you'll need to match C:\Program Files\... and on that day you will need raw strings.
Exercises
Exercise 1: How do you extend this so that it can match the unit as well? And how do you extend this so that it can match the unit as either mA, A, or kA? Hint: "Alternation operator".
Exercise 2: How do you extend this so that it can match numbers in engineering notation, i.e. "1.00e3", or "-3.141e-4"?
import re
text = "One of the values, I=1mV is used"
l = (re.split('I=', text))
print str(l[1]).split(' ') [0]
if you have more than one I= do the above for each odd index in l sice 0 is the first one.
that is a good way since one can write "One of the values, I= 1mV is used"
and I guess you want to get that I is 1mv.
BTW I is current and its units are Ampers and not Volts :)
With your re.findall attempt you would want to add a + which means one or more.
Here are some examples:
import re
test = "This is a test with I=1mV, I=1.414mv, I=10mv and I=1.618mv."
result = re.findall(r'I=[\d\.]+m[vV]', test)
print(result)
test = "One of the values, I=1mV is used"
result = re.search(r'I=([\d\.]+m[vV])', test)
print(result.group(1))
The first print is: ['I=1mV', 'I=1.414mv', 'I=10mv', 'I=1.618mv']
I've grouped everything other than I= in the re.search example,
so the second print is: 1mV
incase you are interested in extracting that.

Categories

Resources