repetition "{}" on fly for regex - python

I am trying to write a function that compare a value with a regex to see if matches. The problem is that I have quite a many regex that are similar with just one difference which the range {} e.g. ^[a-z]{0,500}$ & ^[a-z]{0,200}$ are similar regex with just a diff of range/repetition. I am trying to solve that problem of how to deal with these regex with one function. So far I have written that function. But I think there must be some option that is much better than what I have done below. It should also be able to deal if no max or min is specified as well.
def check(value, min=None, max=None):
regex = "^[a-z]"+"{"+min+","+max+"}$"
r= re.compile(regex)
if r.match(value):
return True
else:
return False

Use min="0" and max="" instead (that way, they will construct valid ranges if left unspecified).
Also, don't do if condition: return True etc. - just return the match object - it will evaluate to True if there is a match (and you can do stuff with it later if you want to).
Further, no need to compile the regex if you're only using it once.
def check(value, min="0", max=""):
regex = "[a-z]{" + min + "," + max + "}$"
return re.match(regex, value)
Also, I've removed the ^ because it's implicit in re.match().

Related

Python - using Regex to return True if all substrings are contained within string

str = "chair table planttest"
substr = ["plant", "tab"]
x = func(substr, str)
print(x)
I want code that will return True if str contains all strings in the substr list, regardless of position. If str does not contain all the elements in substr, it should return False.
I have been trying to use re.search or re.findall but am having a very hard time understanding regex operators.
Thanks in advance!
You need to loop through all of the substrings and use the in operator to decide if they are in the string. Like this:
def hasAllSubstrings(partsArr, fullStr):
allFound = true
for part in partsArr:
if part not in fullStr:
allFound = false
return allFound
# Test
full = "chair table planttest"
parts = ["plant", "tab"]
x = hasAllSubstrings(parts, full)
print(x)
Let's see what the hasAllSubstrings function.
It creates a boolean variable that decides if all substrings have been found.
It loops through each part in the partsArr sent in to the function.
If that part is not in the full string fullStr, it sets the boolean to false.
If multiple are not found, it will still be false, and it won't change. If everything is found, it will always be true and not false.
At the end, it returns whether it found everything.
After the function definition, we test the function with your example. There is also one last thing you should take note of: your variables shouldn't collide with built-ins. You used the str variable, and I changed it to full because str is a (commonly used) function in the Python standard library.
By using that name, you effectively just disabled the str function. Make sure to avoid those names as they can easily make your programs harder to debug and more confusing. Oh, and by the way, str is useful when you need to convert a number or some other object into a string.
You can simply use the built-in function all() and the in keyword:
fullstr = "chair table planttest"
substr = ["plant", "tab"]
x = all(s in fullstr for s in substr)
In continuation to the answer of # Lakshya Raj, you can add break after allFound = false.
Because, as soon as the first item of the sub_strung is not found in str, it gives your desired output. No need to loop further.
allFound = false
break

Python Regex to start from inner match

I am working on a regex in Python that is converting a mathematical expression to the power format in Sympy language pow(x,y). For example, it takes 2^3 and returns pow(2,3).
My current pattern is:
# pattern
pattern = re.compile(r'(.+?)\^(.*)')
To find the math for nested expressions, I use a for loop to count the number of ^(hat)s and then use re.sub to generate the power format:
# length of the for loop
num_hat = len(re.findall(r'\^', exp))
# for nested pow
for i in range(num_hat):
exp = re.sub(pattern, r'pow(\1,\2)', exp)
return exp
This method does not work for nested ^ expressions such as a^b^c^d or sin(x^2)^3 as the position of the final parenthesis is not correct.
For a^b^c^d it returns pow(pow(pow(a,b,c,d)))
For sin(x^2)^3 it returns pow(pow(sin(x,2),3))
Is there any way to overcome this issue? I tried the negative lookahead but it is still not working
There is no nice way of saying this, but you have an extreme case of an XY problem. What you apparently want is to convert some mathematical expression to SymPy. Writing your own regular expression seems like a very tedious, error-prone, and possibly impossible approach to this.
Being a vast symbolic library, SymPy comes with an entire parsing submodule, which allows you to tweak the parsing expressions in all detail, in particular convert_xor governs what happens to the ^ character. However, it appears you do not need to do anything since converting ^ to exponentiation is the default. You can therefore simply do:
from sympy import sympify
print( sympy.sympify("a^b^c^d") ) # a**(b**(c**d))
print( sympy.sympify("sin(x^2)^3") ) # sin(x**2)**3
Note that ** is equivalent to pow, so I am not sure why you are insisting on the latter. If you need an output that shall work in yet another programming language, that’s what the printing module is for and it’s comparably easy to change this yourself. Another thing that may help you obtain the desired form is sympy.srepr.
Why don't you use recursion for this? It might not be the best, but will work for your use case if the nested powers are not a lot,
A small demonstration,
import re
#Your approach
def func(exp):
# pattern
pattern = re.compile(r'(.+?)\^(.*)')
# length of the for loop
num_hat = len(re.findall(r'\^', exp))
# for nested pow
for i in range(num_hat): # num_hat-1 since we created the first match already
exp = re.sub(pattern, r'pow(\1,\2)', exp)
return exp
#With recursion
def refined_func(exp):
# pattern
pattern = '(.+?)\^(.*)'
# length of the for loop
num_hat = len(re.findall(r'\^', exp))
#base condition
if num_hat == 1:
search = re.search(pattern, exp)
group1 = search.group(1)
group2 = search.group(2)
exp = "pow("+group1+", "+group2+")"
return exp
# for nested pow
for i in range(num_hat): # num_hat-1 since we created the first match already
search = re.search(pattern, exp)
if not search: # the point where there are no hats in the exp
break
group1 = search.group(1)
group2 = search.group(2)
exp = "pow("+group1+", "+refined_func(group2)+")"
return exp
if __name__=='__main__':
print(func("a^b^c^d"))
print("###############")
print(refined_func("a^b^c^d"))
The output of the above program is,
pow(pow(pow(a,b,c,d)))
###############
pow(a, pow(b, pow(c, d)))
Problem in your approach:
Initially you start off with the following expression,
a^b^c^d
With your defined regex, two parts are made of the above expression -> part1: a and part2: b^c^d. With these two, you generate pow(a,b^c^d). So the next expression that you work with is,
pow(a,b^c^d)
In this case now, your regex will give part1 to be pow(a,b and part2 will be c^d). Since, the pow statement used to construct the info from part1 and part2 is like pow(part1, part2), you end up having pow( pow(a,b , c^d) ) which is not what you intended.
I made an attempt into your examples but I'll still advise you to find a math parser (from my comment) as this regex is very complex.
import re
pattern = re.compile(r"(\w+(\(.+\))?)\^(\w+(\(.+\))?)([^^]*)$")
def convert(string):
while "^" in string:
string = pattern.sub(r"pow(\1, \3)\5", string, 1)
return string
print(convert("a^b^c^d")) # pow(a, pow(b, pow(c, d)))
print(convert("sin(x^2)^3")) # pow(sin(pow(x, 2)), 3)
Explanation: Loop while there is a ^ and replace the last match (presence of $)

Check logical concatenation of regular expressions

I have the following problem in python, which I hope you can assist with.
The input is 2 regular expressions, and I have to check if their concatenation can have values.
For example if one says take strings with length greater than 10 and the other says at most 5, than
no value can ever pass both expressions.
Is there something in python to solve this issue?
Thanks,
Max.
Getting this brute force algorithm from here:
Generating a list of values a regex COULD match in Python
def all_matching_strings(alphabet, max_length, regex1, regex2):
"""Find the list of all strings over 'alphabet' of length up to 'max_length' that match 'regex'"""
if max_length == 0: return
L = len(alphabet)
for N in range(1, max_length+1):
indices = [0]*N
for z in xrange(L**N):
r = ''.join(alphabet[i] for i in indices)
if regex1.match(r) and regex2.match(r):
yield(r)
i = 0
indices[i] += 1
while (i<N) and (indices[i]==L):
indices[i] = 0
i += 1
if i<N: indices[i] += 1
return
example usage, for your situation (two regexes)... you'd need to add all possible symbols/whitespaces/etc to that alphabet also...:
alphabet = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890'
import re
regex1 = re.compile(regex1_str)
regex2 = re.compile(regex1_str)
for r in all_matching_strings(alphabet, 5, regex1, regex2):
print r
That said, the runtime on this is super-crazy and you'll want to do whatever you can to speed it up. One suggestion on the answer I swiped the algorithm from was to filter the alphabet to only have characters that are "possible" for the regex. So if you scan your regex and you only see [1-3] and [a-eA-E], with no ".", "\w", "\s", etc, then you can reduce the size of the alphabet to 13 length. Lots of other little tricks you could implement as well.
Is there something in python to solve this issue?
There is nothing in Python that solves this directly.
That said, you can simulate a logical-and operation for two regexes by using lookahead assertions. There is a good explanation with examples at Regular Expressions: Is there an AND operator?
This will combine the regexes but won't show directly whether some string exists that satisfies the combined regex.
I highly doubt that something like this is implemented and even that there is a way to efficiently compute it.
One approximative way that comes to my mind now, that detects the most obvious conflicts would be to generate a random string conforming to each of the regexes, and then check if the concatenation of the regexes matches the concatenation of the generated strings.
Something like:
import re, rstr
s1 = rstr.xeger(r1)
s2 = rstr.xeger(r2)
print re.match(r1 + r2, s1 + s2)
Although I can't really think of a way for this to fail. In my opinion, for your example, where r1 matches strings with more than 10 chars, r2 matches strings shorter than 5 chars, then the sum of the two would yield strings with the first part longer than 10 and a tail of less than 5.

Possible to Simplify These Python Regular Expressions?

patterns = {}
patterns[1] = re.compile("[A-Z]\d-[A-Z]\d")
patterns[2] = re.compile("[A-Z]\d-[A-Z]\d\d")
patterns[3] = re.compile("[A-Z]\d\d-[A-Z]\d\d")
patterns[4] = re.compile("[A-Z]\d\d-[A-Z]\d\d\d")
patterns[5] = re.compile("[A-Z]\d\d\d-[A-Z]\d\d\d")
patterns[6] = re.compile("[A-Z][A-Z]\d-[A-Z][A-Z]\d")
patterns[7] = re.compile("[A-Z][A-Z]\d-[A-Z][A-Z]\d\d")
patterns[8] = re.compile("[A-Z][A-Z]\d\d-[A-Z][A-Z]\d\d")
patterns[9] = re.compile("[A-Z][A-Z]\d\d-[A-Z][A-Z]\d\d\d")
patterns[10] = re.compile("[A-Z][A-Z]\d\d\d-[A-Z][A-Z]\d\d\d")
def matchFound(toSearch):
for items in sorted(patterns.keys(), reverse=True):
matchObject = patterns[items].search(toSearch)
if matchObject:
return items
return 0
then I use the following code to look for matches:
while matchFound(toSearch) > 0:
I have 10 different regular expressions but I feel like they could be replaced by one, well written, more elegant regular expression. Do you guys think it's possible?
EDIT: FORGOT TWO MORE EXPRESSIONS:
patterns[11] = re.compile("[A-Z]\d-[A-Z]\d\d\d")
patterns[12] = re.compile("[A-Z][A-Z]\d-[A-Z][A-Z]\d\d\d")
EDIT2: I ended up with the following. I realize I COULD get extra results but I don't think they're possible in the data I'm parsing.
patterns = {}
patterns[1] = re.compile("[A-Z]{1,2}\d-[A-Z]{1,2}\d{1,3}")
patterns[2] = re.compile("[A-Z]{1,2}\d\d-[A-Z]{1,2}\d{2,3}")
patterns[3] = re.compile("[A-Z]{1,2}\d\d\d-[A-Z]{1,2}\d\d\d")
Josh Caswell noted that Sean Bright's answer will match more inputs than your original group. Sorry I didn't figure this out. (In the future it might be good to spell out your problem a little bit more.)
So your basic problem is that regular expressions can't count. But we can still solve this in Python in a very slick way. First we make a pattern that matches any of your legal inputs, but would also match some you want to reject.
Next, we define a function that uses the pattern and then examines the match object, and counts to make sure that the matched string meets the length requirements.
import re
_s_pat = r'([A-Z]{1,2})(\d{1,3})-([A-Z]{1,2})(\d{1,3})'
_pat = re.compile(_s_pat)
_valid_n_len = set([(1,1), (1,2), (1,3), (2,2), (2,3), (3,3)])
def check_match(s):
m = _pat.search(s)
try:
a0, n0, a1, n1 = m.groups()
if len(a0) != len(a1):
return False
if not (len(n0), len(n1)) in _valid_n_len:
return False
return True
except (AttributeError, TypeError, ValueError):
return False
Here is some explanation of the above code.
First we use a raw string to define the pattern, and then we pre-compile the pattern. We could just stuff the literal string into the call to re.compile() but I like to have a separate string. Our pattern has four distinct sections enclosed in parentheses; these will become "match groups". There are two match groups to match the alphabet characters, and two match groups to match numbers. This one pattern will match everything you want, but won't exclude some stuff you don't want.
Next we declare a set that has all the valid lengths for numbers. For example, the first group of numbers can be 1 digit long and the second group can be 2 digits; this is (1,2) (a tuple value). A set is a nice way to specify all the possible combinations that we want to be legal, while still being able to check quickly whether a given pair of lengths is legal.
The function check_match() first uses the pattern to match against the string, returning a "match object" which is bound to the name m. If the search fails, m might be set to None. Instead of explicitly testing for None, I used a try/except block; in retrospect it might have been better to just test for None. Sorry, I didn't mean to be confusing. But the try/except block is a pretty simple way to wrap something and make it very reliable, so I often use it for things like this.
Finally, check_match() unpacks the match groups into four variables. The two alpha groups are a0 and a1, and the two number groups are n0 and n1. Then it checks that the lengths are legal. As far as I can tell, the rule is that alpha groups need to be the same length; and then we build a tuple of number group lengths and check to see if the tuple is in our set of valid tuples.
Here's a slightly different version of the above. Maybe you will like it better.
import re
# match alpha: 1 or 2 capital letters
_s_pat_a = r'[A-Z]{1,2}'
# match number: 1-3 digits
_s_pat_n = r'\d{1,3}'
# pattern: four match groups: alpha, number, alpha, number
_s_pat = '(%s)(%s)-(%s)(%s)' % (_s_pat_a, _s_pat_n, _s_pat_a, _s_pat_n)
_pat = re.compile(_s_pat)
# set of valid lengths of number groups
_valid_n_len = set([(1,1), (1,2), (1,3), (2,2), (2,3), (3,3)])
def check_match(s):
m = _pat.search(s)
if not m:
return False
a0, n0, a1, n1 = m.groups()
if len(a0) != len(a1):
return False
tup = (len(n0), len(n1)) # make tuple of actual lengths
if not tup in _valid_n_len:
return False
return True
Note: It looks like the rule for valid lengths is actually simple:
if len(n0) > len(n1):
return False
If that rule works for you, you could get rid of the set and the tuple stuff. Hmm, and I'll make the variable names a bit shorter.
import re
# match alpha: 1 or 2 capital letters
pa = r'[A-Z]{1,2}'
# match number: 1-3 digits
pn = r'\d{1,3}'
# pattern: four match groups: alpha, number, alpha, number
p = '(%s)(%s)-(%s)(%s)' % (pa, pn, pa, pn)
_pat = re.compile(p)
def check_match(s):
m = _pat.search(s)
if not m:
return False
a0, n0, a1, n1 = m.groups()
if len(a0) != len(a1):
return False
if len(n0) > len(n1):
return False
return True
Sean Bright gave you the answer you need. Here's just a general tip:
Python has wonderful documentation. In this case, you could read it with the "help" command:
import re
help(re)
And if you read through the help, you would see:
{m,n} Matches from m to n repetitions of the preceding RE.
It also helps to use Google. "Python regular expressions" found these links for me:
http://docs.python.org/library/re.html
http://docs.python.org/howto/regex.html
Both are worth reading.
Josh is right about at least reducing the number of REs.
But you could also take a RE which is wider than allowed and then additionally check if all conditions are met. Such as
pattern = re.compile("([A-Z]{1,2})(\d{1,3})-([A-Z]{1,2})(\d{1,3})")
and then
matchObject = pattern.search(toSearch)
if matchObject and <do something with the length of the groups, comparing them)>:
return <stuff>
But even if that does not work due to any reason, there are ways to improve that:
patterns = tuple(re.compile(r) for r in (
"[A-Z]\d-[A-Z]\d{1,2}",
"[A-Z]\d\d-[A-Z]\d{2,3}",
"[A-Z]\d\d\d-[A-Z]\d\d\d",
"[A-Z][A-Z]\d-[A-Z][A-Z]\d{1,2}",
"[A-Z][A-Z]\d\d-[A-Z][A-Z]\d{2,3}",
"[A-Z][A-Z]\d\d\d-[A-Z][A-Z]\d\d\d",
)
def matchFound(toSearch):
for pat in reversed(patterns):
matchObject = pat.search(toSearch)
if matchObject:
return items # maybe more useful?
return None
Building on Sean's (now apparently deleted) answer, you can reduce the number of patterns. Because of the limitations on the combinations of length of digit matches (i.e., if m in the first position, at least m and no more than 3 in the second) I'm not sure you can get it down to one:
"[A-Z]\d-[A-Z]\d{1,3}"
"[A-Z]\d\d-[A-Z]\d{2,3}"
"[A-Z]\d\d\d-[A-Z]\d\d\d"
"[A-Z][A-Z]\d-[A-Z][A-Z]\d{1,3}"
"[A-Z][A-Z]\d\d-[A-Z][A-Z]\d{2,3}"
"[A-Z][A-Z]\d\d\d-[A-Z][A-Z]\d\d\d"
This uses the {m,n} repeat qualifier syntax, which specifies that the immediately preceding match be repeated at least m but no more than n times. You can also specify a single number n; then the match must succeed exactly n times:
"[A-Z]{2}\d-[A-Z]{2}\d{2,3}"

Pythonic way of adding "ly" to end of string if it ends in "ing"?

This is my first effort on solving the exercise. I gotta say, I'm kind of liking Python. :D
# D. verbing
# Given a string, if its length is at least 3,
# add 'ing' to its end.
# Unless it already ends in 'ing', in which case
# add 'ly' instead.
# If the string length is less than 3, leave it unchanged.
# Return the resulting string.
def verbing(s):
if len(s) >= 3:
if s[-3:] == "ing":
s += "ly"
else:
s += "ing"
return s
else:
return s
# +++your code here+++
return
What do you think I could improve on here?
def verbing(s):
if len(s) >= 3:
if s.endswith("ing"):
s += "ly"
else:
s += "ing"
return s
How about this little rewrite:
def verbing(s):
if len(s) < 3:
return s
elif s.endswith('ing'):
return s + 'ly'
else:
return s + 'ing'
I would use s.endswith("ing") in the if, which is also a bit faster, because it doesn't create a new string for the comparision.
And second, I would use docstrings for commenting. This way, you can see your description when you do a help(yourmodule) or when you use some autodoc-tool like Sphinx to create a handbook describing your API. Example:
def verbings(s):
"""Given a string, if its length is at least 3, add 'ing' to its end.
Unless it already ends in 'ing', in which case add 'ly' instead.
If the string length is less than 3, leave it unchanged."""
# rest of the function
Third, it's often considered a bad practice to change input parameters. You can do it for dict or list parameters, which can also act as output parameters. But strings are input parameters only (that's why you have the return). The source you have written is valid of course, but is often confusing. Other languages have often a final or const keyword to avoid this confusion, but Python doesn't. So, I would recommend you, to use either a second variable result = s + "ing" and do a return result afterwards, or write return s + "ing".
The rest is perfectly fine. There are of course some constructs in Python which are shorter to write (you will learn them with the time), but they are often not so readable. Therefore I would stay with your solution.
Pretty good for a beginner! Yes, I would say this is the Pythonic way of doing things. I especially like the way you have commented exactly what the function does. Good work there.
Keep working with Python, though. You're doing fine.
def add_string(str1):
if(len(str1)>=3):
if (str1[-3::] == 'ing'):
str1+='ly'
else:
str1+='ing'
return str1
str1="com"
print(add_string(str1))

Categories

Resources