Possible to Simplify These Python Regular Expressions?

Possible to Simplify These Python Regular Expressions? - python

patterns = {}
patterns[1] = re.compile("[A-Z]\d-[A-Z]\d")
patterns[2] = re.compile("[A-Z]\d-[A-Z]\d\d")
patterns[3] = re.compile("[A-Z]\d\d-[A-Z]\d\d")
patterns[4] = re.compile("[A-Z]\d\d-[A-Z]\d\d\d")
patterns[5] = re.compile("[A-Z]\d\d\d-[A-Z]\d\d\d")
patterns[6] = re.compile("[A-Z][A-Z]\d-[A-Z][A-Z]\d")
patterns[7] = re.compile("[A-Z][A-Z]\d-[A-Z][A-Z]\d\d")
patterns[8] = re.compile("[A-Z][A-Z]\d\d-[A-Z][A-Z]\d\d")
patterns[9] = re.compile("[A-Z][A-Z]\d\d-[A-Z][A-Z]\d\d\d")
patterns[10] = re.compile("[A-Z][A-Z]\d\d\d-[A-Z][A-Z]\d\d\d")
def matchFound(toSearch):
for items in sorted(patterns.keys(), reverse=True):
matchObject = patterns[items].search(toSearch)
if matchObject:
return items
return 0
then I use the following code to look for matches:
while matchFound(toSearch) > 0:
I have 10 different regular expressions but I feel like they could be replaced by one, well written, more elegant regular expression. Do you guys think it's possible?
EDIT: FORGOT TWO MORE EXPRESSIONS:
patterns[11] = re.compile("[A-Z]\d-[A-Z]\d\d\d")
patterns[12] = re.compile("[A-Z][A-Z]\d-[A-Z][A-Z]\d\d\d")
EDIT2: I ended up with the following. I realize I COULD get extra results but I don't think they're possible in the data I'm parsing.
patterns = {}
patterns[1] = re.compile("[A-Z]{1,2}\d-[A-Z]{1,2}\d{1,3}")
patterns[2] = re.compile("[A-Z]{1,2}\d\d-[A-Z]{1,2}\d{2,3}")
patterns[3] = re.compile("[A-Z]{1,2}\d\d\d-[A-Z]{1,2}\d\d\d")

Josh Caswell noted that Sean Bright's answer will match more inputs than your original group. Sorry I didn't figure this out. (In the future it might be good to spell out your problem a little bit more.)
So your basic problem is that regular expressions can't count. But we can still solve this in Python in a very slick way. First we make a pattern that matches any of your legal inputs, but would also match some you want to reject.
Next, we define a function that uses the pattern and then examines the match object, and counts to make sure that the matched string meets the length requirements.
import re
_s_pat = r'([A-Z]{1,2})(\d{1,3})-([A-Z]{1,2})(\d{1,3})'
_pat = re.compile(_s_pat)
_valid_n_len = set([(1,1), (1,2), (1,3), (2,2), (2,3), (3,3)])
def check_match(s):
m = _pat.search(s)
try:
a0, n0, a1, n1 = m.groups()
if len(a0) != len(a1):
return False
if not (len(n0), len(n1)) in _valid_n_len:
return False
return True
except (AttributeError, TypeError, ValueError):
return False
Here is some explanation of the above code.
First we use a raw string to define the pattern, and then we pre-compile the pattern. We could just stuff the literal string into the call to re.compile() but I like to have a separate string. Our pattern has four distinct sections enclosed in parentheses; these will become "match groups". There are two match groups to match the alphabet characters, and two match groups to match numbers. This one pattern will match everything you want, but won't exclude some stuff you don't want.
Next we declare a set that has all the valid lengths for numbers. For example, the first group of numbers can be 1 digit long and the second group can be 2 digits; this is (1,2) (a tuple value). A set is a nice way to specify all the possible combinations that we want to be legal, while still being able to check quickly whether a given pair of lengths is legal.
The function check_match() first uses the pattern to match against the string, returning a "match object" which is bound to the name m. If the search fails, m might be set to None. Instead of explicitly testing for None, I used a try/except block; in retrospect it might have been better to just test for None. Sorry, I didn't mean to be confusing. But the try/except block is a pretty simple way to wrap something and make it very reliable, so I often use it for things like this.
Finally, check_match() unpacks the match groups into four variables. The two alpha groups are a0 and a1, and the two number groups are n0 and n1. Then it checks that the lengths are legal. As far as I can tell, the rule is that alpha groups need to be the same length; and then we build a tuple of number group lengths and check to see if the tuple is in our set of valid tuples.
Here's a slightly different version of the above. Maybe you will like it better.
import re
# match alpha: 1 or 2 capital letters
_s_pat_a = r'[A-Z]{1,2}'
# match number: 1-3 digits
_s_pat_n = r'\d{1,3}'
# pattern: four match groups: alpha, number, alpha, number
_s_pat = '(%s)(%s)-(%s)(%s)' % (_s_pat_a, _s_pat_n, _s_pat_a, _s_pat_n)
_pat = re.compile(_s_pat)
# set of valid lengths of number groups
_valid_n_len = set([(1,1), (1,2), (1,3), (2,2), (2,3), (3,3)])
def check_match(s):
m = _pat.search(s)
if not m:
return False
a0, n0, a1, n1 = m.groups()
if len(a0) != len(a1):
return False
tup = (len(n0), len(n1)) # make tuple of actual lengths
if not tup in _valid_n_len:
return False
return True
Note: It looks like the rule for valid lengths is actually simple:
if len(n0) > len(n1):
return False
If that rule works for you, you could get rid of the set and the tuple stuff. Hmm, and I'll make the variable names a bit shorter.
import re
# match alpha: 1 or 2 capital letters
pa = r'[A-Z]{1,2}'
# match number: 1-3 digits
pn = r'\d{1,3}'
# pattern: four match groups: alpha, number, alpha, number
p = '(%s)(%s)-(%s)(%s)' % (pa, pn, pa, pn)
_pat = re.compile(p)
def check_match(s):
m = _pat.search(s)
if not m:
return False
a0, n0, a1, n1 = m.groups()
if len(a0) != len(a1):
return False
if len(n0) > len(n1):
return False
return True

Sean Bright gave you the answer you need. Here's just a general tip:
Python has wonderful documentation. In this case, you could read it with the "help" command:
import re
help(re)
And if you read through the help, you would see:
{m,n} Matches from m to n repetitions of the preceding RE.
It also helps to use Google. "Python regular expressions" found these links for me:
http://docs.python.org/library/re.html
http://docs.python.org/howto/regex.html
Both are worth reading.

Josh is right about at least reducing the number of REs.
But you could also take a RE which is wider than allowed and then additionally check if all conditions are met. Such as
pattern = re.compile("([A-Z]{1,2})(\d{1,3})-([A-Z]{1,2})(\d{1,3})")
and then
matchObject = pattern.search(toSearch)
if matchObject and <do something with the length of the groups, comparing them)>:
return <stuff>
But even if that does not work due to any reason, there are ways to improve that:
patterns = tuple(re.compile(r) for r in (
"[A-Z]\d-[A-Z]\d{1,2}",
"[A-Z]\d\d-[A-Z]\d{2,3}",
"[A-Z]\d\d\d-[A-Z]\d\d\d",
"[A-Z][A-Z]\d-[A-Z][A-Z]\d{1,2}",
"[A-Z][A-Z]\d\d-[A-Z][A-Z]\d{2,3}",
"[A-Z][A-Z]\d\d\d-[A-Z][A-Z]\d\d\d",
)
def matchFound(toSearch):
for pat in reversed(patterns):
matchObject = pat.search(toSearch)
if matchObject:
return items # maybe more useful?
return None

Building on Sean's (now apparently deleted) answer, you can reduce the number of patterns. Because of the limitations on the combinations of length of digit matches (i.e., if m in the first position, at least m and no more than 3 in the second) I'm not sure you can get it down to one:
"[A-Z]\d-[A-Z]\d{1,3}"
"[A-Z]\d\d-[A-Z]\d{2,3}"
"[A-Z]\d\d\d-[A-Z]\d\d\d"
"[A-Z][A-Z]\d-[A-Z][A-Z]\d{1,3}"
"[A-Z][A-Z]\d\d-[A-Z][A-Z]\d{2,3}"
"[A-Z][A-Z]\d\d\d-[A-Z][A-Z]\d\d\d"
This uses the {m,n} repeat qualifier syntax, which specifies that the immediately preceding match be repeated at least m but no more than n times. You can also specify a single number n; then the match must succeed exactly n times:
"[A-Z]{2}\d-[A-Z]{2}\d{2,3}"

Related

create strings that are very similar (only differ in one bit) in python

this is what I tried but it output different strings
import string
import random
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice("abcdefghijklmnopqrstuvwxyz") for _ in range(5))
print(id_generator())

For your particular example, you can use itertools.product:
from string import ascii_lowercase
from itertools import product
list(map(''.join, product(ascii_lowercase, [ascii_lowercase[1:]])))
Output:
['abcdefghijklmnopqrstuvwxyz',
'bbcdefghijklmnopqrstuvwxyz',
'cbcdefghijklmnopqrstuvwxyz',
'dbcdefghijklmnopqrstuvwxyz',
'ebcdefghijklmnopqrstuvwxyz',
'fbcdefghijklmnopqrstuvwxyz',
'gbcdefghijklmnopqrstuvwxyz',
'hbcdefghijklmnopqrstuvwxyz',
'ibcdefghijklmnopqrstuvwxyz',
'jbcdefghijklmnopqrstuvwxyz',
'kbcdefghijklmnopqrstuvwxyz',
'lbcdefghijklmnopqrstuvwxyz',
'mbcdefghijklmnopqrstuvwxyz',
'nbcdefghijklmnopqrstuvwxyz',
'obcdefghijklmnopqrstuvwxyz',
'pbcdefghijklmnopqrstuvwxyz',
'qbcdefghijklmnopqrstuvwxyz',
'rbcdefghijklmnopqrstuvwxyz',
'sbcdefghijklmnopqrstuvwxyz',
'tbcdefghijklmnopqrstuvwxyz',
'ubcdefghijklmnopqrstuvwxyz',
'vbcdefghijklmnopqrstuvwxyz',
'wbcdefghijklmnopqrstuvwxyz',
'xbcdefghijklmnopqrstuvwxyz',
'ybcdefghijklmnopqrstuvwxyz',
'zbcdefghijklmnopqrstuvwxyz']

For this problem randomizing things doesn't sound a good idea.
Because
Random can reproduce already existing ID.
You need to keep track of already given IDs to make sure you didn't regenerate the same ID.
We should take a look to real example of this kind of problems. Real examples has rules to make sure
An ID is real with small effort.
Make easy to generate.
I am living in Turkey and Turkish ID numbers has some (those of I am aware of) rules:
All numbers should be 11 digits.
All ID numbers must be even. End with an even digit.
The last digit must be equal to the reminder of first 10 digits divided by 10.
In the university that I work, every student is given a number. The rule:
The first 2 digits of this number is the year that the student joined the university. (21 for 2021)
Next 3 number is the identifier of the faculty/institute that student is a member of.
3 number is the department
The reminding numbers are the order that the student is accepted to the department.
Now coming to your question. The image you showed is just the permutation of a given list (alphabet).
For this you should understand the number of permutation grows fast. n!/(n-r)!
where n is number of elements (in your example the alphabet) and r is the length of string (in your example the ID).
Here Python's generators come to rescue.
We need a function to generate permutation. And if it give a state, it should start from there:
def nex_permutation(list_of_elements, r, start_from):
res = ""
if len(start_from) != r:
raise ValueError(f"start_from length and r are not equal")
carry = 1
for i in start_from[::-1]:
ind = list_of_elements.index(i) + carry
carry = ind // len(list_of_elements)
res += list_of_elements[ind % len(list_of_elements)]
return res[::-1]
Notice: with this code
We did not check if the given start_from contains only elements from list_of_elements
We going to the first permutation if we try to continue from last possible permutation.
I leave these problems to you.
Now we can have a function using nex_permutation to generate next one. Here We will use yield instead of return. Thus this function become a generator. And with generators you can calculate the next element as long as you want.
def permuataion(list_of_elements, r, start_from=None):
if start_from is None:
start_from = list_of_elements[0] * r
yield start_from
while True:
start_from = nex_permutation(list_of_elements, r, start_from=start_from)
yield start_from
Now let's to use it:
a = permuataion("abcd", 4)
for _ in range(3):
print(next(a))
output:
aaaa
aaab
aaac
Test if it continues:
a = permuataion("abcd", 4, start_from="aabd")
for _ in range(3):
print(next(a))
output:
aaca
aacb
aacc

This seems like a homework assignment, so I'll give you some pseudocode to work with.
# Create a list of all the ascii lowercase characters.
# Get a random letter from list of ascii lowercase letters.
# Make sure it isn't the first letter.
# Replace the first letter in the list of all ascii lowercase letters
# with the random letter.
# Create a str from the list of all the ascii lowercase letters

Python Regex to start from inner match

I am working on a regex in Python that is converting a mathematical expression to the power format in Sympy language pow(x,y). For example, it takes 2^3 and returns pow(2,3).
My current pattern is:
# pattern
pattern = re.compile(r'(.+?)\^(.*)')
To find the math for nested expressions, I use a for loop to count the number of ^(hat)s and then use re.sub to generate the power format:
# length of the for loop
num_hat = len(re.findall(r'\^', exp))
# for nested pow
for i in range(num_hat):
exp = re.sub(pattern, r'pow(\1,\2)', exp)
return exp
This method does not work for nested ^ expressions such as a^b^c^d or sin(x^2)^3 as the position of the final parenthesis is not correct.
For a^b^c^d it returns pow(pow(pow(a,b,c,d)))
For sin(x^2)^3 it returns pow(pow(sin(x,2),3))
Is there any way to overcome this issue? I tried the negative lookahead but it is still not working

There is no nice way of saying this, but you have an extreme case of an XY problem. What you apparently want is to convert some mathematical expression to SymPy. Writing your own regular expression seems like a very tedious, error-prone, and possibly impossible approach to this.
Being a vast symbolic library, SymPy comes with an entire parsing submodule, which allows you to tweak the parsing expressions in all detail, in particular convert_xor governs what happens to the ^ character. However, it appears you do not need to do anything since converting ^ to exponentiation is the default. You can therefore simply do:
from sympy import sympify
print( sympy.sympify("a^b^c^d") ) # a**(b**(c**d))
print( sympy.sympify("sin(x^2)^3") ) # sin(x**2)**3
Note that ** is equivalent to pow, so I am not sure why you are insisting on the latter. If you need an output that shall work in yet another programming language, that’s what the printing module is for and it’s comparably easy to change this yourself. Another thing that may help you obtain the desired form is sympy.srepr.

Why don't you use recursion for this? It might not be the best, but will work for your use case if the nested powers are not a lot,
A small demonstration,
import re
#Your approach
def func(exp):
# pattern
pattern = re.compile(r'(.+?)\^(.*)')
# length of the for loop
num_hat = len(re.findall(r'\^', exp))
# for nested pow
for i in range(num_hat): # num_hat-1 since we created the first match already
exp = re.sub(pattern, r'pow(\1,\2)', exp)
return exp
#With recursion
def refined_func(exp):
# pattern
pattern = '(.+?)\^(.*)'
# length of the for loop
num_hat = len(re.findall(r'\^', exp))
#base condition
if num_hat == 1:
search = re.search(pattern, exp)
group1 = search.group(1)
group2 = search.group(2)
exp = "pow("+group1+", "+group2+")"
return exp
# for nested pow
for i in range(num_hat): # num_hat-1 since we created the first match already
search = re.search(pattern, exp)
if not search: # the point where there are no hats in the exp
break
group1 = search.group(1)
group2 = search.group(2)
exp = "pow("+group1+", "+refined_func(group2)+")"
return exp
if __name__=='__main__':
print(func("a^b^c^d"))
print("###############")
print(refined_func("a^b^c^d"))
The output of the above program is,
pow(pow(pow(a,b,c,d)))
###############
pow(a, pow(b, pow(c, d)))
Problem in your approach:
Initially you start off with the following expression,
a^b^c^d
With your defined regex, two parts are made of the above expression -> part1: a and part2: b^c^d. With these two, you generate pow(a,b^c^d). So the next expression that you work with is,
pow(a,b^c^d)
In this case now, your regex will give part1 to be pow(a,b and part2 will be c^d). Since, the pow statement used to construct the info from part1 and part2 is like pow(part1, part2), you end up having pow( pow(a,b , c^d) ) which is not what you intended.

I made an attempt into your examples but I'll still advise you to find a math parser (from my comment) as this regex is very complex.
import re
pattern = re.compile(r"(\w+(\(.+\))?)\^(\w+(\(.+\))?)([^^]*)$")
def convert(string):
while "^" in string:
string = pattern.sub(r"pow(\1, \3)\5", string, 1)
return string
print(convert("a^b^c^d")) # pow(a, pow(b, pow(c, d)))
print(convert("sin(x^2)^3")) # pow(sin(pow(x, 2)), 3)
Explanation: Loop while there is a ^ and replace the last match (presence of $)

Navigating around a string in python

Pardon the incredibly trivial/noob question, at least it should be easy to answer. I've been working through the coderbyte problems and solving the easy ones in python, but have come across a wall. the problem is to return True if a string (e.g. d+==f+d++) has all alpha characters surrounded by plus signs (+) and if not return false. I'm blanking on the concept that would help navigate around these strings, I tried doing with a loop and if statement, but it failed to loop through the text entirely, and always returned false (from the first problem):
def SimpleSymbols(str):
split = list(str)
rawletters = "abcdefghijklmnopqrstuvwxyz"
letters = list(rawletters)
for i in split:
if i in letters and (((split.index(i)) - 1) == '+') and (((split.index(i)) + 1) == '+'):
return True
else:
return False
print SimpleSymbols(raw_input())
Also editing to add the problem statement: "Using the Python language, have the function SimpleSymbols(str) take the str parameter being passed and determine if it is an acceptable sequence by either returning the string true or false. The str parameter will be composed of + and = symbols with several letters between them (ie. ++d+===+c++==a) and for the string to be true each letter must be surrounded by a + symbol. So the string to the left would be false. The string will not be empty and will have at least one letter."
Any assistance would be greatly appreciated. Thank you!

Here's how I would do the first part (if I weren't using regex):
import string
LOWERCASE = set(string.ascii_lowercase)
def plus_surrounds(s):
"""Return True if `+` surrounds a single ascii lowercase letter."""
# initialize to 0 -- The first time through the loop is guaranteed not
# to find anything, but it will initialize `idx1` and `idx2` for us.
# We could actually make this more efficient by factoring out
# the first 2 `find` operations (left as an exercise).
idx2 = idx1 = 0
# if the indices are negative, we hit the end of the string.
while idx2 >= 0 and idx1 >= 0:
# if they're 2 spaces apart, check the character between them
# otherwise, keep going.
if (idx2 - idx1 == 2) and (s[idx1+1] in LOWERCASE):
return True
idx1 = s.find('+', idx2)
idx2 = s.find('+', max(idx1+1, 0))
return False
assert plus_surrounds('s+s+s')
assert plus_surrounds('+s+')
assert not plus_surrounds('+aa+')
I think that if you study this code and understand it, you should be able to get the second part without too much trouble.

More of a note than an answer, but I wanted to mention regular expressions as a solution not because it's the right one for your scenario (this looks distinctly homework-ish so I understand you're almost certainly not allowed to use regex) but just to ingrain upon you early that in Python, almost EVERYTHING is solved by import foo.
import re
def SimpleSymbols(target):
return not (re.search(r"[^a-zA-Z+=]",target) and re.search(r"(?<!\+)\w|\w(?!\+)",target))

Check logical concatenation of regular expressions

I have the following problem in python, which I hope you can assist with.
The input is 2 regular expressions, and I have to check if their concatenation can have values.
For example if one says take strings with length greater than 10 and the other says at most 5, than
no value can ever pass both expressions.
Is there something in python to solve this issue?
Thanks,
Max.

Getting this brute force algorithm from here:
Generating a list of values a regex COULD match in Python
def all_matching_strings(alphabet, max_length, regex1, regex2):
"""Find the list of all strings over 'alphabet' of length up to 'max_length' that match 'regex'"""
if max_length == 0: return
L = len(alphabet)
for N in range(1, max_length+1):
indices = [0]*N
for z in xrange(L**N):
r = ''.join(alphabet[i] for i in indices)
if regex1.match(r) and regex2.match(r):
yield(r)
i = 0
indices[i] += 1
while (i<N) and (indices[i]==L):
indices[i] = 0
i += 1
if i<N: indices[i] += 1
return
example usage, for your situation (two regexes)... you'd need to add all possible symbols/whitespaces/etc to that alphabet also...:
alphabet = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890'
import re
regex1 = re.compile(regex1_str)
regex2 = re.compile(regex1_str)
for r in all_matching_strings(alphabet, 5, regex1, regex2):
print r
That said, the runtime on this is super-crazy and you'll want to do whatever you can to speed it up. One suggestion on the answer I swiped the algorithm from was to filter the alphabet to only have characters that are "possible" for the regex. So if you scan your regex and you only see [1-3] and [a-eA-E], with no ".", "\w", "\s", etc, then you can reduce the size of the alphabet to 13 length. Lots of other little tricks you could implement as well.

Is there something in python to solve this issue?
There is nothing in Python that solves this directly.
That said, you can simulate a logical-and operation for two regexes by using lookahead assertions. There is a good explanation with examples at Regular Expressions: Is there an AND operator?
This will combine the regexes but won't show directly whether some string exists that satisfies the combined regex.

I highly doubt that something like this is implemented and even that there is a way to efficiently compute it.
One approximative way that comes to my mind now, that detects the most obvious conflicts would be to generate a random string conforming to each of the regexes, and then check if the concatenation of the regexes matches the concatenation of the generated strings.
Something like:
import re, rstr
s1 = rstr.xeger(r1)
s2 = rstr.xeger(r2)
print re.match(r1 + r2, s1 + s2)
Although I can't really think of a way for this to fail. In my opinion, for your example, where r1 matches strings with more than 10 chars, r2 matches strings shorter than 5 chars, then the sum of the two would yield strings with the first part longer than 10 and a tail of less than 5.

repetition "{}" on fly for regex

I am trying to write a function that compare a value with a regex to see if matches. The problem is that I have quite a many regex that are similar with just one difference which the range {} e.g. ^[a-z]{0,500}$ & ^[a-z]{0,200}$ are similar regex with just a diff of range/repetition. I am trying to solve that problem of how to deal with these regex with one function. So far I have written that function. But I think there must be some option that is much better than what I have done below. It should also be able to deal if no max or min is specified as well.
def check(value, min=None, max=None):
regex = "^[a-z]"+"{"+min+","+max+"}$"
r= re.compile(regex)
if r.match(value):
return True
else:
return False

Use min="0" and max="" instead (that way, they will construct valid ranges if left unspecified).
Also, don't do if condition: return True etc. - just return the match object - it will evaluate to True if there is a match (and you can do stuff with it later if you want to).
Further, no need to compile the regex if you're only using it once.
def check(value, min="0", max=""):
regex = "[a-z]{" + min + "," + max + "}$"
return re.match(regex, value)
Also, I've removed the ^ because it's implicit in re.match().

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Possible to Simplify These Python Regular Expressions? - python

Related

create strings that are very similar (only differ in one bit) in python

Python Regex to start from inner match

Navigating around a string in python

Check logical concatenation of regular expressions

repetition "{}" on fly for regex

Categories

Resources