Check logical concatenation of regular expressions

Check logical concatenation of regular expressions - python

I have the following problem in python, which I hope you can assist with.
The input is 2 regular expressions, and I have to check if their concatenation can have values.
For example if one says take strings with length greater than 10 and the other says at most 5, than
no value can ever pass both expressions.
Is there something in python to solve this issue?
Thanks,
Max.

Getting this brute force algorithm from here:
Generating a list of values a regex COULD match in Python
def all_matching_strings(alphabet, max_length, regex1, regex2):
"""Find the list of all strings over 'alphabet' of length up to 'max_length' that match 'regex'"""
if max_length == 0: return
L = len(alphabet)
for N in range(1, max_length+1):
indices = [0]*N
for z in xrange(L**N):
r = ''.join(alphabet[i] for i in indices)
if regex1.match(r) and regex2.match(r):
yield(r)
i = 0
indices[i] += 1
while (i<N) and (indices[i]==L):
indices[i] = 0
i += 1
if i<N: indices[i] += 1
return
example usage, for your situation (two regexes)... you'd need to add all possible symbols/whitespaces/etc to that alphabet also...:
alphabet = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890'
import re
regex1 = re.compile(regex1_str)
regex2 = re.compile(regex1_str)
for r in all_matching_strings(alphabet, 5, regex1, regex2):
print r
That said, the runtime on this is super-crazy and you'll want to do whatever you can to speed it up. One suggestion on the answer I swiped the algorithm from was to filter the alphabet to only have characters that are "possible" for the regex. So if you scan your regex and you only see [1-3] and [a-eA-E], with no ".", "\w", "\s", etc, then you can reduce the size of the alphabet to 13 length. Lots of other little tricks you could implement as well.

Is there something in python to solve this issue?
There is nothing in Python that solves this directly.
That said, you can simulate a logical-and operation for two regexes by using lookahead assertions. There is a good explanation with examples at Regular Expressions: Is there an AND operator?
This will combine the regexes but won't show directly whether some string exists that satisfies the combined regex.

I highly doubt that something like this is implemented and even that there is a way to efficiently compute it.
One approximative way that comes to my mind now, that detects the most obvious conflicts would be to generate a random string conforming to each of the regexes, and then check if the concatenation of the regexes matches the concatenation of the generated strings.
Something like:
import re, rstr
s1 = rstr.xeger(r1)
s2 = rstr.xeger(r2)
print re.match(r1 + r2, s1 + s2)
Although I can't really think of a way for this to fail. In my opinion, for your example, where r1 matches strings with more than 10 chars, r2 matches strings shorter than 5 chars, then the sum of the two would yield strings with the first part longer than 10 and a tail of less than 5.

Related

Approaches for finding matches in a large dataset

I have a project where, given a list of ~10,000 unique strings, I want to find where those strings occur in a file with 10,000,000+ string entries. I also want to include partial matches if possible. My list of ~10,000 strings is dynamic data and updates every 30 minutes, and currently I'm not able to process all of the searching to keep up with the updated data. My searches take about 3 hours now (compared to the 30 minutes I have to do the search within), so I feel my approach to this problem isn't quite right.
My current approach is to first create a list from the 10,000,000+ string entries. Then each item from the dynamic list is searched for in the larger list using an in-search.
results_boolean = [keyword in n for n in string_data]
Is there a way I can greatly speed this up with a more appropriate approach?

Using a generator with a set is probably your best bet ... this solution i think will work and presumably faster
def find_matches(target_words,filename_to_search):
targets = set(target_words)
with open("search_me.txt") as f:
for line_no,line in enumerate(f):
matching_intersection = targets.intersection(line.split())
if matching_intersection:
yield (line_no,line,matching_intersection) # there was a match
for match in find_matches(["unique","list","of","strings"],"search_me.txt"):
print("Match: %s"%(match,))
input("Hit Enter For next match:") #py3 ... just to see your matches
of coarse it gets harder if your matches are not single words, especially if there is no reliable grouping delimiter

In general, you would want to preprocess the large, unchanging data is some way to speed repeated searches. But you said too little to suggest something clearly practical. Like: how long are these strings? What's the alphabet (e.g., 7-bit ASCII or full-blown Unicode?)? How many characters total are there? Are characters in the alphabet equally likely to appear in each string position, or is the distribution highly skewed? If so, how? And so on.
Here's about the simplest kind of indexing, buiding a dict with a number of entries equal to the number of unique characters across all of string_data. It maps each character to the set of string_data indices of strings containing that character. Then a search for a keyword can be restricted to the only string_data entries now known in advance to contain the keyword's first character.
Now, depending on details that can't be guessed from what you said, it's possible even this modest indexing will consume more RAM than you have - or it's possible that it's already more than good enough to get you the 6x speedup you seem to need:
# Preprocessing - do this just once, when string_data changes.
def build_map(string_data):
from collections import defaultdict
ch2ixs = defaultdict(set)
for i, s in enumerate(string_data):
for ch in s:
ch2ixs[ch].add(i)
return ch2ixs
def find_partial_matches(keywords, string_data, ch2ixs):
for keyword in keywords:
ch = keyword[0]
if ch in ch2ixs:
result = []
for i in ch2ixs[ch]:
if keyword in string_data[i]:
result.append(i)
if result:
print(repr(keyword), "found in strings", result)
Then, e.g.,
string_data = ['banana', 'bandana', 'bandito']
ch2ixs = build_map(string_data)
find_partial_matches(['ban', 'i', 'dana', 'xyz', 'na'],
string_data,
ch2ixs)
displays:
'ban' found in strings [0, 1, 2]
'i' found in strings [2]
'dana' found in strings [1]
'na' found in strings [0, 1]
If, e.g., you still have plenty of RAM, but need more speed, and are willing to give up on (probably silly - but can't guess from here) 1-character matches, you could index bigrams (adjacent letter pairs) instead.
In the limit, you could build a trie out of string_data, which would require lots of RAM, but could reduce the time to search for an embedded keyword to a number of operations proportional to the number of characters in the keyword, independent of how many strings are in string_data.
Note that you should really find a way to get rid of this:
results_boolean = [keyword in n for n in string_data]
Building a list with over 10 million entries for every keyword search makes every search expensive, no matter how cleverly you index the data.
Note: a probably practical refinement of the above is to restrict the search to strings that contain all of the keyword's characters:
def find_partial_matches(keywords, string_data, ch2ixs):
for keyword in keywords:
keyset = set(keyword)
if all(ch in ch2ixs for ch in keyset):
ixs = set.intersection(*(ch2ixs[ch] for ch in keyset))
result = []
for i in ixs:
if keyword in string_data[i]:
result.append(i)
if result:
print(repr(keyword), "found in strings", result)

Python Regex to start from inner match

I am working on a regex in Python that is converting a mathematical expression to the power format in Sympy language pow(x,y). For example, it takes 2^3 and returns pow(2,3).
My current pattern is:
# pattern
pattern = re.compile(r'(.+?)\^(.*)')
To find the math for nested expressions, I use a for loop to count the number of ^(hat)s and then use re.sub to generate the power format:
# length of the for loop
num_hat = len(re.findall(r'\^', exp))
# for nested pow
for i in range(num_hat):
exp = re.sub(pattern, r'pow(\1,\2)', exp)
return exp
This method does not work for nested ^ expressions such as a^b^c^d or sin(x^2)^3 as the position of the final parenthesis is not correct.
For a^b^c^d it returns pow(pow(pow(a,b,c,d)))
For sin(x^2)^3 it returns pow(pow(sin(x,2),3))
Is there any way to overcome this issue? I tried the negative lookahead but it is still not working

There is no nice way of saying this, but you have an extreme case of an XY problem. What you apparently want is to convert some mathematical expression to SymPy. Writing your own regular expression seems like a very tedious, error-prone, and possibly impossible approach to this.
Being a vast symbolic library, SymPy comes with an entire parsing submodule, which allows you to tweak the parsing expressions in all detail, in particular convert_xor governs what happens to the ^ character. However, it appears you do not need to do anything since converting ^ to exponentiation is the default. You can therefore simply do:
from sympy import sympify
print( sympy.sympify("a^b^c^d") ) # a**(b**(c**d))
print( sympy.sympify("sin(x^2)^3") ) # sin(x**2)**3
Note that ** is equivalent to pow, so I am not sure why you are insisting on the latter. If you need an output that shall work in yet another programming language, that’s what the printing module is for and it’s comparably easy to change this yourself. Another thing that may help you obtain the desired form is sympy.srepr.

Why don't you use recursion for this? It might not be the best, but will work for your use case if the nested powers are not a lot,
A small demonstration,
import re
#Your approach
def func(exp):
# pattern
pattern = re.compile(r'(.+?)\^(.*)')
# length of the for loop
num_hat = len(re.findall(r'\^', exp))
# for nested pow
for i in range(num_hat): # num_hat-1 since we created the first match already
exp = re.sub(pattern, r'pow(\1,\2)', exp)
return exp
#With recursion
def refined_func(exp):
# pattern
pattern = '(.+?)\^(.*)'
# length of the for loop
num_hat = len(re.findall(r'\^', exp))
#base condition
if num_hat == 1:
search = re.search(pattern, exp)
group1 = search.group(1)
group2 = search.group(2)
exp = "pow("+group1+", "+group2+")"
return exp
# for nested pow
for i in range(num_hat): # num_hat-1 since we created the first match already
search = re.search(pattern, exp)
if not search: # the point where there are no hats in the exp
break
group1 = search.group(1)
group2 = search.group(2)
exp = "pow("+group1+", "+refined_func(group2)+")"
return exp
if __name__=='__main__':
print(func("a^b^c^d"))
print("###############")
print(refined_func("a^b^c^d"))
The output of the above program is,
pow(pow(pow(a,b,c,d)))
###############
pow(a, pow(b, pow(c, d)))
Problem in your approach:
Initially you start off with the following expression,
a^b^c^d
With your defined regex, two parts are made of the above expression -> part1: a and part2: b^c^d. With these two, you generate pow(a,b^c^d). So the next expression that you work with is,
pow(a,b^c^d)
In this case now, your regex will give part1 to be pow(a,b and part2 will be c^d). Since, the pow statement used to construct the info from part1 and part2 is like pow(part1, part2), you end up having pow( pow(a,b , c^d) ) which is not what you intended.

I made an attempt into your examples but I'll still advise you to find a math parser (from my comment) as this regex is very complex.
import re
pattern = re.compile(r"(\w+(\(.+\))?)\^(\w+(\(.+\))?)([^^]*)$")
def convert(string):
while "^" in string:
string = pattern.sub(r"pow(\1, \3)\5", string, 1)
return string
print(convert("a^b^c^d")) # pow(a, pow(b, pow(c, d)))
print(convert("sin(x^2)^3")) # pow(sin(pow(x, 2)), 3)
Explanation: Loop while there is a ^ and replace the last match (presence of $)

Efficient algorithm for strings - regex matching

Problem statement:
Given two list A of strings and B of regex's(they are string too).
For every regex in list B, find all the matching strings in list A.
Length of list A <= 10^6 (N)
Length of string B <= 100 (M)
Length of strings, regex <= 30 (K)
Assume regex matching and string comparisons take O(K) time and regex can contain any python regex supported operations.
My algorithm:
for regex in B:
for s in A:
if regex.match(s):
mapping[regex].add(s)
This takes O(N*M*K) time.
Is there any way to make it more time efficient even compromising space (using any data structure)?

This is about as fast as it can go, in terms of time complexity.
Every regex has to be matched with every string at least once. Otherwise, you won't be able to get the information of "match" or "no match".
In terms of absolute time, you can use a filter to avoid the slow Python loops:
mapping = {regex: filter(re.compile(regex).match, A) for regex in B}

python longest repeated substring in a single string? [duplicate]

I need to find the longest sequence in a string with the caveat that the sequence must be repeated three or more times. So, for example, if my string is:
fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld
then I would like the value "helloworld" to be returned.
I know of a few ways of accomplishing this but the problem I'm facing is that the actual string is absurdly large so I'm really looking for a method that can do it in a timely fashion.

This problem is a variant of the longest repeated substring problem and there is an O(n)-time algorithm for solving it that uses suffix trees. The idea (as suggested by Wikipedia) is to construct a suffix tree (time O(n)), annotate all the nodes in the tree with the number of descendants (time O(n) using a DFS), and then to find the deepest node in the tree with at least three descendants (time O(n) using a DFS). This overall algorithm takes time O(n).
That said, suffix trees are notoriously hard to construct, so you would probably want to find a Python library that implements suffix trees for you before attempting this implementation. A quick Google search turns up this library, though I'm not sure whether this is a good implementation.
Another option would be to use suffix arrays in conjunction with LCP arrays. You can iterate over pairs of adjacent elements in the LCP array, taking the minimum of each pair, and store the largest number you find this way. That will correspond to the length of the longest string that repeats at least three times, and from there you can then read off the string itself.
There are several simple algorithms for building suffix arrays (the Manber-Myers algorithm runs in time O(n log n) and isn't too hard to code up), and Kasai's algorithm builds LCP arrays in time O(n) and is fairly straightforward to code up.
Hope this helps!

Use defaultdict to tally each substring beginning with each position in the input string. The OP wasn't clear whether overlapping matches should or shouldn't be included, this brute force method includes them.
from collections import defaultdict
def getsubs(loc, s):
substr = s[loc:]
i = -1
while(substr):
yield substr
substr = s[loc:i]
i -= 1
def longestRepetitiveSubstring(r, minocc=3):
occ = defaultdict(int)
# tally all occurrences of all substrings
for i in range(len(r)):
for sub in getsubs(i,r):
occ[sub] += 1
# filter out all substrings with fewer than minocc occurrences
occ_minocc = [k for k,v in occ.items() if v >= minocc]
if occ_minocc:
maxkey = max(occ_minocc, key=len)
return maxkey, occ[maxkey]
else:
raise ValueError("no repetitions of any substring of '%s' with %d or more occurrences" % (r,minocc))
prints:
('helloworld', 3)

Let's start from the end, count the frequency and stop as soon as the most frequent element appears 3 or more times.
from collections import Counter
a='fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
times=3
for n in range(1,len(a)/times+1)[::-1]:
substrings=[a[i:i+n] for i in range(len(a)-n+1)]
freqs=Counter(substrings)
if freqs.most_common(1)[0][1]>=3:
seq=freqs.most_common(1)[0][0]
break
print "sequence '%s' of length %s occurs %s or more times"%(seq,n,times)
Result:
>>> sequence 'helloworld' of length 10 occurs 3 or more times
Edit: if you have the feeling that you're dealing with random input and the common substring should be of small length, you better start (if you need the speed) with small substrings and stop when you can't find any that appear at least 3 time:
from collections import Counter
a='fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
times=3
for n in range(1,len(a)/times+1):
substrings=[a[i:i+n] for i in range(len(a)-n+1)]
freqs=Counter(substrings)
if freqs.most_common(1)[0][1]<3:
n-=1
break
else:
seq=freqs.most_common(1)[0][0]
print "sequence '%s' of length %s occurs %s or more times"%(seq,n,times)
The same result as above.

The first idea that came to mind is searching with progressively larger regular expressions:
import re
text = 'fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
largest = ''
i = 1
while 1:
m = re.search("(" + ("\w" * i) + ").*\\1.*\\1", text)
if not m:
break
largest = m.group(1)
i += 1
print largest # helloworld
The code ran successfully. The time complexity appears to be at least O(n^2).

If you reverse the input string, then feed it to a regex like (.+)(?:.*\1){2}
It should give you the longest string repeated 3 times. (Reverse capture group 1 for the answer)
Edit:
I have to say cancel this way. It's dependent on the first match. Unless its tested against a curr length vs max length so far, in an itterative loop, regex won't work for this.

In Python you can use the string count method.
We also use an additional generator which will generate all the unique substrings of a given length for our example string.
The code is straightforward:
test_string2 = 'fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'
def generate_substrings_of_length(this_string, length):
''' Generates unique substrings of a given length for a given string'''
for i in range(len(this_string)-2*length+1):
yield this_string[i:i+length]
def longest_substring(this_string):
'''Returns the string with at least two repetitions which has maximum length'''
max_substring = ''
for subs_length in range(2, len(this_string) // 2 + 1):
for substring in generate_substrings_of_length(this_string, subs_length):
count_occurences = this_string.count(substring)
if count_occurences > 1 :
if len(substring) > len(max_substring) :
max_substring = substring
return max_substring
I must note here (and this is important) that the generate_substrings_of_length generator does not generate all the substrings of a certain length. It will generate only the required substring to be able to make comparisons. Otherwise we will have some artificial duplicates. For example in the case :
test_string = "banana"
GS = generate_substrings_of_length(test_string , 2)
for i in GS: print(i)
will result :
ba
an
na
and this is enough for what we need.

from collections import Counter
def Longest(string):
b = []
le = []
for i in set(string):
for j in range(Counter(string)[i]+1):
b.append(i* (j+1))
for i in b:
if i in string:
le.append(i)
return ([s for s in le if len(s)==len(max( le , key = len))])

Possible to Simplify These Python Regular Expressions?

patterns = {}
patterns[1] = re.compile("[A-Z]\d-[A-Z]\d")
patterns[2] = re.compile("[A-Z]\d-[A-Z]\d\d")
patterns[3] = re.compile("[A-Z]\d\d-[A-Z]\d\d")
patterns[4] = re.compile("[A-Z]\d\d-[A-Z]\d\d\d")
patterns[5] = re.compile("[A-Z]\d\d\d-[A-Z]\d\d\d")
patterns[6] = re.compile("[A-Z][A-Z]\d-[A-Z][A-Z]\d")
patterns[7] = re.compile("[A-Z][A-Z]\d-[A-Z][A-Z]\d\d")
patterns[8] = re.compile("[A-Z][A-Z]\d\d-[A-Z][A-Z]\d\d")
patterns[9] = re.compile("[A-Z][A-Z]\d\d-[A-Z][A-Z]\d\d\d")
patterns[10] = re.compile("[A-Z][A-Z]\d\d\d-[A-Z][A-Z]\d\d\d")
def matchFound(toSearch):
for items in sorted(patterns.keys(), reverse=True):
matchObject = patterns[items].search(toSearch)
if matchObject:
return items
return 0
then I use the following code to look for matches:
while matchFound(toSearch) > 0:
I have 10 different regular expressions but I feel like they could be replaced by one, well written, more elegant regular expression. Do you guys think it's possible?
EDIT: FORGOT TWO MORE EXPRESSIONS:
patterns[11] = re.compile("[A-Z]\d-[A-Z]\d\d\d")
patterns[12] = re.compile("[A-Z][A-Z]\d-[A-Z][A-Z]\d\d\d")
EDIT2: I ended up with the following. I realize I COULD get extra results but I don't think they're possible in the data I'm parsing.
patterns = {}
patterns[1] = re.compile("[A-Z]{1,2}\d-[A-Z]{1,2}\d{1,3}")
patterns[2] = re.compile("[A-Z]{1,2}\d\d-[A-Z]{1,2}\d{2,3}")
patterns[3] = re.compile("[A-Z]{1,2}\d\d\d-[A-Z]{1,2}\d\d\d")

Josh Caswell noted that Sean Bright's answer will match more inputs than your original group. Sorry I didn't figure this out. (In the future it might be good to spell out your problem a little bit more.)
So your basic problem is that regular expressions can't count. But we can still solve this in Python in a very slick way. First we make a pattern that matches any of your legal inputs, but would also match some you want to reject.
Next, we define a function that uses the pattern and then examines the match object, and counts to make sure that the matched string meets the length requirements.
import re
_s_pat = r'([A-Z]{1,2})(\d{1,3})-([A-Z]{1,2})(\d{1,3})'
_pat = re.compile(_s_pat)
_valid_n_len = set([(1,1), (1,2), (1,3), (2,2), (2,3), (3,3)])
def check_match(s):
m = _pat.search(s)
try:
a0, n0, a1, n1 = m.groups()
if len(a0) != len(a1):
return False
if not (len(n0), len(n1)) in _valid_n_len:
return False
return True
except (AttributeError, TypeError, ValueError):
return False
Here is some explanation of the above code.
First we use a raw string to define the pattern, and then we pre-compile the pattern. We could just stuff the literal string into the call to re.compile() but I like to have a separate string. Our pattern has four distinct sections enclosed in parentheses; these will become "match groups". There are two match groups to match the alphabet characters, and two match groups to match numbers. This one pattern will match everything you want, but won't exclude some stuff you don't want.
Next we declare a set that has all the valid lengths for numbers. For example, the first group of numbers can be 1 digit long and the second group can be 2 digits; this is (1,2) (a tuple value). A set is a nice way to specify all the possible combinations that we want to be legal, while still being able to check quickly whether a given pair of lengths is legal.
The function check_match() first uses the pattern to match against the string, returning a "match object" which is bound to the name m. If the search fails, m might be set to None. Instead of explicitly testing for None, I used a try/except block; in retrospect it might have been better to just test for None. Sorry, I didn't mean to be confusing. But the try/except block is a pretty simple way to wrap something and make it very reliable, so I often use it for things like this.
Finally, check_match() unpacks the match groups into four variables. The two alpha groups are a0 and a1, and the two number groups are n0 and n1. Then it checks that the lengths are legal. As far as I can tell, the rule is that alpha groups need to be the same length; and then we build a tuple of number group lengths and check to see if the tuple is in our set of valid tuples.
Here's a slightly different version of the above. Maybe you will like it better.
import re
# match alpha: 1 or 2 capital letters
_s_pat_a = r'[A-Z]{1,2}'
# match number: 1-3 digits
_s_pat_n = r'\d{1,3}'
# pattern: four match groups: alpha, number, alpha, number
_s_pat = '(%s)(%s)-(%s)(%s)' % (_s_pat_a, _s_pat_n, _s_pat_a, _s_pat_n)
_pat = re.compile(_s_pat)
# set of valid lengths of number groups
_valid_n_len = set([(1,1), (1,2), (1,3), (2,2), (2,3), (3,3)])
def check_match(s):
m = _pat.search(s)
if not m:
return False
a0, n0, a1, n1 = m.groups()
if len(a0) != len(a1):
return False
tup = (len(n0), len(n1)) # make tuple of actual lengths
if not tup in _valid_n_len:
return False
return True
Note: It looks like the rule for valid lengths is actually simple:
if len(n0) > len(n1):
return False
If that rule works for you, you could get rid of the set and the tuple stuff. Hmm, and I'll make the variable names a bit shorter.
import re
# match alpha: 1 or 2 capital letters
pa = r'[A-Z]{1,2}'
# match number: 1-3 digits
pn = r'\d{1,3}'
# pattern: four match groups: alpha, number, alpha, number
p = '(%s)(%s)-(%s)(%s)' % (pa, pn, pa, pn)
_pat = re.compile(p)
def check_match(s):
m = _pat.search(s)
if not m:
return False
a0, n0, a1, n1 = m.groups()
if len(a0) != len(a1):
return False
if len(n0) > len(n1):
return False
return True

Sean Bright gave you the answer you need. Here's just a general tip:
Python has wonderful documentation. In this case, you could read it with the "help" command:
import re
help(re)
And if you read through the help, you would see:
{m,n} Matches from m to n repetitions of the preceding RE.
It also helps to use Google. "Python regular expressions" found these links for me:
http://docs.python.org/library/re.html
http://docs.python.org/howto/regex.html
Both are worth reading.

Josh is right about at least reducing the number of REs.
But you could also take a RE which is wider than allowed and then additionally check if all conditions are met. Such as
pattern = re.compile("([A-Z]{1,2})(\d{1,3})-([A-Z]{1,2})(\d{1,3})")
and then
matchObject = pattern.search(toSearch)
if matchObject and <do something with the length of the groups, comparing them)>:
return <stuff>
But even if that does not work due to any reason, there are ways to improve that:
patterns = tuple(re.compile(r) for r in (
"[A-Z]\d-[A-Z]\d{1,2}",
"[A-Z]\d\d-[A-Z]\d{2,3}",
"[A-Z]\d\d\d-[A-Z]\d\d\d",
"[A-Z][A-Z]\d-[A-Z][A-Z]\d{1,2}",
"[A-Z][A-Z]\d\d-[A-Z][A-Z]\d{2,3}",
"[A-Z][A-Z]\d\d\d-[A-Z][A-Z]\d\d\d",
)
def matchFound(toSearch):
for pat in reversed(patterns):
matchObject = pat.search(toSearch)
if matchObject:
return items # maybe more useful?
return None

Building on Sean's (now apparently deleted) answer, you can reduce the number of patterns. Because of the limitations on the combinations of length of digit matches (i.e., if m in the first position, at least m and no more than 3 in the second) I'm not sure you can get it down to one:
"[A-Z]\d-[A-Z]\d{1,3}"
"[A-Z]\d\d-[A-Z]\d{2,3}"
"[A-Z]\d\d\d-[A-Z]\d\d\d"
"[A-Z][A-Z]\d-[A-Z][A-Z]\d{1,3}"
"[A-Z][A-Z]\d\d-[A-Z][A-Z]\d{2,3}"
"[A-Z][A-Z]\d\d\d-[A-Z][A-Z]\d\d\d"
This uses the {m,n} repeat qualifier syntax, which specifies that the immediately preceding match be repeated at least m but no more than n times. You can also specify a single number n; then the match must succeed exactly n times:
"[A-Z]{2}\d-[A-Z]{2}\d{2,3}"

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Check logical concatenation of regular expressions - python

Related

Approaches for finding matches in a large dataset

Python Regex to start from inner match

Efficient algorithm for strings - regex matching

python longest repeated substring in a single string? [duplicate]

Possible to Simplify These Python Regular Expressions?

Categories

Resources