Smart pythonic way of removing if elif on regular expressions - python

I have a series of reg expressions called in order. I need to check the first one, and then the second, then the third etc etc right the way until the end. I need to do some processed on the matched string, so I'm trying to avoid too much logic, but in python, unlike perl I do not think I can perform assignment in the if-elif-elif..blocks so I'll end up doing an assignment, then checking for a match and then getting the results of that match. For example:
m = re.search(patternA, string)
if m:
stripped = m.group(0)
xyz = stripped[45:67]
elif:
m = re.search(patternB, string)
if m:
stripped = m.group(0)
abc = stripped[5:7]
elif:
m = re.search(patternB, string)
if m:
stripped = m.group(0)
txt = stripped[4:5]
elif:
......
Ideally I'd like to find a better structure that ensures I preserve the ordering of the tested regular expressions, and also that I can incorporate the assignment into the if-then statements. So for example:
if (m = re.search(patternA, string)):
stripped = m.group(0)
xyz = stripped[45:67]
elif (m = re.search(patternB, string)):
stripped = m.group(0)
abc = stripped[5:7]
...
What is the most pythonic way of dealing with this? Thanks.
The use case is to read old data - very old data. However each string may include information about particular values and these are only present if the regular expression matches a particular pattern. So the variables extracted are highly dependent upon what matches.

for (pattern, slice) in zip([patternA, patternB, patternC],
[slice(45,67), slice(5,7), slice(4,5)]):
m = re.search(pattern, string)
if m:
value = m.group(0)[slice]
break
else:
# Handle no match found for any pattern here
This iterates over pairs of regular expressions and the relevant portion of their match until a match is found. If there is no match found, the else clause of the for loop will execute. The result of the match is found in value after the loop, regardless of which pattern matches.
Having different variables set based on which "branch" succeeds is not a great idea, since you won't necessarily know which variables are set at any given time. A dictionary would be a better idea if you really want separate labels for each match, since you can query which key or keys are set in a dictionary.
value = {}
for (pattern, slice, key) in zip([patternA, patternB, patternC],
[slice(45,67), slice(5,7), slice(4,5)],
['abc', 'xyx', 'txt']):
m = re.search(pattern, string)
if m:
value[key] = m.group(0)[slice]
break
The general idea, though, is to note that your chain of if statements is like a hard-coded iteration, so you just need to identify which parts of each if/elif clause varies from the preceding ones, and create a list that you can iterate over instead.

Related

Extract all variables from a string of Python code (regex or AST)

I want to find and extract all the variables in a string that contains Python code. I only want to extract the variables (and variables with subscripts) but not function calls.
For example, from the following string:
code = 'foo + bar[1] + baz[1:10:var1[2+1]] + qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2 (var3[0])'
I want to extract: foo, bar[1], baz[1:10:var1[2+1]], var1[2+1], qux[[1,2,int(var2)]], var2, bob[len("foobar")], var3[0]. Please note that some variables may be "nested". For example, from baz[1:10:var1[2+1]] I want to extract baz[1:10:var1[2+1]] and var1[2+1].
The first two ideas that come to mind is to use either a regex or an AST. I have tried both but with no success.
When using a regex, in order to make things simpler, I thought it would be a good idea to first extract the "top level" variables, and then recursively the nested ones. Unfortunately, I can't even do that.
This is what I have so far:
regex = r'[_a-zA-Z]\w*\s*(\[.*\])?'
for match in re.finditer(regex, code):
print(match)
Here is a demo: https://regex101.com/r/INPRdN/2
The other solution is to use an AST, extend ast.NodeVisitor, and implement the visit_Name and visit_Subscript methods. However, this doesn't work either because visit_Name is also called for functions.
I would appreciate if someone could provide me with a solution (regex or AST) to this problem.
Thank you.
I find your question an interesting challenge, so here is a code that do what you want, doing this using Regex alone it's impossible because there is nested expression, this is a solution using a combination of Regex and string manipulations to handle nested expressions:
# -*- coding: utf-8 -*-
import re
RE_IDENTIFIER = r'\b[a-z]\w*\b(?!\s*[\[\("\'])'
RE_INDEX_ONLY = re.compile(r'(##)(\d+)(##)')
RE_INDEX = re.compile('##\d+##')
def extract_expression(string):
""" extract all identifier and getitem expression in the given order."""
def remove_brackets(text):
# 1. handle `[...]` expression replace them with #{#...#}#
# so we don't confuse them with word[...]
pattern = '(?<!\w)(\s*)(\[)([^\[]+?)(\])'
# keep extracting expression until there is no expression
while re.search(pattern, text):
text = re.sub(pattern, r'\1#{#\3#}#', string)
return text
def get_ordered_subexp(exp):
""" get index of nested expression."""
index = int(exp.replace('#', ''))
subexp = RE_INDEX.findall(expressions[index])
if not subexp:
return exp
return exp + ''.join(get_ordered_subexp(i) for i in subexp)
def replace_expression(match):
""" save the expression in the list, replace it with special key and it's index in the list."""
match_exp = match.group(0)
current_index = len(expressions)
expressions.append(None) # just to make sure the expression is inserted before it's inner identifier
# if the expression contains identifier extract too.
if re.search(RE_IDENTIFIER, match_exp) and '[' in match_exp:
match_exp = re.sub(RE_IDENTIFIER, replace_expression, match_exp)
expressions[current_index] = match_exp
return '##{}##'.format(current_index)
def fix_expression(match):
""" replace the match by the corresponding expression using the index"""
return expressions[int(match.group(2))]
# result that will contains
expressions = []
string = remove_brackets(string)
# 2. extract all expression and keep track of there place in the original code
pattern = r'\w+\s*\[[^\[]+?\]|{}'.format(RE_IDENTIFIER)
# keep extracting expression until there is no expression
while re.search(pattern, string):
# every exression that is extracted is replaced by a special key
string = re.sub(pattern, replace_expression, string)
# some times inside brackets can contains getitem expression
# so when we extract that expression we handle the brackets
string = remove_brackets(string)
# 3. build the correct result with extracted expressions
result = [None] * len(expressions)
for index, exp in enumerate(expressions):
# keep replacing special keys with the correct expression
while RE_INDEX_ONLY.search(exp):
exp = RE_INDEX_ONLY.sub(fix_expression, exp)
# finally we don't forget about the brackets
result[index] = exp.replace('#{#', '[').replace('#}#', ']')
# 4. Order the index that where extracted
ordered_index = ''.join(get_ordered_subexp(exp) for exp in RE_INDEX.findall(string))
# convert it to integer
ordered_index = [int(index[1]) for index in RE_INDEX_ONLY.findall(ordered_index)]
# 5. fix the order of expressions using the ordered indexes
final_result = []
for exp_index in ordered_index:
final_result.append(result[exp_index])
# for debug:
# print('final string:', string)
# print('expression :', expressions)
# print('order_of_expresion: ', ordered_index)
return final_result
code = 'foo + bar[1] + baz[1:10:var1[2+1]] + qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2 (var3[0])'
code2 = 'baz[1:10:var1[2+1]]'
code3 = 'baz[[1]:10:var1[2+1]:[var3[3+1*x]]]'
print(extract_expression(code))
print(extract_expression(code2))
print(extract_expression(code3))
OUTPU:
['foo', 'bar[1]', 'baz[1:10:var1[2+1]]', 'var1[2+1]', 'qux[[1,2,int(var2)]]', 'var2', 'bob[len("foobar")]', 'var3[0]']
['baz[1:10:var1[2+1]]', 'var1[2+1]']
['baz[[1]:10:var1[2+1]:[var3[3+1*x]]]', 'var1[2+1]', 'var3[3+1*x]', 'x']
I tested this code for very complicated examples and it worked perfectly. and notice that the order if extraction is the same as you wanted, Hope that this is what you needed.
This answer might be too later. But it is possible to do it using python regex package.
import regex
code= '''foo + bar[1] + baz[1:10:var1[2+1]] +
qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2
(var3[0])'''
p=r'(\b[a-z]\w*\b(?!\s*[\(\"])(\[(?:[^\[\]]|(?2))*\])?)'
result=regex.findall(p,code,overlapped=True) #overlapped=True is needed to capture something inside a group like 'var1[2+1]'
[x[0] for x in result] #result variable is list of tuple of two,as each pattern capture two groups ,see below.
output:
['foo','bar[1]','baz[1:10:var1[2+1]]','var1[2+1]','qux[[1,2,int(var2)]]','var2','bob[len("foobar")]','var3[0]']
pattern explaination:
(     # 1st capturing group start
\b[a-z]\w*\b     #variable name ,eg 'bar'
(?!\s*[\(\"])     #negative lookahead. so to ignore something like foobar"
(\[(?:[^\[\]]|(?2))*\])     #2nd capture group,capture nested groups in '[ ]'
                                        #eg '[1:10:var1[2+1]]'.
                                        #'?2' refer to 2nd capturing group recursively
?     #2nd capturing group is optional so to capture something like 'foo'
)     #end of 1st group.
Regex is not a powerful enough tool to do this. If there is a finite depth of your nesting there is some hacky work around that would allow you to make complicate regex to do what you are looking for but I would not recommend it.
This is question is asked a lot an the linked response is famous for demonstrating the difficulty of what you are trying to do
If you really must parse a string for code an AST would technically work but I am not aware of a library to help with this. You would be best off trying to build a recursive function to do the parsing.

I want to split a string by a character on its first occurence, which belongs to a list of characters. How to do this in python?

Basically, I have a list of special characters. I need to split a string by a character if it belongs to this list and exists in the string. Something on the lines of:
def find_char(string):
if string.find("some_char"):
#do xyz with some_char
elif string.find("another_char"):
#do xyz with another_char
else:
return False
and so on. The way I think of doing it is:
def find_char_split(string):
char_list = [",","*",";","/"]
for my_char in char_list:
if string.find(my_char) != -1:
my_strings = string.split(my_char)
break
else:
my_strings = False
return my_strings
Is there a more pythonic way of doing this? Or the above procedure would be fine? Please help, I'm not very proficient in python.
(EDIT): I want it to split on the first occurrence of the character, which is encountered first. That is to say, if the string contains multiple commas, and multiple stars, then I want it to split by the first occurrence of the comma. Please note, if the star comes first, then it will be broken by the star.
I would favor using the re module for this because the expression for splitting on multiple arbitrary characters is very simple:
r'[,*;/]'
The brackets create a character class that matches anything inside of them. The code is like this:
import re
results = re.split(r'[,*;/]', my_string, maxsplit=1)
The maxsplit argument makes it so that the split only occurs once.
If you are doing the same split many times, you can compile the regex and search on that same expression a little bit faster (but see Jon Clements' comment below):
c = re.compile(r'[,*;/]')
results = c.split(my_string)
If this speed up is important (it probably isn't) you can use the compiled version in a function instead of having it re compile every time. Then make a separate function that stores the actual compiled expression:
def split_chars(chars, maxsplit=0, flags=0, string=None):
# see note about the + symbol below
c = re.compile('[{}]+'.format(''.join(chars)), flags=flags)
def f(string, maxsplit=maxsplit):
return c.split(string, maxsplit=maxsplit)
return f if string is None else f(string)
Then:
special_split = split_chars(',*;/', maxsplit=1)
result = special_split(my_string)
But also:
result = split_chars(',*;/', my_string, maxsplit=1)
The purpose of the + character is to treat multiple delimiters as one if that is desired (thank you Jon Clements). If this is not desired, you can just use re.compile('[{}]'.format(''.join(chars))) above. Note that with maxsplit=1, this will not have any effect.
Finally: have a look at this talk for a quick introduction to regular expressions in Python, and this one for a much more information packed journey.

Python Regex Partial Match or "hitEnd"

I'm writing a scanner, so I'm matching an arbitrary string against a list of regex rules. It would be useful if I could emulate the Java "hitEnd" functionality of knowing not just when the regular expression didn't match, but when it can't match; when the regular expression matcher reached the end of the input before deciding it was rejected, indicating that a longer input might satisfy the rule.
For example, maybe I'm matching html tags for starting to bold a sentence of the form "< b >". So I compile my rule
bold_html_rule = re.compile("<b>")
And I run some tests:
good_match = bold_html_rule.match("<b>")
uncertain_match = bold_html_rule.match("<")
bad_match = bold_html_rule.match("goat")
How can I tell the difference between the "bad" match, for which goat can never be made valid by more input, and the ambiguous match that isn't a match yet, but could be.
Attempts
It is clear that in the above form, there is no way to distinguish, because both the uncertain attempt and the bad attempt return "None". If I wrap all rules in "(RULE)?" then any input will return a match, because at the least the empty string is a substring of all strings. However, when I try and see how far the regex progressed before rejecting my string by using the group method or endPos field, it is always just the length of the string.
Does the Python regex package do a lot of extra work and traverse the whole string even if it's an invalid match on the first character? I can see what it would have to if I used search, which will verify if the sequence is anywhere in the input, but it seems very strange to do so for match.
I've found the question asked before (on non-stackoverflow places) like this one:
https://mail.python.org/pipermail/python-list/2012-April/622358.html
but he doesn't really get a response.
I looked at the regular expression package itself but wasn't able to discern its behavior; could I extend the package to get this result? Is this the wrong way to tackle my task in the first place (I've built effective Java scanners using this strategy in the past)
Try this out. It does feel like a hack, but at least it does achieve the result you are looking for. Though I am a bit concerned about the PrepareCompileString function. It should be able to handle all the escaped characters, but cannot handle any wildcards
import re
#Grouping every single character
def PrepareCompileString(regexString):
newstring = ''
escapeFlag = False
for char in regexString:
if escapeFlag:
char = escapeString+char
escapeFlag = False
escapeString = ''
if char == '\\':
escapeFlag = True
escapeString = char
if not escapeFlag:
newstring += '({})?'.format(char)
return newstring
def CheckMatch(match):
# counting the number of non matched groups
count = match.groups().count(None)
# If all groups matched - good match
# all groups did not match - bad match
# few groups matched - uncertain match
if count == 0:
print('Good Match:', match.string)
elif count < len(match.groups()):
print('Uncertain Match:', match.string)
elif count == len(match.groups()):
print('Bad Match:', match.string)
regexString = '<b>'
bold_html_rule = re.compile(PrepareCompileString(regexString))
good_match = bold_html_rule.match("<b>")
uncertain_match = bold_html_rule.match("<")
bad_match = bold_html_rule.match("goat")
for match in [good_match, uncertain_match, bad_match]:
CheckMatch(match)
I got this result:
Good Match: <b>
Uncertain Match: <
Bad Match: goat

replace regex variable with string in python

I have a situation where I have a regular expression like this
regex_string = r'(?P<x>\d+)\s(?P<y>\w+)'
r = re.compile(regex_string)
and, before I start matching things with it, I'd like to replace the regex group named x with a particular value, say 2014. This way, when I search for matches to this regular expression, we will only find things that have x=2014. What is the best way to approach this issue?
The challenge here is that both the original regular expression regex_string and the arbitrary replacement value x=2014 are specified by an end user. In my head, the ideal thing would be to have a function like replace_regex:
r = re.compile(regex_string)
r = replace_regex_variables(r, x=2014)
for match in r.finditer(really_big_string):
do_something_with_each_match(match)
I'm open to any solution, but specifically interested in understanding if its possible to do this without checking matches after they are returned by finditer to take advantage of re's performance. In other words, preferrably NOT this:
r = re.compile(regex_string)
for match in r.finditer(really_big_string):
if r.groupdict()['x'] == 2014:
do_sometehing_with_each_match(match)
You want something like this, don't you?
r = r'(?P<x>%(x)s)\s(?P<y>\w+)'
r = re.compile(r % {x: 2014})
for match in r.finditer(really_big_string):
do_something_with_each_match(match)

Logic for finding and excluding multiple matches from a list

I need to match contents of a list with a given pattern, and form another list which will be having everything except the matches. Meaning, I am trying to make an exclude list.
Now with one pattern match, it is easy. But for more that one, it becomes tricky.
Lets see an example :
Lmain=[arc123, arc234,xyz111,xyz222,ppp999,ppp888]
for count in range(len[Lmain]):
if Pattern matches Lmain[i]:
Pass
else:result.append(Lmain[i])
Now lets say pattern = arc , my result will be
result = [xyz111,xyz222,ppp999,ppp888]
This is just a logic, where I will be using regular expr for finding match.
Now if we have 2 patterns, then using above logic in a loop :
Pattern=['arc','xyz']
for pat in Pattern:
if pat matches Lmain[i]:
Pass
else:result.append(Lmain[i])
This will give us the wrong result
result = [xyz111,xyz222,ppp999,ppp888,arc123,arc234,ppp999,ppp888]
So , you can see above logic just wont work .
My plan:
First we find exclude list for first Pattern which will give us result:
result = [xyz111,xyz222,ppp999,ppp888]
For 2nd pattern, we need to look in to the above result.
if Pattern matches Result[i]:
Pass
else:result_final.append(Result[i])
I think we need to use Recursion to implement above logic. Now how do we do that?
Also we dont know the number of patterns user is going to enter. It can be one or more.
Anybody has any logic ideas, then please share.
Using a list comprehension and a generator expression, and skipping the intermediate step of building an exclude list and just building the final list:
>>> import re
>>> Lmain=['arc123', 'arc234', 'xyz111', 'xyz222','ppp999','ppp888']
>>> Pattern=['arc','xyz']
>>> [x for x in Lmain if not any(re.search(y, x) for y in Pattern)]
['ppp999', 'ppp888']
for item in lst:
if all(pat not in item for pat in patterns):
exclude_list.append(item)
Replace in to what is more appropriate in your case (e.g. item.startswith(pat))
If there are more matches than non-matches, it should be more efficient to find the matches first, and then exclude them:
matches = [x for x in lst if any(x.startswith(p) for p in patterns)]
exclude_list = list(set(lst).difference(matches))
Yet another (and probably the fastest) option is to use regular expressions (here in combination with filter):
import re
expr = '^(?!%s)' % '|'.join(patterns)
exclude_list = filter(re.compile(expr).search, lst)
matched = False
for pat in Pattern:
if pat patches Lmain[i]:
matched = True
break;
if matched:
Pass
else:
result.append(Lmain[i])

Categories

Resources