Python re.findall regex and text processing - python

I'm looking to find and modify some sql syntax around the convert function. I want basically any convert(A,B) or CONVERT(A,B) in all my files to be selected and converted to B::A.
So far I tried selecting them with re.findall(r"\bconvert\b\(.*?,.*\)", l, re.IGNORECASE) But it's only returning a small selection out of what I want and I also have trouble actually manipulating the A/B I mentioned.
For example, a sample line (note the nested structure here is irrelevant, I'm only getting the outer layer working if possible)
convert(varchar, '/' || convert(nvarchar, es.Item_ID) || ':' || convert(nvarchar, o.Option_Number) || '/') as LocPath
...should become...
'/' || es.Item_ID::nvarchar || ':' || o.Option_Number::nvarchar || '/' :: varchar as LocPath
Example2:
SELECT LocationID AS ItemId, convert(bigint, -1),
...should become...
SELECT LocationID AS ItemId, -1::bigint,
I think this should be possible with some kind of re.sub with groups and currently have a code structure inside a for each loop where line is the each line in the file:
matchConvert = ["convert(", "CONVERT("]
a = next((a for a in matchConvert if a in line), False)
if a:
print("convert() line")
#line = re.sub(re.escape(a) + r'', '', line)
Edit: In the end I went with a non re solution and handled each line by identifying each block and manipulate them accordingly.

This may be an X/Y problem, meaning you’re asking how to do something with Regex that may be better solved with parsing (meaning using/modifying/writing a SQL parser). An indication that this is the case is the fact that “convert” calls can be nested. I’m guessing Regex is going to be more of a headache than it’s worth here in the long run if you’re working with a lot of files and they’re at all complicated.

The task:
Swap the parameters of all the 'convert' functions in this given. Parameters can contain any character, including nested 'convert' functions.
A solution:
def convert_py(s):
#capturing start:
left=s.index('convert')
start=s[:left]
#capturing part_1:
c=0
line=''
for n1,i in enumerate(s[left+8:],start=len(start)+8):
if i==',' and c==0:
part_1=line
break
if i==')':
c-=1
if i=='(':
c+=1
line+=i
#capturing part_2:
c=0
line=''
for n2,i in enumerate(s[n1+1:],start=n1+1):
if i==')':
c-=1
if i=='(':
c+=1
if c<0:
part_2=line
break
line+=i
#capturing end:
end=s[n2+1:]
#capturing result:
result=start+part_2.lstrip()+' :: '+part_1+end
return result
def multi_convert_py(s):
converts=s.count('convert')
for n in range(converts):
s=convert_py(s)
return s
Notes:
Unlike the solution based on the re module, which is presented in another answer - this version should not fail if there are more than two parameters in the 'convert' function in the given string. However, it will swap them only once, for example: convert(a,b, c) --> b, c : a
I am afraid that unforeseen cases may arise that will lead to failure. Please tell if you find any flaws

The task:
Swap the parameters of all the 'convert' functions in the given string. Parameters can contain any character, including nested 'convert' functions.
A solution based on the re module:
def convert_re(s):
import re
start,part_1,part_2,end=re.search(r'''
(.*?)
convert\(
([^,)(]+\(.+?\)[^,)(]*|[^,)(]+)
,
([^,)(]+\(.+?\)[^,)(]*|[^,)(]+)
\)
(.*)
''',s,re.X).groups()
result=start+part_2.lstrip()+' :: '+part_1+end
return result
def multi_convert_re(s):
converts=s.count('convert')
for n in range(converts):
s=convert_re(s)
return s
Discription of the 'convert_re' function:
Regular expression:
start is the first group with what comes before 'convert'
Then follows convert\() which has no group and contains the name of the function and the opening '('
part_1 is the second group ([^,)(]+\(.+?\)[^,)(]*|[^,)(]+). This should match the first parameter. It can be anything except - ,)(, or a function preceded by anything except ,)(, optionally followed by anything except ,)( and with anything inside (except a new line)
Then follows a comma ,, which has no group
part_2 is the third group and it acts like the second, but should catch everything what's left inside the external function
Then follows ), which has no group
end is the fourth group (.*) with what's left before the new line.
The resulting string is then created by swapping part_1 and part_2, putting ' :: ' between them, removing spaces on the left from part_2 and adding start to the beginning and end to the end.
Description of the 'multi_convert_re' function
Repeatedly calls 'convert_re' function until there are no "convert" left.
Notes:
N.B.: The code implies that the 'convert' function in the string has exactly two parameters.
The code works on the given examples, but I'm afraid there may still be unforeseen flaws when it comes to other examples. Please tell, if you find any flaws.
I have provided another solution presented in another answer that is not based on the re module. It may turn out that the results will be different.

Here's my solution based on #Иван-Балван's code. Breaking this structure into blocks makes further specification a lot easier than I previously thought and I'll be using this method for a lot of other operations as well.
# Check for balanced brackets
def checkBracket(my_string):
count = 0
for c in my_string:
if c == "(":
count+=1
elif c == ")":
count-=1
return count
# Modify the first convert in line
# Based on suggestions from stackoverflow.com/questions/73040953
def modifyConvert(l):
# find the location of convert()
count = l.index('convert(')
# select the group before convert() call
before = l[:count]
group=""
n1=0
n2=0
A=""
B=""
operate = False
operators = ["|", "<", ">", "="]
# look for A group before comma
for n1, i in enumerate(l[count+8:], start=len(before)+8):
# find current position in l
checkIndex = checkBracket(l[count+8:][:n1-len(before)-8])
if i == ',' and checkIndex == 0:
A = group
break
group += i
# look for B group after comma
group = ""
for n2, i in enumerate(l[n1+1:], start=n1+1):
checkIndex = checkBracket(l[count+n1-len(before):][:n2-n1+1])
if i == ',' and checkIndex == 0:
return l
elif checkIndex < 0:
B = group
break
group += i
# mark operators
if i in operators:
operate = True
# select the group after convert() call
after = l[n2+1:]
# (B) if it contains operators
if operate:
return before + "(" + B.lstrip() + ') :: ' + A + after
else:
return before + B.lstrip() + '::' + A + after
# Modify cast syntax with convert(a,b). return line.
def convertCast(l):
# Call helper for nested cases
i = l.count('convert(')
while i>0:
i -= 1
l = modifyConvert(l)
return l

Related

Formatting in python(Kivy) like in Stack overflow

My issue is that I would like to take input text with formatting like you would use when creating a Stackoverflow post and reformat it into the required text string. The best way I can think is to give an example....
# This is the input string
Hello **there**, how are **you**
# This is the intended output string
Hello [font=Nunito-Black.ttf]there[/font], how are [font=Nunito-Black.ttf]you[/font]
SO the ** is replaced by a different string that has an opening and a closing part but also needs to work as many times as needed for any string. (As seen 2 times in the example)
I have tried to use a variable to record if the ** in need of replacing is an opening or a closing part, but haven't managed to get a function to work yet, hence it being incomplete
I think replacing the correct ** is hard because I have been trying to use index which will only return the position of the 1st occurrence in the string
My attempt as of now
def formatting_text(input_text):
if input_text:
if '**' in input_text:
d = '**'
for line in input_text:
s = [e+d for e in line.split(d) if e]
count = 0
for y in s:
if y == '**' and count == 0:
s.index(y)
# replace with required part
return output_text
return input_text
I have tried to find this answer so I'm sorry if has already been asked but I have had no luck finding it and don't know what to search
Of course thank you for any help
A general solution for your case,
Using re
import re
def formatting_text(input_text, special_char, left_repl, right_repl):
# Define re pattern.
RE_PATTERN = f"[{special_char}].\w+.[{special_char}]"
for word in re.findall(RE_PATTERN, input_text):
# Re-assign with replacement with the parts.
new_word = left_repl+word.strip(special_char)+right_repl
input_text = input_text.replace(word, new_word)
return input_text
print(formatting_text("Hello **there**, how are **you**", "**", "[font=Nunito-Black.ttf]", "[/font]"))
Without using re
def formatting_text(input_text, special_char, left_repl, right_repl):
while True:
# Replace the left part.
input_text = input_text.replace(special_char, left_repl, 1)
# Replace the right part.
input_text = input_text.replace(special_char, right_repl, 1)
if input_text.find(special_char) == -1:
# Nothing found, time to stop.
break
return input_text
print(formatting_text("Hello **there**, how are **you**", "**", "[font=Nunito-Black.ttf]", "[/font]"))
However the above solution should work for other special_char like __, *, < etc. But if you want to just make it bold only, you may prefer kivy's bold markdown for label i.e. [b] and escape [/b].
So the formatting stack overflow uses is markdown, implemented in javascript. If you just want the single case to be formatted then you can see an implementation here where they use regex to find the matches and then just iterate through them.
STRONG_RE = r'(\*{2})(.+?)\1'
I would recommend against re-implementing an entire markdown solution yourself when you can just import one.

Remove the parentheses

Recently I was solving a problem in Codewars and got stuck. The link of the problem link
Basically what it is asking for is :
You are given a string for example :
"example(unwanted thing)example"
Your task is to remove everything inside the parentheses as well as the parentheses themselves.
The example above would return:
"exampleexample"
Don't worry about other brackets like "[]" and "{}" as these will never appear.
There can be multiple parentheses.
The parentheses can be nested.
Some other test cases are given below :
test.assert_equals(remove_parentheses("hello example (words(more words) here) something"), "hello example something")
test.assert_equals(remove_parentheses("(first group) (second group) (third group)"), " ")
I looked up online and I found some solution involving Regex, but I wanted to solve this problem without Regex.
Till now I have tried similar solutions as given below :
def remove_parentheses(s):
while s.find('(') != -1 or s.find(')') != -1 :
f = s.find('(')
l = s.find(')')
s = s[:f] + s [l+1:]
return s
But when I try to run this snippet, I get Execution Timed Out.
You just need to track the number of open parentheses (the nested depth, technically) to see whether the current character should be included in the output.
def remove_parentheses(s):
parentheses_count = 0
output = ""
for i in s:
if i=="(":
parentheses_count += 1
elif i==")":
parentheses_count -= 1
else:
if parentheses_count == 0:
output += i
return output
print(remove_parentheses("hello example (words(more words) here) something"))
print(remove_parentheses("(first group) (second group) (third group)"))
Use Stack to check whether '(' is closed or not.
If the length of Stack is not zero, that means that parentheses are still open, So you have to ignore the characters.
The code below will pass all the test cases.
def remove_parentheses(s):
stack = []
answer = []
for character in s:
if(character == '('):
stack.append('(')
continue
if(character == ')'):
stack.pop()
continue
if(len(stack) == 0):
answer.append(character)
return "".join(answer)
The reason for your code to have Execution Timed out is because it is stuck in an infinity loop. Since s = s[:f] + s [l+1:] doesn't remove the parentheses properly, such as
a nested example hello example (words(more words) here) something
your code will locate the first ( and the first ) and return hello example here) something on the first loop, which will lead to incorrect result in the next loop as one of your ( is removed.
To be honest, an approach like this is not ideal as it is difficult to understand and read since you have to dry run the index in the loop one by one. You may continue to debug this code and fix the indexing, such as only search the nearest/enclosed closing bracket according to your first located (, which will make it even more harder to read but get the job done.
For me, I would personally suggest you to look up regular expression, or what is often referred as regex,
a very simple algorithm that builds on regex is
import re
def remove_parentheses(s):
s = re.sub("\(.{1,25}\)", "", s)
return s
def f(s):
pairs = []
output = ''
for i, v in enumerate(s):
if "(" == v:
pairs.append(1)
if ")" == v:
pairs.pop()
continue
if len(pairs) ==0:
output +=v
return output
Can be achieved easily if we use a recursive function.. Try this out.
def rmp(st):
if st.find('(') == -1 or st.find(')') == -1: return st
else :
i=st.rindex('(')
j=st[i+1:].index(')')
return rmp(st[:i] + st[i+1+j+1:])
Here are a few cases I tested...
print(rmp("hello example (words(more words) here) something"))
print(rmp("(first group) (second group) (third group)"))
print(rmp("This does(n't) work (so well)"))
print(rmp("(1233)qw()"))
print(rmp("(1(2(3(4(5(6(7(8))))))))abcdqw(hkfjfj)"))
And the results are..
hello example something
This does work
qw
abcdqw

how to delete char after -> without using a regular expression

Given a string s representing characters typed into an editor,
with "->" representing a delete, return the current state of the editor.
For every one "->" it should delete one char. If there are two "->" i.e "->->" it should delete 2 char post the symbol.
Example 1
Input
s = "a->bcz"
Output
"acz"
Explanation
The "b" got deleted by the delete.
Example 2
Input
s = "->x->z"
Output
empty string
Explanation
All characters are deleted. Also note you can type delete when the editor
is empty as well.
"""
I Have tried following function but id didnt work
def delete_forward(text):
"""
return the current state of the editor after deletion of characters
"""
f = "->"
for i in text:
if (i==f):
del(text[i+1])
How can i complete this without using regular expressions?
Strings do not support item deletion. You have to create a new string.
>>> astring = 'abc->def'
>>> astring.index('->') # Look at the index of the target string
3
>>> x=3
>>> astring[x:x+3] # Here is the slice you want to remove
'->d'
>>> astring[0:x] + astring[x+3:] # Here is a copy of the string before and after, but not including the slice
'abcef'
This only handles one '->' per string, but you can iterate on it.
Here's a simple recursive solution-
# Constant storing the length of the arrow
ARROW_LEN = len('->')
def delete_forward(s: str):
try:
first_occurence = s.index('->')
except ValueError:
# No more arrows in string
return s
if s[first_occurence + ARROW_LEN:first_occurence + ARROW_LEN + ARROW_LEN] == '->':
# Don't delete part of the next arrow
next_s = s[first_occurence + ARROW_LEN:]
else:
# Delete the character immediately following the arrow
next_s = s[first_occurence + ARROW_LEN + 1:]
return delete_forward(s[:first_occurence] + s[first_occurence + ARROW_LEN + 1:])
Remember, python strings are immutable so you should instead rely on string slicing to create new strings as you go.
In each recursion step, the first index of -> is located and everything before this is extracted out. Then, check if there's another -> immediately following the current location - if there is, don't delete the next character and call delete_forward with everything after the first occurrence. If what is immediately followed is not an arrow, delete the immediately next character after the current arrow, and feed it into delete_forward.
This will turn x->zb into xb.
The base case for the recursion is when .index finds no matches, in which case the result string is returned.
Output
>>> delete_forward('ab->cz')
'abz'
>>> delete_forward('abcz')
'abcz'
>>> delete_forward('->abc->z')
'bc'
>>> delete_forward('abc->z->')
'abc'
>>> delete_forward('a-->b>x-->c>de->f->->g->->->->->')
'a->x->de'
There could be several methods to achieve this in python e.g.:
Using split and list comprehensions (If you want to delete a single character everytime one or more delete characters encountered):
def delete_forward(s):
return ''.join([s.split('->')[0]] + [i[1:] if len(i)>1 else "" for i in s.split('->')[1:]])
Now delete_forward("a->bcz") returns 'acz' & delete_forward("->x->z") returns ''. ote that this works for EVERY possible case whether there are many delete characters, one or none at all. Moreover it will NEVER throw any exception or error as long as input is str. This however assumes you want to delete a single character everytime one or more delete characters encountered.
If you want to delete as many characters as the number of times delete characters occur:
def delete_forward(s):
new_str =''
start = 0
for end in [i for i in range(len(s)) if s.startswith('->', i)] +[len(s)+1]:
new_str += s[start:end]
count = 0
start = max(start, end)
while s[start:start+2] =='->':
count+=1
start+=2
start += count
return new_str
This produces same output for above two cases however for case: 'a->->bc', it produces 'a' instead of 'ac' as produced by first function.

Extract all variables from a string of Python code (regex or AST)

I want to find and extract all the variables in a string that contains Python code. I only want to extract the variables (and variables with subscripts) but not function calls.
For example, from the following string:
code = 'foo + bar[1] + baz[1:10:var1[2+1]] + qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2 (var3[0])'
I want to extract: foo, bar[1], baz[1:10:var1[2+1]], var1[2+1], qux[[1,2,int(var2)]], var2, bob[len("foobar")], var3[0]. Please note that some variables may be "nested". For example, from baz[1:10:var1[2+1]] I want to extract baz[1:10:var1[2+1]] and var1[2+1].
The first two ideas that come to mind is to use either a regex or an AST. I have tried both but with no success.
When using a regex, in order to make things simpler, I thought it would be a good idea to first extract the "top level" variables, and then recursively the nested ones. Unfortunately, I can't even do that.
This is what I have so far:
regex = r'[_a-zA-Z]\w*\s*(\[.*\])?'
for match in re.finditer(regex, code):
print(match)
Here is a demo: https://regex101.com/r/INPRdN/2
The other solution is to use an AST, extend ast.NodeVisitor, and implement the visit_Name and visit_Subscript methods. However, this doesn't work either because visit_Name is also called for functions.
I would appreciate if someone could provide me with a solution (regex or AST) to this problem.
Thank you.
I find your question an interesting challenge, so here is a code that do what you want, doing this using Regex alone it's impossible because there is nested expression, this is a solution using a combination of Regex and string manipulations to handle nested expressions:
# -*- coding: utf-8 -*-
import re
RE_IDENTIFIER = r'\b[a-z]\w*\b(?!\s*[\[\("\'])'
RE_INDEX_ONLY = re.compile(r'(##)(\d+)(##)')
RE_INDEX = re.compile('##\d+##')
def extract_expression(string):
""" extract all identifier and getitem expression in the given order."""
def remove_brackets(text):
# 1. handle `[...]` expression replace them with #{#...#}#
# so we don't confuse them with word[...]
pattern = '(?<!\w)(\s*)(\[)([^\[]+?)(\])'
# keep extracting expression until there is no expression
while re.search(pattern, text):
text = re.sub(pattern, r'\1#{#\3#}#', string)
return text
def get_ordered_subexp(exp):
""" get index of nested expression."""
index = int(exp.replace('#', ''))
subexp = RE_INDEX.findall(expressions[index])
if not subexp:
return exp
return exp + ''.join(get_ordered_subexp(i) for i in subexp)
def replace_expression(match):
""" save the expression in the list, replace it with special key and it's index in the list."""
match_exp = match.group(0)
current_index = len(expressions)
expressions.append(None) # just to make sure the expression is inserted before it's inner identifier
# if the expression contains identifier extract too.
if re.search(RE_IDENTIFIER, match_exp) and '[' in match_exp:
match_exp = re.sub(RE_IDENTIFIER, replace_expression, match_exp)
expressions[current_index] = match_exp
return '##{}##'.format(current_index)
def fix_expression(match):
""" replace the match by the corresponding expression using the index"""
return expressions[int(match.group(2))]
# result that will contains
expressions = []
string = remove_brackets(string)
# 2. extract all expression and keep track of there place in the original code
pattern = r'\w+\s*\[[^\[]+?\]|{}'.format(RE_IDENTIFIER)
# keep extracting expression until there is no expression
while re.search(pattern, string):
# every exression that is extracted is replaced by a special key
string = re.sub(pattern, replace_expression, string)
# some times inside brackets can contains getitem expression
# so when we extract that expression we handle the brackets
string = remove_brackets(string)
# 3. build the correct result with extracted expressions
result = [None] * len(expressions)
for index, exp in enumerate(expressions):
# keep replacing special keys with the correct expression
while RE_INDEX_ONLY.search(exp):
exp = RE_INDEX_ONLY.sub(fix_expression, exp)
# finally we don't forget about the brackets
result[index] = exp.replace('#{#', '[').replace('#}#', ']')
# 4. Order the index that where extracted
ordered_index = ''.join(get_ordered_subexp(exp) for exp in RE_INDEX.findall(string))
# convert it to integer
ordered_index = [int(index[1]) for index in RE_INDEX_ONLY.findall(ordered_index)]
# 5. fix the order of expressions using the ordered indexes
final_result = []
for exp_index in ordered_index:
final_result.append(result[exp_index])
# for debug:
# print('final string:', string)
# print('expression :', expressions)
# print('order_of_expresion: ', ordered_index)
return final_result
code = 'foo + bar[1] + baz[1:10:var1[2+1]] + qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2 (var3[0])'
code2 = 'baz[1:10:var1[2+1]]'
code3 = 'baz[[1]:10:var1[2+1]:[var3[3+1*x]]]'
print(extract_expression(code))
print(extract_expression(code2))
print(extract_expression(code3))
OUTPU:
['foo', 'bar[1]', 'baz[1:10:var1[2+1]]', 'var1[2+1]', 'qux[[1,2,int(var2)]]', 'var2', 'bob[len("foobar")]', 'var3[0]']
['baz[1:10:var1[2+1]]', 'var1[2+1]']
['baz[[1]:10:var1[2+1]:[var3[3+1*x]]]', 'var1[2+1]', 'var3[3+1*x]', 'x']
I tested this code for very complicated examples and it worked perfectly. and notice that the order if extraction is the same as you wanted, Hope that this is what you needed.
This answer might be too later. But it is possible to do it using python regex package.
import regex
code= '''foo + bar[1] + baz[1:10:var1[2+1]] +
qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2
(var3[0])'''
p=r'(\b[a-z]\w*\b(?!\s*[\(\"])(\[(?:[^\[\]]|(?2))*\])?)'
result=regex.findall(p,code,overlapped=True) #overlapped=True is needed to capture something inside a group like 'var1[2+1]'
[x[0] for x in result] #result variable is list of tuple of two,as each pattern capture two groups ,see below.
output:
['foo','bar[1]','baz[1:10:var1[2+1]]','var1[2+1]','qux[[1,2,int(var2)]]','var2','bob[len("foobar")]','var3[0]']
pattern explaination:
(     # 1st capturing group start
\b[a-z]\w*\b     #variable name ,eg 'bar'
(?!\s*[\(\"])     #negative lookahead. so to ignore something like foobar"
(\[(?:[^\[\]]|(?2))*\])     #2nd capture group,capture nested groups in '[ ]'
                                        #eg '[1:10:var1[2+1]]'.
                                        #'?2' refer to 2nd capturing group recursively
?     #2nd capturing group is optional so to capture something like 'foo'
)     #end of 1st group.
Regex is not a powerful enough tool to do this. If there is a finite depth of your nesting there is some hacky work around that would allow you to make complicate regex to do what you are looking for but I would not recommend it.
This is question is asked a lot an the linked response is famous for demonstrating the difficulty of what you are trying to do
If you really must parse a string for code an AST would technically work but I am not aware of a library to help with this. You would be best off trying to build a recursive function to do the parsing.

python's regular expression that repeats

I have a list of lines. I'm writing a typical text modifying function that runs through each line in the list and modifies it when a pattern is detected.
I realized later in writing this type of functions that a pattern may repeat multiple times in the line.
For example, this is one of the functions I wrote:
def change_eq(string):
#inputs a string and outputs the modified string
#replaces (X####=#) to (X####==#)
#set pattern
pat_eq=r"""(.*) #stuff before
([\(\|][A-Z]+[0-9]*) #either ( or | followed by the variable name
(=) #single equal sign we want to change
([0-9]*[\)\|]) #numeric value of the variable followed by ) or |
(.*)""" #stuff after
p= re.compile(pat_eq, re.X)
p1=p.match(string)
if bool(p1)==1:
# if pattern in pat_eq is detected, replace that portion of the string with a modified version
original=p1.group(0)
fixed=p1.group(1)+p1.group(2)+"=="+p1.group(4)+p1.group(5)
string_c=string.replace(original,fixed)
return string_c
else:
# returns the original string
return string
But for an input string such as
'IF (X2727!=78|FLAG781=0) THEN PURPILN2=(X2727!=78|FLAG781=0)*X2727'
, group() only works on the last pattern detected in the string, so it changes it to
'IF (X2727!=78|FLAG781=0) THEN PURPILN2=(X2727!=78|FLAG781==0)*X2727'
, ignoring the first case detected. I understand that's the product of my function using the group attribute.
How would I address this issue? I know there is {m,n}, but does it work with match?
Thank you in advance.
Different languages handle "global" matches in different ways. You'll want to use Python's re.finditer (link) and use a for loop to iterate through the resulting match objects.
Example with some of your code:
p = re.compile(pat_eq, re.X)
string_c = string
for match_obj in p.finditer(string):
original = match_obj.group(0)
fixed = p1.group(1) + p1.group(2) + '==' + p1.group(4) + p1.group(5)
string_c = string_c.replace(original, fixed)
return string_c

Categories

Resources