parse nested function to extract each inner function in python - python

I have a nested expression as below
expression = 'position(\'a\' IN Concat("function_test"."PRODUCT_CATEGORIES"."CATEGORY_NAME" , "function_test"."PRODUCT_CATEGORIES"."CATEGORY_NAME" ))'
I want the output as by retreiving nested function first and then outer functions
['Concat("function_test"."PRODUCT_CATEGORIES"."CATEGORY_NAME" , "function_test"."PRODUCT_CATEGORIES"."CATEGORY_NAME" )','position(\'a\' IN Concat("function_test"."PRODUCT_CATEGORIES"."CATEGORY_NAME" , "function_test"."PRODUCT_CATEGORIES"."CATEGORY_NAME" ))']
Below is the code I have tried
result = []
for i in range(len(expression)):
if expression[i]=="(":
a.append(i)
elif expression[i]==")":
fromIdx=a.pop()
fromIdx2=max(a[-1],expression.rfind(",", 0, fromIdx))
flag=False
for (fromIndex, toIndex) in first_Index:
if fromIdx2 + 1 >= fromIndex and i <= toIndex:
flag=True
break
if flag==False:
result.append(expression[fromIdx2+1:i+1])
But this works only if expression is separated by ','
for ex:
expression = 'position(\'a\' , Concat("function_test"."PRODUCT_CATEGORIES"."CATEGORY_NAME" , "function_test"."PRODUCT_CATEGORIES"."CATEGORY_NAME" ))'
and result for this expression from my code will be correct as exprected.
In first expression ,I mentioned ,there is IN operator instead of ',' hence my code doesnt work.
Please help

If you want it to be reliable, you need a full-fledged SQL parser. Fortunately, there is an out-of-box solution for that: https://pypi.org/project/sqlparse/. As soon as you have a parsed token tree, you can walk through it and do what you need:
import sqlparse
def extract_functions(tree):
res = []
def visit(token):
if token.is_group:
for child in token.tokens:
visit(child)
if isinstance(token, sqlparse.sql.Function):
res.append(token.value)
visit(tree)
return res
extract_functions(sqlparse.parse(expression)[0])
Explanation.
sqlparse.parse(expression) parses the string and returns a tuple of statements. As there is only one statement in the example, we can just take the first element. If there are many statements, you should rather iterate over all tuple elements.
extract_functions recursively walks over a parsed token tree depth first (since you want inner calls appear before outer ones) using token.is_group to determine if the current token is a leaf, tests if the current token is a function, and if it is, appends its string representation (token.value) to the result list.

Related

Python: Index slicing from a list for each index in for loop

I got stuck in slicing from a list of data inside a for loop.
list = ['[init.svc.logd]: [running]', '[init.svc.logd-reinit]: [stopped]']
what I am looking for is to print only key without it values (running/stopped)
Overall code,
for each in list:
print(each[:]) #not really sure what may work here
result expected:
init.svc.logd
anyone for a quick solution?
If you want print only the key, you could use the split function to take whatever is before : and then replace [ and ] with nothing if you don't want them:
list = ['[init.svc.logd]: [running]', '[init.svc.logd-reinit]: [stopped]']
for each in list:
print(each.split(":")[0].replace('[','').replace(']','')) #not really sure what may work here
which gives :
init.svc.logd
init.svc.logd-reinit
You should probably be using a regular expression. The concept of 'key' in the question is ambiguous as there are no data constructs shown that have keys - it's merely a list of strings. So...
import re
list_ = ['[init.svc.logd]: [running]', '[init.svc.logd-reinit]: [stopped]']
for e in list_:
if r := re.findall('\[(.*?)\]', e):
print(r[0])
Output:
init.svc.logd
init.svc.logd-reinit
Note:
This is more robust than string splitting solutions for cases where data are unexpectedly malformed

Find elements between two tags in a list

Language: Python 3.4
OS: Windows 8.1
I have some lists like the following:
a = ['text1', 'text2', 'text3','text4','text5']
b = ['text1', 'text2', 'text3','text4','New_element', 'text5']
What is the simplest way to find the elements between two tags in a list?
I want to be able to get it even if the lists and tags have variable number of elements or variable length.
Ex: get elements between text1 and text4 or text1 or text5, etc. Or get the elements between text1 and text5 that has longer length.
I tried using regular expressions like:
re.findall(r'text1(.*?)text5', a)
This will give me an error I guess because you can only use this in a string but not lists.
To get the location of an element in a list use index(). Then use the discovered index to create a slice of the list like:
Code:
print(b[b.index('text3')+1:b.index('text5')])
Results:
['text4', 'New_element']
You can use the list.index method to find the first occurrence of each of your tags, then slice the list to get the value between the indexes.
def find_between_tags(lst, start_tag, end_tag):
start_index = lst.index(start_tag)
end_index = lst.index(end_tag, start_index)
return lst[start_index + 1: end_index]
If either of the tags is not in the list (or if the end tag only occurs before the start tag), one of the index calls will raise a ValueError. You could suppress the exception if you want to do something else, but just letting the caller deal with it seems like a reasonable option to me, so I've left the exception uncaught.
If the tags might occur in this list multiple times, you could extend the logic of the function above to find all of them. For this you'll want to use the start argument to list.index, which will tell it not to look at values before the previous end tag.
def find_all_between_tags(lst, start_tag, end_tag):
search_from = 0
try:
while True:
start_index = lst.index(start_tag, search_from)
end_index = lst.index(end_tag, start_index + 1)
yield lst[start_index + 1:end_index]
search_from = end_index + 1
except ValueError:
pass
This generator does suppress the ValueError, since it keeps on searching until it can't find another pair of tags. If the tags don't exist anywhere in the list, the generator will be empty, but it won't raise any exception (other than StopIteration).
You can get the items between the values by utilizing the index function to search for the index of both objects in the list. Be sure to add one to the index of the first object so it won't be included in the result. See my code below:
def get_sublist_between(e1, e2, li):
return li[li.index(e1) + 1:li.index(e2)]

How to create a censor "translator" via function in python

I'm trying to create a "translator" of sorts, in which if the raw_input has any curses (pre-determined, I list maybe 6 test ones), the function will output a string with the curse as ****.
This is my code below:
def censor(sequence):
curse = ('badword1', 'badword2', 'badword3', 'badword4', 'badword5', 'badword6')
nsequence = sequence.split()
aword = ''
bsequence = []
for x in range(0, len(nsequence)):
if nsequence[x] != curse:
bsequence.append(nsequence[x])
else:
bsequence.append('*' * (len(x)))
latest = ''.join(bsequence)
return bsequence
if __name__ == "__main__":
print(censor(raw_input("Your sentence here: ")))
A simple approach is to simply use Python's native string method: str.replace
def censor(string):
curses = ('badword1', 'badword2', 'badword3', 'badword4', 'badword5', 'badword6')
for curse in curses:
string = string.replace(curse, '*' * len(curse))
return string
To improve efficiency, you could try to compile the list of curses into a regular expression and then do a single replacement operation.
Python Documentation
First, there's no need to iterate over element indices here. Python allows you to iterate over the elements themselves, which is ideal for this case.
Second, you are checking whether each of those words in the given sentence is equal to the entire tuple of potential bad words. You want to check whether each word is in that tuple (a set would be better).
Third, you are mixing up indices and elements when you do len(x) - that assumes that x is the word itself, but it is actually the index, as you use elsewhere.
Fourth, you are joining the sequence within the loop, and on the empty string. You should join it on a space, and only after you've checked each element.
def censor(sequence):
curse = {'badword1', 'badword2', 'badword3', 'badword4', 'badword5', 'badword6'}
nsequence = sequence.split()
bsequence = []
for x in nsequence:
if x not in curse:
bsequence.append(x)
else:
bsequence.append('*' * (len(x)))
return ' '.join(bsequence)

Regular expressions matching words which contain the pattern but also the pattern plus something else

I have the following problem:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
I need to find which elements of list2 are in list1. In actual fact the elements of list1 correspond to a numerical value which I need to obtain then change. The problem is that 'xyz2' contains 'xyz' and therefore matches also with a regular expression.
My code so far (where 'data' is a python dictionary and 'specie_name_and_initial_values' is a list of lists where each sublist contains two elements, the first being specie name and the second being a numerical value that goes with it):
all_keys = list(data.keys())
for i in range(len(all_keys)):
if all_keys[i]!='Time':
#print all_keys[i]
pattern = re.compile(all_keys[i])
for j in range(len(specie_name_and_initial_values)):
print re.findall(pattern,specie_name_and_initial_values[j][0])
Variations of the regular expression I have tried include:
pattern = re.compile('^'+all_keys[i]+'$')
pattern = re.compile('^'+all_keys[i])
pattern = re.compile(all_keys[i]+'$')
And I've also tried using 'in' as a qualifier (i.e. within a for loop)
Any help would be greatly appreciated. Thanks
Ciaran
----------EDIT------------
To clarify. My current code is below. its used within a class/method like structure.
def calculate_relative_data_based_on_initial_values(self,copasi_file,xlsx_data_file,data_type='fold_change',time='seconds'):
copasi_tool = MineParamEstTools()
data=pandas.io.excel.read_excel(xlsx_data_file,header=0)
#uses custom class and method to get the list of lists from a file
specie_name_and_initial_values = copasi_tool.get_copasi_initial_values(copasi_file)
if time=='minutes':
data['Time']=data['Time']*60
elif time=='hour':
data['Time']=data['Time']*3600
elif time=='seconds':
print 'Time is already in seconds.'
else:
print 'Not a valid time unit'
all_keys = list(data.keys())
species=[]
for i in range(len(specie_name_and_initial_values)):
species.append(specie_name_and_initial_values[i][0])
for i in range(len(all_keys)):
for j in range(len(specie_name_and_initial_values)):
if all_keys[i] in species[j]:
print all_keys[i]
The table returned from pandas is accessed like a dictionary. I need to go to my data table, extract the headers (i.e. the all_keys bit), then look up the name of the header in the specie_name_and_initial_values variable and obtain the corresponding value (the second element within the specie_name_and_initial_value variable). After this, I multiply all values of my data table by the value obtained for each of the matched elements.
I'm most likely over complicating this. Do you have a better solution?
thanks
----------edit 2 ---------------
Okay, below are my variables
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
species = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
You don't need a regex to find common elements, set.intersection will find all elements in list2 that are also in list1:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
print(set(list2).intersection(list1))
set(['xyz'])
Also if you wanted to compare 'xyz' to 'xyz2' you would use == not in and then it would correctly return False.
You can also rewrite your own code a lot more succinctly, :
for key in data:
if key != 'Time':
pattern = re.compile(val)
for name, _ in specie_name_and_initial_values:
print re.findall(pattern, name)
Based on your edit you have somehow managed to turn lists into strings, one option is to strip the []:
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
specie_name_and_initial_values = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
specie_name_and_initial_values = set(s.strip("[]") for s in specie_name_and_initial_values)
print(all_keys.intersection(specie_name_and_initial_values))
Which outputs:
set([u'Cyp26_G_R1', u'Cyp26_G_rep1'])
FYI, if you had lists inside the set you would have gotten an error as lists are mutable so are not hashable.

multiple occurrence of the same thing in python list

I'm trying to implement a postscript interpreter in python. For this part of the program, I'm trying to access multiple occurrences of the same element in a list, but the function call does not do that. I can explain it better with the code.
This loop steps through a list of tokens
for token in tokens:
process_token(token)
tokens is define as:
line = "/three 3 def /four 4 def"
tokens = line.strip().split(" ")
So after this is done tokens looks like ['/three', '3', 'def', '/four', '4', 'def'].
Process tokens will continue to push thing on to a stack until it reaches an operation to be done, in this case def. Once it gets to def it will execute:
if (t == "def"):
handle_def (tokens.index(t)-2, tokens.index(t)-1)
stack.pop()
and here is handle_def():
def handle_def (t, t1):
name = tokens[t]
defin = tokens [t1]
x=name[1:]
dict1 [x]= float(defin)
The problem is when it is done adding {'three':3} to the dictionary, it then should keep reading and add {'four':4} to the dictionary. But when handle_def (tokens.index(t)-2, tokens.index(t)-1) is called it will pass in the index numbers for the first occurrence of def, meaning it just puts {'three':3} into the dictionary again. I want it to skip past the first one and go the later occurrences of the word def. How do I make it do that?
Sorry for the long post, but i felt like it needed the explanation.
list.index will give only the first occurrence in the list. You can use the enumerate function to get the index of the current item being processed, like this
for index, token in enumerate(tokens):
process_token(index, token)
...
...
def process_token(index, t):
...
if t == "def":
handle_def (index - 2, index - 1)
...

Categories

Resources