Bulk replace with regular expressions in Python

Bulk replace with regular expressions in Python - python

For a Django application, I need to turn all occurrences of a pattern in a string into a link if I have the resource related to the match in my database.
Right now, here's the process:
- I use re.sub to process a very long string of text
- When re.sub finds a pattern match, it runs a function that looks up whether that pattern matches an entry in the database
- If there is a match, it wraps the link wraps a link around the match.
The problem is that there are sometimes hundreds of hits on the database. What I'd like to be able to do is a single bulk query to the database.
So: can you do a bulk find and replace using regular expressions in Python?
For reference, here's the code (for the curious, the patterns I'm looking up are for legal citations):
def add_linked_citations(text):
linked_text = re.sub(r'(?P<volume>[0-9]+[a-zA-Z]{0,3})\s+(?P<reporter>[A-Z][a-zA-Z0-9\.\s]{1,49}?)\s+(?P<page>[0-9]+[a-zA-Z]{0,3}))', create_citation_link, text)
return linked_text
def create_citation_link(match_object):
volume = None
reporter = None
page = None
if match_object.group("volume") not in [None, '']:
volume = match_object.group("volume")
if match_object.group("reporter") not in [None, '']:
reporter = match_object.group("reporter")
if match_object.group("page") not in [None, '']:
page = match_object.group("page")
if volume and reporter and page: # These should all be here...
# !!! Here's where I keep hitting the database
citations = Citation.objects.filter(volume=volume, reporter=reporter, page=page)
if citations.exists():
citation = citations[0]
document = citation.document
url = document.url()
return '%s %s %s' % (url, volume, reporter, page)
else:
return '%s %s %s' % (volume, reporter, page)

Sorry if this is obvious and wrong (that no-one has suggested it in 4 hours is worrying!), but why not search for all matches, do a batch query for everything (easy once you have all matches), and then call sub with the dictionary of results (so the function pulls the data from the dict)?
You have to run the regexp twice, but it seems like the database access is the expensive part anyway.

You can do it with a single regexp pass, by using finditer which returns match objects.
The match object have:
a method returning a dict of the named groups, groupdict()
the start and the end positions of the match in the original text, span()
the original matching text, group()
So I would suggest that you:
Make a list of all the matches in your text using finditer
Make a list of all the unique volume, reporter, page triplets in the matches
Lookup those triplets
Correlate each match object with the result of the triplet lookup if found
Process the original text, splitting by the match spans and interpolating lookup results.
I've implemented the database lookup by combining a list of Q(volume=foo1,reporter=bar2,page=baz3)|Q(volume=foo1,reporter=bar2,page=baz3).... There maybe be more efficient approaches.
Here's an untested implementation:
from django.db.models import Q
from collections import namedtuple
Triplet = namedtuple('Triplet',['volume','reporter','page'])
def lookup_references(matches):
match_to_triplet = {}
triplet_to_url = {}
for m in matches:
group_dict = m.groupdict()
if any(not(x) for x in group_dict.values()): # Filter out matches we don't want to lookup
continue
match_to_triplet[m] = Triplet(**group_dict)
# Build query
unique_triplets = set(match_to_triplet.values())
# List of Q objects
q_list = [Q(**trip._asdict()) for trip in unique_triplets]
# Consolidated Q
single_q = reduce(Q.__or__,q_list)
for row in Citations.objects.filter(single_q).values('volume','reporter','page','url'):
url = row.pop('url')
triplet_to_url[Triplet(**row)] = url
# Now pair original match objects with URL where found
lookups = {}
for match, triplet in match_to_triplet.items():
if triplet in triplet_to_url:
lookups[match] = triplet_to_url[triplet]
return lookups
def interpolate_citation_matches(text,matches,lookups):
result = []
prev = m_start = 0
last = m_end = len(text)
for m in matches:
m_start, m_end = m.span()
if prev != m_start:
result.append(text[prev:m_start])
# Now check match
if m in lookups:
result.append('%s' % (lookups[m],m.group()))
else:
result.append(m.group())
if m_end != last:
result.append(text[m_end:last])
return ''.join(result)
def process_citations(text):
citation_regex = r'(?P<volume>[0-9]+[a-zA-Z]{0,3})\s+(?P<reporter>[A-Z][a-zA-Z0-9\.\s]{1,49}?)\s+(?P<page>[0-9]+[a-zA-Z]{0,3}))'
matches = list(re.finditer(citation_regex,text))
lookups = lookup_references(matches)
new_text = interpolate_citation_matches(text,matches,lookups)
return new_text

Related

Generating multiple strings by replacing wildcards

So i have the following strings:
"xxxxxxx#FUS#xxxxxxxx#ACS#xxxxx"
"xxxxx#3#xxxxxx#FUS#xxxxx"
And i want to generate the following strings from this pattern (i'll use the second example):
Considering #FUS# will represent 2.
"xxxxx0xxxxxx0xxxxx"
"xxxxx0xxxxxx1xxxxx"
"xxxxx0xxxxxx2xxxxx"
"xxxxx1xxxxxx0xxxxx"
"xxxxx1xxxxxx1xxxxx"
"xxxxx1xxxxxx2xxxxx"
"xxxxx2xxxxxx0xxxxx"
"xxxxx2xxxxxx1xxxxx"
"xxxxx2xxxxxx2xxxxx"
"xxxxx3xxxxxx0xxxxx"
"xxxxx3xxxxxx1xxxxx"
"xxxxx3xxxxxx2xxxxx"
Basically if i'm given a string as above, i want to generate multiple strings by replacing the wildcards that can be #FUS#, #WHATEVER# or with a number #20# and generating multiple strings with the ranges that those wildcards represent.
I've managed to get a regex to find the wildcards.
wildcardRegex = f"(#FUS#|#WHATEVER#|#([0-9]|[1-9][0-9]|[1-9][0-9][0-9])#)"
Which finds correctly the target wildcards.
For 1 wildcard present, it's easy.
re.sub()
For more it gets complicated. Or maybe it was a long day...
But i think my algorithm logic is failing hard because i'm failing to write some code that will basically generate the signals. I think i need some kind of recursive function that will be called for each number of wildcards present (up to maybe 4 can be present (xxxxx#2#xxx#2#xx#FUS#xx#2#x)).
I need a list of resulting signals.
Is there any easy way to do this that I'm completely missing?
Thanks.

import re
stringV1 = "xxx#FUS#xxxxi#3#xxx#5#xx"
stringV2 = "XXXXXXXXXX#FUS#XXXXXXXXXX#3#xxxxxx#5#xxxx"
regex = "(#FUS#|#DSP#|#([0-9]|[1-9][0-9]|[1-9][0-9][0-9])#)"
WILDCARD_FUS = "#FUS#"
RANGE_FUS = 3
def getSignalsFromWildcards(app, can):
sigList = list()
if WILDCARD_FUS in app:
for i in range(RANGE_FUS):
outAppSig = app.replace(WILDCARD_FUS, str(i), 1)
outCanSig = can.replace(WILDCARD_FUS, str(i), 1)
if "#" in outAppSig:
newSigList = getSignalsFromWildcards(outAppSig, outCanSig)
sigList += newSigList
else:
sigList.append((outAppSig, outCanSig))
elif len(re.findall("(#([0-9]|[1-9][0-9]|[1-9][0-9][0-9])#)", stringV1)) > 0:
wildcard = re.search("(#([0-9]|[1-9][0-9]|[1-9][0-9][0-9])#)", app).group()
tarRange = int(wildcard.strip("#"))
for i in range(tarRange):
outAppSig = app.replace(wildcard, str(i), 1)
outCanSig = can.replace(wildcard, str(i), 1)
if "#" in outAppSig:
newSigList = getSignalsFromWildcards(outAppSig, outCanSig)
sigList += newSigList
else:
sigList.append((outAppSig, outCanSig))
return sigList
if "#" in stringV1:
resultList = getSignalsFromWildcards(stringV1, stringV2)
for item in resultList:
print(item)
results in
('xxx0xxxxi0xxxxx', 'XXXXXXXXXX0XXXXXXXXXX0xxxxxxxxxx')
('xxx0xxxxi1xxxxx', 'XXXXXXXXXX0XXXXXXXXXX1xxxxxxxxxx')
('xxx0xxxxi2xxxxx', 'XXXXXXXXXX0XXXXXXXXXX2xxxxxxxxxx')
('xxx1xxxxi0xxxxx', 'XXXXXXXXXX1XXXXXXXXXX0xxxxxxxxxx')
('xxx1xxxxi1xxxxx', 'XXXXXXXXXX1XXXXXXXXXX1xxxxxxxxxx')
('xxx1xxxxi2xxxxx', 'XXXXXXXXXX1XXXXXXXXXX2xxxxxxxxxx')
('xxx2xxxxi0xxxxx', 'XXXXXXXXXX2XXXXXXXXXX0xxxxxxxxxx')
('xxx2xxxxi1xxxxx', 'XXXXXXXXXX2XXXXXXXXXX1xxxxxxxxxx')
('xxx2xxxxi2xxxxx', 'XXXXXXXXXX2XXXXXXXXXX2xxxxxxxxxx')
long day after-all...

Consolidate similar patterns into single consensus pattern

In the previous post, I did not clarify the questions properly, therefore, I would like to start a new topic here.
I have the following items:
a sorted list of 59,000 protein patterns (range from 3 characters "FFK" to 152 characters long);
some long protein sequences, aka my reference.
I am going to match these patterns against my reference and find the location of where the match is found. (My friend helped wrtoe a script for that.)
import sys
import re
from itertools import chain, izip
# Read input
with open(sys.argv[1], 'r') as f:
sequences = f.read().splitlines()
with open(sys.argv[2], 'r') as g:
patterns = g.read().splitlines()
# Write output
with open(sys.argv[3], 'w') as outputFile:
data_iter = iter(sequences)
order = ['antibody name', 'epitope sequence', 'start', 'end', 'length']
header = '\t'.join([k for k in order])
outputFile.write(header + '\n')
for seq_name, seq in izip(data_iter, data_iter):
locations = [[{'antibody name': seq_name, 'epitope sequence': pattern, 'start': match.start()+1, 'end': match.end(), 'length': len(pattern)} for match in re.finditer(pattern, seq)] for pattern in patterns]
for loc in chain.from_iterable(locations):
output = '\t'.join([str(loc[k]) for k in order])
outputFile.write(output + '\n')
f.close()
g.close()
outputFile.close()
Problem is, within these 59,000 patterns, after sorted, I found that some part of one pattern match with part of the other patterns, and I would like to consolidate these into one big "consensus" patterns and just keep the consensus (see examples below):
TLYLQMNSLRAED
TLYLQMNSLRAEDT
YLQMNSLRAED
YLQMNSLRAEDT
YLQMNSLRAEDTA
YLQMNSLRAEDTAV
will yield
TLYLQMNSLRAEDTAV
another example:
APRLLIYGASS
APRLLIYGASSR
APRLLIYGASSRA
APRLLIYGASSRAT
APRLLIYGASSRATG
APRLLIYGASSRATGIP
APRLLIYGASSRATGIPD
GQAPRLLIY
KPGQAPRLLIYGASSR
KPGQAPRLLIYGASSRAT
KPGQAPRLLIYGASSRATG
KPGQAPRLLIYGASSRATGIPD
LLIYGASSRATG
LLIYGASSRATGIPD
QAPRLLIYGASSR
will yield
KPGQAPRLLIYGASSRATGIPD
PS : I am aligning them here so it's easier to visualize. The 59,000 patterns initially are not sorted so it's hard to see the consensus in the actual file.
In my particular problem, I am not picking the longest patterns, instead, I need to take each pattern into account to find the consensus. I hope I have explained clearly enough for my specific problem.
Thanks!

Here's my solution with randomized input order to improve confidence of the test.
import re
import random
data_values = """TLYLQMNSLRAED
TLYLQMNSLRAEDT
YLQMNSLRAED
YLQMNSLRAEDT
YLQMNSLRAEDTA
YLQMNSLRAEDTAV
APRLLIYGASS
APRLLIYGASSR
APRLLIYGASSRA
APRLLIYGASSRAT
APRLLIYGASSRATG
APRLLIYGASSRATGIP
APRLLIYGASSRATGIPD
GQAPRLLIY
KPGQAPRLLIYGASSR
KPGQAPRLLIYGASSRAT
KPGQAPRLLIYGASSRATG
KPGQAPRLLIYGASSRATGIPD
LLIYGASSRATG
LLIYGASSRATGIPD
QAPRLLIYGASSR"""
test_li1 = data_values.split()
#print(test_li1)
test_li2 = ["abcdefghi", "defghijklmn", "hijklmnopq", "mnopqrst", "pqrstuvwxyz"]
def aggregate_str(data_li):
copy_data_li = data_li[:]
while len(copy_data_li) > 0:
remove_li = []
len_remove_li = len(remove_li)
longest_str = max(copy_data_li, key=len)
copy_data_li.remove(longest_str)
remove_li.append(longest_str)
while len_remove_li != len(remove_li):
len_remove_li = len(remove_li)
for value in copy_data_li:
value_pattern = "".join([x+"?" for x in value])
longest_match = max(re.findall(value_pattern, longest_str), key=len)
if longest_match in value:
longest_str_index = longest_str.index(longest_match)
value_index = value.index(longest_match)
if value_index > longest_str_index and longest_str_index > 0:
longest_str = value[:value_index] + longest_str
copy_data_li.remove(value)
remove_li.append(value)
elif value_index < longest_str_index and longest_str_index + len(longest_match) == len(longest_str):
longest_str += value[len(longest_str)-longest_str_index:]
copy_data_li.remove(value)
remove_li.append(value)
elif value in longest_str:
copy_data_li.remove(value)
remove_li.append(value)
print(longest_str)
print(remove_li)
random.shuffle(test_li1)
random.shuffle(test_li2)
aggregate_str(test_li1)
#aggregate_str(test_li2)
Output from print().
KPGQAPRLLIYGASSRATGIPD
['KPGQAPRLLIYGASSRATGIPD', 'APRLLIYGASS', 'KPGQAPRLLIYGASSR', 'APRLLIYGASSRAT', 'APRLLIYGASSR', 'APRLLIYGASSRA', 'GQAPRLLIY', 'APRLLIYGASSRATGIPD', 'APRLLIYGASSRATG', 'QAPRLLIYGASSR', 'LLIYGASSRATG', 'KPGQAPRLLIYGASSRATG', 'KPGQAPRLLIYGASSRAT', 'LLIYGASSRATGIPD', 'APRLLIYGASSRATGIP']
TLYLQMNSLRAEDTAV
['YLQMNSLRAEDTAV', 'TLYLQMNSLRAED', 'TLYLQMNSLRAEDT', 'YLQMNSLRAED', 'YLQMNSLRAEDTA', 'YLQMNSLRAEDT']
Edit1 - brief explanation of the code.
1.) Find longest string in list
2.) Loop through all remaining strings and find longest possible match.
3.) Make sure that the match is not a false positive. Based on the way I've written this code, it should avoid pairing single overlaps on terminal ends.
4.) Append the match to the longest string if necessary.
5.) When nothing else can be added to the longest string, repeat the process (1-4) for the next longest string remaining.
Edit2 - Corrected unwanted behavior when treating data like ["abcdefghijklmn", "ghijklmZopqrstuv"]

def main():
#patterns = ["TLYLQMNSLRAED","TLYLQMNSLRAEDT","YLQMNSLRAED","YLQMNSLRAEDT","YLQMNSLRAEDTA","YLQMNSLRAEDTAV"]
patterns = ["APRLLIYGASS","APRLLIYGASSR","APRLLIYGASSRA","APRLLIYGASSRAT","APRLLIYGASSRATG","APRLLIYGASSRATGIP","APRLLIYGASSRATGIPD","GQAPRLLIY","KPGQAPRLLIYGASSR","KPGQAPRLLIYGASSRAT","KPGQAPRLLIYGASSRATG","KPGQAPRLLIYGASSRATGIPD","LLIYGASSRATG","LLIYGASSRATGIPD","QAPRLLIYGASSR"]
test = find_core(patterns)
test = find_pre_and_post(test, patterns)
#final = "YLQMNSLRAED"
final = "KPGQAPRLLIYGASSRATGIPD"
if test == final:
print("worked:" + test)
else:
print("fail:"+ test)
def find_pre_and_post(core, patterns):
pre = ""
post = ""
for pattern in patterns:
start_index = pattern.find(core)
if len(pattern[0:start_index]) > len(pre):
pre = pattern[0:start_index]
if len(pattern[start_index+len(core):len(pattern)]) > len(post):
post = pattern[start_index+len(core):len(pattern)]
return pre+core+post
def find_core(patterns):
test = ""
for i in range(len(patterns)):
for j in range(2,len(patterns[i])):
patterncount = 0
for pattern in patterns:
if patterns[i][0:j] in pattern:
patterncount += 1
if patterncount == len(patterns):
test = patterns[i][0:j]
return test
main()
So what I do first is find the main core in the find_core function by starting with a string of length two, as one character is not sufficient information, for the first string. I then compare that substring and see if it is in ALL the strings as the definition of a "core"
I then find the indexes of the substring in each string to then find the pre and post substrings added to the core. I keep track of these lengths and update them if one length is greater than the other. I didn't have time to explore edge cases so here is my first shot

pyparsing: skip to the next token ignoring everything in between

I am trying to parse a log file that contains multiple entries with the following format:
ITEM_BEGIN item_name
some_text
some_text may optionally contain an expression matched by my_expr anywhere within itself. I am only interested in item_name and my_expr (or None if it is missing). Ideally, what I want is a list of (item_name, my_expr) pairs. What is the best way to extract this information using pyparsing?

If you are not trying to define a parser for the entire input text, but only some pieces of it, look into using pyparsing's searchString or scanString methods - something along these lines:
import pyparsing as pp
ident = Word(alphas, alphanums+'_')
item_header = pp.Keyword("ITEM_BEGIN") + ident("name")
other_expr = ... whatever ...
search_expr = item_header | other_expr
found = {}
current_name = ''
for result in search_expr.searchString(input_text):
result = result[0]
if result[0] == "ITEM_BEGIN":
print("found an item header with name {name}".format_map(result))
current_name = result.name
found[result.name] = []
else:
# found an other expr
found[current_name].append(result.asList())

How to extract table names and column names from sql query?

So let assume we have such simple query:
Select a.col1, b.col2 from tb1 as a inner join tb2 as b on tb1.col7 = tb2.col8;
The result should looks this way:
tb1 col1
tb1 col7
tb2 col2
tb2 col8
I've tried to solve this problem using some python library:
1) Even extracting only tables using sqlparse might be a huge problem. For example this official book doesn't work properly at all.
2) Using regular expression seems to be really hard to achieve.
3) But then I found this , that might help. However the problem is that I can't connect to any database and execute that query.
Any ideas?

sql-metadata is a Python library that uses a tokenized query returned by python-sqlparse and generates query metadata.
This metadata can return column and table names from your supplied SQL query. Here are a couple of example from the sql-metadata github readme:
>>> sql_metadata.get_query_columns("SELECT test, id FROM foo, bar")
[u'test', u'id']
>>> sql_metadata.get_query_tables("SELECT test, id FROM foo, bar")
[u'foo', u'bar']
>>> sql_metadata.get_query_limit_and_offset('SELECT foo_limit FROM bar_offset LIMIT 50 OFFSET 1000')
(50, 1000)
A hosted version of the library exists at sql-app.infocruncher.com to see if it works for you.

Really, this is no easy task. You could use a lexer (ply in this example) and define several rules to get several tokens out of a string. The following code defines these rules for the different parts of your SQL string and puts them back together as there could be aliases in the input string. As a result, you get a dictionary (result) with the different tablenames as key.
import ply.lex as lex, re
tokens = (
"TABLE",
"JOIN",
"COLUMN",
"TRASH"
)
tables = {"tables": {}, "alias": {}}
columns = []
t_TRASH = r"Select|on|=|;|\s+|,|\t|\r"
def t_TABLE(t):
r"from\s(\w+)\sas\s(\w+)"
regex = re.compile(t_TABLE.__doc__)
m = regex.search(t.value)
if m is not None:
tbl = m.group(1)
alias = m.group(2)
tables["tables"][tbl] = ""
tables["alias"][alias] = tbl
return t
def t_JOIN(t):
r"inner\s+join\s+(\w+)\s+as\s+(\w+)"
regex = re.compile(t_JOIN.__doc__)
m = regex.search(t.value)
if m is not None:
tbl = m.group(1)
alias = m.group(2)
tables["tables"][tbl] = ""
tables["alias"][alias] = tbl
return t
def t_COLUMN(t):
r"(\w+\.\w+)"
regex = re.compile(t_COLUMN.__doc__)
m = regex.search(t.value)
if m is not None:
t.value = m.group(1)
columns.append(t.value)
return t
def t_error(t):
raise TypeError("Unknown text '%s'" % (t.value,))
t.lexer.skip(len(t.value))
# here is where the magic starts
def mylex(inp):
lexer = lex.lex()
lexer.input(inp)
for token in lexer:
pass
result = {}
for col in columns:
tbl, c = col.split('.')
if tbl in tables["alias"].keys():
key = tables["alias"][tbl]
else:
key = tbl
if key in result:
result[key].append(c)
else:
result[key] = list()
result[key].append(c)
print result
# {'tb1': ['col1', 'col7'], 'tb2': ['col2', 'col8']}
string = "Select a.col1, b.col2 from tb1 as a inner join tb2 as b on tb1.col7 = tb2.col8;"
mylex(string)

moz-sql-parser is a python library to convert some subset of SQL-92 queries to JSON-izable parse trees. Maybe it what you want.
Here is an example.
>>> parse("SELECT id,name FROM dual WHERE id>3 and id<10 ORDER BY name")
{'select': [{'value': 'id'}, {'value': 'name'}], 'from': 'dual', 'where': {'and': [{'gt': ['id', 3]}, {'lt': ['id', 10]}]}, 'orderby': {'value': 'name'}}

I am tackling a similar problem and found a simpler solution and it seems to work well.
import re
def tables_in_query(sql_str):
# remove the /* */ comments
q = re.sub(r"/\*[^*]*\*+(?:[^*/][^*]*\*+)*/", "", sql_str)
# remove whole line -- and # comments
lines = [line for line in q.splitlines() if not re.match("^\s*(--|#)", line)]
# remove trailing -- and # comments
q = " ".join([re.split("--|#", line)[0] for line in lines])
# split on blanks, parens and semicolons
tokens = re.split(r"[\s)(;]+", q)
# scan the tokens. if we see a FROM or JOIN, we set the get_next
# flag, and grab the next one (unless it's SELECT).
tables = set()
get_next = False
for tok in tokens:
if get_next:
if tok.lower() not in ["", "select"]:
tables.add(tok)
get_next = False
get_next = tok.lower() in ["from", "join"]
dictTables = dict()
for table in tables:
fields = []
for token in tokens:
if token.startswith(table):
if token != table:
fields.append(token)
if len(list(set(fields))) >= 1:
dictTables[table] = list(set(fields))
return dictTables
code adapted from https://grisha.org/blog/2016/11/14/table-names-from-sql/

Create a list of all the tables that are present in the DB. You can then search each table name in the queries.
This obviously isn't foolproof and the code will break in case any column/alias name matches the table name.
But it can be done as a workaround.

import pandas as pd
#%config PPMagics.autolimit=0
#txt = """<your SQL text here>"""
txt_1 = txt
replace_list = ['\n', '(', ')', '*', '=','-',';','/','.']
count = 0
for i in replace_list:
txt_1 = txt_1.replace(i, ' ')
txt_1 = txt_1.split()
res = []
for i in range(1, len(txt_1)):
if txt_1[i-1].lower() in ['from', 'join','table'] and txt_1[i].lower() != 'select':
count +=1
str_count = str(count)
res.append(txt_1[i] + "." + txt_1[i+1])
#df.head()
res_l = res
f_res_l = []
for i in range(0,len(res_l)):
if len(res_l[i]) > 15 : # change it to 0 is you want all the caught strings
f_res_l.append(res_l[i])
else :
pass
All_Table_List = f_res_l
print("All the unique tables from the SQL text, in the order of their appearence in the code : \n",100*'*')
df = pd.DataFrame(All_Table_List,columns=['Tables_Names'])
df.reset_index(level=0, inplace=True)
list_=list(df["Tables_Names"].unique())
df_1_Final = pd.DataFrame(list_,columns=['Tables_Names'])
df_1_Final.reset_index(level=0, inplace=True)
df_1_Final

Unfortunately, in order to do this successfully for "complex SQL" queries, you will more or less have to implement a complete parser for the particular database engine you are using.
As an example, consider this very basic complex query:
WITH a AS (
SELECT col1 AS c FROM b
)
SELECT c FROM a
In this case, a is not a table but a common table expression (CTE), and should be excluded from your output. There's no simple way of using regexp:es to realize that b is a table access but a is not - your code will really have to understand the SQL at a deeper level.
Also consider
SELECT * FROM tbl
You'd have to know the column names actually present in a particular instance of a database (and accessible to a particular user, too) to answer that correctly.
If by "works with complex SQL" you mean that it must work with any valid SQL statement, you also need to specify for which SQL dialect - or implement dialect-specific solutions. A solution which works with any SQL handled by a database that does not implement CTE:s would not work in one that does.
I am sorry to say so, but I do not think you will find a complete solution which works for arbitrarily complex SQL queries. You'll have to settle for a solution which works with a subset of a particular SQL-dialect.

For my simple use case (one table in query, no joins), I used the following tweak
lst = "select * from table".split(" ")
lst = [item for item in lst if len(item)>0]
table_name = lst[lst.index("from")+1]

Python - How to match and replace words from a given string?

I have a array list with large collection, and i have one input string. Large collecion if found in the input string, it will replace by given option.
I tried following but its returning wrong:
#!/bin/python
arr=['www.', 'http://', '.com', 'many many many....']
def str_replace(arr, replaceby, original):
temp = ''
for n,i in enumerate(arr):
temp = original.replace(i, replaceby)
return temp
main ='www.google.com'
main1='www.a.b.c.company.google.co.uk.com'
print str_replace(arr,'',main);
Output:
www.google
Expected:
google

You are deriving temp from the original every time, so only the last element of arr will be replaced in the temp that is returned. Try this instead:
def str_replace(arr, replaceby, original):
temp = original
for n,i in enumerate(arr):
temp = temp.replace(i, replaceby)
return temp

You don't even need temp (assuming the above code is the whole function):
def str_replace(search, replace, subject):
for s in search:
subject = subject.replace(s, replace)
return subject
Another (probably more efficient) option is to use regular expressions:
import re
def str_replace(search, replace, subject):
search = '|'.join(map(re.escape, search))
return re.sub(search, replace, subject)
Do note that these functions may produce different results if replace contains substrings from search.

temp = original.replace(i, replaceby)
It should be
temp = temp.replace(i, replaceby)
You're throwing away the previous substitutions.

Simple way :)
arr=['www.', 'http://', '.com', 'many many many....']
main ='http://www.google.com'
for item in arr:
main = main.replace(item,'')
print main

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Bulk replace with regular expressions in Python - python

Related

Generating multiple strings by replacing wildcards

Consolidate similar patterns into single consensus pattern

pyparsing: skip to the next token ignoring everything in between

How to extract table names and column names from sql query?

Python - How to match and replace words from a given string?

Categories

Resources