I am using Python 2.6.4.
I have a series of select statements in a text file and I need to extract the field names from each select query. This would be easy if some of the fields didn't use nested functions like to_char() etc.
Given select statement fields that could have several nested parenthese like "ltrim(rtrim(to_char(base_field_name, format))) renamed_field_name," or the simple case of just "base_field_name" as a field, is it possible to use Python's re module to write a regex to extract base_field_name? If so, what would the regex look like?
Regular expressions are not suitable for parsing "nested" structures. Try, instead, a full-fledged parsing kit such as pyparsing -- examples of using pyparsing specifically to parse SQL can be found here and here, for example (you'll no doubt need to take the examples just as a starting point, and write some parsing code of your own, but, it's definitely not too difficult).
>>> import re
>>> string = 'ltrim(rtrim(to_char(base_field_name, format))) renamed_field_name'
>>> rx = re.compile('^(.*?\()*(.+?)(,.*?)*(,|\).*?)*$')
>>> rx.search(string).group(2)
'base_field_name'
>>> rx.search('base_field_name').group(2)
'base_field_name'
Either a table-driven parser as Alex Martelli suggests or a hand-written recursive descent parser. They're not hard and quite rewarding to write.
This may be good enough:
import re
print re.match(r".*\(([^\)]+)\)", "ltrim(to_char(field_name, format)))").group(1)
You would need to do further processing. For example pick up the function name as well and pull the field name according to function signature.
.*(\w+)\(([^\)]+)\)
Here's a really hacky parser that does what you want.
It works by calling 'eval' on the text to be parsed, mapping all identifiers to a function which returns its first argument (which I'm guessing is what you want given your example).
class FakeFunction(object):
def __init__(self, name):
self.name = name
def __call__(self, *args):
return args[0]
def __str__(self):
return self.name
class FakeGlobals(dict):
def __getitem__(self, x):
return FakeFunction(x)
def ExtractBaseFieldName(x):
return eval(x, FakeGlobals())
print ExtractBaseFieldName('ltrim(rtrim(to_char(base_field_name, format)))')
Do you really need regular expressions? To get the one you've got up there I'd use
s[s.rfind('(')+1:s.find(')')].split(',')[0]
with 's' containing the original string.
Of course, it's not a general solution, but...
Related
I have written a python extension for markdown based on InlineProcessor who correctly match when the pattern appears:
Custom extension:
from markdown.util import AtomicString, etree
from markdown.extensions import Extension
from markdown.inlinepatterns import InlineProcessor
RE = r'(#)(\S{3,})'
class MyPattern(InlineProcessor):
def handleMatch(self, m, data):
tag = m.group(2)
el = etree.Element("a")
el.set('href', f'/{tag}')
el.text = AtomicString(f'#{tag}')
return el, m.start(0), m.end(0)
class MyExtension(Extension):
def extendMarkdown(self, md, md_globals):
# If processed by attr_list extension, not by this one
md.inlinePatterns.register(MyPattern(RE, md), 'my_tag', 200)
def makeExtension(*args, **kwargs):
return MyExtension(*args, **kwargs)
IN: markdown('foo #bar')
OUT: <p>foo #bar</p>
But my extension is breaking a native feature called attr_list in extra of python markdown.
IN: ### Title {style="color:#FF0000;"}
OUT: <h3>Title {style="color:#FF0000;"}</h3>
I'm not sure to correctly understand how Python-Markdown register / apply patterns on the text. I try to register my pattern with a high number to put it at the end of the process md.inlinePatterns.register(MyPattern(RE, md), 'my_tag', 200) but it doesn't do the job.
I have look at the source code of attr_list extension and they use Treeprocessor based class. Did I need to have a class-based onTreeprocessor and not an InlineProcessor for my MyPattern? To find a way to don't apply my tag on element how already have matched with another one (there: attr_list)?
You need a stricter regular expression which won't result in false matches. Or perhaps you need to alter the syntax you use so that it doesn't clash with other legitimate text.
First of all, the order of events is correct. Using your example input:
### Title {style="color:#FF0000;"}
When the InlineProcessor gets it, so far it has been processed to this:
<h3>Title {style="color:#FF0000;"}</h3>
Notice that the block level tags are now present (<h3>), but the attr_list has not been processed. And that is your problem. Your regular expression is matching #FF0000;"} and converting that to a link: #FF0000;"}.
Finally, after all InlinePrecessors are done, the attr_list TreeProsessor is run, but with the link in the middle, it doesn't recognize the text as a valid attr_list and ignores it (as it should).
In other words, your problem has nothing to do with order at all. You can't run an inline processor after the attr_list TreeProcessor, so you need to explore other alternatives. You have at least two options:
Rewrite your regular expression to not have false matches. You might want to try using word boundaries or something.
Reconsider your proposed new syntax. #bar is a pretty indistinct syntax which is likely to reoccur elsewhere in the text and result in false matches. Perhaps you could require it to be wrapped in brackets or use some character other than a hash.
Personally, I would strongly suggest the second option. Read some text with #bar in it, it would not be obvious tome that that is a link. However, [#bar] (or similar) would be much more clear.
I want to use f-string with my string variable, not with string defined with a string literal, "...".
Here is my code:
name=["deep","mahesh","nirbhay"]
user_input = r"certi_{element}" # this string I ask from user
for element in name:
print(f"{user_input}")
This code gives output:
certi_{element}
certi_{element}
certi_{element}
But I want:
certi_{deep}
certi_{mahesh}
certi_{nirbhay}
How can I do this?
f"..." strings are great when interpolating expression results into a literal, but you don't have a literal, you have a template string in a separate variable.
You can use str.format() to apply values to that template:
name=["deep","mahesh","nirbhay"]
user_input = "certi_{element}" # this string i ask from user
for value in name:
print(user_input.format(element=value))
String formatting placeholders that use names (such as {element}) are not variables. You assign a value for each name in the keyword arguments of the str.format() call instead. In the above example, element=value passes in the value of the value variable to fill in the placeholder with the element.
Unlike f-strings, the {...} placeholders are not expressions and you can't use arbitrary Python expressions in the template. This is a good thing, you wouldn't want end-users to be able to execute arbitrary Python code in your program. See the Format String Syntax documenation for details.
You can pass in any number of names; the string template doesn't have to use any of them. If you combine str.format() with the **mapping call convention, you can use any dictionary as the source of values:
template_values = {
'name': 'Ford Prefect',
'number': 42,
'company': 'Sirius Cybernetics Corporation',
'element': 'Improbability Drive',
}
print(user_input.format(**template_values)
The above would let a user use any of the names in template_values in their template, any number of times they like.
While you can use locals() and globals() to produce dictionaries mapping variable names to values, I'd not recommend that approach. Use a dedicated namespace like the above to limit what names are available, and document those names for your end-users.
If you define:
def fstr(template):
return eval(f"f'{template}'")
Then you can do:
name=["deep","mahesh","nirbhay"]
user_input = r"certi_{element}" # this string i ask from user
for element in name:
print(fstr(user_input))
Which gives as output:
certi_deep
certi_mahesh
certi_nirbhay
But be aware that users can use expressions in the template, like e.g.:
import os # assume you have used os somewhere
user_input = r"certi_{os.environ}"
for element in name:
print(fstr(user_input))
You definitely don't want this!
Therefore, a much safer option is to define:
def fstr(template, **kwargs):
return eval(f"f'{template}'", kwargs)
Arbitrary code is no longer possible, but users can still use string expressions like:
user_input = r"certi_{element.upper()*2}"
for element in name:
print(fstr(user_input, element=element))
Gives as output:
certi_DEEPDEEP
certi_MAHESHMAHESH
certi_NIRBHAYNIRBHAY
Which may be desired in some cases.
If you want the user to have access to your namespace, you can do that, but the consequences are entirely on you. Instead of using f-strings, you can use the format method to interpolate dynamically, with a very similar syntax.
If you want the user to have access to only a small number of specific variables, you can do something like
name=["deep", "mahesh", "nirbhay"]
user_input = "certi_{element}" # this string i ask from user
for element in name:
my_str = user_input.format(element=element)
print(f"{my_str}")
You can of course rename the key that the user inputs vs the variable name that you use:
my_str = user_input.format(element=some_other_variable)
And you can just go and let the user have access to your whole namespace (or at least most of it). Please don't do this, but be aware that you can:
my_str = user_input.format(**locals(), **globals())
The reason that I went with print(f'{my_str}') instead of print(my_str) is to avoid the situation where literal braces get treated as further, erroneous expansions. For example, user_input = 'certi_{{{element}}}'
I was looking for something similar with your problem.
I came across this other question's answer: https://stackoverflow.com/a/54780825/7381826
Using that idea, I tweaked your code:
user_input = r"certi_"
for element in name:
print(f"{user_input}{element}")
And I got this result:
certi_deep
certi_mahesh
certi_nirbhay
If you would rather stick to the layout in the question, then this final edit did the trick:
for element in name:
print(f"{user_input}" "{" f"{element}" "}")
Reading the security concerns of all other questions, I don't think this alternative has serious security risks because it does not define a new function with eval().
I am no security expert so please do correct me if I am wrong.
This is what you’re looking for. Just change the last line of your original code:
name=["deep","mahesh","nirbhay"]
user_input = "certi_{element}" # this string I ask from user
for element in name:
print(eval("f'" + f"{user_input}" + "'"))
I have data like this:
>Px016979
MSPWMKKVFLQCMPKLLMMRRTKYSLPDYDDTFVSNGYTNELEMSRDSLT
DAFGNSKEDSGDYRKSPAPEDDMVGAGAYQRPSVTESENMLPRHLSPEVA
AALQSVRFIAQHIKDADKDNEVVEDWKFMSMVLDRFFLWLFTIACFVGTF
GIIFQSPSLYDTRVPVDQQISSIPMRKNNFFYPKDIETIGIIS
>Px016980
MQFIKKVLLIALTLSGAMGISREKRGLIFPPTSLYGTFLAIAVPIDIPDK
NVFVSYNFESNYSTLNNITEIDEVLFPNLPVVTARHSRSITRELAYTVLE
TKFKEHGLGGRECLLRNICEAAETPLHHNGLLGHIMHIVFTPSSSAEEGL
DDEYYEAEASGRAGSCARYEELCPVGLFDLITRIVEFKHT
>Px002185
MLSPSVAIKVQVLYIGKVRISQRKVPDTLIDDALVKFVHHEAEKVKANML
RRHSLLSSTGTSIYSSESAENLNEDKTKTDTSEHNIFLMMLLRAHCEAKQ
LRHVHDTAENRTEFLNQYLGGSTIFMKAKRSLSSGFDQLLKRKSSRDEGS
GLVLPVKKVT
>Px006321
MFPGRTIGIMITASHNLEPDNGVKLVDPDGEMLDGSWEEIATRMANVRYL
PMSLITKFLVNSYY
What I want to do is if I have the number >Px016979 or I can get the data bellow it.like this:
>Px016979
MSPWMKKVFLQCMPKLLMMRRTKYSLPDYDDTFVSNGYTNELEMSRDSLT
DAFGNSKEDSGDYRKSPAPEDDMVGAGAYQRPSVTESENMLPRHLSPEVA
AALQSVRFIAQHIKDADKDNEVVEDWKFMSMVLDRFFLWLFTIACFVGTF
GIIFQSPSLYDTRVPVDQQISSIPMRKNNFFYPKDIETIGIIS
I am new with Python.
#coding:utf-8
import os,re
a = """
>Px016979
MSPWMKKVFLQCMPKLLMMRRTKYSLPDYDDTFVSNGYTNELEMSRDSLT
DAFGNSKEDSGDYRKSPAPEDDMVGAGAYQRPSVTESENMLPRHLSPEVA
AALQSVRFIAQHIKDADKDNEVVEDWKFMSMVLDRFFLWLFTIACFVGTF
GIIFQSPSLYDTRVPVDQQISSIPMRKNNFFYPKDIETIGIIS
>Px016980
MQFIKKVLLIALTLSGAMGISREKRGLIFPPTSLYGTFLAIAVPIDIPDK
NVFVSYNFESNYSTLNNITEIDEVLFPNLPVVTARHSRSITRELAYTVLE
TKFKEHGLGGRECLLRNICEAAETPLHHNGLLGHIMHIVFTPSSSAEEGL
DDEYYEAEASGRAGSCARYEELCPVGLFDLITRIVEFKHT"
>Px002185
MLSPSVAIKVQVLYIGKVRISQRKVPDTLIDDALVKFVHHEAEKVKANML
RRHSLLSSTGTSIYSSESAENLNEDKTKTDTSEHNIFLMMLLRAHCEAKQ
LRHVHDTAENRTEFLNQYLGGSTIFMKAKRSLSSGFDQLLKRKSSRDEGS
GLVLPVKKVT
>Px006321
MFPGRTIGIMITASHNLEPDNGVKLVDPDGEMLDGSWEEIATRMANVRYL
PMSLITKFLVNSYY
"""
b = '>Px016979'
matchbj = re.match( r'$b(.*?)>',a,re.M|re.I)
print matchbj.group()
My code can not work. I have two questions:
I think my data has carriage return so my code can't work.
I don't know how to use variables in Python regular expression. If I write re.match( r'>Px016797(.*?)>',a,re.M|re.I) it can work, but I need to use variables.
Thanks.
It looks like your data is a FASTA file with protein sequences. So instead of using regular expressions, you should consider installing BioPython. That is a library specifically for bioinformatics use and research.
The goal of Biopython is to make it as easy as possible to use Python for bioinformatics by creating high-quality, reusable modules and classes. Biopython features include parsers for various Bioinformatics file formats (BLAST, Clustalw, FASTA, Genbank,...), access to online services (NCBI, Expasy,...), interfaces to common and not-so-common programs (Clustalw, DSSP, MSMS...), a standard sequence class, various clustering modules, a KD tree data structure etc. and even documentation.
Using BioPython, you would extract a sequence from a FASTA file for a given identifier in the following way:
from Bio import SeqIO
input_file = r'C:\path\to\proteins.fasta'
record_id = 'Px016979'
record_dict = SeqIO.to_dict(SeqIO.parse(input_file, 'fasta'))
record = record_dict[record_id]
sequence = str(record.seq)
print sequence
The following should work for each of the entries you have:
a = """
>Px016979
MSPWMKKVFLQCMPKLLMMRRTKYSLPDYDDTFVSNGYTNELEMSRDSLT
DAFGNSKEDSGDYRKSPAPEDDMVGAGAYQRPSVTESENMLPRHLSPEVA
AALQSVRFIAQHIKDADKDNEVVEDWKFMSMVLDRFFLWLFTIACFVGTF
GIIFQSPSLYDTRVPVDQQISSIPMRKNNFFYPKDIETIGIIS
>Px016980
MQFIKKVLLIALTLSGAMGISREKRGLIFPPTSLYGTFLAIAVPIDIPDK
NVFVSYNFESNYSTLNNITEIDEVLFPNLPVVTARHSRSITRELAYTVLE
TKFKEHGLGGRECLLRNICEAAETPLHHNGLLGHIMHIVFTPSSSAEEGL
DDEYYEAEASGRAGSCARYEELCPVGLFDLITRIVEFKHT"
>Px002185
MLSPSVAIKVQVLYIGKVRISQRKVPDTLIDDALVKFVHHEAEKVKANML
RRHSLLSSTGTSIYSSESAENLNEDKTKTDTSEHNIFLMMLLRAHCEAKQ
LRHVHDTAENRTEFLNQYLGGSTIFMKAKRSLSSGFDQLLKRKSSRDEGS
GLVLPVKKVT
>Px006321
MFPGRTIGIMITASHNLEPDNGVKLVDPDGEMLDGSWEEIATRMANVRYL
PMSLITKFLVNSYY
"""
for b in ['>Px016979', '>Px016980', '>Px002185', '>Px006321']:
re_search = re.search(re.escape(b) + r'(.*?)(?:>|\Z)', a, re.M|re.I|re.S)
print re_search.group()
This will display the following:
>Px016979
MSPWMKKVFLQCMPKLLMMRRTKYSLPDYDDTFVSNGYTNELEMSRDSLT
DAFGNSKEDSGDYRKSPAPEDDMVGAGAYQRPSVTESENMLPRHLSPEVA
AALQSVRFIAQHIKDADKDNEVVEDWKFMSMVLDRFFLWLFTIACFVGTF
GIIFQSPSLYDTRVPVDQQISSIPMRKNNFFYPKDIETIGIIS
>
>Px016980
MQFIKKVLLIALTLSGAMGISREKRGLIFPPTSLYGTFLAIAVPIDIPDK
NVFVSYNFESNYSTLNNITEIDEVLFPNLPVVTARHSRSITRELAYTVLE
TKFKEHGLGGRECLLRNICEAAETPLHHNGLLGHIMHIVFTPSSSAEEGL
DDEYYEAEASGRAGSCARYEELCPVGLFDLITRIVEFKHT"
>
>Px002185
MLSPSVAIKVQVLYIGKVRISQRKVPDTLIDDALVKFVHHEAEKVKANML
RRHSLLSSTGTSIYSSESAENLNEDKTKTDTSEHNIFLMMLLRAHCEAKQ
LRHVHDTAENRTEFLNQYLGGSTIFMKAKRSLSSGFDQLLKRKSSRDEGS
GLVLPVKKVT
>
>Px006321
MFPGRTIGIMITASHNLEPDNGVKLVDPDGEMLDGSWEEIATRMANVRYL
PMSLITKFLVNSYY
I would also consider installing biopython and checking out the book python for biologists which is free online (http://pythonforbiologists.com/). I worked with fastas a lot and for a quick and dirty solution you can just use this (leave the rest of the code as is):
matchbj = re.findall( '>.*', a, re.DOTALL)
for item in matchbj:
print item
It basically matches over lines because of the re.DOTALL flag, and looks for any number of any things between '>' characters.
Be advised, this will give them to you in list for, not an object. In my experience, re.match in the first thing people learn but they are often looking for the effect of re.findall.
I like the way ElementTree parses xml, in particular the Xpath feature. I've an output in xml from an application with nested tags.
I'd like to access this tags by name without specifying the namespace, is it possible?
For example:
root.findall("/molpro/job")
instead of:
root.findall("{http://www.molpro.net/schema/molpro2006}molpro/{http://www.molpro.net/schema/molpro2006}job")
At least with lxml2, it's possible to reduce this overhead somewhat:
root.findall("/n:molpro/n:job",
namespaces=dict(n="http://www.molpro.net/schema/molpro2006"))
You could write your own function to wrap the nasty looking bits for example:
def my_xpath(doc, ns, xp);
num = xp.count('/')
new_xp = xp.replace('/', '/{%s}')
ns_tup = (ns,) * num
doc.findall(new_xp % ns_tup)
namespace = 'http://www.molpro.net/schema/molpro2006'
my_xpath(root, namespace, '/molpro/job')
Not that much fun I admit but a least you will be able to read your xpath expressions.
I'm using this REST web service, which returns various templated strings as urls, for example:
"http://api.app.com/{foo}"
In Ruby, I can then use
url = Addressable::Template.new("http://api.app.com/{foo}").expand('foo' => 'bar')
to get
"http://api.app.com/bar"
Is there any way to do this in Python? I know about %() templates, but obviously they're not working here.
In python 2.6 you can do this if you need exactly that syntax
from string import Formatter
f = Formatter()
f.format("http://api.app.com/{foo}", foo="bar")
If you need to use an earlier python version then you can either copy the 2.6 formatter class or hand roll a parser/regex to do it.
Don't use a quick hack.
What is used there (and implemented by Addressable) are URI Templates. There seem to be several libs for this in python, for example: uri-templates. described_routes_py also has a parser for them.
I cannot give you a perfect solution but you could try using string.Template.
You either pre-process your incoming URL and then use string.Template directly, like
In [6]: url="http://api.app.com/{foo}"
In [7]: up=string.Template(re.sub("{", "${", url))
In [8]: up.substitute({"foo":"bar"})
Out[8]: 'http://api.app.com/bar'
taking advantage of the default "${...}" syntax for replacement identifiers. Or you subclass string.Template to control the identifier pattern, like
class MyTemplate(string.Template):
delimiter = ...
pattern = ...
but I haven't figured that out.