Example of use libextractor3 from python - python

I'm using python-extractor to work with libextractor3. I can not find any examples of it. Does any one have any documentations or examples?

Source package of python-extractor contains a file named extract.py which has a small demo on how to use libextractor Python binding.
Content from extract.py
import extractor
import sys
from ctypes import *
import struct
xtract = extractor.Extractor()
def print_k(xt, plugin, type, format, mime, data, datalen):
mstr = cast (data, c_char_p)
# FIXME: this ignores 'datalen', not that great...
# (in general, depending on the mime type and format, only
# the first 'datalen' bytes in 'data' should be used).
if (format == extractor.EXTRACTOR_METAFORMAT_UTF8):
print "%s - %s" % (xtract.keywordTypes()[type], mstr.value)
return 0
for arg in sys.argv[1:]:
print "Keywords from %s:" % arg
xtract.extract(print_k, None, arg)
To have better understanding of python-extractor go through source code in extractor.py.

Related

Python SAX Parser: resolveEntity

I am having a hard time figuring out how to bind a ResolveEntityHandler of my own to a SAX parser. On SO there this answer. But unfortunately I cannot reproduce the result there.
When I run the following code, which is actually copied from the aforementioned answer, just updated to Python 3,
import io
import xml.sax
from xml.sax.handler import ContentHandler
# Inheriting from EntityResolver and DTDHandler is not necessary
class TestHandler(ContentHandler):
# This method is only called for external entities. Must return a value.
def resolveEntity(self, publicID, systemID):
print ("TestHandler.resolveEntity(): %s %s" % (publicID, systemID))
return systemID
def skippedEntity(self, name):
print ("TestHandler.skippedEntity(): %s" % (name))
def unparsedEntityDecl(self, name, publicID, systemID, ndata):
print ("TestHandler.unparsedEntityDecl(): %s %s" % (publicID, systemID))
def startElement(self, name, attrs):
summary = attrs.get('summary', '')
print ('TestHandler.startElement():', summary)
def main(xml_string):
try:
parser = xml.sax.make_parser()
curHandler = TestHandler()
parser.setContentHandler(curHandler)
parser.setEntityResolver(curHandler)
parser.setDTDHandler(curHandler)
stream = io.StringIO(xml_string)
parser.parse(stream)
stream.close()
except xml.sax.SAXParseException as e:
print ("ERROR %s" % e)
XML = """<!DOCTYPE test SYSTEM "test.dtd">
<test summary='step: &num;'>Entity: ¬</test>
"""
main(XML)
and the external test.dtd
<!ENTITY num "FOO">
<!ENTITY pic SYSTEM 'bar.gif' NDATA gif>
What I got is
TestHandler.startElement(): step:
TestHandler.skippedEntity(): not
Process finished with exit code 0
So my questions are:
why was resolveEntity never called?
how to bind a ResolveEntityHandler to your parser?
What you are seeing has to do with a change in Python 3.7.1:
Changed in version 3.7.1: The SAX parser no longer processes general external entities by default to increase security. Before, the parser created network connections to fetch remote files or loaded local files from the file system for DTD and entities. The feature can be enabled again with method setFeature() on the parser object and argument feature_external_ges.
To get the same behaviour as in earlier versions, add these lines:
from xml.sax.handler import feature_external_ges
and (in the main function)
parser.setFeature(feature_external_ges, True)

Python3 / SWIG and output streams

I am using the SWIG generated Python wrappers for GDCM (comes with gdcm.py).
I am running the following Python3 script.
import gdcm
import sys
filename="path_to_data/gdcm_test.dcm"
r = gdcm.Reader()
r.SetFileName(filename)
r.Read()
f=r.GetFile()
ds = f.GetDataSet()
csa_t1 = gdcm.CSAHeader()
t1 = csa_t1.GetCSAImageHeaderInfoTag()
csa_t1.LoadFromDataElement(ds.GetDataElement( t1))
csa_t1.Print(sys.stdout)
The relevant snippet from the gdcmswig.py file (with the function that wraps Print) is below.
def Print(self, os: 'std::ostream &') -> "void":
"""
void
gdcm::CSAHeader::Print(std::ostream &os) const
Print the CSAHeader (use only if Format == SV10 or NOMAGIC)
"""
return _gdcmswig.CSAHeader_Print(self, os)
The problem appears on the last line of my script. The call to Print(sys.stdout).
TypeError: in method 'CSAHeader_Print', argument 2 of type 'std::ostream &'
The problem, I think, is that Python’s sys.stdout is not the actual output file handle, but wraps the handle. What is the best way to solve this? Thanks in advance.

How do I get the Python line number and file name of the point this function was called from? [duplicate]

In C++, I can print debug output like this:
printf(
"FILE: %s, FUNC: %s, LINE: %d, LOG: %s\n",
__FILE__,
__FUNCTION__,
__LINE__,
logmessage
);
How can I do something similar in Python?
There is a module named inspect which provides these information.
Example usage:
import inspect
def PrintFrame():
callerframerecord = inspect.stack()[1] # 0 represents this line
# 1 represents line at caller
frame = callerframerecord[0]
info = inspect.getframeinfo(frame)
print(info.filename) # __FILE__ -> Test.py
print(info.function) # __FUNCTION__ -> Main
print(info.lineno) # __LINE__ -> 13
def Main():
PrintFrame() # for this line
Main()
However, please remember that there is an easier way to obtain the name of the currently executing file:
print(__file__)
For example
import inspect
frame = inspect.currentframe()
# __FILE__
fileName = frame.f_code.co_filename
# __LINE__
fileNo = frame.f_lineno
There's more here http://docs.python.org/library/inspect.html
Building on geowar's answer:
class __LINE__(object):
import sys
def __repr__(self):
try:
raise Exception
except:
return str(sys.exc_info()[2].tb_frame.f_back.f_lineno)
__LINE__ = __LINE__()
If you normally want to use __LINE__ in e.g. print (or any other time an implicit str() or repr() is taken), the above will allow you to omit the ()s.
(Obvious extension to add a __call__ left as an exercise to the reader.)
You can refer my answer:
https://stackoverflow.com/a/45973480/1591700
import sys
print sys._getframe().f_lineno
You can also make lambda function
I was also interested in a __LINE__ command in python.
My starting point was https://stackoverflow.com/a/6811020 and I extended it with a metaclass object. With this modification it has the same behavior like in C++.
import inspect
class Meta(type):
def __repr__(self):
# Inspiration: https://stackoverflow.com/a/6811020
callerframerecord = inspect.stack()[1] # 0 represents this line
# 1 represents line at caller
frame = callerframerecord[0]
info = inspect.getframeinfo(frame)
# print(info.filename) # __FILE__ -> Test.py
# print(info.function) # __FUNCTION__ -> Main
# print(info.lineno) # __LINE__ -> 13
return str(info.lineno)
class __LINE__(metaclass=Meta):
pass
print(__LINE__) # print for example 18
wow, 7 year old question :)
Anyway, taking Tugrul's answer, and writing it as a debug type method, it can look something like:
def debug(message):
import sys
import inspect
callerframerecord = inspect.stack()[1]
frame = callerframerecord[0]
info = inspect.getframeinfo(frame)
print(info.filename, 'func=%s' % info.function, 'line=%s:' % info.lineno, message)
def somefunc():
debug('inside some func')
debug('this')
debug('is a')
debug('test message')
somefunc()
Output:
/tmp/test2.py func=<module> line=12: this
/tmp/test2.py func=<module> line=13: is a
/tmp/test2.py func=<module> line=14: test message
/tmp/test2.py func=somefunc line=10: inside some func
import inspect
.
.
.
def __LINE__():
try:
raise Exception
except:
return sys.exc_info()[2].tb_frame.f_back.f_lineno
def __FILE__():
return inspect.currentframe().f_code.co_filename
.
.
.
print "file: '%s', line: %d" % (__FILE__(), __LINE__())
Here is a tool to answer this old yet new question!
I recommend using icecream!
Do you ever use print() or log() to debug your code? Of course, you
do. IceCream, or ic for short, makes print debugging a little sweeter.
ic() is like print(), but better:
It prints both expressions/variable names and their values.
It's 40% faster to type.
Data structures are pretty printed.
Output is syntax highlighted.
It optionally includes program context: filename, line number, and parent function.
For example, I created a module icecream_test.py, and put the following code inside it.
from icecream import ic
ic.configureOutput(includeContext=True)
def foo(i):
return i + 333
ic(foo(123))
Prints
ic| icecream_test.py:6 in <module>- foo(123): 456
To get the line number in Python without importing the whole sys module...
First import the _getframe submodule:
from sys import _getframe
Then call the _getframe function and use its' f_lineno property whenever you want to know the line number:
print(_getframe().f_lineno) # prints the line number
From the interpreter:
>>> from sys import _getframe
... _getframe().f_lineno # 2
Word of caution from the official Python Docs:
CPython implementation detail: This function should be used for internal and specialized purposes only. It is not guaranteed to exist in all implementations of Python.
In other words: Only use this code for personal testing / debugging reasons.
See the Official Python Documentation on sys._getframe for more information on the sys module, and the _getframe() function / submodule.
Based on Mohammad Shahid's answer (above).

How to determine if win32api.ShellExecute was successful using hinstance?

I've been looking around for an answer to my original issue.. how do i determine (programmatically) that my win32api.ShellExecute statement executed successfully, and if a successful execution occurs, execute an os.remove() statement.
Researching I found out that the ShellExecute() call returns the HINSTANCE. Further digging I found that ShellExecute() will return an HINSTANCE > 32 if it was successful. My problem/question now is, how do i use it to control the rest of my program's flow? I tried using an if HINSTANCE> 32: statement to control the next part, but I get a NameError: name 'hinstance' is not defined message. Normally this wouldn't confuse me because it means i need to define the variable 'hinstance' before referencing it; however, because i thought ShellExecute is supposed to return HINSTANCE, i thought that makes it available for use?
Here is my full code where i am trying to implement this. Note that in my print_file() def i am assigning hinstance to the full win32api.ShellExecute() command in attempt to capture the hinstance along with explicitly returning it at the end of the function.. this isn't working either.
import win32print
import win32api
from os.path import isfile, join
import glob
import os
import time
source_path = "c:\\temp\\source\\"
def main():
printer_name = win32print.GetDefaultPrinter()
while True:
file_queue = [f for f in glob.glob("%s\\*.txt" % source_path) if isfile(f)]
if len(file_queue) > 0:
for i in file_queue:
print_file(i, printer_name)
if hinstance > 32:
time.sleep(.25)
delete_file(i)
print "Filename: %r has printed" % i
print
time.sleep(.25)
print
else:
print "No files to print. Will retry in 15 seconds"
time.sleep(15)
def print_file(pfile, printer):
hinstance = win32api.ShellExecute(
0,
"print",
'%s' % pfile,
'/d:"%s"' % printer,
".",
0
)
return hinstance
def delete_file(f):
os.remove(f)
print f, "was deleted!"
def alert(email):
pass
main()
With ShellExecute, you will never know when the printing is complete, it depends on the size of the file and whether the printer driver buffers the contents (the printer might be waiting for you to fill the paper tray, for example).
According to this SO answer, it looks like subprocess.call() is a better solution, since it waits for the command to complete, only in this case you would need to read the registry to obtain the exe associated with the file.
ShellExecuteEx is available from pywin32, you can do something like:
import win32com.shell.shell as shell
param = '/d:"%s"' % printer
shell.ShellExecuteEx(fmask = win32com.shell.shellcon.SEE_MASK_NOASYNC, lpVerb='print', lpFile=pfile, lpParameters=param)
EDIT: code for waiting on the handle from ShellExecuteEx()
import win32com.shell.shell as shell
import win32event
#fMask = SEE_MASK_NOASYNC(0x00000100) = 256 + SEE_MASK_NOCLOSEPROCESS(0x00000040) = 64
dict = shell.ShellExecuteEx(fMask = 256 + 64, lpFile='Notepad.exe', lpParameters='Notes.txt')
hh = dict['hProcess']
print hh
ret = win32event.WaitForSingleObject(hh, -1)
print ret
The return value of ShellExecute is what you need to test. You return that from print_file, but you then ignore it. You need to capture it and check that.
hinstance = print_file(i, printer_name)
if hinstance > 32:
....
However, having your print_file function leak implementation detail like an HINSTANCE seems bad. I think you would be better to check the return value of ShellExecute directly at the point of use. So try to move the > 32 check inside print_file.
Note that ShellExecute has very weak error reporting. If you want proper error reporting then you should use ShellExecuteEx instead.
Your delete/sleep loop is very brittle indeed. I'm not quite sure I can recommend anything better since I'm not sure what you are trying to achieve. However, expect to run into trouble with that part of your program.

How to extract chains from a PDB file?

I would like to extract chains from pdb files. I have a file named pdb.txt which contains pdb IDs as shown below. The first four characters represent PDB IDs and last character is the chain IDs.
1B68A
1BZ4B
4FUTA
I would like to 1) read the file line by line
2) download the atomic coordinates of each chain from the corresponding PDB files.
3) save the output to a folder.
I used the following script to extract chains. But this code prints only A chains from pdb files.
for i in 1B68 1BZ4 4FUT
do
wget -c "http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId="$i -O $i.pdb
grep ATOM $i.pdb | grep 'A' > $i\_A.pdb
done
The following BioPython code should suit your needs well.
It uses PDB.Select to only select the desired chains (in your case, one chain) and PDBIO() to create a structure containing just the chain.
import os
from Bio import PDB
class ChainSplitter:
def __init__(self, out_dir=None):
""" Create parsing and writing objects, specify output directory. """
self.parser = PDB.PDBParser()
self.writer = PDB.PDBIO()
if out_dir is None:
out_dir = os.path.join(os.getcwd(), "chain_PDBs")
self.out_dir = out_dir
def make_pdb(self, pdb_path, chain_letters, overwrite=False, struct=None):
""" Create a new PDB file containing only the specified chains.
Returns the path to the created file.
:param pdb_path: full path to the crystal structure
:param chain_letters: iterable of chain characters (case insensitive)
:param overwrite: write over the output file if it exists
"""
chain_letters = [chain.upper() for chain in chain_letters]
# Input/output files
(pdb_dir, pdb_fn) = os.path.split(pdb_path)
pdb_id = pdb_fn[3:7]
out_name = "pdb%s_%s.ent" % (pdb_id, "".join(chain_letters))
out_path = os.path.join(self.out_dir, out_name)
print "OUT PATH:",out_path
plural = "s" if (len(chain_letters) > 1) else "" # for printing
# Skip PDB generation if the file already exists
if (not overwrite) and (os.path.isfile(out_path)):
print("Chain%s %s of '%s' already extracted to '%s'." %
(plural, ", ".join(chain_letters), pdb_id, out_name))
return out_path
print("Extracting chain%s %s from %s..." % (plural,
", ".join(chain_letters), pdb_fn))
# Get structure, write new file with only given chains
if struct is None:
struct = self.parser.get_structure(pdb_id, pdb_path)
self.writer.set_structure(struct)
self.writer.save(out_path, select=SelectChains(chain_letters))
return out_path
class SelectChains(PDB.Select):
""" Only accept the specified chains when saving. """
def __init__(self, chain_letters):
self.chain_letters = chain_letters
def accept_chain(self, chain):
return (chain.get_id() in self.chain_letters)
if __name__ == "__main__":
""" Parses PDB id's desired chains, and creates new PDB structures. """
import sys
if not len(sys.argv) == 2:
print "Usage: $ python %s 'pdb.txt'" % __file__
sys.exit()
pdb_textfn = sys.argv[1]
pdbList = PDB.PDBList()
splitter = ChainSplitter("/home/steve/chain_pdbs") # Change me.
with open(pdb_textfn) as pdb_textfile:
for line in pdb_textfile:
pdb_id = line[:4].lower()
chain = line[4]
pdb_fn = pdbList.retrieve_pdb_file(pdb_id)
splitter.make_pdb(pdb_fn, chain)
One final note: don't write your own parser for PDB files. The format specification is ugly (really ugly), and the amount of faulty PDB files out there is staggering. Use a tool like BioPython that will handle parsing for you!
Furthermore, instead of using wget, you should use tools that interact with the PDB database for you. They take FTP connection limitations into account, the changing nature of the PDB database, and more. I should know - I updated Bio.PDBList to account for changes in the database. =)
It is probably a little late for asnwering this question, but I will give my opinion.
Biopython has some really handy features that would help you achieve such a think easily. You could use something like a custom selection class and then call it for each one of the chains you want to select inside a for loop with the original pdb file.
from Bio.PDB import Select, PDBIO
from Bio.PDB.PDBParser import PDBParser
class ChainSelect(Select):
def __init__(self, chain):
self.chain = chain
def accept_chain(self, chain):
if chain.get_id() == self.chain:
return 1
else:
return 0
chains = ['A','B','C']
p = PDBParser(PERMISSIVE=1)
structure = p.get_structure(pdb_file, pdb_file)
for chain in chains:
pdb_chain_file = 'pdb_file_chain_{}.pdb'.format(chain)
io_w_no_h = PDBIO()
io_w_no_h.set_structure(structure)
io_w_no_h.save('{}'.format(pdb_chain_file), ChainSelect(chain))
Lets say you have the following file pdb_structures
1B68A
1BZ4B
4FUTA
Then have your code in load_pdb.sh
while read name
do
chain=${name:4:1}
name=${name:0:4}
wget -c "http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId="$name -O $name.pdb
awk -v chain=$chain '$0~/^ATOM/ && substr($0,20,1)==chain {print}' $name.pdb > $name\_$chain.pdb
# rm $name.pdb
done
uncomment the last line if you don't need the original pdb's.
execute
cat pdb_structures | ./load_pdb.sh

Categories

Resources