unix tools to parse file on the command line

unix tools to parse file on the command line - python

I have a python script that looks like the following that I want to transform:
import sys
# more imports
''' some comments '''
class Foo:
def _helper1():
etc.
def _helper2():
etc.
def foo1():
d = { a:3, b:2, c:4 }
etc.
def foo2():
d = { a:2, b:2, c:7 }
etc.
def foo3():
d = { a:3, b:2, c:7 }
etc.
etc.
if __name__ == "__main__":
etc.
I'd like to be able to parse JUST the foo*() functions and keep just the ones that have certain attributes, like d={a:3, b:2}. Obviously keep everything else that is non foo*() so the transformation will still run. The foo*() will be well defined though d may have different key, values.
Is there some set of unix tools I can use to do this through chaining? I can use grep to identify foo but how would I scan the next couple of lines to apply the keep or reject portion of my logic?
edit: note, i'm trying to see if it's reasonable to do this with command line tools before writing a custom parser. i know how to write the parser.

You haven't specified your problem with enough detail to recommend a particular solution, but there are many tools and techniques that will handle this type of problem.
As I understand this, you want to
Identify the boundaries of your class
Identify the methods within the class
Remove the methods lacking certain textual features
My general approach to this would be a script with logic based on "open old and new files; write everything you read from the old file, unless ."
You can blithely write things until you get into the class (one flag) and start finding methods (another flag). The one slightly tricky part here is the buffering: you need to keep the text of each method until you know whether it contains the target text. You can either read in the entire method (minor parsing task) and search that for the target, or simply hold lines of text until you find the target (then return to your write-it-all mode) or run off the end (empty the buffer without writing).
This is simply enough that you could cobble a script in any handy language to handle the problem. UNIX provides a variety of tools; in that paradigm I'd use awk. However, I recommend a read-friendly tool, such as Python or Perl. If you want to move formally into the world of parsing, I suggest a trivial Lex-YACC couplet: you can have very simple tokens (perhaps even complete lines, depending on your coding style) and actions (write line, hold line, set status flag, flush buffer, etc.).
Is that enough to get you moving?

Related

How can I edit a line in .tcl file?

I need to run a .tcl file via command line which get invoked with a Python script. However, a single line in that .tcl file needs to change based on input from the user. For example:
info = input("Prompt for the user: ")
Now I need the string contained in info to replace one of the lines in .tcl file.

Rewriting the script is one of the trickier options to pick. It makes things harder to audit and it is tremendously easy to make a mess of. It's not recommended at all unless you take special steps, such as factoring out the bit you set into its own file:
File that you edit, e.g., settings.tcl (simple enough that it is pretty trivial to write and you can rewrite the whole lot each time without making a mess of it)
set value "123"
Use of that file:
set value 0
if {[file readable settings.tcl]} {
source settings.tcl
}
puts "value is $value"
More sophisticated versions of that are possible with safe interpreters and language profiling… but they're only really needed when the settings and the code are in different trust domains.
That said, there are other approaches that are usually easier. If you are invoking the Tcl script by running a subprocess, the easiest ways to pass an arbitrary parameter are to use one of:
A command line argument. These can be read on the Tcl side from the $argv global, which holds a list of all arguments after the script name. (The lindex and lassign commands tend to be useful here, e.g., set value [lindex $argv 0].)
An environment variable. These can be read on the Tcl side from the env global array, e.g., set value $env(MyVarName)
On standard input. A line can be read from that on the Tcl side using set line [gets stdin].
In more complex cases, you'd pass values in their own files, or by writing them into something like an SQLite database, or… well, there's lots of options.
If on the other hand the Tcl interpreter is in the same process, pass the values by setting the variables in it before asking for the script to run. (Tcl has almost no true globals — environment variables are a special exception, and only because the OS forces it upon us — so everything is specific to the interpreter context.)
Specifically, if you've got a Tcl instance object from tkinter (Tk is a subclass of that) then you can do:
import tkinter
interp = tkinter.Tcl()
interp.call("set", "value", 123)
interp.eval("source program.tcl")
# Or interp.call("source", "program.tcl")
That has the advantage of doing all the quoting for you.

Python script entry point: How to call "main2"?

I have inherited a python script which appears to have multiple distinct entry points. For example:
if __name__ == '__main__1':
... Do stuff for option 1
if __name__ == '__main__2':
... Do stuff for option 2
etc
Google has turned up a few other examples of this syntax (e.g. here) but I'm still no wiser on how to use it.
So the question is: How can I call a specific entry point in a python script that has multiple numbered __main__ sections?
Update:
I found another example of it here, where the syntax appears to be related to a specfic tool.
https://github.com/brython-dev/brython/issues/163

The standard doc mentions only main as a reserved module namespace. After looking at your sample I notice that every main method seems separate, does its imports, performs some enclosed functionality. My suspicion is that the developer wanted to quickly swap functionalities and didn't bother to use command line arguments for that, opting instead to swap 'main2' to 'main' as needed.
This is by no means proven, though - any chance of contacting the one who wrote this in the first place?

Ignore the rest of the python file

My python scripts often contain "executable code" (functions, classes, &c) in the first part of the file and "test code" (interactive experiments) at the end.
I want python, py_compile, pylint &c to completely ignore the experimental stuff at the end.
I am looking for something like #if 0 for cpp.
How can this be done?
Here are some ideas and the reasons they are bad:
sys.exit(0): works for python but not py_compile and pylint
put all experimental code under def test():: I can no longer copy/paste the code into a python REPL because it has non-trivial indent
put all experimental code between lines with """: emacs no longer indents and fontifies the code properly
comment and uncomment the code all the time: I am too lazy (yes, this is a single key press, but I have to remember to do that!)
put the test code into a separate file: I want to keep the related stuff together
PS. My IDE is Emacs and my python interpreter is pyspark.

Use ipython rather than python for your REPL It has better code completion and introspection and when you paste indented code it can automatically "de-indent" the pasted code.
Thus you can put your experimental code in a test function and then paste in parts without worrying and having to de-indent your code.
If you are pasting large blocks that can be considered individual blocks then you will need to use the %paste or %cpaste magics.
eg.
for i in range(3):
i *= 2
# with the following the blank line this is a complete block
print(i)
With a normal paste:
In [1]: for i in range(3):
...: i *= 2
...:
In [2]: print(i)
4
Using %paste
In [3]: %paste
for i in range(10):
i *= 2
print(i)
## -- End pasted text --
0
2
4
In [4]:
PySpark and IPython
It is also possible to launch PySpark in IPython, the enhanced Python interpreter. PySpark works with IPython 1.0.0 and later. To use IPython, set the IPYTHON variable to 1 when running bin/pyspark:1
$ IPYTHON=1 ./bin/pyspark

Unfortunately, there is no widely (or any) standard describing what you are talking about, so getting a bunch of python specific things to work like this will be difficult.
However, you could wrap these commands in such a way that they only read until a signifier. For example (assuming you are on a unix system):
cat $file | sed '/exit(0)/q' |sed '/exit(0)/d'
The command will read until 'exit(0)' is found. You could pipe this into your checkers, or create a temp file that your checkers read. You could create wrapper executable files on your path that may work with your editors.
Windows may be able to use a similar technique.
I might advise a different approach. Separate files might be best. You might explore iPython notebooks as a possible solution, but I'm not sure exactly what your use case is.

Follow something like option 2.
I usually put experimental code in a main method.
def main ():
*experimental code goes here *
Then if you want to execute the experimental code just call the main.
main()

With python-mode.el mark arbitrary chunks as section - for example via py-sectionize-region.
Than call py-execute-section.
Updated after comment:
python-mode.el is delivered by melpa.
M-x list-packages RET
Look for python-mode - the built-in python.el provides 'python, while python-mode.el provides 'python-mode.
Developement just moved hereto: https://gitlab.com/python-mode-devs/python-mode

I think the standard ('Pythonic') way to deal with this is to do it like so:
class MyClass(object):
...
def my_function():
...
if __name__ == '__main__':
# testing code here
Edit after your comment
I don't think what you want is possible using a plain Python interpreter. You could have a look at the IEP Python editor (website, bitbucket): it supports something like Matlab's cell mode, where a cell can be defined with a double comment character (##):
## main code
class MyClass(object):
...
def my_function():
...
## testing code
do_some_testing_please()
All code from a ##-beginning line until either the next such line or end-of-file constitutes a single cell.
Whenever the cursor is within a particular cell and you strike some hotkey (default Ctrl+Enter), the code within that cell is executed in the currently running interpreter. An additional feature of IEP is that selected code can be executed with F9; a pretty standard feature but the nice thing here is that IEP will smartly deal with whitespace, so just selecting and pasting stuff from inside a method will automatically work.

I suggest you use a proper version control system to keep the "real" and the "experimental" parts separated.
For example, using Git, you could only include the real code without the experimental parts in your commits (using add -p), and then temporarily stash the experimental parts for running your various tools.
You could also keep the experimental parts in their own branch which you then rebase on top of the non-experimental parts when you need them.

Another possibility is to put tests as doctests into the docstrings of your code, which admittedly is only practical for simpler cases.
This way, they are only treated as executable code by the doctest module, but as comments otherwise.

How to read LV2 ttl file in Python?

I have an LV2 plugin and I want to use Python to extract its metadata - plugin name, description, list of control and audio ports and specification of each port.
With LADSPA the instructions were pretty clear, although a bit difficult to implement in Python: I just needed to call ladspa_descriptor() function. Now with LV2 there's a .ttl file, simples to access but more complicated to parse.
Is there any python library that will make this job simple?

The LV2 documentation generation tools use RDFLib. It is probably the most popular RDF interface for Python, though does much more than just parse Turtle. It is a good choice if performance is not an issue, but is unfortunately really slow.
If you need to actually instantiate and use plugins, you probably want to use an existing LV2 implementation. As Steve mentioned, Lilv is for this. It is not limited to any static default location, but will look in all the locations in LV2_PATH. You can set this environment variable to whatever you want before calling Lilv and it will only look in those locations. Alternatively, if you want to specifically load just one bundle at a time, there is a function for that: lilv_world_load_bundle().
There are SWIG-based Python bindings included with Lilv, but they stop short of actually allowing you to process data. However there is a project to wrap Lilv that allows processing of audio using scipy arrays: http://pyslv2.sourceforge.net/ (despite the name they are indeed Lilv bindings and not bindings for its predecessor SLV2)
That said, if you only need to get static information from the Turtle files, involving C libraries is probably more trouble than it is worth. One of the big advantages of using standard data files is ease of use with existing tools. To get the number of ports on a plugin, you simply need to count the number of triples that match the pattern (plugin, lv2:port, *). Here is an example Python script that prints the number of ports of a plugin, given the file to read and the plugin URI as command line arguments:
#!/usr/bin/env python
import rdflib
import sys
lv2 = rdflib.Namespace('http://lv2plug.in/ns/lv2core#')
path = sys.argv[1]
plugin = rdflib.URIRef(sys.argv[2])
model = rdflib.ConjunctiveGraph()
model.parse(path, format='n3')
num_ports = 0
for i in model.triples(plugin, lv2.port, None]):
num_ports += 1
print('%s has %u ports' % (plugin, num_ports))

This is how to get the number of ports each plugin supports:
w = lilv.World()
w.load_all()
for p in w.get_all_plugins():
print p.get_name().as_string(), p.get_num_ports()
At least this is all i got while trying to figure this out.

Reducing capabilities of markdown in python

I'm writing a comment system. It has to be have formatting system like stackoverflow's.
Users can use some inline markdown syntax like bold or italic. I thought that i can solve that need with using regex replacements.
But there is another thing i have to do: by giving 4 space indents users can create code blocks. I think that i can't do this by using regex. or parsing idents is too advanced usage for me :) Also, creating lists via using regex replacements looks like imposible for me.
What would be best approach for doing this?
Are there any markdown libraries that can i reduce capabilities of it? (for example i'll try to remove tables support)
If i should write my own parser, should i write a finite state machine from the scratch? or are there any other libraries to make it easier?
Thanks for giving time, and your responses.

I'd just go ahead and use python-markdown and monkey-patch it. You can write your own def_block_parser() function and substitute that in for the default one to disable some of the Markdown functionality:
from markdown import blockprocessors as bp
def build_block_parser(md_instance, **kwargs):
""" Build the default block parser used by Markdown. """
parser = bp.BlockParser(md_instance)
parser.blockprocessors['empty'] = bp.EmptyBlockProcessor(parser)
parser.blockprocessors['indent'] = bp.ListIndentProcessor(parser)
# parser.blockprocessors['code'] = bp.CodeBlockProcessor(parser)
parser.blockprocessors['hashheader'] = bp.HashHeaderProcessor(parser)
parser.blockprocessors['setextheader'] = bp.SetextHeaderProcessor(parser)
parser.blockprocessors['hr'] = bp.HRProcessor(parser)
parser.blockprocessors['olist'] = bp.OListProcessor(parser)
parser.blockprocessors['ulist'] = bp.UListProcessor(parser)
parser.blockprocessors['quote'] = bp.BlockQuoteProcessor(parser)
parser.blockprocessors['paragraph'] = bp.ParagraphProcessor(parser)
return parser
bp.build_block_parser = build_block_parser
Note that I've simply copied and pasted the default build_block_processor() function from the blockprocessors.py file, tweaked it a bit (inserting bp. in front of all the names from that module), and commented out the line where it adds the code block processor. The resulting function is then monkey-patched back into the module. A similar method looks feasible for inlinepatterns.py, treeprocessor.py, preprocessor.py, and postprocessor.py, each of which does a different kind of processing.
Rather than rewriting the function that sets up the individual parsers, as I've done above, you could also patch out the parser classes themselves with do-nothing subclasses that would still be invoked but which would do nothing. That is probably simpler:
from markdown import blockprocessors as bp
class NoProcessing(bp.BlockProcessor):
def test(self, parent, block):
return False # never invoke this processor
bp.CodeBlockProcessor = NoProcessing
There might be other Markdown libraries that more explicitly allow functionality to be disabled, but python-markdown looks like it is reasonably hackable.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.