Reducing capabilities of markdown in python

Reducing capabilities of markdown in python - python

I'm writing a comment system. It has to be have formatting system like stackoverflow's.
Users can use some inline markdown syntax like bold or italic. I thought that i can solve that need with using regex replacements.
But there is another thing i have to do: by giving 4 space indents users can create code blocks. I think that i can't do this by using regex. or parsing idents is too advanced usage for me :) Also, creating lists via using regex replacements looks like imposible for me.
What would be best approach for doing this?
Are there any markdown libraries that can i reduce capabilities of it? (for example i'll try to remove tables support)
If i should write my own parser, should i write a finite state machine from the scratch? or are there any other libraries to make it easier?
Thanks for giving time, and your responses.

I'd just go ahead and use python-markdown and monkey-patch it. You can write your own def_block_parser() function and substitute that in for the default one to disable some of the Markdown functionality:
from markdown import blockprocessors as bp
def build_block_parser(md_instance, **kwargs):
""" Build the default block parser used by Markdown. """
parser = bp.BlockParser(md_instance)
parser.blockprocessors['empty'] = bp.EmptyBlockProcessor(parser)
parser.blockprocessors['indent'] = bp.ListIndentProcessor(parser)
# parser.blockprocessors['code'] = bp.CodeBlockProcessor(parser)
parser.blockprocessors['hashheader'] = bp.HashHeaderProcessor(parser)
parser.blockprocessors['setextheader'] = bp.SetextHeaderProcessor(parser)
parser.blockprocessors['hr'] = bp.HRProcessor(parser)
parser.blockprocessors['olist'] = bp.OListProcessor(parser)
parser.blockprocessors['ulist'] = bp.UListProcessor(parser)
parser.blockprocessors['quote'] = bp.BlockQuoteProcessor(parser)
parser.blockprocessors['paragraph'] = bp.ParagraphProcessor(parser)
return parser
bp.build_block_parser = build_block_parser
Note that I've simply copied and pasted the default build_block_processor() function from the blockprocessors.py file, tweaked it a bit (inserting bp. in front of all the names from that module), and commented out the line where it adds the code block processor. The resulting function is then monkey-patched back into the module. A similar method looks feasible for inlinepatterns.py, treeprocessor.py, preprocessor.py, and postprocessor.py, each of which does a different kind of processing.
Rather than rewriting the function that sets up the individual parsers, as I've done above, you could also patch out the parser classes themselves with do-nothing subclasses that would still be invoked but which would do nothing. That is probably simpler:
from markdown import blockprocessors as bp
class NoProcessing(bp.BlockProcessor):
def test(self, parent, block):
return False # never invoke this processor
bp.CodeBlockProcessor = NoProcessing
There might be other Markdown libraries that more explicitly allow functionality to be disabled, but python-markdown looks like it is reasonably hackable.

Related

How to use pyftsubset of Fonttools inside of the python environment, not from the command line

I need to subset very many font files and I need to do that from within the python environment. Yet, Fonttools is very poorly documented and I cannot find a module and the proper function syntax to perform subsetting based on unicode from within python, not as a command line tool (pyftsubset). Some of my files contain various errors when read by the Fonttools and I cannot catch exceptions using !command inside jupyter.

pyftsubset is itself just a Python script, which calls fontTools.subset.main, which in turn parses sys.argv (command-line args) to perform subsetting. You can do the same thing pretty easily in your own script, for example:
import sys
from fontTools.subset import main as ss
sys.argv = [None, '/path/to/font/file.ttf', '--unicodes=U+0020-002F']
ss() # this is what actually does the subsetting and writes the output file
Obviously you'll want to use your own values for --unicodes plus the numerous other pyftsubset options, but in general this scheme should work. Possible caveat is if you have other parts of your program that use/rely on sys.argv; if that's the case you might want to capture the initial values in another variable before modifying sys.argv and calling the subsetter, then re-set it to the initial values after.

I think that should be a pythonic way to do it properly:
from fontTools import subset
subsetter = subset.Subsetter()
subsetter.populate(unicodes=["U+0020", "U+0021"])
subsetter.subset(font)
While font is your TTFont and you might need to check the documentation for how to exactly pass in the the list of unicodes. I didn’t test this exact code, but I tested it with subsetter.populate(glyphs=["a", "b"]) which does a similar job, but with glyphNames instead. The populate method can take these arguments as documented: populate(self, glyphs=[], gids=[], unicodes=[], text='')
I found a clue to that in this discussion.

unix tools to parse file on the command line

I have a python script that looks like the following that I want to transform:
import sys
# more imports
''' some comments '''
class Foo:
def _helper1():
etc.
def _helper2():
etc.
def foo1():
d = { a:3, b:2, c:4 }
etc.
def foo2():
d = { a:2, b:2, c:7 }
etc.
def foo3():
d = { a:3, b:2, c:7 }
etc.
etc.
if __name__ == "__main__":
etc.
I'd like to be able to parse JUST the foo*() functions and keep just the ones that have certain attributes, like d={a:3, b:2}. Obviously keep everything else that is non foo*() so the transformation will still run. The foo*() will be well defined though d may have different key, values.
Is there some set of unix tools I can use to do this through chaining? I can use grep to identify foo but how would I scan the next couple of lines to apply the keep or reject portion of my logic?
edit: note, i'm trying to see if it's reasonable to do this with command line tools before writing a custom parser. i know how to write the parser.

You haven't specified your problem with enough detail to recommend a particular solution, but there are many tools and techniques that will handle this type of problem.
As I understand this, you want to
Identify the boundaries of your class
Identify the methods within the class
Remove the methods lacking certain textual features
My general approach to this would be a script with logic based on "open old and new files; write everything you read from the old file, unless ."
You can blithely write things until you get into the class (one flag) and start finding methods (another flag). The one slightly tricky part here is the buffering: you need to keep the text of each method until you know whether it contains the target text. You can either read in the entire method (minor parsing task) and search that for the target, or simply hold lines of text until you find the target (then return to your write-it-all mode) or run off the end (empty the buffer without writing).
This is simply enough that you could cobble a script in any handy language to handle the problem. UNIX provides a variety of tools; in that paradigm I'd use awk. However, I recommend a read-friendly tool, such as Python or Perl. If you want to move formally into the world of parsing, I suggest a trivial Lex-YACC couplet: you can have very simple tokens (perhaps even complete lines, depending on your coding style) and actions (write line, hold line, set status flag, flush buffer, etc.).
Is that enough to get you moving?

Ignore the rest of the python file

My python scripts often contain "executable code" (functions, classes, &c) in the first part of the file and "test code" (interactive experiments) at the end.
I want python, py_compile, pylint &c to completely ignore the experimental stuff at the end.
I am looking for something like #if 0 for cpp.
How can this be done?
Here are some ideas and the reasons they are bad:
sys.exit(0): works for python but not py_compile and pylint
put all experimental code under def test():: I can no longer copy/paste the code into a python REPL because it has non-trivial indent
put all experimental code between lines with """: emacs no longer indents and fontifies the code properly
comment and uncomment the code all the time: I am too lazy (yes, this is a single key press, but I have to remember to do that!)
put the test code into a separate file: I want to keep the related stuff together
PS. My IDE is Emacs and my python interpreter is pyspark.

Use ipython rather than python for your REPL It has better code completion and introspection and when you paste indented code it can automatically "de-indent" the pasted code.
Thus you can put your experimental code in a test function and then paste in parts without worrying and having to de-indent your code.
If you are pasting large blocks that can be considered individual blocks then you will need to use the %paste or %cpaste magics.
eg.
for i in range(3):
i *= 2
# with the following the blank line this is a complete block
print(i)
With a normal paste:
In [1]: for i in range(3):
...: i *= 2
...:
In [2]: print(i)
4
Using %paste
In [3]: %paste
for i in range(10):
i *= 2
print(i)
## -- End pasted text --
0
2
4
In [4]:
PySpark and IPython
It is also possible to launch PySpark in IPython, the enhanced Python interpreter. PySpark works with IPython 1.0.0 and later. To use IPython, set the IPYTHON variable to 1 when running bin/pyspark:1
$ IPYTHON=1 ./bin/pyspark

Unfortunately, there is no widely (or any) standard describing what you are talking about, so getting a bunch of python specific things to work like this will be difficult.
However, you could wrap these commands in such a way that they only read until a signifier. For example (assuming you are on a unix system):
cat $file | sed '/exit(0)/q' |sed '/exit(0)/d'
The command will read until 'exit(0)' is found. You could pipe this into your checkers, or create a temp file that your checkers read. You could create wrapper executable files on your path that may work with your editors.
Windows may be able to use a similar technique.
I might advise a different approach. Separate files might be best. You might explore iPython notebooks as a possible solution, but I'm not sure exactly what your use case is.

Follow something like option 2.
I usually put experimental code in a main method.
def main ():
*experimental code goes here *
Then if you want to execute the experimental code just call the main.
main()

With python-mode.el mark arbitrary chunks as section - for example via py-sectionize-region.
Than call py-execute-section.
Updated after comment:
python-mode.el is delivered by melpa.
M-x list-packages RET
Look for python-mode - the built-in python.el provides 'python, while python-mode.el provides 'python-mode.
Developement just moved hereto: https://gitlab.com/python-mode-devs/python-mode

I think the standard ('Pythonic') way to deal with this is to do it like so:
class MyClass(object):
...
def my_function():
...
if __name__ == '__main__':
# testing code here
Edit after your comment
I don't think what you want is possible using a plain Python interpreter. You could have a look at the IEP Python editor (website, bitbucket): it supports something like Matlab's cell mode, where a cell can be defined with a double comment character (##):
## main code
class MyClass(object):
...
def my_function():
...
## testing code
do_some_testing_please()
All code from a ##-beginning line until either the next such line or end-of-file constitutes a single cell.
Whenever the cursor is within a particular cell and you strike some hotkey (default Ctrl+Enter), the code within that cell is executed in the currently running interpreter. An additional feature of IEP is that selected code can be executed with F9; a pretty standard feature but the nice thing here is that IEP will smartly deal with whitespace, so just selecting and pasting stuff from inside a method will automatically work.

I suggest you use a proper version control system to keep the "real" and the "experimental" parts separated.
For example, using Git, you could only include the real code without the experimental parts in your commits (using add -p), and then temporarily stash the experimental parts for running your various tools.
You could also keep the experimental parts in their own branch which you then rebase on top of the non-experimental parts when you need them.

Another possibility is to put tests as doctests into the docstrings of your code, which admittedly is only practical for simpler cases.
This way, they are only treated as executable code by the doctest module, but as comments otherwise.

Use entire scope as input in plugin in Sublime Text 2

One of the nice things about Textmate was the ability to pipe the contents of an entire scope into a command, like so:
You could then specify the scope to be used, such as meta.class.python or whatever.
I'm trying to write a small plugin that will pipe the entire current scope in as the input for the plugin (for example (not exactly what I'm trying to do, but close), a function that lets you comment out an entire Python class without selecting it all)
Using the current selection(s) as input is quite easy:
import sublime, sublime_plugin
import re
class DoStuffWithSelection(sublime_plugin.TextCommand):
def run(self, edit):
for region in self.view.sel():
if not region.empty():
changed = region # Do something to the selection
self.view.replace(edit, region, changed) # Replace the selection
I've scoured the Sublime Text plugin API for some way to do something like for region in self.view.scope(), but without success.
Is there a way, then, to use the contents of the current scope under the cursor as input for a plugin function? Or, even better, a way to use the entire scope if there isn't a selection, but use the selection if there is one.
Thanks!

If you want to get text that you select, the following code snippet is an example.
if not region.empty():
selectText = self.view.substr(region)
...
If you want to get text where the cursor is located, the following code snippet is an example.
if region.empty():
lineRegion = self.view.line(region)
lineText = self.view.substr(lineRegion)
...
To get more information, see http://net.tutsplus.com/tutorials/python-tutorials/how-to-create-a-sublime-text-2-plugin/ and http://www.sublimetext.com/docs/api-reference.

Parse a .py file, read the AST, modify it, then write back the modified source code

I want to programmatically edit python source code. Basically I want to read a .py file, generate the AST, and then write back the modified python source code (i.e. another .py file).
There are ways to parse/compile python source code using standard python modules, such as ast or compiler. However, I don't think any of them support ways to modify the source code (e.g. delete this function declaration) and then write back the modifying python source code.
UPDATE: The reason I want to do this is I'd like to write a Mutation testing library for python, mostly by deleting statements / expressions, rerunning tests and seeing what breaks.

Pythoscope does this to the test cases it automatically generates as does the 2to3 tool for python 2.6 (it converts python 2.x source into python 3.x source).
Both these tools uses the lib2to3 library which is an implementation of the python parser/compiler machinery that can preserve comments in source when it's round tripped from source -> AST -> source.
The rope project may meet your needs if you want to do more refactoring like transforms.
The ast module is your other option, and there's an older example of how to "unparse" syntax trees back into code (using the parser module). But the ast module is more useful when doing an AST transform on code that is then transformed into a code object.
The redbaron project also may be a good fit (ht Xavier Combelle)

The builtin ast module doesn't seem to have a method to convert back to source. However, the codegen module here provides a pretty printer for the ast that would enable you do do so.
eg.
import ast
import codegen
expr="""
def foo():
print("hello world")
"""
p=ast.parse(expr)
p.body[0].body = [ ast.parse("return 42").body[0] ] # Replace function body with "return 42"
print(codegen.to_source(p))
This will print:
def foo():
return 42
Note that you may lose the exact formatting and comments, as these are not preserved.
However, you may not need to. If all you require is to execute the replaced AST, you can do so simply by calling compile() on the ast, and execing the resulting code object.

Took a while, but Python 3.9 has this:
https://docs.python.org/3.9/whatsnew/3.9.html#ast
https://docs.python.org/3.9/library/ast.html#ast.unparse
ast.unparse(ast_obj)
Unparse an ast.AST object and generate a string with code that would produce an equivalent ast.AST object if parsed back with ast.parse().

In a different answer I suggested using the astor package, but I have since found a more up-to-date AST un-parsing package called astunparse:
>>> import ast
>>> import astunparse
>>> print(astunparse.unparse(ast.parse('def foo(x): return 2 * x')))
def foo(x):
return (2 * x)
I have tested this on Python 3.5.

You might not need to re-generate source code. That's a bit dangerous for me to say, of course, since you have not actually explained why you think you need to generate a .py file full of code; but:
If you want to generate a .py file that people will actually use, maybe so that they can fill out a form and get a useful .py file to insert into their project, then you don't want to change it into an AST and back because you'll lose all formatting (think of the blank lines that make Python so readable by grouping related sets of lines together) (ast nodes have lineno and col_offset attributes) comments. Instead, you'll probably want to use a templating engine (the Django template language, for example, is designed to make templating even text files easy) to customize the .py file, or else use Rick Copeland's MetaPython extension.
If you are trying to make a change during compilation of a module, note that you don't have to go all the way back to text; you can just compile the AST directly instead of turning it back into a .py file.
But in almost any and every case, you are probably trying to do something dynamic that a language like Python actually makes very easy, without writing new .py files! If you expand your question to let us know what you actually want to accomplish, new .py files will probably not be involved in the answer at all; I have seen hundreds of Python projects doing hundreds of real-world things, and not a single one of them needed to ever writer a .py file. So, I must admit, I'm a bit of a skeptic that you've found the first good use-case. :-)
Update: now that you've explained what you're trying to do, I'd be tempted to just operate on the AST anyway. You will want to mutate by removing, not lines of a file (which could result in half-statements that simply die with a SyntaxError), but whole statements — and what better place to do that than in the AST?

Parsing and modifying the code structure is certainly possible with the help of ast module and I will show it in an example in a moment. However, writing back the modified source code is not possible with ast module alone. There are other modules available for this job such as one here.
NOTE: Example below can be treated as an introductory tutorial on the usage of ast module but a more comprehensive guide on using ast module is available here at Green Tree snakes tutorial and official documentation on ast module.
Introduction to ast:
>>> import ast
>>> tree = ast.parse("print 'Hello Python!!'")
>>> exec(compile(tree, filename="<ast>", mode="exec"))
Hello Python!!
You can parse the python code (represented in string) by simply calling the API ast.parse(). This returns the handle to Abstract Syntax Tree (AST) structure. Interestingly you can compile back this structure and execute it as shown above.
Another very useful API is ast.dump() which dumps the whole AST in a string form. It can be used to inspect the tree structure and is very helpful in debugging. For example,
On Python 2.7:
>>> import ast
>>> tree = ast.parse("print 'Hello Python!!'")
>>> ast.dump(tree)
"Module(body=[Print(dest=None, values=[Str(s='Hello Python!!')], nl=True)])"
On Python 3.5:
>>> import ast
>>> tree = ast.parse("print ('Hello Python!!')")
>>> ast.dump(tree)
"Module(body=[Expr(value=Call(func=Name(id='print', ctx=Load()), args=[Str(s='Hello Python!!')], keywords=[]))])"
Notice the difference in syntax for print statement in Python 2.7 vs. Python 3.5 and the difference in type of AST node in respective trees.
How to modify code using ast:
Now, let's a have a look at an example of modification of python code by ast module. The main tool for modifying AST structure is ast.NodeTransformer class. Whenever one needs to modify the AST, he/she needs to subclass from it and write Node Transformation(s) accordingly.
For our example, let's try to write a simple utility which transforms the Python 2 , print statements to Python 3 function calls.
Print statement to Fun call converter utility: print2to3.py:
#!/usr/bin/env python
'''
This utility converts the python (2.7) statements to Python 3 alike function calls before running the code.
USAGE:
python print2to3.py <filename>
'''
import ast
import sys
class P2to3(ast.NodeTransformer):
def visit_Print(self, node):
new_node = ast.Expr(value=ast.Call(func=ast.Name(id='print', ctx=ast.Load()),
args=node.values,
keywords=[], starargs=None, kwargs=None))
ast.copy_location(new_node, node)
return new_node
def main(filename=None):
if not filename:
return
with open(filename, 'r') as fp:
data = fp.readlines()
data = ''.join(data)
tree = ast.parse(data)
print "Converting python 2 print statements to Python 3 function calls"
print "-" * 35
P2to3().visit(tree)
ast.fix_missing_locations(tree)
# print ast.dump(tree)
exec(compile(tree, filename="p23", mode="exec"))
if __name__ == '__main__':
if len(sys.argv) <=1:
print ("\nUSAGE:\n\t print2to3.py <filename>")
sys.exit(1)
else:
main(sys.argv[1])
This utility can be tried on small example file, such as one below, and it should work fine.
Test Input file : py2.py
class A(object):
def __init__(self):
pass
def good():
print "I am good"
main = good
if __name__ == '__main__':
print "I am in main"
main()
Please note that above transformation is only for ast tutorial purpose and in real case scenario one will have to look at all different scenarios such as print " x is %s" % ("Hello Python").

If you are looking at this in 2019, then you can use this libcst
package. It has syntax similar to ast. This works like a charm, and preserve the code structure. It's basically helpful for the project where you have to preserve comments, whitespace, newline etc.
If you don't need to care about the preserving comments, whitespace and others, then the combination of ast and astor works well.

I've created recently quite stable (core is really well tested) and extensible piece of code which generates code from ast tree: https://github.com/paluh/code-formatter .
I'm using my project as a base for a small vim plugin (which I'm using every day), so my goal is to generate really nice and readable python code.
P.S.
I've tried to extend codegen but it's architecture is based on ast.NodeVisitor interface, so formatters (visitor_ methods) are just functions. I've found this structure quite limiting and hard to optimize (in case of long and nested expressions it's easier to keep objects tree and cache some partial results - in other way you can hit exponential complexity if you want to search for best layout). BUT codegen as every piece of mitsuhiko's work (which I've read) is very well written and concise.

One of the other answers recommends codegen, which seems to have been superceded by astor. The version of astor on PyPI (version 0.5 as of this writing) seems to be a little outdated as well, so you can install the development version of astor as follows.
pip install git+https://github.com/berkerpeksag/astor.git#egg=astor
Then you can use astor.to_source to convert a Python AST to human-readable Python source code:
>>> import ast
>>> import astor
>>> print(astor.to_source(ast.parse('def foo(x): return 2 * x')))
def foo(x):
return 2 * x
I have tested this on Python 3.5.

Unfortunately none of the answers above actually met both of these conditions
Preserve the syntactical integrity for the surrounding source code (e.g keeping comments, other sorts of formatting for the rest of the code)
Actually use AST (not CST).
I've recently written a small toolkit to do pure AST based refactorings, called refactor. For example if you want to replace all placeholders with 42, you can simply write a rule like this;
class Replace(Rule):
def match(self, node):
assert isinstance(node, ast.Name)
assert node.id == 'placeholder'
replacement = ast.Constant(42)
return ReplacementAction(node, replacement)
And it will find all acceptable nodes, replace them with the new nodes and generate the final form;
--- test_file.py
+++ test_file.py
## -1,11 +1,11 ##
def main():
- print(placeholder * 3 + 2)
- print(2 + placeholder + 3)
+ print(42 * 3 + 2)
+ print(2 + 42 + 3)
# some commments
- placeholder # maybe other comments
+ 42 # maybe other comments
if something:
other_thing
- print(placeholder)
+ print(42)
if __name__ == "__main__":
main()

We had a similar need, which wasn't solved by other answers here. So we created a library for this, ASTTokens, which takes an AST tree produced with the ast or astroid modules, and marks it with the ranges of text in the original source code.
It doesn't do modifications of code directly, but that's not hard to add on top, since it does tell you the range of text you need to modify.
For example, this wraps a function call in WRAP(...), preserving comments and everything else:
example = """
def foo(): # Test
'''My func'''
log("hello world") # Print
"""
import ast, asttokens
atok = asttokens.ASTTokens(example, parse=True)
call = next(n for n in ast.walk(atok.tree) if isinstance(n, ast.Call))
start, end = atok.get_text_range(call)
print(atok.text[:start] + ('WRAP(%s)' % atok.text[start:end]) + atok.text[end:])
Produces:
def foo(): # Test
'''My func'''
WRAP(log("hello world")) # Print
Hope this helps!

A Program Transformation System is a tool that parses source text, builds ASTs, allows you to modify them using source-to-source transformations ("if you see this pattern, replace it by that pattern"). Such tools are ideal for doing mutation of existing source codes, which are just "if you see this pattern, replace by a pattern variant".
Of course, you need a program transformation engine that can parse the language of interest to you, and still do the pattern-directed transformations. Our DMS Software Reengineering Toolkit is a system that can do that, and handles Python, and a variety of other languages.
See this SO answer for an example of a DMS-parsed AST for Python capturing comments accurately. DMS can make changes to the AST, and regenerate valid text, including the comments. You can ask it to prettyprint the AST, using its own formatting conventions (you can changes these), or do "fidelity printing", which uses the original line and column information to maximally preserve the original layout (some change in layout where new code is inserted is unavoidable).
To implement a "mutation" rule for Python with DMS, you could write the following:
rule mutate_addition(s:sum, p:product):sum->sum =
" \s + \p " -> " \s - \p"
if mutate_this_place(s);
This rule replace "+" with "-" in a syntactically correct way; it operates on the AST and thus won't touch strings or comments that happen to look right. The extra condition on "mutate_this_place" is to let you control how often this occurs; you don't want to mutate every place in the program.
You'd obviously want a bunch more rules like this that detect various code structures, and replace them by the mutated versions. DMS is happy to apply a set of rules. The mutated AST is then prettyprinted.

I used to use baron for this, but have now switched to parso because it's up to date with modern python. It works great.
I also needed this for a mutation tester. It's really quite simple to make one with parso, check out my code at https://github.com/boxed/mutmut

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.