Sensible python source line wrapping for printout - python

I am working on a latex document that will require typesetting significant amounts of python source code. I'm using pygments (the python module, not the online demo) to encapsulate this python in latex, which works well except in the case of long individual lines - which simply continue off the page. I could manually wrap these lines except that this just doesn't seem that elegant a solution to me, and I prefer spending time puzzling about crazy automated solutions than on repetitive tasks.
What I would like is some way of processing the python source code to wrap the lines to a certain maximum character length, while preserving functionality. I've had a play around with some python and the closest I've come is inserting \\\n in the last whitespace before the maximum line length - but of course, if this ends up in strings and comments, things go wrong. Quite frankly, I'm not sure how to approach this problem.
So, is anyone aware of a module or tool that can process source code so that no lines exceed a certain length - or at least a good way to start to go about coding something like that?

You might want to extend your current approach a bit, but using the tokenize module from the standard library to determine where to put your line breaks. That way you can see the actual tokens (COMMENT, STRING, etc.) of your source code rather than just the whitespace-separated words.
Here is a short example of what tokenize can do:
>>> from cStringIO import StringIO
>>> from tokenize import tokenize
>>>
>>> python_code = '''
... def foo(): # This is a comment
... print 'foo'
... '''
>>>
>>> fp = StringIO(python_code)
>>>
>>> tokenize(fp.readline)
1,0-1,1: NL '\n'
2,0-2,3: NAME 'def'
2,4-2,7: NAME 'foo'
2,7-2,8: OP '('
2,8-2,9: OP ')'
2,9-2,10: OP ':'
2,11-2,30: COMMENT '# This is a comment'
2,30-2,31: NEWLINE '\n'
3,0-3,4: INDENT ' '
3,4-3,9: NAME 'print'
3,10-3,15: STRING "'foo'"
3,15-3,16: NEWLINE '\n'
4,0-4,0: DEDENT ''
4,0-4,0: ENDMARKER ''

I use the listings package in LaTeX to insert source code; it does syntax highlight, linebreaks et al.
Put the following in your preamble:
\usepackage{listings}
%\lstloadlanguages{Python} # Load only these languages
\newcommand{\MyHookSign}{\hbox{\ensuremath\hookleftarrow}}
\lstset{
% Language
language=Python,
% Basic setup
%basicstyle=\footnotesize,
basicstyle=\scriptsize,
keywordstyle=\bfseries,
commentstyle=,
% Looks
frame=single,
% Linebreaks
breaklines,
prebreak={\space\MyHookSign},
% Line numbering
tabsize=4,
stepnumber=5,
numbers=left,
firstnumber=1,
%numberstyle=\scriptsize,
numberstyle=\tiny,
% Above and beyond ASCII!
extendedchars=true
}
The package has hook for inline code, including entire files, showing it as figures, ...

I'd check a reformat tool in an editor like NetBeans.
When you reformat java it properly fixes the lengths of lines both inside and outside of comments, if the same algorithm were applied to Python, it would work.
For Java it allows you to set any wrapping width and a bunch of other parameters. I'd be pretty surprised if that didn't exist either native or as a plugin.
Can't tell for sure just from the description, but it's worth a try:
http://www.netbeans.org/features/python/

Related

Removing a control character using Python

I have a script that processes the output of a command (the aws help cli command).
I step through the output line-by-line and don't start the actual real parsing until I encounter the text "AVAILABLE COMMANDS" at which point I set a flag to true and start further processing on each line.
I've had this working fine - BUT on Ubuntu we encounter a problem which is this :
The CLI highlights the text in a way I have not seen before:
The output is very long, so I've grep'd the particular line in question - see below:
># aws ec2 help | egrep '^A'
>AVAILABLE COMMANDS
># aws ec2 help | egrep '^A' | cat -vet
>A^HAV^HVA^HAI^HIL^HLA^HAB^HBL^HLE^HE C^HCO^HOM^HMM^HMA^HAN^HND^HDS^HS$
What I haven't seen before is that each letter that is highligted is in the format X^HX.
I'd like to apply a simple transformation of the type X^HX --> X (for all a-zA-Z).
What have I tried so far:
well my workaround is this - first I remove control characters like this:
String = re.sub(r'[\x00-\x1f\x7f-\x9f]','',String)
but I still have to search for 'AAVVAAIILLAABBLLEE' which is totally ugly. I considered using a further regex to turn doubles to singles but that will catch true doubles and get messy.
I started writing a function with an iteration across a constructed list of alpha characters to translate as described, and I used hexdump to try to figure out the exact \x code of the control characters in question but could not get it working - I could remove H but not the ^.
I really don't want to use any additional modules because I want to make this available to people without them having to install extras. In conclusion I have a workaround that is quite ugly, but I'm sure someone must know a quick an easy way to do this translation. It's odd that it only seems to show up on Ubuntu.
After looking at this a little further I was able to put in place a solution:
from string import ascii_lowercase
from string import ascii_uppercase
def RemoveUbuntuHighlighting(String):
for Char in ascii_uppercase + ascii_lowercase:
Match = Char + '\x08' + Char
String = re.sub(Match,Char,String)
return(String)
I'm still a little confounded to see characters highlighted in the format (X\x08X), the arrangement does seem to repeat the same information unnecessarily.
The other thing I would advise to anyone not familiar with reading hexcode is that each pair of hexes is swapped around with respect to the order of their appearance.
A much simpler and more reliable fix is to replace a backspace and duplicate of any character.
I have also augmented this to handle underscores using the same mechanism (character, backspace, underscore).
String = re.sub(r'(.)\x08(\1|_)', r'\1', String)
Demo: https://ideone.com/yzwd2V
This highlighting was standard back when output was to a line printer; backspacing and printing the same character again would add pigmentation to produce boldface. (Backspacing and printing an underscore would produce underlining.)
Probably the AWS CLI can be configured to disable this by setting the TERM variable to something like dumb. There is also a utility col which can remove this formatting (try col-b; maybe see also colcrt). Though perhaps really the best solution would be to import the AWS Python code and extract the help message natively.

Does pep8 allow the usage of \t in a print

Is the usage of escaped characters such as \t allowed by PEP8 in something like print statements?
Is there a more idiomatic way to left indent some of the printout without importing non standard libraries?
Yeah that's fine, it is a fundamental ASCII character - PEP would not deny its use as it may be fundamental to your end result (say an API needed tabs or something) - PEP is all about styling your source code, I wouldn't consider a character in a string to be something that can be decreed by a style guide (PEP8).
Though there is nothing wrong with using \t, you might want to use the textwrap module to allow your indented text to be displayed more naturally in your source code. As an alternative to msg = '\teggs\tmilk\tbread', you can write
import textwrap
def show_list():
msg = """\
eggs
milk
bread"""
print(textwrap.indent(textwrap.dedent(msg), "\t"))
Then show_list() produces the output
eggs
milk
bread
When you indent the definition of msg, the whitespace is part of the literal. dedent removes the common leading whitespace from each line of the string. The indent method then indents each line with, specifically, a tab character.
There is nothing wrong using the tabulator character in a string, at all. See e.g. the Wikipedia link for some common usages. You may be confused by this PEP-8 info:
Use 4 spaces per indentation level.
This is similar to Joe Iddon's answer. It has to be clear that writing a text (not code, of course) is something different than writing code. Texts and their usages are very inhomogeneous. So setting rules how to format your text does not make any sense (if text is not code).
But you also asked "Is there a more idiomatic way to left indent some of the printout without importing non standard libraries?"
Since Python3.6 You can use formatted string literals to get additional spaces (indentation) in your strings you want to print. (If you're using Python3.5 and lower, you can use the str.format instead, for example.)
The usage is like this:
>>> text = "Hello World"
>>> print(f"\t{text}")
Hello World
This is just a toy example, of course. F-Strings become more useful with more complex strings. If you don't have such complex strings, you can consider also using the arguments of print() statement like this, for example:
>>> print("Foo", "Bar", "Foo", "Bar", sep="\t\t") # doubled "\t" only for better displaying
Foo Bar Foo Bar
But often it is simply quite enough to include the tab character in your string, e.g.: "Hello World!\tHow are you doing?\tThat's it.". As already said, don't do that with code (PEP-8), but in texts it is fine.
If you want to use a module for that (it is a built-in module), I recommend using textwrap. See chepner's answer for more information how to use that.

Own pretty print option in python script

I'm outputting pretty huge XML structure to file and I want user to be able to enable/disable pretty print.
I'm working with approximately 150MB of data,when I tried xml.etree.ElementTree and build tree structure from it's element objects, it used awfully lot of memory, so I do this manually by storing raw strings and outputing by .write(). My output sequence looks like this:
ofile.write(pretty_print(u'\
\t\t<LexicalEntry id="%s">\n\
\t\t\t<feat att="languageCode" val="cz"/>\n\
\t\t\t<Lemma>\n\
\t\t\t\t<FormRepresentation>\n\
\t\t\t\t\t<feat att="writtenForm" val="%s"/>\n\
\t\t\t\t</FormRepresentation>\n\
\t\t\t</Lemma>\n\
\t\t\t<Sense>%s\n' % (str(lex_id), word['word'], '' if word['pos']=='' else '\n\t\t\t\t<feat att="partOfSpeech" val="%s"/>' % word['pos'])))
inside the .write() I call my function pretty_print which, depending on command line option, SHOULD strip all tab and newline characters
o_parser = OptionParser()
# ....
o_parser.add_option("-p", "--prettyprint", action="store_true", dest="pprint", default=False)
# ....
def pretty_print(string):
if not options.pprint:
return string.strip('\n\t')
return string
I wrote 'should', because it does not, in this particular case it does not strip any of the characters.
BUT in this case, it works fine:
for ss in word['synsets']:
ofile.write(pretty_print(u'\t\t\t\t<Sense synset="%s-synset"/>\n' % ss))
First thing that came on my mind was that there might be some issues with the substitution, but when i print passed string inside the pretty_print function it looks perfectly fine.
Any suggestiones what might cause that .strip() does not work?
Or if there is any better way to do this, I'll accept any advice
Your issue is that str.strip() only removes from the beginning and end of a string.
You either want str.replace() to remove all instances, or to split it into lines and strip each line, if you want to remove them from the beginning and end of lines.
Also note that for your massive string, Python supports multi-line strings with triple quotes that will make it a lot easier to type out, and the old style string formatting with % has been superseded by str.format() - which you probably want to use instead in new code.

Python: 2.6 and 3.1 string matching inconsistencies

I wrote my module in Python 3.1.2, but now I have to validate it for 2.6.4.
I'm not going to post all my code since it may cause confusion.
Brief explanation:
I'm writing a XML parser (my first interaction with XML) that creates objects from the XML file. There are a lot of objects, so I have a 'unit test' that manually scans the XML and tries to find a matching object. It will print out anything that doesn't have a match.
I open the XML file and use a simple 'for' loop to read line-by-line through the file. If I match a regular expression for an 'application' (XML has different 'application' nodes), then I add it to my dictionary, d, as the key. I perform a lxml.etree.xpath() query on the title and store it as the value.
After I go through the whole thing, I iterate through my dictionary, d, and try to match the key to my value (I have to use the get() method from my 'application' class). Any time a mismatch is found, I print the key and title.
Python 3.1.2 has all matching items in the dictionary, so nothing is printed. In 2.6.4, every single value is printed (~600) in all. I can't figure out why my string comparisons aren't working.
Without further ado, here's the relevant code:
for i in d:
if i[1:-2] != d[i].get('id'):
print('X%sX Y%sY' % (i[1:-3], d[i].get('id')))
I slice the strings because the strings are different. Where the key would be "9626-2008olympics_Prod-SH"\n the value would be 9626-2008olympics_Prod-SH, so I have to cut the quotes and newline. I also added the Xs and Ys to the print statements to make sure that there wasn't any kind of whitespace issues.
Here is an example line of output:
X9626-2008olympics_Prod-SHX Y9626-2008olympics_Prod-SHY
Remember to ignore the Xs and Ys. Those strings are identical. I don't understand why Python2 can't match them.
Edit:
So the problem seems to be the way that I am slicing.
In Python3,
if i[1:-2] != d[i].get('id'):
this comparison works fine.
In Python2,
if i[1:-3] != d[i].get('id'):
I have to change the offset by one.
Why would strings need different offsets? The only possible thing that I can think of is that Python2 treats a newline as two characters (i.e. '\' + 'n').
Edit 2:
Updated with requested repr() information.
I added a small amount of code to produce the repr() info from the "2008olympics" exmpale above. I have not done any slicing. It actually looks like it might not be a unicode issue. There is now a "\r" character.
Python2:
'"9626-2008olympics_Prod-SH"\r\n'
'9626-2008olympics_Prod-SH'
Python3:
'"9626-2008olympics_Prod-SH"\n'
'9626-2008olympics_Prod-SH'
Looks like this file was created/modified on Windows. Is there a way in Python2 to automatically suppress '\r'?
You are printing i[1:-3] but comparing i[1:-2] in the loop.
Very Important Question
Why are you writing code to parse XML when lxml will do all that for you? The point of unit tests is to test your code, not to ensure that the libraries you are using work!
Russell Borogrove is right.
Python 3 defaults to unicode, and the newline character is correctly interpreted as one character. That's why my offset of [1:-2] worked in 3 because I needed to eliminate three characters: ", ", and \n.
In Python 2, the newline is being interpreted as two characters, meaning I have to eliminate four characters and use [1:-3].
I just added a manual check for the Python major version.
Here is the fixed code:
for i in d:
# The keys in D contain quotes and a newline which need
# to be removed. In v3, newline = 1 char and in v2,
# newline = 2 char.
if sys.version_info[0] < 3:
if i[1:-3] != d[i].get('id'):
print('%s %s' % (i[1:-3], d[i].get('id')))
else:
if i[1:-2] != d[i].get('id'):
print('%s %s' % (i[1:-2], d[i].get('id')))
Thanks for the responses everyone! I appreciate your help.
repr() and %r format are your friends ... they show you (for basic types like str/unicode/bytes) exactly what you've got, including type.
Instead of
print('X%sX Y%sY' % (i[1:-3], d[i].get('id')))
do
print('%r %r' % (i, d[i].get('id')))
Note leaving off the [1:-3] so that you can see what is in i before you slice it.
Update after comment "You are perfectly right about comparing the wrong slice. However, once I change it, python2.6 works, but python3 has the problem now (i.e. it doesn't match any objects)":
How are you opening the file (two answers please, for Python 2 and 3). Are you running on Windows? Have you tried getting the repr() as I suggested?
Update after actual input finally provided by OP:
If, as it appears, your input file was created on Windows (lines are separated by "\r\n"), you can read Windows and *x text files portably by using the "universal newlines" option ... open('datafile.txt', 'rU') on Python2 -- read this. Universal newlines mode is the default in Python3. Note that the Python3 docs say that you can use 'rU' also in Python3; this would save you having to test which Python version you are using.
I don't understand what you're doing exactly, but would you try using strip() instead of slicing and see whether it helps?
for i in d:
stripped = i.strip()
if stripped != d[i].get('id'):
print('X%sX Y%sY' % (stripped, d[i].get('id')))

IronPython: Is there an alternative to significant whitespace?

For rapidly changing business rules, I'm storing IronPython fragments in XML files. So far this has been working out well, but I'm starting to get to the point where I need more that just one-line expressions.
The problem is that XML and significant whilespace don't play well together. Before I abandon it for another language, I would like to know if IronPython has an alternative syntax.
IronPython doesn't have an alternate syntax. It's an implementation of Python, and Python uses significant indentation (all languages use significant whitespace, not sure why we talk about whitespace when it's only indentation that's unusual in the Python case).
>>> from __future__ import braces
File "<stdin>", line 1
from __future__ import braces
^
SyntaxError: not a chance
All I want is something that will let my users write code like
Ummm... Don't do this. You don't actually want this. In the long run, this will cause endless little issues because you're trying to force too much content into an attribute.
Do this.
<Rule Name="Markup">
<Formula>(Account.PricingLevel + 1) * .05</Formula>
</Rule>
You should try not to have significant, meaningful stuff in attributes. As a general XML design policy, you should use tags and save attributes for names and ID's and the like. When you look at well-done XSD's and DTD's, you see that attributes are used minimally.
Having the body of the rule in a separate tag (not an attribute) saves much pain. And it allows a tool to provide correct CDATA sections. Use a tool like Altova's XML Spy to assure that your tags have space preserved properly.
I think you can set the xml:space="preserve" attribute or use a <![CDATA[ to avoid other issues, with for example quotes and greater equal signs.
Apart from the already mentioned CDATA sections, there's pindent.py which can, among others, fix broken indentation based on comments a la #end if - to quote the linked file:
When called as "pindent -r" it assumes its input is a Python program with block-closing comments but with its indentation messed up, and outputs a properly indented version.
...
A "block-closing comment" is a comment of the form '# end <keyword>' where is the keyword that opened the block. If the opening keyword is 'def' or 'class', the function or class name may be repeated in the block-closing comment as well. Here is an example of a program fully augmented with block-closing comments:
def foobar(a, b):
if a == b:
a = a+1
elif a < b:
b = b-1
if b > a: a = a-1
# end if
else:
print 'oops!'
# end if
# end def foobar
It's bundeled with CPython, but if IronPython doesn't have it, just grab it from the repository.

Categories

Resources