How to compact/strip multiple blank lines in source-code-preview - python

My doxygen documentation shows the source-code of each class member function.
My source code sometimes contains multiple blank lines between functions.
How can I get doxygen to compact or strip these multiple blank lines?
(instead of showing them in the preview)
ANSWER (from below)
The answer below pointed me in the right direction: INPUT_FILTER (doxygen documentation)
The INPUT_FILTER tag can be used to specify a program that doxygen should invoke to filter for each input file. Doxygen will invoke the filter program by executing (via popen()) the command:
<filter> <input-file>
where is the value of the INPUT_FILTER tag, and is the name of an input file. Doxygen will then use the output that the filter program writes to standard output.
I quickly wrote a python script (I'm on Win7) that does the 'compacting':
import re
import sys
if len(sys.argv) != 2:
exit()
with open(sys.argv[1]) as f:
original = f.read()
compact = re.sub('\n\n\n+', '\n\n', original)
print(compact)
Then I added it to the filter:
INPUT_FILTER = "python ../DoxyCompact.py"
This also opens up A LOT of possibilities to modify the source before doxygen reads it!

The only way I can think of doing this is to define some preprocessor to strip the multiple blank lines before doxygen uses the source code in the documentation. To define an action (or filter) to perform on the source files use the INPUT_FILTER configuration file option.
Warning: The following has not been tested.
From the question How can I replace multiple empty lines with a single empty line in bash? it seems that one can use
sed /^$/d
to strip multiple blank lines, so in your configuration file, set
INPUT_FILTER = sed /^$/d

Related

How do I pass user-input filenames to ImageMagick safely?

I am generating an ImageMagick bash command using Python. Something like
import subprocess
input_file = "hello.png"
output_file = "world.jpg"
subprocess.run(["convert", input_file, output_file])
where there might be more arguments before input_file or output_file. My question is, if either of the filenames is user provided and the user provides a filename that can be parsed as a command line option for ImageMagick, isn't that unsafe?
If the filename starts with a dash, ImageMagick indeed could think that this is an option instead of a filename. Most programs - including AFIK the ImageMagick command line tools - follow the convention that a double-dash (--) denotes the end of the options. If you do a
subprocess.run(["convert", "--", input_file, output_file])
you should be safe in this respect.
From the man page (and a few tests), convert requires an input file and an output file. If you only allow two tokens and if a file name is interpreted as an option then convert is going to miss at least one of the files, so you'll get an ugly message but you should be fine.
Otherwise you can prefix any file name that starts with - with ./ (except - itself, which is stdin or stdout depending on position), so that it becomes an unambiguous file path to the same file.

Unicode issues with tarfile.extractall() (Python 2.7)

I'm using python 2.7.6 on Windows and I'm using the tarfile module to extract a file a gzip file. The mode option of tarfile.open() is set to "r:gz". After the open call, if I were to print the contents of the archive via tarfile.list(), I see the following directory in the list:
./静态分析 Part 1.v1/
However, after I call tarfile.extractall(), I don't see the above directory in the extracted list of files, instead I see this:
é™æ€åˆ†æž Part 1.v1/
If I were to extract the archive via 7zip, I see a directory with the same name as the first item above. So, clearly, the extractall() method is screwing up, but I don't know how to fix this.
I learned that tar doesn't retain the encoding information as part of the archive and treats filenames as raw byte sequences. So, the output I saw from tarfile.extractall() was simply raw the character sequence that comprised the file's name prior to compression. In order to get the extractall() method to recreate the original filenames, I discovered that you have to manually convert the members of the TarFile object to the appropriate encoding before calling extractall(). In my case, the following did the trick:
modeltar = tarfile.open(zippath, mode="r:gz")
updatedMembers = []
for m in modeltar.getmembers():
m.name = unicode(m.name, 'utf-8')
updatedMembers.append(m)
modeltar.extractall(members=updatedMembers, path=dbpath)
The above code is based on this superuser answer: https://superuser.com/a/190786/354642

Linking back to a source code file in Sphinx

I am documenting a Python module in Sphinx. I have a source code file full of examples of the use of my module. I'd like to reference this file. It is too long to inline as continuous code. Is there a way to create a link to the full source file, formatted in a code-friendly way (i.e: literal or with line numbers)?
If I get the question right, you want a link from your documentation to the original source file. You can do this by adding the sphinx.ext.viewcode extension to your conf file (under extensions entry). This will create a "source" link for every header of a class, method, function, etc. Clicking the link will open the original file highlighting the clicked item. More explanation here
literalinclude
.. literalinclude:: filename
From the Sphinx (v1.5.1) documentation:
Longer displays of verbatim text may be included by storing the example text in an external file containing only plain text. The file may be included using the literalinclude directive.
For example, to include the Python source file example.py, use:
.. literalinclude:: example.py
The file name is usually relative to the current file’s path. However, if it is absolute (starting with /), it is relative to the top source directory.
Tabs in the input are expanded if you give a tab-width option with the desired tab width.
Like code-block, the directive supports the linenos flag option to switch on line numbers, the lineno-start option to select the first line number, the emphasize-lines option to emphasize particular lines, and a language option to select a language different from the current file’s standard language. Example with options:
.. literalinclude:: example.rb
:language: ruby
:emphasize-lines: 12,15-18
:linenos:
Include files are assumed to be encoded in the source_encoding. If the file has a different encoding, you can specify it with the encoding option:
.. literalinclude:: example.py
:encoding: latin-1
The directive also supports including only parts of the file. If it is a Python module, you can select a class, function or method to include using the pyobject option:
.. literalinclude:: example.py
:pyobject: Timer.start
This would only include the code lines belonging to the start() method in the Timer class within the file.
Alternately, you can specify exactly which lines to include by giving a lines option:
.. literalinclude:: example.py
:lines: 1,3,5-10,20-
This includes the lines 1, 3, 5 to 10 and lines 20 to the last line.
Another way to control which part of the file is included is to use the start-after and end-before options (or only one of them). If start-after is given as a string option, only lines that follow the first line containing that string are included. If end-before is given as a string option, only lines that precede the first lines containing that string are included.
When specifying particular parts of a file to display, it can be useful to display exactly which lines are being presented. This can be done using the lineno-match option.
You can prepend and/or append a line to the included code, using the prepend and append option, respectively. This is useful e.g. for highlighting PHP code that doesn’t include the markers.
If you want to show the diff of the code, you can specify the old file by giving a diff option:
.. literalinclude:: example.py
:diff: example.py.orig
This shows the diff between example.py and example.py.orig with unified diff format.
Python 3 does this. For example, the argparse docs link to the source code (near the top of the page, where it says "Source code"). You can see how they do it by looking at the source for the docs (linked from the first link, down at the bottom of the left had column).
I assume they're using standard Sphinx, but I am having a hard time finding :source: in their docs...
Update: the :source: role is defined here.

Python: how to capture output to a text file? (only 25 of 530 lines captured now)

I've done a fair amount of lurking on SO and a fair amount of searching and reading, but I must also confess to being a relative noob at programming in general. I am trying to learn as I go, and so I have been playing with Python's NLTK. In the script below, I can get everything to work, except it only writes what would be the first screen of a multi-screen output, at least that's how I am thinking about it.
Here's the script:
#! /usr/bin/env python
import nltk
# First we have to open and read the file:
thefile = open('all_no_id.txt')
raw = thefile.read()
# Second we have to process it with nltk functions to do what we want
tokens = nltk.wordpunct_tokenize(raw)
text = nltk.Text(tokens)
# Now we can actually do stuff with it:
concord = text.concordance("cultural")
# Now to save this to a file
fileconcord = open('ccord-cultural.txt', 'w')
fileconcord.writelines(concord)
fileconcord.close()
And here's the beginning of the output file:
Building index...
Displaying 25 of 530 matches:
y .   The Baobab Tree : Stories of Cultural Continuity The continuity evident
regardless of ethnicity , and the cultural legacy of Africa as well . This Af
What am I missing here to get the entire 530 matches written to the file?
text.concordance(self, word, width=79, lines=25) seem to have other parameters as per manual.
I see no way to extract the size of concordance index, however, the concordance printing code seem to have this part: lines = min(lines, len(offsets)), therefore you can simply pass sys.maxint as a last argument:
concord = text.concordance("cultural", 75, sys.maxint)
Added:
Looking at you original code now, I can't see a way it could work before. text.concordance does not return anything, but outputs everything to stdout using print. Therefore, the easy option would be redirection stdout to you file, like this:
import sys
....
# Open the file
fileconcord = open('ccord-cultural.txt', 'w')
# Save old stdout stream
tmpout = sys.stdout
# Redirect all "print" calls to that file
sys.stdout = fileconcord
# Init the method
text.concordance("cultural", 200, sys.maxint)
# Close file
fileconcord.close()
# Reset stdout in case you need something else to print
sys.stdout = tmpout
Another option would be to use the respective classes directly and omit the Text wrapper. Just copy bits from here and combine them with bits from here and you are done.
Update:
I found this write text.concordance output to a file Options
from the ntlk usergroup. It's from 2010, and states:
Documentation for the Text class says: "is intended to support
initial exploration of texts (via the interactive console). ... If you
wish to write a program which makes use of these analyses, then you
should bypass the Text class, and use the appropriate analysis
function or class directly instead."
If nothing has changed in the package since then, this may be the source of your problem.
--- previously ---
I don't see a problem with writing to the file using writelines():
file.writelines(sequence)
Write a sequence of strings to the file. The sequence can be any
iterable object producing strings, typically a list of strings. There
is no return value. (The name is intended to match readlines();
writelines() does not add line separators.)
Note the italicized part, did you examine the output file in different editors? Perhaps the data is there, but not being rendered correctly due to missing end of line seperators?
Are you sure this part is generating the data you want to output?
concord = text.concordance("cultural")
I'm not familiar with nltk, so I'm just asking as part of eliminating possible sources for the problem.

How to separate content from a file that is a container for binary and other forms of content

I am trying to parse some .txt files. These files serve as containers for a variable number of 'children' files that are set off or identified within the container with SGML tags. With python I can easily separate the children files. However I am having trouble writing the binary content back out as a binary file (say a gif or jpg). In the simplest case the container might have an embedded html file followed by a graphic that is called by the html. I am assuming that my problem is because I am reading the original .txt file using open(filename,'r'). But that seems the only option to find the sgml tags to split the file.
I would appreciate any help to identify some relevant reading material.
I appreciate the suggestions but I am still struggling with the most basic questions. For example when I open the file with wordpad and scroll down to the section tagged as a gif I see this:
<FILENAME>h65803h6580301.gif
<DESCRIPTION>GRAPHIC
<TEXT>
begin 644 h65803h6580301.gif
M1TE&.#EA(P)I`=4#`("`#,#`P$!`0+^_OW]_?_#P\*"#H.##X-#0T&!#8!`0
M$+"PL"`#('!P<)"0D#`P,%!04#\_/^_O[Y^?GZ^OK]_?WX^/C\_/SV]O;U]?
I can handle finding the section easily enough but where does the gif file begin. Does the header start with 644, the blanks after the word begin or the line beginning with MITE?
Next, when the file is read into python does it do anything to the binary code that has to be undone when it is read back out?
I can find the lines where the graphics begin:
filerefbin=file('myfile.txt','rb')
wholeFile=filerefbin.read()
import re
graphicReg=re.compile('<DESCRIPTION>GRAPHIC')
locationGraphics=graphicReg.finditer(wholeFile)
graphicsTags=[]
for match in locationGraphics:
graphicsTags.append(match.span())
I can easily use the same process to get to the word begin, or to identify the filename and get to the end of the filename in the 'first' line. I have also successefully gotten to the end of the embedded gif file. But I can't seem to write out the correct combination of things so when I double click on h65803h6580301.gif when it has been isolated and saved I get to see the graphic.
Interestingly, when I open the file in rb, the line endings appear to still be present even though they don't seem to have any effect in notebpad. So that is clearly one of my problems I might need to readlines and join the lines together after stripping out the \n
I love this site and I love PYTHON
This was too easy once I read bendin's post. I just had to snip the section that began with the word begin and save that in a txt file and then run the following command:
import uu
uu.decode(r'c:\test2.txt',r'c:\test.gif')
I have to work with some other stuff for the rest of the day but I will post more here as I look at this more closely. The first thing I need to discover is how to use something other than a file, that is since I read the whole .txt file into memory and clipped out the section that has the image I need to work with the clipped section instead of writing it out to test2.txt. I am sure that can be done its just figuring out how to do it.
What you're looking at isn't "binary", it's uuencoded. Python's standard library includes the module uu, to handle uuencoded data.
The module uu requires the use of temporary files for encoding and decoding. You can accomplish this without resorting to temporary files by using Python's codecs module like this:
import codecs
data = "Let's just pretend that this is binary data, ok?"
uuencode = codecs.getencoder("uu")
data_uu, n = uuencode(data)
uudecode = codecs.getdecoder("uu")
decoded, m = uudecode(data_uu)
print """* The initial input:
%(data)s
* Encoding these %(n)d bytes produces:
%(data_uu)s
* When we decode these %(m)d bytes, we get the original data back:
%(decoded)s""" % globals()
You definitely need to be reading in binary mode if the content includes JPEG images.
As well, Python includes an SGML parser, http://docs.python.org/library/sgmllib.html .
There is no example there, but all you need to do is setup do_ methods to handle the sgml tags you wish.
You need to open(filename,'rb') to open the file in binary mode. Be aware that this will cause python to give You confusing, two-byte line endings on some operating systems.

Categories

Resources