I'm writing a simple webpage generator based on restructuredtext and I'd like to put tags into the document, like this.
=====
Title
=====
:author: Me
:tags: foo, bar
Here we go ...
What I want now:
get in possession of some kind of document tree
find the tags entry, read it, process it (like print the tags on the command line), remove it and render the remaining tree.
So I'd like to write compatible restructuredtext in case it's being compiled with something different than my program.
Can someone give me a hint? I found this one here http://svn.python.org/projects/external/docutils-0.6/docutils/examples.py showing in the internals method how to obtain the document (and therefore the dom tree), but is this the best way to go or would a regex based approach (find lines, remove them) be a lot easier? Working with the tree would also involve the conversion tree → document and so on.
There are tools that can do this for you. See http://docutils.sourceforge.net/docs/user/links.html
I think I have a nice solution for both problems. First, the core.py file in the docutils distribution shows how to obtain the doctree and how to write it (using a html writer for instance), see publish_from_doctree and publish_doctree. Then, there is docutils.nodes.SparseNodeVisitor which one can subclass and overwrite methods like visit_field to manipulate the document tree in various ways.
Related
LateX and Python-Sphinx are word-processors, transforming formatted content towards a target document. The usual workflow when working with large documents is to split them in smaller files (sections, chapters) and use a specific file as a table of contents (main.tex in most Latex projects, index.rst in Sphinx).
In Latex, it is possible to stop the processing of a specific file by using a \endinput command (then, the processor will move on to the next section or chapter).
I cannot find such a directive in Python-Sphinx's documentation. Does it exists? If not, is there a way to implement this behavior?
None that I know of. But you can do the inverse by including what you want, which would have the same effect as excluding what you don't want, by using the ifconfig Sphinx extension. You might have to indent what you want.
I am trying to build a small program in which I open a docx document and replace characters by others, to do some old school caesar-style encrypting, after checking the documentation: [ https://python-docx.readthedocs.io ] I am afraid I can't find the object methods and attributes, the documentation just kind-of explains how to do certain stuff like creating paragraphs and sections but I can't find anything on retrieving document data and parsing. I would like to find a list of the objects in the document so I can parse through them.
I would like to do something like this:
from docx import Document
document = Document('essay.docx')
paragraph = []
for i in document:
paragraph.append(i)
for i in paragraph:
for y in i:
y.replace("a", "y")
...
Can python-docx do something like this? If so where would I find the documentation that could show me how to do it?
If maybe I am using the incorrect library I would also appreciate it if you could point it out.
The API documentation is indexed (i.e. its table of contents appears) on the page you link to and describes all the objects and methods. https://python-docx.readthedocs.io/en/latest/#api-documentation
I think I found something useful in case future readers might be interested. The problem with python-docx is I could get paragraphs individually and it would take a lot of time. I don't even know if titles, footers and headers count as paragraphs.
But there is a library called textract that can read docx and other files, it integrates with python-docx, or at least that's what the short documentation says. But what I can do, is save my docx file to PDF and use:
text = textract.process(
'path/to/norwegian.pdf',
method='pdftofile',
language='nor',
)
This allows you to get all the text as a string and save it preserving the layout of the pdf. Haven't tested it yet, will edit this post if it doesn't work as intended.
http://textract.readthedocs.io/en/latest/python_package.html#python-package
I'm documenting my Python classes using Sphinx, and sometimes I want to give my parameters quite long descriptions to explain something in details. Unfortunately, Sphinx generates ugly output for me which wastes a lot of space and breaks the whole page appearance:
It can be seen that Sphinx creates a table, then puts "Parameters" header to the left cell, and the actual list of parameters to the right cell. But there should be way to avoid creating this table completely. After playing with the page DOM tree I finally can show that I want to achieve:
Is there a built-in way to do this or I'd have to create a PR to Sphinx theme or Sphinx itself?
After posting an issue to Sphinx bug-tracker and having no response, I've decided to roll-out my own solution (better say, hack). To achieve the look I want, I have written a simple Sphinx extension which post-processes generated HTML code. It can be found on PyPI:
https://pypi.python.org/pypi/sphinxcontrib.divparams
https://pythonhosted.org/sphinxcontrib.divparams/
This doesn't seem to be the best way to solve the issue, but the behaviour I wanted to change is deeply hard-coded in docutils, not Sphinx.
I am working on a project (in Python) that needs formatted, editable output. Since the end-user isn't going to be technically proficient, the output needs to be in a word processor editable format. The formatting is complex (bullet points, paragraphs, bold face, etc).
Is there a way to generate such a report using Python? I feel like there should be a way to do this using Microsoft Word/OpenOffice templates and Python, but I can't find anything advanced enough to get good formatting. Any suggestions?
A little known, and slightly evil fact: If you create an HTML file, and stick a .doc extension on it, Word will open it as a Word document, and most users will be none the wiser.
Except maybe a very technical person will say, my this is a small Word file! :)
Use the Python Docx module for this - 100% Python, tables, images, document properties, headings, paragraphs, and more.
" The formatting is complex(bullet points, paragraphs, bold face, etc), "
Use RST.
It's trivial to produce, since it's plain text.
It's trivial to edit, since it's plain text with a few extra characters to provide structural information.
It formats nicely using a bunch of tools.
I know there is an odtwriter for docutils. You could generate your output as reStructuredText and feed it to odtwriter or look into what odtwriter is using on the backend to generate the ODT and use that.
(I'd probably go with generating rst output and then hacking odtwriter to output the stuff I want (and contribute the fixes back to the project), because that's probably a whole lot easier that trying to render your stuff to ODT directly.)
I've used xlwt to create Excel documents using python, but I haven't needed to write word files yet. I've found this package, OOoPy, but I haven't used it.
Also you might want to try outputting html files and having the users open them in Word.
You can use QTextDocument, QTextCursor and QTextDocumentWriter in PyQt4. A simple example to show how to write to an odt file:
>>>from pyqt4 import QtGui
# Create a document object
>>>doc = QtGui.QTextDocument()
# Create a cursor pointing to the beginning of the document
>>>cursor = QtGui.QTextCursor(doc)
# Insert some text
>>>cursor.insertText('Hello world')
# Create a writer to save the document
>>>writer = QtGui.QTextDocumentWriter()
>>>writer.supportedDocumentFormats()
[PyQt4.QtCore.QByteArray(b'HTML'), PyQt4.QtCore.QByteArray(b'ODF'), PyQt4.QtCore.QByteArray(b'plaintext')]
>>>odf_format = writer.supportedDocumentFormats()[1]
>>>writer.setFormat(odf_format)
>>>writer.setFileName('hello_world.odt')
>>>writer.write(doc) # Return True if successful
True
QTextCursor also can insert tables, frames, blocks, images. More information at:
http://qt-project.org/doc/qt-4.8/qtextcursor.html
As a bonus, you also can print to a pdf file by using QPrinter.
I think OpenOffice has some Python bindings - you should be able to write OO macros in Python.
But I would use HTML instead - Word and OO.org are rather good at editing it and you can write it from Python easily (although Word saves a lot of mess which could complicate parsing it by your Python app).
I have two chunks of text that I would like to compare and see which words/lines have been added/removed/modified in Python (similar to a Wiki's Diff Output).
I have tried difflib.HtmlDiff but it's output is less than pretty.
Is there a way in Python (or external library) that would generate clean looking HTML of the diff of two sets of text chunks? (not just line level, but also word/character modifications within a line)
There's diff_prettyHtml() in the diff-match-patch library from Google.
Generally, if you want some HTML to render in a prettier way, you do it by adding CSS.
For instance, if you generate the HTML like this:
import difflib
import sys
fromfile = "xxx"
tofile = "zzz"
fromlines = open(fromfile, 'U').readlines()
tolines = open(tofile, 'U').readlines()
diff = difflib.HtmlDiff().make_file(fromlines,tolines,fromfile,tofile)
sys.stdout.writelines(diff)
then you get green backgrounds on added lines, yellow on changed lines and red on deleted. If I were doing this I would take take the generated HTML, extract the body, and prefix it with my own handwritten block of HTML with lots of CSS to make it look good. I'd also probably strip out the legend table and move it to the top or put it in a div so that CSS can do that.
Actually, I would give serious consideration to just fixing up the difflib module (which is written in python) to generate better HTML and contribute it back to the project. If you have a CSS expert to help you or are one yourself, please consider doing this.
I recently posted a python script that does just this: diff2HtmlCompare (follow the link for a screenshot). Under the hood it wraps difflib and uses pygments for syntax highlighting.
not just line level, but also word/character modifications within a line
xmldiff seems to be a nice package for this purpose especially when you have XML/HTML to compare. Read more in their documentation.
try first of all clean up both of HTML by lxml.html, and the check the difference by difflib
Since the .. library from google seams to have no active development any more, I suggest to use diff_py
From the github page:
The simple diff tool which is written by Python. The diff result can be printed in console or to html file.
A copy of my own answer from here.
What about DaisyDiff (Java and PHP vesions available).
Following features are really nice:
Works with badly formed HTML that can be found "in the wild".
The diffing is more specialized in HTML than XML tree differs. Changing part of a text node will not cause the entire node to be changed.
In addition to the default visual diff, HTML source can be diffed coherently.
Provides easy to understand descriptions of the changes.
The default GUI allows easy browsing of the modifications through keyboard shortcuts and links.