I have two chunks of text that I would like to compare and see which words/lines have been added/removed/modified in Python (similar to a Wiki's Diff Output).
I have tried difflib.HtmlDiff but it's output is less than pretty.
Is there a way in Python (or external library) that would generate clean looking HTML of the diff of two sets of text chunks? (not just line level, but also word/character modifications within a line)
There's diff_prettyHtml() in the diff-match-patch library from Google.
Generally, if you want some HTML to render in a prettier way, you do it by adding CSS.
For instance, if you generate the HTML like this:
import difflib
import sys
fromfile = "xxx"
tofile = "zzz"
fromlines = open(fromfile, 'U').readlines()
tolines = open(tofile, 'U').readlines()
diff = difflib.HtmlDiff().make_file(fromlines,tolines,fromfile,tofile)
sys.stdout.writelines(diff)
then you get green backgrounds on added lines, yellow on changed lines and red on deleted. If I were doing this I would take take the generated HTML, extract the body, and prefix it with my own handwritten block of HTML with lots of CSS to make it look good. I'd also probably strip out the legend table and move it to the top or put it in a div so that CSS can do that.
Actually, I would give serious consideration to just fixing up the difflib module (which is written in python) to generate better HTML and contribute it back to the project. If you have a CSS expert to help you or are one yourself, please consider doing this.
I recently posted a python script that does just this: diff2HtmlCompare (follow the link for a screenshot). Under the hood it wraps difflib and uses pygments for syntax highlighting.
not just line level, but also word/character modifications within a line
xmldiff seems to be a nice package for this purpose especially when you have XML/HTML to compare. Read more in their documentation.
try first of all clean up both of HTML by lxml.html, and the check the difference by difflib
Since the .. library from google seams to have no active development any more, I suggest to use diff_py
From the github page:
The simple diff tool which is written by Python. The diff result can be printed in console or to html file.
A copy of my own answer from here.
What about DaisyDiff (Java and PHP vesions available).
Following features are really nice:
Works with badly formed HTML that can be found "in the wild".
The diffing is more specialized in HTML than XML tree differs. Changing part of a text node will not cause the entire node to be changed.
In addition to the default visual diff, HTML source can be diffed coherently.
Provides easy to understand descriptions of the changes.
The default GUI allows easy browsing of the modifications through keyboard shortcuts and links.
Related
I am trying to build a small program in which I open a docx document and replace characters by others, to do some old school caesar-style encrypting, after checking the documentation: [ https://python-docx.readthedocs.io ] I am afraid I can't find the object methods and attributes, the documentation just kind-of explains how to do certain stuff like creating paragraphs and sections but I can't find anything on retrieving document data and parsing. I would like to find a list of the objects in the document so I can parse through them.
I would like to do something like this:
from docx import Document
document = Document('essay.docx')
paragraph = []
for i in document:
paragraph.append(i)
for i in paragraph:
for y in i:
y.replace("a", "y")
...
Can python-docx do something like this? If so where would I find the documentation that could show me how to do it?
If maybe I am using the incorrect library I would also appreciate it if you could point it out.
The API documentation is indexed (i.e. its table of contents appears) on the page you link to and describes all the objects and methods. https://python-docx.readthedocs.io/en/latest/#api-documentation
I think I found something useful in case future readers might be interested. The problem with python-docx is I could get paragraphs individually and it would take a lot of time. I don't even know if titles, footers and headers count as paragraphs.
But there is a library called textract that can read docx and other files, it integrates with python-docx, or at least that's what the short documentation says. But what I can do, is save my docx file to PDF and use:
text = textract.process(
'path/to/norwegian.pdf',
method='pdftofile',
language='nor',
)
This allows you to get all the text as a string and save it preserving the layout of the pdf. Haven't tested it yet, will edit this post if it doesn't work as intended.
http://textract.readthedocs.io/en/latest/python_package.html#python-package
I am trying to grab content from Wikipedia and use the HTML of the article. Ideally I would also like to be able to alter the content (eg, hide certain infoboxes etc).
I am able to grab page content using mwclient:
>>> import mwclient
>>> site = mwclient.Site('en.wikipedia.org')
>>> page = site.Pages['Samuel_Pepys']
>>> print page.text()
{{Redirect|Pepys}}
{{EngvarB|date=January 2014}}
{{Infobox person
...
But I can't see a relatively simple, lightweight way to translate this wikicode into HTML using python.
Pandoc is too much for my needs.
I could just scrape the original page using Beautiful Soup but that doesn't seem like a particularly elegant solution.
mwparserfromhell might help in the process, but I can't quite tell from the documentation if it gives me anything I need and don't already have.
I can't see an obvious solution on the Alternative Parsers page.
What have I missed?
UPDATE: I wrote up what I ended up doing, following the discussion below.
page="""<html>
your pretty html here
<div id="for_api_content">%s</div>
</html>"""
Now you can grab your raw content with your API and just call
generated_page = page%api_content
This way you can design any HTML you want and just insert the API content in a designed spot.
Those APIs that you are using are designed to return raw content so it's up to you to style how you want the raw content to be displayed.
UPDATE
Since you showed me the actual output you are dealing with I realize your dilemma. However luckily for you there are modules that already parse and convert to HTML for you.
There is one called mwlib that will parse the wiki and output to HTML, PDF, etc. You can install it with pip using the install instructions. This is probably one of your better options since it was created in cooperation between Wikimedia Foundation and PediaPress.
Once you have it installed you can use the writer method to do the dirty work.
def writer(env, output, status_callback, **kwargs): pass
Here are the docs for this module: http://mwlib.readthedocs.org/en/latest/index.html
And you can set attributes on the writer object to set the filetype (HTML, PDF, etc).
writer.description = 'PDF documents (using ReportLab)'
writer.content_type = 'application/pdf'
writer.file_extension = 'pdf'
writer.options = {
'coverimage': {
'param': 'FILENAME',
'help': 'filename of an image for the cover page',
}
}
I don't know what the rendered html looks like but I would imagine that it's close to the actual wiki page. But since it's rendered in code I'm sure you have control over modifications as well.
I would go with HTML parsing, page content is reasonably semantic (class="infobox" and such), and there are classes explicitly meant to demarcate content which should not be displayed in alternative views (the first rule of the print stylesheet might be interesting).
That said, if you really want to manipulate wikitext, the best way is to fetch it, use mwparserfromhell to drop the templates you don't like, and use the parse API to get the modified HTML. Or use the Parsoid API which is a partial reimplementation of the parser returning XHTML/RDFa which is richer in semantic elements.
At any rate, trying to set up a local wikitext->HTML converter is by far the hardest way you can approach this task.
The mediawiki API contains a (perhaps confusingly named) parse action that in effect renders wikitext into HTML. I find that mwclient's faithful mirroring of the API structure sometimes actually gets in the way. There's a good example of just using requests to call the API to "parse" (aka render) a page given its title.
I'm writing a simple webpage generator based on restructuredtext and I'd like to put tags into the document, like this.
=====
Title
=====
:author: Me
:tags: foo, bar
Here we go ...
What I want now:
get in possession of some kind of document tree
find the tags entry, read it, process it (like print the tags on the command line), remove it and render the remaining tree.
So I'd like to write compatible restructuredtext in case it's being compiled with something different than my program.
Can someone give me a hint? I found this one here http://svn.python.org/projects/external/docutils-0.6/docutils/examples.py showing in the internals method how to obtain the document (and therefore the dom tree), but is this the best way to go or would a regex based approach (find lines, remove them) be a lot easier? Working with the tree would also involve the conversion tree → document and so on.
There are tools that can do this for you. See http://docutils.sourceforge.net/docs/user/links.html
I think I have a nice solution for both problems. First, the core.py file in the docutils distribution shows how to obtain the doctree and how to write it (using a html writer for instance), see publish_from_doctree and publish_doctree. Then, there is docutils.nodes.SparseNodeVisitor which one can subclass and overwrite methods like visit_field to manipulate the document tree in various ways.
I am working on a project (in Python) that needs formatted, editable output. Since the end-user isn't going to be technically proficient, the output needs to be in a word processor editable format. The formatting is complex (bullet points, paragraphs, bold face, etc).
Is there a way to generate such a report using Python? I feel like there should be a way to do this using Microsoft Word/OpenOffice templates and Python, but I can't find anything advanced enough to get good formatting. Any suggestions?
A little known, and slightly evil fact: If you create an HTML file, and stick a .doc extension on it, Word will open it as a Word document, and most users will be none the wiser.
Except maybe a very technical person will say, my this is a small Word file! :)
Use the Python Docx module for this - 100% Python, tables, images, document properties, headings, paragraphs, and more.
" The formatting is complex(bullet points, paragraphs, bold face, etc), "
Use RST.
It's trivial to produce, since it's plain text.
It's trivial to edit, since it's plain text with a few extra characters to provide structural information.
It formats nicely using a bunch of tools.
I know there is an odtwriter for docutils. You could generate your output as reStructuredText and feed it to odtwriter or look into what odtwriter is using on the backend to generate the ODT and use that.
(I'd probably go with generating rst output and then hacking odtwriter to output the stuff I want (and contribute the fixes back to the project), because that's probably a whole lot easier that trying to render your stuff to ODT directly.)
I've used xlwt to create Excel documents using python, but I haven't needed to write word files yet. I've found this package, OOoPy, but I haven't used it.
Also you might want to try outputting html files and having the users open them in Word.
You can use QTextDocument, QTextCursor and QTextDocumentWriter in PyQt4. A simple example to show how to write to an odt file:
>>>from pyqt4 import QtGui
# Create a document object
>>>doc = QtGui.QTextDocument()
# Create a cursor pointing to the beginning of the document
>>>cursor = QtGui.QTextCursor(doc)
# Insert some text
>>>cursor.insertText('Hello world')
# Create a writer to save the document
>>>writer = QtGui.QTextDocumentWriter()
>>>writer.supportedDocumentFormats()
[PyQt4.QtCore.QByteArray(b'HTML'), PyQt4.QtCore.QByteArray(b'ODF'), PyQt4.QtCore.QByteArray(b'plaintext')]
>>>odf_format = writer.supportedDocumentFormats()[1]
>>>writer.setFormat(odf_format)
>>>writer.setFileName('hello_world.odt')
>>>writer.write(doc) # Return True if successful
True
QTextCursor also can insert tables, frames, blocks, images. More information at:
http://qt-project.org/doc/qt-4.8/qtextcursor.html
As a bonus, you also can print to a pdf file by using QPrinter.
I think OpenOffice has some Python bindings - you should be able to write OO macros in Python.
But I would use HTML instead - Word and OO.org are rather good at editing it and you can write it from Python easily (although Word saves a lot of mess which could complicate parsing it by your Python app).
I'm trying to create XML using the ElementTree object structure in python. It all works very well except when it comes to processing instructions. I can create a PI easily using the factory function ProcessingInstruction(), but it doesn't get added into the elementtree. I can add it manually, but I can't figure out how to add it above the root element where PI's are normally placed. Anyone know how to do this? I know of plenty of alternative methods of doing it, but it seems that this must be built in somewhere that I just can't find.
Try the lxml library: it follows the ElementTree api, plus adds a lot of extras. From the compatibility overview:
ElementTree ignores comments and processing instructions when parsing XML, while etree will read them in and treat them as Comment or ProcessingInstruction elements respectively. This is especially visible where comments are found inside text content, which is then split by the Comment element.
You can disable this behaviour by passing the boolean remove_comments and/or remove_pis keyword arguments to the parser you use. For convenience and to support portable code, you can also use the etree.ETCompatXMLParser instead of the default etree.XMLParser. It tries to provide a default setup that is as close to the ElementTree parser as possible.
Not in the stdlib, I know, but in my experience the best bet when you need stuff that the standard ElementTree doesn't provide.
With the lxml API it couldn't be easier, though it is a bit "underdocumented":
If you need a top-level processing instruction, create it like this:
from lxml import etree
root = etree.Element("anytagname")
root.addprevious(etree.ProcessingInstruction("anypi", "anypicontent"))
The resulting document will look like this:
<?anypi anypicontent?>
<anytagname />
They certainly should add this to their FAQ because IMO it is another feature that sets this fine API apart.
Yeah, I don't believe it's possible, sorry. ElementTree provides a simpler interface to (non-namespaced) element-centric XML processing than DOM, but the price for that is that it doesn't support the whole XML infoset.
There is no apparent way to represent the content that lives outside the root element (comments, PIs, the doctype and the XML declaration), and these are also discarded at parse time. (Aside: this appears to include any default attributes specified in the DTD internal subset, which makes ElementTree strictly-speaking a non-compliant XML processor.)
You can probably work around it by subclassing or monkey-patching the Python native ElementTree implementation's write() method to call _write on your extra PIs before _writeing the _root, but it could be a bit fragile.
If you need support for the full XML infoset, probably best stick with DOM.
I don't know much about ElementTree. But it is possible that you might be able to solve your problem using a library I wrote called "xe".
xe is a set of Python classes designed to make it easy to create structured XML. I haven't worked on it in a long time, for various reasons, but I'd be willing to help you if you have questions about it, or need bugs fixed.
It has the bare bones of support for things like processing instructions, and with a little bit of work I think it could do what you need. (When I started adding processing instructions, I didn't really understand them, and I didn't have any need for them, so the code is sort of half-baked.)
Take a look and see if it seems useful.
http://home.avvanta.com/~steveha/xe.html
Here's an example of using it:
import xe
doc = xe.XMLDoc()
prefs = xe.NestElement("prefs")
prefs.user_name = xe.TextElement("user_name")
prefs.paper = xe.NestElement("paper")
prefs.paper.width = xe.IntElement("width")
prefs.paper.height = xe.IntElement("height")
doc.root_element = prefs
prefs.user_name = "John Doe"
prefs.paper.width = 8
prefs.paper.height = 10
c = xe.Comment("this is a comment")
doc.top.append(c)
If you ran the above code and then ran print doc here is what you would get:
<?xml version="1.0" encoding="utf-8"?>
<!-- this is a comment -->
<prefs>
<user_name>John Doe</user_name>
<paper>
<width>8</width>
<height>10</height>
</paper>
</prefs>
If you are interested in this but need some help, just let me know.
Good luck with your project.
f = open('D:\Python\XML\test.xml', 'r+')
old = f.read()
f.seek(44,0) #place cursor after xml declaration
f.write('<?xml-stylesheet type="text/xsl" href="C:\Stylesheets\expand.xsl"?>'+ old[44:])
I was facing the same problem and came up with this crude solution after failing to insert the PI into the .xml file correctly even after using one of the Element methods in my case root.insert (0, PI) and trying multiple ways to cut and paste the inserted PI to the correct location only to find the data to be deleted from unexpected locations.