python-docx - replacing characters - python

I am trying to build a small program in which I open a docx document and replace characters by others, to do some old school caesar-style encrypting, after checking the documentation: [ https://python-docx.readthedocs.io ] I am afraid I can't find the object methods and attributes, the documentation just kind-of explains how to do certain stuff like creating paragraphs and sections but I can't find anything on retrieving document data and parsing. I would like to find a list of the objects in the document so I can parse through them.
I would like to do something like this:
from docx import Document
document = Document('essay.docx')
paragraph = []
for i in document:
paragraph.append(i)
for i in paragraph:
for y in i:
y.replace("a", "y")
...
Can python-docx do something like this? If so where would I find the documentation that could show me how to do it?
If maybe I am using the incorrect library I would also appreciate it if you could point it out.

The API documentation is indexed (i.e. its table of contents appears) on the page you link to and describes all the objects and methods. https://python-docx.readthedocs.io/en/latest/#api-documentation

I think I found something useful in case future readers might be interested. The problem with python-docx is I could get paragraphs individually and it would take a lot of time. I don't even know if titles, footers and headers count as paragraphs.
But there is a library called textract that can read docx and other files, it integrates with python-docx, or at least that's what the short documentation says. But what I can do, is save my docx file to PDF and use:
text = textract.process(
'path/to/norwegian.pdf',
method='pdftofile',
language='nor',
)
This allows you to get all the text as a string and save it preserving the layout of the pdf. Haven't tested it yet, will edit this post if it doesn't work as intended.
http://textract.readthedocs.io/en/latest/python_package.html#python-package

Related

python-docx How to get content/body of a section

I am using Word's sections term to be able for each page to have different header, where I mark page with some markup like {page1}.
Using python-docx I get sections by:
doc = Document(my_file)`
doc_sections = doc.sections
doc_page_one = doc_sections[0]
I am able to get header and footer of each page and their texts:
doc_page_one.header.paragraphs[0].text
But I don't see the actual page content/body or shapes, while debugging I was not able to find where do they live.
Does python-docx have this possibility?
At present, python-docx does not have API support for getting what I would imagine would be the "block-items" (paragraphs + tables) that are "contained" in a certain section.
You would have to navigate the underlying XML if you wanted it bad enough, probably starting at document._body._body.xml. You could get an idea what that looks like with:
print(document._body._body.xml)
Basically you'd be looking for w:sectPr elements, each of which ends a section. There is some more detail on the XML schema involved in the python-docx analysis page here: https://python-docx.readthedocs.io/en/latest/dev/analysis/features/sections.html

Python -- Parsing files (docx, pdf and odt) and converting the content into my data model

I'm writing an import/export tool for importing docx, pdf, and odt files; in which a book has been written.
We already have a tool for the .epub format, and we'd like to extend the functionality beyond that, so users of the site can have more flexibility.
So far I've looked at PDFMiner and also found out that docx is just based on the openxml format, so the word/document.xml is essentially the file containing the whole thing, and I can parse it with lxml.
The question I have is: I'm hoping to parse the contents of these files, and from that content, extract things like chapter names, images (if any), and chapter text, so that I can fit the content into a data model of:
Book --> o2m --> Chapter --> o2m --> Image
Clearly, PDFMiner has a .get_outlines() function that will return the TOC for me. But it can't link any of the returned tuples (chapter numbers and titles) to the actual pages for that chapter.
Even more problematic is that with docx/odt; those are just paragraphs -- <\w:sdt> -- elements, with attrs and child elements.
I'm looking for idea(s) to extrapolate some sense of structure from these filetypes, and if need be, I can apply those ideas (2 or 3) as suggested formats for our users who wish to import a book via one of those file formats.
Textract is the best tool that i have encountered so far for parsing different file formats.
It can parse most of the file formats.
You can find the project on Github
Here is the official documentation
(Python 3 answer)
When I was looking for a tool to read .docx files, I was able to find one here: http://etienned.github.io/posts/extract-text-from-word-docx-simply/
What it does is simply get the text from a .docx file and return it as a string; separate paragraphs are still clearly separate, as there are the new lines between, but all other formatting is lost. I think this may include the loss of end- and foot-notes, but if you want the body of a text, it works great.
I have tested it on both Windows 10 and on OS X, and it has worked successfully on both. Here is what it imports:
import zipfile
try:
from xml.etree.cElementTree import XML
print("cElementTree")
except ImportError:
from xml.etree.ElementTree import XML
print("ElementTree")
EDIT:
If, in the body of the function, you replace
'word/document.xml'
with
'word/footnotes.xml'
or
'word/endnotes.xml'
you can get the footnotes and endnotes, respectively.
The markers for where they were in the text are lost, however.

restructuredtext: use directives for metadata

I'm writing a simple webpage generator based on restructuredtext and I'd like to put tags into the document, like this.
=====
Title
=====
:author: Me
:tags: foo, bar
Here we go ...
What I want now:
get in possession of some kind of document tree
find the tags entry, read it, process it (like print the tags on the command line), remove it and render the remaining tree.
So I'd like to write compatible restructuredtext in case it's being compiled with something different than my program.
Can someone give me a hint? I found this one here http://svn.python.org/projects/external/docutils-0.6/docutils/examples.py showing in the internals method how to obtain the document (and therefore the dom tree), but is this the best way to go or would a regex based approach (find lines, remove them) be a lot easier? Working with the tree would also involve the conversion tree → document and so on.
There are tools that can do this for you. See http://docutils.sourceforge.net/docs/user/links.html
I think I have a nice solution for both problems. First, the core.py file in the docutils distribution shows how to obtain the doctree and how to write it (using a html writer for instance), see publish_from_doctree and publish_doctree. Then, there is docutils.nodes.SparseNodeVisitor which one can subclass and overwrite methods like visit_field to manipulate the document tree in various ways.

Formatted output in OpenOffice/Microsoft Word with Python

I am working on a project (in Python) that needs formatted, editable output. Since the end-user isn't going to be technically proficient, the output needs to be in a word processor editable format. The formatting is complex (bullet points, paragraphs, bold face, etc).
Is there a way to generate such a report using Python? I feel like there should be a way to do this using Microsoft Word/OpenOffice templates and Python, but I can't find anything advanced enough to get good formatting. Any suggestions?
A little known, and slightly evil fact: If you create an HTML file, and stick a .doc extension on it, Word will open it as a Word document, and most users will be none the wiser.
Except maybe a very technical person will say, my this is a small Word file! :)
Use the Python Docx module for this - 100% Python, tables, images, document properties, headings, paragraphs, and more.
" The formatting is complex(bullet points, paragraphs, bold face, etc), "
Use RST.
It's trivial to produce, since it's plain text.
It's trivial to edit, since it's plain text with a few extra characters to provide structural information.
It formats nicely using a bunch of tools.
I know there is an odtwriter for docutils. You could generate your output as reStructuredText and feed it to odtwriter or look into what odtwriter is using on the backend to generate the ODT and use that.
(I'd probably go with generating rst output and then hacking odtwriter to output the stuff I want (and contribute the fixes back to the project), because that's probably a whole lot easier that trying to render your stuff to ODT directly.)
I've used xlwt to create Excel documents using python, but I haven't needed to write word files yet. I've found this package, OOoPy, but I haven't used it.
Also you might want to try outputting html files and having the users open them in Word.
You can use QTextDocument, QTextCursor and QTextDocumentWriter in PyQt4. A simple example to show how to write to an odt file:
>>>from pyqt4 import QtGui
# Create a document object
>>>doc = QtGui.QTextDocument()
# Create a cursor pointing to the beginning of the document
>>>cursor = QtGui.QTextCursor(doc)
# Insert some text
>>>cursor.insertText('Hello world')
# Create a writer to save the document
>>>writer = QtGui.QTextDocumentWriter()
>>>writer.supportedDocumentFormats()
[PyQt4.QtCore.QByteArray(b'HTML'), PyQt4.QtCore.QByteArray(b'ODF'), PyQt4.QtCore.QByteArray(b'plaintext')]
>>>odf_format = writer.supportedDocumentFormats()[1]
>>>writer.setFormat(odf_format)
>>>writer.setFileName('hello_world.odt')
>>>writer.write(doc) # Return True if successful
True
QTextCursor also can insert tables, frames, blocks, images. More information at:
http://qt-project.org/doc/qt-4.8/qtextcursor.html
As a bonus, you also can print to a pdf file by using QPrinter.
I think OpenOffice has some Python bindings - you should be able to write OO macros in Python.
But I would use HTML instead - Word and OO.org are rather good at editing it and you can write it from Python easily (although Word saves a lot of mess which could complicate parsing it by your Python app).

Generate pretty diff html in Python

I have two chunks of text that I would like to compare and see which words/lines have been added/removed/modified in Python (similar to a Wiki's Diff Output).
I have tried difflib.HtmlDiff but it's output is less than pretty.
Is there a way in Python (or external library) that would generate clean looking HTML of the diff of two sets of text chunks? (not just line level, but also word/character modifications within a line)
There's diff_prettyHtml() in the diff-match-patch library from Google.
Generally, if you want some HTML to render in a prettier way, you do it by adding CSS.
For instance, if you generate the HTML like this:
import difflib
import sys
fromfile = "xxx"
tofile = "zzz"
fromlines = open(fromfile, 'U').readlines()
tolines = open(tofile, 'U').readlines()
diff = difflib.HtmlDiff().make_file(fromlines,tolines,fromfile,tofile)
sys.stdout.writelines(diff)
then you get green backgrounds on added lines, yellow on changed lines and red on deleted. If I were doing this I would take take the generated HTML, extract the body, and prefix it with my own handwritten block of HTML with lots of CSS to make it look good. I'd also probably strip out the legend table and move it to the top or put it in a div so that CSS can do that.
Actually, I would give serious consideration to just fixing up the difflib module (which is written in python) to generate better HTML and contribute it back to the project. If you have a CSS expert to help you or are one yourself, please consider doing this.
I recently posted a python script that does just this: diff2HtmlCompare (follow the link for a screenshot). Under the hood it wraps difflib and uses pygments for syntax highlighting.
not just line level, but also word/character modifications within a line
xmldiff seems to be a nice package for this purpose especially when you have XML/HTML to compare. Read more in their documentation.
try first of all clean up both of HTML by lxml.html, and the check the difference by difflib
Since the .. library from google seams to have no active development any more, I suggest to use diff_py
From the github page:
The simple diff tool which is written by Python. The diff result can be printed in console or to html file.
A copy of my own answer from here.
What about DaisyDiff (Java and PHP vesions available).
Following features are really nice:
Works with badly formed HTML that can be found "in the wild".
The diffing is more specialized in HTML than XML tree differs. Changing part of a text node will not cause the entire node to be changed.
In addition to the default visual diff, HTML source can be diffed coherently.
Provides easy to understand descriptions of the changes.
The default GUI allows easy browsing of the modifications through keyboard shortcuts and links.

Categories

Resources