I want to write a script that generates reports for each team in my unit where each report uses the same template, but where the numbers specific to each team is used for each report. The report should be in a format like .pdf that non-programmers know how to open and read. This is in many ways similar to rmarkdown for R, but the reports I want to generate are based on data from code already written in python.
The solution I am looking for does not need to export directly to pdf. It can export to markdown and then I know how to convert. I do not need any fancier formatting than what markdown provides. It does not need to be markdown, but I know how to do everything else in markdown, if I only find a way to dynamically populate numbers and text in a markdown template from python code.
What I need is something that is similar to the code block below, but on a bigger scale and instead of printing output on screen this would saved to a file (.md or .pdf) that can then be shared with each team.
user = {'name':'John Doe', 'email':'jd#example.com'}
print('Name is {}, and email is {}'.format(user["name"], user["email"]))
So the desired functionality heavily influenced by my previous experience using rmarkdown would look something like the code block below, where the the template is a string or a file read as a string, with placeholders that will be populated from variables (or Dicts or objects) from the python code. Then the output can be saved and shared with the teams.
user = {'name':'John Doe', 'email':'jd#example.com'}
template = 'Name is `user["name"]`, and email is `user["email"]`'
output = render(template, user)
When trying to find a rmarkdown equivalent in python, I have found a lot of pointers to Jupyter Notebook which I am familiar with, and very much like, but it is not what I am looking for, as the point is not to share the code, only a rendered output.
Since this question was up-voted I want to answer my own question, as I found a solution that was perfect for me. In the end I shared these reports in a repo, so I write the reports in markdown and do not convert them to PDF. The reason I still think this is an answer to my original quesiton is that this works similar to creating markdown in Rmarkdown which was the core of my question, and markdown can easily be converted to PDF.
I solved this by using a library for backend generated HTML pages. I happened to use jinja2 but there are many other options.
First you need a template file in markdown. Let say this is template.md:
## Overview
**Name:** {{repo.name}}<br>
**URL:** {{repo.url}}
| Branch name | Days since last edit |
|---|---|
{% for branch in repo.branches %}
|{{branch[0]]}}|{{branch[1]}}|
{% endfor %}
And then you have use this in your python script:
from jinja2 import Template
import codecs
#create an dict will all data that will be populate the template
repo = {}
repo.name = 'training-kit'
repo.url = 'https://github.com/github/training-kit'
repo.branches = [
['master',15],
['dev',2]
]
#render the template
with open('template.md', 'r') as file:
template = Template(file.read(),trim_blocks=True)
rendered_file = template.render(repo=repo)
#output the file
output_file = codecs.open("report.md", "w", "utf-8")
output_file.write(rendered_file)
output_file.close()
If you are OK with your dynamic doc being in markdown you are done and the report is written to report.py. If you want PDF you can use pandoc to convert.
I would strongly recommend to install and use the pyFPDF Library, that enables you to write and export PDF files directly from python. The Library was ported from php and offers the same functionality as it's php-variant.
1.) Clone and install pyFPDF
Git-Bash:
git clone https://github.com/reingart/pyfpdf.git
cd pyfpdf
python setup.py install
2.) After successfull installation, you can use python code similar as if you'd work with fpdf in php like:
from fpdf import FPDF
pdf = FPDF()
pdf.add_page()
pdf.set_xy(0, 0)
pdf.set_font('arial', 'B', 13.0)
pdf.cell(ln=0, h=5.0, align='L', w=0, txt="Hello", border=0)
pdf.output('myTest.pdf', 'F')
For more Information, take a look at:
https://pypi.org/project/fpdf/
To work with pyFPDF clone repo from: https://github.com/reingart/pyfpdf
pyFPDF Documentation:
https://pyfpdf.readthedocs.io/en/latest/Tutorial/index.html
I'm trying to insert a picture into a Word document using python-docx but running into errors.
The code is simply:
document.add_picture("test.jpg", width = Cm(2.0))
From looking at the python-docx documentation I can see that the following XML should be generated:
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:nvPicPr>
<pic:cNvPr id="1" name="python-powered.png"/>
<pic:cNvPicPr/>
</pic:nvPicPr>
<pic:blipFill>
<a:blip r:embed="rId7"/>
<a:stretch>
<a:fillRect/>
</a:stretch>
</pic:blipFill>
<pic:spPr>
<a:xfrm>
<a:off x="0" y="0"/>
<a:ext cx="859536" cy="343814"/>
</a:xfrm>
<a:prstGeom prst="rect"/>
</pic:spPr>
</pic:pic>
This does in fact get generated in my document.xml file. (When unzipping the docx file). However looking into the OOXML format I can see that the image should also be saved under the media folder and the relationship should be mapped in word/_rels/document.xml:
<Relationship Id="rId20"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image"
Target="media/image20.png"/>
None of this is happens however, and when I open the Word document I'm met with a "The picture can't be displayed" placeholder.
Can anyone help me understand what is going on?
It looks like the image is not embedded the way it should be and I need to insert it in the media folder and add the mapping for it, however as a well documented feature this should be working as expected.
UPDATE:
Testing it out with an empty docx file that image does get added as expected which leads me to believe it might have something to do with the python-docx-template library. (https://github.com/elapouya/python-docx-template)
It uses python-docx and jinja to allow templating capabilities but runs and works the same way python-docx should. I added the image to a subdoc which then gets inserted into a full document at a given place.
A sample code can be seen below (from https://github.com/elapouya/python-docx-template/blob/master/tests/subdoc.py):
from docxtpl import DocxTemplate
from docx.shared import Inches
tpl=DocxTemplate('test_files/subdoc_tpl.docx')
sd = tpl.new_subdoc()
sd.add_paragraph('A picture :')
sd.add_picture('test_files/python_logo.png', width=Inches(1.25))
context = {
'mysubdoc' : sd,
}
tpl.render(context)
tpl.save('test_files/subdoc.docx')
I'll keep this up in case anyone else manages to make the same mistake as I did :) I managed to debug it in the end.
The problem was in how I used the python-docx-template library. I opened up a DocxTemplate like so:
report_output = DocxTemplate(template_path)
DoThings(value,template_path)
report_output.render(dictionary)
report_output.save(output_path)
But I accidentally opened it up twice. Instead of passing the template to a function, when working with it, I passed a path to it and opened it again when creating subdocs and building them.
def DoThings(data,template_path):
doc = DocxTemplate(template_path)
temp_finding = doc.new_subdoc()
#DO THINGS
Finally after I had the subdocs built, I rendered the first template which seemed to work fine for paragraphs and such but I'm guessing the images were added to the "second" opened template and not to the first one that I was actually rendering. After passing the template to the function it started working as expected!
I came acrossed with this problem and it was solved after the parameter width=(1.0) in method add_picture removed.
when parameter width=(1.0) was added, I could not see the pic in test.docx
so, it MIGHT BE resulted from an unappropriate size was set to the picture,
to add pictures, headings, paragraphs to existing document:
doc = Document(full_path) # open an existing document with existing styles
for row in tableData: # list from the json api ...
print ('row {}'.format(row))
level = row['level']
levelStyle = 'Heading ' + str(level)
title = row['title']
heading = doc.add_heading( title , level)
heading.style = doc.styles[levelStyle]
p = doc.add_paragraph(row['description'])
if row['img_http_path']:
ip = doc.add_paragraph()
r = ip.add_run()
r.add_text(row['img_name'])
r.add_text("\n")
r.add_picture(row['img_http_path'], width = Cm(15.0))
doc.save(full_path)
for working with MS word files in python, there is python win32 extensions, which can be used in windows. How do I do the same in linux?
Is there any library?
Use the native Python docx module. Here's how to extract all the text from a doc:
document = docx.Document(filename)
docText = '\n\n'.join(
paragraph.text for paragraph in document.paragraphs
)
print(docText)
See Python DocX site
Also check out Textract which pulls out tables etc.
Parsing XML with regexs invokes cthulu. Don't do it!
You could make a subprocess call to antiword. Antiword is a linux commandline utility for dumping text out of a word doc. Works pretty well for simple documents (obviously it loses formatting). It's available through apt, and probably as RPM, or you could compile it yourself.
benjamin's answer is a pretty good one. I have just consolidated...
import zipfile, re
docx = zipfile.ZipFile('/path/to/file/mydocument.docx')
content = docx.read('word/document.xml').decode('utf-8')
cleaned = re.sub('<(.|\n)*?>','',content)
print(cleaned)
OpenOffice.org can be scripted with Python: see here.
Since OOo can load most MS Word files flawlessly, I'd say that's your best bet.
I know this is an old question, but I was recently trying to find a way to extract text from MS word files, and the best solution by far I found was with wvLib:
http://wvware.sourceforge.net/
After installing the library, using it in Python is pretty easy:
import commands
exe = 'wvText ' + word_file + ' ' + output_txt_file
out = commands.getoutput(exe)
exe = 'cat ' + output_txt_file
out = commands.getoutput(exe)
And that's it. Pretty much, what we're doing is using the commands.getouput function to run a couple of shell scripts, namely wvText (which extracts text from a Word document, and cat to read the file output). After that, the entire text from the Word document will be in the out variable, ready to use.
Hopefully this will help anyone having similar issues in the future.
Take a look at how the doc format works and create word document using PHP in linux. The former is especially useful. Abiword is my recommended tool. There are limitations though:
However, if the document has complicated tables, text boxes, embedded spreadsheets, and so forth, then it might not work as expected. Developing good MS Word filters is a very difficult process, so please bear with us as we work on getting Word documents to open correctly. If you have a Word document which fails to load, please open a Bug and include the document so we can improve the importer.
(Note: I posted this on this question as well, but it seems relevant here, so please excuse the repost.)
Now, this is pretty ugly and pretty hacky, but it seems to work for me for basic text extraction. Obviously to use this in a Qt program you'd have to spawn a process for it etc, but the command line I've hacked together is:
unzip -p file.docx | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'
So that's:
unzip -p file.docx: -p == "unzip to stdout"
grep '<w:t': Grab just the lines containing '<w:t' (<w:t> is the Word 2007 XML element for "text", as far as I can tell)
sed 's/<[^<]>//g'*: Remove everything inside tags
grep -v '^[[:space:]]$'*: Remove blank lines
There is likely a more efficient way to do this, but it seems to work for me on the few docs I've tested it with.
As far as I'm aware, unzip, grep and sed all have ports for Windows and any of the Unixes, so it should be reasonably cross-platform. Despit being a bit of an ugly hack ;)
If your intention is to use purely python modules without calling a subprocess, you can use the zipfile python modude.
content = ""
# Load DocX into zipfile
docx = zipfile.ZipFile('/home/whateverdocument.docx')
# Unpack zipfile
unpacked = docx.infolist()
# Find the /word/document.xml file in the package and assign it to variable
for item in unpacked:
if item.orig_filename == 'word/document.xml':
content = docx.read(item.orig_filename)
else:
pass
Your content string however needs to be cleaned up, one way of doing this is:
# Clean the content string from xml tags for better search
fullyclean = []
halfclean = content.split('<')
for item in halfclean:
if '>' in item:
bad_good = item.split('>')
if bad_good[-1] != '':
fullyclean.append(bad_good[-1])
else:
pass
else:
pass
# Assemble a new string with all pure content
content = " ".join(fullyclean)
But there is surely a more elegant way to clean up the string, probably using the re module.
Hope this helps.
Unoconv might also be a good alternative: http://linux.die.net/man/1/unoconv
To read Word 2007 and later files, including .docx files, you can use the python-docx package:
from docx import Document
document = Document('existing-document-file.docx')
document.save('new-file-name.docx')
To read .doc files from Word 2003 and earlier, make a subprocess call to antiword. You need to install antiword first:
sudo apt-get install antiword
Then just call it from your python script:
import os
input_word_file = "input_file.doc"
output_text_file = "output_file.txt"
os.system('antiword %s > %s' % (input_word_file, output_text_file))
If you have LibreOffice installed, you can simply call it from the command line to convert the file to text, then load the text into Python.
Is this an old question?
I believe that such thing does not exist.
There are only answered and unanswered ones.
This one is pretty unanswered, or half answered if you wish.
Well, methods for reading *.docx (MS Word 2007 and later) documents without using COM interop are all covered.
But methods for extracting text from *.doc (MS Word 97-2000), using Python only, lacks.
Is this complicated?
To do: not really, to understand: well, that's another thing.
When I didn't find any finished code, I read some format specifications and dug out some proposed algorithms in other languages.
MS Word (*.doc) file is an OLE2 compound file.
Not to bother you with a lot of unnecessary details, think of it as a file-system stored in a file. It actually uses FAT structure, so the definition holds. (Hm, maybe you can loop-mount it in Linux???)
In this way, you can store more files within a file, like pictures etc.
The same is done in *.docx by using ZIP archive instead.
There are packages available on PyPI that can read OLE files. Like (olefile, compoundfiles, ...)
I used compoundfiles package to open *.doc file.
However, in MS Word 97-2000, internal subfiles are not XML or HTML, but binary files.
And as this is not enough, each contains an information about other one, so you have to read at least two of them and unravel stored info accordingly.
To understand fully, read the PDF document from which I took the algorithm.
Code below is very hastily composed and tested on small number of files.
As far as I can see, it works as intended.
Sometimes some gibberish appears at the start, and almost always at the end of text.
And there can be some odd characters in-between as well.
Those of you who just wish to search for text will be happy.
Still, I urge anyone who can help to improve this code to do so.
doc2text module:
"""
This is Python implementation of C# algorithm proposed in:
http://b2xtranslator.sourceforge.net/howtos/How_to_retrieve_text_from_a_binary_doc_file.pdf
Python implementation author is Dalen Bernaca.
Code needs refining and probably bug fixing!
As I am not a C# expert I would like some code rechecks by one.
Parts of which I am uncertain are:
* Did the author of original algorithm used uint32 and int32 when unpacking correctly?
I copied each occurence as in original algo.
* Is the FIB length for MS Word 97 1472 bytes as in MS Word 2000, and would it make any difference if it is not?
* Did I interpret each C# command correctly?
I think I did!
"""
from compoundfiles import CompoundFileReader, CompoundFileError
from struct import unpack
__all__ = ["doc2text"]
def doc2text (path):
text = u""
cr = CompoundFileReader(path)
# Load WordDocument stream:
try:
f = cr.open("WordDocument")
doc = f.read()
f.close()
except: cr.close(); raise CompoundFileError, "The file is corrupted or it is not a Word document at all."
# Extract file information block and piece table stream informations from it:
fib = doc[:1472]
fcClx = unpack("L", fib[0x01a2l:0x01a6l])[0]
lcbClx = unpack("L", fib[0x01a6l:0x01a6+4l])[0]
tableFlag = unpack("L", fib[0x000al:0x000al+4l])[0] & 0x0200l == 0x0200l
tableName = ("0Table", "1Table")[tableFlag]
# Load piece table stream:
try:
f = cr.open(tableName)
table = f.read()
f.close()
except: cr.close(); raise CompoundFileError, "The file is corrupt. '%s' piece table stream is missing." % tableName
cr.close()
# Find piece table inside a table stream:
clx = table[fcClx:fcClx+lcbClx]
pos = 0
pieceTable = ""
lcbPieceTable = 0
while True:
if clx[pos]=="\x02":
# This is piece table, we store it:
lcbPieceTable = unpack("l", clx[pos+1:pos+5])[0]
pieceTable = clx[pos+5:pos+5+lcbPieceTable]
break
elif clx[pos]=="\x01":
# This is beggining of some other substructure, we skip it:
pos = pos+1+1+ord(clx[pos+1])
else: break
if not pieceTable: raise CompoundFileError, "The file is corrupt. Cannot locate a piece table."
# Read info from pieceTable, about each piece and extract it from WordDocument stream:
pieceCount = (lcbPieceTable-4)/12
for x in xrange(pieceCount):
cpStart = unpack("l", pieceTable[x*4:x*4+4])[0]
cpEnd = unpack("l", pieceTable[(x+1)*4:(x+1)*4+4])[0]
ofsetDescriptor = ((pieceCount+1)*4)+(x*8)
pieceDescriptor = pieceTable[ofsetDescriptor:ofsetDescriptor+8]
fcValue = unpack("L", pieceDescriptor[2:6])[0]
isANSII = (fcValue & 0x40000000) == 0x40000000
fc = fcValue & 0xbfffffff
cb = cpEnd-cpStart
enc = ("utf-16", "cp1252")[isANSII]
cb = (cb*2, cb)[isANSII]
text += doc[fc:fc+cb].decode(enc, "ignore")
return "\n".join(text.splitlines())
I'm not sure if you're going to have much luck without using COM. The .doc format is ridiculously complex, and is often called a "memory dump" of Word at the time of saving!
At Swati, that's in HTML, which is fine and dandy, but most word documents aren't so nice!
Just an option for reading 'doc' files without using COM: miette. Should work on any platform.
Aspose.Words Cloud SDK for Python is a platform independent solution to convert MS Word/Open Office files to text. It is a commercial product but free trial plan provides 150 monthly API calls.
P.S: I am a developer evangelist at Aspose.
# For complete examples and data files, please go to https://github.com/aspose-words-cloud/aspose-words-cloud-python
# Import module
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile
# Please get your Client ID and Secret from https://dashboard.aspose.cloud.
client_id='xxxxxxx-xxxx-xxxx-xxxxx-xxxxxxxxxx'
client_secret='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
words_api = asposewordscloud.WordsApi(client_id,client_secret)
words_api.api_client.configuration.host='https://api.aspose.cloud'
filename = 'C:/Temp/02_pages.docx'
dest_name = 'C:/Temp/02_pages.txt'
#Convert RTF to text
request = asposewordscloud.models.requests.ConvertDocumentRequest(document=open(filename, 'rb'), format='txt')
result = words_api.convert_document(request)
copyfile(result, dest_name)
I came across this wonderful extension for Open/Libre Office, writer2Latex, which can convert my whole document into TeX format.It's working great so I was wondering if it is possible to skip open/libre office application and call writer2Latex directly from Python? I would like that my python application just calls writer2Latex with the word document as input, and get generated latex file.
This should be doable using Python + uno. The method is very similar to exporting to PDF, which is more common for people to do than export to TeX / Latex. LibreOffice / OpenOffice have a set of export filters that can just be changed.
See DocumentHacker.com for some general documentation about using Python + LibreOffice / OpenOffice, such as how to open documents. The recipe in the cookbook that you need to modify is "Converting to PDF". Simply replace the output filter with the latex filter name
#already have the document open
from com.sun.star.beans import PropertyValue
outputfiltername = "writer_pdf_Export" #for PDF
property = (
PropertyValue( "FilterName" , 0, outputfiltername , 0 ),
)
document.storeToURL("file:///home/my_username/output.pdf", property)
I'm not sure what the FilterName should be however, maybe you can work it out from the writer3Latex documentation. I found a PDF that suggests it should be:
outputfiltername = "org.openoffice.da.writer2latex"
but I haven't tested it (search for "FilterName").
I wrote a daemon program, for searching through recipes from Gourmet Recipe Manager's database ( that is a recipe manager for GNU/Linux )
My program reads the information that it needs for each recipe element over a loop from the sqlite database.
(Such a daemon for Ubuntu Linux is called a 'scope'.
Such scopes give ubuntu unity more sources for its searching.)
'model' has the information, which gets delivered over DBUS to Ubuntu Unity.
In theory you are able to use an URI as the source for the image in 'model' ,
but the developers said me in IRC I am not able to use data URI's.
I tested that, too, and for me it did not work.
So I cache(d) the images in /tmp.
Now you are able to see all recipes, and search for specific one's per title, but the image asociation is simply wrong. if you search for the 2.th , the recipe of the 2.th gets shown, but with the image of the fist recipe in the sqlite table.
Here are two images, to understand the problem:
The 2.th recipe gets the image of the first recipe
I already looked in some IRC rooms for help, but no one could help me...
I think you have to save the state of each image somehow.
I would be pleased, if you have a solution that does not requiere to cache images.
The full source file can be viewed here: http://bazaar.launchpad.net/~gotwig/lens-cooking/lens-cooking/view/head:/unity-scope-gourmet
So, here is the specific part of my code:
if row[14]:
open('/tmp/unity-scope-gourmet/icon' + str(i), 'wb').write(row[14])
model.append(uri, '/tmp/unity-scope-gourmet/icon' + str(i), 1, "text/html", title, comment, uri)
else:
if os.path.exists('/tmp/unity-scope-gourmet/icon' + str(i)): os.remove('/tmp/unity-scope-gourmet/icon' + str(i))
model.append(uri, '', 1, "text/html", title, comment, uri)
It sounds like you a describing a simple counter error. It's not clear to me how you are initializing your counter 'i', but if it's off by one, the simple solution is to simply add one to it before using it, ie:
open('/tmp/unity-scope-gourmet/icon' + str(i+1), 'wb').write(row[14])
^^^
I solved the problem with combining the filename of the cached image with each ID of the recipe.
Code:
i = row[0]
See the complete solution here : http://bazaar.launchpad.net/~gotwig/lens-cooking/lens-cooking/revision/32