Libre/Open and writer2Latex, call from Python - python

I came across this wonderful extension for Open/Libre Office, writer2Latex, which can convert my whole document into TeX format.It's working great so I was wondering if it is possible to skip open/libre office application and call writer2Latex directly from Python? I would like that my python application just calls writer2Latex with the word document as input, and get generated latex file.

This should be doable using Python + uno. The method is very similar to exporting to PDF, which is more common for people to do than export to TeX / Latex. LibreOffice / OpenOffice have a set of export filters that can just be changed.
See DocumentHacker.com for some general documentation about using Python + LibreOffice / OpenOffice, such as how to open documents. The recipe in the cookbook that you need to modify is "Converting to PDF". Simply replace the output filter with the latex filter name
#already have the document open
from com.sun.star.beans import PropertyValue
outputfiltername = "writer_pdf_Export" #for PDF
property = (
PropertyValue( "FilterName" , 0, outputfiltername , 0 ),
)
document.storeToURL("file:///home/my_username/output.pdf", property)
I'm not sure what the FilterName should be however, maybe you can work it out from the writer3Latex documentation. I found a PDF that suggests it should be:
outputfiltername = "org.openoffice.da.writer2latex"
but I haven't tested it (search for "FilterName").

Related

Python library for dynamic documents

I want to write a script that generates reports for each team in my unit where each report uses the same template, but where the numbers specific to each team is used for each report. The report should be in a format like .pdf that non-programmers know how to open and read. This is in many ways similar to rmarkdown for R, but the reports I want to generate are based on data from code already written in python.
The solution I am looking for does not need to export directly to pdf. It can export to markdown and then I know how to convert. I do not need any fancier formatting than what markdown provides. It does not need to be markdown, but I know how to do everything else in markdown, if I only find a way to dynamically populate numbers and text in a markdown template from python code.
What I need is something that is similar to the code block below, but on a bigger scale and instead of printing output on screen this would saved to a file (.md or .pdf) that can then be shared with each team.
user = {'name':'John Doe', 'email':'jd#example.com'}
print('Name is {}, and email is {}'.format(user["name"], user["email"]))
So the desired functionality heavily influenced by my previous experience using rmarkdown would look something like the code block below, where the the template is a string or a file read as a string, with placeholders that will be populated from variables (or Dicts or objects) from the python code. Then the output can be saved and shared with the teams.
user = {'name':'John Doe', 'email':'jd#example.com'}
template = 'Name is `user["name"]`, and email is `user["email"]`'
output = render(template, user)
When trying to find a rmarkdown equivalent in python, I have found a lot of pointers to Jupyter Notebook which I am familiar with, and very much like, but it is not what I am looking for, as the point is not to share the code, only a rendered output.
Since this question was up-voted I want to answer my own question, as I found a solution that was perfect for me. In the end I shared these reports in a repo, so I write the reports in markdown and do not convert them to PDF. The reason I still think this is an answer to my original quesiton is that this works similar to creating markdown in Rmarkdown which was the core of my question, and markdown can easily be converted to PDF.
I solved this by using a library for backend generated HTML pages. I happened to use jinja2 but there are many other options.
First you need a template file in markdown. Let say this is template.md:
## Overview
**Name:** {{repo.name}}<br>
**URL:** {{repo.url}}
| Branch name | Days since last edit |
|---|---|
{% for branch in repo.branches %}
|{{branch[0]]}}|{{branch[1]}}|
{% endfor %}
And then you have use this in your python script:
from jinja2 import Template
import codecs
#create an dict will all data that will be populate the template
repo = {}
repo.name = 'training-kit'
repo.url = 'https://github.com/github/training-kit'
repo.branches = [
['master',15],
['dev',2]
]
#render the template
with open('template.md', 'r') as file:
template = Template(file.read(),trim_blocks=True)
rendered_file = template.render(repo=repo)
#output the file
output_file = codecs.open("report.md", "w", "utf-8")
output_file.write(rendered_file)
output_file.close()
If you are OK with your dynamic doc being in markdown you are done and the report is written to report.py. If you want PDF you can use pandoc to convert.
I would strongly recommend to install and use the pyFPDF Library, that enables you to write and export PDF files directly from python. The Library was ported from php and offers the same functionality as it's php-variant.
1.) Clone and install pyFPDF
Git-Bash:
git clone https://github.com/reingart/pyfpdf.git
cd pyfpdf
python setup.py install
2.) After successfull installation, you can use python code similar as if you'd work with fpdf in php like:
from fpdf import FPDF
pdf = FPDF()
pdf.add_page()
pdf.set_xy(0, 0)
pdf.set_font('arial', 'B', 13.0)
pdf.cell(ln=0, h=5.0, align='L', w=0, txt="Hello", border=0)
pdf.output('myTest.pdf', 'F')
For more Information, take a look at:
https://pypi.org/project/fpdf/
To work with pyFPDF clone repo from: https://github.com/reingart/pyfpdf
pyFPDF Documentation:
https://pyfpdf.readthedocs.io/en/latest/Tutorial/index.html

Can LibreOffice/OpenOffice programmatically add passwords to existing .docx/.xlsx/.pptx files?

TL;DR version - I need to programmatically add a password to .docx/.xlsx/.pptx files using LibreOffice and it doesn't work, and no errors are reported back either, my request to add a password is simply ignored, and a password-less version of the same file is saved.
In-depth:
I'm trying to script the ability to password-protect existing .docx/.xlsx/.pptx files using LibreOffice.
I'm using 64-bit LibreOffice 6.2.5.2 which is the latest version at the time of writing, on Windows 8.1 64-bit Professional.
Whilst I can do this manually via the UI - specifically, I open the "plain" document, do "Save As" and then tick "Save with Password", and enter the password in there, I cannot get this to work via any kind of automation. I'm been trying via Python/Uno, but to no gain. Although the code below correctly opens and saves the document, my attempt to add a password is completely ignored. Curiously, the file size shrinks from 12kb to 9kb when I do this.
Here is my code:
import socket
import uno
import sys
localContext = uno.getComponentContext()
resolver = localContext.ServiceManager.createInstanceWithContext("com.sun.star.bridge.UnoUrlResolver", localContext)
ctx = resolver.resolve( "uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext" )
smgr = ctx.ServiceManager
desktop = smgr.createInstanceWithContext( "com.sun.star.frame.Desktop",ctx)
from com.sun.star.beans import PropertyValue
properties=[]
oDocB = desktop.loadComponentFromURL ("file:///C:/Docs/PlainDoc.docx","_blank",0, tuple(properties) )
sp=[]
sp1=PropertyValue()
sp1.Name='FilterName'
sp1.Value='MS Word 2007 XML'
sp.append(sp1)
sp2=PropertyValue()
sp2.Name='Password'
sp2.Value='secret'
sp.append(sp2)
oDocB.storeToURL("file:///C:/Docs/PasswordDoc.docx",sp)
oDocB.dispose()
I've had great results using Python/Uno to open password-protected files, but I cannot get it to protect a previously unprotected document. I've tried enabling the macro recorder and recording my actions - it recorded the following LibreOffice BASIC code:
sub SaveDoc
rem ----------------------------------------------------------------------
rem define variables
dim document as object
dim dispatcher as object
rem ----------------------------------------------------------------------
rem get access to the document
document = ThisComponent.CurrentController.Frame
dispatcher = createUnoService("com.sun.star.frame.DispatchHelper")
rem ----------------------------------------------------------------------
dim args1(2) as new com.sun.star.beans.PropertyValue
args1(0).Name = "URL"
args1(0).Value = "file:///C:/Docs/PasswordDoc.docx"
args1(1).Name = "FilterName"
args1(1).Value = "MS Word 2007 XML"
args1(2).Name = "EncryptionData"
args1(2).Value = Array(Array("OOXPassword","secret"))
dispatcher.executeDispatch(document, ".uno:SaveAs", "", 0, args1())
end sub
Even when I try to run that, it...saves an unprotected document, with no password encryption. I've even tried converting the macro above into the equivalent Python code, but to no avail either. I don't get any errors, it simply doesn't protect the document.
Finally, out of desperation, I've even tried other approaches that don't include LibreOffice, for example, using the Apache POI library as per the following existing StackOverflow question:
Python or LibreOffice Save xlsx file encrypted with password
...but I just get an error saying "Error: Could not find or load main class org.python.util.jython". I've tried upgrading my JDK, tweaking the paths used in the example, i.e. had an "intelligent" go, but still no joy. I suspect the error above is trivial to fix, but I'm not a Java developer and lack the experience in this area.
Does anyone have any solution? Do you have some LibreOffice code that can do this (password-protect .docx/.xlsx/.pptx files)? Or OpenOffice for that matter, I'm not precious about which package I use. Or something else entirely!
NOTE: I appreciate this is trivial using full-fat Microsoft Office, but thanks to Microsoft's licensing restrictions, is a complete no-go for this project - I have to use an alternative.
The following example is from page 40 (file page 56) of Useful Macro Information
For OpenOffice.org by Andrew Pitonyak (http://www.pitonyak.org/AndrewMacro.odt). The document is directed to OpenOffice.org Basic but is generally applicable to LibreOffice as well. The example differs from the macro recorder version primarily in its use of the documented API rather than dispatch calls.
5.8.3. Save a document with a password
To save a document with a password, you must set the “Password”
attribute.
Listing 5.19: Save a document using a password.
Sub SaveDocumentWithPassword
Dim args(0) As New com.sun.star.beans.PropertyValue
Dim sURL$
args(0).Name ="Password"
args(0).Value = "test"
sURL=ConvertToURL("/andrew0/home/andy/test.odt")
ThisComponent.storeToURL(sURL, args())
End Sub
The argument name is case sensitive, so “password” will not work.

Python libreoffice set margin values, optimal height and print document with uno

I trying to interact with libreoffice with Python, which integrated in Libreoffice installation. And I didn't found anywhere how can I set margins in PageStyle, to set optimal height of row and prind few copies of document. Or, maybe, I can write macro in Libreoffice and run it from python. Code below is not working.
pageStyle = document.getStyleFamilies().getByName("PageStyles")
page = pageStyle.getByName("Default")
page.LeftMargin = 500
P.S. Sorry for my english.
In most versions of LibreOffice, the name of the default style is "Default Style". In Apache OpenOffice, it is named "Default" instead.
Here is the complete code. For example, name the file change_settings.py.
import uno
def set_page_style_margins():
document = XSCRIPTCONTEXT.getDocument()
pageStyle = document.getStyleFamilies().getByName("PageStyles")
page = pageStyle.getByName("Default Style")
page.LeftMargin = 500
g_exportedScripts = set_page_style_margins,
On my Windows 10 system, this script is located under the directory C:\Users\<your username>\AppData\Roaming\LibreOffice\4\user\Scripts\python. You will need to create the last two directories, and case must match.
Now, in LibreOffice Writer, go to Tools -> Macros -> Run Macro. Expand to My Macros -> change_settings and select the macro name set_page_style_margins.
For a full introduction to Python with LibreOffice:
https://wiki.openoffice.org/wiki/Python/Transfer_from_Basic_to_Python
http://christopher5106.github.io/office/2015/12/06/openoffice-libreoffice-automate-your-office-tasks-with-python-macros.html

How to parse doc files on a mac with python? [duplicate]

for working with MS word files in python, there is python win32 extensions, which can be used in windows. How do I do the same in linux?
Is there any library?
Use the native Python docx module. Here's how to extract all the text from a doc:
document = docx.Document(filename)
docText = '\n\n'.join(
paragraph.text for paragraph in document.paragraphs
)
print(docText)
See Python DocX site
Also check out Textract which pulls out tables etc.
Parsing XML with regexs invokes cthulu. Don't do it!
You could make a subprocess call to antiword. Antiword is a linux commandline utility for dumping text out of a word doc. Works pretty well for simple documents (obviously it loses formatting). It's available through apt, and probably as RPM, or you could compile it yourself.
benjamin's answer is a pretty good one. I have just consolidated...
import zipfile, re
docx = zipfile.ZipFile('/path/to/file/mydocument.docx')
content = docx.read('word/document.xml').decode('utf-8')
cleaned = re.sub('<(.|\n)*?>','',content)
print(cleaned)
OpenOffice.org can be scripted with Python: see here.
Since OOo can load most MS Word files flawlessly, I'd say that's your best bet.
I know this is an old question, but I was recently trying to find a way to extract text from MS word files, and the best solution by far I found was with wvLib:
http://wvware.sourceforge.net/
After installing the library, using it in Python is pretty easy:
import commands
exe = 'wvText ' + word_file + ' ' + output_txt_file
out = commands.getoutput(exe)
exe = 'cat ' + output_txt_file
out = commands.getoutput(exe)
And that's it. Pretty much, what we're doing is using the commands.getouput function to run a couple of shell scripts, namely wvText (which extracts text from a Word document, and cat to read the file output). After that, the entire text from the Word document will be in the out variable, ready to use.
Hopefully this will help anyone having similar issues in the future.
Take a look at how the doc format works and create word document using PHP in linux. The former is especially useful. Abiword is my recommended tool. There are limitations though:
However, if the document has complicated tables, text boxes, embedded spreadsheets, and so forth, then it might not work as expected. Developing good MS Word filters is a very difficult process, so please bear with us as we work on getting Word documents to open correctly. If you have a Word document which fails to load, please open a Bug and include the document so we can improve the importer.
(Note: I posted this on this question as well, but it seems relevant here, so please excuse the repost.)
Now, this is pretty ugly and pretty hacky, but it seems to work for me for basic text extraction. Obviously to use this in a Qt program you'd have to spawn a process for it etc, but the command line I've hacked together is:
unzip -p file.docx | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'
So that's:
unzip -p file.docx: -p == "unzip to stdout"
grep '<w:t': Grab just the lines containing '<w:t' (<w:t> is the Word 2007 XML element for "text", as far as I can tell)
sed 's/<[^<]>//g'*: Remove everything inside tags
grep -v '^[[:space:]]$'*: Remove blank lines
There is likely a more efficient way to do this, but it seems to work for me on the few docs I've tested it with.
As far as I'm aware, unzip, grep and sed all have ports for Windows and any of the Unixes, so it should be reasonably cross-platform. Despit being a bit of an ugly hack ;)
If your intention is to use purely python modules without calling a subprocess, you can use the zipfile python modude.
content = ""
# Load DocX into zipfile
docx = zipfile.ZipFile('/home/whateverdocument.docx')
# Unpack zipfile
unpacked = docx.infolist()
# Find the /word/document.xml file in the package and assign it to variable
for item in unpacked:
if item.orig_filename == 'word/document.xml':
content = docx.read(item.orig_filename)
else:
pass
Your content string however needs to be cleaned up, one way of doing this is:
# Clean the content string from xml tags for better search
fullyclean = []
halfclean = content.split('<')
for item in halfclean:
if '>' in item:
bad_good = item.split('>')
if bad_good[-1] != '':
fullyclean.append(bad_good[-1])
else:
pass
else:
pass
# Assemble a new string with all pure content
content = " ".join(fullyclean)
But there is surely a more elegant way to clean up the string, probably using the re module.
Hope this helps.
Unoconv might also be a good alternative: http://linux.die.net/man/1/unoconv
To read Word 2007 and later files, including .docx files, you can use the python-docx package:
from docx import Document
document = Document('existing-document-file.docx')
document.save('new-file-name.docx')
To read .doc files from Word 2003 and earlier, make a subprocess call to antiword. You need to install antiword first:
sudo apt-get install antiword
Then just call it from your python script:
import os
input_word_file = "input_file.doc"
output_text_file = "output_file.txt"
os.system('antiword %s > %s' % (input_word_file, output_text_file))
If you have LibreOffice installed, you can simply call it from the command line to convert the file to text, then load the text into Python.
Is this an old question?
I believe that such thing does not exist.
There are only answered and unanswered ones.
This one is pretty unanswered, or half answered if you wish.
Well, methods for reading *.docx (MS Word 2007 and later) documents without using COM interop are all covered.
But methods for extracting text from *.doc (MS Word 97-2000), using Python only, lacks.
Is this complicated?
To do: not really, to understand: well, that's another thing.
When I didn't find any finished code, I read some format specifications and dug out some proposed algorithms in other languages.
MS Word (*.doc) file is an OLE2 compound file.
Not to bother you with a lot of unnecessary details, think of it as a file-system stored in a file. It actually uses FAT structure, so the definition holds. (Hm, maybe you can loop-mount it in Linux???)
In this way, you can store more files within a file, like pictures etc.
The same is done in *.docx by using ZIP archive instead.
There are packages available on PyPI that can read OLE files. Like (olefile, compoundfiles, ...)
I used compoundfiles package to open *.doc file.
However, in MS Word 97-2000, internal subfiles are not XML or HTML, but binary files.
And as this is not enough, each contains an information about other one, so you have to read at least two of them and unravel stored info accordingly.
To understand fully, read the PDF document from which I took the algorithm.
Code below is very hastily composed and tested on small number of files.
As far as I can see, it works as intended.
Sometimes some gibberish appears at the start, and almost always at the end of text.
And there can be some odd characters in-between as well.
Those of you who just wish to search for text will be happy.
Still, I urge anyone who can help to improve this code to do so.
doc2text module:
"""
This is Python implementation of C# algorithm proposed in:
http://b2xtranslator.sourceforge.net/howtos/How_to_retrieve_text_from_a_binary_doc_file.pdf
Python implementation author is Dalen Bernaca.
Code needs refining and probably bug fixing!
As I am not a C# expert I would like some code rechecks by one.
Parts of which I am uncertain are:
* Did the author of original algorithm used uint32 and int32 when unpacking correctly?
I copied each occurence as in original algo.
* Is the FIB length for MS Word 97 1472 bytes as in MS Word 2000, and would it make any difference if it is not?
* Did I interpret each C# command correctly?
I think I did!
"""
from compoundfiles import CompoundFileReader, CompoundFileError
from struct import unpack
__all__ = ["doc2text"]
def doc2text (path):
text = u""
cr = CompoundFileReader(path)
# Load WordDocument stream:
try:
f = cr.open("WordDocument")
doc = f.read()
f.close()
except: cr.close(); raise CompoundFileError, "The file is corrupted or it is not a Word document at all."
# Extract file information block and piece table stream informations from it:
fib = doc[:1472]
fcClx = unpack("L", fib[0x01a2l:0x01a6l])[0]
lcbClx = unpack("L", fib[0x01a6l:0x01a6+4l])[0]
tableFlag = unpack("L", fib[0x000al:0x000al+4l])[0] & 0x0200l == 0x0200l
tableName = ("0Table", "1Table")[tableFlag]
# Load piece table stream:
try:
f = cr.open(tableName)
table = f.read()
f.close()
except: cr.close(); raise CompoundFileError, "The file is corrupt. '%s' piece table stream is missing." % tableName
cr.close()
# Find piece table inside a table stream:
clx = table[fcClx:fcClx+lcbClx]
pos = 0
pieceTable = ""
lcbPieceTable = 0
while True:
if clx[pos]=="\x02":
# This is piece table, we store it:
lcbPieceTable = unpack("l", clx[pos+1:pos+5])[0]
pieceTable = clx[pos+5:pos+5+lcbPieceTable]
break
elif clx[pos]=="\x01":
# This is beggining of some other substructure, we skip it:
pos = pos+1+1+ord(clx[pos+1])
else: break
if not pieceTable: raise CompoundFileError, "The file is corrupt. Cannot locate a piece table."
# Read info from pieceTable, about each piece and extract it from WordDocument stream:
pieceCount = (lcbPieceTable-4)/12
for x in xrange(pieceCount):
cpStart = unpack("l", pieceTable[x*4:x*4+4])[0]
cpEnd = unpack("l", pieceTable[(x+1)*4:(x+1)*4+4])[0]
ofsetDescriptor = ((pieceCount+1)*4)+(x*8)
pieceDescriptor = pieceTable[ofsetDescriptor:ofsetDescriptor+8]
fcValue = unpack("L", pieceDescriptor[2:6])[0]
isANSII = (fcValue & 0x40000000) == 0x40000000
fc = fcValue & 0xbfffffff
cb = cpEnd-cpStart
enc = ("utf-16", "cp1252")[isANSII]
cb = (cb*2, cb)[isANSII]
text += doc[fc:fc+cb].decode(enc, "ignore")
return "\n".join(text.splitlines())
I'm not sure if you're going to have much luck without using COM. The .doc format is ridiculously complex, and is often called a "memory dump" of Word at the time of saving!
At Swati, that's in HTML, which is fine and dandy, but most word documents aren't so nice!
Just an option for reading 'doc' files without using COM: miette. Should work on any platform.
Aspose.Words Cloud SDK for Python is a platform independent solution to convert MS Word/Open Office files to text. It is a commercial product but free trial plan provides 150 monthly API calls.
P.S: I am a developer evangelist at Aspose.
# For complete examples and data files, please go to https://github.com/aspose-words-cloud/aspose-words-cloud-python
# Import module
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile
# Please get your Client ID and Secret from https://dashboard.aspose.cloud.
client_id='xxxxxxx-xxxx-xxxx-xxxxx-xxxxxxxxxx'
client_secret='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
words_api = asposewordscloud.WordsApi(client_id,client_secret)
words_api.api_client.configuration.host='https://api.aspose.cloud'
filename = 'C:/Temp/02_pages.docx'
dest_name = 'C:/Temp/02_pages.txt'
#Convert RTF to text
request = asposewordscloud.models.requests.ConvertDocumentRequest(document=open(filename, 'rb'), format='txt')
result = words_api.convert_document(request)
copyfile(result, dest_name)

How to generate HTML report in Python?

I'm looking for a way to print out all my graphs (from matplotlib but already saved as png files) and some data frames in HTML, just like what I usually do with R2HTML.
However I couldn't find detailed descriptions about Python module or functions doing this. Could anyone give me some suggestions?
Looks like a couple of tools exist. I've focused on simple html text writers since I'll be crafting my report page structures completely from scratch. This may differ from the R2HTML which I would guess has a lot of convenience functionality for the sort of things one wishes to stuff into pages from R objects.
HTMLTags
This fella wrote a module from scratch at this ActiveState community page: HTMLTags - generate HTML in Python (Python recipe). In the comments I discovered most of the tools I enumerate for this answer.
htmlgen
This package looks pretty good and basic. I'm not sure if its the same module described in this old article.
XIST
this one looks pretty legit and includes a parser as well. Here's some sample page generation code using the format that you'll need to use to interject all the appropriate python commands inbetween various html elements. The other format utilizes a bunch of nested function calls which will make the python interjection very awkward at best.
with xsc.build() :
with xsc.Frag() as myHtmlReport :
+xml.XML()
+html.DocTypeXHTML10transitional()
with html.html() :
reportTitle = "My report title"
with html.head() :
+meta.contenttype()
+html.title( reportTitle )
with html.body() :
# Insert title header
+html.h1( reportTitle )
with html.table() :
# Header Row
with html.tr() :
with html.td() :
+xsc.Text( "Col 1 Header" )
with html.td() :
+xsc.Text( "Col 2 Header" )
# Data Rows
for i in [ 1, 2, 3, 4, 5 ] :
with html.tr() :
with html.td() :
+xsc.Text( "data1_" + str(i) )
with html.td() :
+xsc.Text( "data2_" + str(i) )
# Write the report to disk
with open( "MyReportfileName.html" , "wb" ) as f:
f.write( myHtmlReport.bytes( encoding="us-ascii" ) )
libxml2
through Python bindings there's a plain vanilla xmlwriter module, which is probably too generic but good to know about nonetheless. Windows binaries for this package can be found here.
first install the package
https://pypi.org/project/html-testRunner/
and use
if __name__ == "__main__":
unittest.main(testRunner=HtmlTestRunner.HTMLTestRunner(
output= r"here the place where you want to store the your
HTML report(mean to say folder path"))

Categories

Resources