Reading/Writing MS Word files in Python

Reading/Writing MS Word files in Python - python

Is it possible to read and write Word (2003 and 2007) files in Python without using a COM object?
I know that I can:
f = open('c:\file.doc', "w")
f.write(text)
f.close()
but Word will read it as an HTML file not a native .doc file.

See python-docx, its official documentation is available here.
This has worked very well for me.
Note that this does not work for .doc files, only .docx files.

If you only what to read, it is simplest to use the linux soffice command to convert it to text, and then load the text into python:

I'd look into IronPython which intrinsically has access to windows/office APIs because it runs on .NET runtime.

doc (Word 2003 in this case) and docx (Word 2007) are different formats, where the latter is usually just an archive of xml and image files. I would imagine that it is very possible to write to docx files by manipulating the contents of those xml files. However I don't see how you could read and write to a doc file without some type of COM component interface.

Related

How to convert docx file to .chm using Python

I want to convert contents(text, images, links) of docx file to .chm file using Python. Can anyone please suggest how to do.
I tried to read the docx file content using docx2txt
https://github.com/ankushshah89/python-docx2txt package. But I am not sure how to read the images and links in the file.
Can someone please suggest how to read each content separately and convert it to .chm file.

You maybe warned this has a learn curve.
You need to extract all sections from your Word document into clean HTML files including the graphic files.
Please try to Save Word as HTML. But I think this don't make clean HTML.
You need the Microsoft Htmlhelp compiler for creating Chm files. I recommend using a converter tool or a Help Authoring Tool (Hat) for your task.
Search by Google for such tool "DoctoChm" and give it a try for your needs.

I recently needed to convert some resumes to plain text. There are any number of use cases for wanting to extract readable text from binary formats.
you can see the url 'http://davidmburke.com/2014/02/04/python-convert-documents-doc-docx-odt-pdf-to-plain-text-without-libreoffice/'

Saving files openoffice Python

I'm working on a script to create an OpenOffice document. After this i want to save the file. Maybe later also as an PDF.. Google doesn't give me any information how to fix this..
My question here is: What method should be used to save an openoffice-writer document?
Thanks in advance!

You should look at this similar question which answer covers both MSWord and OOWriter (by the way, creating a Word file could be the easiest to be read with OpenOffice).
How can I create a Word document using Python?
Alexis

You can create a rtf file with pyrtf or it's variants, and for pdf you can use reportlab. These are libraries for use in python, not to control remotely oo. There are other libraries for other formats.

Read contents of a pdf file

Is there a commandline tool to read a pdf file on linux.Please indicate the appropriate urls for this.
Thanks..

Xpdf and Poppler contain the commandline-utility pdftotext wich converts PDF files to plain text.

There is PyODConverter. It uses OpenOffice working as a service and can convert between various document formats including PDF and simple text.

Not a command line tool but a pdf reading and generation framework
http://www.reportlab.com/software/opensource/
you should also be able to write a simple reader using
https://pypi.org/project/pypdf/
http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-converter/
you can also look at:
http://www.unixuser.org/~euske/python/pdfminer/index.html

Dynamic generation of .doc files

How can you dynamically generate a .doc file using AJAX? Python? Adobe AIR? I'm thinking of a situation where an online program/desktop app takes in user feedback in article form (a la wiki) in Icelandic character encoding and then upon pressing a button releases a .doc file containing the user input for the webpage. Any solutions/suggestions would be much appreciated.
PS- I don't want to go the C#/Java way with this.

The problem with the *.doc MS word format is, that it isn't documented enough, therefor it can't have a very good support like, for example, PDF, which is a standard.
Except of the problems with generating the doc, you're users might have problems reading the doc files. For example users on linux machines.
You should consider producing RTF on the server. It is more standard, and thus more supported both for document generation, and for reading the document afterwards. Unless you need very specific features, it should suffice for most of documents types, and MS word opens it by default, just like it opens its own native format.
PyRTF is an project you can use for RTF generation with python.

It don't have to do much with ajax(in th sense that ajax is generally used for dynamic client side interactions)
You need a server side script which takes the input and converts it to doc.
You may use something like openoffice and python if it has some interface
see http://wiki.services.openoffice.org/wiki/Python
or on windows you can directly use Word COM objects to create doc using win32apis
but it is less probable, that a windows server serving python :)
I think better alternative is to generate PDF which would be nicer and easier.
Reportlab has a wonderful pdf generation library and it works like charm from python.
Once you have pdf you may use some pdf to doc converter, but I think PDF would be good enough.
Edit: Doc generation
On second thought if you are insisting on DOC you may have windows server in that case
you can use COM objets to generate DOC, xls or whatever see
http://win32com.goermezer.de/content/view/173/284/

How to programmatically insert comments into a Microsoft Word document?

Looking for a way to programmatically insert comments (using the comments feature in Word) into a specific location in a MS Word document. I would prefer an approach that is usable across recent versions of MS Word standard formats and implementable in a non-Windows environment (ideally using Python and/or Common Lisp). I have been looking at the OpenXML SDK but can't seem to find a solution there.

Here is what I did:
Create a simple document with word (i.e. a very small one)
Add a comment in Word
Save as docx.
Use the zip module of python to access the archive (docx files are ZIP archives).
Dump the content of the entry "word/document.xml" in the archive. This is the XML of the document itself.
This should give you an idea what you need to do. After that, you can use one of the XML libraries in Python to parse the document, change it and add it back to a new ZIP archive with the extension ".docx". Simply copy every other entry from the original ZIP and you have a new, valid Word document.
There is also a library which might help: openxmllib

If this is server side (non-interactive) use of the Word application itself is unsupported (but I see this is not applicable). So either take that route or use the OpenXML SDK to learn the markup needed to create a comment. With that knowledge it is all about manipulating data.
The .docx format is a ZIP of XML files with a defines structure, so mostly once you get into the ZIP and get the right XML file it becomes a matter of modifying an XML DOM.
The best route might be to take a docx, copy it, add a comment (using Word) to one, and compare. A diff will show you the kind of elements/structures you need to be looking up in the SDK (or ISO/Ecma standard).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading/Writing MS Word files in Python - python

Is it possible to read and write Word (2003 and 2007) files in Python without using a COM object? I know that I can: f = open('c:\file.doc', "w") f.write(text) f.close() but Word will read it as an HTML file not a native .doc file.

See python-docx, its official documentation is available here. This has worked very well for me. Note that this does not work for .doc files, only .docx files.

If you only what to read, it is simplest to use the linux soffice command to convert it to text, and then load the text into python:

I'd look into IronPython which intrinsically has access to windows/office APIs because it runs on .NET runtime.

Related

How to convert docx file to .chm using Python

Saving files openoffice Python

Read contents of a pdf file

Dynamic generation of .doc files

How to programmatically insert comments into a Microsoft Word document?

Categories

Resources