How can I convert file.docx to file.doc using Python? I have a code that outputs a file in docx format, but this program is for someone who can only use Word 2003, so I need to convert that file to .doc using Python. How can I do it? Thank you in advance.
Bit clunky, but you could use pywinauto to programatically open your .docx documents in word, then save-as .doc. It'd be using word to do the conversion, so it should be as clean as you could get.
This is a snippet of what I've used for converting to pdf within word (it was just a test). You'd have to follow the keystrokes necessary to save as a .doc
import pywinauto
from pywinauto.application import Application
app1 = Application(backend="uia").connect(path="C:\\Program Files (x86)\\Microsoft Office\\root\\Office16\\WINWORD.EXE")
wordhndl = app1.top_window()
wordhndl.type_keys('^o')
wordhndl.type_keys('%f')
wordhndl.type_keys('^o')
wordhndl.type_keys('^o')
#Now that we're in a sub-window, using the top_window() handle doesn't work...
#Instead refer to absolute (using friendly_class_name())
app1.Dialog.Open.type_keys("Y:\\996.Software\\04.Python\\Test\\SampleDoc1.docx")
app1.Dialog.Open.type_keys('~')
#Publish it to pdf
app1.SampleDoc1docx.type_keys('%f')
app1.SampleDoc1docx.type_keys('e')
app1.SampleDoc1docx.type_keys('p')
app1.SampleDoc1docx.type_keys('a')
app1.SampleDoc1docx.PublishasPDForXPS.Publish.type_keys('~')
#Deal with popups & prompts
if app1.Dialog.PublishasPDForXPS.ConfirmSaveAs.exists():
app1.Dialog.PublishasPDForXPS.ConfirmSaveAs.Yes.click() #This line can take some time...
I think the .doc keystrokes would be (not tested)
app1.SampleDoc1docx.type_keys('%f')
app1.SampleDoc1docx.type_keys('a')
app1.SampleDoc1docx.type_keys('y')
app1.SampleDoc1docx.type_keys('4')
app1.SampleDoc1docx.type_keys('{DOWN}')
app1.SampleDoc1docx.type_keys('{DOWN}')
app1.SampleDoc1docx.type_keys('~')
app1.SampleDoc1docx.type_keys('{RIGHT}')
app1.SampleDoc1docx.type_keys('~')
But... the better solution is to use word. I've used VBA within word to do this exact thing before. Don't have the code to hand, but a good pointer would be:
https://www.datanumen.com/blogs/3-quick-ways-to-batch-convert-word-doc-to-docx-files-and-vice-versa/
you can do something of the likes of:
import docx
doc = docx.Document("myWordxFile.docx")
doc.save('myNewWordFile.doc')
check this too. Good luck!
I'am trying to get lines from a text file (.log) into a .txt document.
I need get into my .txt file the same data. But the line itself is sometimes different. From what I have seen on internet, it's usualy done with a pattern that will anticipate how the line is made.
1525:22Player 11 spawned with userinfo: \team\b\forcepowers\0-5-030310001013001131\ip\46.98.134.211:24806\rate\25000\snaps\40\cg_predictItems\1\char_color_blue\34\char_color_green\34\char_color_red\34\color1\65507\color2\14942463\color3\2949375\color4\2949375\handicap\100\jp\0\model\desann/default\name\Faybell\pbindicator\1\saber1\saber_malgus_broken\saber2\none\sex\male\ja_guid\420D990471FC7EB6B3EEA94045F739B7\teamoverlay\1
The line i'm working with usualy looks like this. The data i'am trying to collect are :
\ip\0.0.0.0
\name\NickName_of_the_player
\ja_guid\420D990471FC7EB6B3EEA94045F739B7
And print these data, inside a .txt file. Here is my current code.
As explained above, i'am unsure about what keyword to use for my research on google. And how this could be called (Because the string isn't the same?)
I have been looking around alot, and most of the test I have done, have allowed me to do some things, but i'am not yet able to do as explained above. So i'am in hope for guidance here :) (Sorry if i'am noobish, I understand alot how it works, I just didn't learned language in school, I mostly do small scripts, and usualy they work fine, this time it's way harder)
def readLog(filename):
with open(filename,'r') as eventLog:
data = eventLog.read()
dataList = data.splitlines()
return dataList
eventLog = readLog('games.log')
You'll need to read the files in "raw" mode rather than as strings. When reading the file from disk, use open(filename,'rb'). To use your example, I ran
text_input = r"1525:22Player 11 spawned with userinfo: \team\b\forcepowers\0-5-030310001013001131\ip\46.98.134.211:24806\rate\25000\snaps\40\cg_predictItems\1\char_color_blue\34\char_color_green\34\char_color_red\34\color1\65507\color2\14942463\color3\2949375\color4\2949375\handicap\100\jp\0\model\desann/default\name\Faybell\pbindicator\1\saber1\saber_malgus_broken\saber2\none\sex\male\ja_guid\420D990471FC7EB6B3EEA94045F739B7\teamoverlay\1"
text_as_array = text_input.split('\\')
You'll need to know which columns contain the strings you care about. For example,
with open('output.dat','w') as fil:
fil.write(text_as_array[6])
You can figure these array positions from the sample string
>>> text_as_array[6]
'46.98.134.211:24806'
>>> text_as_array[34]
'Faybell'
>>> text_as_array[44]
'420D990471FC7EB6B3EEA94045F739B7'
If the column positions are not consistent but the key-value pairs are always adjacent, we can leverage that
>>> text_as_array.index("ip")
5
>>> text_as_array[text_as_array.index("ip")+1]
'46.98.134.211:24806'
I'm preprocessing/preparing a batch of MS Word documents, which I automaticly converted from .doc to .docx to use them later to train an NLP-model with entity recognition.
I'm a newbie in Python-programming as well as in Spacy-NLP but I have some programming experience in other languages but right now my biggest question that makes me feel like "I don't know what to do or how to do it" is this:
I have the documents in a folder. I need to parse the raw text and titles (which are in the name of the document itself, not the first line in the document) to make a corpus which is going to be used later on to train the NLP-model.
Since I'm a newbie I have a lot to learn. So I've already done a lot of research on this topic. In the beginning It was a pain in the *** for me to convert all these .doc-files to .docx-files but I've finaly found a way to do that.
Since I need to get the title and text from a bunch of documents I assumed that I need to 'walk' over the documents in the folder, using a for-loop, which I did like this:
path = '/path/to/folder'
for filename in os.listdir(path):
if filename.endswith('.docx'):
path = os.path.join(path, filename)
I've also tried what I found in this stackoverflow-link (using the native python-docx module): extracting text from MS word files in python
But this gave me this TypeError: sequence item 0: expected str instance, bytes found
edit:
The TypeError problem is solved, I tried again 3 different ways to extract text from a Word Document and thisone gave me the best output (without errors):
´´´
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
print(getText('test.docx'))
´´´
So now I (finaly) know how to do a good text-extraction from a Word document. I still need to figure out how to do this on a whole folder and what are my next steps in the proces in order to make a corpus that will be used for NLP.
Btw. I'm using Pycharm in a Ubuntu 18.04 virtual machine and Python version 3.6.
(I've also explained my problem a bit in a different way in this post https://python-forum.io/Thread-Data-extraction-from-multiple-MS-Word-file-s-in-python (see comment #9). I posted this yesterday, it was before trying out what I've found in the stackoverflow-link.)
Could anyone give me any idea about what is a good way to extract titles from MS Word document in order to make a corpus of files to use in SpaCy?
Thank you very much to take your time.
I am working on a process to automate generation of offer letters for candidates. The candidate information is in Excel and contains standard information needed for offer letter generation such as candidate name, date of joining, location, job title, CTC etc.
Is there a way to generate multiple offer letters (output file name _.docx) while preserving the formatting of the docx template?
Using Stackoverflow's help, I was able to utilize python-docx package and generate multiple offer letters. Thus approach however strips all the formatting from the offer letter.
import os
from pandas import *
import datetime
from docxtpl import DocxTemplate
doc = DocxTemplate("\\template\\offer_letter_template.docx")
xls = ExcelFile("\\data\\candidate_data.xlsx")
df = xls.parse(xls.sheet_names[0])
print (df.to_json(orient='records'))
Output:
[{"offer_letter_date":"July 27, 2019","candidate_name":"John Wick","candidate_email":"john.wick#gmail.com","candidate_location":"NYC","candidate_job_title":"Business Development Executive","candidate_ctc":283000},{"offer_letter_date":"July 17, 2019","candidate_name":"Jane Doe","candidate_email":"jane.doe#gmail.com","candidate_location":"NYC","candidate_job_title":"Business Development Executive","candidate_ctc":290000}]
context = df.to_json(orient='records')
doc.render(context)
I am struggling with creating a loop around context so that candidate information is saved in respective file rather than one file itself. Can someone please help?
Jinja2 for word templating was really helpful but I could not replicate it with a loop.
It is possible to create multiple docx files, unfortunately nobody said in the docxtpl documentation that once you load the template, replacements are done in-place, thus preventing any further context replacements.
A workaround which you may like would be reopening the file at every iteration.
Something like:
context=df.to_json(orient='records')
for i in len(context):
doc = DocxTemplate("\\template\\offer_letter_template.docx")
template.render(context[i])
template.save("docs-folder\\%s%(context[i][candidate_name]))
^Might need some revision, but you get the point.
for working with MS word files in python, there is python win32 extensions, which can be used in windows. How do I do the same in linux?
Is there any library?
Use the native Python docx module. Here's how to extract all the text from a doc:
document = docx.Document(filename)
docText = '\n\n'.join(
paragraph.text for paragraph in document.paragraphs
)
print(docText)
See Python DocX site
Also check out Textract which pulls out tables etc.
Parsing XML with regexs invokes cthulu. Don't do it!
You could make a subprocess call to antiword. Antiword is a linux commandline utility for dumping text out of a word doc. Works pretty well for simple documents (obviously it loses formatting). It's available through apt, and probably as RPM, or you could compile it yourself.
benjamin's answer is a pretty good one. I have just consolidated...
import zipfile, re
docx = zipfile.ZipFile('/path/to/file/mydocument.docx')
content = docx.read('word/document.xml').decode('utf-8')
cleaned = re.sub('<(.|\n)*?>','',content)
print(cleaned)
OpenOffice.org can be scripted with Python: see here.
Since OOo can load most MS Word files flawlessly, I'd say that's your best bet.
I know this is an old question, but I was recently trying to find a way to extract text from MS word files, and the best solution by far I found was with wvLib:
http://wvware.sourceforge.net/
After installing the library, using it in Python is pretty easy:
import commands
exe = 'wvText ' + word_file + ' ' + output_txt_file
out = commands.getoutput(exe)
exe = 'cat ' + output_txt_file
out = commands.getoutput(exe)
And that's it. Pretty much, what we're doing is using the commands.getouput function to run a couple of shell scripts, namely wvText (which extracts text from a Word document, and cat to read the file output). After that, the entire text from the Word document will be in the out variable, ready to use.
Hopefully this will help anyone having similar issues in the future.
Take a look at how the doc format works and create word document using PHP in linux. The former is especially useful. Abiword is my recommended tool. There are limitations though:
However, if the document has complicated tables, text boxes, embedded spreadsheets, and so forth, then it might not work as expected. Developing good MS Word filters is a very difficult process, so please bear with us as we work on getting Word documents to open correctly. If you have a Word document which fails to load, please open a Bug and include the document so we can improve the importer.
(Note: I posted this on this question as well, but it seems relevant here, so please excuse the repost.)
Now, this is pretty ugly and pretty hacky, but it seems to work for me for basic text extraction. Obviously to use this in a Qt program you'd have to spawn a process for it etc, but the command line I've hacked together is:
unzip -p file.docx | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'
So that's:
unzip -p file.docx: -p == "unzip to stdout"
grep '<w:t': Grab just the lines containing '<w:t' (<w:t> is the Word 2007 XML element for "text", as far as I can tell)
sed 's/<[^<]>//g'*: Remove everything inside tags
grep -v '^[[:space:]]$'*: Remove blank lines
There is likely a more efficient way to do this, but it seems to work for me on the few docs I've tested it with.
As far as I'm aware, unzip, grep and sed all have ports for Windows and any of the Unixes, so it should be reasonably cross-platform. Despit being a bit of an ugly hack ;)
If your intention is to use purely python modules without calling a subprocess, you can use the zipfile python modude.
content = ""
# Load DocX into zipfile
docx = zipfile.ZipFile('/home/whateverdocument.docx')
# Unpack zipfile
unpacked = docx.infolist()
# Find the /word/document.xml file in the package and assign it to variable
for item in unpacked:
if item.orig_filename == 'word/document.xml':
content = docx.read(item.orig_filename)
else:
pass
Your content string however needs to be cleaned up, one way of doing this is:
# Clean the content string from xml tags for better search
fullyclean = []
halfclean = content.split('<')
for item in halfclean:
if '>' in item:
bad_good = item.split('>')
if bad_good[-1] != '':
fullyclean.append(bad_good[-1])
else:
pass
else:
pass
# Assemble a new string with all pure content
content = " ".join(fullyclean)
But there is surely a more elegant way to clean up the string, probably using the re module.
Hope this helps.
Unoconv might also be a good alternative: http://linux.die.net/man/1/unoconv
To read Word 2007 and later files, including .docx files, you can use the python-docx package:
from docx import Document
document = Document('existing-document-file.docx')
document.save('new-file-name.docx')
To read .doc files from Word 2003 and earlier, make a subprocess call to antiword. You need to install antiword first:
sudo apt-get install antiword
Then just call it from your python script:
import os
input_word_file = "input_file.doc"
output_text_file = "output_file.txt"
os.system('antiword %s > %s' % (input_word_file, output_text_file))
If you have LibreOffice installed, you can simply call it from the command line to convert the file to text, then load the text into Python.
Is this an old question?
I believe that such thing does not exist.
There are only answered and unanswered ones.
This one is pretty unanswered, or half answered if you wish.
Well, methods for reading *.docx (MS Word 2007 and later) documents without using COM interop are all covered.
But methods for extracting text from *.doc (MS Word 97-2000), using Python only, lacks.
Is this complicated?
To do: not really, to understand: well, that's another thing.
When I didn't find any finished code, I read some format specifications and dug out some proposed algorithms in other languages.
MS Word (*.doc) file is an OLE2 compound file.
Not to bother you with a lot of unnecessary details, think of it as a file-system stored in a file. It actually uses FAT structure, so the definition holds. (Hm, maybe you can loop-mount it in Linux???)
In this way, you can store more files within a file, like pictures etc.
The same is done in *.docx by using ZIP archive instead.
There are packages available on PyPI that can read OLE files. Like (olefile, compoundfiles, ...)
I used compoundfiles package to open *.doc file.
However, in MS Word 97-2000, internal subfiles are not XML or HTML, but binary files.
And as this is not enough, each contains an information about other one, so you have to read at least two of them and unravel stored info accordingly.
To understand fully, read the PDF document from which I took the algorithm.
Code below is very hastily composed and tested on small number of files.
As far as I can see, it works as intended.
Sometimes some gibberish appears at the start, and almost always at the end of text.
And there can be some odd characters in-between as well.
Those of you who just wish to search for text will be happy.
Still, I urge anyone who can help to improve this code to do so.
doc2text module:
"""
This is Python implementation of C# algorithm proposed in:
http://b2xtranslator.sourceforge.net/howtos/How_to_retrieve_text_from_a_binary_doc_file.pdf
Python implementation author is Dalen Bernaca.
Code needs refining and probably bug fixing!
As I am not a C# expert I would like some code rechecks by one.
Parts of which I am uncertain are:
* Did the author of original algorithm used uint32 and int32 when unpacking correctly?
I copied each occurence as in original algo.
* Is the FIB length for MS Word 97 1472 bytes as in MS Word 2000, and would it make any difference if it is not?
* Did I interpret each C# command correctly?
I think I did!
"""
from compoundfiles import CompoundFileReader, CompoundFileError
from struct import unpack
__all__ = ["doc2text"]
def doc2text (path):
text = u""
cr = CompoundFileReader(path)
# Load WordDocument stream:
try:
f = cr.open("WordDocument")
doc = f.read()
f.close()
except: cr.close(); raise CompoundFileError, "The file is corrupted or it is not a Word document at all."
# Extract file information block and piece table stream informations from it:
fib = doc[:1472]
fcClx = unpack("L", fib[0x01a2l:0x01a6l])[0]
lcbClx = unpack("L", fib[0x01a6l:0x01a6+4l])[0]
tableFlag = unpack("L", fib[0x000al:0x000al+4l])[0] & 0x0200l == 0x0200l
tableName = ("0Table", "1Table")[tableFlag]
# Load piece table stream:
try:
f = cr.open(tableName)
table = f.read()
f.close()
except: cr.close(); raise CompoundFileError, "The file is corrupt. '%s' piece table stream is missing." % tableName
cr.close()
# Find piece table inside a table stream:
clx = table[fcClx:fcClx+lcbClx]
pos = 0
pieceTable = ""
lcbPieceTable = 0
while True:
if clx[pos]=="\x02":
# This is piece table, we store it:
lcbPieceTable = unpack("l", clx[pos+1:pos+5])[0]
pieceTable = clx[pos+5:pos+5+lcbPieceTable]
break
elif clx[pos]=="\x01":
# This is beggining of some other substructure, we skip it:
pos = pos+1+1+ord(clx[pos+1])
else: break
if not pieceTable: raise CompoundFileError, "The file is corrupt. Cannot locate a piece table."
# Read info from pieceTable, about each piece and extract it from WordDocument stream:
pieceCount = (lcbPieceTable-4)/12
for x in xrange(pieceCount):
cpStart = unpack("l", pieceTable[x*4:x*4+4])[0]
cpEnd = unpack("l", pieceTable[(x+1)*4:(x+1)*4+4])[0]
ofsetDescriptor = ((pieceCount+1)*4)+(x*8)
pieceDescriptor = pieceTable[ofsetDescriptor:ofsetDescriptor+8]
fcValue = unpack("L", pieceDescriptor[2:6])[0]
isANSII = (fcValue & 0x40000000) == 0x40000000
fc = fcValue & 0xbfffffff
cb = cpEnd-cpStart
enc = ("utf-16", "cp1252")[isANSII]
cb = (cb*2, cb)[isANSII]
text += doc[fc:fc+cb].decode(enc, "ignore")
return "\n".join(text.splitlines())
I'm not sure if you're going to have much luck without using COM. The .doc format is ridiculously complex, and is often called a "memory dump" of Word at the time of saving!
At Swati, that's in HTML, which is fine and dandy, but most word documents aren't so nice!
Just an option for reading 'doc' files without using COM: miette. Should work on any platform.
Aspose.Words Cloud SDK for Python is a platform independent solution to convert MS Word/Open Office files to text. It is a commercial product but free trial plan provides 150 monthly API calls.
P.S: I am a developer evangelist at Aspose.
# For complete examples and data files, please go to https://github.com/aspose-words-cloud/aspose-words-cloud-python
# Import module
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile
# Please get your Client ID and Secret from https://dashboard.aspose.cloud.
client_id='xxxxxxx-xxxx-xxxx-xxxxx-xxxxxxxxxx'
client_secret='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
words_api = asposewordscloud.WordsApi(client_id,client_secret)
words_api.api_client.configuration.host='https://api.aspose.cloud'
filename = 'C:/Temp/02_pages.docx'
dest_name = 'C:/Temp/02_pages.txt'
#Convert RTF to text
request = asposewordscloud.models.requests.ConvertDocumentRequest(document=open(filename, 'rb'), format='txt')
result = words_api.convert_document(request)
copyfile(result, dest_name)