I'am tasked with converting tons of .doc files to .pdf. And the only way my supervisor wants me to do this is through MSWord 2010. I know I should be able to automate this with python COM automation. Only problem is I dont know how and where to start. I tried searching for some tutorials but was not able to find any (May be I might have, but I don't know what I'm looking for).
Right now I'm reading through this. Dont know how useful this is going to be.
A simple example using comtypes, converting a single file, input and output filenames given as commandline arguments:
import sys
import os
import comtypes.client
wdFormatPDF = 17
in_file = os.path.abspath(sys.argv[1])
out_file = os.path.abspath(sys.argv[2])
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
You could also use pywin32, which would be the same except for:
import win32com.client
and then:
word = win32com.client.Dispatch('Word.Application')
You can use the docx2pdf python package to bulk convert docx to pdf. It can be used as both a CLI and a python library. It requires Microsoft Office to be installed and uses COM on Windows and AppleScript (JXA) on macOS.
from docx2pdf import convert
convert("input.docx")
convert("input.docx", "output.pdf")
convert("my_docx_folder/")
pip install docx2pdf
docx2pdf input.docx output.pdf
Disclaimer: I wrote the docx2pdf package. https://github.com/AlJohri/docx2pdf
I have tested many solutions but no one of them works efficiently on Linux distribution.
I recommend this solution :
import sys
import subprocess
import re
def convert_to(folder, source, timeout=None):
args = [libreoffice_exec(), '--headless', '--convert-to', 'pdf', '--outdir', folder, source]
process = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout)
filename = re.search('-> (.*?) using filter', process.stdout.decode())
return filename.group(1)
def libreoffice_exec():
# TODO: Provide support for more platforms
if sys.platform == 'darwin':
return '/Applications/LibreOffice.app/Contents/MacOS/soffice'
return 'libreoffice'
and you call your function:
result = convert_to('TEMP Directory', 'Your File', timeout=15)
All resources:
https://michalzalecki.com/converting-docx-to-pdf-using-python/
I have worked on this problem for half a day, so I think I should share some of my experience on this matter. Steven's answer is right, but it will fail on my computer. There are two key points to fix it here:
(1). The first time when I created the 'Word.Application' object, I should make it (the word app) visible before open any documents. (Actually, even I myself cannot explain why this works. If I do not do this on my computer, the program will crash when I try to open a document in the invisible model, then the 'Word.Application' object will be deleted by OS. )
(2). After doing (1), the program will work well sometimes but may fail often. The crash error "COMError: (-2147418111, 'Call was rejected by callee.', (None, None, None, 0, None))" means that the COM Server may not be able to response so quickly. So I add a delay before I tried to open a document.
After doing these two steps, the program will work perfectly with no failure anymore. The demo code is as below. If you have encountered the same problems, try to follow these two steps. Hope it helps.
import os
import comtypes.client
import time
wdFormatPDF = 17
# absolute path is needed
# be careful about the slash '\', use '\\' or '/' or raw string r"..."
in_file=r'absolute path of input docx file 1'
out_file=r'absolute path of output pdf file 1'
in_file2=r'absolute path of input docx file 2'
out_file2=r'absolute path of outputpdf file 2'
# print out filenames
print in_file
print out_file
print in_file2
print out_file2
# create COM object
word = comtypes.client.CreateObject('Word.Application')
# key point 1: make word visible before open a new document
word.Visible = True
# key point 2: wait for the COM Server to prepare well.
time.sleep(3)
# convert docx file 1 to pdf file 1
doc=word.Documents.Open(in_file) # open docx file 1
doc.SaveAs(out_file, FileFormat=wdFormatPDF) # conversion
doc.Close() # close docx file 1
word.Visible = False
# convert docx file 2 to pdf file 2
doc = word.Documents.Open(in_file2) # open docx file 2
doc.SaveAs(out_file2, FileFormat=wdFormatPDF) # conversion
doc.Close() # close docx file 2
word.Quit() # close Word Application
unoconv (writen in Python) and OpenOffice running as a headless daemon.
https://github.com/unoconv/unoconv
http://dag.wiee.rs/home-made/unoconv/
Works very nicely for doc, docx, ppt, pptx, xls, xlsx.
Very useful if you need to convert docs or save/convert to certain formats on a server.
As an alternative to the SaveAs function, you could also use ExportAsFixedFormat which gives you access to the PDF options dialog you would normally see in Word. With this you can specify bookmarks and other document properties.
doc.ExportAsFixedFormat(OutputFileName=pdf_file,
ExportFormat=17, #17 = PDF output, 18=XPS output
OpenAfterExport=False,
OptimizeFor=0, #0=Print (higher res), 1=Screen (lower res)
CreateBookmarks=1, #0=No bookmarks, 1=Heading bookmarks only, 2=bookmarks match word bookmarks
DocStructureTags=True
);
The full list of function arguments is: 'OutputFileName', 'ExportFormat', 'OpenAfterExport', 'OptimizeFor', 'Range', 'From', 'To', 'Item', 'IncludeDocProps', 'KeepIRM', 'CreateBookmarks', 'DocStructureTags', 'BitmapMissingFonts', 'UseISO19005_1', 'FixedFormatExtClassPtr'
It's worth noting that Stevens answer works, but make sure if using a for loop to export multiple files to place the ClientObject or Dispatch statements before the loop - it only needs to be created once - see my problem: Python win32com.client.Dispatch looping through Word documents and export to PDF; fails when next loop occurs
If you don't mind using PowerShell have a look at this Hey, Scripting Guy! article. The code presented could be adopted to use the wdFormatPDF enumeration value of WdSaveFormat (see here).
This blog article presents a different implementation of the same idea.
I have modified it for ppt support as well. My solution support all the below-specified extensions.
word_extensions = [".doc", ".odt", ".rtf", ".docx", ".dotm", ".docm"]
ppt_extensions = [".ppt", ".pptx"]
My Solution: Github Link
I have modified code from Docx2PDF
I tried the accepted answer but wasn't particularly keen on the bloated PDFs Word was producing which was usually an order of magnitude bigger than expected. After looking how to disable the dialogs when using a virtual PDF printer I came across Bullzip PDF Printer and I've been rather impressed with its features. It's now replaced the other virtual printers I used previously. You'll find a "free community edition" on their download page.
The COM API can be found here and a list of the usable settings can be found here. The settings are written to a "runonce" file which is used for one print job only and then removed automatically. When printing multiple PDFs we need to make sure one print job completes before starting another to ensure the settings are used correctly for each file.
import os, re, time, datetime, win32com.client
def print_to_Bullzip(file):
util = win32com.client.Dispatch("Bullzip.PDFUtil")
settings = win32com.client.Dispatch("Bullzip.PDFSettings")
settings.PrinterName = util.DefaultPrinterName # make sure we're controlling the right PDF printer
outputFile = re.sub("\.[^.]+$", ".pdf", file)
statusFile = re.sub("\.[^.]+$", ".status", file)
settings.SetValue("Output", outputFile)
settings.SetValue("ConfirmOverwrite", "no")
settings.SetValue("ShowSaveAS", "never")
settings.SetValue("ShowSettings", "never")
settings.SetValue("ShowPDF", "no")
settings.SetValue("ShowProgress", "no")
settings.SetValue("ShowProgressFinished", "no") # disable balloon tip
settings.SetValue("StatusFile", statusFile) # created after print job
settings.WriteSettings(True) # write settings to the runonce.ini
util.PrintFile(file, util.DefaultPrinterName) # send to Bullzip virtual printer
# wait until print job completes before continuing
# otherwise settings for the next job may not be used
timestamp = datetime.datetime.now()
while( (datetime.datetime.now() - timestamp).seconds < 10):
if os.path.exists(statusFile) and os.path.isfile(statusFile):
error = util.ReadIniString(statusFile, "Status", "Errors", '')
if error != "0":
raise IOError("PDF was created with errors")
os.remove(statusFile)
return
time.sleep(0.1)
raise IOError("PDF creation timed out")
I was working with this solution but I needed to search all .docx, .dotm, .docm, .odt, .doc or .rtf and then turn them all to .pdf (python 3.7.5). Hope it works...
import os
import win32com.client
wdFormatPDF = 17
for root, dirs, files in os.walk(r'your directory here'):
for f in files:
if f.endswith(".doc") or f.endswith(".odt") or f.endswith(".rtf"):
try:
print(f)
in_file=os.path.join(root,f)
word = win32com.client.Dispatch('Word.Application')
word.Visible = False
doc = word.Documents.Open(in_file)
doc.SaveAs(os.path.join(root,f[:-4]), FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
word.Visible = True
print ('done')
os.remove(os.path.join(root,f))
pass
except:
print('could not open')
# os.remove(os.path.join(root,f))
elif f.endswith(".docx") or f.endswith(".dotm") or f.endswith(".docm"):
try:
print(f)
in_file=os.path.join(root,f)
word = win32com.client.Dispatch('Word.Application')
word.Visible = False
doc = word.Documents.Open(in_file)
doc.SaveAs(os.path.join(root,f[:-5]), FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
word.Visible = True
print ('done')
os.remove(os.path.join(root,f))
pass
except:
print('could not open')
# os.remove(os.path.join(root,f))
else:
pass
The try and except was for those documents I couldn't read and won't exit the code until the last document.
You should start from investigating so called virtual PDF print drivers.
As soon as you will find one you should be able to write batch file that prints your DOC files into PDF files. You probably can do this in Python too (setup printer driver output and issue document/print command in MSWord, later can be done using command line AFAIR).
import docx2txt
from win32com import client
import os
files_from_folder = r"c:\\doc"
directory = os.fsencode(files_from_folder)
amount = 1
word = client.DispatchEx("Word.Application")
word.Visible = True
for file in os.listdir(directory):
filename = os.fsdecode(file)
print(filename)
if filename.endswith('docx'):
text = docx2txt.process(os.path.join(files_from_folder, filename))
print(f'{filename} transfered ({amount})')
amount += 1
new_filename = filename.split('.')[0] + '.txt'
try:
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
except:
os.mkdir(files_from_folder + r'\txt_files')
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
elif filename.endswith('doc'):
doc = word.Documents.Open(os.path.join(files_from_folder, filename))
text = doc.Range().Text
doc.Close()
print(f'{filename} transfered ({amount})')
amount += 1
new_filename = filename.split('.')[0] + '.txt'
try:
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
except:
os.mkdir(files_from_folder + r'\txt_files')
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
word.Quit()
The Source Code, see here:
https://neculaifantanaru.com/en/python-full-code-how-to-convert-doc-and-docx-files-to-pdf-from-the-folder.html
I would suggest ignoring your supervisor and use OpenOffice which has a Python api. OpenOffice has built in support for Python and someone created a library specific for this purpose (PyODConverter).
If he isn't happy with the output, tell him it could take you weeks to do it with word.
I'm trying to save a bunch of pages in a folder next to the py file that creates them. I'm on windows so when I try to make the trailing backslash for the file-path it makes a special character instead.
Here's what I'm talking about:
from bs4 import BeautifulSoup
import urllib2, urllib
import csv
import requests
from os.path import expanduser
print "yes"
with open('intjpages.csv', 'rb') as csvfile:
pagereader = csv.reader(open("intjpages.csv","rb"))
i=0
for row in pagereader:
print row
agentheader = {'User-Agent': 'Nerd'}
request = urllib2.Request(row[0],headers=agentheader)
url = urllib2.urlopen(request)
soup = BeautifulSoup(url)
for div in soup.findAll('div', {"class" : "side"}):
div.extract()
body = soup.find_all("div", { "class" : "md" })
name = "page" + str(i) + ".html"
path_to_file = "\cleanishdata\"
outfile = open(path_to_file + name, 'w')
#outfile = open(name,'w') #this works fine
body=str(body)
outfile.write(body)
outfile.close()
i+=1
I can save the files to the same folder that the .py file is in, but when I process the files using rapidminer it includes the program too. Also it would just be neater if I could save it in a directory.
I am surprised this hasn't already been answered on the entire internet.
EDIT: Thanks so much! I ended up using information from both of your answers. IDLE was making me use r'\string\' to concatenate the strings with the backslashes. I needed use the path_to_script technique of abamert to solve the problem of creating a new folder wherever the py file is. Thanks again! Here's the relevant coding changes:
name = "page" + str(i) + ".txt"
path_to_script_dir = os.path.dirname(os.path.abspath("links.py"))
newpath = path_to_script_dir + r'\\' + 'cleanishdata'
if not os.path.exists(newpath): os.makedirs(newpath)
outfile = open(path_to_script_dir + r'\\cleanishdata\\' + name, 'w')
body=str(body)
outfile.write(body)
outfile.close()
i+=1
Are you sure sure you're escaping your backslashes properly?
The \" in your string "\cleanishdata\" is actually an escaped quote character (").
You probably want
r"\cleanishdata\"
or
"\\cleanishdata\\"
You probably also want to check out the os.path library, particular os.path.join and os.path.dirname.
For example, if your file is in C:\Base\myfile.py and you want to save files to C:\Base\cleanishdata\output.txt, you'd want:
os.path.join(
os.path.dirname(os.path.abspath(sys.argv[0])), # C:\Base\
'cleanishdata',
'output.txt')
A better solution than hardcoding the path to the .py file is to just ask Python for it:
import sys
import os
path_to_script = sys.argv[0]
path_to_script_dir = os.path.dirname(os.path.abspath(path_to_script))
Also, it's generally better to use os.path methods instead of string manipulation:
outfile = open(os.path.join(path_to_script_dir, name), 'w')
Besides making your program continue to work as expected even if you move it to a different location or install it on another machine or give it to a friend, getting rid of the hardcoded paths and the string-based path concatenation means you don't have to worry about backslashes anywhere, and this problem never arises in the first place.
I'm a photographer and doing many backups. Over the years I found myself with a lot of hard drives. Now I bought a NAS and copied all my pictures on one 3TB raid 1 using rsync. According to my script about 1TB of those files are duplicates. That comes from doing multiple backups before deleting files on my laptop and being very messy. I do have a backup of all those files on the old hard drives, but it would be a pain if my script messes things up. Can you please have a look at my duplicate finder script and tell me if you think I can run it or not? I tried it on a test folder and it seems ok, but I don't want to mess things up on the NAS.
The script has three steps in three files. In this First part I find all image and metadata files and put them into a shelve database (datenbank) with their size as key.
import os
import shelve
datenbank = shelve.open(os.path.join(os.path.dirname(__file__),"shelve_step1"), flag='c', protocol=None, writeback=False)
#path_to_search = os.path.join(os.path.dirname(__file__),"test")
path_to_search = "/volume1/backup_2tb_wd/"
file_exts = ["xmp", "jpg", "JPG", "XMP", "cr2", "CR2", "PNG", "png", "tiff", "TIFF"]
walker = os.walk(path_to_search)
counter = 0
for dirpath, dirnames, filenames in walker:
if filenames:
for filename in filenames:
counter += 1
print str(counter)
for file_ext in file_exts:
if file_ext in filename:
filepath = os.path.join(dirpath, filename)
filesize = str(os.path.getsize(filepath))
if not filesize in datenbank:
datenbank[filesize] = []
tmp = datenbank[filesize]
if filepath not in tmp:
tmp.append(filepath)
datenbank[filesize] = tmp
datenbank.sync()
print "done"
datenbank.close()
The second part. Now I drop all file sizes which only have one file in their list and create another shelve database with the md5 hash as key and a list of files as value.
import os
import shelve
import hashlib
datenbank = shelve.open(os.path.join(os.path.dirname(__file__),"shelve_step1"), flag='c', protocol=None, writeback=False)
datenbank_step2 = shelve.open(os.path.join(os.path.dirname(__file__),"shelve_step2"), flag='c', protocol=None, writeback=False)
counter = 0
space = 0
def md5Checksum(filePath):
with open(filePath, 'rb') as fh:
m = hashlib.md5()
while True:
data = fh.read(8192)
if not data:
break
m.update(data)
return m.hexdigest()
for filesize in datenbank:
filepaths = datenbank[filesize]
filepath_count = len(filepaths)
if filepath_count > 1:
counter += filepath_count -1
space += (filepath_count -1) * int(filesize)
for filepath in filepaths:
print counter
checksum = md5Checksum(filepath)
if checksum not in datenbank_step2:
datenbank_step2[checksum] = []
temp = datenbank_step2[checksum]
if filepath not in temp:
temp.append(filepath)
datenbank_step2[checksum] = temp
print counter
print str(space)
datenbank_step2.sync()
datenbank_step2.close()
print "done"
And finally the most dangerous part. For evrey md5 key i retrieve the file list and do an additional sha1. If it matches I delete every file in that list execept the first one and create a hard link to replace the deleted files.
import os
import shelve
import hashlib
datenbank = shelve.open(os.path.join(os.path.dirname(__file__),"shelve_step2"), flag='c', protocol=None, writeback=False)
def sha1Checksum(filePath):
with open(filePath, 'rb') as fh:
m = hashlib.sha1()
while True:
data = fh.read(8192)
if not data:
break
m.update(data)
return m.hexdigest()
for hashvalue in datenbank:
switch = True
for path in datenbank[hashvalue]:
if switch:
original = path
original_checksum = sha1Checksum(path)
switch = False
else:
if sha1Checksum(path) == original_checksum:
os.unlink(path)
os.link(original, path)
print "delete: ", path
print "done"
What do you think?
Thank you very much.
*if that's somehow important: It's a synology 713+ and has an ext3 or ext4 filesystem.
This looked good, and after sanitizing a bit (to make it work with python 3.4), I ran this on my NAS. While I had hardlinks for files that had not been modified between backups, files that had moved were being duplicated. This recovered that lost disk space for me.
A minor nitpick is that files that are already hardlinks are deleted and relinked. This does not affect the end result anyway.
I did slightly alter the third file ("3.py"):
if sha1Checksum(path) == original_checksum:
tmp_filename = path + ".deleteme"
os.rename(path, tmp_filename)
os.link(original, path)
os.unlink(tmp_filename)
print("Deleted {} ".format(path))
This makes sure that in case of a power-failure or some other similar error, no files are lost, though a trailing "deleteme" is left behind. A recovery script should be quite trivial.
Why not compare the files byte for byte instead of the second checksum? One in a billion two checksums might accidentally match, but direct comparison shouldn't fail. It shouldn't be slower, and might even be faster. Maybe it could be slower when there are more than two files and you have to read the original file for each other. If you really wanted you could get around that by comparing blocks of all the files at once.
EDIT:
I don't think it would require more code, just different. Something like this for the loop body:
data1 = fh1.read(8192)
data2 = fh2.read(8192)
if data1 != data2: return False
Note: If you're not wedded to Python, there are exsting tools to do the heavy lifting for you:
https://unix.stackexchange.com/questions/3037/is-there-an-easy-way-to-replace-duplicate-files-with-hardlinks
How do you create a hard link.
In linux you do
sudo ln sourcefile linkfile
Sometimes this can fail (for me it fails sometimes). Also your python script needs to run in sudo mode.
So I use symbolic links:
ln -s sourcefile linkfile
I can check for them with os.path.islink
You can call the commands like this in Python:
os.system("ln -s sourcefile linkfile")
or like this using subprocess:
import subprocess
subprocess.call(["ln", "-s", sourcefile, linkfile], shell = True)
Have a look at execution from command line and hard vs. soft links
When it works, could you post your whole code? I would like to use it, too.
I'm trying to write a program that will take file names and rename them by changing the order of the words. It works fine for most files, but I have a few files with Japanese characters in the file name for which the program does not work. I think this is because it converts the characters into question marks (I've checked this using print) and then can't find the file because the file has Japanese characters in it, not question marks. How can I work around this?
Edit:
Yes, I'm using Windows.
A recreation of my code is posted below (I'm fairly new to this, so it may be very inefficient and hard to read).
import os
def Filenames(filelist):
filenames = []
for name in filelist:
name = name.split(".") #Take off file extension
filenames.append(name)
return filenames
def ReformatName(directory):
filelist = []
name = []
filelist = os.listdir(directory)
filenames = Filenames(filelist)
for doc in filenames: #Docs are in form "Date Name Subject DocName", want to turn into "Subject DocName Date"
doc1 = doc.split(" ")
date = doc1[0]
subject = doc1[2]
docname = doc1[3]
newdoc = "%s %s %s.docx" %(subject, docname, date)
doc = ".".join(doc)
os.rename(os.path.normpath(directory + os.sep + doc), os.path.normpath(directory + os.sep + newdoc))
I've found a quite complicated solution for problems with windows console:
# -*- coding: utf-8 -*-
import sys
import codecs
def setup_console(sys_enc="utf-8"):
reload(sys)
# Calling a system library function if we're using win32
if sys.platform.startswith("win"):
import ctypes
enc = "cp%d" % ctypes.windll.kernel32.GetOEMCP() #TODO: check on win64/python64
else:
# It seems like for Linux everything already exists
enc = (sys.stdout.encoding if sys.stdout.isatty() else
sys.stderr.encoding if sys.stderr.isatty() else
sys.getfilesystemencoding() or sys_enc)
# Encoding for sys
sys.setdefaultencoding(sys_enc)
# Redefining standard output streams if they aren't redirected
if sys.stdout.isatty() and sys.stdout.encoding != enc:
sys.stdout = codecs.getwriter(enc)(sys.stdout, 'replace')
if sys.stderr.isatty() and sys.stderr.encoding != enc:
sys.stderr = codecs.getwriter(enc)(sys.stderr, 'replace')
Source: http://habrahabr.ru/post/117236/ (available in Russian only)