Is there a simple way to lauch the systems default editor from a Python command-line tool, like the webbrowser module?
Under windows you can simply "execute" the file and the default action will be taken:
os.system('c:/tmp/sample.txt')
For this example a default editor will spawn. Under UNIX there is an environment variable called EDITOR, so you need to use something like:
os.system('%s %s' % (os.getenv('EDITOR'), filename))
The modern Linux way to open a file is using xdg-open; however it does not guarantee that a text editor will open the file. Using $EDITOR is appropriate if your program is command-line oriented (and your users).
If you need to open a file for editing, you could be interested in this question.
You can actually use the webbrowser module to do this. All the answers given so far for both this and the linked question are just the same things the webbrowser module does behind the hood.
The ONLY difference is if they have $EDITOR set, which is rare. So perhaps a better flow would be:
editor = os.getenv('EDITOR')
if editor:
os.system(editor + ' ' + filename)
else:
webbrowser.open(filename)
OK, now that I’ve told you that, I should let you know that the webbrowser module does state that it does not support this case.
Note that on some platforms, trying to open a filename using this function, may work and start the operating system's associated program. However, this is neither supported nor portable.
So if it doesn't work, don’t submit a bug report. But for most uses, it should work.
As the committer of the python Y-Principle generator i had the need to check the generated files against the original and wanted to call a diff-capable editor from python.
My search pointed me to this questions and the most upvoted answer had some comments and follow-up issue that i'd also want to address:
make sure the EDITOR env variable is used if set
make sure things work on MacOS (defaulting to Atom in my case)
make sure a text can be opened in a temporary file
make sure that if an url is opened the html text is extracted by default
You'll find the solution at editory.py and the test case at test_editor.py in my project's repository.
Test Code
'''
Created on 2022-11-27
#author: wf
'''
from tests.basetest import Basetest
from yprinciple.editor import Editor
class TestEditor(Basetest):
"""
test opening an editor
"""
def test_Editor(self):
"""
test the editor
"""
if not self.inPublicCI():
# open this source file
Editor.open(__file__)
Editor.open("https://stackoverflow.com/questions/1442841/lauch-default-editor-like-webbrowser-module")
Editor.open_tmp_text("A sample text to be opened in a temporary file")
Screenshot
Source Code
'''
Created on 2022-11-27
#author: wf
'''
from sys import platform
import os
import tempfile
from urllib.request import urlopen
from bs4 import BeautifulSoup
class Editor:
"""
helper class to open the system defined editor
see https://stackoverflow.com/questions/1442841/lauch-default-editor-like-webbrowser-module
"""
#classmethod
def extract_text(cls,html_text:str)->str:
"""
extract the text from the given html_text
Args:
html_text(str): the input for the html text
Returns:
str: the plain text
"""
soup = BeautifulSoup(html_text, features="html.parser")
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
return text
#classmethod
def open(cls,file_source:str,extract_text:bool=True)->str:
"""
open an editor for the given file_source
Args:
file_source(str): the path to the file
extract_text(bool): if True extract the text from html sources
Returns:
str: the path to the file e.g. a temporary file if the file_source points to an url
"""
# handle urls
# https://stackoverflow.com/a/45886824/1497139
if file_source.startswith("http"):
url_source = urlopen(file_source)
#https://stackoverflow.com/a/19156107/1497139
charset=url_source.headers.get_content_charset()
# if charset fails here you might want to set it to utf-8 as a default!
text = url_source.read().decode(charset)
if extract_text:
# https://stackoverflow.com/a/24618186/1497139
text=cls.extract_text(text)
return cls.open_tmp_text(text)
editor_cmd=None
editor_env=os.getenv('EDITOR')
if editor_env:
editor_cmd=editor_env
if platform == "darwin":
if not editor_env:
# https://stackoverflow.com/questions/22390709/how-can-i-open-the-atom-editor-from-the-command-line-in-os-x
editor_cmd="/usr/local/bin/atom"
os_cmd=f"{editor_cmd} {file_source}"
os.system(os_cmd)
return file_source
#classmethod
def open_tmp_text(cls,text:str)->str:
"""
open an editor for the given text in a newly created temporary file
Args:
text(str): the text to write to a temporary file and then open
Returns:
str: the path to the temp file
"""
# see https://stackoverflow.com/a/8577226/1497139
# https://stackoverflow.com/a/3924253/1497139
with tempfile.NamedTemporaryFile(delete=False) as tmp:
with open(tmp.name,"w") as tmp_file:
tmp_file.write(text)
tmp_file.close()
return cls.open(tmp.name)
Stackoverflow answers applied
https://stackoverflow.com/a/45886824/1497139
https://stackoverflow.com/a/19156107/1497139
https://stackoverflow.com/a/24618186/1497139
How can I open the Atom editor from the command line in OS X?
https://stackoverflow.com/a/8577226/1497139
https://stackoverflow.com/a/3924253/1497139
Related
I'am tasked with converting tons of .doc files to .pdf. And the only way my supervisor wants me to do this is through MSWord 2010. I know I should be able to automate this with python COM automation. Only problem is I dont know how and where to start. I tried searching for some tutorials but was not able to find any (May be I might have, but I don't know what I'm looking for).
Right now I'm reading through this. Dont know how useful this is going to be.
A simple example using comtypes, converting a single file, input and output filenames given as commandline arguments:
import sys
import os
import comtypes.client
wdFormatPDF = 17
in_file = os.path.abspath(sys.argv[1])
out_file = os.path.abspath(sys.argv[2])
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
You could also use pywin32, which would be the same except for:
import win32com.client
and then:
word = win32com.client.Dispatch('Word.Application')
You can use the docx2pdf python package to bulk convert docx to pdf. It can be used as both a CLI and a python library. It requires Microsoft Office to be installed and uses COM on Windows and AppleScript (JXA) on macOS.
from docx2pdf import convert
convert("input.docx")
convert("input.docx", "output.pdf")
convert("my_docx_folder/")
pip install docx2pdf
docx2pdf input.docx output.pdf
Disclaimer: I wrote the docx2pdf package. https://github.com/AlJohri/docx2pdf
I have tested many solutions but no one of them works efficiently on Linux distribution.
I recommend this solution :
import sys
import subprocess
import re
def convert_to(folder, source, timeout=None):
args = [libreoffice_exec(), '--headless', '--convert-to', 'pdf', '--outdir', folder, source]
process = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout)
filename = re.search('-> (.*?) using filter', process.stdout.decode())
return filename.group(1)
def libreoffice_exec():
# TODO: Provide support for more platforms
if sys.platform == 'darwin':
return '/Applications/LibreOffice.app/Contents/MacOS/soffice'
return 'libreoffice'
and you call your function:
result = convert_to('TEMP Directory', 'Your File', timeout=15)
All resources:
https://michalzalecki.com/converting-docx-to-pdf-using-python/
I have worked on this problem for half a day, so I think I should share some of my experience on this matter. Steven's answer is right, but it will fail on my computer. There are two key points to fix it here:
(1). The first time when I created the 'Word.Application' object, I should make it (the word app) visible before open any documents. (Actually, even I myself cannot explain why this works. If I do not do this on my computer, the program will crash when I try to open a document in the invisible model, then the 'Word.Application' object will be deleted by OS. )
(2). After doing (1), the program will work well sometimes but may fail often. The crash error "COMError: (-2147418111, 'Call was rejected by callee.', (None, None, None, 0, None))" means that the COM Server may not be able to response so quickly. So I add a delay before I tried to open a document.
After doing these two steps, the program will work perfectly with no failure anymore. The demo code is as below. If you have encountered the same problems, try to follow these two steps. Hope it helps.
import os
import comtypes.client
import time
wdFormatPDF = 17
# absolute path is needed
# be careful about the slash '\', use '\\' or '/' or raw string r"..."
in_file=r'absolute path of input docx file 1'
out_file=r'absolute path of output pdf file 1'
in_file2=r'absolute path of input docx file 2'
out_file2=r'absolute path of outputpdf file 2'
# print out filenames
print in_file
print out_file
print in_file2
print out_file2
# create COM object
word = comtypes.client.CreateObject('Word.Application')
# key point 1: make word visible before open a new document
word.Visible = True
# key point 2: wait for the COM Server to prepare well.
time.sleep(3)
# convert docx file 1 to pdf file 1
doc=word.Documents.Open(in_file) # open docx file 1
doc.SaveAs(out_file, FileFormat=wdFormatPDF) # conversion
doc.Close() # close docx file 1
word.Visible = False
# convert docx file 2 to pdf file 2
doc = word.Documents.Open(in_file2) # open docx file 2
doc.SaveAs(out_file2, FileFormat=wdFormatPDF) # conversion
doc.Close() # close docx file 2
word.Quit() # close Word Application
unoconv (writen in Python) and OpenOffice running as a headless daemon.
https://github.com/unoconv/unoconv
http://dag.wiee.rs/home-made/unoconv/
Works very nicely for doc, docx, ppt, pptx, xls, xlsx.
Very useful if you need to convert docs or save/convert to certain formats on a server.
As an alternative to the SaveAs function, you could also use ExportAsFixedFormat which gives you access to the PDF options dialog you would normally see in Word. With this you can specify bookmarks and other document properties.
doc.ExportAsFixedFormat(OutputFileName=pdf_file,
ExportFormat=17, #17 = PDF output, 18=XPS output
OpenAfterExport=False,
OptimizeFor=0, #0=Print (higher res), 1=Screen (lower res)
CreateBookmarks=1, #0=No bookmarks, 1=Heading bookmarks only, 2=bookmarks match word bookmarks
DocStructureTags=True
);
The full list of function arguments is: 'OutputFileName', 'ExportFormat', 'OpenAfterExport', 'OptimizeFor', 'Range', 'From', 'To', 'Item', 'IncludeDocProps', 'KeepIRM', 'CreateBookmarks', 'DocStructureTags', 'BitmapMissingFonts', 'UseISO19005_1', 'FixedFormatExtClassPtr'
It's worth noting that Stevens answer works, but make sure if using a for loop to export multiple files to place the ClientObject or Dispatch statements before the loop - it only needs to be created once - see my problem: Python win32com.client.Dispatch looping through Word documents and export to PDF; fails when next loop occurs
If you don't mind using PowerShell have a look at this Hey, Scripting Guy! article. The code presented could be adopted to use the wdFormatPDF enumeration value of WdSaveFormat (see here).
This blog article presents a different implementation of the same idea.
I have modified it for ppt support as well. My solution support all the below-specified extensions.
word_extensions = [".doc", ".odt", ".rtf", ".docx", ".dotm", ".docm"]
ppt_extensions = [".ppt", ".pptx"]
My Solution: Github Link
I have modified code from Docx2PDF
I tried the accepted answer but wasn't particularly keen on the bloated PDFs Word was producing which was usually an order of magnitude bigger than expected. After looking how to disable the dialogs when using a virtual PDF printer I came across Bullzip PDF Printer and I've been rather impressed with its features. It's now replaced the other virtual printers I used previously. You'll find a "free community edition" on their download page.
The COM API can be found here and a list of the usable settings can be found here. The settings are written to a "runonce" file which is used for one print job only and then removed automatically. When printing multiple PDFs we need to make sure one print job completes before starting another to ensure the settings are used correctly for each file.
import os, re, time, datetime, win32com.client
def print_to_Bullzip(file):
util = win32com.client.Dispatch("Bullzip.PDFUtil")
settings = win32com.client.Dispatch("Bullzip.PDFSettings")
settings.PrinterName = util.DefaultPrinterName # make sure we're controlling the right PDF printer
outputFile = re.sub("\.[^.]+$", ".pdf", file)
statusFile = re.sub("\.[^.]+$", ".status", file)
settings.SetValue("Output", outputFile)
settings.SetValue("ConfirmOverwrite", "no")
settings.SetValue("ShowSaveAS", "never")
settings.SetValue("ShowSettings", "never")
settings.SetValue("ShowPDF", "no")
settings.SetValue("ShowProgress", "no")
settings.SetValue("ShowProgressFinished", "no") # disable balloon tip
settings.SetValue("StatusFile", statusFile) # created after print job
settings.WriteSettings(True) # write settings to the runonce.ini
util.PrintFile(file, util.DefaultPrinterName) # send to Bullzip virtual printer
# wait until print job completes before continuing
# otherwise settings for the next job may not be used
timestamp = datetime.datetime.now()
while( (datetime.datetime.now() - timestamp).seconds < 10):
if os.path.exists(statusFile) and os.path.isfile(statusFile):
error = util.ReadIniString(statusFile, "Status", "Errors", '')
if error != "0":
raise IOError("PDF was created with errors")
os.remove(statusFile)
return
time.sleep(0.1)
raise IOError("PDF creation timed out")
I was working with this solution but I needed to search all .docx, .dotm, .docm, .odt, .doc or .rtf and then turn them all to .pdf (python 3.7.5). Hope it works...
import os
import win32com.client
wdFormatPDF = 17
for root, dirs, files in os.walk(r'your directory here'):
for f in files:
if f.endswith(".doc") or f.endswith(".odt") or f.endswith(".rtf"):
try:
print(f)
in_file=os.path.join(root,f)
word = win32com.client.Dispatch('Word.Application')
word.Visible = False
doc = word.Documents.Open(in_file)
doc.SaveAs(os.path.join(root,f[:-4]), FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
word.Visible = True
print ('done')
os.remove(os.path.join(root,f))
pass
except:
print('could not open')
# os.remove(os.path.join(root,f))
elif f.endswith(".docx") or f.endswith(".dotm") or f.endswith(".docm"):
try:
print(f)
in_file=os.path.join(root,f)
word = win32com.client.Dispatch('Word.Application')
word.Visible = False
doc = word.Documents.Open(in_file)
doc.SaveAs(os.path.join(root,f[:-5]), FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
word.Visible = True
print ('done')
os.remove(os.path.join(root,f))
pass
except:
print('could not open')
# os.remove(os.path.join(root,f))
else:
pass
The try and except was for those documents I couldn't read and won't exit the code until the last document.
You should start from investigating so called virtual PDF print drivers.
As soon as you will find one you should be able to write batch file that prints your DOC files into PDF files. You probably can do this in Python too (setup printer driver output and issue document/print command in MSWord, later can be done using command line AFAIR).
import docx2txt
from win32com import client
import os
files_from_folder = r"c:\\doc"
directory = os.fsencode(files_from_folder)
amount = 1
word = client.DispatchEx("Word.Application")
word.Visible = True
for file in os.listdir(directory):
filename = os.fsdecode(file)
print(filename)
if filename.endswith('docx'):
text = docx2txt.process(os.path.join(files_from_folder, filename))
print(f'{filename} transfered ({amount})')
amount += 1
new_filename = filename.split('.')[0] + '.txt'
try:
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
except:
os.mkdir(files_from_folder + r'\txt_files')
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
elif filename.endswith('doc'):
doc = word.Documents.Open(os.path.join(files_from_folder, filename))
text = doc.Range().Text
doc.Close()
print(f'{filename} transfered ({amount})')
amount += 1
new_filename = filename.split('.')[0] + '.txt'
try:
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
except:
os.mkdir(files_from_folder + r'\txt_files')
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
word.Quit()
The Source Code, see here:
https://neculaifantanaru.com/en/python-full-code-how-to-convert-doc-and-docx-files-to-pdf-from-the-folder.html
I would suggest ignoring your supervisor and use OpenOffice which has a Python api. OpenOffice has built in support for Python and someone created a library specific for this purpose (PyODConverter).
If he isn't happy with the output, tell him it could take you weeks to do it with word.
How can I configure rst2html to run on a full Python project instead of a single file?
I'm used to epydoc generating my docs, but thinking to use reStructuredText because it's PyCharm default (I know we as well could change PyCharm to epytext)
rst2html cannot do that by itself. It's a ridiculously thin wrapper over docutils.core.publish_cmdline. Check it's source:
https://github.com/docutils-mirror/docutils/blob/master/tools/rst2html.py
If you want to process several .rst files, you could write a short shell script that calls rst2html on each file.
Alternatively you can write a Python script. Here is an example:
# https://docs.python.org/3/library/pathlib.html
from pathlib import Path
# https://pypi.org/project/docutils/
import docutils.io, docutils.core
def rst2html(source_path):
# mostly taken from
# https://github.com/getpelican/pelican/
pub = docutils.core.Publisher(
source_class=docutils.io.FileInput,
destination_class=docutils.io.StringOutput)
pub.set_components('standalone', 'restructuredtext', 'html')
pub.process_programmatic_settings(None, None, None)
pub.set_source(source_path=str(source_path))
pub.publish()
html = pub.writer.parts['whole']
return html
SRC_DIR = Path('.')
DST_DIR = Path('.')
for rst_file in SRC_DIR.iterdir():
if rst_file.is_file() and rst_file.suffix == '.rst':
html = rst2html(rst_file)
with open(DST_DIR / (rst_file.stem + '.html'), 'w') as f:
f.write(html)
A useful reference could be:
http://docutils.sourceforge.net/docs/api/publisher.html
I can not find a way using the the Tkinter tkFileDialog.askopenfilename / Open() method to get a file name of an already opened file. At least on Windows 7 anyway.
Simply need to return the filename selected, but when I select an 'opened' file I get the "This file is in use" popup. Any searches for getting filenames seems to always point back to using FileDialog.
There seems to be the same question but using c# here:
Open File Which is already open in window explorer
Are there any ways not listed that can be used to arrive at this? A file selection popup that doesn't fail on already opened files on Windows 7?
Python 2.7.11
Windows 7_64
Selection dialog:
opts = {}
opts['title'] = 'Select file.'
filename = tkFileDialog.Open(**opts).show()
or
filename = tkFileDialog.askopenfilename(**opts)
Which are the same as askopenfilename calls Open().show() which in turn is calling command = "tk_getOpenFile".
Searching tk_getOpenFile shows nothing to override this behavior.
Any pointers here would be much appreciated.
Update:
So as to not reiterate everything that's been gone over here's the break down of the Tkinter 8.5 FileDialog.
def askopenfile(mode = "r", **options):
"Ask for a filename to open, and returned the opened file"
filename = Open(**options).show()
if filename:
return open(filename, mode)
return None
def askopenfilename(**options):
"Ask for a filename to open"
return Open(**options).show()
Both call the Open class but askopenfile follows up with an open(file), here:
class Open(_Dialog):
"Ask for a filename to open"
command = "tk_getOpenFile"
def _fixresult(self, widget, result):
if isinstance(result, tuple):
# multiple results:
result = tuple([getattr(r, "string", r) for r in result])
if result:
import os
path, file = os.path.split(result[0])
self.options["initialdir"] = path
# don't set initialfile or filename, as we have multiple of these
return result
if not widget.tk.wantobjects() and "multiple" in self.options:
# Need to split result explicitly
return self._fixresult(widget, widget.tk.splitlist(result))
return _Dialog._fixresult(self, widget, result)
UPDATE2(answerish):
Seems the windows trc/tk tk_getOpenFile ultimately calls
OPENFILENAME ofn;
which apparently can not, at times, return a filename if the file is in use.
Yet to find the reason why but regardless.
In my case, needing to reference a running QuickBooks file, it will not work using askopenfilename().
Again the link above references someone having the same issue but using c# .
Preface
This is my first post on stackoverflow so I apologize if I mess up somewhere. I searched the internet and stackoverflow heavily for a solution to my issues but I couldn't find anything.
Situation
What I am working on is creating a digital photo frame with my raspberry pi that will also automatically download pictures from my wife's facebook page. Luckily I found someone who was working on something similar:
https://github.com/samuelclay/Raspberry-Pi-Photo-Frame
One month ago this gentleman added the download_facebook.py script. This is what I needed! So a few days ago I started working on this script to get it working in my windows environment first (before I throw it on the pi). Unfortunately there is no documentation specific to that script and I am lacking in python experience.
Based on the from urllib import urlopen statement, I can assume that this script was written for Python 2.x. This is because Python 3.x is now from urlib import request.
So I installed Python 2.7.9 interpreter and I've had fewer issues than when I was attempting to work with Python 3.4.3 interpreter.
Problem
I've gotten the script to download pictures from the facebook account; however, the pictures are corrupted.
Here is pictures of the problem: http://imgur.com/a/3u7cG
Now, I originally was using Python 3.4.3 and had issues with my method urlrequest(url) (see code at bottom of post) and how it was working with the image data. I tried decoding with different formats such as utf-8 and utf-16 but according to the content headers, it shows utf-8 format (I think).
Conclusion
I'm not quite sure if the problem is with downloading the image or with writing the image to the file.
If anyone can help me with this I'd be forever grateful! Also let me know what I can do to improve my posts in the future.
Thanks in advance.
Code
from urllib import urlopen
from json import loads
from sys import argv
import dateutil.parser as dateparser
import logging
# plugin your username and access_token (Token can be get and
# modified in the Explorer's Get Access Token button):
# https://graph.facebook.com/USER_NAME/photos?type=uploaded&fields=source&access_token=ACCESS_TOKEN_HERE
FACEBOOK_USER_ID = "**USER ID REMOVED"
FACEBOOK_ACCESS_TOKEN = "** TOKEN REMOVED - GET YOUR OWN **"
def get_logger(label='lvm_cli', level='INFO'):
"""
Return a generic logger.
"""
format = '%(asctime)s - %(levelname)s - %(message)s'
logging.basicConfig(format=format)
logger = logging.getLogger(label)
logger.setLevel(getattr(logging, level))
return logger
def urlrequest(url):
"""
Make a url request
"""
req = urlopen(url)
data = req.read()
return data
def get_json(url):
"""
Make a url request and return as a JSON object
"""
res = urlrequest(url)
data = loads(res)
return data
def get_next(data):
"""
Get next element from facebook JSON response,
or return None if no next present.
"""
try:
return data['paging']['next']
except KeyError:
return None
def get_images(data):
"""
Get all images from facebook JSON response,
or return None if no data present.
"""
try:
return data['data']
except KeyError:
return []
def get_all_images(url):
"""
Get all images using recursion.
"""
data = get_json(url)
images = get_images(data)
next = get_next(data)
if not next:
return images
else:
return images + get_all_images(next)
def get_url(userid, access_token):
"""
Generates a useable facebook graph API url
"""
root = 'https://graph.facebook.com/'
endpoint = '%s/photos?type=uploaded&fields=source,updated_time&access_token=%s' % \
(userid, access_token)
return '%s%s' % (root, endpoint)
def download_file(url, filename):
"""
Write image to a file.
"""
data = urlrequest(url)
path = 'C:/photos/%s' % filename
f = open(path, 'w')
f.write(data)
f.close()
def create_time_stamp(timestring):
"""
Creates a pretty string from time
"""
date = dateparser.parse(timestring)
return date.strftime('%Y-%m-%d-%H-%M-%S')
def download(userid, access_token):
"""
Download all images to current directory.
"""
logger = get_logger()
url = get_url(userid, access_token)
logger.info('Requesting image direct link, please wait..')
images = get_all_images(url)
for image in images:
logger.info('Downloading %s' % image['source'])
filename = '%s.jpg' % create_time_stamp(image['created_time'])
download_file(image['source'], filename)
if __name__ == '__main__':
download(FACEBOOK_USER_ID, FACEBOOK_ACCESS_TOKEN)
Answering the question of why #Alastair's solution from the comments worked:
f = open(path, 'wb')
From https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files:
On Windows, 'b' appended to the mode opens the file in binary mode, so
there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows
makes a distinction between text and binary files; the end-of-line
characters in text files are automatically altered slightly when data
is read or written. This behind-the-scenes modification to file data
is fine for ASCII text files, but it’ll corrupt binary data like that
in JPEG or EXE files. Be very careful to use binary mode when reading
and writing such files. On Unix, it doesn’t hurt to append a 'b' to
the mode, so you can use it platform-independently for all binary
files.
(I was on a Mac, which explains why the problem wasn't reproduced for me.)
Alastair McCormack posted something that worked!
He said Try setting binary mode when you open the file for writing: f = open(path, 'wb')
It is now successfully downloading the images correctly. Does anyone know why this worked?
Is there a reasonable way to extract plain text from a Word file that doesn't depend on COM automation? (This is a a feature for a web app deployed on a non-Windows platform - that's non-negotiable in this case.)
Antiword seems like it might be a reasonable option, but it seems like it might be abandoned.
A Python solution would be ideal, but doesn't appear to be available.
(Same answer as extracting text from MS word files in python)
Use the native Python docx module which I made this week. Here's how to extract all the text from a doc:
document = opendocx('Hello world.docx')
# This location is where most document content lives
docbody = document.xpath('/w:document/w:body', namespaces=wordnamespaces)[0]
# Extract all text
print getdocumenttext(document)
See Python DocX site
100% Python, no COM, no .net, no Java, no parsing serialized XML with regexs.
I use catdoc or antiword for this, whatever gives the result that is the easiest to parse. I have embedded this in python functions, so it is easy to use from the parsing system (which is written in python).
import os
def doc_to_text_catdoc(filename):
(fi, fo, fe) = os.popen3('catdoc -w "%s"' % filename)
fi.close()
retval = fo.read()
erroroutput = fe.read()
fo.close()
fe.close()
if not erroroutput:
return retval
else:
raise OSError("Executing the command caused an error: %s" % erroroutput)
# similar doc_to_text_antiword()
The -w switch to catdoc turns off line wrapping, BTW.
If all you want to do is extracting text from Word files (.docx), it's possible to do it only with Python. Like Guy Starbuck wrote it, you just need to unzip the file and then parse the XML. Inspired by python-docx, I have written a simple function to do this:
try:
from xml.etree.cElementTree import XML
except ImportError:
from xml.etree.ElementTree import XML
import zipfile
"""
Module that extract text from MS XML Word document (.docx).
(Inspired by python-docx <https://github.com/mikemaccana/python-docx>)
"""
WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
def get_docx_text(path):
"""
Take the path of a docx file as argument, return the text in unicode.
"""
document = zipfile.ZipFile(path)
xml_content = document.read('word/document.xml')
document.close()
tree = XML(xml_content)
paragraphs = []
for paragraph in tree.getiterator(PARA):
texts = [node.text
for node in paragraph.getiterator(TEXT)
if node.text]
if texts:
paragraphs.append(''.join(texts))
return '\n\n'.join(paragraphs)
Using the OpenOffice API, and Python, and Andrew Pitonyak's excellent online macro book I managed to do this. Section 7.16.4 is the place to start.
One other tip to make it work without needing the screen at all is to use the Hidden property:
RO = PropertyValue('ReadOnly', 0, True, 0)
Hidden = PropertyValue('Hidden', 0, True, 0)
xDoc = desktop.loadComponentFromURL( docpath,"_blank", 0, (RO, Hidden,) )
Otherwise the document flicks up on the screen (probably on the webserver console) when you open it.
tika-python
A Python port of the Apache Tika library, According to the documentation Apache tika supports text extraction from over 1500 file formats.
Note: It also works charmingly with pyinstaller
Install with pip :
pip install tika
Sample:
#!/usr/bin/env python
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"]) #To get the meta data of the file
print(parsed["content"]) # To get the content of the file
Link to official GitHub
Open Office has an API
For docx files, check out the Python script docx2txt available at
http://cobweb.ecn.purdue.edu/~kak/distMisc/docx2txt
for extracting the plain text from a docx document.
This worked well for .doc and .odt.
It calls openoffice on the command line to convert your file to text, which you can then simply load into python.
(It seems to have other format options, though they are not apparenlty documented.)
Honestly don't use "pip install tika", this has been developed for mono-user (one developper working on his laptop) and not for multi-users (multi-developpers).
The small class TikaWrapper.py bellow which uses Tika in command line is widely enough to meet our needs.
You just have to instanciate this class with JAVA_HOME path and the Tika jar path, that's all ! And it works perfectly for lot of formats (e.g: PDF, DOCX, ODT, XLSX, PPT, etc.).
#!/bin/python
# -*- coding: utf-8 -*-
# Class to extract metadata and text from different file types (such as PPT, XLS, and PDF)
# Developed by Philippe ROSSIGNOL
#####################
# TikaWrapper class #
#####################
class TikaWrapper:
java_home = None
tikalib_path = None
# Constructor
def __init__(self, java_home, tikalib_path):
self.java_home = java_home
self.tika_lib_path = tikalib_path
def extractMetadata(self, filePath, encoding="UTF-8", returnTuple=False):
'''
- Description:
Extract metadata from a document
- Params:
filePath: The document file path
encoding: The encoding (default = "UTF-8")
returnTuple: If True return a tuple which contains both the output and the error (default = False)
- Examples:
metadata = extractMetadata(filePath="MyDocument.docx")
metadata, error = extractMetadata(filePath="MyDocument.docx", encoding="UTF-8", returnTuple=True)
'''
cmd = self._getCmd(self._cmdExtractMetadata, filePath, encoding)
out, err = self._execute(cmd, encoding)
if (returnTuple): return out, err
return out
def extractText(self, filePath, encoding="UTF-8", returnTuple=False):
'''
- Description:
Extract text from a document
- Params:
filePath: The document file path
encoding: The encoding (default = "UTF-8")
returnTuple: If True return a tuple which contains both the output and the error (default = False)
- Examples:
text = extractText(filePath="MyDocument.docx")
text, error = extractText(filePath="MyDocument.docx", encoding="UTF-8", returnTuple=True)
'''
cmd = self._getCmd(self._cmdExtractText, filePath, encoding)
out, err = self._execute(cmd, encoding)
return out, err
# ===========
# = PRIVATE =
# ===========
_cmdExtractMetadata = "${JAVA_HOME}/bin/java -jar ${TIKALIB_PATH} --metadata ${FILE_PATH}"
_cmdExtractText = "${JAVA_HOME}/bin/java -jar ${TIKALIB_PATH} --encoding=${ENCODING} --text ${FILE_PATH}"
def _getCmd(self, cmdModel, filePath, encoding):
cmd = cmdModel.replace("${JAVA_HOME}", self.java_home)
cmd = cmd.replace("${TIKALIB_PATH}", self.tika_lib_path)
cmd = cmd.replace("${ENCODING}", encoding)
cmd = cmd.replace("${FILE_PATH}", filePath)
return cmd
def _execute(self, cmd, encoding):
import subprocess
process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = process.communicate()
out = out.decode(encoding=encoding)
err = err.decode(encoding=encoding)
return out, err
Just in case if someone wants to do in Java language there is Apache poi api. extractor.getText() will extract plane text from docx . Here is the link https://www.tutorialspoint.com/apache_poi_word/apache_poi_word_text_extraction.htm
Textract-Plus
Use textract-plus which can extract text from most of the document extensions including doc , docm , dotx and docx.
(It uses antiword as a backend for doc files)
refer docs
Install-
pip install textract-plus
Sample-
import textractplus as tp
text=tp.process('path/to/yourfile.doc')
There is also pandoc the swiss-army-knife of documents. It converts from every format to nearly every other format. From the demos page
pandoc -s input_file.docx -o output_file.txt
Like Etienne's answer.
With python 3.9 getiterator was deprecated in ET, so you need to replace it with iter:
def get_docx_text(path):
"""
Take the path of a docx file as argument, return the text in unicode.
"""
document = zipfile.ZipFile(path)
xml_content = document.read('word/document.xml')
document.close()
tree = XML(xml_content)
paragraphs = []
for paragraph in tree.iter(PARA):
texts = [node.text
for node in paragraph.iter(TEXT)
if node.text]
if texts:
paragraphs.append(''.join(texts))
return '\n\n'.join(paragraphs)