rst2html on full python project - python

How can I configure rst2html to run on a full Python project instead of a single file?
I'm used to epydoc generating my docs, but thinking to use reStructuredText because it's PyCharm default (I know we as well could change PyCharm to epytext)

rst2html cannot do that by itself. It's a ridiculously thin wrapper over docutils.core.publish_cmdline. Check it's source:
https://github.com/docutils-mirror/docutils/blob/master/tools/rst2html.py
If you want to process several .rst files, you could write a short shell script that calls rst2html on each file.
Alternatively you can write a Python script. Here is an example:
# https://docs.python.org/3/library/pathlib.html
from pathlib import Path
# https://pypi.org/project/docutils/
import docutils.io, docutils.core
def rst2html(source_path):
# mostly taken from
# https://github.com/getpelican/pelican/
pub = docutils.core.Publisher(
source_class=docutils.io.FileInput,
destination_class=docutils.io.StringOutput)
pub.set_components('standalone', 'restructuredtext', 'html')
pub.process_programmatic_settings(None, None, None)
pub.set_source(source_path=str(source_path))
pub.publish()
html = pub.writer.parts['whole']
return html
SRC_DIR = Path('.')
DST_DIR = Path('.')
for rst_file in SRC_DIR.iterdir():
if rst_file.is_file() and rst_file.suffix == '.rst':
html = rst2html(rst_file)
with open(DST_DIR / (rst_file.stem + '.html'), 'w') as f:
f.write(html)
A useful reference could be:
http://docutils.sourceforge.net/docs/api/publisher.html

Related

how to do docx to pdf conversion using python library without subprocess in linux? [duplicate]

I'am tasked with converting tons of .doc files to .pdf. And the only way my supervisor wants me to do this is through MSWord 2010. I know I should be able to automate this with python COM automation. Only problem is I dont know how and where to start. I tried searching for some tutorials but was not able to find any (May be I might have, but I don't know what I'm looking for).
Right now I'm reading through this. Dont know how useful this is going to be.
A simple example using comtypes, converting a single file, input and output filenames given as commandline arguments:
import sys
import os
import comtypes.client
wdFormatPDF = 17
in_file = os.path.abspath(sys.argv[1])
out_file = os.path.abspath(sys.argv[2])
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
You could also use pywin32, which would be the same except for:
import win32com.client
and then:
word = win32com.client.Dispatch('Word.Application')
You can use the docx2pdf python package to bulk convert docx to pdf. It can be used as both a CLI and a python library. It requires Microsoft Office to be installed and uses COM on Windows and AppleScript (JXA) on macOS.
from docx2pdf import convert
convert("input.docx")
convert("input.docx", "output.pdf")
convert("my_docx_folder/")
pip install docx2pdf
docx2pdf input.docx output.pdf
Disclaimer: I wrote the docx2pdf package. https://github.com/AlJohri/docx2pdf
I have tested many solutions but no one of them works efficiently on Linux distribution.
I recommend this solution :
import sys
import subprocess
import re
def convert_to(folder, source, timeout=None):
args = [libreoffice_exec(), '--headless', '--convert-to', 'pdf', '--outdir', folder, source]
process = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout)
filename = re.search('-> (.*?) using filter', process.stdout.decode())
return filename.group(1)
def libreoffice_exec():
# TODO: Provide support for more platforms
if sys.platform == 'darwin':
return '/Applications/LibreOffice.app/Contents/MacOS/soffice'
return 'libreoffice'
and you call your function:
result = convert_to('TEMP Directory', 'Your File', timeout=15)
All resources:
https://michalzalecki.com/converting-docx-to-pdf-using-python/
I have worked on this problem for half a day, so I think I should share some of my experience on this matter. Steven's answer is right, but it will fail on my computer. There are two key points to fix it here:
(1). The first time when I created the 'Word.Application' object, I should make it (the word app) visible before open any documents. (Actually, even I myself cannot explain why this works. If I do not do this on my computer, the program will crash when I try to open a document in the invisible model, then the 'Word.Application' object will be deleted by OS. )
(2). After doing (1), the program will work well sometimes but may fail often. The crash error "COMError: (-2147418111, 'Call was rejected by callee.', (None, None, None, 0, None))" means that the COM Server may not be able to response so quickly. So I add a delay before I tried to open a document.
After doing these two steps, the program will work perfectly with no failure anymore. The demo code is as below. If you have encountered the same problems, try to follow these two steps. Hope it helps.
import os
import comtypes.client
import time
wdFormatPDF = 17
# absolute path is needed
# be careful about the slash '\', use '\\' or '/' or raw string r"..."
in_file=r'absolute path of input docx file 1'
out_file=r'absolute path of output pdf file 1'
in_file2=r'absolute path of input docx file 2'
out_file2=r'absolute path of outputpdf file 2'
# print out filenames
print in_file
print out_file
print in_file2
print out_file2
# create COM object
word = comtypes.client.CreateObject('Word.Application')
# key point 1: make word visible before open a new document
word.Visible = True
# key point 2: wait for the COM Server to prepare well.
time.sleep(3)
# convert docx file 1 to pdf file 1
doc=word.Documents.Open(in_file) # open docx file 1
doc.SaveAs(out_file, FileFormat=wdFormatPDF) # conversion
doc.Close() # close docx file 1
word.Visible = False
# convert docx file 2 to pdf file 2
doc = word.Documents.Open(in_file2) # open docx file 2
doc.SaveAs(out_file2, FileFormat=wdFormatPDF) # conversion
doc.Close() # close docx file 2
word.Quit() # close Word Application
unoconv (writen in Python) and OpenOffice running as a headless daemon.
https://github.com/unoconv/unoconv
http://dag.wiee.rs/home-made/unoconv/
Works very nicely for doc, docx, ppt, pptx, xls, xlsx.
Very useful if you need to convert docs or save/convert to certain formats on a server.
As an alternative to the SaveAs function, you could also use ExportAsFixedFormat which gives you access to the PDF options dialog you would normally see in Word. With this you can specify bookmarks and other document properties.
doc.ExportAsFixedFormat(OutputFileName=pdf_file,
ExportFormat=17, #17 = PDF output, 18=XPS output
OpenAfterExport=False,
OptimizeFor=0, #0=Print (higher res), 1=Screen (lower res)
CreateBookmarks=1, #0=No bookmarks, 1=Heading bookmarks only, 2=bookmarks match word bookmarks
DocStructureTags=True
);
The full list of function arguments is: 'OutputFileName', 'ExportFormat', 'OpenAfterExport', 'OptimizeFor', 'Range', 'From', 'To', 'Item', 'IncludeDocProps', 'KeepIRM', 'CreateBookmarks', 'DocStructureTags', 'BitmapMissingFonts', 'UseISO19005_1', 'FixedFormatExtClassPtr'
It's worth noting that Stevens answer works, but make sure if using a for loop to export multiple files to place the ClientObject or Dispatch statements before the loop - it only needs to be created once - see my problem: Python win32com.client.Dispatch looping through Word documents and export to PDF; fails when next loop occurs
If you don't mind using PowerShell have a look at this Hey, Scripting Guy! article. The code presented could be adopted to use the wdFormatPDF enumeration value of WdSaveFormat (see here).
This blog article presents a different implementation of the same idea.
I have modified it for ppt support as well. My solution support all the below-specified extensions.
word_extensions = [".doc", ".odt", ".rtf", ".docx", ".dotm", ".docm"]
ppt_extensions = [".ppt", ".pptx"]
My Solution: Github Link
I have modified code from Docx2PDF
I tried the accepted answer but wasn't particularly keen on the bloated PDFs Word was producing which was usually an order of magnitude bigger than expected. After looking how to disable the dialogs when using a virtual PDF printer I came across Bullzip PDF Printer and I've been rather impressed with its features. It's now replaced the other virtual printers I used previously. You'll find a "free community edition" on their download page.
The COM API can be found here and a list of the usable settings can be found here. The settings are written to a "runonce" file which is used for one print job only and then removed automatically. When printing multiple PDFs we need to make sure one print job completes before starting another to ensure the settings are used correctly for each file.
import os, re, time, datetime, win32com.client
def print_to_Bullzip(file):
util = win32com.client.Dispatch("Bullzip.PDFUtil")
settings = win32com.client.Dispatch("Bullzip.PDFSettings")
settings.PrinterName = util.DefaultPrinterName # make sure we're controlling the right PDF printer
outputFile = re.sub("\.[^.]+$", ".pdf", file)
statusFile = re.sub("\.[^.]+$", ".status", file)
settings.SetValue("Output", outputFile)
settings.SetValue("ConfirmOverwrite", "no")
settings.SetValue("ShowSaveAS", "never")
settings.SetValue("ShowSettings", "never")
settings.SetValue("ShowPDF", "no")
settings.SetValue("ShowProgress", "no")
settings.SetValue("ShowProgressFinished", "no") # disable balloon tip
settings.SetValue("StatusFile", statusFile) # created after print job
settings.WriteSettings(True) # write settings to the runonce.ini
util.PrintFile(file, util.DefaultPrinterName) # send to Bullzip virtual printer
# wait until print job completes before continuing
# otherwise settings for the next job may not be used
timestamp = datetime.datetime.now()
while( (datetime.datetime.now() - timestamp).seconds < 10):
if os.path.exists(statusFile) and os.path.isfile(statusFile):
error = util.ReadIniString(statusFile, "Status", "Errors", '')
if error != "0":
raise IOError("PDF was created with errors")
os.remove(statusFile)
return
time.sleep(0.1)
raise IOError("PDF creation timed out")
I was working with this solution but I needed to search all .docx, .dotm, .docm, .odt, .doc or .rtf and then turn them all to .pdf (python 3.7.5). Hope it works...
import os
import win32com.client
wdFormatPDF = 17
for root, dirs, files in os.walk(r'your directory here'):
for f in files:
if f.endswith(".doc") or f.endswith(".odt") or f.endswith(".rtf"):
try:
print(f)
in_file=os.path.join(root,f)
word = win32com.client.Dispatch('Word.Application')
word.Visible = False
doc = word.Documents.Open(in_file)
doc.SaveAs(os.path.join(root,f[:-4]), FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
word.Visible = True
print ('done')
os.remove(os.path.join(root,f))
pass
except:
print('could not open')
# os.remove(os.path.join(root,f))
elif f.endswith(".docx") or f.endswith(".dotm") or f.endswith(".docm"):
try:
print(f)
in_file=os.path.join(root,f)
word = win32com.client.Dispatch('Word.Application')
word.Visible = False
doc = word.Documents.Open(in_file)
doc.SaveAs(os.path.join(root,f[:-5]), FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
word.Visible = True
print ('done')
os.remove(os.path.join(root,f))
pass
except:
print('could not open')
# os.remove(os.path.join(root,f))
else:
pass
The try and except was for those documents I couldn't read and won't exit the code until the last document.
You should start from investigating so called virtual PDF print drivers.
As soon as you will find one you should be able to write batch file that prints your DOC files into PDF files. You probably can do this in Python too (setup printer driver output and issue document/print command in MSWord, later can be done using command line AFAIR).
import docx2txt
from win32com import client
import os
files_from_folder = r"c:\\doc"
directory = os.fsencode(files_from_folder)
amount = 1
word = client.DispatchEx("Word.Application")
word.Visible = True
for file in os.listdir(directory):
filename = os.fsdecode(file)
print(filename)
if filename.endswith('docx'):
text = docx2txt.process(os.path.join(files_from_folder, filename))
print(f'{filename} transfered ({amount})')
amount += 1
new_filename = filename.split('.')[0] + '.txt'
try:
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
except:
os.mkdir(files_from_folder + r'\txt_files')
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
elif filename.endswith('doc'):
doc = word.Documents.Open(os.path.join(files_from_folder, filename))
text = doc.Range().Text
doc.Close()
print(f'{filename} transfered ({amount})')
amount += 1
new_filename = filename.split('.')[0] + '.txt'
try:
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
except:
os.mkdir(files_from_folder + r'\txt_files')
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
word.Quit()
The Source Code, see here:
https://neculaifantanaru.com/en/python-full-code-how-to-convert-doc-and-docx-files-to-pdf-from-the-folder.html
I would suggest ignoring your supervisor and use OpenOffice which has a Python api. OpenOffice has built in support for Python and someone created a library specific for this purpose (PyODConverter).
If he isn't happy with the output, tell him it could take you weeks to do it with word.

file upload from container using webdav results into empty file upload

I'm attempting to wrap my brain around this because the identical code generates two different sets of outcomes, implying that there must be a fundamental difference between the settings in which the code is performed.
This is the code I use:
from webdav3.client import Client
if __name__ == "__main__":
client = Client(
{
"webdav_hostname": "http://some_address"
+ "project"
+ "/",
"webdav_login": "somelogin",
"webdav_password": "somepass",
}
)
ci = "someci"
version = "someversion"
directory = f'release-{ci.replace("/", "-")}-{version}'
client.webdav.disable_check = (
True # Need to be disabled as the check can not be performed on the root
)
f = "a.rst"
with open(f, "r") as fh:
contents = fh.read()
print(contents)
evaluated = contents.replace("#PIPELINE_URL#", "DUMMY PIPELINE URL")
with open(f, "w") as fh:
fh.write(evaluated)
print(contents)
client.upload(local_path=f, remote_path=f)
The file a.rst contains some text like:
Please follow instruction link below
#####################################
`Click here for instructions <https://some_website>`_
When I execute this code from macOS, a file with the same contents of a.rst appears on my website.
When I execute this script from within a container with a base image of Python 3.9 and the webdav dependencies, it creates a file on my website, but the content is always empty. I'm not sure why, but it could have something to do with the fact that I'm running it from within a Docker container that on top of it can't handle the special characters in the file (plain text seems to work though)?
Anyone have any ideas as to why this is happening and how to fix it?
EDIT:
It seems that the character ":" is creating the problem..

how to download pics to a specific folder location on windows?

I have this script which download all images from a given web url address:
from selenium import webdriver
import urllib
class ChromefoxTest:
def __init__(self,url):
self.url=url
self.uri = []
def chromeTest(self):
# file_name = "C:\Users\Administrator\Downloads\images"
self.driver=webdriver.Chrome()
self.driver.get(self.url)
self.r=self.driver.find_elements_by_tag_name('img')
# output=open(file_name,'w')
for i, v in enumerate(self.r):
src = v.get_attribute("src")
self.uri.append(src)
pos = len(src) - src[::-1].index('/')
print src[pos:]
self.g=urllib.urlretrieve(src, src[pos:])
# output.write(src)
# output.close()
if __name__=='__main__':
FT=ChromefoxTest("http://imgur.com/")
FT.chromeTest()
my question is: how do i make this script to save all the pics to a specific folder location on my windows machine?
You need to specify the path where you want to save the file. This is explained in the documentation for urllib.urlretrieve:
The method is: urllib.urlretrieve(url[, filename[, reporthook[, data]]]).
And the documentation says:
The second argument, if present, specifies the file location to copy to (if absent, the location will be a tempfile with a generated name).
So...
urllib.urlretrieve(src, 'location/on/my/system/foo.png')
Will save the image to the specified folder.
Also, consider reviewing the documentation for os.path. Those functions will help you manipulate file names and paths.
If you use the requests library you can slurp up really big image files (or small ones) efficiently and arrange to store them in a place of your choice in an obvious way.
Use this code and you'll get a nice picture of a beagle dog!
image_url is the link to the remote image.
file_path is where you want to store the image locally. It can include just a file name or a full path, at your option.
chunk_size is the size of the piece of the file to be downloaded with each slurp from the remote site.
length is the actual size of the piece that is written locally. Since I did this interactively I put this in mainly so that I wouldn't have to look at a long vertical stream of 1024s on my screen.
..
>>> import requests
>>> image_url = 'http://maxpixel.freegreatpicture.com/static/photo/1x/Eyes-Dog-Portrait-Animal-Familiar-Domestic-Beagle-2507963.jpg'
>>> file_path = r'c:\scratch\beagle.jpg'
>>> r = requests.get(image_url, stream=True)
>>> with open(file_path, 'wb') as beagle:
... for chunk in r.iter_content(chunk_size=1024):
... length = beagle.write(chunk)

Lauch default editor (like 'webbrowser' module)

Is there a simple way to lauch the systems default editor from a Python command-line tool, like the webbrowser module?
Under windows you can simply "execute" the file and the default action will be taken:
os.system('c:/tmp/sample.txt')
For this example a default editor will spawn. Under UNIX there is an environment variable called EDITOR, so you need to use something like:
os.system('%s %s' % (os.getenv('EDITOR'), filename))
The modern Linux way to open a file is using xdg-open; however it does not guarantee that a text editor will open the file. Using $EDITOR is appropriate if your program is command-line oriented (and your users).
If you need to open a file for editing, you could be interested in this question.
You can actually use the webbrowser module to do this. All the answers given so far for both this and the linked question are just the same things the webbrowser module does behind the hood.
The ONLY difference is if they have $EDITOR set, which is rare. So perhaps a better flow would be:
editor = os.getenv('EDITOR')
if editor:
os.system(editor + ' ' + filename)
else:
webbrowser.open(filename)
OK, now that I’ve told you that, I should let you know that the webbrowser module does state that it does not support this case.
Note that on some platforms, trying to open a filename using this function, may work and start the operating system's associated program. However, this is neither supported nor portable.
So if it doesn't work, don’t submit a bug report. But for most uses, it should work.
As the committer of the python Y-Principle generator i had the need to check the generated files against the original and wanted to call a diff-capable editor from python.
My search pointed me to this questions and the most upvoted answer had some comments and follow-up issue that i'd also want to address:
make sure the EDITOR env variable is used if set
make sure things work on MacOS (defaulting to Atom in my case)
make sure a text can be opened in a temporary file
make sure that if an url is opened the html text is extracted by default
You'll find the solution at editory.py and the test case at test_editor.py in my project's repository.
Test Code
'''
Created on 2022-11-27
#author: wf
'''
from tests.basetest import Basetest
from yprinciple.editor import Editor
class TestEditor(Basetest):
"""
test opening an editor
"""
def test_Editor(self):
"""
test the editor
"""
if not self.inPublicCI():
# open this source file
Editor.open(__file__)
Editor.open("https://stackoverflow.com/questions/1442841/lauch-default-editor-like-webbrowser-module")
Editor.open_tmp_text("A sample text to be opened in a temporary file")
Screenshot
Source Code
'''
Created on 2022-11-27
#author: wf
'''
from sys import platform
import os
import tempfile
from urllib.request import urlopen
from bs4 import BeautifulSoup
class Editor:
"""
helper class to open the system defined editor
see https://stackoverflow.com/questions/1442841/lauch-default-editor-like-webbrowser-module
"""
#classmethod
def extract_text(cls,html_text:str)->str:
"""
extract the text from the given html_text
Args:
html_text(str): the input for the html text
Returns:
str: the plain text
"""
soup = BeautifulSoup(html_text, features="html.parser")
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
return text
#classmethod
def open(cls,file_source:str,extract_text:bool=True)->str:
"""
open an editor for the given file_source
Args:
file_source(str): the path to the file
extract_text(bool): if True extract the text from html sources
Returns:
str: the path to the file e.g. a temporary file if the file_source points to an url
"""
# handle urls
# https://stackoverflow.com/a/45886824/1497139
if file_source.startswith("http"):
url_source = urlopen(file_source)
#https://stackoverflow.com/a/19156107/1497139
charset=url_source.headers.get_content_charset()
# if charset fails here you might want to set it to utf-8 as a default!
text = url_source.read().decode(charset)
if extract_text:
# https://stackoverflow.com/a/24618186/1497139
text=cls.extract_text(text)
return cls.open_tmp_text(text)
editor_cmd=None
editor_env=os.getenv('EDITOR')
if editor_env:
editor_cmd=editor_env
if platform == "darwin":
if not editor_env:
# https://stackoverflow.com/questions/22390709/how-can-i-open-the-atom-editor-from-the-command-line-in-os-x
editor_cmd="/usr/local/bin/atom"
os_cmd=f"{editor_cmd} {file_source}"
os.system(os_cmd)
return file_source
#classmethod
def open_tmp_text(cls,text:str)->str:
"""
open an editor for the given text in a newly created temporary file
Args:
text(str): the text to write to a temporary file and then open
Returns:
str: the path to the temp file
"""
# see https://stackoverflow.com/a/8577226/1497139
# https://stackoverflow.com/a/3924253/1497139
with tempfile.NamedTemporaryFile(delete=False) as tmp:
with open(tmp.name,"w") as tmp_file:
tmp_file.write(text)
tmp_file.close()
return cls.open(tmp.name)
Stackoverflow answers applied
https://stackoverflow.com/a/45886824/1497139
https://stackoverflow.com/a/19156107/1497139
https://stackoverflow.com/a/24618186/1497139
How can I open the Atom editor from the command line in OS X?
https://stackoverflow.com/a/8577226/1497139
https://stackoverflow.com/a/3924253/1497139

Best way to extract text from a Word doc without using COM/automation?

Is there a reasonable way to extract plain text from a Word file that doesn't depend on COM automation? (This is a a feature for a web app deployed on a non-Windows platform - that's non-negotiable in this case.)
Antiword seems like it might be a reasonable option, but it seems like it might be abandoned.
A Python solution would be ideal, but doesn't appear to be available.
(Same answer as extracting text from MS word files in python)
Use the native Python docx module which I made this week. Here's how to extract all the text from a doc:
document = opendocx('Hello world.docx')
# This location is where most document content lives
docbody = document.xpath('/w:document/w:body', namespaces=wordnamespaces)[0]
# Extract all text
print getdocumenttext(document)
See Python DocX site
100% Python, no COM, no .net, no Java, no parsing serialized XML with regexs.
I use catdoc or antiword for this, whatever gives the result that is the easiest to parse. I have embedded this in python functions, so it is easy to use from the parsing system (which is written in python).
import os
def doc_to_text_catdoc(filename):
(fi, fo, fe) = os.popen3('catdoc -w "%s"' % filename)
fi.close()
retval = fo.read()
erroroutput = fe.read()
fo.close()
fe.close()
if not erroroutput:
return retval
else:
raise OSError("Executing the command caused an error: %s" % erroroutput)
# similar doc_to_text_antiword()
The -w switch to catdoc turns off line wrapping, BTW.
If all you want to do is extracting text from Word files (.docx), it's possible to do it only with Python. Like Guy Starbuck wrote it, you just need to unzip the file and then parse the XML. Inspired by python-docx, I have written a simple function to do this:
try:
from xml.etree.cElementTree import XML
except ImportError:
from xml.etree.ElementTree import XML
import zipfile
"""
Module that extract text from MS XML Word document (.docx).
(Inspired by python-docx <https://github.com/mikemaccana/python-docx>)
"""
WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
def get_docx_text(path):
"""
Take the path of a docx file as argument, return the text in unicode.
"""
document = zipfile.ZipFile(path)
xml_content = document.read('word/document.xml')
document.close()
tree = XML(xml_content)
paragraphs = []
for paragraph in tree.getiterator(PARA):
texts = [node.text
for node in paragraph.getiterator(TEXT)
if node.text]
if texts:
paragraphs.append(''.join(texts))
return '\n\n'.join(paragraphs)
Using the OpenOffice API, and Python, and Andrew Pitonyak's excellent online macro book I managed to do this. Section 7.16.4 is the place to start.
One other tip to make it work without needing the screen at all is to use the Hidden property:
RO = PropertyValue('ReadOnly', 0, True, 0)
Hidden = PropertyValue('Hidden', 0, True, 0)
xDoc = desktop.loadComponentFromURL( docpath,"_blank", 0, (RO, Hidden,) )
Otherwise the document flicks up on the screen (probably on the webserver console) when you open it.
tika-python
A Python port of the Apache Tika library, According to the documentation Apache tika supports text extraction from over 1500 file formats.
Note: It also works charmingly with pyinstaller
Install with pip :
pip install tika
Sample:
#!/usr/bin/env python
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"]) #To get the meta data of the file
print(parsed["content"]) # To get the content of the file
Link to official GitHub
Open Office has an API
For docx files, check out the Python script docx2txt available at
http://cobweb.ecn.purdue.edu/~kak/distMisc/docx2txt
for extracting the plain text from a docx document.
This worked well for .doc and .odt.
It calls openoffice on the command line to convert your file to text, which you can then simply load into python.
(It seems to have other format options, though they are not apparenlty documented.)
Honestly don't use "pip install tika", this has been developed for mono-user (one developper working on his laptop) and not for multi-users (multi-developpers).
The small class TikaWrapper.py bellow which uses Tika in command line is widely enough to meet our needs.
You just have to instanciate this class with JAVA_HOME path and the Tika jar path, that's all ! And it works perfectly for lot of formats (e.g: PDF, DOCX, ODT, XLSX, PPT, etc.).
#!/bin/python
# -*- coding: utf-8 -*-
# Class to extract metadata and text from different file types (such as PPT, XLS, and PDF)
# Developed by Philippe ROSSIGNOL
#####################
# TikaWrapper class #
#####################
class TikaWrapper:
java_home = None
tikalib_path = None
# Constructor
def __init__(self, java_home, tikalib_path):
self.java_home = java_home
self.tika_lib_path = tikalib_path
def extractMetadata(self, filePath, encoding="UTF-8", returnTuple=False):
'''
- Description:
Extract metadata from a document
- Params:
filePath: The document file path
encoding: The encoding (default = "UTF-8")
returnTuple: If True return a tuple which contains both the output and the error (default = False)
- Examples:
metadata = extractMetadata(filePath="MyDocument.docx")
metadata, error = extractMetadata(filePath="MyDocument.docx", encoding="UTF-8", returnTuple=True)
'''
cmd = self._getCmd(self._cmdExtractMetadata, filePath, encoding)
out, err = self._execute(cmd, encoding)
if (returnTuple): return out, err
return out
def extractText(self, filePath, encoding="UTF-8", returnTuple=False):
'''
- Description:
Extract text from a document
- Params:
filePath: The document file path
encoding: The encoding (default = "UTF-8")
returnTuple: If True return a tuple which contains both the output and the error (default = False)
- Examples:
text = extractText(filePath="MyDocument.docx")
text, error = extractText(filePath="MyDocument.docx", encoding="UTF-8", returnTuple=True)
'''
cmd = self._getCmd(self._cmdExtractText, filePath, encoding)
out, err = self._execute(cmd, encoding)
return out, err
# ===========
# = PRIVATE =
# ===========
_cmdExtractMetadata = "${JAVA_HOME}/bin/java -jar ${TIKALIB_PATH} --metadata ${FILE_PATH}"
_cmdExtractText = "${JAVA_HOME}/bin/java -jar ${TIKALIB_PATH} --encoding=${ENCODING} --text ${FILE_PATH}"
def _getCmd(self, cmdModel, filePath, encoding):
cmd = cmdModel.replace("${JAVA_HOME}", self.java_home)
cmd = cmd.replace("${TIKALIB_PATH}", self.tika_lib_path)
cmd = cmd.replace("${ENCODING}", encoding)
cmd = cmd.replace("${FILE_PATH}", filePath)
return cmd
def _execute(self, cmd, encoding):
import subprocess
process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = process.communicate()
out = out.decode(encoding=encoding)
err = err.decode(encoding=encoding)
return out, err
Just in case if someone wants to do in Java language there is Apache poi api. extractor.getText() will extract plane text from docx . Here is the link https://www.tutorialspoint.com/apache_poi_word/apache_poi_word_text_extraction.htm
Textract-Plus
Use textract-plus which can extract text from most of the document extensions including doc , docm , dotx and docx.
(It uses antiword as a backend for doc files)
refer docs
Install-
pip install textract-plus
Sample-
import textractplus as tp
text=tp.process('path/to/yourfile.doc')
There is also pandoc the swiss-army-knife of documents. It converts from every format to nearly every other format. From the demos page
pandoc -s input_file.docx -o output_file.txt
Like Etienne's answer.
With python 3.9 getiterator was deprecated in ET, so you need to replace it with iter:
def get_docx_text(path):
"""
Take the path of a docx file as argument, return the text in unicode.
"""
document = zipfile.ZipFile(path)
xml_content = document.read('word/document.xml')
document.close()
tree = XML(xml_content)
paragraphs = []
for paragraph in tree.iter(PARA):
texts = [node.text
for node in paragraph.iter(TEXT)
if node.text]
if texts:
paragraphs.append(''.join(texts))
return '\n\n'.join(paragraphs)

Categories

Resources