Convert PDF to .docx with Python - python

I'm trying very hard to find the way to convert a PDF file to a .docx file with Python.
I have seen other posts related with this, but none of them seem to work correctly in my case.
I'm using specifically
import os
import subprocess
for top, dirs, files in os.walk('/my/pdf/folder'):
for filename in files:
if filename.endswith('.pdf'):
abspath = os.path.join(top, filename)
subprocess.call('lowriter --invisible --convert-to doc "{}"'
.format(abspath), shell=True)
This gives me Output[1], but then, I can't find any .docx document in my folder.
I have LibreOffice 5.3 installed.
Any clues about it?
Thank you in advance!

I am not aware of a way to convert a pdf file into a Word file using libreoffice.
However, you can convert from a pdf to a html and then convert the html to a docx.
Firstly, get the commands running on the command line. (The following is on Linux. So you may have to fill in path names to the soffice binary and use a full path for the input file on your OS)
soffice --convert-to html ./my_pdf_file.pdf
then
soffice --convert-to docx:'MS Word 2007 XML' ./my_pdf_file.html
You should end up with:
my_pdf_file.pdf
my_pdf_file.html
my_pdf_file.docx
Now wrap the commands in your subprocess code

I use this for multiple files
####
from pdf2docx import Converter
import os
# # # dir_path for input reading and output files & a for loop # # #
path_input = '/pdftodocx/input/'
path_output = '/pdftodocx/output/'
for file in os.listdir(path_input):
cv = Converter(path_input+file)
cv.convert(path_output+file+'.docx', start=0, end=None)
cv.close()
print(file)

Below code worked for me.
import win32com.client
word = win32com.client.Dispatch("Word.Application")
word.visible = 1
pdfdoc = 'NewDoc.pdf'
todocx = 'NewDoc.docx'
wb1 = word.Documents.Open(pdfdoc)
wb1.SaveAs(todocx, FileFormat=16) # file format for docx
wb1.Close()
word.Quit()

My approach does not follow the same methodology of using subsystems. However this one does the job of reading through all the pages of a PDF document and moving them to a docx file. Note: It only works with text; images and other objects are usually ignored.
#Description: This python script will allow you to fetch text information from a pdf file
#import libraries
import PyPDF2
import os
import docx
mydoc = docx.Document() # document type
pdfFileObj = open('pdf/filename.pdf', 'rb') # pdffile loction
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) # define pdf reader object
# Loop through all the pages
for pageNum in range(1, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
pdfContent = pageObj.extractText() #extracts the content from the page.
print(pdfContent) # print statement to test output in the terminal. codeline optional.
mydoc.add_paragraph(pdfContent) # this adds the content to the word document
mydoc.save("pdf/filename.docx") # Give a name to your output file.

I have successfully done this with pdf2docx :
from pdf2docx import parse
pdf_file = "test.pdf"
word_file = "test.docx"
parse(pdf_file, word_file, start=0, end=None)

Related

how to do docx to pdf conversion using python library without subprocess in linux? [duplicate]

I'am tasked with converting tons of .doc files to .pdf. And the only way my supervisor wants me to do this is through MSWord 2010. I know I should be able to automate this with python COM automation. Only problem is I dont know how and where to start. I tried searching for some tutorials but was not able to find any (May be I might have, but I don't know what I'm looking for).
Right now I'm reading through this. Dont know how useful this is going to be.
A simple example using comtypes, converting a single file, input and output filenames given as commandline arguments:
import sys
import os
import comtypes.client
wdFormatPDF = 17
in_file = os.path.abspath(sys.argv[1])
out_file = os.path.abspath(sys.argv[2])
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
You could also use pywin32, which would be the same except for:
import win32com.client
and then:
word = win32com.client.Dispatch('Word.Application')
You can use the docx2pdf python package to bulk convert docx to pdf. It can be used as both a CLI and a python library. It requires Microsoft Office to be installed and uses COM on Windows and AppleScript (JXA) on macOS.
from docx2pdf import convert
convert("input.docx")
convert("input.docx", "output.pdf")
convert("my_docx_folder/")
pip install docx2pdf
docx2pdf input.docx output.pdf
Disclaimer: I wrote the docx2pdf package. https://github.com/AlJohri/docx2pdf
I have tested many solutions but no one of them works efficiently on Linux distribution.
I recommend this solution :
import sys
import subprocess
import re
def convert_to(folder, source, timeout=None):
args = [libreoffice_exec(), '--headless', '--convert-to', 'pdf', '--outdir', folder, source]
process = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout)
filename = re.search('-> (.*?) using filter', process.stdout.decode())
return filename.group(1)
def libreoffice_exec():
# TODO: Provide support for more platforms
if sys.platform == 'darwin':
return '/Applications/LibreOffice.app/Contents/MacOS/soffice'
return 'libreoffice'
and you call your function:
result = convert_to('TEMP Directory', 'Your File', timeout=15)
All resources:
https://michalzalecki.com/converting-docx-to-pdf-using-python/
I have worked on this problem for half a day, so I think I should share some of my experience on this matter. Steven's answer is right, but it will fail on my computer. There are two key points to fix it here:
(1). The first time when I created the 'Word.Application' object, I should make it (the word app) visible before open any documents. (Actually, even I myself cannot explain why this works. If I do not do this on my computer, the program will crash when I try to open a document in the invisible model, then the 'Word.Application' object will be deleted by OS. )
(2). After doing (1), the program will work well sometimes but may fail often. The crash error "COMError: (-2147418111, 'Call was rejected by callee.', (None, None, None, 0, None))" means that the COM Server may not be able to response so quickly. So I add a delay before I tried to open a document.
After doing these two steps, the program will work perfectly with no failure anymore. The demo code is as below. If you have encountered the same problems, try to follow these two steps. Hope it helps.
import os
import comtypes.client
import time
wdFormatPDF = 17
# absolute path is needed
# be careful about the slash '\', use '\\' or '/' or raw string r"..."
in_file=r'absolute path of input docx file 1'
out_file=r'absolute path of output pdf file 1'
in_file2=r'absolute path of input docx file 2'
out_file2=r'absolute path of outputpdf file 2'
# print out filenames
print in_file
print out_file
print in_file2
print out_file2
# create COM object
word = comtypes.client.CreateObject('Word.Application')
# key point 1: make word visible before open a new document
word.Visible = True
# key point 2: wait for the COM Server to prepare well.
time.sleep(3)
# convert docx file 1 to pdf file 1
doc=word.Documents.Open(in_file) # open docx file 1
doc.SaveAs(out_file, FileFormat=wdFormatPDF) # conversion
doc.Close() # close docx file 1
word.Visible = False
# convert docx file 2 to pdf file 2
doc = word.Documents.Open(in_file2) # open docx file 2
doc.SaveAs(out_file2, FileFormat=wdFormatPDF) # conversion
doc.Close() # close docx file 2
word.Quit() # close Word Application
unoconv (writen in Python) and OpenOffice running as a headless daemon.
https://github.com/unoconv/unoconv
http://dag.wiee.rs/home-made/unoconv/
Works very nicely for doc, docx, ppt, pptx, xls, xlsx.
Very useful if you need to convert docs or save/convert to certain formats on a server.
As an alternative to the SaveAs function, you could also use ExportAsFixedFormat which gives you access to the PDF options dialog you would normally see in Word. With this you can specify bookmarks and other document properties.
doc.ExportAsFixedFormat(OutputFileName=pdf_file,
ExportFormat=17, #17 = PDF output, 18=XPS output
OpenAfterExport=False,
OptimizeFor=0, #0=Print (higher res), 1=Screen (lower res)
CreateBookmarks=1, #0=No bookmarks, 1=Heading bookmarks only, 2=bookmarks match word bookmarks
DocStructureTags=True
);
The full list of function arguments is: 'OutputFileName', 'ExportFormat', 'OpenAfterExport', 'OptimizeFor', 'Range', 'From', 'To', 'Item', 'IncludeDocProps', 'KeepIRM', 'CreateBookmarks', 'DocStructureTags', 'BitmapMissingFonts', 'UseISO19005_1', 'FixedFormatExtClassPtr'
It's worth noting that Stevens answer works, but make sure if using a for loop to export multiple files to place the ClientObject or Dispatch statements before the loop - it only needs to be created once - see my problem: Python win32com.client.Dispatch looping through Word documents and export to PDF; fails when next loop occurs
If you don't mind using PowerShell have a look at this Hey, Scripting Guy! article. The code presented could be adopted to use the wdFormatPDF enumeration value of WdSaveFormat (see here).
This blog article presents a different implementation of the same idea.
I have modified it for ppt support as well. My solution support all the below-specified extensions.
word_extensions = [".doc", ".odt", ".rtf", ".docx", ".dotm", ".docm"]
ppt_extensions = [".ppt", ".pptx"]
My Solution: Github Link
I have modified code from Docx2PDF
I tried the accepted answer but wasn't particularly keen on the bloated PDFs Word was producing which was usually an order of magnitude bigger than expected. After looking how to disable the dialogs when using a virtual PDF printer I came across Bullzip PDF Printer and I've been rather impressed with its features. It's now replaced the other virtual printers I used previously. You'll find a "free community edition" on their download page.
The COM API can be found here and a list of the usable settings can be found here. The settings are written to a "runonce" file which is used for one print job only and then removed automatically. When printing multiple PDFs we need to make sure one print job completes before starting another to ensure the settings are used correctly for each file.
import os, re, time, datetime, win32com.client
def print_to_Bullzip(file):
util = win32com.client.Dispatch("Bullzip.PDFUtil")
settings = win32com.client.Dispatch("Bullzip.PDFSettings")
settings.PrinterName = util.DefaultPrinterName # make sure we're controlling the right PDF printer
outputFile = re.sub("\.[^.]+$", ".pdf", file)
statusFile = re.sub("\.[^.]+$", ".status", file)
settings.SetValue("Output", outputFile)
settings.SetValue("ConfirmOverwrite", "no")
settings.SetValue("ShowSaveAS", "never")
settings.SetValue("ShowSettings", "never")
settings.SetValue("ShowPDF", "no")
settings.SetValue("ShowProgress", "no")
settings.SetValue("ShowProgressFinished", "no") # disable balloon tip
settings.SetValue("StatusFile", statusFile) # created after print job
settings.WriteSettings(True) # write settings to the runonce.ini
util.PrintFile(file, util.DefaultPrinterName) # send to Bullzip virtual printer
# wait until print job completes before continuing
# otherwise settings for the next job may not be used
timestamp = datetime.datetime.now()
while( (datetime.datetime.now() - timestamp).seconds < 10):
if os.path.exists(statusFile) and os.path.isfile(statusFile):
error = util.ReadIniString(statusFile, "Status", "Errors", '')
if error != "0":
raise IOError("PDF was created with errors")
os.remove(statusFile)
return
time.sleep(0.1)
raise IOError("PDF creation timed out")
I was working with this solution but I needed to search all .docx, .dotm, .docm, .odt, .doc or .rtf and then turn them all to .pdf (python 3.7.5). Hope it works...
import os
import win32com.client
wdFormatPDF = 17
for root, dirs, files in os.walk(r'your directory here'):
for f in files:
if f.endswith(".doc") or f.endswith(".odt") or f.endswith(".rtf"):
try:
print(f)
in_file=os.path.join(root,f)
word = win32com.client.Dispatch('Word.Application')
word.Visible = False
doc = word.Documents.Open(in_file)
doc.SaveAs(os.path.join(root,f[:-4]), FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
word.Visible = True
print ('done')
os.remove(os.path.join(root,f))
pass
except:
print('could not open')
# os.remove(os.path.join(root,f))
elif f.endswith(".docx") or f.endswith(".dotm") or f.endswith(".docm"):
try:
print(f)
in_file=os.path.join(root,f)
word = win32com.client.Dispatch('Word.Application')
word.Visible = False
doc = word.Documents.Open(in_file)
doc.SaveAs(os.path.join(root,f[:-5]), FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
word.Visible = True
print ('done')
os.remove(os.path.join(root,f))
pass
except:
print('could not open')
# os.remove(os.path.join(root,f))
else:
pass
The try and except was for those documents I couldn't read and won't exit the code until the last document.
You should start from investigating so called virtual PDF print drivers.
As soon as you will find one you should be able to write batch file that prints your DOC files into PDF files. You probably can do this in Python too (setup printer driver output and issue document/print command in MSWord, later can be done using command line AFAIR).
import docx2txt
from win32com import client
import os
files_from_folder = r"c:\\doc"
directory = os.fsencode(files_from_folder)
amount = 1
word = client.DispatchEx("Word.Application")
word.Visible = True
for file in os.listdir(directory):
filename = os.fsdecode(file)
print(filename)
if filename.endswith('docx'):
text = docx2txt.process(os.path.join(files_from_folder, filename))
print(f'{filename} transfered ({amount})')
amount += 1
new_filename = filename.split('.')[0] + '.txt'
try:
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
except:
os.mkdir(files_from_folder + r'\txt_files')
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
elif filename.endswith('doc'):
doc = word.Documents.Open(os.path.join(files_from_folder, filename))
text = doc.Range().Text
doc.Close()
print(f'{filename} transfered ({amount})')
amount += 1
new_filename = filename.split('.')[0] + '.txt'
try:
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
except:
os.mkdir(files_from_folder + r'\txt_files')
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
word.Quit()
The Source Code, see here:
https://neculaifantanaru.com/en/python-full-code-how-to-convert-doc-and-docx-files-to-pdf-from-the-folder.html
I would suggest ignoring your supervisor and use OpenOffice which has a Python api. OpenOffice has built in support for Python and someone created a library specific for this purpose (PyODConverter).
If he isn't happy with the output, tell him it could take you weeks to do it with word.

How to extract text from pdf in Python 3.7 [duplicate]

This question already has answers here:
How to extract text from a PDF file?
(33 answers)
Closed 10 months ago.
I am trying to extract text from a PDF file using Python. My main goal is I am trying to create a program that reads a bank statement and extracts its text to update an excel file to easily record monthly spendings. Right now I am focusing just extracting the text from the pdf file but I don't know how to do so.
What is currently the best and easiest way to extract text from a PDF file into a string? What library is best to use today and how can I do it?
I have tried using PyPDF2 but everytime I try to extract text from any page using extractText(), it returns empty strings. I have tried installing textract but I get errors because I need more libraries I think.
from PyPDF2 import PdfReader
reader = PdfReader("January2019.pdf")
page = reader.pages[0]
print(page.extract_text())
This prints empty strings when it should be printing the contents of the page
edit: This question was asked for a very old PyPDF2 version. New versions of PyPDF2 have improved text extraction a lot
I have tried many methods but failed, include PyPDF2 and Tika. I finally found the module pdfplumber that is work for me, you also can try it.
Hope this will be helpful to you.
import pdfplumber
pdf = pdfplumber.open('pdffile.pdf')
page = pdf.pages[0]
text = page.extract_text()
print(text)
pdf.close()
Using tika worked for me!
from tika import parser
rawText = parser.from_file('January2019.pdf')
rawList = rawText['content'].splitlines()
This made it really easy to extract separate each line in the bank statement into a list.
If you are looking for a maintained, bigger project, have a look at PyMuPDF. Install it with pip install pymupdf and use it like this:
import fitz
def get_text(filepath: str) -> str:
with fitz.open(filepath) as doc:
text = ""
for page in doc:
text += page.getText().strip()
return text
PyPDF2 is highly unreliable for extracting text from pdf . as pointed out here too.
it says :
While PyPDF2 has .extractText(), which can be used on its page objects
(not shown in this example), it does not work very well. Some PDFs
will return text and some will return an empty string. When you want
to extract text from a PDF, you should check out the PDFMiner project
instead. PDFMiner is much more robust and was specifically designed
for extracting text from PDFs.
You could instead install and use pdfminer using
pip install pdfminer
or you can use another open source utility named pdftotext by xpdfreader. instructions to use the utility is given on the page.
you can download the command line tools from here
and could use the pdftotext.exe utility using subprocess .detailed explanation for using subprocess is given here
PyPDF2 does not read whole pdf correctly. You must use this code.
import pdftotext
pdfFileObj = open("January2019.pdf", 'rb')
pdf = pdftotext.PDF(pdfFileObj)
# Iterate over all the pages
for page in pdf:
print(page)
Here is an alternative solution in Windows 10, Python 3.8
Example test pdf: https://drive.google.com/file/d/1aUfQAlvq5hA9kz2c9CyJADiY3KpY3-Vn/view?usp=sharing
#pip install pdfminer.six
import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert_pdf_to_txt(path):
'''Convert pdf content from a file path to text
:path the file path
'''
rsrcmgr = PDFResourceManager()
codec = 'utf-8'
laparams = LAParams()
with io.StringIO() as retstr:
with TextConverter(rsrcmgr, retstr, codec=codec,
laparams=laparams) as device:
with open(path, 'rb') as fp:
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp,
pagenos,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True):
interpreter.process_page(page)
return retstr.getvalue()
if __name__ == "__main__":
print(convert_pdf_to_txt('C:\\Path\\To\\Test_PDF.pdf'))
import pdftables_api
import os
c = pdftables_api.Client('MY-API-KEY')
file_path = "C:\\Users\\MyName\\Documents\\PDFTablesCode\\"
for file in os.listdir(file_path):
if file.endswith(".pdf"):
c.xlsx(os.path.join(file_path,file), file+'.xlsx')
Go to https://pdftables.com to get an API key.
CSV, format=csv
XML, format=xml
HTML, format=html
XLSX, format=xlsx-single, format=xlsx-multiple
Try pdfreader. You can extract either plain text or decoded text containing "pdf markdown":
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
plain_text = ""
pdf_markdown = ""
try:
while True:
viewer.render()
pdf_markdown += viewer.canvas.text_content
plain_text += "".join(viewer.canvas.strings)
viewer.next()
except PageDoesNotExist:
pass
I think this code will be exactly what you are looking for:
import requests, time, datetime, os, threading, sys, configparser
import glob
import pdfplumber
for filename in glob.glob("*.pdf"):
pdf = pdfplumber.open(filename)
OutputFile = filename.replace('.pdf','.txt')
fx2=open(OutputFile, "a+")
for i in range(0,10000,1):
try:
page = pdf.pages[i]
text = page.extract_text()
print(text)
fx2.write(text)
except Exception as e:
print(e)
fx2.close()
pdf.close()
Try this:
in terminal execute command: pip install PyPDF2
import PyPDF2
reader = PyPDF2.PdfReader("mypdf.pdf")
for page in reader.pages:
print(page.extract_text())

I need an epub to text solution in Python

I need to get text from an epub
from epub_conversion.utils import open_book, convert_epub_to_lines
f = open("demofile.txt", "a")
book = open_book("razvansividra.epub")
lines = convert_epub_to_lines(book)
I use this but if I use print(lines) it does print only one line. And the library is 6 years old. Do you guys know a good way ?
What about https://github.com/aerkalov/ebooklib
EbookLib is a Python library for managing EPUB2/EPUB3 and Kindle
files. It's capable of reading and writing EPUB files programmatically
(Kindle support is under development).
The API is designed to be as simple as possible, while at the same
time making complex things possible too. It has support for covers,
table of contents, spine, guide, metadata and etc.
import ebooklib
from ebooklib import epub
book = epub.read_epub('test.epub')
for doc in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
print doc
convert_epub_to_lines returns an iterator to lines, which you need to iterate one by one to get.
Instead, you can get all lines with "convert", see in the documentation of the library:
https://pypi.org/project/epub-conversion/
Epublib has the problem of modifying your epub metadata, so if you want the original file with maybe only a few things changed you can simply unpack the epub into a directory and parse it with Beautifulsoup:
from os import path, listdir
with ZipFile(FILE_NAME, "r") as zip_ref:
zip_ref.extractall(extract_dir)
for filename in listdir(extract_dir):
if filename.endswith(".xhtml"):
print(filename)
with open(path.join(extract_dir, filename), "r", encoding="utf-8") as f:
soup = BeautifulSoup(f.read(), "lxml")
for text_object in soup.find_all(text=True):
Here is a sloppy script that extracts the text from an .epub in the right order. Improvements could be made
Quick explanation:
Takes input(epub) and output(txt) file paths as first and second arguments
Extracts epub content in temporary directory
Parses 'content.opf' file for xhtml content and order
Extracts text from each xhtml
Dependency: lxml
#!/usr/bin/python3
import shutil, os, sys, zipfile, tempfile
from lxml import etree
if len(sys.argv) != 3:
print(f"Usage: {sys.argv[0]} <input.epub> <output.txt>")
exit(1)
inputFilePath=sys.argv[1]
outputFilePath=sys.argv[2]
print(f"Input: {inputFilePath}")
print(f"Output: {outputFilePath}")
with tempfile.TemporaryDirectory() as tmpDir:
print(f"Extracting input to temp directory '{tmpDir}'.")
with zipfile.ZipFile(inputFilePath, 'r') as zip_ref:
zip_ref.extractall(tmpDir)
with open(outputFilePath, "w") as outFile:
print(f"Parsing 'container.xml' file.")
containerFilePath=f"{tmpDir}/META-INF/container.xml"
tree = etree.parse(containerFilePath)
for rootFilePath in tree.xpath( "//*[local-name()='container']"
"/*[local-name()='rootfiles']"
"/*[local-name()='rootfile']"
"/#full-path"):
print(f"Parsing '{rootFilePath}' file.")
contentFilePath = f"{tmpDir}/{rootFilePath}"
contentFileDirPath = os.path.dirname(contentFilePath)
tree = etree.parse(contentFilePath)
for idref in tree.xpath("//*[local-name()='package']"
"/*[local-name()='spine']"
"/*[local-name()='itemref']"
"/#idref"):
for href in tree.xpath( f"//*[local-name()='package']"
f"/*[local-name()='manifest']"
f"/*[local-name()='item'][#id='{idref}']"
f"/#href"):
outFile.write("\n")
xhtmlFilePath = f"{contentFileDirPath}/{href}"
subtree = etree.parse(xhtmlFilePath, etree.HTMLParser())
for ptag in subtree.xpath("//html/body/*"):
for text in ptag.itertext():
outFile.write(f"{text}")
outFile.write("\n")
print(f"Text written to '{outputFilePath}'.")

Python 3.6 Extracting text from PPT files

I'm using textract python-pptx to extract the text contents of the files which works well. Unfortunately our client also has ppt files that needs to be processed but we don't have any MS Office/ Open Office in the server so I'm unable to use comtypes to do conversion of the ppt file to another file type and just do the extraction from there.
Greatly appreciate any recommendations of alternative methods.
I'm running Python 3.6 on Windows 64-bit machine.
Convert them here. https://convertio.co/ppt-pptx/ This will allow you to use them with your program.
from os.path import isfile, join
import os
import re
from pptx import Presentation
def getPptContent(path):
prs = Presentation(path)
text_runs = []
for slide in prs.slides:
for shape in slide.shapes:
if not shape.has_text_frame:
continue
for paragraph in shape.text_frame.paragraphs:
for run in paragraph.runs:
text_runs.append(run.text)
return text_runs
ppt_dir = "ppt_data"
corpus = [str(f) for f in os.listdir(ppt_dir) if not f.startswith('.') and isfile(join(ppt_dir, f))]
for filename in corpus:
Path = ppt_dir + "/" +filename
print(Path)
file_content = getPptContent(Path)
f = open(ppt_dir + "/output/" + filename.split(".")[0] +".txt" ,"w+", encoding="utf-8")
f.write(str(file_content))
f.close()

How do I extract data from a doc/docx file using Python

I know there are similar questions out there, but I couldn't find something that would answer my prayers. What I need is a way to access certain data from MS-Word files and save it in an XML file.
Reading up on python-docx did not help, as it only seems to allow one to write into word documents, rather than read.
To present my task exactly (or how i chose to approach my task): I would like to search for a key word or phrase in the document (the document contains tables) and extract text data from the table where the key word/phrase is found.
Anybody have any ideas?
The docx is a zip file containing an XML of the document. You can open the zip, read the document and parse data using ElementTree.
The advantage of this technique is that you don't need any extra python libraries installed.
import zipfile
import xml.etree.ElementTree
WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
TABLE = WORD_NAMESPACE + 'tbl'
ROW = WORD_NAMESPACE + 'tr'
CELL = WORD_NAMESPACE + 'tc'
with zipfile.ZipFile('<path to docx file>') as docx:
tree = xml.etree.ElementTree.XML(docx.read('word/document.xml'))
for table in tree.iter(TABLE):
for row in table.iter(ROW):
for cell in row.iter(CELL):
print ''.join(node.text for node in cell.iter(TEXT))
See my stackoverflow answer to How to read contents of an Table in MS-Word file Using Python? for more details and references.
In answer to a comment below,
Images are not as clear cut to extract. I have created an empty docx and inserted one image into it. I then open the docx file as a zip archive (using 7zip) and looked at the document.xml. All the image information is stored as attributes in the XML not the CDATA like the text is. So you need to find the tag you are interested in and pull out the information that you are looking for.
For example adding to the script above:
IMAGE = '{http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing}' + 'docPr'
for image in tree.iter(IMAGE):
print image.attrib
outputs:
{'id': '1', 'name': 'Picture 1'}
I'm no expert at the openxml format but I hope this helps.
I do note that the zip file contains a directory called media which contains a file called image1.jpeg that contains a renamed copy of my embedded image. You can look around in the docx zipfile to investigate what is available.
To search in a document with python-docx
# Import the module
from docx import *
# Open the .docx file
document = opendocx('A document.docx')
# Search returns true if found
search(document,'your search string')
You also have a function to get the text of a document:
https://github.com/mikemaccana/python-docx/blob/master/docx.py#L910
# Import the module
from docx import *
# Open the .docx file
document = opendocx('A document.docx')
fullText=getdocumenttext(document)
Using https://github.com/mikemaccana/python-docx
It seems that pywin32 does the trick. You can iterate through all the tables in a document and through all the cells inside a table. It's a bit tricky to get the data (the last 2 characters from every entry have to be omitted), but otherwise, it's a ten minute code.
If anyone needs additional details, please say so in the comments.
A more simple library with image extraction capability.
pip install docx2txt
Then use below code to read docx file.
import docx2txt
text = docx2txt.process("file.docx")
Extracting text from doc/docx file using python
import os
import docx2txt
from win32com import client as wc
def extract_text_from_docx(path):
temp = docx2txt.process(path)
text = [line.replace('\t', ' ') for line in temp.split('\n') if line]
final_text = ' '.join(text)
return final_text
def extract_text_from_doc(doc_path):
w = wc.Dispatch('Word.Application')
doc = w.Documents.Open(file_path)
doc.SaveAs(save_file_name, 16)
doc.Close()
w.Quit()
joinedPath = os.path.join(root_path,save_file_name)
text = extract_text_from_docx(joinedPath)
return text
def extract_text(file_path, extension):
text = ''
if extension == '.docx':
text = extract_text_from_docx(file_path)
else extension == '.doc':
text = extract_text_from_doc(file_path)
return text
file_path = #file_path with doc/docx file
root_path = #file_path where the doc downloaded
save_file_name = "Final2_text_docx.docx"
final_text = extract_text(file_path, extension)
print(final_text)

Categories

Resources