How do I extract data from a doc/docx file using Python - python

I know there are similar questions out there, but I couldn't find something that would answer my prayers. What I need is a way to access certain data from MS-Word files and save it in an XML file.
Reading up on python-docx did not help, as it only seems to allow one to write into word documents, rather than read.
To present my task exactly (or how i chose to approach my task): I would like to search for a key word or phrase in the document (the document contains tables) and extract text data from the table where the key word/phrase is found.
Anybody have any ideas?

The docx is a zip file containing an XML of the document. You can open the zip, read the document and parse data using ElementTree.
The advantage of this technique is that you don't need any extra python libraries installed.
import zipfile
import xml.etree.ElementTree
WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
TABLE = WORD_NAMESPACE + 'tbl'
ROW = WORD_NAMESPACE + 'tr'
CELL = WORD_NAMESPACE + 'tc'
with zipfile.ZipFile('<path to docx file>') as docx:
tree = xml.etree.ElementTree.XML(docx.read('word/document.xml'))
for table in tree.iter(TABLE):
for row in table.iter(ROW):
for cell in row.iter(CELL):
print ''.join(node.text for node in cell.iter(TEXT))
See my stackoverflow answer to How to read contents of an Table in MS-Word file Using Python? for more details and references.
In answer to a comment below,
Images are not as clear cut to extract. I have created an empty docx and inserted one image into it. I then open the docx file as a zip archive (using 7zip) and looked at the document.xml. All the image information is stored as attributes in the XML not the CDATA like the text is. So you need to find the tag you are interested in and pull out the information that you are looking for.
For example adding to the script above:
IMAGE = '{http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing}' + 'docPr'
for image in tree.iter(IMAGE):
print image.attrib
outputs:
{'id': '1', 'name': 'Picture 1'}
I'm no expert at the openxml format but I hope this helps.
I do note that the zip file contains a directory called media which contains a file called image1.jpeg that contains a renamed copy of my embedded image. You can look around in the docx zipfile to investigate what is available.

To search in a document with python-docx
# Import the module
from docx import *
# Open the .docx file
document = opendocx('A document.docx')
# Search returns true if found
search(document,'your search string')
You also have a function to get the text of a document:
https://github.com/mikemaccana/python-docx/blob/master/docx.py#L910
# Import the module
from docx import *
# Open the .docx file
document = opendocx('A document.docx')
fullText=getdocumenttext(document)
Using https://github.com/mikemaccana/python-docx

It seems that pywin32 does the trick. You can iterate through all the tables in a document and through all the cells inside a table. It's a bit tricky to get the data (the last 2 characters from every entry have to be omitted), but otherwise, it's a ten minute code.
If anyone needs additional details, please say so in the comments.

A more simple library with image extraction capability.
pip install docx2txt
Then use below code to read docx file.
import docx2txt
text = docx2txt.process("file.docx")

Extracting text from doc/docx file using python
import os
import docx2txt
from win32com import client as wc
def extract_text_from_docx(path):
temp = docx2txt.process(path)
text = [line.replace('\t', ' ') for line in temp.split('\n') if line]
final_text = ' '.join(text)
return final_text
def extract_text_from_doc(doc_path):
w = wc.Dispatch('Word.Application')
doc = w.Documents.Open(file_path)
doc.SaveAs(save_file_name, 16)
doc.Close()
w.Quit()
joinedPath = os.path.join(root_path,save_file_name)
text = extract_text_from_docx(joinedPath)
return text
def extract_text(file_path, extension):
text = ''
if extension == '.docx':
text = extract_text_from_docx(file_path)
else extension == '.doc':
text = extract_text_from_doc(file_path)
return text
file_path = #file_path with doc/docx file
root_path = #file_path where the doc downloaded
save_file_name = "Final2_text_docx.docx"
final_text = extract_text(file_path, extension)
print(final_text)

Related

Python - Extract PDF content using PyPDF2

I'm trying to get part of a text from a PDF file and for that I'm using PyPDF2 package. On the past I wasusing Tika that was working fine but the same process with PyPDF2 is giving some problems. My process is very simple: I just pass the pdf content to a text variable and then I just pass a start point and an end point and the string that was between both points is the value that I want. Example, Imagine that my PDF got this:
TEST
HELO THIS IS A PDF WITH NAME test1234 and it was created
I just pass "PDF WITH NAME" and "and it was created" and I will got test1234
My code is the following:
from PyPDF2 import PdfReader
reader = PdfReader('mypdf.pdf')
page = reader.pages[1]
raw = page.extract_text().
start_point = "PDF WITH NAME"
end_point = "and it was created"
quote_start = raw.index(start_point) + len(start_point)
quote_end = raw.index(end_point , quote_start)
my_value = raw[quote_start:quote_end].strip()
I know that all that text parts exists on that specific page of my PDF but it looks that if I pass more than one word to search is not able to get me the index, I already try to make str(raw.encode('utf-8')) but it still give this error:
ValueError: substring not found
And if I print my raw it gives my example above
What I am doing wrong?

Is it possible to extract the Text structure of a PDF with Tika and put in a JSON?

I would like to know if it is possible to put the texts extracted from a PDF with Tika Python in a JSON, so that in the future I can import them to the respective records of a system. So far, I've only been able to return the PDF structure (the 'metadata' in line 4), but not the text structure, which is what I need. Below is the code I'm using to return parsed text from a PDF.
from tika import parser
def extract_text(file):
parsed = parser.from_file(file)
parsed_text = parsed['metadata']
return parsed_text
file_name_with_extension = input("Enter file name: ")
text = extract_text(file_name_with_extension)
print(text)

How to extract the title, authors, creation date of a PDF in Python

I manage papers locally and rename each PDF file in the form of "creationdate_authors_title.pdf". Hence, extracting the title, authors, creation date of each paper from the PDF file automatically is required.
I have written a python script using the package pdfminer to extract info. However, for certain files, after parsing them, the file info stored in the dictionary doc.info[0] by using PDFDocument may not contain some keys such as "Author", or these keys' values are empty.
I'm wondering how can I locate the required info such as the paper's title directly from the PDF file using the function like "extract_pages". Or, more generally, how can I accurately and efficiently extract the info I required?
Any hint would be appreciated! Many thanks in advance.
You can use this script to extract all the metadata using the library PyPDF2
from PyPDF2 import PdfFileReader
def get_info(path):
with open(path, 'rb') as f:
pdf = PdfFileReader(f)
info = pdf.getDocumentInfo()
number_of_pages = pdf.getNumPages()
print(info)
author = info.author
creator = info.creator
producer = info.producer
subject = info.subject
title = info.title
if __name__ == '__main__':
path = 'reportlab-sample.pdf'
get_info(path)
As you see inside the info variable you have all you need. Check this documentation

When I save the docx file in python, data gets corrupted

I am able to edit & save txt without problem but when I save the docx file, data gets corrupted. Like the image here: image of error Any suggestions to save docx properly? Thanks.
def save(self,MainWindow):
if self.docURL == "" or self.docURL.split(".")[-1] == "txt":
text = self.textEdit_3.toPlainText()
file = open(self.docURL[:-4]+"_EDITED.txt","w+")
file.write(text)
file.close()
elif self.docURL == "" or self.docURL.split(".")[-1] == "docx":
text = self.textEdit_3.toPlainText()
file = open(self.docURL[:-4] + "_EDITED.docx", "w+")
file.write(text)
file.close()
.docx files are much more than text - they are actually a collection of XML files with a very specific format. To read/write them easily, you need the python-docx module. To adapt your code:
from docx import Document
...
elif self.docURL == "" or self.docURL.split(".")[-1] == "docx":
text = self.textEdit_3.toPlainText()
doc = Document()
paragraph = doc.add_paragraph(text)
doc.save(self.docURL[:-4] + "_EDITED.docx"
and you're all set. There is much more you can do regarding text formatting, inserting images and shapes, creating tables, etc., but this will get you going. I linked to the docs above.

Convert PDF to .docx with Python

I'm trying very hard to find the way to convert a PDF file to a .docx file with Python.
I have seen other posts related with this, but none of them seem to work correctly in my case.
I'm using specifically
import os
import subprocess
for top, dirs, files in os.walk('/my/pdf/folder'):
for filename in files:
if filename.endswith('.pdf'):
abspath = os.path.join(top, filename)
subprocess.call('lowriter --invisible --convert-to doc "{}"'
.format(abspath), shell=True)
This gives me Output[1], but then, I can't find any .docx document in my folder.
I have LibreOffice 5.3 installed.
Any clues about it?
Thank you in advance!
I am not aware of a way to convert a pdf file into a Word file using libreoffice.
However, you can convert from a pdf to a html and then convert the html to a docx.
Firstly, get the commands running on the command line. (The following is on Linux. So you may have to fill in path names to the soffice binary and use a full path for the input file on your OS)
soffice --convert-to html ./my_pdf_file.pdf
then
soffice --convert-to docx:'MS Word 2007 XML' ./my_pdf_file.html
You should end up with:
my_pdf_file.pdf
my_pdf_file.html
my_pdf_file.docx
Now wrap the commands in your subprocess code
I use this for multiple files
####
from pdf2docx import Converter
import os
# # # dir_path for input reading and output files & a for loop # # #
path_input = '/pdftodocx/input/'
path_output = '/pdftodocx/output/'
for file in os.listdir(path_input):
cv = Converter(path_input+file)
cv.convert(path_output+file+'.docx', start=0, end=None)
cv.close()
print(file)
Below code worked for me.
import win32com.client
word = win32com.client.Dispatch("Word.Application")
word.visible = 1
pdfdoc = 'NewDoc.pdf'
todocx = 'NewDoc.docx'
wb1 = word.Documents.Open(pdfdoc)
wb1.SaveAs(todocx, FileFormat=16) # file format for docx
wb1.Close()
word.Quit()
My approach does not follow the same methodology of using subsystems. However this one does the job of reading through all the pages of a PDF document and moving them to a docx file. Note: It only works with text; images and other objects are usually ignored.
#Description: This python script will allow you to fetch text information from a pdf file
#import libraries
import PyPDF2
import os
import docx
mydoc = docx.Document() # document type
pdfFileObj = open('pdf/filename.pdf', 'rb') # pdffile loction
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) # define pdf reader object
# Loop through all the pages
for pageNum in range(1, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
pdfContent = pageObj.extractText() #extracts the content from the page.
print(pdfContent) # print statement to test output in the terminal. codeline optional.
mydoc.add_paragraph(pdfContent) # this adds the content to the word document
mydoc.save("pdf/filename.docx") # Give a name to your output file.
I have successfully done this with pdf2docx :
from pdf2docx import parse
pdf_file = "test.pdf"
word_file = "test.docx"
parse(pdf_file, word_file, start=0, end=None)

Categories

Resources