Extract headings from a MS Word document in Python

Extract headings from a MS Word document in Python - python

I have an MS Word document contains some text and headings, I want to extract the headings, I installed Python for win32, but I didn't know which method to use, it seems the help document of python for windows does not list the functions of the word obejct. take the following code as example
import win32com.client as win32
word = win32.Dispatch("Word.Application")
word.Visible = 0
word.Documents.Open("MyDocument")
doc = word.ActiveDocument
how can I know all the functions of the word object？I didn't find anything useful in the help document.

The Word object model can be found here. Your doc object will contain these properties, and you can use them to perform your desired actions (note that I haven't used this feature with Word, so my knowledge of the object model is sparse). For instance, if you wanted to read all the words in a document, you could do:
for word in doc.Words:
print word
And you would get all of the words. Each of those word items would be a Word object (reference here), so you could access those properties during iteration. In your case, here is how you would get the style:
for word in doc.Words:
print word.Style
On a sample doc with a single Heading 1 and normal text, this prints:
Heading 1
Heading 1
Heading 1
Heading 1
Heading 1
Normal
Normal
Normal
Normal
Normal
To group the headings together, you can use itertools.groupby. As explained in the code comments below, you need to reference the str() of the object itself, as using word.Style returns an instance that won't properly group with other instances of the same style:
from itertools import groupby
import win32com.client as win32
# All the same as yours
word = win32.Dispatch("Word.Application")
word.Visible = 0
word.Documents.Open("testdoc.doc")
doc = word.ActiveDocument
# Here we use itertools.groupby (without sorting anything) to
# find groups of words that share the same heading (note it picks
# up newlines). The tricky/confusing thing here is that you can't
# just group on the Style itself - you have to group on the str().
# There was some other interesting behavior, but I have zero
# experience with COMObjects so I'll leave it there :)
# All of these comments for two lines of code :)
for heading, grp_wrds in groupby(doc.Words, key=lambda x: str(x.Style)):
print heading, ''.join(str(word) for word in grp_wrds)
This outputs:
Heading 1 Here is some text
Normal
No header
If you replace the join with a list comprehension, you get the below (where you can see the newlines):
Heading 1 ['Here ', 'is ', 'some ', 'text', '\r']
Normal ['\r', 'No ', 'header', '\r', '\r']

convert word to docx and use python docx module
from docx import Document
file = 'test.docx'
document = Document(file)
for paragraph in document.paragraphs:
if paragraph.style.name == 'Heading 1':
print(paragraph.text)

You can also use the Google Drive SDK to convert the Word document to something more useful, like HTML, where you can easily extract the headers.
https://developers.google.com/drive/manage-uploads

Related

Python to Find-replace a string and Create Two Paragraphs Before String in Words Document

I have a VBA Macro. In that, I have
.Find Text = 'Pollution'
.Replacement Text = '^p^pChemical'
Here, '^p^pChemical' means Replace the Word Pollution with Chemical and create two empty paragraphs before the word sea.
Before:
After:
Have you noticed that The Word Pollution has been replaced With Chemical and two empty paragraphs preceds it ? This is how I want in Python.
My Code so far:
import docx
from docx import Document
document = Document('Example.docx')
for Paragraph in document.paragraphs:
if 'Pollution' in paragraph:
replace(Pollution, Chemical)
document.add_paragraph(before('Chemical'))
document.add_paragraph(before('Chemical'))
I want to open a word document to find the word, replace it with another word, and create two empty paragraphs before the replaced word.

You can search through each paragraph to find the word of interest, and call insert_paragraph_before to add the new elements:
def replace(doc, target, replacement):
for par in list(document.paragraphs):
text = par.text
while (index := text.find(target)) != -1:
par.insert_paragraph_before(text[:index].rstrip())
par.insert_paragraph_before('')
par.text = replacement + text[index + len(target)]
list(doc.paragraphs) makes a copy of the list, so that the iteration is not thrown off when you insert elements.
Call this function as many times as you need to replace whatever words you have.

This will take the text from the your document, replace the instances of the word pollution with chemical and add paragraphs in between, but it doesn't change the first document, instead it creates a copy. This is probably the safer route to go anyway.
import re
from docx import Document
ref = {"Pollution":"Chemicals", "Ocean":"Sea", "Speaker":"Magnet"}
def get_old_text():
doc1 = Document('demo.docx')
fullText = []
for para in doc1.paragraphs:
fullText.append(para.text)
text = '\n'.join(fullText)
return text
def create_new_document(ref, text):
doc2 = Document()
lines = text.split('\n')
for line in lines:
for k in ref:
if k.lower() in line.lower():
parts = re.split(f'{k}', line, flags=re.I)
doc2.add_paragraph(parts[0])
for part in parts[1:]:
doc2.add_paragraph('')
doc2.add_paragraph('')
doc2.add_paragraph(ref[k] + " " + part)
doc2.save('demo.docx')
text = get_old_text()
create_new_document(ref, text)

You need to use \n for new line. Using re should work like so:
import re
before = "The term Pollution means the manifestation of any unsolicited foregin substance in something. When we talk about pollution on earth, we refer to the contamination that is happening of the natural resources by various pollutants"
pattern = re.compile("pollution", re.IGNORECASE)
after = pattern.sub("\n\nChemical", before)
print(after)
Which will output:
The term
Chemical means the manifestation of any unsolicited foregin substance in something. When we talk about
Chemical on earth, we refer to the contamination that is happening of the natural resources by various pollutants

Regex not specific enough

So I wrote a program for my Kindle e-reader that searches my highlights and deletes repetitive text (it's usually information about the book title, author, page number, etc.). I thought it was functional but sometimes there would random be periods (.) on certain lines of the output. At first I thought the program was just buggy but then I realized that the regex I'm using to match the books title and author was also matching any sentence that ended in brackets.
This is the code for the regex that I'm using to detect the books title and author
titleRegex = re.compile('(.+)\((.+)\)')
Example
Desired book title and author match: Book title (Author name)
What would also get matched: *I like apples because they are green (they are sometimes red as well). *
In this case it would delete everything and leave just the period at the end of the sentence. This is obviously not ideal because it deletes the text I highlighted
Here is the unformatted text file that goes into my program
The program works by finding all of the matches for the regexes I wrote, looping through those matches and one by one replacing them with empty strings.
Would there be any ways to make my title regex more specific so that it only picks up author titles and not full sentences that end in brackets? If not, what steps would I have to take to restructure this program?
I've attached my code to the bottom of this post. I would greatly appreciate any help as I'm a total coding newbie. Thanks :)
import re
titleRegex = re.compile('(.+)\((.+)\)')
titleRegex2 = re.compile(r'\ufeff (.+)\((.+)\)')
infoRegex = re.compile(r'(.) ([a-zA-Z]+) (Highlight|Bookmark|Note) ([a-zA-Z]+) ([a-zA-Z]+) ([0-9]+) (\|)')
locationRegex = re.compile(r' Location (\d+)(-\d+)? (\|)')
dateRegex = re.compile(r'([a-zA-Z]+) ([a-zA-Z]+) ([a-zA-Z]+), ([a-zA-Z]+) ([0-9]+), ([0-9]+)')
timeRegex = re.compile(r'([0-9]+):([0-9]+):([0-9]+) (AM|PM)')
newlineRegex = re.compile(r'\n')
sepRegex = re.compile('==========')
regexList = [titleRegex, titleRegex2, infoRegex, locationRegex, dateRegex, timeRegex, sepRegex, newlineRegex]
string = open("/Users/devinnagami/myclippings.txt").read()
for x in range(len(regexList)):
newString = re.sub(regexList[x], ' ', string)
string = newString
finalText = newString.split(' ')
with open('booknotes.txt', 'w') as f:
for item in finalText:
f.write('%s\n' % item)

There isn't enough information to tell if "Book title (Book Author)" is different than something like "I like Books (Good Ones)" without context. Thankfully, the text you showed has plenty of context. Instead of creating several different regular expressions, you can combine them into one expression to encode that context.
For instance:
quoteInfoRegex = re.compile(
r"^=+\n(?P<title>.*?) \((?P<author>.*?)\)\n" +
r"- Your Highlight on page (?P<page>[\d]+) \| Location (?P<location>[\d-]+) \| Added on (?P<added>.*?)\n" +
r"\n" +
r"(?P<quote>.*?)\n", flags=re.MULTILINE)
for m in quoteInfoRegex.finditer(data):
print(m.groupdict())
This will pull out each line of the text, and parse it, knowing that the book title is the first line after the equals, and the quote itself is below that.

How to make the column widths dynamic in docx?

I have written a pronunciation guide for a foreign language. It has information about each word in a list. I want to use docx to show the pronunciation guide above the original words and the part of speech below the words.
The length of the text between each word varies considerably, as does the difference in length between the pronunciation, word, and part of speech as shown below.
docx forces me to specify a column width, i.e. Inches(.25) when I add a column using table.add_column(Inches(.25)).cells. I tried playing with the autofit object listed here and could not figure out what it actually does.
The desired result looks like this:
pronunciation_1 | pronunciation_2 | pronunciation_3
---------------------------------------------------
word_1 | word_2 | word_3
---------------------------------------------------
part_of_speech_1 | part_of_speech_2|part_of_speech_3
Here's a code example of my attempt to get this to work. I have tried several times solve my questions below but the code I've written either crashes or returns results worse than this:
from docx import Document
from docx.shared import Inches
document = Document()
table = document.add_table(rows=3,cols=0)
word_1 = ['This', "th is", 'pronoun']
word_2 = ['is', ' iz', 'verb']
word_3 = ['an', 'uh n', 'indefinite article']
word_4 = ['apple.','ap-uh l', 'noun']
my_word_collection = [word_1,word_2,word_3,word_4]
for word in my_word_collection:
my_word = word[0]
pronunciation = word[1]
part_of_speech = word[2]
column_cells = table.add_column(Inches(.25)).cells
column_cells[0].text = pronunciation
column_cells[1].text = my_word
column_cells[2].text = part_of_speech
document.save('my_word_demo.docx')
Here's what the results look like. This is a screenshot of how my code renders the text in MS Word:
My specific question is:
How can you make the column widths dynamic?
I want the column width to adjust to the longest of the three data points for each word (my_word, pronunciation, or part_of_speech) to that none of the text needs to wrap, and also so there's not unnecessary excessive space around each word. The goal is the document is created and the reader/user doesn't have to adjust the cell widths for it to be readable.
If it helps, the Q/A here says cell width must be set individually. I tried Googling around how to calculate character widths but it seems like there would be a way to handle this in docx that I don't know about.

filtering the tag set and make the sequence of bigram

I'm sorry to ask question with same text file.
below is my working text file string.
The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd
This string consists of the "word / its tag" format, as you can see. From this string, I want to filter only the sequence of "noun + adjective" and make them to the bigram. For example, "Grand/jj-tl Jury/nn-tl" is exact word sequence that I want. (nn means noun, jj means adjective and adjuncts such as "-tl" are additional information about the tag.)
Maybe this will be easy job. And I first used regex for filtering. Below is my code.
import re
f = open(textfile)
raw = f.read()
tag_list = re.findall("\w+/jj-?\w* \w+/nn-?\w*", raw)
print tag_list
This codes give me the exact words list. However, what I want is the bigram data. That code only gives me the list of words, such like this.
['Grand/jj-tl Jury/nn-tl', 'recent/jj primary/nn', 'Executive/jj-tl Committee/nn-tl']
I want this data to be converted such as below.
[('Grand/jj-tl, Jury/nn-tl'), ('recent/jj ,primary/nn'), ('Executive/jj-tl , Committee/nn-tl')]
i.e. the list of bigram data. I need your advice.

I think once you have found the tag_list it should be an easy job afterwards just using the list comprehension:
>>> tag_list = ['Grand/jj-tl Jury/nn-tl', 'recent/jj primary/nn', 'Executive/jj-tl Committee/nn-tl']
>>> [tag.replace(' ', ', ') for tag in tag_list]
['Grand/jj-tl, Jury/nn-tl', 'recent/jj, primary/nn', 'Executive/jj-tl, Committee/nn-tl']
In your original demonstration, I am not sure why do you have ('Grand/jj-tl, Jury/nn-tl') and I am also not sure why would you like to join these bigrams using comma.
I think it would be better to have a list of list where the inner list have the bigram data:
>>> [tag.split() for tag in tag_list]
[['Grand/jj-tl', 'Jury/nn-tl'], ['recent/jj', 'primary/nn'], ['Executive/jj-tl', 'Committee/nn-tl']]

Replacement by synsets in Python pattern packatge

My goal is to create a system that will be able to take any random text, extract sentences, remove punctuations, and then, on the bare sentence (one of them), to randomly replace NN or VB tagged words with their meronym, holonym or synonim as well as with a similar word from a WordNet synset. There is a lot of work ahead, but I have a problem at the very beginning.
For this I use pattern and TextBlob packages. This is what I have done so far...
from pattern.web import URL, plaintext
from pattern.text import tokenize
from pattern.text.en import wordnet
from textblob import TextBlob
import string
s = URL('http://www.fangraphs.com/blogs/the-fringe-five-baseballs-most-compelling-fringe-prospects-35/#more-157570').download()
s = plaintext(s, keep=[])
secam = (tokenize(s, punctuation=""))
simica = secam[15].strip(string.punctuation)
simica = simica.replace(",", "")
simica = TextBlob(simica)
simicaTg = simica.words
synsimica = wordnet.synsets(simicaTg[3])[0]
djidja = synsimica.hyponyms()
Now everything works the way I want but when I try to extract the i.e. hyponym from this djidja variable it proves to be impossible since it is a Synset object, and I can't manipulate it anyhow.
Any idea how to extract a the very word that is reported in hyponyms list (i.e. print(djidja[2]) displays Synset(u'bowler')...so how to extract only 'bowler' from this)?

Recall that a synset is just a list of words marked as synonyms. Given a sunset, you can extract the words that form it:
from pattern.text.en import wordnet
s = wordnet.synsets('dog')[0] # a word can belong to many synsets, let's just use one for the sake of argument
print(s.synonyms)
This outputs:
Out[14]: [u'dog', u'domestic dog', u'Canis familiaris']
You can also extract hypernims and hyponyms:
print(s.hypernyms())
Out[16]: [Synset(u'canine'), Synset(u'domestic animal')]
print(s.hypernyms()[0].synonyms)
Out[17]: [u'canine', u'canid']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract headings from a MS Word document in Python - python

convert word to docx and use python docx module from docx import Document file = 'test.docx' document = Document(file) for paragraph in document.paragraphs: if paragraph.style.name == 'Heading 1': print(paragraph.text)

You can also use the Google Drive SDK to convert the Word document to something more useful, like HTML, where you can easily extract the headers. https://developers.google.com/drive/manage-uploads

Related

Python to Find-replace a string and Create Two Paragraphs Before String in Words Document

Regex not specific enough

How to make the column widths dynamic in docx?

filtering the tag set and make the sequence of bigram

Replacement by synsets in Python pattern packatge

Categories

Resources