Extracting headings' text from word doc

Extracting headings' text from word doc - python

I am trying to extract text from headings(of any level) in a MS Word document(.docx file). Currently I am trying to solve using python-docx, but unfortunately I am still not able to figure out if it is even feasible after reading it(maybe I am mistaken).
I tried to look for the solutions online but found nothing specific to my task. It would be great if someone could guide me here.

The fundamental challenge is identifying heading paragraphs. There's nothing stopping an author from formatting a "regular" paragraph to look like (and serve as) a heading as far as a reader is concerned.
However, it's not uncommon for authors to reliably use styles to create headings, because doing so makes it possible to automatically compile those headings into a table of contents.
In that case, you can just iterate over the paragraphs, and pick out those with one of the heading styles.
def iter_headings(paragraphs):
for paragraph in paragraphs:
if paragraph.style.name.startswith('Heading'):
yield paragraph
for heading in iter_headings(document.paragraphs):
print heading.text
Heading levels may be parsed from the full style name if they've kept the defaults (like 'Heading 1', 'Heading 2', ...).
This may need to be adjusted if the author has renamed the heading styles.
There are more sophisticated approaches which are more reliable (as far as being style-name independent), but those don't have API support so you'd need to dig into the internal code and interact with some of the style XML directly I expect.

Related

Can I tag a PDF programmatically?

Can an unstructured PDF be tagged using any tools/libraries?
Only source of tagging a PDF was using Adobe Acrobat or Auto-Tag APIs (Not something which I am looking forward to + not so great results imo)
I know the bounding boxes and semantics of the elements (i.e paragraph, lists, headings, tables)
So, is there a way to manipulate PDF trees/objects? preferably in Python or JavaScript.
Any thoughts on the topic is appreciated!!
PDF spec Talks about "StructTreeRoot" for Tagged PDFs. Going deep inside for making these objects would be
nerve-racking, so is there any high-level library to manipulate objects?

Tagging a PDF with all that entails needs to be done by the PDF writer so here is this page as Tagged by Chromium/Foxit/Skia in MS Edge.
Consider how impossible this may be to do retrospectively word by word or even sentence or paragraph at a time, as PDF does not inherently have such constructions.
Things like H1 are discarded by the paper printout generator as unrequired superfluous bloat for a printer.
OK the prime reason for tagging is the human challenged reader, so with a tagged PDF lets see how it fares. Here we are only dealing with one simple page without images or tables (the two most common reasons for checking tags)
So programmatically how will an iterative application driven by Python resolve the residual requirements which are missing.
Language, as a Human I know the language is English (that should have been obvious to a browser that speaks aloud)
The Title is missing but again that should be obvious is "TAGGING PDFS" suitable as a working title for approval by another person?
Lets temporarily ignore the major errors that tagging and order of tabs is wrong. A human with eyes and brain to analyse why, can fix those, as they progress through all the pages human aspects, so can the "Human" read / navigate logically? will itself resolve the tags order, and at the same time, check if the page is visually suitable contrast for visually challenged.
So the tagging of a PDF is best done at the time a human completes their retrospective use of the page, and that is best done using "Pre-flight" "Post-flight" GUI applications, such as Acrobat.

is there any function or module in nlp that would find a specific paragraph headings

I have a text file . I need to identify specific paragraph headings and if true i need to extract relevant tables and paragraph wrt that heading using python. can we do this by nlp or machine learning?. if so please help me out in gathering basics as i am new to this field.I was thinking of using a rule like:
if (capitalized) and heading_length <50:
return heading_text
how do i parse through the entire document and pick only the header names ? this is like automating human intervention of clicking document,scrolling to relevant subject and picking it up.
please help me out in this

You probably don't need NLP or machine learning to detect these headings. Figure out the rule you actually want and if indeed it is such a simple rule as the one you wrote, a regexp will be sufficient. If your text is formatted (e.g. using HTML) it might be even simpler.
If however, you can't find a rule, and your text isn't really formatted consistently, your problem will be hard to solve.

I agree with lorg. Although you could use NLP, but that might just complicate the problem. This problem could be an optimization problem if performance is a concern.

Edit Header and Footer using python-pptx

Can I edit the Header & Footer of an existing Presentation using python-pptx? The values I want to set are as shown in the attached image. Thanks.

I asked this a long time ago, but I can't remember where and couldn't find it on SO. Scanny answered the question, so I'm relaying his answer here (probably poorly).
By default, Python-pptx doesn't include footers or page number placeholders when listing slide placeholders. It's common practice to recommend inserting text boxes instead when these are needed, but that's not useful when dealing with multiple templates or layouts.
The first thing you'll need to add somewhere is a patch so that the placeholders are included:
def footer_patch(self):
for ph in self.placeholders:
yield ph
SlideLayout.iter_cloneable_placeholders = footer_patch
You should then be able to grab the footer from the placeholders with simple means:
footer_copy = "Hi, it's me, the footer"
elif "FOOTER" in str(shape.placeholder_format.type):
footer = slide.placeholders[shape.placeholder_format.idx]
footer_text_frame = footer.text_frame
insert_text(footer_copy, footer_text_frame)
The above is old code, and probably a poor example of how to do this, but I hope it gives a starting point. A similar approach should work for the other values you listed there. Some values, like the page number, may require additional XML editing, which you can read about in another post where Scanny was my savior.
Please note, if you're using placeholders for other tasks, adding the Footer placeholder to the list of placeholders may have unforeseen consequences.

python-docx - replacing characters

I am trying to build a small program in which I open a docx document and replace characters by others, to do some old school caesar-style encrypting, after checking the documentation: [ https://python-docx.readthedocs.io ] I am afraid I can't find the object methods and attributes, the documentation just kind-of explains how to do certain stuff like creating paragraphs and sections but I can't find anything on retrieving document data and parsing. I would like to find a list of the objects in the document so I can parse through them.
I would like to do something like this:
from docx import Document
document = Document('essay.docx')
paragraph = []
for i in document:
paragraph.append(i)
for i in paragraph:
for y in i:
y.replace("a", "y")
...
Can python-docx do something like this? If so where would I find the documentation that could show me how to do it?
If maybe I am using the incorrect library I would also appreciate it if you could point it out.

The API documentation is indexed (i.e. its table of contents appears) on the page you link to and describes all the objects and methods. https://python-docx.readthedocs.io/en/latest/#api-documentation

I think I found something useful in case future readers might be interested. The problem with python-docx is I could get paragraphs individually and it would take a lot of time. I don't even know if titles, footers and headers count as paragraphs.
But there is a library called textract that can read docx and other files, it integrates with python-docx, or at least that's what the short documentation says. But what I can do, is save my docx file to PDF and use:
text = textract.process(
'path/to/norwegian.pdf',
method='pdftofile',
language='nor',
)
This allows you to get all the text as a string and save it preserving the layout of the pdf. Haven't tested it yet, will edit this post if it doesn't work as intended.
http://textract.readthedocs.io/en/latest/python_package.html#python-package

How can/should I break an html document into parts using Python? (Techno- and logically)

I've an HTML document I'm trying to break into separate, smaller chunks. Say, take each < h3 > header and turn into its own separate file, using only the HTML encoded within that chunk (along with html, head, body, tags).
I am using Python's Beautiful Soup which I am new to, but seems easy to use for easy tasks such as this (Any better suggestions like lxml or Mini-dom?). So:
1) How do I go, 'parse all < h3 >s and turn each into a separate doc'? Anything from pointers to the right direction to code snippets to online documentation (found quite little for Soup) will be appreciated.
2) Logically, finding the tag won't be enough - I need to physically 'cut it out' and put it in a separate file (and remove it from original). Perhaps parsing the text lines instead of nodes would be easier (albeit super-ugly, parsing raw text from a formed structure...?)
3) Similarly related - suppose I want to delete a certain attribute from all tags of a type (like, delete the alignment attribute of all images). This seems easy but I've failed - any help will be appreciated!
Thanks for any help!

Yes, you use BeautifulSoup or lxml. Both have methods to find the nodes you want to extract. You can then also recreate HTML from the node objects, and hence save that HTML to new files.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.