extract text from first page of a word document Using python

extract text from first page of a word document Using python - python

I'm trying to look for Python script that could extract text from the first page of a word document. I found functions that could do paragraphs but not pages, which is not what I need.

The problem is, pages in docx format are purely virtual. MS Word decides by itself where and when to put page limiters, based on the text size and another parameters.
It's a little bit easier when user did explicitly set page breaks, as they can be found like it's described there, for example.
As a workaround, you can just calculate the amount of lines per page and trim it by yourself, but as long as I know, there's no "easy" method to do everything with 1 code line.

Related

Is there any solution to find the page breakpoint in python (for docx doccument)?

PLEASE REFER THE IMAGE FOR A BETTER UNDERSTANDNG I am developing an application which check the page break point of a DOCX file and if there is a page break point, then that page is ending at that point. I am literally fed up with many logics.
Hope i can get one.
Thanks and regards

.DOCX files seem to do something called a soft breakpoint between pages.
However, this seems to say that you can use from docx import nsprefixes to replace occurrences of page breaks with something more identifiable.

python-docx - replacing characters

I am trying to build a small program in which I open a docx document and replace characters by others, to do some old school caesar-style encrypting, after checking the documentation: [ https://python-docx.readthedocs.io ] I am afraid I can't find the object methods and attributes, the documentation just kind-of explains how to do certain stuff like creating paragraphs and sections but I can't find anything on retrieving document data and parsing. I would like to find a list of the objects in the document so I can parse through them.
I would like to do something like this:
from docx import Document
document = Document('essay.docx')
paragraph = []
for i in document:
paragraph.append(i)
for i in paragraph:
for y in i:
y.replace("a", "y")
...
Can python-docx do something like this? If so where would I find the documentation that could show me how to do it?
If maybe I am using the incorrect library I would also appreciate it if you could point it out.

The API documentation is indexed (i.e. its table of contents appears) on the page you link to and describes all the objects and methods. https://python-docx.readthedocs.io/en/latest/#api-documentation

I think I found something useful in case future readers might be interested. The problem with python-docx is I could get paragraphs individually and it would take a lot of time. I don't even know if titles, footers and headers count as paragraphs.
But there is a library called textract that can read docx and other files, it integrates with python-docx, or at least that's what the short documentation says. But what I can do, is save my docx file to PDF and use:
text = textract.process(
'path/to/norwegian.pdf',
method='pdftofile',
language='nor',
)
This allows you to get all the text as a string and save it preserving the layout of the pdf. Haven't tested it yet, will edit this post if it doesn't work as intended.
http://textract.readthedocs.io/en/latest/python_package.html#python-package

How to iterate over everything in a python-docx document?

I am using python-docx to convert a Word docx to a custom HTML equivalent. The document that I need to convert has images and tables, but I haven't been able to figure out how to access the images and the tables within a given run. Here is what I am thinking...
for para in doc.paragraphs:
for run in para.runs:
# How to tell if this run has images or tables?
...but I don't see anything on the Run that has info on the InlineShape or Table. Do I have to fall back to the XML directly or is there a better, cleaner way to iterate over everything in the document?
Thanks!

There are actually two problems to solve for what you're trying to do. The first is iterating over all the block-level elements in the document, in document order. The second is iterating over all the inline elements within each block element, in the order they appear.
python-docx doesn't yet have the features you would need to do this directly. However, for the first problem there is some example code here that will likely work for you:
https://github.com/python-openxml/python-docx/issues/40
There is no exact counterpart I know of to deal with inline items, but I expect you could get pretty far with paragraph.runs. All inline content will be within a paragraph. If you got most of the way there and were just hung up on getting pictures or something you could go down the the lxml level and decode some of the XML to get what you needed. If you get that far along and are still keen, if you post a feature request on the GitHub issues list for something like "feature: Paragraph.iter_inline_items()" I can probably provide you with some similar code to get you what you need.
This requirement comes up from time to time so we'll definitely want to add it at some point.
Note that block-level items (paragraphs and tables primarily) can appear recursively, and a general solution will need to account for that. In particular, a paragraph can (and in fact at least one always must) appear in a table cell. A table can also appear in a table cell. So theoretically it can get pretty deep. A recursive function/method is the right approach for getting to all of those.

Assuming doc is of type Document, then what you want to do is have 3 separate iterations:
One for the paragraphs, as you have in your code
One for the tables, via doc.tables
One for the shapes, via doc.inline_shapes
The reason your code wasn't working was that paragraphs don't have references to the tables and or shapes within the document, as that is stored within the Document object.
Here is the documentation for more info: python-docx

Word count statistics on a web page

I am looking for a way to extract basic stats (total count, density, count in links, hrefs) for words on an arbitrary website, ideally a Python based solution.
While it is easy to parse a specific website using, say BautifulSoup and determine where the bulk of the content is, it requires you to define the location of the content in the DOM tree ahead of processing. This is easy for, say, hrefs or any arbitraty tag but gets more complicated when determining where the rest of the data (not enclosed in well defined markers) is.
If I understand correctly, robots used by the likes of Google (GoogleBot?) are able to extract data from any website to determine the keyword density. My scenario is similar, obtain the info related to the words that define what the website is about (i.e. after removing js, links and fillers).
My question is, are there any libraries or web APIs that would allow me to get statistics of meaningful words from any given page?

There is no APIs but there could be few libraries that you can use it as a tool.
you should count the meaningful words and record them by the time.
you can also Start from something like this:
string Link= "http://www.website.com/news/Default.asp";
string itemToSearch= "Word";
int count = new Regex(itemToSearch).Matches(Link).Count;
MessageBox.Show(count.ToString());

There are multiple libraries that deal with more advanced processing of web articles, this question should be a duplicate of this one.

Python/Javascript: WYSIWYG html editor - Handle large documents fast and/or design theory

Background:
I am writing an ebook editing program in python. Currently it utilizes a source-code view for editing, and I would like to port it over to a wysiwyg view for editing. The best (only?) html renderer I could find for python was webkit (I am using the PyQt version).
Question:
How do I accomplish wysiwyg editing? The requirements/issues are as follows:
An ebook may be up to 10,000 paragraphs / 1,000,000
characters.
PyQt Webkit (ContentEditable): No problem.
PyQt Webkit (TinyMce, etc): Takes forever to open them!
The format is <body><p>...</p><p>...</p>...</body>. The body element contains only paragraphs, there are no divs, etc (but in the paragraph there may be spans, links, etc.). Editing must take place with no significant delays as far as the user is concerned.
PyQt Webkit (ContentEditable): If you try deleting text across multiple paragraphs, it takes forever!! My understanding is that this is because it resets the common-parent of the elements being changed - i.e. the entire body element, since two different paragraphs are being deleted/merged. But, there should be no need for this - it should need only delete/merge/change those individual paragraphs!
I am open to implementing my own wysiwyg editing, but for the life of me I can't figure out how to delete/cut/paste/merge/change the html code correctly. I searched online for articles about html wysiwyg design theory, and came up dry.
Thanks!

Can i suggest a complete another approach ? Since your ebook is only <p></p>:
Split the text on <p></p> to get an indexed array of all your paragraphs
Make your own pagination system, and fill the screen with N paragraphs, that automatically get enough text to show from the indexed array
When you are doing selection, you can use [paragraph index + character index in the paragraph] for selection start / end
Then implement cut/copy/paste/delete/undo/redo based on thoses assumptions.
(Note: when you'll do a selection, since the start point is saved, you can safely change the text on the screen / pagination, until the selection end.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.