Macro to get document contents preserving hyphenation in libreoffice writer

Macro to get document contents preserving hyphenation in libreoffice writer - python

I need to access the text in a LibreOffice document.
The document has automatic hyphenation,
and I need to know the hyphen positions as they are displayed on screen.
The following code returns clear text without automatic hyphens:
XSCRIPTCONTEXT.getDocument().getText().getString()
This is the documentation I read:
https://wiki.openoffice.org/wiki/Documentation/DevGuide/Text/Working_with_Text_Documents
Also I looked at this extension: https://github.com/voikko/libreoffice-voikko
I also ran the Capitalise.py example under pyCharm remote debugger, but couldn't find any hints.

Automatic hyphens do not actually occur in the text in LibreOffice. Instead, they are displayed as needed. When a format such as PDF is exported, or if the document is printed, then hyphens are shown in the output.
The Hyphenator service is fairly easy to use in macros, and allows a word to be split up according to possible hyphenation positions.
To really determine where hyphens are getting displayed on screen, the following may work:
Traverse the document with a word cursor. Andrew Pitonyak's Macro Document section 7.3.8.5 gives an example of this in Basic.
Move the view cursor to the beginning of each word and check the Y position. For example, if self.oVC is the view cursor, then check the value of self.oVC.getPosition().Y.
Move the cursor to the end of the word, and see if the Y position changed.
If it did, then presumably the word was hyphenated.

Related

RTL (Arabic) ligatures problem when extracting text from PDF

When extracting Arabic text from a PDF file using librairies like PyMuPDF or PDFMiner, the words are returned in backward order which is a normal behavior for RTL languages, and you need to use bidi algorithm to be able to display it correctly across UI/GUIs.
The problem is when you have ligatures chars that are composed of two chars, these ligatures chars are not reversed which makes the extracted text inaccurate.
Here's an example :
Let's say we have a font with a ligature glyph "لا" that maps to "uni0644 uni0627". The pdf is rendered like this:
When you extract the pdf text using the library text extraction method, you get this:
كارتــــــشلاا
Notice how all chars are in reverse order except "لا".
And here's the final result after applying bidi algorithm:
االشــــــتراك
Am I missing something? Is there any workaround to fix this without detecting false positives and breaking them, or should I write my own implementation that correctly handles ligatures decomposition in bidirectional text?

Most likely, the actual text on the PDF page isn't Unicode, but font CIDs (identifying the glyph used) and that the program converting the CIDs to Unicode doesn't take RTL into account.
An example using RTL with english (sorry), suppose the word "fire" is rendering RTL as "erif" with 3 glyphs: e, r, and fi (through arbitrary CIDs, perhaps \001\002\003).
If the CIDs are used to get the Unicode information, and the "fi" ligature is de-ligatured, you'll get "erfi" as the data.
In this case, there's no way of knowing that the 'f' and 'i' characters should actually compose a ligature and be flipped around. I'm assuming that's the case for these Arabic characters.
It's unlikely that the tools you're using know anything about RTL or are going to be much help here. You'll need different tools, or to use an approach that can get you the CID's directly so you can recompose the Unicode in the correct order.

pywinauto escaping special characters

I am using type_keys() on a combobox to upload files via a file dialog. As mentioned in similar SO posts, this function omits certain special characters in the text that it actually types into the combobox. I'm resolving this by simply replacing every "specialchar" in that string with "{specialchar}". So far I've found the need to replace the following chars: + ^ % ( ).
I'm wondering where I can find the complete list of characters that require this treatment. I don't think it's this list because I'm not seeing, for example, the % symbol there. I also tried checking keyboard.py from the keyboard library but I don't know if it can be found there.
PS. I realize that instead of using type_keys(), for example, using send_keys() or set_edit_text(), the escaping of special characters might be done for me automatically. However, for various reasons, it looks like type_keys() works the best for my particular file dialog/situation.
Thanks

This is the full documentation: https://pywinauto.readthedocs.io/en/latest/code/pywinauto.keyboard.html All special characters can be wrapped by {}

How to parse and preserve text formatting (Python-Docx)?

I'm using Python-Docx to export all the data from a 500-page Docx file into a spreadsheet using pandas. So far so good except that the process is removing all character styles. I have written the following to preserve superscript, but I can't seem to get it working.
for para in document.paragraphs:
content = para.text
for run in para.runs:
if run.font.superscript:
r.font.superscript = True
r = para.add_run(run.text)
scripture += r.text
My Input text might me, for example:
Genesis 1:1 1 In the beginning God created the heavens and the earth.
But my output into the Xlsx file is:
Genesis 1:1 1 In the beginning God created the heavens and the earth. (Still losing the superscript formatting).
How do I preserve the font.style of each run for export? Perhaps more specifically, how do I get the text formatting from each run to be encoded into the "scripture" string?
Any help is greatly appreciated!

You cannot encode font information in a str object. A str object is a sequence of characters and that's that. It cannot indicate "make these five characters bold and the following three characters italic. There's just no place to put that sort of thing and the str data type is not made for that job.
Font (character-formatting) information must be stored in a container object of some sort. In Word, that's a run. It HTML it can be a <span> element. If you want character-formatting in your spreadsheet, you'll need to know how character formatting is stored in the target format (Excel maybe) and then apply it to text in that export format on a run-by-run basis.
There are some other problems with your code you should be aware of:
the r in r.font.superscript = True is being used before being defined. The r = para.add_run(run.text) line would need to appear prior to that line to avoid problems. I wouldn't bother here because it's not actually doing anything here it turns out, but names need to be defined before use.
You are doubling the size of the source paragraph by adding runs to it. This part actually contributes nothing because you then call run.text which as we mentioned cannot contain any character-formatting information and so it gets stripped back out.
The same result as your current code can be achieved by this:
scripture = "".join(p.text for p in document.paragraphs)
but I think you'll at approach like:
Parse out bits that go in separate cells
Within the text that goes into a single cell, write a "rich-text" cell something like that described here for XlsxWriter: https://xlsxwriter.readthedocs.io/example_rich_strings.html

How to find and replace all tabs with spaces in idle

I have an invisible expected an indented block error, which is likely caused by me using tabs instead of spaces.
When I open the "find/replace" window and try to enter TAB in the find field IDLE unsurprisingly just skips to the replace field.
How do I do this?
Final update:
Thank you for all answers, much appreciated. My problem was that python wants the function comments to be indented too, that is
def imdheladumb():
"""
I'm dum as hel
"""
does not work, it needs to be
def imdheladumb():
"""
I'm dum as hel
"""

IDLE doesn't let you search to literal tab characters. You can paste one into the search box (as suggested by will), but it will never match anything.
However, it does let you do regular expression searches, and the regular expression \t will match a literal tab. So, turn on the Regular expression checkbox, and put '\t in the Find: box, and 4 or 8 spaces (as appropriate) in the Replace: box.
But, as will suggested, it's better to use IDLE's features instead of trying to do things manually: Select the block of code with tabbed (or inconsistent) indentation, go to the Format menu, and select Untabify Region. (Or just hit control-6.) If the tabs were inserted with an editor that uses 4-space tabs, you may need to first use New Indent Width and change it to 4, then Untabify Region.
IDLE doesn't have any code to guess what your tab size was when you wrote the inconsistent code. The only editor I know of that does is emacs. If you just open the file in emacs, it will try to guess your settings, and then you can select the whole buffer and untabify-region. If it guessed right, you're golden; if it guessed wrong, don't save the buffer, because now it'll be even harder to fix. (If you're one of the 3 people in the world who knows how to read emacs lisp but doesn't like emacs, you could look through the python-mode.el source and see how it does its magic.)

A generic way to do this is to just copy a tab character from the document (or just do one in any random text editor and copy it) and then put that into the field.
You could try putting \t in there, but that only works in some editors.
Most IDEs have a function to automatically replace tabs with a predefined number of spaces. I suggest turning that on...
EDIT: doing a standard find and replace could be dangerous if you're using tabs somewhere else for any reason.
If you look here, there's an option called "tabify region". That might be more interesting for you.

There should be an app for that. ;)
If you enjoy overkill, what about running your code through a regular expression like:
re.sub('\t', '\s\s\s\s', yourCode)

For those people having the problem, the new version of VSCode can solve this easily.
Click on Tab Sizes at the bottom of the page.
Select Convert Indentation to Spaces from the "Select Action" menu that pops up.

error parsing XML file using ElementTree.parse

I am using Python's elementtree library to parse an .XML file that I exported from MySQL query browser. When I export the result set to a .XML it includes this really weird character that shows up as the letters "BS" highlighted in a green rounded rectangle in my editor. (see screen shot) Anyway I iterate through the file and try to manually replace the character, but it must not be matching because after I do this:
for lines in file:
lines.replace("<Weird Char>", "").strip();
I get an error from the parse method. However if I replace the character manually in wordpad/notepad etc... the parse call works correctly. I am looking for a way to parse out the character without having to do it manually.
any help would be great: I included two screen shots, one of how the character appears in my editor, and another how it appears in Chrome.
Thanks
EDIT: You will probably have to zoom in on the images, sorry.

The backspace character is not a valid XML character and needs to be escaped (). I'm surprised MySQL is not doing that here, but I'm not familiar with MySQL. You can also check your data and clean it up with an update statement to get rid of that character if it is not valid data for the table.
As far as parsing it out in python, this should work:
lines.replace("\b", "")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.