How to parse and preserve text formatting (Python-Docx)?

How to parse and preserve text formatting (Python-Docx)? - python

I'm using Python-Docx to export all the data from a 500-page Docx file into a spreadsheet using pandas. So far so good except that the process is removing all character styles. I have written the following to preserve superscript, but I can't seem to get it working.
for para in document.paragraphs:
content = para.text
for run in para.runs:
if run.font.superscript:
r.font.superscript = True
r = para.add_run(run.text)
scripture += r.text
My Input text might me, for example:
Genesis 1:1 1 In the beginning God created the heavens and the earth.
But my output into the Xlsx file is:
Genesis 1:1 1 In the beginning God created the heavens and the earth. (Still losing the superscript formatting).
How do I preserve the font.style of each run for export? Perhaps more specifically, how do I get the text formatting from each run to be encoded into the "scripture" string?
Any help is greatly appreciated!

You cannot encode font information in a str object. A str object is a sequence of characters and that's that. It cannot indicate "make these five characters bold and the following three characters italic. There's just no place to put that sort of thing and the str data type is not made for that job.
Font (character-formatting) information must be stored in a container object of some sort. In Word, that's a run. It HTML it can be a <span> element. If you want character-formatting in your spreadsheet, you'll need to know how character formatting is stored in the target format (Excel maybe) and then apply it to text in that export format on a run-by-run basis.
There are some other problems with your code you should be aware of:
the r in r.font.superscript = True is being used before being defined. The r = para.add_run(run.text) line would need to appear prior to that line to avoid problems. I wouldn't bother here because it's not actually doing anything here it turns out, but names need to be defined before use.
You are doubling the size of the source paragraph by adding runs to it. This part actually contributes nothing because you then call run.text which as we mentioned cannot contain any character-formatting information and so it gets stripped back out.
The same result as your current code can be achieved by this:
scripture = "".join(p.text for p in document.paragraphs)
but I think you'll at approach like:
Parse out bits that go in separate cells
Within the text that goes into a single cell, write a "rich-text" cell something like that described here for XlsxWriter: https://xlsxwriter.readthedocs.io/example_rich_strings.html

Related

Python - How to read multiple lines from text file as a string and remove all encoding?

I have a list of 77 items. I have placed all 77 items in a text file (one per line).
I am trying to read this into my python script (where I will then compare each item in a list, to another list pulled via API).
Problem: for some reason, 2/77 of the items on the list have encoding, giving me characters of "u00c2" and "u00a2" which means they are not comparing correctly and being missed. I have no idea why these 2/77 have this encoding, but the other 75 are fine, and I don't know how to get rid of the encoding, in python.
Question:
In Python, How can I get rid of the encoding to ensure none of them have any special/weird characters and are just plain text?
Is there a method I can use to do this upon reading the file in?
Here is how I am reading the text file into python:
with open("name_list2.txt", "r") as myfile:
policy_match_list = myfile.readlines()
policy_match_list = [x.strip() for x in policy_match_list]
Note - "policy_match_list" is the list of 77 policies read in from the text file.
Here is how I am comparing my two lists:
for policy_name in policy_match_list:
for us_policy in us_policies:
if policy_name == us_policy["name"]:
print(f"Match #{match} | {policy_name}")
match += 1
Note - "us_policies" is another list of thousands of policies, pulled via API that I am comparing to
Which is resulting in 75/77 expected matches, due to the other 2 policies comparing e.g. "text12 - text" to "text12u00c2-u00a2text" rather than "text12 - text" to "text12 - text"
I hope this makes sense, let me know if I can add any further info
Cheers!

Did you try to open the file while decoding from utf8? because I can't see the file I can't tell this is the problem, but the file might have characters that the default decoding option (which I think is Latin) can't process.
Try doing:
with open("name_list2.txt", "r", encoding="utf-8") as myfile:
Also, you can watch this question about how to treat control characters: Python - how to delete hidden signs from string?
Sorry about not posting it as a comment (as I really don't know if this is the solution), I don't have enough reputation for that.

Certain Unicode characters aren't properly decoded in some cases. In your case, the characters \u00c2 and \u00a2 caused the issue. As of now, I see two fixes:
Try to resolve the encoding by replacing the characters (refer to https://stackoverflow.com/a/56967370)
Copy the text to a new plain text file (if possible) and save it. These extra characters tend to get ignored in that case and consequently removed.

I want to read in strings to the new line character in Python 2.7

I have a long text file that I am trying to pull certain strings out of. The length of these strings are variable with the text file but are always located after certain identifiers. So for example say my text file looks like this:
junk text...
Name:
Age:
Robert
twenty
four.
junk text...
I always know that the "Robert" string is located at "Age:\n\n" but I am not sure how long it is only that it will end at a "\n\n" and the same principle with the "twenty four." string. I have tried using
namepos1 = string.find("Age:")
namepos2 = namepos1 + 6
this will give the starting location of the string I want but I do not know how to save it into a variable such that it always saves the whole string up to the two new line characters. If it was a set length and not variable I think I could use:
name = string[namepos2:length]
but any help would be greatly appreciated. I may have to go about doing it completely different, but this is the first way I have thought about it and tried to do it.
Thanks!

You could do this by finding age, then moving forward your cursor two lines if you would like to do that, if you want the entire section of text after the "junk", and you know how long that text is, this would also work:
lookup = 'age'
lines=[]
with open('C:/Users/Luke/Desktop/Summer 2016/Programs/untitled5.txt') as myFile:
for num, line in enumerate(myFile, 1):
if lookup in line:
lines.append(num+2)
ofile=open('C:/Users/Luke/Desktop/Summer 2016/Programs/untitled5.txt')
line=ofile.readlines()
interestinglines=''
for i in range(len(lines)):
interestinglines+=(line[lines[i]]+'\n')
you may need to tinker with it a bit, but I believe this should reproduce mostly what you're looking for. The '\n' is added onto the line[lines[i]] so that you may save it to a new file.

After you found the location in string, you can split the String by \n\n and get the first item.
s = file_str[namepos2 :]
name = s.split('\n\n')[0]

Why is the text formatting different in the Python 3 shell compared to text file produced?

I'm trying to read a webpage and output the formatted text to a text file. The code below prints to the shell with formatting but when I write it to the file it puts it on one line (with the linebreaks /n present in the text).
I have tried a variety of things such as not converting it to a string, using prettify from beautiful soup but none seem to produce a text file with formatting. I am presuming I am missing something fairly basic. Any help or guidance would be much appreciated.
# Import
from urllib.request import urlopen
from bs4 import BeautifulSoup
#The actual code
URL = "https://simple.wikipedia.org/wiki/castle" #The target URL
html = urlopen(URL).read() # Reads the url to variable html
soup = BeautifulSoup(html, "lxml") # Uses BS4 to create the soup using the lxml parser
soup = soup.get_text() # Extracts the text
print(soup) # Prints to python 3.5.1 shell, formatted as I would expect
# Now writing what I have extracted to a text file
file = open("TextOutput.txt", 'w') # Creates the file and opens as write (w)
file.writelines(str(soup.encode('UTF-8'))) # Tried file.write/lines(soup), convertion to string and encoding as UTF-8 needed to avoid errors
file.close()
A sample of the file output looks like:
b'\n\n\nCastle - Simple English Wikipedia, the free encyclopedia\ndocument.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );\n(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Castle","wgTitle":"Castle","wgCurRevisionId":5333370,"wgRevisionId":5333370,"wgArticleId":15933,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":[""],"wgCategories":["Castles"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRelevantPageName":"Castle","wgRelevantArticleId":15933,"wgRequestId":"VxUR5gpAIDAAAEXY6FMAAACC","wgIsProbablyEditable":true,"wgRestrictionEdit":[],"wgRestrictionMove":[],"wgWikiEditorEnabledModules":{"toolbar":true,"dialogs":true,"preview":false,"publish":false},"wgBetaFeaturesFeatures":[],"wgMediaViewerOnClick":true,"wgMediaViewerEnabledByDefault":true,"wgVisualEditor":{"pageLanguageCode":"en","pageLanguageDir":"ltr","usePageImages":true,"usePageDescriptions":true},"wgPreferredVariant":"en","wgRelatedArticles":null,"wgRelatedArticlesUseCirrusSearch":true,"wgRelatedArticlesOnlyUseCirrusSearch":false,"wgULSAcceptLanguageList":[],"wgULSCurrentAutonym":"English","wgCategoryTreePageCategoryOptions":"{\"mode\":0,\"hideprefix\":20,\"showcount\":true,\"namespaces\":false}","wgNoticeProject":"wikipedia","wgCentralNoticeCategoriesUsingLegacy":["Fundraising","fundraising"],"wgCentralAuthMobileDomain":false,"wgWikibaseItemId":"Q23413","wgVisualEditorToolbarScrollOffset":0});mw.loader.implement("user.options",function($,jQuery){mw.user.options.set({"variant":"en"});});mw.loader.implement("user.tokens",function ( $, jQuery ) {\nmw.user.tokens.set({"editToken":"+\\","patrolToken":"+\\","watchToken":"+\\","csrfToken":"+\\"});/#nomin*/;\n\n});mw.loader.load(["mw.MediaWikiPlayer.loader","mw.PopUpMediaTransform","mw.TMHGalleryHook.js","mediawiki.page.startup","mediawiki.legacy.wikibits","ext.centralauth.centralautologin","mmv.head","ext.visualEditor.desktopArticleTarget.init","ext.uls.init","ext.uls.interface","ext.centralNotice.bannerController","skins.vector.js"]);});\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCastle\n\nFrom Wikipedia, the free encyclopedia\n\n\n\t\t\t\t\tJump to:\t\t\t\t\tnavigation, \t\t\t\t\tsearch\n\n\n\n\n\nBodiam Castle in England surrounded by a water-filled moat.\n\n\n\n\n\n\nLichtenstein Castle\n\n\nA castle (from the Latin word castellum) is a fortified structure made in Europe and the Middle East during the Middle Ages. People argue about what the word castle means. However, it usually means a private structure of a lord or noble. This is different from a fortress, which is not a home, and from a fortified town, which was a public defence. For about 900\xc2\xa0years that castles were built they had many different shapes and different details.\nCastles began in Europe in the 9th and 10th centuries. They controlled the places surrounding them, and could both help in attacking and defending. Weapons could be fired from castles, or people could be protected from enemies in castles. However, castles were also a symbol of power. They could be used to control the people and roads around it.\nMany castles were built with earth and wood at first often using manual labour, and then had their defences replaced by stone instead. Early castles often used nature for protection, and did not have towers. By the late 12th and early 13th centuries, though, castles became longer and more complex.\n

file.writelines(str(soup.encode('UTF-8'))) is kind of insane, it's:
Encoding text (str) to binary (bytes)
Getting the text representation of that by wrapping in str (so it's what you'd type to recreate the binary bytes, but it's not the raw binary)
Writing that result one character at a time (writelines iterates what you give it, and strs iterate by character)
Step #3 is silly and inefficient, but mostly harmless. Step #1 would be fine if you then wrote the raw binary to a file opened for binary write and actually wrote the bytes object. But #1 and #2 together mean that stuff like a new line gets converted to a literal \n in the output, rather than actually breaking a line. Non-ASCII stuff like é is output as \xc3\xa9, and the whole thing is wrapped in b'' (or b"").
You want something like:
# open with UTF-8 encoding (in case your system defaults to something else)
with open("TextOutput.txt", 'w', encoding='utf-8') as file:
# Get the text and write it as a single block
file.write(soup.get_text())

Macro to get document contents preserving hyphenation in libreoffice writer

I need to access the text in a LibreOffice document.
The document has automatic hyphenation,
and I need to know the hyphen positions as they are displayed on screen.
The following code returns clear text without automatic hyphens:
XSCRIPTCONTEXT.getDocument().getText().getString()
This is the documentation I read:
https://wiki.openoffice.org/wiki/Documentation/DevGuide/Text/Working_with_Text_Documents
Also I looked at this extension: https://github.com/voikko/libreoffice-voikko
I also ran the Capitalise.py example under pyCharm remote debugger, but couldn't find any hints.

Automatic hyphens do not actually occur in the text in LibreOffice. Instead, they are displayed as needed. When a format such as PDF is exported, or if the document is printed, then hyphens are shown in the output.
The Hyphenator service is fairly easy to use in macros, and allows a word to be split up according to possible hyphenation positions.
To really determine where hyphens are getting displayed on screen, the following may work:
Traverse the document with a word cursor. Andrew Pitonyak's Macro Document section 7.3.8.5 gives an example of this in Basic.
Move the view cursor to the beginning of each word and check the Y position. For example, if self.oVC is the view cursor, then check the value of self.oVC.getPosition().Y.
Move the cursor to the end of the word, and see if the Y position changed.
If it did, then presumably the word was hyphenated.

error parsing XML file using ElementTree.parse

I am using Python's elementtree library to parse an .XML file that I exported from MySQL query browser. When I export the result set to a .XML it includes this really weird character that shows up as the letters "BS" highlighted in a green rounded rectangle in my editor. (see screen shot) Anyway I iterate through the file and try to manually replace the character, but it must not be matching because after I do this:
for lines in file:
lines.replace("<Weird Char>", "").strip();
I get an error from the parse method. However if I replace the character manually in wordpad/notepad etc... the parse call works correctly. I am looking for a way to parse out the character without having to do it manually.
any help would be great: I included two screen shots, one of how the character appears in my editor, and another how it appears in Chrome.
Thanks
EDIT: You will probably have to zoom in on the images, sorry.

The backspace character is not a valid XML character and needs to be escaped (). I'm surprised MySQL is not doing that here, but I'm not familiar with MySQL. You can also check your data and clean it up with an update statement to get rid of that character if it is not valid data for the table.
As far as parsing it out in python, this should work:
lines.replace("\b", "")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.