error parsing XML file using ElementTree.parse - python

I am using Python's elementtree library to parse an .XML file that I exported from MySQL query browser. When I export the result set to a .XML it includes this really weird character that shows up as the letters "BS" highlighted in a green rounded rectangle in my editor. (see screen shot) Anyway I iterate through the file and try to manually replace the character, but it must not be matching because after I do this:
for lines in file:
lines.replace("<Weird Char>", "").strip();
I get an error from the parse method. However if I replace the character manually in wordpad/notepad etc... the parse call works correctly. I am looking for a way to parse out the character without having to do it manually.
any help would be great: I included two screen shots, one of how the character appears in my editor, and another how it appears in Chrome.
Thanks
EDIT: You will probably have to zoom in on the images, sorry.

The backspace character is not a valid XML character and needs to be escaped (). I'm surprised MySQL is not doing that here, but I'm not familiar with MySQL. You can also check your data and clean it up with an update statement to get rid of that character if it is not valid data for the table.
As far as parsing it out in python, this should work:
lines.replace("\b", "")

Related

pywinauto escaping special characters

I am using type_keys() on a combobox to upload files via a file dialog. As mentioned in similar SO posts, this function omits certain special characters in the text that it actually types into the combobox. I'm resolving this by simply replacing every "specialchar" in that string with "{specialchar}". So far I've found the need to replace the following chars: + ^ % ( ).
I'm wondering where I can find the complete list of characters that require this treatment. I don't think it's this list because I'm not seeing, for example, the % symbol there. I also tried checking keyboard.py from the keyboard library but I don't know if it can be found there.
PS. I realize that instead of using type_keys(), for example, using send_keys() or set_edit_text(), the escaping of special characters might be done for me automatically. However, for various reasons, it looks like type_keys() works the best for my particular file dialog/situation.
Thanks
This is the full documentation: https://pywinauto.readthedocs.io/en/latest/code/pywinauto.keyboard.html All special characters can be wrapped by {}

How to parse and preserve text formatting (Python-Docx)?

I'm using Python-Docx to export all the data from a 500-page Docx file into a spreadsheet using pandas. So far so good except that the process is removing all character styles. I have written the following to preserve superscript, but I can't seem to get it working.
for para in document.paragraphs:
content = para.text
for run in para.runs:
if run.font.superscript:
r.font.superscript = True
r = para.add_run(run.text)
scripture += r.text
My Input text might me, for example:
Genesis 1:1 1 In the beginning God created the heavens and the earth.
But my output into the Xlsx file is:
Genesis 1:1 1 In the beginning God created the heavens and the earth. (Still losing the superscript formatting).
How do I preserve the font.style of each run for export? Perhaps more specifically, how do I get the text formatting from each run to be encoded into the "scripture" string?
Any help is greatly appreciated!
You cannot encode font information in a str object. A str object is a sequence of characters and that's that. It cannot indicate "make these five characters bold and the following three characters italic. There's just no place to put that sort of thing and the str data type is not made for that job.
Font (character-formatting) information must be stored in a container object of some sort. In Word, that's a run. It HTML it can be a <span> element. If you want character-formatting in your spreadsheet, you'll need to know how character formatting is stored in the target format (Excel maybe) and then apply it to text in that export format on a run-by-run basis.
There are some other problems with your code you should be aware of:
the r in r.font.superscript = True is being used before being defined. The r = para.add_run(run.text) line would need to appear prior to that line to avoid problems. I wouldn't bother here because it's not actually doing anything here it turns out, but names need to be defined before use.
You are doubling the size of the source paragraph by adding runs to it. This part actually contributes nothing because you then call run.text which as we mentioned cannot contain any character-formatting information and so it gets stripped back out.
The same result as your current code can be achieved by this:
scripture = "".join(p.text for p in document.paragraphs)
but I think you'll at approach like:
Parse out bits that go in separate cells
Within the text that goes into a single cell, write a "rich-text" cell something like that described here for XlsxWriter: https://xlsxwriter.readthedocs.io/example_rich_strings.html

Macro to get document contents preserving hyphenation in libreoffice writer

I need to access the text in a LibreOffice document.
The document has automatic hyphenation,
and I need to know the hyphen positions as they are displayed on screen.
The following code returns clear text without automatic hyphens:
XSCRIPTCONTEXT.getDocument().getText().getString()
This is the documentation I read:
https://wiki.openoffice.org/wiki/Documentation/DevGuide/Text/Working_with_Text_Documents
Also I looked at this extension: https://github.com/voikko/libreoffice-voikko
I also ran the Capitalise.py example under pyCharm remote debugger, but couldn't find any hints.
Automatic hyphens do not actually occur in the text in LibreOffice. Instead, they are displayed as needed. When a format such as PDF is exported, or if the document is printed, then hyphens are shown in the output.
The Hyphenator service is fairly easy to use in macros, and allows a word to be split up according to possible hyphenation positions.
To really determine where hyphens are getting displayed on screen, the following may work:
Traverse the document with a word cursor. Andrew Pitonyak's Macro Document section 7.3.8.5 gives an example of this in Basic.
Move the view cursor to the beginning of each word and check the Y position. For example, if self.oVC is the view cursor, then check the value of self.oVC.getPosition().Y.
Move the cursor to the end of the word, and see if the Y position changed.
If it did, then presumably the word was hyphenated.

Problems writing a regex in testcases.xml of pylot

I have to verify a list of strings to be present in a response to a soap request. I am using pylot testing tool. I know that if I use a string inside <verify>abcd</verify>element it works fine. I have to use regex though and I seem to face problems with the same since I am not good with regex.
I have to verify if <TestName>Abcd Hijk</TestName> is present in my response for the request sent.
Following is my attempt to write the regex inside testcases.xml
<verify>[.TestName.][\w][./TestName.]</verify>
Is this the correct way to write a regex in testcases.xml file? I want to exactly verify the tagnames and its values mentioned above.
When I run the tool, it gives me no errors. But If I change the the characters to <verify>[.TesttttName.][\w][./TestttttName.]</verify> and run the tool, it still run without giving errors. While this should be a failed run since no tags like the one mentioned is present in the response!
Could someone please tell me what I am doing wrong in the regex here?
Any help would be appreciated. Thanks!
The regex used should be like the following.
<verify>&lt;TestName&gt;[\w\s]+&lt;/TestName&gt;</verify>
The reason being, Pylot has the response content in the form of a text i.e, [the above part in the response would be like the following]
.......<TestName>ABCd Hijk</TestName>.....
What Pylot does is, when it parses element in the Testcases.xml, it takes the value of the element in TEXT format. Then it searches for the 'verify text' in the response which it got from the request.
Hence whenever we would want to verify anything in Pylot using regex we need to put the regex in the above format so that it gives the required results.
Note: One has to be careful of the response format received. To view the response got from the request, Enable the Log Messages on the tool or if you want to view the response on the console, edit the tools engine.py module and insert print statements.
The raw regular expression (no XML escape). I assume you want to accept English alphabet a-zA-Z, digits 0-9, underscore _ and space characters (space, new line, carriage return, and a few others - check documentation for details).
<TestName>[\w\s]+</TestName>
You need to escape the < and > to specify inside <verify> tag:
<TestName>[\w\s]+</TestName>

Cleaning an XML file in Python before parsing

I'm using minidom to parse an xml file and it threw an error indicating that the data is not well formed. I figured out that some of the pages have characters like ไอเฟล &, causing the parser to hiccup. Is there an easy way to clean the file before I start parsing it? Right now I'm using a regular expressing to throw away anything that isn't an alpha numeric character and the </> characters, but it isn't quite working.
Try
xmltext = re.sub(u"[^\x20-\x7f]+",u"",xmltext)
It will get rid of everything except 0x20-0x7F range.
You may start from \x01, if you want want to keep control characters like tab, line breaks.
xmltext = re.sub(u"[^\x01-\x7f]+",u"",xmltext)
Take a look at µTidyLib, a Python wrapper to TidyLib.
If you do need the data with the strange characters you could, in stead of just stripping them, convert them to codes the XML parser can understand.
You could have a look at the unicodedata package, especially the normalize method.
I haven't used it myself, so I can't tell you all that much, but you could ask again here on SO if you decide you're going to convert and keep that data.
>>> import unicodedata
>>> unicodedata.normalize("NFKD" , u"ไภเฟล &")
u'a\u03001\u201ea\u0300 \u0327 a\u03001\u20aca\u0300 \u0327Y\u0308a\u0300 \u0327\xa5 &'
It looks like you're dealing with data which are saved with some kind of encoding "as if" they were ASCII. XML file should normally be UTF8, and SAX (the underlying parser used by minidom) should handle that, so it looks like something's wrong in that part of the processing chain. Instead of focusing on "cleaning up" I'd first try to make sure the encoding is correct and correctly recognized. Maybe a broken XML directive? Can you edit your Q to show the first few lines of the file, especially the <?xml ... directive at the very start?
I'd throw out all non-ASCII characters which can be identified by having the 8th bit (0x80) set (128 .. 255 respectively 0x80 .. 0xff).
You could read in the file into a Python string named old_str
Then perform a filter call in conjunction with a lambda statement:
new_str = filter(lambda x: x in string.ascii_letters, old_str)
Parse new_str
Many ways exist to accomplish stripping non-ASCII characters from a string.
This question might be related: How to check if a string in Python is in ASCII?

Categories

Resources