python: escaping non-ascii characters in XML

python: escaping non-ascii characters in XML - python

I got my test XML file to print using the following source file, but it doesn't handle non-ASCII characters appropriately:
xmltest.py:
import xml.sax.xmlreader
import xml.sax.saxutils
def testJunk(file, e2content):
attr0 = xml.sax.xmlreader.AttributesImpl({})
x = xml.sax.saxutils.XMLGenerator(file)
x.startDocument()
x.startElement("document", attr0)
x.startElement("element1", attr0)
x.characters("bingo")
x.endElement("element1")
x.startElement("element2", attr0)
x.characters(e2content)
x.endElement("element2")
x.endElement("document")
x.endDocument()
If I do
>>> import xmltest
>>> xmltest.testJunk(open("test.xml","w"), "ascii 001: \001")
then I get an xml file with character code 001 in it. I can't figure out how to escape this character. Firefox tells me it's not well formed XML and complains about that character. How can I fix this?
clarification: I'm trying to log the output of a function I do not have control over, which outputs non-ASCII characters.
update: OK, so now I know characters outside one of the accepted ranges can't be encoded in the form . (Or rather, they can be encoded, but that doesn't help any w/r/t XML not being well-formed.) But they can be escaped if I define a way of doing so.
(for future reference: W3C has a useful page outside the XML standard itself which says "Control codes should be replaced with appropriate markup" but doesn't really suggest any examples for doing so.)
If I wanted to escape characters outside the accepted range in the following way:
before escaping: ( represents one character, not the literal 8-character string)
abcdefghijkl
after escaping:
abcd<u>0001</u>efgh<u>0002</u>ijkl
How could I do this in python?
def escapeXML(src)
dest = ??????
return dest

"\001" aka "\x01" is an ASCII control code. It is not however one of the permitted XML characters. The only ASCII control codes which qualify are "\t", "\n" and "\r".
Examples:
>>> import xml.etree.cElementTree as ET
# Raw newline works
>>> t = ET.fromstring("<e>\n</e>")
>>> t.text
'\n'
# Hex escaping of a newline works
>>> t = ET.fromstring("<e>
</e>")
>>> t.text
'\n'
# Hex escaping of "\x01" doesn't work; it's not a valid XML character
>>> t = ET.fromstring("<e></e>")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 106, in XML
cElementTree.ParseError: reference to invalid character number: line 1, column 3
If you want to include invalid XML characters somehow in an XML document, they must be hidden from the XML parser by an extra level of escaping. The mechanism needs to be documented, published, and understood by readers of your document.
For example, in Microsoft Excel 2007+ XLSX files, Unicode code points which are not valid XML characters are smuggled past the parser by representing them as _xhhhh_ where hhhh is the hex representation of the codepoint. In your example this would be the 7 bytes _x0001_. Note that it is necessary to escape any _ characters in the text that would otherwise be falsely interpreted as introducing an _xhhhh_ sequence.
This is ugly, painful, inefficient, etc. You may wish to consider other methods. Is use of XML necessary? Would a CSV file (shock, horror!) do a better job in your application?
Edit Some notes on the OP's encoding proposal:
A. Although \r is a valid XML 1.0 input character, it is subject to mandatory immediate transmogrification, so you should escape it as well.
B. This scheme assumes/hopes that the <u>hhhh</u> cannot be confused with any other markup.
C. I take back what I said above about the Microsoft escaping scheme. It is relatively beautiful, painfree, and efficient. To complete the picture of your scheme for your gentle readers, you should show the code that is required to unescape the nasty bits and glue the pieces back together. Bear in mind that the MS scheme requires somebody to write one escaping function and one unescaping function, whereas your scheme requires different treatment for each tool (SAX, DOM, ElementTree).
D. At the detailed level, the code is a little bit whiffy:
if (len(g1) > 0): should be if g1:
if (not foo == None): has a record THREE deviations from the commonly accepted idiom: (1) the parentheses (2) not x == y instead of x != y (3) != None instead of is not None
Don't use list (and names of other built-in objects) as a name for your own variable.
Edit 2 You want to split up a string using a regex. Why not use re.split?
splitInvalidXML2 = re.compile(
ur'([^\x09\x0A\x0D\x20-\x7E\x85\xA0-\xFF\u0100-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD])'
).split
def submitCharacters2(x, string):
badchar = True
for fragment in splitInvalidXML2(string):
badchar = not badchar
if badchar:
x.startElement("u", attr0)
x.characters('%04X' % ord(fragment))
x.endElement("u")
elif fragment:
x.characters(fragment)

This seems to work for me.
r = re.compile(ur'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\xFF' \
+ ur'\u0100-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]')
def escapeInvalidXML(string):
def replacer(m):
return "<u>"+('%04X' % ord(m.group(0)))+"</u>"
return re.sub(r,replacer,string)
example:
>>> s='this is a \x01 test \x0B of something'
>>> escapeInvalidXML(s)
'this is a <u>0001</u> test <u>000B</u> of something'
>>> s2 = u'this is a \x01 test \x0B of \uFDD0'
>>> escapeInvalidXML(s2)
u'this is a <u>0001</u> test <u>000B</u> of <u>FDD0</u>'
Character ranges from http://www.w3.org/TR/2006/REC-xml-20060816/#charsets, and I haven't escaped everything, just the ones below \uFFFF.
Update: Oops, forgot to adapt to the startElement/characters methods of SAX, & deal properly with multiple lines:
import re
import xml.sax.xmlreader
import xml.sax.saxutils
r = re.compile(ur'(.*?)(?:([^\x09\x0A\x0D\x20-\x7E\x85\xA0-\xFF' \
+ ur'\u0100-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD])|([\n])|$)')
attr0 = xml.sax.xmlreader.AttributesImpl({})
def splitInvalidXML(string):
list = []
def replacer(m):
g1 = m.group(1)
if (len(g1) > 0):
list.append(g1)
g2 = m.group(2)
if (not g2 == None):
list.append(ord(g2))
g3 = m.group(3)
if (not g3 == None):
list.append(g3)
return ""
re.sub(r,replacer,string)
return list
def submitCharacters(x, string):
for fragment in splitInvalidXML(string):
if (isinstance(fragment,int)):
x.startElement("u", attr0)
x.characters('%04X' % fragment)
x.endElement("u")
else:
x.characters(fragment)
def test1(fname):
with open(fname,'w') as f:
x = xml.sax.saxutils.XMLGenerator(f)
x.startDocument()
x.startElement('document',attr0)
submitCharacters(x, 'this is a \x01 test\nof the \x02\x0b xml system.')
x.endElement('document')
x.endDocument()
test1('test.xml')
This produces:
<?xml version="1.0" encoding="iso-8859-1"?>
<document>this is a <u>0001</u> test
of the <u>0002</u><u>000B</u> xml system.</document>

There's an open python bug for this https://bugs.python.org/issue5166 - not yet sure what the resolution will be/if it will be fixed as it's been open a little while now, but worth checking on that periodically in case we get a proper solution for handling invalid XML characters built into python itself.

Related

python 3 regex not finding confirmed matches

So I'm trying to parse a bunch of citations from a text file using the re module in python 3.4 (on, if it matters, a mac running mavericks). Here's some minimal code. Note that there are two commented lines: they represent two alternative searches. (Obviously, the little one, r'Rawls', is the one that works)
def makeRefList(reffile):
print(reffile)
# namepattern = r'(^[A-Z1][A-Za-z1]*-?[A-Za-z1]*),.*( \(?\d\d\d\d[a-z]?[.)])'
# namepattern = r'Rawls'
refsTuplesList = re.findall(namepattern, reffile, re.MULTILINE)
print(refsTuplesList)
The string in question is ugly, and so I stuck it in a gist: https://gist.github.com/paultopia/6c48c398a42d4834f2ae
As noted, the search string r'Rawls' produces expected output ['Rawls', 'Rawls']. However, the other search string just produces an empty list.
I've confirmed this regex (partially) works using the regex101 tester. Confirmation here: https://regex101.com/r/kP4nO0/1 -- this match what I expect it to match. Since it works in the tester, it should work in the code, right?
(n.b. I copied the text from terminal output from the first print command, then manually replaced \n characters in the string with carriage returns for regex101.)
One possible issue is that python has appended the bytecode flag (is the little b called a "flag?") to the string. This is an artifact of my attempt to convert the text from utf-8 to ascii, and I haven't figured out how to make it go away.
Yet re clearly is able to parse strings in that form. I know this because I'm converting two text files from utf-8 to ascii, and the following code works perfectly fine on the other string, converted from the other text file, which also has a little b in front of it:
def makeCiteList(citefile):
print(citefile)
citepattern = r'[\s(][A-Z1][A-Za-z1]*-?[A-Za-z1]*[ ,]? \(?\d\d\d\d[a-z]?[\s.,)]'
rawCitelist = re.findall(citepattern, citefile)
cleanCitelist = cleanup(rawCitelist)
finalCiteList = list(set(cleanCitelist))
print(finalCiteList)
return(finalCiteList)
The other chunk of text, which the code immediately above matches correctly: https://gist.github.com/paultopia/a12eba2752638389b2ee
The only hypothesis I can come up with is that the first, broken, regex expression is puking on the combination of newline characters and the string being treated as a byte object, even though a) I know the regex is correct for newlines (because, confirmation from the linked regex101), and b) I know it's matching the strings (because, confirmation from the successful match on the other string).
If that's true, though, I don't know what to do about it.
Thus, questions:
1) Is my hypothesis right that it's the combination of newlines and b that blows up my regex? If not, what is?
2) How do I fix that?
a) replace the newlines with something in the string?
b) rewrite the regex somehow?
c) somehow get rid of that b and make it into a normal string again? (how?)
thanks!
Addition
In case this is a problem I need to fix upstream, here's the code I'm using to get the text files and convert to ascii, replacing non-ascii characters:
this function gets called on utf-8 .txt files saved by textwrangler in mavericks
def makeCorpoi(citefile, reffile):
citebox = open(citefile, 'r')
refbox = open(reffile, 'r')
citecorpus = citebox.read()
refcorpus = refbox.read()
citebox.close()
refbox.close()
corpoi = [str(citecorpus), str(refcorpus)]
return corpoi
and then this function gets called on each element of the list the above function returns.
def conv2ASCII(bigstring):
def convHandler(error):
return ('1FOREIGN', error.start + 1)
codecs.register_error('foreign', convHandler)
bigstring = bigstring.encode('ascii', 'foreign')
stringstring = str(bigstring)
return stringstring

Aah. I've tracked it down and answered my own question. Apparently one needs to call some kind of encode method on the decoded thing. The following code produces an actual string, with newlines and everything, out the other end (though now I have to fix a bunch of other bugs before I can figure out if the final output is as expected):
def conv2ASCII(bigstring):
def convHandler(error):
return ('1FOREIGN', error.start + 1)
codecs.register_error('foreign', convHandler)
bigstring = bigstring.encode('ascii', 'foreign')
newstring = bigstring.decode('ascii', 'foreign')
return newstring
apparently the str() function doesn't do the same job, for reasons that are mysterious to me. This is despite an answer here How to make new line commands work in a .txt file opened from the internet? which suggests that it does.

Unexpected behaviour of t.unicode('utf-8') - Python

I have a json file with several keys. I want to use one of the keys and write that string to a file. The string originally is in unicode. So, I do, s.unicode('utf-8')
Now, there is another key in that json which I write to another file (this is a Machine learning task, am writing original string in one, features in another). The problem is that at the end, the file with the unicode string turns out to have more number of lines (when counted by using "wc -l") and this misguides my tool and it crashes saying sizes not same.
Code for reference:
for line in input_file:
j = json.loads(line)
text = j['text']
label = j[t]
output_file.write(str(label) + '\t' + text.encode('utf-8') + '\n')
norm_file.write(j['normalized'].encode('utf-8') + '\n')
The difference when using "wc -l"
16862965
This is the number of lines I expect and what I get is
16878681
which is actually higher. So I write a script to see how many output labels are actually there
with open(sys.argv[1]) as input_file:
for line in input_file:
p = line.split('\t')
if p[0] not in ("good", "bad"):
print p
else:
c += 1
print c
And, lo and behold, I have 16862965 lines, which means some are wrong. I print them out and I get a bunch of empty new line chars ('\n'). So I guess my question is, "what am i missing when dealing with unicode like this?"
Should I have stripped all leading and trailing spaces (not that there are any in the string)

JSON strings can't contain literal newlines in them e.g.,
not_a_json_string = '"\n"' # in Python source
json.loads(not_a_json_string) # raises ValueError
but they can contain escaped newlines:
json_string = r'"\n"' # raw-string literal (== '"\\n"')
s = json.loads(json_string)
i.e., the original text (json_string) has no newlines in it (it has the backslash followed by n character -- two characters) but the parsed result does contain the newline: '\n' in s.
That is why the example:
for line in file:
d = json.loads(line)
print(d['key'])
may print more lines than the file contains.
It is unrelated to utf-8.
In general, there could also be an issue with non-native newlines e.g., b'\r\r\n\n', or an issue with Unicode newlines such as u'"\u2028 "' (U+2028 LINE SEPARATOR).

Do the same check you were doing on the files written but before you write them, to see how many values get flagged. And make sure those values don't have '\\n' in them. That may be skewing your count.
For better details, see J.F.'s answer below.
Unrelated-to-your-error notes:
(a) When JSON is loads()ed, str objects are automatically unicode already:
>>> a = '{"b":1}'
>>> json.loads(a)['b']
1
>>> json.loads(a).keys()
[u'b']
>>> type(json.loads(a).keys()[0])
<type 'unicode'>
So str(label) in the file write should be either just label or unicode(label). You shouldn't need to encode text and j['normalized'] when you write them to file. Instead, set the file encoding to 'utf-8' when you open it.
(b) Btw, use format() or join() in the write operations - if any of label, text or j['normalized'] is None, the + operator will give an error.

Extract Text from a Binary File (using Python 2.7 on Windows 7)

I have a binary file of size about 5MB.. which has lots of interspersed text.. and control characters..
This is actually an equivalent of an outlook .pst file for SITATEX Application (from SITA).
The file contains all the TEXT MESSAGES sent and received to and from outside world...(but the text has to be extracted through the binary control characters).. all the text messages are clearly available... with line ending ^M characters... etc.
for example: assume ^# ^X are control characters... \xaa with HEX aa, etc. loads of them around my required text extraction.
^#^#^#^#^#^#^#^#^#^#^#BLLBBCC^X^X^X^X^X^X^X^X^X
^X^X^X
MVT^M
EA1123 TEXT TEXT TEXT^M
END^M
\xaa^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#
^#^#^#^#^#^#^#^#^#^#^#TTBBTT^X^X^X^X^X^X^X^X^X
^X^X^X blah blah blah... of control characters.. and then the message comes..
MVT MESSAGE 2
ED1123
etc.
and so on.. for several messages.
Using Perl.. it is easy to do:
while (<>) {
use regular expression to split messages
m/ /
}
How would one do this in python easily..
How to read the file? binary and text interspersed
Eliminate unnecessary control characters
parse the messages in between two \xaa USEFUL TEXT INFORMATION \xaa (HEX 'aa')
print out the required stuff
Loop through all the lines.. and more files.
In the text file sample... I am interested in seeing.. BLLBBCC... and MVT and EA1123 and so on.
Please assist... If it is going to be very difficult in python.. I will have to think through the logic in perl itself.. as it (perl) doesn't throw lots of errors at me at least for the looping part of binary and text stuff.. and the regex.
Thanks.
Update 02Jan after reading your answers/comments
After going through S.Lott's comments and others... This is where I am at.. and it is working 80% ok.
import fileinput
import sys
import re
strfile = r'C:\Users\' \
r'\Learn\python\mvt\sitatex_test.msgs'
f = open(strfile, 'rb')
contents = f.read() # read whole file in contents
#extract the string between two \xaaU.. multiline pattern match
#with look ahead assertion
#and this is stored in a list with all msgs
msgs = re.findall(r'\xaaU.*?(?=\xaaU)', contents, re.I|re.DOTALL|re.M)
for msg in msgs:
#loop through msgs.. to find the first msg then next and so on.
print "## NEW MESSAGE STARTS HERE ##"
#for each msg split the lines.. to read line by line
# stored as list in msglines
msglines = msg.splitlines()
line = 0
#then process each msgline with a message
for msgline in msglines:
line += 1
#msgline = re.sub(r'[\x00]+', r' ', msgline)
mystr = msgline
print mystr
textstrings = re.findall(r'[\x00\x20-\x7E]+', msgline)
So far so good.. still I am not completely done.. because I need to parse the text line by line and word by word.. to pickup (as an example) the origin address and headers, subject line, message body... by parsing the message through the control characters.
Now I am stuck with... how to print line by line with the control characters converted to \x00\x02.. etc (using the \xHH format).. but leave the normal readable text alone.
For example.. say I have this: assume ^# and ^X are some control characters
line1 = '^#UG^#^#^#^#^#^#^#^#^#^#BLLBBCC^X^X^X^X^X^X^X^X^X' (on the first line).
When I print the line as it is on IDLE.. print line1.. it prints only say the first 2 or 3 characters.. and ignores the rest due to the control characters get choked.
However, when I print with this: print re.findall(r'.*', line1)
['\xaaUG\x02\x05\x00\x04\x00\x00\x00\x05\x00\x00\x00....
x00\x00\x00..BLLBBCC\x00\x00N\x00N\\x00
002 010 180000 DEC 11', '']
It prints nicely with all the control characters converted to \xHH format.. and ascii text intact.. (just as I want it)..with one catch.. the list has two items.. with '' in the end.
What is the explanation for the empty string in the end?
How to avoid it... I just want the line converted nicely to a string (not a list). i.e. one line of binary/text to be converted to a string with \xHH codes.. leave the ASCII TEXT alone.
Is using re.findall(r'.*', line1) is the only easy solution.. to do this conversion.. or are there any other straightforward method.. to convert a '\x00string' to \xHH and TEXT (where it is a printable character or whitespace).
Also.. any other useful comments to get the lines out nicely.
Thanks.
Update 2Jan2011 - Part 2
I have found out that re.findall(r'.+', line1) strips to
['\xaaUG\x02\x05\x00\x04\x00\x00\x00\x05\x00\x00\x00....
x00\x00\x00..BLLBBCC\x00\x00N\x00N\\x00
002 010 180000 DEC 11']
without the extra blank '' item in the list. This finding after numerous trial and errors.
Still I will need assistance to eliminate the list altogether but return just a string.
like this:
'\xaaUG\x02\x05\x00\x04..BLLBBCC..002 010 180000 DEC 11'
Added Info on 05Jan:
#John Machin
1) \xaaU is the delimiter between messages.. In the example.. I may have just left out in the samples. Please see below for one actual message that ends with \xaaU (but left out).
Following text is obtained from repr(msg between r'\xaaU.*?(?=\xaaU)')
I am trying to understand the binary format.. this is a typical message which is sent out
the first 'JJJOWXH' is the sender address.. anything that follows that has 7 alphanumeric is the receiver addresses.. Based on the sender address.. I can know whether this is a 'SND' or 'RCV'.. as the source is 'JJJOWXH'... This msg is a 'SND' as we are 'JJJOWXH'.
The message is addressed to: JJJKLXH.... JJJKRXH.... and so on.
As soon as all the.. \x00000000 finishes..
the sita header and subject starts
In this particular case... "\x00QN\x00HX\x00180001 \x00" this is the header.. and I am only interested all the stuff between \x00.
and the body comes next.. after the final \x00 or any other control character... In this case... it is:
COR\r\nMVT \r\nHX9136/17.BLNZ.JJJ\r\nAD2309/2314 EA0128
BBB\r\nDLRA/CI/0032/0022\r\nSI EET 02:14 HRS\r\n RA / 0032 DUE TO
LATE ARVL ACFT\r\n CI / 0022 OFFLOAD OVERHANG PALLET DUE INADEQUATE
PACKING LEADING TO \r\n SPACE PROBLEM
once the readable text ends... the first control character that appears until the end \xaaU is to be ignored... In above cases.. "SPACE PROBLEM".. is the last one.. then control characters starts... so to be ignored... sometimes the control characters are not there till the next \xaaU.
This is one complete message.
"\xaaU\x1c\x04\x02\x00\x05\x06\x1f\x00\x19\x00\x00\x00\xc4\x9d\xedN\x1a\x00?\x02\x02\x00B\x02\x02\x00E\x02\x07\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00M\x02\xec\x00\xff\xff\x00\x00\x00\x00?\x02M\x02\xec\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00:\x03\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x7f\x00JJJOWXH\x00\x05w\x01x\x01\x00\x01JJJKLXH\x00\x00\x7f\x01\x80\x01\x00\x01JJJKRXH\x00F\x87\x01\x88\x01\x00\x01JJJFFXH\x00\xff\x8f\x01\x90\x01\x00\x01JJJFCXH\x00\xff\x97\x01\x98\x01\x00\x01JJJFAXH\x00\x00\x9f\x01\xa0\x01\x00\x01JJJKPXH\x00\x00\xa7\x01\xa8\x01\x00\x01HAKUOHU\x00\x00\xaf\x01\xb0\x01\x00\x01BBBHRXH\x00\x00\xb7\x01\xb8\x01\x00\x01BBBFFHX\x00\x00\xbf\x01\xc0\x01\x00\x01BBBOMHX\x00\x00\xc7\x01\xc8\x01\x00\x01BBBFMXH\x00\x00\xcf\x01\xd0\x01\x00\x01JJJHBER\x00\x00\xd7\x01\xd8\x01\x00\x01BBBFRUO\x00\x00\xdf\x01\xe0\x01\x00\x01BBBKKHX\x00\x00\xe7\x01\xe8\x01\x00\x01JJJLOTG\x00\x01\xef\x01\xf0\x01\x00\x01JJJLCTG\x00\x00\xf7\x01\xf8\x01\x00\x01HDQOMTG\x005\xff\x01\x00\x02\x00\x01CHACSHX\x00K\x07\x02\x08\x02\x00\x01JJJKZXH\x00F\x0f\x02\x10\x02\x00\x01BBBOMUO\x00
\x17\x02\x18\x02\x00\x01BBBORXH\x00 \x1f\x02
\x02\x00\x01BBBOPXH\x00W'\x02(\x02\x00\x01CHACSHX\x00
/\x020\x02\x00\x01JJJDBXH\x0007\x028\x02\x00010000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00QN\x00HX\x00180001
\x00COR\r\nMVT \r\nHX9136/17.BLNZ.JJJ\r\nAD2309/2314 EA0128
BBB\r\nDLRA/CI/0032/0022\r\nSI EET 02:14 HRS\r\n RA / 0032 DUE TO
LATE ARVL ACFT\r\n CI / 0022 OFFLOAD OVERHANG PALLET DUE INADEQUATE
PACKING LEADING TO \r\n SPACE
PROBLEM\x00D-\xedN\x00\x04\x1a\x00t<\x93\x01x\x00M_\x00"
2) I am not using .+ anymore after the 'repr' is known.
3) each Message is multiline.. and i need to preserve all the control characters to make some sense of this proprietary format.. that is why i needed repr to see it up close.
Hope this explains... This is just 1 message out of 1000s with in the file... and some are 'SND' and some are 'RCV'... and for 'RCV' there will not be '000000'.. and occasionally there are minor exceptions to the rule... but usually that is okay.
Any further suggestions anyone.. I am still working with the file.. to retrieve the text out intact... with sender and receiver addresses.
Thank you.

Python supports regexes too. I don't speak Perl, so I don't know exactly what your Perl code does, but this Python program might help you:
import re
with open('yourfile.pst') as f:
contents = f.read()
textstrings = re.findall(r'[\x20-\x7E]+', contents)
That will get you a list of all strings of one or more ASCII printable characters in the file. That may not be exactly what you want, but you might be able to tweak it from there.
Note that if you're using Python 3, then you have to worry about the distinction between binary and textual data and it becomes a bit more complicated. I'm assuming you're in Python 2.

Q: How to read the file? binary and text interspersed
A: Don't bother, just read it as normal text and you'll be able to keep your binary/text dichotomy (otherwise you won't be able to regex it as easily)
fh = open('/path/to/my/file.ext', 'r')
fh.read()
Just in case you want to read binary later for some reason, you just add a b to the second input of the open:
fh = open('/path/to/my/file.ext', 'rb')
Q: Eliminate unnecessary control characters
A: Use the python re module. Your next question sorta ask how
Q: parse the messages in between two \xaa USEFUL TEXT INFORMATION \xaa (HEX 'aa')
A: re module has a findall function that works as you (mostly) expect.
import re
mytext = '\xaaUseful text that I want to keep\xaa^X^X^X\xaaOther text i like\xaa'
usefultext = re.findall('\xaa([a-zA-Z^!-~0-9 ]+)\xaa', mytext)
Q: print out the required stuff
*A: There's a print function...
print usefultext
Q: Loop through all the lines.. and more files.
fh = open('/some/file.ext','r')
for lines in fh.readlines():
#do stuff
I'll let you figure out the os module to figure out what files exist/how to iterate through them.

You say:
Still I will need assistance to eliminate the list altogether but return just a string. like this
In other words, you have foo = [some_string] and you are doing print foo which as a side does repr(some_string) but encloses it in square brackets which you don't want. So just do print repr(foo[0]).
There seem to be several things unexplained:
You say the useful text is bracketed by \xaaU but in the sample file instead of 2 occurrences of that delimiter there is only \xaa (missingU) near the start, and nothing else.
You say
I have found out that re.findall(r'.+', line1) strips to ...
That in effect is stripping out \n (but not \r!!) -- I thought line breaks would be worth preserving when attempting to recover an email message.
>>> re.findall(r'.+', 'abc\r\ndef\r\n\r\n')
['abc\r', 'def\r', '\r']
What you you done with the \r characters? Have you tested a multi-line message? Have you tested a multi-message file?
One is left to guess who or what is intended to consume your output; you write
I need to parse the text line by line and word by word
but you seem overly concerned with printing the message "legibly" with e.g. \xab instead of gibberish.
It looks like the last 6 or so lines in your latest code (for msgline in msglines: etc etc) should be indented one level.
Is it possible to clarify all of the above?

How can I determine a Unicode character from its name in Python, even if that character is a control character?

I'd like to create an array of the Unicode code points which constitute white space in JavaScript (minus the Unicode-white-space code points, which I address separately). These characters are horizontal tab, vertical tab, form feed, space, non-breaking space, and BOM. I could do this with magic numbers:
whitespace = [0x9, 0xb, 0xc, 0x20, 0xa0, 0xfeff]
That's a little bit obscure; names would be better. The unicodedata.lookup method passed through ord helps some:
>>> ord(unicodedata.lookup("NO-BREAK SPACE"))
160
But this doesn't work for 0x9, 0xb, or 0xc -- I think because they're control characters, and the "names" FORM FEED and such are just alias names. Is there any way to map these "names" to the characters, or their code points, in standard Python? Or am I out of luck?

Kerrek SB's comment is a good one: just put the names in a comment.
BTW, Python also supports a named unicode literal:
>>> u"\N{NO-BREAK SPACE}"
u'\xa0'
But it uses the same unicode name database, and the control characters are not in it.

You could roll your own "database" for the control characters by parsing a few lines of the UCD files in the Unicode public directory. In particular, see the UnicodeData-6.1.0d3 file (or see the parent directory for earlier versions).

I don't think it can be done in standard Python. The unicodedata module uses the UnicodeData.txt v5.2.0 Unicode database. Notice that the control characters are all assigned the name <control> (the second field, semicolon-delimited).
The script Tools/unicode/makeunicodedata.py in the Python source distribution is used to generate the table used by the Python runtime. The makeunicodename function looks like this:
def makeunicodename(unicode, trace):
FILE = "Modules/unicodename_db.h"
print "--- Preparing", FILE, "..."
# collect names
names = [None] * len(unicode.chars)
for char in unicode.chars:
record = unicode.table[char]
if record:
name = record[1].strip()
if name and name[0] != "<":
names[char] = name + chr(0)
...
Notice that it skips over entries whose name begins with "<". Hence, there is no name that can be passed to unicodedata.lookup that will give you back one of those control characters.
Just hardcode the code points for horizontal tab, line feed, and carriage return, and leave a descriptive comment. As the Zen of Python goes, "practicality beats purity".

A few points:
(1) "BOM" is not a character. BOM is a byte sequence that appears at the start of a file to indicate the byte order of a file that is encoded in UTF-nn. BOM is u'\uFEFF'.encode('UTF-nn'). Reading a file with the appropriate codec will slurp up the BOM; you don't see it as a Unicode character. A BOM is not data. If you do see u'\uFEFF' in your data, treat it as a (deprecated) ZERO-WIDTH NO-BREAK SPACE.
(2) "minus the Unicode-white-space code points, which I address separately"?? Isn't NO-BREAK SPACE a "Unicode-white-space" code point?
(3) Your Python appears to be broken; mine does this:
>>> ord(unicodedata.lookup("NO-BREAK SPACE"))
160
(4) You could use escape sequences for the first three.
>>> map(hex, map(ord, "\t\v\f"))
['0x9', '0xb', '0xc']
(5) You could use " " for the fourth one.
(6) Even if you could use names, the readers of your code would still be applying blind faith that e.g. "FORM FEED" is a whitespace character.
(7) What happened to to \r and \n?

Assuming you're working with Unicode strings, the first five items in your list, plus all other Unicode space characters, will be matched by the \s option when using a regular expression. Using Python 3.1.2:
>>> import re
>>> s = '\u0009,\u000b,\u000c,\u0020,\u00a0,\ufeff'
>>> s
'\t,\x0b,\x0c, ,\xa0,\ufeff'
>>> re.findall(r'\s', s)
['\t', '\x0b', '\x0c', ' ', '\xa0']
And as for the byte-order mark, the one given can be referred to as codecs.BOM_BE or codecs.BOM_UTF16_BE (though in Python 3+, it's returned as a bytes object rather than str).

The official Unicode recommendation for newlines may or may not be at odds with the way the Python codecs module handles newlines. Since u'\n' is often said to mean "new line", one might expect based on this recommendation for the Python string u'\n' to represent character U+2028 LINE SEPARATOR and to be encoded as such, rather than as the semantic-less control character U+000A. But I can only imagine the confusion that would result if the codecs module actually implemented that policy, and there are valid counter-arguments besides. Ditto for horizontal/vertical tab and form feed, which are probably not really characters but controls anyway. (I would certainly consider backspace to be a control, not a character.)
Your question seems to assume that treating U+000A as a control character (instead of a line separator) is wrong; but that is not at all certain. Perhaps it is more wrong for text processing applications everywhere to assume that a legacy printer-platen-scrolling control signal is really a true "line separator".

You can extend the lookup function to handle the characters that aren't included.
def unicode_lookup(x):
try:
ch = unicodedata.lookup(x)
except KeyError:
control_chars = {'LINE FEED':unichr(0x0a),'CARRIAGE RETURN':unichr(0x0d)}
if x in control_chars:
ch = control_chars[x]
else:
raise
return ch
>>> unicode_lookup('SPACE')
u' '
>>> unicode_lookup('LINE FEED')
u'\n'
>>> unicode_lookup('FORM FEED')
Traceback (most recent call last):
File "<pyshell#17>", line 1, in <module>
unicode_lookup('FORM FEED')
File "<pyshell#13>", line 3, in unicode_lookup
ch = unicodedata.lookup(x)
KeyError: "undefined character name 'FORM FEED'"

How can I disable 'output escaping' in minidom

I'm trying to build an xml document from scratch using xml.dom.minidom. Everything was going well until I tried to make a text node with a ® (Registered Trademark) symbol in. My objective is for when I finally hit print mydoc.toxml() this particular node will actually contain a ® symbol.
First I tried:
import xml.dom.minidom as mdom
data = '®'
which gives the rather obvious error of:
File "C:\src\python\HTMLGen\test2.py", line 3
SyntaxError: Non-ASCII character '\xae' in file C:\src\python\HTMLGen\test2.py on line 3, but no encoding declared; see http://www.python.or
g/peps/pep-0263.html for details
I have of course also tried changing the encoding of my python script to 'utf-8' using the opening line comment method, but this didn't help.
So I thought
import xml.dom.minidom as mdom
data = '®' #Both accepted xml encodings for registered trademark
data = '®'
text = mdom.Text()
text.data = data
print data
print text.toxml()
But because when I print text.toxml(), the ampersands are being escaped, I get this output:
®
&reg;
My question is, does anybody know of a way that I can force the ampersands not to be escaped in the output, so that I can have my special character reference carry through to the XML document?
Basically, for this node, I want print text.toxml() to produce output of ® or ® in a happy and cooperative way!
EDIT 1:
By the way, if minidom actually doesn't have this capacity, I am perfectly happy using another module that you can recommend which does.
EDIT 2:
As Hugh suggested, I tried using data = u'®' (while also using data # -*- coding: utf-8 -*- Python source tags). This almost helped in the sense that it actually caused the ® symbol itself to be outputted to my xml. This is actually not the result I am looking for. As you may have guessed by now (and perhaps I should have specified earlier) this xml document happens to be an HTML page, which needs to work in a browser. So having ® in the document ends up causing rubbish in the browser (Â® to be precise!).
I also tried:
data = unichr(174)
text.data = data.encode('ascii','xmlcharrefreplace')
print text.toxml()
But of course this lead to the same origional problem where all that happens is the ampersand gets escaped by .toxml().
My ideal scenario would be some way of escaping the ampersand so that the XML printing function won't "escape" it on my behalf for the document (in other words, achieving my original goal of having ® or ® appear in the document).
Seems like soon I'm going to have to resort to regular expressions!
EDIT 2a:
Or perhaps not. Seems like getting my html meta information correct <META http-equiv="Content-Type" Content="text/html; charset=UTF-8"> could help, but I'm not sure yet how this fits in with the xml structure...

Two options that work, one with the escaping ® and the other without. It's not really obvious why you want escaping ... it's 6 bytes instead of the 2 or 3 bytes for non-CJK characters.
import xml.dom.minidom as mdom
text = mdom.Text()
# Start with unicode
text.data = u'\xae'
f = open('reg1.html', 'w')
f.write("header saying the file is ascii")
uxml = text.toxml()
bxml = uxml.encode('ascii', 'xmlcharrefreplace')
f.write(bxml)
f.close()
f = open('reg2.html', 'w')
f.write("header saying the file is UTF-8")
xml = text.toxml(encoding='UTF-8')
f.write(xml)
f.close()

If I understand correctly, what you really want is to be able to create a text node from a unicode object (e.g. u'®' or u'\u00ae') and then have toxml() output unicode characters encoded as entities (e.g. ®). Looking at the source of minidom.py, however, it seems that minidom doesn't support entity encoding on output except the special cases of &, ", < and >.
You also ask about alternative modules that could help, however. There are several possible candidates, but ElementTree (xml.etree) seems to do the appropriate encoding. For example, if you take the first example from this blog post by Doug Hellmann but replace:
child_with_tail.text = 'This child has regular text.'
... with:
child_with_tail.text = u'This child has regular text \u00ae.'
... and run the script, you should see the output contains:
This child has regular text®.
You could also use the lxml implementation of ElementTree in that example just by replacing the import statement with:
from lxml.etree import Element, SubElement, Comment, tostring
Update: the alternative answer from John Machin takes the nice approach of running .encode('ascii', 'xmlcharrefreplace') on the output from minidom's toxml(), which converts any non-ASCII characters to their equivalent XML numeric character references.

Default unescape:
from xml.sax.saxutils import unescape
unescape("< & >")
The result is,
'< & >'
And, unescape more:
unescape("&apos; "", {"&apos;": "'", """: '"'})
Check details here, https://wiki.python.org/moin/EscapingXml

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python: escaping non-ascii characters in XML - python

Related

python 3 regex not finding confirmed matches

Unexpected behaviour of t.unicode('utf-8') - Python

Extract Text from a Binary File (using Python 2.7 on Windows 7)

How can I determine a Unicode character from its name in Python, even if that character is a control character?

How can I disable 'output escaping' in minidom

Categories

Resources