Unexpected behaviour of t.unicode('utf-8') - Python - python

I have a json file with several keys. I want to use one of the keys and write that string to a file. The string originally is in unicode. So, I do, s.unicode('utf-8')
Now, there is another key in that json which I write to another file (this is a Machine learning task, am writing original string in one, features in another). The problem is that at the end, the file with the unicode string turns out to have more number of lines (when counted by using "wc -l") and this misguides my tool and it crashes saying sizes not same.
Code for reference:
for line in input_file:
j = json.loads(line)
text = j['text']
label = j[t]
output_file.write(str(label) + '\t' + text.encode('utf-8') + '\n')
norm_file.write(j['normalized'].encode('utf-8') + '\n')
The difference when using "wc -l"
16862965
This is the number of lines I expect and what I get is
16878681
which is actually higher. So I write a script to see how many output labels are actually there
with open(sys.argv[1]) as input_file:
for line in input_file:
p = line.split('\t')
if p[0] not in ("good", "bad"):
print p
else:
c += 1
print c
And, lo and behold, I have 16862965 lines, which means some are wrong. I print them out and I get a bunch of empty new line chars ('\n'). So I guess my question is, "what am i missing when dealing with unicode like this?"
Should I have stripped all leading and trailing spaces (not that there are any in the string)

JSON strings can't contain literal newlines in them e.g.,
not_a_json_string = '"\n"' # in Python source
json.loads(not_a_json_string) # raises ValueError
but they can contain escaped newlines:
json_string = r'"\n"' # raw-string literal (== '"\\n"')
s = json.loads(json_string)
i.e., the original text (json_string) has no newlines in it (it has the backslash followed by n character -- two characters) but the parsed result does contain the newline: '\n' in s.
That is why the example:
for line in file:
d = json.loads(line)
print(d['key'])
may print more lines than the file contains.
It is unrelated to utf-8.
In general, there could also be an issue with non-native newlines e.g., b'\r\r\n\n', or an issue with Unicode newlines such as u'"\u2028
"' (U+2028 LINE SEPARATOR).

Do the same check you were doing on the files written but before you write them, to see how many values get flagged. And make sure those values don't have '\\n' in them. That may be skewing your count.
For better details, see J.F.'s answer below.
Unrelated-to-your-error notes:
(a) When JSON is loads()ed, str objects are automatically unicode already:
>>> a = '{"b":1}'
>>> json.loads(a)['b']
1
>>> json.loads(a).keys()
[u'b']
>>> type(json.loads(a).keys()[0])
<type 'unicode'>
So str(label) in the file write should be either just label or unicode(label). You shouldn't need to encode text and j['normalized'] when you write them to file. Instead, set the file encoding to 'utf-8' when you open it.
(b) Btw, use format() or join() in the write operations - if any of label, text or j['normalized'] is None, the + operator will give an error.

Related

When writing in Python a dictionary to a YAML file, how to make sure the string in the YAML file is split based on '\n'?

I have a long string in a dictionary which I will dump to a YAML file.
As an example
d = {'test': {'long_string':'this is a long string that does not succesfully split when it sees the character '\n' which is an issue'}}
ff = open('./test.yaml', 'w+')
yaml.safe_dump(d, ff)
Which produces the following output in the YAML file
test:
long_string: "this is a long string that does not successfully split when it sees\
\ the character '\n' which is an issue"
I want the string which is inside the YAML file to only be split into a new line when it sees the "\n", also, I don't want any characters indicating that it's a newline. I want the output as follows:
test:
long_string: "this is a long string that does not successfully split when it sees the character ''
which is an issue"
What do I need to do to make the yaml.dump or yaml.safe_dump fulfill this?
There is no general solution. YAML is a format intentionally designed in a way that lets the implementation decide on the exact representation of values.
What you can do is to suggest a format. The dumper will honor this suggestion if possible. The one scalar format that breaks at literal newlines in the value and nowhere else is a literal block scalar. This code will dump your string as such if possible:
import yaml, sys
class as_block(str):
#staticmethod
def represent(dumper, data):
return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='|')
yaml.SafeDumper.add_representer(as_block, as_block.represent)
d = {'test': {'long_string':as_block('this is a long string that does not succes
fully split when it sees the character\n which is an issue')}}
yaml.safe_dump(d, sys.stdout)
Output:
test:
long_string: |-
this is a long string that does not succesfully split when it sees the character
which is an issue
I use as_block for the string that should be written as block scalar.
You can theoretically use this for all strings, but be aware that long_string and test would then also be written als block scalars, which is most probably not what you want.
This will not work when there is space before the line break, because YAML ignores space at the end of a line of a block scalar, so the serializer will choose another format to not lose the space character(s).
You can also take a step back and ask yourself why this is an issue in the first place. A YAML implementation is perfectly able to load the generated YAML and reconstruct your string.

Python TypeError: expected a string or other character buffer object when importing text file

I am pretty new to python. For this task, I am trying to import a text file, add and to id, and remove punctuation from the text. I tried this method How to strip punctuation from a text file.
import string
def readFile():
translate_table = dict((ord(char), None) for char in string.punctuation)
with open('out_file.txt', 'w') as out_file:
with open('moviereview.txt') as file:
for line in file:
line = ' '.join(line.split(' '))
line = line.translate(translate_table)
out_file.write("<s>" + line.rstrip('\n') + "</s>" + '\n')
return out_file
However, I get an error saying:
TypeError: expected a string or other character buffer object
My thought is that after I split and join the line, I get a list of strings, so I cannot use str.translate() to process it. But it seems like everyone else have the same thing and it works,
ex. https://appliedmachinelearning.blog/2017/04/30/language-identification-from-texts-using-bi-gram-model-pythonnltk/ in example code from line 13.
So I am really confused, can anyone help? Thanks!
On Python 2, only unicode types have a translate method that takes a dict. If you intend to work with arbitrary text, the simplest solution here is to just use the Python 3 version of open on Py2; it will seamlessly decode your inputs and produce unicode instead of str.
As of Python 2.6+, replacing the normal built-in open with the Python 3 version is simple. Just add:
from io import open
to the imports at the top of your file. You can also remove line = ' '.join(line.split(' ')); that's definitionally a no-op (it splits on single spaces to make a list, then rejoins on single spaces). You may also want to add:
from __future__ import unicode_literals
to the very top of your file (before all of your code); that will make all of your uses of plain quotes automatically unicode literals, not str literals (prefix actual binary data with b to make it a str literal on Py2, bytes literal on Py3).
The above solution is best if you can swing it, because it will make your code work correctly on both Python 2 and Python 3. If you can't do it for whatever reason, then you need to change your translate call to use the API Python 2's str.translate expects, which means removing the definition of translate_table entirely (it's not needed) and just doing:
line = line.translate(None, string.punctuation)
For Python 2's str.translate, the arguments are a one-to-one mapping table for all values from 0 to 255 inclusive as the first argument (None if no mapping needed), and the second argument is a string of characters to delete (which string.punctuation already provides).
Answering here because a comment doesn't let me format code properly:
def r():
translate_table = dict((ord(char), None) for char in string.punctuation)
a = []
with open('out.txt', 'w') as of:
with open('test.txt' ,'r') as f:
for l in f:
l = l.translate(translate_table)
a.append(l)
of.write(l)
return a
This code runs fine for me with no errors. Can you try running that, and responding with a screenshot of the code you ran?

python 3.2.3 ignore escape sequence in string

I'm new to Python and I'm using one of the examples I found here to read lines from a file and print them. What I don't understand is why the interpreter ignores \n escape sequence:
Text file:
Which of the following are components you might find inside a PC? (Select all correct answers.)
A. CPU
B. Motherboard
C. Keyboard
Answers: A, B, and E. \nCommon components inside a PC include \nthe CPU,motherboard, and \nRAM
Python code:
questions_fname="Test.txt"
with open(questions_fname, 'r') as f:
questions = [line.strip() for line in f]
for line in questions:
print (line)
f.close()
The result I get is strings like:
Answers: A, B, and E. \nCommon components inside a PC include \nthe CPU,motherboard, and \nRAM
I was just looking for a simple way of formatting long lines to fit the screen.
You don't have "\n" in the string, you have "\\n" since you're reading it from a file. If you want to have "\n" then you need to decode the string. Note that 3.x doesn't have str.decode(), so you can't use that mechanism from 2.x.
3>> codecs.getdecoder('unicode-escape')('foo\\nbar')[0]
'foo\nbar'
Try the following code to get the wanted behaviour...
questions_fname = "Test.txt"
with open(questions_fname) as f:
for line in f:
line = line.rstrip().replace('\\n', '\n')
print(line)
The .rstrip() removes the trailing whitespaces, including the binary form of \n. The .replace() causes explicit, user defined interpretation of your \n sequences in the file content -- captured as the printable character \ followed by the n.
When using the with construct, the f.close() is done automatically.
\ is an escape character only in Python script, not in text files. When reading text files, Python converts all backslashes to \\, so when reading the file, \n becomes \\n which is not a newline character
Sorry - this isn't valid for Python 3.x (I was looking at the tags), but I'll leave here as reference - please see #Ignacio's answer: https://stackoverflow.com/a/14193673/1252759
If you've effectively got a raw string that contains the literal characters '\n', then you can re-interpret the string to make it an escape sequence again:
>>> a = r"Answers: A, B, and E. \nCommon components inside a PC include \nthe CPU,motherboard, and \nRAM"
>>> print a
Answers: A, B, and E. \nCommon components inside a PC include \nthe CPU,motherboard, and \nRAM
>>> print a.decode('string_escape')
Answers: A, B, and E.
Common components inside a PC include
the CPU,motherboard, and
RAM
You may also want to look at the textwrap module if you want to wrap lines to a certain width for certain displays...

Extract Text from a Binary File (using Python 2.7 on Windows 7)

I have a binary file of size about 5MB.. which has lots of interspersed text.. and control characters..
This is actually an equivalent of an outlook .pst file for SITATEX Application (from SITA).
The file contains all the TEXT MESSAGES sent and received to and from outside world...(but the text has to be extracted through the binary control characters).. all the text messages are clearly available... with line ending ^M characters... etc.
for example: assume ^# ^X are control characters... \xaa with HEX aa, etc. loads of them around my required text extraction.
^#^#^#^#^#^#^#^#^#^#^#BLLBBCC^X^X^X^X^X^X^X^X^X
^X^X^X
MVT^M
EA1123 TEXT TEXT TEXT^M
END^M
\xaa^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#
^#^#^#^#^#^#^#^#^#^#^#TTBBTT^X^X^X^X^X^X^X^X^X
^X^X^X blah blah blah... of control characters.. and then the message comes..
MVT MESSAGE 2
ED1123
etc.
and so on.. for several messages.
Using Perl.. it is easy to do:
while (<>) {
use regular expression to split messages
m/ /
}
How would one do this in python easily..
How to read the file? binary and text interspersed
Eliminate unnecessary control characters
parse the messages in between two \xaa USEFUL TEXT INFORMATION \xaa (HEX 'aa')
print out the required stuff
Loop through all the lines.. and more files.
In the text file sample... I am interested in seeing.. BLLBBCC... and MVT and EA1123 and so on.
Please assist... If it is going to be very difficult in python.. I will have to think through the logic in perl itself.. as it (perl) doesn't throw lots of errors at me at least for the looping part of binary and text stuff.. and the regex.
Thanks.
Update 02Jan after reading your answers/comments
After going through S.Lott's comments and others... This is where I am at.. and it is working 80% ok.
import fileinput
import sys
import re
strfile = r'C:\Users\' \
r'\Learn\python\mvt\sitatex_test.msgs'
f = open(strfile, 'rb')
contents = f.read() # read whole file in contents
#extract the string between two \xaaU.. multiline pattern match
#with look ahead assertion
#and this is stored in a list with all msgs
msgs = re.findall(r'\xaaU.*?(?=\xaaU)', contents, re.I|re.DOTALL|re.M)
for msg in msgs:
#loop through msgs.. to find the first msg then next and so on.
print "## NEW MESSAGE STARTS HERE ##"
#for each msg split the lines.. to read line by line
# stored as list in msglines
msglines = msg.splitlines()
line = 0
#then process each msgline with a message
for msgline in msglines:
line += 1
#msgline = re.sub(r'[\x00]+', r' ', msgline)
mystr = msgline
print mystr
textstrings = re.findall(r'[\x00\x20-\x7E]+', msgline)
So far so good.. still I am not completely done.. because I need to parse the text line by line and word by word.. to pickup (as an example) the origin address and headers, subject line, message body... by parsing the message through the control characters.
Now I am stuck with... how to print line by line with the control characters converted to \x00\x02.. etc (using the \xHH format).. but leave the normal readable text alone.
For example.. say I have this: assume ^# and ^X are some control characters
line1 = '^#UG^#^#^#^#^#^#^#^#^#^#BLLBBCC^X^X^X^X^X^X^X^X^X' (on the first line).
When I print the line as it is on IDLE.. print line1.. it prints only say the first 2 or 3 characters.. and ignores the rest due to the control characters get choked.
However, when I print with this: print re.findall(r'.*', line1)
['\xaaUG\x02\x05\x00\x04\x00\x00\x00\x05\x00\x00\x00....
x00\x00\x00..BLLBBCC\x00\x00N\x00N\\x00
002 010 180000 DEC 11', '']
It prints nicely with all the control characters converted to \xHH format.. and ascii text intact.. (just as I want it)..with one catch.. the list has two items.. with '' in the end.
What is the explanation for the empty string in the end?
How to avoid it... I just want the line converted nicely to a string (not a list). i.e. one line of binary/text to be converted to a string with \xHH codes.. leave the ASCII TEXT alone.
Is using re.findall(r'.*', line1) is the only easy solution.. to do this conversion.. or are there any other straightforward method.. to convert a '\x00string' to \xHH and TEXT (where it is a printable character or whitespace).
Also.. any other useful comments to get the lines out nicely.
Thanks.
Update 2Jan2011 - Part 2
I have found out that re.findall(r'.+', line1) strips to
['\xaaUG\x02\x05\x00\x04\x00\x00\x00\x05\x00\x00\x00....
x00\x00\x00..BLLBBCC\x00\x00N\x00N\\x00
002 010 180000 DEC 11']
without the extra blank '' item in the list. This finding after numerous trial and errors.
Still I will need assistance to eliminate the list altogether but return just a string.
like this:
'\xaaUG\x02\x05\x00\x04..BLLBBCC..002 010 180000 DEC 11'
Added Info on 05Jan:
#John Machin
1) \xaaU is the delimiter between messages.. In the example.. I may have just left out in the samples. Please see below for one actual message that ends with \xaaU (but left out).
Following text is obtained from repr(msg between r'\xaaU.*?(?=\xaaU)')
I am trying to understand the binary format.. this is a typical message which is sent out
the first 'JJJOWXH' is the sender address.. anything that follows that has 7 alphanumeric is the receiver addresses.. Based on the sender address.. I can know whether this is a 'SND' or 'RCV'.. as the source is 'JJJOWXH'... This msg is a 'SND' as we are 'JJJOWXH'.
The message is addressed to: JJJKLXH.... JJJKRXH.... and so on.
As soon as all the.. \x00000000 finishes..
the sita header and subject starts
In this particular case... "\x00QN\x00HX\x00180001 \x00" this is the header.. and I am only interested all the stuff between \x00.
and the body comes next.. after the final \x00 or any other control character... In this case... it is:
COR\r\nMVT \r\nHX9136/17.BLNZ.JJJ\r\nAD2309/2314 EA0128
BBB\r\nDLRA/CI/0032/0022\r\nSI EET 02:14 HRS\r\n RA / 0032 DUE TO
LATE ARVL ACFT\r\n CI / 0022 OFFLOAD OVERHANG PALLET DUE INADEQUATE
PACKING LEADING TO \r\n SPACE PROBLEM
once the readable text ends... the first control character that appears until the end \xaaU is to be ignored... In above cases.. "SPACE PROBLEM".. is the last one.. then control characters starts... so to be ignored... sometimes the control characters are not there till the next \xaaU.
This is one complete message.
"\xaaU\x1c\x04\x02\x00\x05\x06\x1f\x00\x19\x00\x00\x00\xc4\x9d\xedN\x1a\x00?\x02\x02\x00B\x02\x02\x00E\x02\x07\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00M\x02\xec\x00\xff\xff\x00\x00\x00\x00?\x02M\x02\xec\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00\xff\xff\x00\x00:\x03\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x7f\x00JJJOWXH\x00\x05w\x01x\x01\x00\x01JJJKLXH\x00\x00\x7f\x01\x80\x01\x00\x01JJJKRXH\x00F\x87\x01\x88\x01\x00\x01JJJFFXH\x00\xff\x8f\x01\x90\x01\x00\x01JJJFCXH\x00\xff\x97\x01\x98\x01\x00\x01JJJFAXH\x00\x00\x9f\x01\xa0\x01\x00\x01JJJKPXH\x00\x00\xa7\x01\xa8\x01\x00\x01HAKUOHU\x00\x00\xaf\x01\xb0\x01\x00\x01BBBHRXH\x00\x00\xb7\x01\xb8\x01\x00\x01BBBFFHX\x00\x00\xbf\x01\xc0\x01\x00\x01BBBOMHX\x00\x00\xc7\x01\xc8\x01\x00\x01BBBFMXH\x00\x00\xcf\x01\xd0\x01\x00\x01JJJHBER\x00\x00\xd7\x01\xd8\x01\x00\x01BBBFRUO\x00\x00\xdf\x01\xe0\x01\x00\x01BBBKKHX\x00\x00\xe7\x01\xe8\x01\x00\x01JJJLOTG\x00\x01\xef\x01\xf0\x01\x00\x01JJJLCTG\x00\x00\xf7\x01\xf8\x01\x00\x01HDQOMTG\x005\xff\x01\x00\x02\x00\x01CHACSHX\x00K\x07\x02\x08\x02\x00\x01JJJKZXH\x00F\x0f\x02\x10\x02\x00\x01BBBOMUO\x00
\x17\x02\x18\x02\x00\x01BBBORXH\x00 \x1f\x02
\x02\x00\x01BBBOPXH\x00W'\x02(\x02\x00\x01CHACSHX\x00
/\x020\x02\x00\x01JJJDBXH\x0007\x028\x02\x00010000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00\x00000000\x00QN\x00HX\x00180001
\x00COR\r\nMVT \r\nHX9136/17.BLNZ.JJJ\r\nAD2309/2314 EA0128
BBB\r\nDLRA/CI/0032/0022\r\nSI EET 02:14 HRS\r\n RA / 0032 DUE TO
LATE ARVL ACFT\r\n CI / 0022 OFFLOAD OVERHANG PALLET DUE INADEQUATE
PACKING LEADING TO \r\n SPACE
PROBLEM\x00D-\xedN\x00\x04\x1a\x00t<\x93\x01x\x00M_\x00"
2) I am not using .+ anymore after the 'repr' is known.
3) each Message is multiline.. and i need to preserve all the control characters to make some sense of this proprietary format.. that is why i needed repr to see it up close.
Hope this explains... This is just 1 message out of 1000s with in the file... and some are 'SND' and some are 'RCV'... and for 'RCV' there will not be '000000'.. and occasionally there are minor exceptions to the rule... but usually that is okay.
Any further suggestions anyone.. I am still working with the file.. to retrieve the text out intact... with sender and receiver addresses.
Thank you.
Python supports regexes too. I don't speak Perl, so I don't know exactly what your Perl code does, but this Python program might help you:
import re
with open('yourfile.pst') as f:
contents = f.read()
textstrings = re.findall(r'[\x20-\x7E]+', contents)
That will get you a list of all strings of one or more ASCII printable characters in the file. That may not be exactly what you want, but you might be able to tweak it from there.
Note that if you're using Python 3, then you have to worry about the distinction between binary and textual data and it becomes a bit more complicated. I'm assuming you're in Python 2.
Q: How to read the file? binary and text interspersed
A: Don't bother, just read it as normal text and you'll be able to keep your binary/text dichotomy (otherwise you won't be able to regex it as easily)
fh = open('/path/to/my/file.ext', 'r')
fh.read()
Just in case you want to read binary later for some reason, you just add a b to the second input of the open:
fh = open('/path/to/my/file.ext', 'rb')
Q: Eliminate unnecessary control characters
A: Use the python re module. Your next question sorta ask how
Q: parse the messages in between two \xaa USEFUL TEXT INFORMATION \xaa (HEX 'aa')
A: re module has a findall function that works as you (mostly) expect.
import re
mytext = '\xaaUseful text that I want to keep\xaa^X^X^X\xaaOther text i like\xaa'
usefultext = re.findall('\xaa([a-zA-Z^!-~0-9 ]+)\xaa', mytext)
Q: print out the required stuff
*A: There's a print function...
print usefultext
Q: Loop through all the lines.. and more files.
fh = open('/some/file.ext','r')
for lines in fh.readlines():
#do stuff
I'll let you figure out the os module to figure out what files exist/how to iterate through them.
You say:
Still I will need assistance to eliminate the list altogether but return just a string. like this
In other words, you have foo = [some_string] and you are doing print foo which as a side does repr(some_string) but encloses it in square brackets which you don't want. So just do print repr(foo[0]).
There seem to be several things unexplained:
You say the useful text is bracketed by \xaaU but in the sample file instead of 2 occurrences of that delimiter there is only \xaa (missingU) near the start, and nothing else.
You say
I have found out that re.findall(r'.+', line1) strips to ...
That in effect is stripping out \n (but not \r!!) -- I thought line breaks would be worth preserving when attempting to recover an email message.
>>> re.findall(r'.+', 'abc\r\ndef\r\n\r\n')
['abc\r', 'def\r', '\r']
What you you done with the \r characters? Have you tested a multi-line message? Have you tested a multi-message file?
One is left to guess who or what is intended to consume your output; you write
I need to parse the text line by line and word by word
but you seem overly concerned with printing the message "legibly" with e.g. \xab instead of gibberish.
It looks like the last 6 or so lines in your latest code (for msgline in msglines: etc etc) should be indented one level.
Is it possible to clarify all of the above?

python: escaping non-ascii characters in XML

I got my test XML file to print using the following source file, but it doesn't handle non-ASCII characters appropriately:
xmltest.py:
import xml.sax.xmlreader
import xml.sax.saxutils
def testJunk(file, e2content):
attr0 = xml.sax.xmlreader.AttributesImpl({})
x = xml.sax.saxutils.XMLGenerator(file)
x.startDocument()
x.startElement("document", attr0)
x.startElement("element1", attr0)
x.characters("bingo")
x.endElement("element1")
x.startElement("element2", attr0)
x.characters(e2content)
x.endElement("element2")
x.endElement("document")
x.endDocument()
If I do
>>> import xmltest
>>> xmltest.testJunk(open("test.xml","w"), "ascii 001: \001")
then I get an xml file with character code 001 in it. I can't figure out how to escape this character. Firefox tells me it's not well formed XML and complains about that character. How can I fix this?
clarification: I'm trying to log the output of a function I do not have control over, which outputs non-ASCII characters.
update: OK, so now I know characters outside one of the accepted ranges can't be encoded in the form . (Or rather, they can be encoded, but that doesn't help any w/r/t XML not being well-formed.) But they can be escaped if I define a way of doing so.
(for future reference: W3C has a useful page outside the XML standard itself which says "Control codes should be replaced with appropriate markup" but doesn't really suggest any examples for doing so.)
If I wanted to escape characters outside the accepted range in the following way:
before escaping: ( represents one character, not the literal 8-character string)
abcdefghijkl
after escaping:
abcd<u>0001</u>efgh<u>0002</u>ijkl
How could I do this in python?
def escapeXML(src)
dest = ??????
return dest
"\001" aka "\x01" is an ASCII control code. It is not however one of the permitted XML characters. The only ASCII control codes which qualify are "\t", "\n" and "\r".
Examples:
>>> import xml.etree.cElementTree as ET
# Raw newline works
>>> t = ET.fromstring("<e>\n</e>")
>>> t.text
'\n'
# Hex escaping of a newline works
>>> t = ET.fromstring("<e>
</e>")
>>> t.text
'\n'
# Hex escaping of "\x01" doesn't work; it's not a valid XML character
>>> t = ET.fromstring("<e></e>")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 106, in XML
cElementTree.ParseError: reference to invalid character number: line 1, column 3
If you want to include invalid XML characters somehow in an XML document, they must be hidden from the XML parser by an extra level of escaping. The mechanism needs to be documented, published, and understood by readers of your document.
For example, in Microsoft Excel 2007+ XLSX files, Unicode code points which are not valid XML characters are smuggled past the parser by representing them as _xhhhh_ where hhhh is the hex representation of the codepoint. In your example this would be the 7 bytes _x0001_. Note that it is necessary to escape any _ characters in the text that would otherwise be falsely interpreted as introducing an _xhhhh_ sequence.
This is ugly, painful, inefficient, etc. You may wish to consider other methods. Is use of XML necessary? Would a CSV file (shock, horror!) do a better job in your application?
Edit Some notes on the OP's encoding proposal:
A. Although \r is a valid XML 1.0 input character, it is subject to mandatory immediate transmogrification, so you should escape it as well.
B. This scheme assumes/hopes that the <u>hhhh</u> cannot be confused with any other markup.
C. I take back what I said above about the Microsoft escaping scheme. It is relatively beautiful, painfree, and efficient. To complete the picture of your scheme for your gentle readers, you should show the code that is required to unescape the nasty bits and glue the pieces back together. Bear in mind that the MS scheme requires somebody to write one escaping function and one unescaping function, whereas your scheme requires different treatment for each tool (SAX, DOM, ElementTree).
D. At the detailed level, the code is a little bit whiffy:
if (len(g1) > 0): should be if g1:
if (not foo == None): has a record THREE deviations from the commonly accepted idiom: (1) the parentheses (2) not x == y instead of x != y (3) != None instead of is not None
Don't use list (and names of other built-in objects) as a name for your own variable.
Edit 2 You want to split up a string using a regex. Why not use re.split?
splitInvalidXML2 = re.compile(
ur'([^\x09\x0A\x0D\x20-\x7E\x85\xA0-\xFF\u0100-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD])'
).split
def submitCharacters2(x, string):
badchar = True
for fragment in splitInvalidXML2(string):
badchar = not badchar
if badchar:
x.startElement("u", attr0)
x.characters('%04X' % ord(fragment))
x.endElement("u")
elif fragment:
x.characters(fragment)
This seems to work for me.
r = re.compile(ur'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\xFF' \
+ ur'\u0100-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]')
def escapeInvalidXML(string):
def replacer(m):
return "<u>"+('%04X' % ord(m.group(0)))+"</u>"
return re.sub(r,replacer,string)
example:
>>> s='this is a \x01 test \x0B of something'
>>> escapeInvalidXML(s)
'this is a <u>0001</u> test <u>000B</u> of something'
>>> s2 = u'this is a \x01 test \x0B of \uFDD0'
>>> escapeInvalidXML(s2)
u'this is a <u>0001</u> test <u>000B</u> of <u>FDD0</u>'
Character ranges from http://www.w3.org/TR/2006/REC-xml-20060816/#charsets, and I haven't escaped everything, just the ones below \uFFFF.
Update: Oops, forgot to adapt to the startElement/characters methods of SAX, & deal properly with multiple lines:
import re
import xml.sax.xmlreader
import xml.sax.saxutils
r = re.compile(ur'(.*?)(?:([^\x09\x0A\x0D\x20-\x7E\x85\xA0-\xFF' \
+ ur'\u0100-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD])|([\n])|$)')
attr0 = xml.sax.xmlreader.AttributesImpl({})
def splitInvalidXML(string):
list = []
def replacer(m):
g1 = m.group(1)
if (len(g1) > 0):
list.append(g1)
g2 = m.group(2)
if (not g2 == None):
list.append(ord(g2))
g3 = m.group(3)
if (not g3 == None):
list.append(g3)
return ""
re.sub(r,replacer,string)
return list
def submitCharacters(x, string):
for fragment in splitInvalidXML(string):
if (isinstance(fragment,int)):
x.startElement("u", attr0)
x.characters('%04X' % fragment)
x.endElement("u")
else:
x.characters(fragment)
def test1(fname):
with open(fname,'w') as f:
x = xml.sax.saxutils.XMLGenerator(f)
x.startDocument()
x.startElement('document',attr0)
submitCharacters(x, 'this is a \x01 test\nof the \x02\x0b xml system.')
x.endElement('document')
x.endDocument()
test1('test.xml')
This produces:
<?xml version="1.0" encoding="iso-8859-1"?>
<document>this is a <u>0001</u> test
of the <u>0002</u><u>000B</u> xml system.</document>
There's an open python bug for this https://bugs.python.org/issue5166 - not yet sure what the resolution will be/if it will be fixed as it's been open a little while now, but worth checking on that periodically in case we get a proper solution for handling invalid XML characters built into python itself.

Categories

Resources