Write xml utf-8 file with utf-8 data with ElementTree

Write xml utf-8 file with utf-8 data with ElementTree - python

I'm trying to write an xml file with utf-8 encoded data using ElementTree like this:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET
import codecs
testtag = ET.Element('unicodetag')
testtag.text = u'Töreboda' #The o is really ö (o with two dots over). No idea why SO dont display this
expfile = codecs.open('testunicode.xml',"w","utf-8-sig")
ET.ElementTree(testtag).write(expfile,encoding="UTF-8",xml_declaration=True)
expfile.close()
This blows up with the error
Traceback (most recent call last):
File "unicodetest.py", line 10, in <module>
ET.ElementTree(testtag).write(expfile,encoding="UTF-8",xml_declaration=True)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 815, in write
serialize(write, self._root, encoding, qnames, namespaces)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 932, in _serialize_xml
write(_escape_cdata(text, encoding))
File "/usr/lib/python2.7/codecs.py", line 691, in write
return self.writer.write(data)
File "/usr/lib/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
Using the "us-ascii" encoding instead works fine, but don't preserve the unicode characters in the data. What is happening?

codecs.open expects Unicode strings to be written to the file object and it will handle encoding to UTF-8. ElementTree's write encodes the Unicode strings to UTF-8 byte strings before sending them to the file object. Since the file object wants Unicode strings, it is coercing the byte string back to Unicode using the default ascii codec and causing the UnicodeDecodeError.
Just do this:
#expfile = codecs.open('testunicode.xml',"w","utf-8-sig")
ET.ElementTree(testtag).write('testunicode.xml',encoding="UTF-8",xml_declaration=True)
#expfile.close()

Related

UnicodeEncodeError when reading a file

I am trying to read from rockyou wordlist and write all words that are >= 8 chars to a new file.
Here is the code -
def main():
with open("rockyou.txt", encoding="utf8") as in_file, open('rockout.txt', 'w') as out_file:
for line in in_file:
if len(line.rstrip()) < 8:
continue
print(line, file = out_file, end = '')
print("done")
if __name__ == '__main__':
main()
Some words are not utf-8.
Traceback (most recent call last):
File "wpa_rock.py", line 10, in <module>
main()
File "wpa_rock.py", line 6, in main
print(line, file = out_file, end = '')
File "C:\Python\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0e45' in position
0: character maps to <undefined>
Update
def main():
with open("rockyou.txt", encoding="utf8") as in_file, open('rockout.txt', 'w', encoding="utf8") as out_file:
for line in in_file:
if len(line.rstrip()) < 8:
continue
out_file.write(line)
print("done")
if __name__ == '__main__':
main()```
Traceback (most recent call last):
File "wpa_rock.py", line 10, in <module>
main()
File "wpa_rock.py", line 3, in main
for line in in_file:
File "C:\Python\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 933: invali
d continuation byte

Your UnicodeEncodeError: 'charmap' error occurs during writing to out_file (in print()).
By default, open() uses locale.getpreferredencoding() that is ANSI codepage on Windows (such as cp1252) that can't represent all Unicode characters and '\u0e45' character in particular. cp1252 is a one-byte encoding that can represent at most 256 different characters but there are a million (1114111) Unicode characters. It can't represent them all.
Pass encoding that can represent all the desired data e.g., encoding='utf-8' must work (as #robyschek suggested)—if your code reads utf-8 data without any errors then the code should be able to write the data using utf-8 too.
Your UnicodeDecodeError: 'utf-8' error occurs during reading in_file (for line in in_file). Not all byte sequences are valid utf-8 e.g., os.urandom(100).decode('utf-8') may fail. What to do depends on the application.
If you expect the file to be encoded as utf-8; you could pass errors="ignore" open() parameter, to ignore occasional invalid byte sequences. Or you could use some other error handlers depending on your application.
If the actual character encoding used in the file is different then you should pass the actual character encoding. bytes by themselves do not have any encoding—that metadata should come from another source (though some encodings are more likely than others: chardet can guess) e.g., if the file content is an http body then see A good way to get the charset/encoding of an HTTP response in Python
Sometimes a broken software can generate mostly utf-8 byte sequences with some bytes in a different encoding. bs4.BeautifulSoup can handle some special cases. You could also try ftfy utility/library and see if it helps in your case e.g., ftfy may fix some utf-8 variations.

Hey I was having a similar issue, in the case of rockyou.txt wordlist, I tried a number of encodings that Python had to offer and I found that encoding = 'kio8_u' worked to read the file.

Python3 CSV reader Unicode Decode error

I have a huge csv file in utf8 encoding, but some columns with encoding that differs from main file encoding. It looks like:
input.txt in UTF-8 encoding:
a,b,c
d,"e?",f
g,h,"kü"
same input.txt in win-1252
a,b,c
d,"eü",f
g,h,"kÃ¼
Code:
import csv
file = open("input.txt",encoding="...")
c = csv.reader(file, delimiter=';', quotechar='"')
for itm in c:
print(itm)
and standart python3 csv reader generates encoding error on such lines.I can not just ignore reading this line but I need only always good encoded "someOther" column.
Is it posible using standart csv reader to split somehow CSV data in some "bytes mode" and then convert each array element to normal python unicode string, or should I implement my own csv reader ?
Traceback:
Traceback (most recent call last):
File "C:\Development\t.py", line 7, in <module>
for itm in c:
File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 11: invalid start byte

How sure are you that your file is UTF8 encoded?
For the small sample that you've posted UTF8 decoding fails on the ü which is "LATIN SMALL LETTER U WITH DIAERESIS". When encoded as ISO-8859-1, ü is '\xfc'. Two other possibilities are that the CSV file is UTF-16 encoded (UTF-16 little endian is common on Windows), or even Windows-1252.
If your CSV file is encoded in one of the ISO-8859-X family of encodings; any of ISO 8859-1/3/4/9/10/14/15/16 encode ü as 0xfc.
To solve, use the correct encoding and open the file like this:
file = open("input.txt", encoding="iso-8859-1")
or, for Windows 1252:
file = open("input.txt", encoding="windows-1252")
or, for UTF-16:
file = open("input.txt", encoding="utf-16") # or utf-16-le or utf-16-be as required

python: csv to json conversion when csv contains unicode

I'm trying to use the following code (within web2py) to read a csv file and convert it into a json object:
import csv
import json
originalfilename, file_stream = db.tablename.file.retrieve(info.file)
file_contents = file_stream.read()
csv_reader = csv.DictReader(StringIO(file_contents))
json = json.dumps([x for x in csv_reader])
This produces the following error:
'utf8' codec can't decode byte
0xa0 in position 1: invalid start byte
Apparently, there is a problem handling the spaces in the .csv file. The problem appears to stem from the json.dumps() line. The traceback from that point on:
Traceback (most recent call last):
File ".../web2py/gluon/restricted.py", line 212, in restricted
exec ccode in environment
File ".../controllers/default.py", line 2345, in <module>
File ".../web2py/gluon/globals.py", line 194, in <lambda>
self._caller = lambda f: f()
File ".../web2py/gluon/tools.py", line 3021, in f
return action(*a, **b)
File ".../controllers/default.py", line 697, in generate_vis
request.vars.json = json.dumps(list(csv_reader))
File "/usr/local/lib/python2.7/json/__init__.py", line 243, in dumps
return _default_encoder.encode(obj)
File "/usr/local/lib/python2.7/json/encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/local/lib/python2.7/json/encoder.py", line 270, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 1: invalid start byte
Any suggestions regarding how to resolve this, or another way to get a csv file (which contains a header; using StringIO) into a json object that won't produce similar complications? Thank you.

The csv module (under Python 2) is purely byte-based; all strings you get out of it are bytes. However JSON is Unicode character-based, so there is an implicit conversion when you try to write out the bytes you got from CSV into JSON. Python guessed UTF-8 for this, but your CSV file wasn't UTF-8 - it was probably Windows code page 1252 (Western European - like ISO-8859-1 only not quite).
A quick fix would be to transcode your input (file_contents= file_contents.decode('windows-1252').encode('utf-8')), but probably you don't really want to rely on json guessing a particular encoding.
Best would be to explicitly decode your strings at the point of reading them from CSV. Then JSON will be able to cope with them OK. Unfortately csv doesn't have built-in decoding (at least in this Python version), but you can do it manually:
class UnicodeDictReader(csv.DictReader):
def __init__(self, f, encoding, *args, **kwargs):
csv.DictReader.__init__(self, f, *args, **kwargs)
self.encoding = encoding
def next(self):
return {
k.decode(self.encoding): v.decode(self.encoding)
for (k, v) in csv.DictReader.next(self).items()
}
csv_reader = UnicodeDictReader(StringIO(file_contents), 'windows-1252')
json_output = json.dumps(list(csv_reader))
it's not known in advance what sort of encoding will come up
Well that's more of a problem, since it's impossible to guess accurately what encoding a file is in. You would either have to specific a particular encoding, or give the user a way to signal what the encoding is, if you want to support non-ASCII characters properly.

Try replacing your final line with
json = json.dumps([x.encode('utf-8') for x in csv_reader])

Running unidecode over the file contents seems to do the trick:
from isounidecode import unidecode
...
file_contents = unidecode(file_stream.read())
...
Thanks, everyone!

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 434852: invalid continuation byte

I am using hfcca to calculate cyclomatic complexity for a c++ code. hfcca is a simple python script (https://code.google.com/p/headerfile-free-cyclomatic-complexity-analyzer/). When i am trying to run the script to generate the output in the form of an xml file i am getting following errors :
Traceback (most recent call last):
"./hfcca.py", line 802, in <module>
main(sys.argv[1:])
File "./hfcca.py", line 798, in main
print(xml_output([f for f in r], options))
File "./hfcca.py", line 798, in <listcomp>
print(xml_output([f for f in r], options))
File "/x/home06/smanchukonda/PREFIX/lib/python3.3/multiprocessing/pool.py", line 652, in next
raise value
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 434852: invalid continuation byte
Please help me with this..

The problem looks like the file has characters represented with latin1 that aren't characters in utf8. The file utility can be useful for figuring out what encoding a file should be treated as, e.g:
monk#monk-VirtualBox:~$ file foo.txt
foo.txt: UTF-8 Unicode text
Here's what the bytes mean in latin1:
>>> b'\xe2'.decode('latin1')
'â'
Probably easiest is to convert the files to utf8.

I also had the same problem rendering Markup("""yyyyyy""") but i solved it using an online tool with removed the 'bad' characters. https://pteo.paranoiaworks.mobi/diacriticsremover/
It is a nice tool and works even offline.

CSV, DictWriter, unicode and utf-8

I am having problems with the DictWriter and non-ascii characters. A short version of my problem:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
import csv
f = codecs.open("test.csv", 'w', 'utf-8')
writer = csv.DictWriter(f, ['field1'], delimiter='\t')
writer.writerow({'field1':u'å'.encode('utf-8')})
f.close()
Gives this Traceback:
Traceback (most recent call last):
File "test.py", line 10, in <module>writer.writerow({'field1':u'å'.encode('utf-8')})
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/csv.py", line 124, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/codecs.py", line 638, in write
return self.writer.write(data)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/codecs.py", line 303, in write data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
I am bit lost as the DictWriter ought to be able to work with UTF-8 from what I have read in the documentation.

The object you obtain with codecs.open wants a unicode string in its write method -- that's the whole point. csv.DictWriter of course is calling that method with a utf8-encoded byte string instead, whence the exception.
Change f's creation to f = open("test.csv", 'wb') (taking codecs out of the picture) and things should work just fine.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Write xml utf-8 file with utf-8 data with ElementTree - python

Related

UnicodeEncodeError when reading a file

Python3 CSV reader Unicode Decode error

python: csv to json conversion when csv contains unicode

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 434852: invalid continuation byte

CSV, DictWriter, unicode and utf-8

Categories

Resources