I have a UTF-16 CSV file which I have to read. Python csv module does not seem to support UTF-16.
I am using python 2.7.2. CSV files I need to parse are huge size running into several GBs of data.
Answers for John Machin questions below
print repr(open('test.csv', 'rb').read(100))
Output with test.csv having just abc as content
'\xff\xfea\x00b\x00c\x00'
I think csv file got created on windows machine in USA. I am using Mac OSX Lion.
If I use code provided by phihag and test.csv containing one record.
sample test.csv content used. Below is print repr(open('test.csv', 'rb').read(1000)) output
'\xff\xfe1\x00,\x002\x00,\x00G\x00,\x00S\x00,\x00H\x00 \x00f\x00\xfc\x00r\x00 \x00e\x00 \x00\x96\x00 \x00m\x00 \x00\x85\x00,\x00,\x00I\x00\r\x00\n\x00'
Code by phihag
import codecs
import csv
with open('test.csv','rb') as f:
sr = codecs.StreamRecoder(f,codecs.getencoder('utf-8'),codecs.getdecoder('utf-8'),codecs.getreader('utf-16'),codecs.getwriter('utf-16'))
for row in csv.reader(sr):
print row
Output of the above code
['1', '2', 'G', 'S', 'H f\xc3\xbcr e \xc2\x96 m \xc2\x85']
['', '', 'I']
expected output is
['1', '2', 'G', 'S', 'H f\xc3\xbcr e \xc2\x96 m \xc2\x85','','I']
At the moment, the csv module does not support UTF-16.
In Python 3.x, csv expects a text-mode file and you can simply use the encoding parameter of open to force another encoding:
# Python 3.x only
import csv
with open('utf16.csv', 'r', encoding='utf16') as csvf:
for line in csv.reader(csvf):
print(line) # do something with the line
In Python 2.x, you can recode the input:
# Python 2.x only
import codecs
import csv
class Recoder(object):
def __init__(self, stream, decoder, encoder, eol='\r\n'):
self._stream = stream
self._decoder = decoder if isinstance(decoder, codecs.IncrementalDecoder) else codecs.getincrementaldecoder(decoder)()
self._encoder = encoder if isinstance(encoder, codecs.IncrementalEncoder) else codecs.getincrementalencoder(encoder)()
self._buf = ''
self._eol = eol
self._reachedEof = False
def read(self, size=None):
r = self._stream.read(size)
raw = self._decoder.decode(r, size is None)
return self._encoder.encode(raw)
def __iter__(self):
return self
def __next__(self):
if self._reachedEof:
raise StopIteration()
while True:
line,eol,rest = self._buf.partition(self._eol)
if eol == self._eol:
self._buf = rest
return self._encoder.encode(line + eol)
raw = self._stream.read(1024)
if raw == '':
self._decoder.decode(b'', True)
self._reachedEof = True
return self._encoder.encode(self._buf)
self._buf += self._decoder.decode(raw)
next = __next__
def close(self):
return self._stream.close()
with open('test.csv','rb') as f:
sr = Recoder(f, 'utf-16', 'utf-8')
for row in csv.reader(sr):
print (row)
open and codecs.open require the file to start with a BOM. If it doesn't (or you're on Python 2.x), you can still convert it in memory, like this:
try:
from io import BytesIO
except ImportError: # Python < 2.6
from StringIO import StringIO as BytesIO
import csv
with open('utf16.csv', 'rb') as binf:
c = binf.read().decode('utf-16').encode('utf-8')
for line in csv.reader(BytesIO(c)):
print(line) # do something with the line
The Python 2.x csv module documentation example shows how to handle other encodings.
I would strongly suggest that you recode your file(s) to UTF-8. Under the very likely condition that you don't have any Unicode characters outside the BMP, you can take advantage of the fact that UTF-16 is a fixed-length encoding to read fixed-length blocks from your input file without worrying about straddling block boundaries.
Step 1: Determine what encoding you actually have. Examine the first few bytes of your file:
print repr(open('thefile.csv', 'rb').read(100))
Four possible ways of encoding u'abc'
\xfe\xff\x00a\x00b\x00c -> utf_16
\xff\xfea\x00b\x00c\x00 -> utf_16
\x00a\x00b\x00c -> utf_16_be
a\x00b\x00c\x00 -> utf_16_le
If you have any trouble with this step, edit your question to include the results of the above print repr()
Step 2: Here's a Python 2.X recode-UTF-16*-to-UTF-8 script:
import sys
infname, outfname, enc = sys.argv[1:4]
fi = open(infname, 'rb')
fo = open(outfname, 'wb')
BUFSIZ = 64 * 1024 * 1024
first = True
while 1:
buf = fi.read(BUFSIZ)
if not buf: break
if first and enc == 'utf_16':
bom = buf[:2]
buf = buf[2:]
enc = {'\xfe\xff': 'utf_16_be', '\xff\xfe': 'utf_16_le'}[bom]
# KeyError means file doesn't start with a valid BOM
first = False
fo.write(buf.decode(enc).encode('utf8'))
fi.close()
fo.close()
Other matters:
You say that your files are too big to read the whole file, recode and rewrite, yet you can open it in vi. Please explain.
The <85> being treated as end of record is a bit of a worry. Looks like 0x85 is being recognised as NEL (C1 control code, NEWLINE). There is a strong possibility that the data was originally encoded in some legacy single-byte encoding where 0x85 has a meaning but has been transcoded to UTF-16 under the false assumption that the original encoding was ISO-8859-1 aka latin1. Where did the file originate? An IBM mainframe? Windows/Unix/classic Mac? What country, locale, language? You obviously think that the <85> is not meant to be a newline; what do you think that it means?
Please feel free to send a copy of a cut-down file (that includes some of the <85> stuff) to sjmachin at lexicon dot net
Update based on 1-line sample data provided.
This confirms my suspicions. Read this. Here's a quote from it:
... the C1 control characters ... are rarely used directly, except on
specific platforms such as OpenVMS. When they turn up in documents,
Web pages, e-mail messages, etc., which are ostensibly in an
ISO-8859-n encoding, their code positions generally refer instead to
the characters at that position in a proprietary, system-specific
encoding such as Windows-1252 or the Apple Macintosh ("MacRoman")
character set that use the codes provided for representation of the C1
set with a single 8-bit byte to instead provide additional graphic
characters
This code:
s1 = '\xff\xfe1\x00,\x002\x00,\x00G\x00,\x00S\x00,\x00H\x00 \x00f\x00\xfc\x00r\x00 \x00e\x00 \x00\x96\x00 \x00m\x00 \x00\x85\x00,\x00,\x00I\x00\r\x00\n\x00'
s2 = s1.decode('utf16')
print 's2 repr:', repr(s2)
from unicodedata import name
from collections import Counter
non_ascii = Counter(c for c in s2 if c >= u'\x80')
print 'non_ascii:', non_ascii
for c in non_ascii:
print "from: U+%04X %s" % (ord(c), name(c, "<no name>"))
c2 = c.encode('latin1').decode('cp1252')
print "to: U+%04X %s" % (ord(c2), name(c2, "<no name>"))
s3 = u''.join(
c.encode('latin1').decode('1252') if u'\x80' <= c < u'\xA0' else c
for c in s2
)
print 's3 repr:', repr(s3)
print 's3:', s3
produces the following (Python 2.7.2 IDLE, Windows 7):
s2 repr: u'1,2,G,S,H f\xfcr e \x96 m \x85,,I\r\n'
non_ascii: Counter({u'\x85': 1, u'\xfc': 1, u'\x96': 1})
from: U+0085 <no name>
to: U+2026 HORIZONTAL ELLIPSIS
from: U+00FC LATIN SMALL LETTER U WITH DIAERESIS
to: U+00FC LATIN SMALL LETTER U WITH DIAERESIS
from: U+0096 <no name>
to: U+2013 EN DASH
s3 repr: u'1,2,G,S,H f\xfcr e \u2013 m \u2026,,I\r\n'
s3: 1,2,G,S,H für e – m …,,I
Which do you think is a more reasonable interpretation of \x96:
SPA i.e. Start of Protected Area (Used by block-oriented terminals.)
or
EN DASH
?
Looks like a thorough analysis of a much larger data sample is warranted. Happy to help.
Just open your file with codecs.open like in
import codecs, csv
stream = codecs.open(<yourfile.csv>, encoding="utf-16")
reader = csv.reader(stream)
And work through your program with unicode strings, as you should do anyway if you are processing text
Related
Is there a simple way to, in Python, read a file's hexadecimal data into a list, say hex?
So hex would be this:
hex = ['AA','CD','FF','0F']
I don't want to have to read into a string, then split. This is memory intensive for large files.
s = "Hello"
hex_list = ["{:02x}".format(ord(c)) for c in s]
Output
['48', '65', '6c', '6c', '6f']
Just change s to open(filename).read() and you should be good.
with open('/path/to/some/file', 'r') as fp:
hex_list = ["{:02x}".format(ord(c)) for c in fp.read()]
Or, if you do not want to keep the whole list in memory at once for large files.
hex_list = ("{:02x}".format(ord(c)) for c in fp.read())
and to get the values, keep calling
next(hex_list)
to get all the remaining values from the generator
list(hex_list)
Using Python 3, let's assume the input file contains the sample bytes you show. For example, we can create it like this
>>> inp = bytes((170,12*16+13,255,15)) # i.e. b'\xaa\xcd\xff\x0f'
>>> with open(filename,'wb') as f:
... f.write(inp)
Now, given we want the hex representation of each byte in the input file, it would be nice to open the file in binary mode, without trying to interpret its contents as characters/strings (or we might trip on the error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xaa in position 0: invalid start byte)
>>> with open(filename,'rb') as f:
... buff = f.read() # it reads the whole file into memory
...
>>> buff
b'\xaa\xcd\xff\x0f'
>>> out_hex = ['{:02X}'.format(b) for b in buff]
>>> out_hex
['AA', 'CD', 'FF', '0F']
If the file is large, we might want to read one character at a time or in chunks. For that purpose I recommend to read this Q&A
Be aware that for viewing hexadecimal dumps of files, there are utilities available on most operating systems. If all you want to do is hex dump the file, consider one of these programs:
od (octal dump, which has a -x or -t x option)
hexdump
xd utility available under windows
Online hex dump tools, such as this one.
I've read a few postings regarding Python writing to text files but I could not find a solution to my problem. Here it is in a nutshell.
The requirement: to write values delimited by thorn characters (u00FE; and surronding the text values) and the pilcrow character (u00B6; after each column) to a UTF-16LE text file with BOM (FF FE).
The issue: The written-to text file has whitespace between each column that I did not script for. Also, it's not showing up right in UltraEdit. Only the first value ("mom") shows. I welcome any insight or advice.
The script (simplified to ease troubleshooting; the actual script uses a third-party API to obtain the list of values):
import os
import codecs
import shutil
import sys
import codecs
first = u''
textdel = u'\u00FE'.encode('utf_16_le') #thorn
fielddel = u'\u00B6'.encode('utf_16_le') #pilcrow
list1 = ['mom', 'dad', 'son']
num = len(list1) #pretend this is from the metadata profile
f = codecs.open('c:/myFile.txt', 'w', 'utf_16_le')
f.write(u'\uFEFF')
for item in list1:
mytext2 = u''
i = 0
i = i + 1
mytext2 = mytext2 + item + textdel
if i < (num - 1):
mytext2 = mytext2 + fielddel
f.write(mytext2 + u'\n')
f.close()
You're double-encoding your strings. You've already opened your file as UTF-16-LE, so leave your textdel and fielddel strings unencoded. They will get encoded at write time along with every line written to the file.
Or put another way, textdel = u'\u00FE' sets textdel to the "thorn" character, while textdel = u'\u00FE'.encode('utf-16-le') sets textdel to a particular serialized form of that character, a sequence of bytes according to that codec; it is no longer a sequence of characters:
textdel = u'\u00FE'
len(textdel) # -> 1
type(textdel) # -> unicode
len(textdel.encode('utf-16-le')) # -> 2
type(textdel.encode('utf-16-le')) # -> str
I have a UTF-16 CSV file which I have to read. Python csv module does not seem to support UTF-16.
I am using python 2.7.2. CSV files I need to parse are huge size running into several GBs of data.
Answers for John Machin questions below
print repr(open('test.csv', 'rb').read(100))
Output with test.csv having just abc as content
'\xff\xfea\x00b\x00c\x00'
I think csv file got created on windows machine in USA. I am using Mac OSX Lion.
If I use code provided by phihag and test.csv containing one record.
sample test.csv content used. Below is print repr(open('test.csv', 'rb').read(1000)) output
'\xff\xfe1\x00,\x002\x00,\x00G\x00,\x00S\x00,\x00H\x00 \x00f\x00\xfc\x00r\x00 \x00e\x00 \x00\x96\x00 \x00m\x00 \x00\x85\x00,\x00,\x00I\x00\r\x00\n\x00'
Code by phihag
import codecs
import csv
with open('test.csv','rb') as f:
sr = codecs.StreamRecoder(f,codecs.getencoder('utf-8'),codecs.getdecoder('utf-8'),codecs.getreader('utf-16'),codecs.getwriter('utf-16'))
for row in csv.reader(sr):
print row
Output of the above code
['1', '2', 'G', 'S', 'H f\xc3\xbcr e \xc2\x96 m \xc2\x85']
['', '', 'I']
expected output is
['1', '2', 'G', 'S', 'H f\xc3\xbcr e \xc2\x96 m \xc2\x85','','I']
At the moment, the csv module does not support UTF-16.
In Python 3.x, csv expects a text-mode file and you can simply use the encoding parameter of open to force another encoding:
# Python 3.x only
import csv
with open('utf16.csv', 'r', encoding='utf16') as csvf:
for line in csv.reader(csvf):
print(line) # do something with the line
In Python 2.x, you can recode the input:
# Python 2.x only
import codecs
import csv
class Recoder(object):
def __init__(self, stream, decoder, encoder, eol='\r\n'):
self._stream = stream
self._decoder = decoder if isinstance(decoder, codecs.IncrementalDecoder) else codecs.getincrementaldecoder(decoder)()
self._encoder = encoder if isinstance(encoder, codecs.IncrementalEncoder) else codecs.getincrementalencoder(encoder)()
self._buf = ''
self._eol = eol
self._reachedEof = False
def read(self, size=None):
r = self._stream.read(size)
raw = self._decoder.decode(r, size is None)
return self._encoder.encode(raw)
def __iter__(self):
return self
def __next__(self):
if self._reachedEof:
raise StopIteration()
while True:
line,eol,rest = self._buf.partition(self._eol)
if eol == self._eol:
self._buf = rest
return self._encoder.encode(line + eol)
raw = self._stream.read(1024)
if raw == '':
self._decoder.decode(b'', True)
self._reachedEof = True
return self._encoder.encode(self._buf)
self._buf += self._decoder.decode(raw)
next = __next__
def close(self):
return self._stream.close()
with open('test.csv','rb') as f:
sr = Recoder(f, 'utf-16', 'utf-8')
for row in csv.reader(sr):
print (row)
open and codecs.open require the file to start with a BOM. If it doesn't (or you're on Python 2.x), you can still convert it in memory, like this:
try:
from io import BytesIO
except ImportError: # Python < 2.6
from StringIO import StringIO as BytesIO
import csv
with open('utf16.csv', 'rb') as binf:
c = binf.read().decode('utf-16').encode('utf-8')
for line in csv.reader(BytesIO(c)):
print(line) # do something with the line
The Python 2.x csv module documentation example shows how to handle other encodings.
I would strongly suggest that you recode your file(s) to UTF-8. Under the very likely condition that you don't have any Unicode characters outside the BMP, you can take advantage of the fact that UTF-16 is a fixed-length encoding to read fixed-length blocks from your input file without worrying about straddling block boundaries.
Step 1: Determine what encoding you actually have. Examine the first few bytes of your file:
print repr(open('thefile.csv', 'rb').read(100))
Four possible ways of encoding u'abc'
\xfe\xff\x00a\x00b\x00c -> utf_16
\xff\xfea\x00b\x00c\x00 -> utf_16
\x00a\x00b\x00c -> utf_16_be
a\x00b\x00c\x00 -> utf_16_le
If you have any trouble with this step, edit your question to include the results of the above print repr()
Step 2: Here's a Python 2.X recode-UTF-16*-to-UTF-8 script:
import sys
infname, outfname, enc = sys.argv[1:4]
fi = open(infname, 'rb')
fo = open(outfname, 'wb')
BUFSIZ = 64 * 1024 * 1024
first = True
while 1:
buf = fi.read(BUFSIZ)
if not buf: break
if first and enc == 'utf_16':
bom = buf[:2]
buf = buf[2:]
enc = {'\xfe\xff': 'utf_16_be', '\xff\xfe': 'utf_16_le'}[bom]
# KeyError means file doesn't start with a valid BOM
first = False
fo.write(buf.decode(enc).encode('utf8'))
fi.close()
fo.close()
Other matters:
You say that your files are too big to read the whole file, recode and rewrite, yet you can open it in vi. Please explain.
The <85> being treated as end of record is a bit of a worry. Looks like 0x85 is being recognised as NEL (C1 control code, NEWLINE). There is a strong possibility that the data was originally encoded in some legacy single-byte encoding where 0x85 has a meaning but has been transcoded to UTF-16 under the false assumption that the original encoding was ISO-8859-1 aka latin1. Where did the file originate? An IBM mainframe? Windows/Unix/classic Mac? What country, locale, language? You obviously think that the <85> is not meant to be a newline; what do you think that it means?
Please feel free to send a copy of a cut-down file (that includes some of the <85> stuff) to sjmachin at lexicon dot net
Update based on 1-line sample data provided.
This confirms my suspicions. Read this. Here's a quote from it:
... the C1 control characters ... are rarely used directly, except on
specific platforms such as OpenVMS. When they turn up in documents,
Web pages, e-mail messages, etc., which are ostensibly in an
ISO-8859-n encoding, their code positions generally refer instead to
the characters at that position in a proprietary, system-specific
encoding such as Windows-1252 or the Apple Macintosh ("MacRoman")
character set that use the codes provided for representation of the C1
set with a single 8-bit byte to instead provide additional graphic
characters
This code:
s1 = '\xff\xfe1\x00,\x002\x00,\x00G\x00,\x00S\x00,\x00H\x00 \x00f\x00\xfc\x00r\x00 \x00e\x00 \x00\x96\x00 \x00m\x00 \x00\x85\x00,\x00,\x00I\x00\r\x00\n\x00'
s2 = s1.decode('utf16')
print 's2 repr:', repr(s2)
from unicodedata import name
from collections import Counter
non_ascii = Counter(c for c in s2 if c >= u'\x80')
print 'non_ascii:', non_ascii
for c in non_ascii:
print "from: U+%04X %s" % (ord(c), name(c, "<no name>"))
c2 = c.encode('latin1').decode('cp1252')
print "to: U+%04X %s" % (ord(c2), name(c2, "<no name>"))
s3 = u''.join(
c.encode('latin1').decode('1252') if u'\x80' <= c < u'\xA0' else c
for c in s2
)
print 's3 repr:', repr(s3)
print 's3:', s3
produces the following (Python 2.7.2 IDLE, Windows 7):
s2 repr: u'1,2,G,S,H f\xfcr e \x96 m \x85,,I\r\n'
non_ascii: Counter({u'\x85': 1, u'\xfc': 1, u'\x96': 1})
from: U+0085 <no name>
to: U+2026 HORIZONTAL ELLIPSIS
from: U+00FC LATIN SMALL LETTER U WITH DIAERESIS
to: U+00FC LATIN SMALL LETTER U WITH DIAERESIS
from: U+0096 <no name>
to: U+2013 EN DASH
s3 repr: u'1,2,G,S,H f\xfcr e \u2013 m \u2026,,I\r\n'
s3: 1,2,G,S,H für e – m …,,I
Which do you think is a more reasonable interpretation of \x96:
SPA i.e. Start of Protected Area (Used by block-oriented terminals.)
or
EN DASH
?
Looks like a thorough analysis of a much larger data sample is warranted. Happy to help.
Just open your file with codecs.open like in
import codecs, csv
stream = codecs.open(<yourfile.csv>, encoding="utf-16")
reader = csv.reader(stream)
And work through your program with unicode strings, as you should do anyway if you are processing text
I'm trying to convert characters in one list into characters in another list at the same index in Japanese (zenkaku to hangaku moji, for those interested), and I can't get the comparison to work. I am decoding into utf-8 before I compare (decoding into ascii broke the program), but the comparison doesn't ever return true. Does anyone know what I'm doing wrong? Here's the code (indents are a little wacky due to SO's editor):
#!C:\Python27\python.exe
# coding=utf-8
import os
import shutil
import sys
zk = [
'。',
'、',
'「',
'」',
'(',
')',
'!',
'?',
'・',
'/',
'ア','イ','ウ','エ','オ',
'カ','キ','ク','ケ','コ',
'サ','シ','ス','セ','ソ',
'ザ','ジ','ズ','ゼ','ゾ',
'タ','チ','ツ','テ','ト',
'ダ','ヂ','ヅ','デ','ド',
'ラ','リ','ル','レ','ロ',
'マ','ミ','ム','メ','モ',
'ナ','ニ','ヌ','ネ','ノ',
'ハ','ヒ','フ','ヘ','ホ',
'バ','ビ','ブ','ベ','ボ',
'パ','ピ','プ','ペ','ポ',
'ヤ','ユ','ヨ','ヲ','ン','ッ'
]
hk = [
'。',
'、',
'「',
'」',
'(',
')',
'!',
'?',
'・',
'/',
'ア','イ','ウ','エ','オ',
'カ','キ','ク','ケ','コ',
'サ','シ','ス','セ','ソ',
'ザ','ジ','ズ','ゼ','ゾ',
'タ','チ','ツ','テ','ト',
'ダ','ヂ','ヅ','デ','ド',
'ラ','リ','ル','レ','ロ',
'マ','ミ','ム','メ','モ',
'ナ','ニ','ヌ','ネ','ノ',
'ハ','ヒ','フ','ヘ','ホ',
'バ','ビ','ブ','ベ','ボ',
'パ','ピ','プ','ペ','ポ',
'ヤ','ユ','ヨ','ヲ','ン','ッ'
]
def main():
if len(sys.argv) > 1:
filename = sys.argv[1]
else:
print("Please specify a file to check.")
return
try:
f = open(filename, 'r')
except IOError as e:
print("Sorry! The file doesn't exist.")
return
filecontent = f.read()
f.close()
#y = zk[29]
#print y.decode('utf-8')
for f in filecontent:
for z in zk:
if f == z.decode('utf-8'):
print f
print filename
if __name__ == "__main__":
main()
Am I missing a step?
Several.
zk = [
u'。',
u'、',
u'「',
...
...
f = codecs.open(filename, 'r', encoding='utf-8')
...
I'll let you work out the rest now that the hard work's been done.
Make sure that zk and hk lists contain Unicode strings. Either use Unicode literals e.g., u'a' or decode them at runtime:
fromutf8 = lambda s: s.decode('utf-8') if not isinstance(s, unicode) else s
zk = map(fromutf8, zk)
hk = map(fromutf8, hk)
You could use unicode.translate() to convert characters in one list into characters in another list at the same index:
import codecs
translation_table = dict(zip(map(ord,zk), hk))
with codecs.open(sys.argv[1], encoding='utf-8') as f:
for line in f:
print line.translate(translation_table),
You need to convert everything to the same form, and the form is Unicode strings. Unicode strings have no encoding in the sense .encode() or .decode(). When having a non-unicode string, it is actually a stream of bytes that expresses the value in some encoding. When converting to Unicode, you have to .decode(). When storing Unicode string to a sequence of bytes, you have to .encode() the abstraction to concrete bytes.
This way, when loading Unicode strings from an UTF-8 encoded file, or you have to read it into the old strings (non Unicode, sequences of bytes) and then .decode('utf-8'), or you can use `codecs.open(..., encoding='utf-8') -- then you get Unicode strings automatically.
The form # coding=utf-8 is not the usual, but it is OK... if the editor (I mean the tool that you use to write the text) also thinks this way. Then the old strings are displayed by the editor correctly. In the case they should be .decode('utf-8')d to get Unicode. Old strings with ASCII characters only in the same source can also be converted to Unicode using the .decode('utf-8').
To summarize: you are de coding from bytes to Unicode, and you are en coding the Unicode strings into sequence of bytes. It seems from the question that you are doing the opposite.
The following is completely wrong:
for f in filecontent:
for z in zk:
if f == z.decode('utf-8'):
print f
because the filecontent is the result of f.read(). This way it is a sequence of bytes. The f in the loop is one byte. The z.decode('utf-8') returns one Unicode character. They cannot be compared. (By the way, the f is a kind of misleading name for a byte value.)
Basically I have been having real fun with this today. I have this data file called test.csv which is encoded as UTF-8:
"Nguyễn", 0.500
"Trần", 0.250
"Lê", 0.250
Now I am attempting to read it with this code and it displays all funny like this: Trần
Now I have gone through all the Python docs for 2.6 which is the one I use and I can't get the wrapper to work along with all the ideas on the internet which I am assuming are all very correct just not being applied properly by yours truly. On the plus side I have learnt that not all fonts will display those characters correctly anyway something I hadn't even thought of previously and have learned a lot about Unicode etc so it certainly was not wasted time.
If anyone could point out where I went wrong I would be most grateful.
Here is the code updated as per request below that returns this error -
Traceback (most recent call last):
File "surname_generator.py", line 39, in
probfamilynames = [(familyname,float(prob)) for familyname,prob in unicode_csv_reader(open(familynamelist))]
File "surname_generator.py", line 27, in unicode_csv_reader
for row in csv_reader:
File "surname_generator.py", line 33, in utf_8_encoder
yield line.encode('utf-8') UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
from random import random
import csv
class ChooseFamilyName(object):
def __init__(self, probs):
self._total_prob = 0.
self._familyname_levels = []
for familyname, prob in probs:
self._total_prob += prob
self._familyname_levels.append((self._total_prob, familyname))
return
def pickfamilyname(self):
pickfamilyname = self._total_prob * random()
for level, familyname in self._familyname_levels:
if level >= pickfamilyname:
return familyname
print "pickfamilyname error"
return
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
dialect=dialect, **kwargs)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]
def utf_8_encoder(unicode_csv_data):
for line in unicode_csv_data:
yield line.encode('utf-8')
familynamelist = 'familyname_vietnam.csv'
a = 0
while a < 10:
a = a + 1
probfamilynames = [(familyname,float(prob)) for familyname,prob in unicode_csv_reader(open(familynamelist))]
familynamepicker = ChooseFamilyName(probfamilynames)
print(familynamepicker.pickfamilyname())
unicode_csv_reader(open(familynamelist)) is trying to pass non-unicode data (byte strings with utf-8 encoding) to a function you wrote expecting unicode data. You could solve the problem with codecs.open (from standard library module codecs), but that's to roundabout: the codecs would be doing utf8->unicode for you, then your code would be doing unicode->utf8, what's the point?
Instead, define a function more like this one...:
def encoded_csv_reader_to_unicode(encoded_csv_data,
coding='utf-8',
dialect=csv.excel,
**kwargs):
csv_reader = csv.reader(encoded_csv_data,
dialect=dialect,
**kwargs)
for row in csv_reader:
yield [unicode(cell, coding) for cell in row]
and use encoded_csv_reader_to_unicode(open(familynamelist)).
Your current problem is that you have been given a bum steer with the csv_unicode_reader thingy. As the name suggests, and as the documentation states explicitly:
"""(unicode_csv_reader() below is a generator that wraps csv.reader to handle Unicode CSV data (a list of Unicode strings). """
You don't have unicode strings, you have str strings encoded in UTF-8.
Suggestion: blow away the csv_unicode_reader stuff. Get each row plainly and simply as though it was encoded in ascii. Then convert each row to unicode:
unicode_row = [field.decode('utf8') for field in str_row]
Getting back to your original problem:
(1) To get help with fonts etc, you need to say what platform you are running on and what software you are using to display the unicode strings.
(2) If you want platform-independent ways of inspecting your data, look at the repr() built-in function, and the name function in the unicodedata module.
There's the unicode_csv_reader demo in the python docs:
http://docs.python.org/library/csv.html