Writing to UTF-16-LE text file with BOM - python

I've read a few postings regarding Python writing to text files but I could not find a solution to my problem. Here it is in a nutshell.
The requirement: to write values delimited by thorn characters (u00FE; and surronding the text values) and the pilcrow character (u00B6; after each column) to a UTF-16LE text file with BOM (FF FE).
The issue: The written-to text file has whitespace between each column that I did not script for. Also, it's not showing up right in UltraEdit. Only the first value ("mom") shows. I welcome any insight or advice.
The script (simplified to ease troubleshooting; the actual script uses a third-party API to obtain the list of values):
import os
import codecs
import shutil
import sys
import codecs
first = u''
textdel = u'\u00FE'.encode('utf_16_le') #thorn
fielddel = u'\u00B6'.encode('utf_16_le') #pilcrow
list1 = ['mom', 'dad', 'son']
num = len(list1) #pretend this is from the metadata profile
f = codecs.open('c:/myFile.txt', 'w', 'utf_16_le')
f.write(u'\uFEFF')
for item in list1:
mytext2 = u''
i = 0
i = i + 1
mytext2 = mytext2 + item + textdel
if i < (num - 1):
mytext2 = mytext2 + fielddel
f.write(mytext2 + u'\n')
f.close()

You're double-encoding your strings. You've already opened your file as UTF-16-LE, so leave your textdel and fielddel strings unencoded. They will get encoded at write time along with every line written to the file.
Or put another way, textdel = u'\u00FE' sets textdel to the "thorn" character, while textdel = u'\u00FE'.encode('utf-16-le') sets textdel to a particular serialized form of that character, a sequence of bytes according to that codec; it is no longer a sequence of characters:
textdel = u'\u00FE'
len(textdel) # -> 1
type(textdel) # -> unicode
len(textdel.encode('utf-16-le')) # -> 2
type(textdel.encode('utf-16-le')) # -> str

Related

Python: CSV delimiter failing randomly

I have created a script which a number of random passwords are generated (see below)
import string
import secrets
import datetime
now = datetime.datetime.now()
T = now.strftime('%Y_%m_d')
entities = ['AA','BB','CC','DD','EE','FF','GG','HH']
masterpass = ('MasterPass' + '_' + T + '.csv')
f= open(masterpass,"w+")
def random_secure_string(stringLength):
secureStrMain = ''.join((secrets.choice(string.ascii_lowercase + string.ascii_uppercase + string.digits + ('!'+'?'+'"'+'('+')'+'$'+'%'+'#'+'#'+'/'+':'+';'+'['+']'+'#')) for i in range(stringLength)))
return secureStrMain
def random_secure_string_lower(stringLength):
secureStrLower = ''.join((secrets.choice(string.ascii_lowercase)) for i in range(stringLength))
return secureStrLower
def random_secure_string_upper(stringLength):
secureStrUpper = ''.join((secrets.choice(string.ascii_uppercase)) for i in range(stringLength))
return secureStrUpper
def random_secure_string_digit(stringLength):
secureStrDigit = ''.join((secrets.choice(string.digits)) for i in range(stringLength))
return secureStrDigit
def random_secure_string_char(stringLength):
secureStrChar = ''.join((secrets.choice('!'+'?'+'"'+'('+')'+'$'+'%'+'#'+'#'+'/'+':'+';'+'['+']'+'#')) for i in range(stringLength))
return secureStrChar
for x in entities:
f.write(x + ',' + random_secure_string(6) + random_secure_string_lower(1) + random_secure_string_upper(1) + random_secure_string_digit(1) + random_secure_string_char(1) + ',' + T + "\n")
f.close()
I use pandas to get the code to import a list, so normally it is for 200-250 entities, not just the 8 in the example.
The issue comes every so often where it looks like the comma delimiter fails to be read (see row 6 of attached photo)
In all the cases I have had of this (multiple run throughs), it looks like the 10th character is a comma, the 4 before (characters 6-9) are as stated in the script, but then instead of generating 6 initial characters (from random_secure_string(6)), it is generating 5. Could this be causing the issue? If so, how do I fix this?
Thank you in advance
Wild guess, because the content of the csv file as text is required to make sure.
A csv is a Comma Separated Values text file. That means that it is a plain text files where fields are delimited with a separator, normally the comma (,). In order to allow text fields to contain commas or even new lines, they can be enclosed in quotes (normally ") or special characters can be escaped, normally with \.
That means that if a line contains abcdefg\,2020_05 the comma will not be interpreted as a separator.
How to fix:
CSV is a simple format, but with many corner cases. The rule is avoid to read or write it by hand. Just use the standard library csv module here:
...
import csv
...
with open(masterpass,"w+", newline='') as f:
wr = csv.writer(f)
for x in entities:
wr.writerow([x, random_secure_string(6) + random_secure_string_lower(1) + random_secure_string_upper(1) + random_secure_string_digit(1) + random_secure_string_char(1), T])
The writer will take care for special characters and ensure that appropriate encoding or escaping will be used

Determine ROT encoding

I want to determine which type of ROT encoding is used and based off that, do the correct decode.
Also, I have found the following code which will indeed decode rot13 "sbbone" to "foobart" correctly:
import codecs
codecs.decode('sbbone', 'rot_13')
The thing is I'd like to run this python file against an existing file which has rot13 encoding. (for example rot13.py encoded.txt).
Thank you!
To answer the second part of your first question, decode something in ROT-x, you can use the following code:
def encode(s, ROT_number=13):
"""Encodes a string (s) using ROT (ROT_number) encoding."""
ROT_number %= 26 # To avoid IndexErrors
alpha = "abcdefghijklmnopqrstuvwxyz" * 2
alpha += alpha.upper()
def get_i():
for i in range(26):
yield i # indexes of the lowercase letters
for i in range(53, 78):
yield i # indexes of the uppercase letters
ROT = {alpha[i]: alpha[i + ROT_number] for i in get_i()}
return "".join(ROT.get(i, i) for i in s)
def decode(s, ROT_number=13):
"""Decodes a string (s) using ROT (ROT_number) encoding."""
return encrypt(s, abs(ROT_number % 26 - 26))
To answer the first part of your first question, find the rot encoding of an arbitrarily encoded string, you probably want to brute-force. Uses all rot-encodings, and check which one makes the most sense. A quick(-ish) way to do this is to get a space-delimited (e.g. cat\ndog\nmouse\nsheep\nsay\nsaid\nquick\n... where \n is a newline) file containing most common words in the English language, and then check which encoding has the most words in it.
with open("words.txt") as f:
words = frozenset(f.read().lower().split("\n"))
# frozenset for speed
def get_most_likely_encoding(s, delimiter=" "):
alpha = "abcdefghijklmnopqrstuvwxyz" + delimiter
for punctuation in "\n\t,:; .()":
s.replace(punctuation, delimiter)
s = "".join(c for c in s if c.lower() in alpha)
word_count = [sum(w.lower() in words for w in encode(
s, enc).split(delimiter)) for enc in range(26)]
return word_count.index(max(word_count))
A file on Unix machines that you could use is /usr/dict/words, which can also be found here
Well, you can read the file line by line and decode it.
The output should go to an output file:
import codecs
import sys
def main(filename):
output_file = open('output_file.txt', 'w')
with open(filename) as f:
for line in f:
output_file.write(codecs.decode(line, 'rot_13'))
output_file.close()
if __name__ == "__main__":
_filename = sys.argv[1]
main(_filename)

Exporting to CSV Format In UTF-8 Format [duplicate]

I have a UTF-16 CSV file which I have to read. Python csv module does not seem to support UTF-16.
I am using python 2.7.2. CSV files I need to parse are huge size running into several GBs of data.
Answers for John Machin questions below
print repr(open('test.csv', 'rb').read(100))
Output with test.csv having just abc as content
'\xff\xfea\x00b\x00c\x00'
I think csv file got created on windows machine in USA. I am using Mac OSX Lion.
If I use code provided by phihag and test.csv containing one record.
sample test.csv content used. Below is print repr(open('test.csv', 'rb').read(1000)) output
'\xff\xfe1\x00,\x002\x00,\x00G\x00,\x00S\x00,\x00H\x00 \x00f\x00\xfc\x00r\x00 \x00e\x00 \x00\x96\x00 \x00m\x00 \x00\x85\x00,\x00,\x00I\x00\r\x00\n\x00'
Code by phihag
import codecs
import csv
with open('test.csv','rb') as f:
sr = codecs.StreamRecoder(f,codecs.getencoder('utf-8'),codecs.getdecoder('utf-8'),codecs.getreader('utf-16'),codecs.getwriter('utf-16'))
for row in csv.reader(sr):
print row
Output of the above code
['1', '2', 'G', 'S', 'H f\xc3\xbcr e \xc2\x96 m \xc2\x85']
['', '', 'I']
expected output is
['1', '2', 'G', 'S', 'H f\xc3\xbcr e \xc2\x96 m \xc2\x85','','I']
At the moment, the csv module does not support UTF-16.
In Python 3.x, csv expects a text-mode file and you can simply use the encoding parameter of open to force another encoding:
# Python 3.x only
import csv
with open('utf16.csv', 'r', encoding='utf16') as csvf:
for line in csv.reader(csvf):
print(line) # do something with the line
In Python 2.x, you can recode the input:
# Python 2.x only
import codecs
import csv
class Recoder(object):
def __init__(self, stream, decoder, encoder, eol='\r\n'):
self._stream = stream
self._decoder = decoder if isinstance(decoder, codecs.IncrementalDecoder) else codecs.getincrementaldecoder(decoder)()
self._encoder = encoder if isinstance(encoder, codecs.IncrementalEncoder) else codecs.getincrementalencoder(encoder)()
self._buf = ''
self._eol = eol
self._reachedEof = False
def read(self, size=None):
r = self._stream.read(size)
raw = self._decoder.decode(r, size is None)
return self._encoder.encode(raw)
def __iter__(self):
return self
def __next__(self):
if self._reachedEof:
raise StopIteration()
while True:
line,eol,rest = self._buf.partition(self._eol)
if eol == self._eol:
self._buf = rest
return self._encoder.encode(line + eol)
raw = self._stream.read(1024)
if raw == '':
self._decoder.decode(b'', True)
self._reachedEof = True
return self._encoder.encode(self._buf)
self._buf += self._decoder.decode(raw)
next = __next__
def close(self):
return self._stream.close()
with open('test.csv','rb') as f:
sr = Recoder(f, 'utf-16', 'utf-8')
for row in csv.reader(sr):
print (row)
open and codecs.open require the file to start with a BOM. If it doesn't (or you're on Python 2.x), you can still convert it in memory, like this:
try:
from io import BytesIO
except ImportError: # Python < 2.6
from StringIO import StringIO as BytesIO
import csv
with open('utf16.csv', 'rb') as binf:
c = binf.read().decode('utf-16').encode('utf-8')
for line in csv.reader(BytesIO(c)):
print(line) # do something with the line
The Python 2.x csv module documentation example shows how to handle other encodings.
I would strongly suggest that you recode your file(s) to UTF-8. Under the very likely condition that you don't have any Unicode characters outside the BMP, you can take advantage of the fact that UTF-16 is a fixed-length encoding to read fixed-length blocks from your input file without worrying about straddling block boundaries.
Step 1: Determine what encoding you actually have. Examine the first few bytes of your file:
print repr(open('thefile.csv', 'rb').read(100))
Four possible ways of encoding u'abc'
\xfe\xff\x00a\x00b\x00c -> utf_16
\xff\xfea\x00b\x00c\x00 -> utf_16
\x00a\x00b\x00c -> utf_16_be
a\x00b\x00c\x00 -> utf_16_le
If you have any trouble with this step, edit your question to include the results of the above print repr()
Step 2: Here's a Python 2.X recode-UTF-16*-to-UTF-8 script:
import sys
infname, outfname, enc = sys.argv[1:4]
fi = open(infname, 'rb')
fo = open(outfname, 'wb')
BUFSIZ = 64 * 1024 * 1024
first = True
while 1:
buf = fi.read(BUFSIZ)
if not buf: break
if first and enc == 'utf_16':
bom = buf[:2]
buf = buf[2:]
enc = {'\xfe\xff': 'utf_16_be', '\xff\xfe': 'utf_16_le'}[bom]
# KeyError means file doesn't start with a valid BOM
first = False
fo.write(buf.decode(enc).encode('utf8'))
fi.close()
fo.close()
Other matters:
You say that your files are too big to read the whole file, recode and rewrite, yet you can open it in vi. Please explain.
The <85> being treated as end of record is a bit of a worry. Looks like 0x85 is being recognised as NEL (C1 control code, NEWLINE). There is a strong possibility that the data was originally encoded in some legacy single-byte encoding where 0x85 has a meaning but has been transcoded to UTF-16 under the false assumption that the original encoding was ISO-8859-1 aka latin1. Where did the file originate? An IBM mainframe? Windows/Unix/classic Mac? What country, locale, language? You obviously think that the <85> is not meant to be a newline; what do you think that it means?
Please feel free to send a copy of a cut-down file (that includes some of the <85> stuff) to sjmachin at lexicon dot net
Update based on 1-line sample data provided.
This confirms my suspicions. Read this. Here's a quote from it:
... the C1 control characters ... are rarely used directly, except on
specific platforms such as OpenVMS. When they turn up in documents,
Web pages, e-mail messages, etc., which are ostensibly in an
ISO-8859-n encoding, their code positions generally refer instead to
the characters at that position in a proprietary, system-specific
encoding such as Windows-1252 or the Apple Macintosh ("MacRoman")
character set that use the codes provided for representation of the C1
set with a single 8-bit byte to instead provide additional graphic
characters
This code:
s1 = '\xff\xfe1\x00,\x002\x00,\x00G\x00,\x00S\x00,\x00H\x00 \x00f\x00\xfc\x00r\x00 \x00e\x00 \x00\x96\x00 \x00m\x00 \x00\x85\x00,\x00,\x00I\x00\r\x00\n\x00'
s2 = s1.decode('utf16')
print 's2 repr:', repr(s2)
from unicodedata import name
from collections import Counter
non_ascii = Counter(c for c in s2 if c >= u'\x80')
print 'non_ascii:', non_ascii
for c in non_ascii:
print "from: U+%04X %s" % (ord(c), name(c, "<no name>"))
c2 = c.encode('latin1').decode('cp1252')
print "to: U+%04X %s" % (ord(c2), name(c2, "<no name>"))
s3 = u''.join(
c.encode('latin1').decode('1252') if u'\x80' <= c < u'\xA0' else c
for c in s2
)
print 's3 repr:', repr(s3)
print 's3:', s3
produces the following (Python 2.7.2 IDLE, Windows 7):
s2 repr: u'1,2,G,S,H f\xfcr e \x96 m \x85,,I\r\n'
non_ascii: Counter({u'\x85': 1, u'\xfc': 1, u'\x96': 1})
from: U+0085 <no name>
to: U+2026 HORIZONTAL ELLIPSIS
from: U+00FC LATIN SMALL LETTER U WITH DIAERESIS
to: U+00FC LATIN SMALL LETTER U WITH DIAERESIS
from: U+0096 <no name>
to: U+2013 EN DASH
s3 repr: u'1,2,G,S,H f\xfcr e \u2013 m \u2026,,I\r\n'
s3: 1,2,G,S,H für e – m …,,I
Which do you think is a more reasonable interpretation of \x96:
SPA i.e. Start of Protected Area (Used by block-oriented terminals.)
or
EN DASH
?
Looks like a thorough analysis of a much larger data sample is warranted. Happy to help.
Just open your file with codecs.open like in
import codecs, csv
stream = codecs.open(<yourfile.csv>, encoding="utf-16")
reader = csv.reader(stream)
And work through your program with unicode strings, as you should do anyway if you are processing text

python Incorrect formatting Cyrillic

def inp(text):
tmp = str()
arr = ['.' for x in range(1, 40 - len(text))]
tmp += text + ''.join(arr)
print tmp
s=['tester', 'om', 'sup', 'jope']
sr=['тестер', 'ом', 'суп', 'жопа']
for i in s:
inp(i)
for i in sr:
inp(i)
Output:
tester.................................
om.....................................
sup....................................
jope...................................
тестер...........................
ом...................................
суп.................................
жопа...............................
Why is Python not properly handling Cyrillic? End of the line is not straight and scrappy. Using the formatting goes the same. How can this be corrected? thanks
Read this:
http://docs.python.org/2/howto/unicode.html
Basically, what you have in text parameter to inp function is a string. In Python 2.7, strings are bytes by default. Cyrilic characters are not mapped 1-1 to bytes when encoded in e.g. utf-8 encoding, but require more than one byte (usually 2 in utf-8), so when you do len(text) you don't get the number of characters, but number of bytes.
In order to get the number of characters, you need to know your encoding. Assuming it's utf-8, you can decode text to that encoding and it will print right:
#!/usr/bin/python
# coding=utf-8
def inp(text):
tmp = str()
utext = text.decode('utf-8')
l = len(utext)
arr = ['.' for x in range(1, 40 - l)]
tmp += text + ''.join(arr)
print tmp
s=['tester', 'om', 'sup', 'jope']
sr=['тестер', 'ом', 'суп', 'жопа']
for i in s:
inp(i)
for i in sr:
inp(i)
The important lines are these two:
utext = text.decode('utf-8')
l = len(utext)
where you first decode the text, which results in an unicode string. After that, you can use the built in len to get the length in characters, which is what you want.
Hope this helps.

Python UTF-16 CSV reader

I have a UTF-16 CSV file which I have to read. Python csv module does not seem to support UTF-16.
I am using python 2.7.2. CSV files I need to parse are huge size running into several GBs of data.
Answers for John Machin questions below
print repr(open('test.csv', 'rb').read(100))
Output with test.csv having just abc as content
'\xff\xfea\x00b\x00c\x00'
I think csv file got created on windows machine in USA. I am using Mac OSX Lion.
If I use code provided by phihag and test.csv containing one record.
sample test.csv content used. Below is print repr(open('test.csv', 'rb').read(1000)) output
'\xff\xfe1\x00,\x002\x00,\x00G\x00,\x00S\x00,\x00H\x00 \x00f\x00\xfc\x00r\x00 \x00e\x00 \x00\x96\x00 \x00m\x00 \x00\x85\x00,\x00,\x00I\x00\r\x00\n\x00'
Code by phihag
import codecs
import csv
with open('test.csv','rb') as f:
sr = codecs.StreamRecoder(f,codecs.getencoder('utf-8'),codecs.getdecoder('utf-8'),codecs.getreader('utf-16'),codecs.getwriter('utf-16'))
for row in csv.reader(sr):
print row
Output of the above code
['1', '2', 'G', 'S', 'H f\xc3\xbcr e \xc2\x96 m \xc2\x85']
['', '', 'I']
expected output is
['1', '2', 'G', 'S', 'H f\xc3\xbcr e \xc2\x96 m \xc2\x85','','I']
At the moment, the csv module does not support UTF-16.
In Python 3.x, csv expects a text-mode file and you can simply use the encoding parameter of open to force another encoding:
# Python 3.x only
import csv
with open('utf16.csv', 'r', encoding='utf16') as csvf:
for line in csv.reader(csvf):
print(line) # do something with the line
In Python 2.x, you can recode the input:
# Python 2.x only
import codecs
import csv
class Recoder(object):
def __init__(self, stream, decoder, encoder, eol='\r\n'):
self._stream = stream
self._decoder = decoder if isinstance(decoder, codecs.IncrementalDecoder) else codecs.getincrementaldecoder(decoder)()
self._encoder = encoder if isinstance(encoder, codecs.IncrementalEncoder) else codecs.getincrementalencoder(encoder)()
self._buf = ''
self._eol = eol
self._reachedEof = False
def read(self, size=None):
r = self._stream.read(size)
raw = self._decoder.decode(r, size is None)
return self._encoder.encode(raw)
def __iter__(self):
return self
def __next__(self):
if self._reachedEof:
raise StopIteration()
while True:
line,eol,rest = self._buf.partition(self._eol)
if eol == self._eol:
self._buf = rest
return self._encoder.encode(line + eol)
raw = self._stream.read(1024)
if raw == '':
self._decoder.decode(b'', True)
self._reachedEof = True
return self._encoder.encode(self._buf)
self._buf += self._decoder.decode(raw)
next = __next__
def close(self):
return self._stream.close()
with open('test.csv','rb') as f:
sr = Recoder(f, 'utf-16', 'utf-8')
for row in csv.reader(sr):
print (row)
open and codecs.open require the file to start with a BOM. If it doesn't (or you're on Python 2.x), you can still convert it in memory, like this:
try:
from io import BytesIO
except ImportError: # Python < 2.6
from StringIO import StringIO as BytesIO
import csv
with open('utf16.csv', 'rb') as binf:
c = binf.read().decode('utf-16').encode('utf-8')
for line in csv.reader(BytesIO(c)):
print(line) # do something with the line
The Python 2.x csv module documentation example shows how to handle other encodings.
I would strongly suggest that you recode your file(s) to UTF-8. Under the very likely condition that you don't have any Unicode characters outside the BMP, you can take advantage of the fact that UTF-16 is a fixed-length encoding to read fixed-length blocks from your input file without worrying about straddling block boundaries.
Step 1: Determine what encoding you actually have. Examine the first few bytes of your file:
print repr(open('thefile.csv', 'rb').read(100))
Four possible ways of encoding u'abc'
\xfe\xff\x00a\x00b\x00c -> utf_16
\xff\xfea\x00b\x00c\x00 -> utf_16
\x00a\x00b\x00c -> utf_16_be
a\x00b\x00c\x00 -> utf_16_le
If you have any trouble with this step, edit your question to include the results of the above print repr()
Step 2: Here's a Python 2.X recode-UTF-16*-to-UTF-8 script:
import sys
infname, outfname, enc = sys.argv[1:4]
fi = open(infname, 'rb')
fo = open(outfname, 'wb')
BUFSIZ = 64 * 1024 * 1024
first = True
while 1:
buf = fi.read(BUFSIZ)
if not buf: break
if first and enc == 'utf_16':
bom = buf[:2]
buf = buf[2:]
enc = {'\xfe\xff': 'utf_16_be', '\xff\xfe': 'utf_16_le'}[bom]
# KeyError means file doesn't start with a valid BOM
first = False
fo.write(buf.decode(enc).encode('utf8'))
fi.close()
fo.close()
Other matters:
You say that your files are too big to read the whole file, recode and rewrite, yet you can open it in vi. Please explain.
The <85> being treated as end of record is a bit of a worry. Looks like 0x85 is being recognised as NEL (C1 control code, NEWLINE). There is a strong possibility that the data was originally encoded in some legacy single-byte encoding where 0x85 has a meaning but has been transcoded to UTF-16 under the false assumption that the original encoding was ISO-8859-1 aka latin1. Where did the file originate? An IBM mainframe? Windows/Unix/classic Mac? What country, locale, language? You obviously think that the <85> is not meant to be a newline; what do you think that it means?
Please feel free to send a copy of a cut-down file (that includes some of the <85> stuff) to sjmachin at lexicon dot net
Update based on 1-line sample data provided.
This confirms my suspicions. Read this. Here's a quote from it:
... the C1 control characters ... are rarely used directly, except on
specific platforms such as OpenVMS. When they turn up in documents,
Web pages, e-mail messages, etc., which are ostensibly in an
ISO-8859-n encoding, their code positions generally refer instead to
the characters at that position in a proprietary, system-specific
encoding such as Windows-1252 or the Apple Macintosh ("MacRoman")
character set that use the codes provided for representation of the C1
set with a single 8-bit byte to instead provide additional graphic
characters
This code:
s1 = '\xff\xfe1\x00,\x002\x00,\x00G\x00,\x00S\x00,\x00H\x00 \x00f\x00\xfc\x00r\x00 \x00e\x00 \x00\x96\x00 \x00m\x00 \x00\x85\x00,\x00,\x00I\x00\r\x00\n\x00'
s2 = s1.decode('utf16')
print 's2 repr:', repr(s2)
from unicodedata import name
from collections import Counter
non_ascii = Counter(c for c in s2 if c >= u'\x80')
print 'non_ascii:', non_ascii
for c in non_ascii:
print "from: U+%04X %s" % (ord(c), name(c, "<no name>"))
c2 = c.encode('latin1').decode('cp1252')
print "to: U+%04X %s" % (ord(c2), name(c2, "<no name>"))
s3 = u''.join(
c.encode('latin1').decode('1252') if u'\x80' <= c < u'\xA0' else c
for c in s2
)
print 's3 repr:', repr(s3)
print 's3:', s3
produces the following (Python 2.7.2 IDLE, Windows 7):
s2 repr: u'1,2,G,S,H f\xfcr e \x96 m \x85,,I\r\n'
non_ascii: Counter({u'\x85': 1, u'\xfc': 1, u'\x96': 1})
from: U+0085 <no name>
to: U+2026 HORIZONTAL ELLIPSIS
from: U+00FC LATIN SMALL LETTER U WITH DIAERESIS
to: U+00FC LATIN SMALL LETTER U WITH DIAERESIS
from: U+0096 <no name>
to: U+2013 EN DASH
s3 repr: u'1,2,G,S,H f\xfcr e \u2013 m \u2026,,I\r\n'
s3: 1,2,G,S,H für e – m …,,I
Which do you think is a more reasonable interpretation of \x96:
SPA i.e. Start of Protected Area (Used by block-oriented terminals.)
or
EN DASH
?
Looks like a thorough analysis of a much larger data sample is warranted. Happy to help.
Just open your file with codecs.open like in
import codecs, csv
stream = codecs.open(<yourfile.csv>, encoding="utf-16")
reader = csv.reader(stream)
And work through your program with unicode strings, as you should do anyway if you are processing text

Categories

Resources