encoding special characters in python2 - python

I have a piece of code that works well in Python3:
def encode_test(filepath, char_to_int):
with open(filepath, "r", encoding= "latin-1") as f:
dat = [line.rstrip() for line in f]
string_to_int = [[char_to_int[char] if char != 'ó' else char_to_int['ò'] for char in line] for line in dat]
However, when I try to do this in Python2.7, I first got the error
SyntaxError: Non-ASCII character '\xc3' in file languageIdentification.py on line 30, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
Then I realize that I may need to add #coding=utf-8 at the top of code. However, after doing this, I encountered another error:
UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
string_to_int = [[char_to_int[char] if char != 'ó' else char_to_int['ò'] for char in line] for line in dat]
Traceback (most recent call last):
File "languageIdentification.py", line 190, in <module>
test_string = encode_test(sys.argv[3], char_to_int)
File "languageIdentification.py", line 32, in encode_test
string_to_int = [[char_to_int[char] if char != 'ó' else
char_to_int['ò'] for char in line] for line in dat]
KeyError: u'\xf3'
So could anyone tell me what could I do to solve this problem in Python2.7?
Thank you!

The problem is that you try to compare unicode-string and byte-string:
char != 'ó'
Where char is a unicode and 'ó' is a byte-string (or just str).
When Python 2 faces with such a comparison, it tries to convert (or decode):
byte-string -> unicode
The conversion is provided with a default encoding which is ASCII in Python 2.
Since byte-value of 'ó' is higher than 127, it leads to the error (UnicodeWarning).
By the way, for literal which byte-value is in ASCII-range, comparison
will be successful.
Examples:
print u'ó' == 'ó' # UnicodeWarning: ...
print u'z' == 'z' # True
So, in comparison you need to convert your byte-string to unicode manually.
For example, you can do that with built-in unicode() function:
u = unicode('ó', 'utf-8') # note, that you can specify encoding
Or just with 'u'-literal:
u = u'ó'
But be aware: with this option the convertion will be implemented through the encoding you specified at the top of the source file.
So, your actual source encoding and the encoding declared at the top should match.
As I see from the SyntaxError message: in your source 'ó' starts with '\xc3'-byte.Therefore it should be '\xc3\xb3' which is UTF-8:
print '\xc3\xb3'.decode('utf-8') # ó
So, # coding: utf-8 + char != u'ó' should solve your problem.
UPD.
As I see from the UnicodeWarning message - there is the 2nd trouble: KeyError
This error occurs in the statement:
char_to_int[char]
because u'\xf3' (which actually is u'ó') is not a valid key.
This unicode comes from decoding your file (with latin-1).
And I suppose, that there are no unicode keys in your dict char_to_int at all.
So, try to encode such a key back to its byte-value with:
char_to_int[char.encode('latin-1')]
Summarizing, try to change the last string of provided code to:
string_to_int = [[char_to_int[char.encode('latin-1')] if char != u'ó' else char_to_int['ò'] for char in line] for line in dat]

If you want to convert character to its integer value you may use ord function, it works well for Unicode too.
line = u’some Unicode line with ò and ó’
string_to_int = [ord(char) if char!=u‘ó’ else ord(u’ò’) for char in line]

Related

Encoding a file with ord function

I'm trying to encode a file and output the encode into a new file, but I got this error:
TypeError: ord() expected string of length 1, but int found
My code:
from sys import argv, exit
def encode(data):
encoded = ''
while data:
current = data[0]
count = 1
for i in data[1:]:
if i == current:
count += 1
else:
break
if count == 255:
break
encoded += '{}{}'.format(chr(ord(current) & 255), chr(count & 255)) #error occurs here.
data = data[count:]
return encoded
if __name__ == '__main__':
if len(argv) < 2:
print('Please specify input file!')
exit(0)
with open(argv[1], 'rb') as (f):
data = f.read()
with open(argv[1] + '.out', 'wb') as (f):
f.write(encode(data))
Additional question: How do I decode the encoded file?
You are reading bytes (open(..., 'rb')), so when you take one element of the byte string, you get a byte, ie. a number. This number already is the character code, so just leave out the ord. Alternatively, you could open the file without the b modifier (open(..., 'r')), which will return a string; I would advise to keep it as a byte string though (or you could run into encoding issues if you are parsing something non-ascii).
You will run into a similar problem saving your file: you cannot write a string into a file opened with the b modifier. Since you have characters outside the ascii range (>128), writing as a string is not a good idea, since python will try to encode your characters (eg. in UTF-8), and you will end up with completely different bytes. Therefore, the best solution probably is not to concat your data to a string in your loop (the part where you do '{}{}'.format(...), but to have a list (encoded = [], concat with encoded.append(current)) and convert that to a byte string using bytes(encoded) after your loop. You can then pass that to write without a problem.
As for how to decode your file, you can just open the file like you do for encoding, read two bytes b1 and b2, and append [b1]*b2 to your output (again, as a list), and convert that to a byte string with bytes().

Printing out all unicode emojis to file

It's possible to print the hexcode of the emoji with u'\uXXX' pattern in Python, e.g.
>>> print(u'\u231B')
⌛
However, if I have a list of hex code like 231B, just "adding" the string won't work:
>>> print(u'\u' + ' 231B')
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
The chr() fails too:
>>> chr('231B')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: an integer is required (got type str)
My first part of the question is given the hexcode, e.g. 231A how do I get the str type of the emoji?
My goal is to getting the list of emojis from https://unicode.org/Public/emoji/13.0/emoji-sequences.txt and read the hexcode on the first column.
There are cases where it ranges from 231A..231B, the second part of my question is given a hexcode range, how do I iterate through the range to get the emoji str, e.g. 2648..2653, it is possible to do range(2648, 2653+1) but if there's a character in the hexa, e.g. 1F232..1F236, using range() is not possible.
Thanks #amadan for the solutions!!
TL;DR
To get a list of emojis from https://unicode.org/Public/emoji/13.0/emoji-sequences.txt into a file.
import requests
response = requests.get('https://unicode.org/Public/emoji/13.0/emoji-sequences.txt')
with open('emoji.txt', 'w') as fout:
for line in response.content.decode('utf8').split('\n'):
if line.strip() and not line.startswith('#'):
hexa = line.split(';')[0]
hexa = hexa.split('..')
if len(hexa) == 1:
ch = ''.join([chr(int(h, 16)) for h in hexa[0].strip().split(' ')])
print(ch, end='\n', file=fout)
else:
start, end = hexa
for ch in range(int(start, 16), int(end, 16)+1):
#ch = ''.join([chr(int(h, 16)) for h in ch.split(' ')])
print(chr(ch), end='\n', file=fout)
Convert hex string to number, then use chr:
chr(int('231B', 16))
# => '⌛'
or directly use a hex literal:
chr(0x231B)
To use a range, again, you need an int, either converted from a string or using a hex literal:
''.join(chr(c) for c in range(0x2648, 0x2654))
# => '♈♉♊♋♌♍♎♏♐♑♒♓'
or
''.join(chr(c) for c in range(int('2648', 16), int('2654', 16)))
(NOTE: you'd get something very different from range(2648, 2654)!)

"01"-string representing bytes to unicode conversion in python 2

If I have byte - 11001010 or 01001010, how can I convert it back to Unicode if it is a valid code point?
I can take inputs and do a regex check on the input, but that would be a crude way of doing it, and it will be only limited to UTF-8. If I want to extend in future, how can I optimise the solution?
The input is string with 0's and 1's -
11001010 This is invalid
or 01001010 This is valid
or 11010010 11001110 This is invalid
If there is no other text, split the strings on whitespace, convert each to an integer and feed the result to a bytearray() object to decode:
as_binary = bytearray(int(b, 2) for b in inputtext.split())
as_unicode = as_binary.decode('utf8')
By putting the integer values into a bytearray() we avoid having to concatenate individual characters and get a convenient .decode() method as a bonus.
Note that this does expect the input to contain valid UTF-8. You could add an error handler to replace bad bytes rather than raise an exception, e.g. as_binary.decode('utf8', 'replace').
Wrapped up as a function that takes a codec and error handler:
def to_text(inputtext, encoding='utf8', errors='strict'):
as_binary = bytearray(int(b, 2) for b in inputtext.split())
return as_binary.decode(encoding, errors)
Most of your samples are not actually valid UTF-8, so the demo sets errors to 'replace':
>>> to_text('11001010', errors='replace')
u'\ufffd'
>>> to_text('01001010', errors='replace')
u'J'
>>> to_text('11001010', errors='replace')
u'\ufffd'
>>> to_text('11010010 11001110', errors='replace')
u'\ufffd\ufffd'
Leave errors to the default if you want to detect invalid data; just catch the UnicodeDecodeError exception thrown:
>>> to_text('11010010 11001110')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in to_text
File "/Users/mjpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd2 in position 0: invalid continuation byte

Python issue with different versions on local machine/server for json-csv conversion [duplicate]

I have a file which is mostly UTF-8, but some Windows-1252 characters have also found their way in.
I created a table to map from the Windows-1252 (cp1252) characters to their Unicode counterparts, and would like to use it to fix the mis-encoded characters, e.g.
cp1252_to_unicode = {
"\x85": u'\u2026', # …
"\x91": u'\u2018', # ‘
"\x92": u'\u2019', # ’
"\x93": u'\u201c', # “
"\x94": u'\u201d', # ”
"\x97": u'\u2014' # —
}
for l in open('file.txt'):
for c, u in cp1252_to_unicode.items():
l = l.replace(c, u)
But attempting to do the replace this way results in a UnicodeDecodeError being raised, e.g.:
"\x85".replace("\x85", u'\u2026')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)
Any ideas for how to deal with this?
If you try to decode this string as utf-8, as you already know, you will get an "UnicodeDecode" error, as these spurious cp1252 characters are invalid utf-8 -
However, Python codecs allow you to register a callback to handle encoding/decoding errors, with the codecs.register_error function - it gets the UnicodeDecodeerror a a parameter - you can write such a handler that atempts to decode the data as "cp1252", and continues the decoding in utf-8 for the rest of the string.
In my utf-8 terminal, I can build a mixed incorrect string like this:
>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> print a
maçã ma��
>>> a.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-11: invalid data
I wrote the said callback function here, and found a catch: even if you increment the position from which to decode the string by 1, so that it would start on the next chratcer, if the next character is also not utf-8 and out of range(128), the error is raised at the first out of range(128) character - that means, the decoding "walks back" if consecutive non-ascii, non-utf-8 chars are found.
The worka round this is to have a state variable in the error_handler which detects this "walking back" and resume decoding from the last call to it - on this short example, I implemented it as a global variable - (it will have to be manually reset to "-1" before each call to the decoder):
import codecs
last_position = -1
def mixed_decoder(unicode_error):
global last_position
string = unicode_error[1]
position = unicode_error.start
if position <= last_position:
position = last_position + 1
last_position = position
new_char = string[position].decode("cp1252")
#new_char = u"_"
return new_char, position + 1
codecs.register_error("mixed", mixed_decoder)
And on the console:
>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> last_position = -1
>>> print a.decode("utf-8", "mixed")
maçã maçã
With thanks to jsbueno and a whack of other Google searches and other pounding I solved it this way.
#The following works very well but it does not allow for any attempts to FIX the data.
xmlText = unicode(xmlText, errors='replace').replace(u"\uFFFD", "?")
This version allows for a limited opportunity to repair invalid characters. Unknown characters are replaced with a safe value.
import codecs
replacement = {
'85' : '...', # u'\u2026' ... character.
'96' : '-', # u'\u2013' en-dash
'97' : '-', # u'\u2014' em-dash
'91' : "'", # u'\u2018' left single quote
'92' : "'", # u'\u2019' right single quote
'93' : '"', # u'\u201C' left double quote
'94' : '"', # u'\u201D' right double quote
'95' : "*" # u'\u2022' bullet
}
#This is is more complex but allows for the data to be fixed.
def mixed_decoder(unicodeError):
errStr = unicodeError[1]
errLen = unicodeError.end - unicodeError.start
nextPosition = unicodeError.start + errLen
errHex = errStr[unicodeError.start:unicodeError.end].encode('hex')
if errHex in replacement:
return u'%s' % replacement[errHex], nextPosition
return u'%s' % errHex, nextPosition # Comment this line out to get a question mark
return u'?', nextPosition
codecs.register_error("mixed", mixed_decoder)
xmlText = xmlText.decode("utf-8", "mixed")
Basically I attempt to turn it into utf8. For any characters that fail I just convert it to HEX so I can display or look it up in a table of my own.
This is not pretty but it does allow me to make sense of messed up data
Good solution that of #jsbueno, but there is no need of global variable last_position, see:
def mixed_decoder(error: UnicodeError) -> (str, int):
bs: bytes = error.object[error.start: error.end]
return bs.decode("cp1252"), error.start + 1
import codecs
codecs.register_error("mixed", mixed_decoder)
a = "maçã".encode("utf-8") + "maçã".encode("cp1252")
# a = b"ma\xc3\xa7\xc3\xa3ma\xe7\xe3"
s = a.decode("utf-8", "mixed")
# s = "maçãmaçã"
This is usually called Mojibake.
There's a nice Python library that might solve these issues for you called ftfy.
Example:
>>> from ftfy import fix_text
>>> fix_text("Ð¨ÐµÐ¿Ð¾Ñ (напоминалки)")
'Шепот (напоминалки)'
Just came into this today, so here is my problem and my own solution:
original_string = 'Notifica\xe7\xe3o de Emiss\xe3o de Nota Fiscal Eletr\xf4nica.'
def mixed_decoding(s):
output = ''
ii = 0
for c in s:
if ii <= len(s)-1:
if s[ii] == '\\' and s[ii+1] == 'x':
b = s[ii:ii+4].encode('ascii').decode('unicode-escape')
output = output+b
ii += 3
else:
output = output+s[ii]
ii += 1
print(output)
return output
decoded_string = mixed_decoding(original_string)
Now it prints:
>>> Notificação de Emissão de Nota Fiscal Eletrônica.

Python: How do I compare unicode to ascii text?

I'm trying to convert characters in one list into characters in another list at the same index in Japanese (zenkaku to hangaku moji, for those interested), and I can't get the comparison to work. I am decoding into utf-8 before I compare (decoding into ascii broke the program), but the comparison doesn't ever return true. Does anyone know what I'm doing wrong? Here's the code (indents are a little wacky due to SO's editor):
#!C:\Python27\python.exe
# coding=utf-8
import os
import shutil
import sys
zk = [
'。',
'、',
'「',
'」',
'(',
')',
'!',
'?',
'・',
'/',
'ア','イ','ウ','エ','オ',
'カ','キ','ク','ケ','コ',
'サ','シ','ス','セ','ソ',
'ザ','ジ','ズ','ゼ','ゾ',
'タ','チ','ツ','テ','ト',
'ダ','ヂ','ヅ','デ','ド',
'ラ','リ','ル','レ','ロ',
'マ','ミ','ム','メ','モ',
'ナ','ニ','ヌ','ネ','ノ',
'ハ','ヒ','フ','ヘ','ホ',
'バ','ビ','ブ','ベ','ボ',
'パ','ピ','プ','ペ','ポ',
'ヤ','ユ','ヨ','ヲ','ン','ッ'
]
hk = [
'。',
'、',
'「',
'」',
'(',
')',
'!',
'?',
'・',
'/',
'ア','イ','ウ','エ','オ',
'カ','キ','ク','ケ','コ',
'サ','シ','ス','セ','ソ',
'ザ','ジ','ズ','ゼ','ゾ',
'タ','チ','ツ','テ','ト',
'ダ','ヂ','ヅ','デ','ド',
'ラ','リ','ル','レ','ロ',
'マ','ミ','ム','メ','モ',
'ナ','ニ','ヌ','ネ','ノ',
'ハ','ヒ','フ','ヘ','ホ',
'バ','ビ','ブ','ベ','ボ',
'パ','ピ','プ','ペ','ポ',
'ヤ','ユ','ヨ','ヲ','ン','ッ'
]
def main():
if len(sys.argv) > 1:
filename = sys.argv[1]
else:
print("Please specify a file to check.")
return
try:
f = open(filename, 'r')
except IOError as e:
print("Sorry! The file doesn't exist.")
return
filecontent = f.read()
f.close()
#y = zk[29]
#print y.decode('utf-8')
for f in filecontent:
for z in zk:
if f == z.decode('utf-8'):
print f
print filename
if __name__ == "__main__":
main()
Am I missing a step?
Several.
zk = [
u'。',
u'、',
u'「',
...
...
f = codecs.open(filename, 'r', encoding='utf-8')
...
I'll let you work out the rest now that the hard work's been done.
Make sure that zk and hk lists contain Unicode strings. Either use Unicode literals e.g., u'a' or decode them at runtime:
fromutf8 = lambda s: s.decode('utf-8') if not isinstance(s, unicode) else s
zk = map(fromutf8, zk)
hk = map(fromutf8, hk)
You could use unicode.translate() to convert characters in one list into characters in another list at the same index:
import codecs
translation_table = dict(zip(map(ord,zk), hk))
with codecs.open(sys.argv[1], encoding='utf-8') as f:
for line in f:
print line.translate(translation_table),
You need to convert everything to the same form, and the form is Unicode strings. Unicode strings have no encoding in the sense .encode() or .decode(). When having a non-unicode string, it is actually a stream of bytes that expresses the value in some encoding. When converting to Unicode, you have to .decode(). When storing Unicode string to a sequence of bytes, you have to .encode() the abstraction to concrete bytes.
This way, when loading Unicode strings from an UTF-8 encoded file, or you have to read it into the old strings (non Unicode, sequences of bytes) and then .decode('utf-8'), or you can use `codecs.open(..., encoding='utf-8') -- then you get Unicode strings automatically.
The form # coding=utf-8 is not the usual, but it is OK... if the editor (I mean the tool that you use to write the text) also thinks this way. Then the old strings are displayed by the editor correctly. In the case they should be .decode('utf-8')d to get Unicode. Old strings with ASCII characters only in the same source can also be converted to Unicode using the .decode('utf-8').
To summarize: you are de coding from bytes to Unicode, and you are en coding the Unicode strings into sequence of bytes. It seems from the question that you are doing the opposite.
The following is completely wrong:
for f in filecontent:
for z in zk:
if f == z.decode('utf-8'):
print f
because the filecontent is the result of f.read(). This way it is a sequence of bytes. The f in the loop is one byte. The z.decode('utf-8') returns one Unicode character. They cannot be compared. (By the way, the f is a kind of misleading name for a byte value.)

Categories

Resources