I'm trying to write a custom Python codec. Here's a short example:
import codecs
class TestCodec(codecs.Codec):
def encode(self, input_, errors='strict'):
return codecs.charmap_encode(input_, errors, {
'a': 0x01,
'b': 0x02,
'c': 0x03,
})
def decode(self, input_, errors='strict'):
return codecs.charmap_decode(input_, errors, {
0x01: 'a',
0x02: 'b',
0x03: 'c',
})
def lookup(name):
if name != 'test':
return None
return codecs.CodecInfo(
name='test',
encode=TestCodec().encode,
decode=TestCodec().decode,
)
codecs.register(lookup)
print(b'\x01\x02\x03'.decode('test'))
print('abc'.encode('test'))
Decoding works, but encoding throws an exception:
$ python3 codectest.py
abc
Traceback (most recent call last):
File "codectest.py", line 29, in <module>
print('abc'.encode('test'))
File "codectest.py", line 8, in encode
'c': 0x03,
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2:
character maps to <undefined>
Any ideas how to use charmap_encode properly?
Look at https://docs.python.org/3/library/codecs.html#encodings-and-unicode (third paragraph):
There’s another group of encodings (the so called charmap encodings) that choose a different subset of all Unicode code points and how these code points are mapped to the bytes 0x0-0xff. To see how this is done simply open e.g. encodings/cp1252.py (which is an encoding that is used primarily on Windows). There’s a string constant with 256 characters that shows you which character is mapped to which byte value.
take the hint to look at encodings/cp1252.py, and check out the following code:
import codecs
class TestCodec(codecs.Codec):
def encode(self, input_, errors='strict'):
return codecs.charmap_encode(input_, errors, encoding_table)
def decode(self, input_, errors='strict'):
return codecs.charmap_decode(input_, errors, decoding_table)
def lookup(name):
if name != 'test':
return None
return codecs.CodecInfo(
name='test',
encode=TestCodec().encode,
decode=TestCodec().decode,
)
decoding_table = (
'z'
'a'
'b'
'c'
)
encoding_table=codecs.charmap_build(decoding_table)
codecs.register(lookup)
### --- following is test/debug code
print(ascii(encoding_table))
print(b'\x01\x02\x03'.decode('test'))
foo = 'abc'.encode('test')
print(ascii(foo))
Output:
{97: 1, 122: 0, 99: 3, 98: 2}
abc
b'\x01\x02\x03'
Related
When I try to return value from c code to python code i got error.
Traceback (most recent call last):
File "python.py", line 54, in <module>
print("\n\n\n\RESULT: ", str(result, "utf-8"))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 245: invalid start byte
in c function i try to return json string which got hex data - which I could parse in python and than make another calculation.
Example of returned string is "{"data":"0x123132"}"
In python i use
import ctypes
my_functions = ctypes.cdll.LoadLibrary("./my_functions.so")
my_functions.getJson.argtypes = (ctypes.c_char_p,)
my_functions.EthereumProcessor.restype = ctypes.c_char_p
result=my_functions.getJson()
print("\n\n\n\RESULT: ", str(result, "utf-8"))
I have the string 'Leicht bewölkt'.
The 'ö' is causing the error
'ascii' codec can't encode character u'\xf6' in position 10: ordinal not in range(128)
I tried to encode it into utf8 # - *- coding: utf- 8 - *- to the start if the file but it does not work. Could you help me out with that? I just want to print it into the command line and send it to an arduino.
def removeThat(schasch):
print(schasch)
schasch = str(schasch).encode('utf8')
schasch = str(schasch).encode('utf8').replace("ü","ue").replace("ä","ae").replace("ö","oe").replace("ß","sss")
return schasch
Replace characters before you encode the string into utf8
replacements = {
'ü': 'ue',
'ä': 'ae',
'ö': 'oe',
'ß': 'ss',
}
def replace_umlauts(text: str) -> str:
for find, replace in replacements.items():
text = text.replace(find, replace)
return text
def encode_text(text: str) -> bytes:
fixed = replace_umlauts(text)
return fixed.encode('utf-8')
if __name__ == '__main__':
text = 'Leicht bewölkt'
print(replace_umlauts(text))
print(encode_text(text))
which prints
Leicht bewoelkt
b'Leicht bewoelkt'
How is that possible, that with the same input I sometime get ascii codec error, and sometime it works just fine? The code cleans the name and build it's Soundex and DMetaphone values. It works in ~1 out of 5 runs, sometimes more often :)
UPD: Looks like that's an issue of fuzzy.DMetaphone, at least on Python2.7 with Unicode. Plan to integrate Metaphone instead, for now. All solutions for fuzzy.DMetaphone problem are very welcome :)
UPD 2: Problem is gone after fuzzy update to 1.2.2. The same code works fine.
import re
import fuzzy
import sys
def make_namecard(full_name):
soundex = fuzzy.Soundex(4)
dmeta = fuzzy.DMetaphone(4)
names = process_name(full_name)
print names
soundexes = map(soundex, names)
dmetas = []
for name in names:
print name
dmetas.extend(list(dmeta(name)))
dmetas = filter(bool, dmetas)
return {
"full_name": full_name,
"soundex": soundexes,
"dmeta": dmetas,
"names": names,
}
def process_name(full_name):
full_name = re.sub("[_-]", " ", full_name)
full_name = re.sub(r'[^A-Za-z0-9 ]', "", full_name)
names = full_name.split()
names = filter(valid_name, names)
return names
def valid_name(name):
COMMON_WORDS = ["the", "of"]
return len(name) >= 2 and name.lower() not in COMMON_WORDS
print make_namecard('Jerusalem Warriors')
Output:
➜ python2.7 make_namecard.py
['Jerusalem', 'Warriors']
Jerusalem
Warriors
{'soundex': [u'J624', u'W624'], 'dmeta': [u'\x00\x00\x00\x00', u'ARSL', u'ARRS', u'FRRS'], 'full_name': 'Jerusalem Warriors', 'names': ['Jerusalem', 'Warriors']}
➜ python2.7 make_namecard.py
['Jerusalem', 'Warriors']
Jerusalem
Traceback (most recent call last):
File "make_namecard.py", line 38, in <module>
print make_namecard('Jerusalem Warriors')
File "make_namecard.py", line 16, in make_namecard
dmetas.extend(list(dmeta(name)))
File "src/fuzzy.pyx", line 258, in fuzzy.DMetaphone.__call__
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 0: ordinal not in range(128)
This question already has an answer here:
how to interpret this error "UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 164: ordinal not in range(128)"
(1 answer)
Closed 5 years ago.
I'm writing a script in Python 3.5.3 that takes username/password combos from a file and writes them to another file. The script was written on a machine with Windows 10 and worked. However, when I tried to run the script on a MacBook running Yosemite, I got an error that has something to do with ASCII encoding.
The relevant function is this:
def buildDatabase():
print("Building database, this may take some time...")
passwords = open("10-million-combos.txt", "r") #File with user/pword combos.
hashWords = open("Hashed Combos.txt", "a") #File where user/SHA-256 encrypted pwords will be stored.
j = 0
hashTable = [[ None ] for x in range(60001)] #A hashtable with 30,000 elements, quadratic probing means size must = 2 x the final size + 1
for line in passwords:
toSearch = line
i = q = toSearch.find("\t") #The username/pword combos are formatted: username\tpassword\n.
n = toSearch.find("\n")
password = line[i:n-1] #i is the start of the password, n is the end of it
username = toSearch[ :q] + ":" #q is the end of the username
byteWord = password.encode('UTF-8')
sha.update(byteWord)
toWrite = sha.hexdigest() #password is encrypted to UTF-8, run thru SHA-256, and stored in toWrite
skip = False
if len(password) == 0: #if le(password) is 0, just skip it
skip = True
if len(password) == 1:
doModulo = ord(password[0]) ** 4
if len(password) == 2:
doModulo = ord(password[0]) * ord(password[0]) * ord(password[1]) * ord(password[1])
if len(password) == 3:
doModulo = ord(password[0]) * ord(password[0]) * ord(password[1]) * ord(password[2])
if len(password) > 3:
doModulo = ord(password[0]) * ord(password[1]) * ord(password[2]) * ord(password[3])
assignment = doModulo % 60001
#The if block above gives each combo an assignment number for a hash table, indexed by password because they're more unique than usernames
successful = False
collision = 0
The error is as follows:
Traceback (most recent call last):
File "/Users/connerboehm/Documents/Conner B/PythonFinalProject.py", line 104, in <module>
buildDatabase()
File "/Users/connerboehm/Documents/Conner B/PythonFinalProject.py", line 12, in buildDatabase
for line in passwords:
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xaa in position 2370: ordinal not in range(128)
What's happening here? I haven't gotten this error before on Windows, and I can't see any problem with my attempt to encode into UTF-8.
Edit: Notepad encodes in ANSI. Changing the encoding (just copying and pasting the data to a new .txt file) to UTF-8 solved the problem.
Your program doesn't say what codec is used in the file "10-million-combos.txt", so Python is in this case trying to decode it as ASCII. 0xaa isn't an ASCII ordinal so that failed. Identify what codec is used in your file and pass that in the encoding parameter for open.
Not sure what I'm doing wrong here, but with this:
# -*- coding: utf-8 -*-
class Foo(object):
CURRENCY_SYMBOL_MAP = {"CAD":'$', "USD":'$', "GBP" : "£"}
def bar(self, value, symbol="GBP"):
result = u"%s%s" % (self.CURRENCY_SYMBOL_MAP[symbol], value)
return result
if __name__ == "__main__":
f = Foo()
print f.bar(unicode("19.00"))
I get:
Traceback (most recent call last):
File "test.py", line 11, in <module>
print f.bar(unicode("19.00"))
File "test.py", line 7, in bar
result = u"%s%s" % (self.CURRENCY_SYMBOL_MAP[symbol], value)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
This is with Python 2.7.6
PS - I get that there are libraries like Babel for formmatting things as currency, my question is more with respect to unicode strings and the % operator.
Make sure the strings you're inserting are Unicode too.
CURRENCY_SYMBOL_MAP = {"CAD":u'$', "USD":u'$', "GBP" : u"£"}
You are attempting to insert a non-unicode string into a unicode string. You just have to make the values in CURRENCY_SYMBOL_MAP unicode objects.
# -*- coding: utf-8 -*-
class Foo(object):
CURRENCY_SYMBOL_MAP = {"CAD":u'$', "USD":u'$', "GBP" : u"£"} # this line is the difference
def bar(self, value, symbol="GBP"):
result = u"%s%s" % (self.CURRENCY_SYMBOL_MAP[symbol], value)
return result
if __name__ == "__main__":
f = Foo()
print f.bar(unicode("19.00"))