Unable to convert UTF-8 characters - Python

Unable to convert UTF-8 characters - Python - python

I receive a bunch of data into a variable using a mechanize and urllib in Python 2.7. However, certain characters are not decoded despite using .decode(UTF-8). The full code is as follows:
#!/usr/bin/python
import urllib
import mechanize
from time import time
total_time = 0
count = 0
def send_this(url):
global count
count = count + 1
this_browser=mechanize.Browser()
this_browser.set_handle_robots(False)
this_browser.addheaders=[('User-agent','Chrome')]
translated=this_browser.open(url).read().decode("UTF-8")
return translated
def collect_this(my_ltarget,my_lhome,data):
global total_time
data = data.replace(" ","%20")
get_url="http://mymemory.translated.net/api/ajaxfetch?q="+data+"&langpair="+my_lhome+"|"+my_ltarget+"&mtonly=1"
return send_this(get_url)
ctr = 0
print collect_this("hi-IN","en-GB","This is my first proper computer program.")
The output of the print statement is:
{"responseData":{"translatedText":"\u092f\u0939 \u092e\u0947\u0930\u093e \u092a\u0939
u0932\u093e \u0938\u092e\u0941\u091a\u093f\u0924 \u0915\u0902\u092a\u094d\u092f\u0942\u091f
\u0930 \u092a\u094d\u0930\u094b\u0917\u094d\u0930\u093e\u092e \u0939\u0948
\u0964"},"responseDetails":"","responseStatus":200,"matches":[{"id":0,"segment":"This is my
first proper computer program.","translation":"\u092f\u0939 \u092e\u0947\u0930\u093e \u092a
\u0939\u0932\u093e \u0938\u092e\u0941\u091a\u093f\u0924 \u0915\u0902\u092a\u094d\u092f\u0942
\u091f\u0930 \u092a\u094d\u0930\u094b\u0917\u094d\u0930\u093e\u092e \u0939\u0948
\u0964","quality":"70","reference":"Machine Translation provided by Google, Microsoft,
Worldlingo or MyMemory customized engine.","usage-count":0,"subject":"All","created-
by":"MT!","last-updated-by":"MT!","create-date":"2013-12-20","last-update-
date":"2013-12-20","match":0.85}]}
The characters starting with \u... are supposed to be the characters that were supposed to be converted.
Where have I gone wrong?

You don't have a UTF-8-encoded string. You have JSON with JSON unicode escapes in it. Decode it with a JSON decoder:
import json
json.loads(your_json_string)

Related

must be unicode not string on python 2

I'm trying to run python 3 code on Python 2 but it's giving me this error:
TypeError: must be unicode, not str
I've tried adding str() before chr(i) and "u" before "P" but i'm obviously doing it wrong.
tbl = dict.fromkeys(i for i in range(sys.maxunicode)
if unicodedata.category(chr(i)).startswith("P"))
def remove_punctuation(text):
return text.translate(tbl)
# initialize the stemmer
stemmer = LancasterStemmer()
# variable to hold the Json data read from the file
data = None
# read the json file and load the training data
with open('data.json') as json_data:
data = json.load(json_data)
print(data)

Use unichr not chr to create a Unicode character from an ordinal on Python 2:
tbl = dict.fromkeys(i for i in range(sys.maxunicode)
if unicodedata.category(unichr(i)).startswith("P"))
Switch to Python 3 if you can. Python 2 support ends next year.

Python 2 vs 3. Same inputs, different results. MD5 hash

Python 3 code:
def md5hex(data):
""" return hex string of md5 of the given string """
h = MD5.new()
h.update(data.encode('utf-8'))
return b2a_hex(h.digest()).decode('utf-8')
Python 2 code:
def md5hex(data):
""" return hex string of md5 of the given string """
h = MD5.new()
h.update(data)
return b2a_hex(h.digest())
Input python 3:
>>> md5hex('bf5¤7¤8¤3')
'61d91bafe643c282bd7d7af7083c14d6'
Input python 2:
>>> md5hex('bf5¤7¤8¤3')
'46440745dd89d0211de4a72c7cea3720'
Whats going on?
EDIT:
def genurlkey(songid, md5origin, mediaver=4, fmt=1):
""" Calculate the deezer download url given the songid, origin and media+format """
data = b'\xa4'.join(_.encode("utf-8") for _ in [md5origin, str(fmt), str(songid), str(mediaver)])
data = b'\xa4'.join([md5hex(data), data])+b'\xa4'
if len(data)%16:
data += b'\x00' * (16-len(data)%16)
return hexaescrypt(data, "jo6aey6haid2Teih").decode('utf-8')
All this problem started with this b'\xa4' in python 2 code in another function. This byte doesn't work in python 3.
And with that one I get the correct MD5 hash...

Use hashlib & a language agnostic implementation instead:
import hashlib
text = u'bf5¤7¤8¤3'
text = text.encode('utf-8')
print(hashlib.md5(text).hexdigest())
works in Python 2/3 with the same result:
Python2:
'61d91bafe643c282bd7d7af7083c14d6'
Python3 (via repl.it):
'61d91bafe643c282bd7d7af7083c14d6'
The reason your code is failing is the encoded string is not the same string as the un-encoded one: You are only encoding for Python 3.
If you need it to match the unencoded Python 2:
import hashlib
text = u'bf5¤7¤8¤3'
print(hashlib.md5(text.encode("latin1")).hexdigest())
works:
46440745dd89d0211de4a72c7cea3720
the default encoding for Python 2 is latin1 not utf-8

Default encoding in python3 is Unicode. In python 2 it's ASCII. So even if string matches when read they are presented differently.

Python url encode/decode - Convert % escaped hexadecimal digits into string

For example, if I have an encoded string as:
url='locality=Norwood&address=138+The+Parade&region=SA&country=AU&name=Pav%C3%A9+cafe&postalCode=5067'
The name parameter has the characters %C3%A9 which actually implies the character é.
Hence, I would like the output to be:
new_url='locality=Norwood&address=138+The+Parade&region=SA&country=AU&name=Pavé+cafe&postalCode=5067'
I tried the following steps on a Python terminal:
>>> import urllib2
>>> url='locality=Norwood&address=138+The+Parade&region=SA&country=AU&name=Pav%C3%A9+cafe&postalCode=5067'
>>> new_url=urllib2.unquote(url).decode('utf8')
>>> print new_url
locality=Norwood&address=138+The+Parade&region=SA&country=AU&name=Pavé+cafe&postalCode=5067
>>>
However, when I tried the same thing within a Python script and run as myscript.py, I am getting the following stack trace:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 88: ordinal not in range(128)
I am using Python 2.6.6 and cannot switch to other versions due to work reasons.
How can I overcome this error?
Any help is greatly appreciated. Thanks in advance!
######################################################
EDIT
I realized that I am getting the above expected output.
However, I would like to convert the parameters in the new_url into a dictionary as follows. While doing so, I am not able to retain the special character 'é' in my name parameter.
print new_url
params_list = new_url.split("&")
print(params_list)
params_dict={}
for p in params_list:
temp = p.split("=")
params_dict[temp[0]] = temp[1]
print(params_dict)
Outputs:
new_url
locality=Norwood&address=138+The+Parade&region=SA&country=AU&name=Pavé+cafe&postalCode=5067
params_list
[u'locality=Norwood', u'address=138+The+Parade', u'region=SA', u'country=AU', u'name=Pav\xe9+cafe', u'postalCode=5067']
params_dict
{u'name': u'Pav\xe9+cafe', u'locality': u'Norwood', u'country': u'AU', u'region': u'SA', u'address': u'138+The+Parade', u'postalCode': u'5067'}
Basically ... the name is now 'Pav\xe9+cafe' as opposed to the required 'Pavé'.
How can I still retain the same special character in my params_dict?

This is actually due to the difference between __repr__ and __str__. When printing a unicode string, __str__ is used and results in the é you see when printing new_url. However, when a list or dict is printed, __repr__ is used, which uses __repr__ for each object within lists and dicts. If you print the items separately, they print as you desire.
# -*- coding: utf-8 -*-
new_url = u'name=Pavé+cafe&postalCode=5067'
print(new_url) # name=Pavé+cafe&postalCode=5067
params_list = [s for s in new_url.split("&")]
print(params_list) # [u'name=Pav\xe9+cafe', u'postalCode=5067']
print(params_list[0]) # name=Pavé+cafe
print(params_list[1]) # postalCode=5067
params_dict = {}
for p in params_list:
temp = p.split("=")
params_dict[temp[0]] = temp[1]
print(params_dict) # {u'postalCode': u'5067', u'name': u'Pav\xe9+cafe'}
print(params_dict.values()[0]) # 5067
print(params_dict.values()[1]) # Pavé+cafe
One way to print the list and dict is to get their string representation, then decode them withunicode-escape:
print(str(params_list).decode('unicode-escape')) # [u'name=Pavé+cafe', u'postalCode=5067']
print(str(params_dict).decode('unicode-escape')) # {u'postalCode': u'5067', u'name': u'Pavé+cafe'}
Note: This is only an issue in Python 2. Python 3 prints the characters as you would expect. Also, you may want to look into urlparse for parsing your URL instead of doing it manually.
import urlparse
new_url = u'name=Pavé+cafe&postalCode=5067'
print dict(urlparse.parse_qsl(new_url)) # {u'postalCode': u'5067', u'name': u'Pav\xe9 cafe'}

Python RSA Decrypt for class using vs2013

For my class assignment we need to decrypt a message that used RSA Encryption. We were given code that should help us with the decryption, but its not helping.
def block_decode(x):
output = ""
i = BLOCK_SIZE+1
while i > 0:
b1 = int(pow(95,i-1))
y = int(x/b1)
i = i - 1
x = x - y*b1
output = output + chr(y+32)
return output
I'm not great with python yet but it looks like it is doing something one character at a time. What really has me stuck is the data we were given. Can't figure out where or how to store it or if it is really decrypted data using RSA. below are just 3 lines of 38 lines some lines have ' or " or even multiple.
FWfk ?0oQ!#|eO Wgny 1>a^ 80*^!(l{4! 3lL qj'b!.9#'!/s2_
!BH+V YFKq _#:X &?A8 j_p< 7\[0 la.[ a%}b E`3# d3N? ;%FW
KyYM!"4Tz yuok J;b^!,V4) \JkT .E[i i-y* O~$? o*1u d3N?
How do I get this into a string list?

You are looking for the function ord which is a built-in function that
Returns the integer ordinal of a one-character string.
So for instance, you can do:
my_file = open("file_containing_encrypted_message")
data = my_file.read()
to read in the encrypted contents.
Then, you can iterate over each character doing
char_val = ord(each_character)
block_decode(char_val)

UnicodeEncodeError: 'ascii'

Sorry guys I'm really new at this.. Here is the full python script.
The purpose of the script is to read two different 1 wire temperature sensors and then use HTTP post to write those values into a mysql database.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import requests
import hashlib
import time
#Dont forget to fill in PASSWORD and URL TO saveTemp (twice) in this file
sensorids = ["28-00000", "28-000004"]
avgtemperatures = []
for sensor in range(len(sensorids)):
temperatures = []
for polltime in range(0,3):
text = '';
while text.split("\n")[0].find("YES") == -1:
# Open the file that we viewed earlier so that python can see what is in it. Replace the serial number as before.
tfile = open("/sys/bus/w1/devices/"+ sensorids[sensor] +"/w1_slave")
# Read all of the text in the file.
text = tfile.read()
# Close the file now that the text has been read.
tfile.close()
time.sleep(1)
# Split the text with new lines (\n) and select the second line.
secondline = text.split("\n")[1]
# Split the line into words, referring to the spaces, and select the 10th word (counting from 0).
temperaturedata = secondline.split(" ")[9]
# The first two characters are "t=", so get rid of those and convert the temperature from a string to a number.
temperature = float(temperaturedata[2:])
# Put the decimal point in the right place and display it.
temperatures.append(temperature / 1000 * 9.0 / 5.0 + 32.0)
avgtemperatures.append(sum(temperatures) / float(len(temperatures)))
print avgtemperatures[0]
print avgtemperatures[1]
session = requests.Session()
# Getting a fresh nonce which we will use in the authentication step.
nonce = session.get(url='http://127.0.0.1/temp/saveTemp.php?step=nonce').text
# Hashing the nonce, the password and the temperature values (to provide some integrity).
response = hashlib.sha256('{}PASSWORD{}{}'.format(nonce.encode('utf8'), *avgtemperatures).hexdigest())
# Post data of the two temperature values and the authentication response.
post_data = {'response':response, 'temp1':avgtemperatures[0], 'temp2': avgtemperatures[1]}
post_request = session.post(url='http://127.0.0.1/temp/saveTemp.php', data=post_data)
if post_request.status_code == 200 :
print post_request.text
Below is the NEW error that I get.
Traceback (most recent call last):
File "/var/www/pollSensors.py", line 42, in <module>
response = hashlib.sha256('{}PASSWORD{}{}'.format(nonce.encode('utf8'), *avgtemperatures).hexdigest())
AttributeError: 'str' object has no attribute 'hexdigest'

nonce is a unicode value; session.get(..).text is always unicode.
You are trying to force that value into a string without explicitly providing an encoding. As a result Python is trying to encode it for you with the default ASCII codec. That encoding is failing.
Encode your Unicode values to strings explicitly instead. For a SHA 256 hash, UTF-8 is probably fine.
response = hashlib.sha256(nonce.encode('utf8') + 'PASSWORD' +
str(avgtemperatures[0]) +
str(avgtemperatures[1])).hexdigest()
or use string templating:
response = hashlib.sha256('{}PASSWORD{}{}'.format(
nonce.encode('utf8'), *avgtemperatures)).hexdigest()

I got the similar issue except that it was decimal instead of ascci
remove directory: your-profile.spyder2\spyder.lock

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to convert UTF-8 characters - Python - python

You don't have a UTF-8-encoded string. You have JSON with JSON unicode escapes in it. Decode it with a JSON decoder: import json json.loads(your_json_string)

Related

must be unicode not string on python 2

Python 2 vs 3. Same inputs, different results. MD5 hash

Python url encode/decode - Convert % escaped hexadecimal digits into string

Python RSA Decrypt for class using vs2013

UnicodeEncodeError: 'ascii'

Categories

Resources