Python url encode/decode - Convert % escaped hexadecimal digits into string - python

For example, if I have an encoded string as:
url='locality=Norwood&address=138+The+Parade&region=SA&country=AU&name=Pav%C3%A9+cafe&postalCode=5067'
The name parameter has the characters %C3%A9 which actually implies the character é.
Hence, I would like the output to be:
new_url='locality=Norwood&address=138+The+Parade&region=SA&country=AU&name=Pavé+cafe&postalCode=5067'
I tried the following steps on a Python terminal:
>>> import urllib2
>>> url='locality=Norwood&address=138+The+Parade&region=SA&country=AU&name=Pav%C3%A9+cafe&postalCode=5067'
>>> new_url=urllib2.unquote(url).decode('utf8')
>>> print new_url
locality=Norwood&address=138+The+Parade&region=SA&country=AU&name=Pavé+cafe&postalCode=5067
>>>
However, when I tried the same thing within a Python script and run as myscript.py, I am getting the following stack trace:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 88: ordinal not in range(128)
I am using Python 2.6.6 and cannot switch to other versions due to work reasons.
How can I overcome this error?
Any help is greatly appreciated. Thanks in advance!
######################################################
EDIT
I realized that I am getting the above expected output.
However, I would like to convert the parameters in the new_url into a dictionary as follows. While doing so, I am not able to retain the special character 'é' in my name parameter.
print new_url
params_list = new_url.split("&")
print(params_list)
params_dict={}
for p in params_list:
temp = p.split("=")
params_dict[temp[0]] = temp[1]
print(params_dict)
Outputs:
new_url
locality=Norwood&address=138+The+Parade&region=SA&country=AU&name=Pavé+cafe&postalCode=5067
params_list
[u'locality=Norwood', u'address=138+The+Parade', u'region=SA', u'country=AU', u'name=Pav\xe9+cafe', u'postalCode=5067']
params_dict
{u'name': u'Pav\xe9+cafe', u'locality': u'Norwood', u'country': u'AU', u'region': u'SA', u'address': u'138+The+Parade', u'postalCode': u'5067'}
Basically ... the name is now 'Pav\xe9+cafe' as opposed to the required 'Pavé'.
How can I still retain the same special character in my params_dict?

This is actually due to the difference between __repr__ and __str__. When printing a unicode string, __str__ is used and results in the é you see when printing new_url. However, when a list or dict is printed, __repr__ is used, which uses __repr__ for each object within lists and dicts. If you print the items separately, they print as you desire.
# -*- coding: utf-8 -*-
new_url = u'name=Pavé+cafe&postalCode=5067'
print(new_url) # name=Pavé+cafe&postalCode=5067
params_list = [s for s in new_url.split("&")]
print(params_list) # [u'name=Pav\xe9+cafe', u'postalCode=5067']
print(params_list[0]) # name=Pavé+cafe
print(params_list[1]) # postalCode=5067
params_dict = {}
for p in params_list:
temp = p.split("=")
params_dict[temp[0]] = temp[1]
print(params_dict) # {u'postalCode': u'5067', u'name': u'Pav\xe9+cafe'}
print(params_dict.values()[0]) # 5067
print(params_dict.values()[1]) # Pavé+cafe
One way to print the list and dict is to get their string representation, then decode them withunicode-escape:
print(str(params_list).decode('unicode-escape')) # [u'name=Pavé+cafe', u'postalCode=5067']
print(str(params_dict).decode('unicode-escape')) # {u'postalCode': u'5067', u'name': u'Pavé+cafe'}
Note: This is only an issue in Python 2. Python 3 prints the characters as you would expect. Also, you may want to look into urlparse for parsing your URL instead of doing it manually.
import urlparse
new_url = u'name=Pavé+cafe&postalCode=5067'
print dict(urlparse.parse_qsl(new_url)) # {u'postalCode': u'5067', u'name': u'Pav\xe9 cafe'}

Related

Cyrillic encoding python 2.7 array

My code so far:
data = [(u'Rest', u'русский', u'фввв', u'vc'), (u'Rest', u'русский', u'фввв ', u'vc')]
print(data)
result:
[(u'Rest', u'\u0440\u0443\u0441\u0441\u043a\u0438\u0439', u'\u0444\u0432\u0432\u0432', u'vc'), (u'Rest', u'\u0440\u0443\u0441\u0441\u043a\u0438\u0439', u'\u0444\u0432\u0432\u0432 ', u'vc')]
I want the output to display the cyrillic characters, like so:
[('Rest', 'русский', 'фввв', 'vc'), ('Rest', 'русский', 'фввв ', 'vc')]
This is happening because when we print out a list or tuple, the representation of the elements within the list is defined by the element's __repr__ function, rather than its __str__ function. To fix this, you can use the following to encode the strings and then decode the repr() of the list.
Code:
# -*- coding: utf-8 -*-
import sys
data = [(u'Rest', u'русский', u'фввв', u'vc'), (u'Rest', u'русский', u'фввв ', u'vc')]
print repr([tuple(x.encode(sys.stdout.encoding) for x in sl) for sl in data]).decode('string-escape')
Out:
[('Rest', 'русский', 'фввв', 'vc'), ('Rest', 'русский', 'фввв ', 'vc')]

writing utf-8 encoded text to a file

I am trying to print chinese text to a file. When i print it on the terminal, it looks correct to me. When i type print >> filename... i get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 24: ordinal not in range(128)
I dont know what else i need to do.. i already encoded all textual data to utf-8 and used string formatting.
This is my code:
# -*- coding: utf-8 -*-
exclude = string.punctuation
def get_documents_cleaned(topics,filename):
c = codecs.open(filename, "w", "utf-8")
for top in topics:
print >> c , "num" , "," , "text" , "," , "id" , "," , top
document_results = proj.get('docs',text=top)['results']
for doc in document_results:
print "{0}, {1}, {2}, {3}".format(doc[1], (doc[0]['text']).encode('utf-8').translate(None,exclude), doc[0]['document']['id'], top.encode('utf-8'))
print >> c , "{0}, {1}, {2}, {3}".format(doc[1], (doc[0]['text']).encode('utf-8').translate(None,exclude), doc[0]['document']['id'], top.encode('utf-8'))
get_documents_cleaned(my_topics,"export_c.csv")
print doc[0]['text'] looks like this:
u' \u3001 \u6010...'
Since your first print statement works, it's clear, that it's not the format function raising the UnicodeDecodeError.
Instead it's a problem with the file writer. c seems expect a unicode object but only gets a UTF-8 encoded str object (let's name it s). So c tries to call s.decode() implicitly which results in the UnicodeDecodeError.
You could fix your problem by simply calling s.decode('utf-8') before printing or by using the Python default open(filename, "w") function instead.

Python: Reading from two array of tuples at a time and placing them side-by-side on CSV file

So I have two arrays of tuples that are arranged with Restaurant Name and an Int:
("Restaurant Name", 0)
One is called ArrayForInitialSpots, and the other is called ArrayForChosenSpots. What I want to do is to write the tuples from both rows in side-by-side order in a csv file like this:
"First Restaurant in ArrayForInitialSPots",0,"First Restaurant in ArrayForChosenSpots", 1
"Second Restaurant in ArrayForInitialSpots",0,"Second Restaurant in ArrayForChosenSpots",0
So far i've tried doing this:
with open('data.csv','w') as out:
csv_out=csv.writer(out)
csv_out.writerow(['Restaurant Name','Change'])
for x, y in zip(arrayForInitialSpots, arrayForChosenSpots):
csv_out.writerow(x + y)
#csv_out.writerow(y)
But I get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-6: ordinal not in range(128)
If I remove the zip function, I get too many values to unpack. Any suggestions guys? Thank you very much in advance.
There are two things that you could use to handle extended ascii characters while writing to files:
Set default encoding to utf-8
import sys
reload(sys).setdefaultencoding("utf-8")
Use unicodecsv writer to write data to files
import unicodecsv
unicodecsv
with mhawke help here is my solution
with open('data.csv','w') as out:
csv_out=csv.writer(out)
csv_out.writerow(['Restaurant Name','Change'])
for x, y in zip(arrayForInitialSpots, arrayForChosenSpots):
list_ = [str(word).decode("utf8") for word in (x+y)]
counter = 0
while counter < len(list_):
s=""
for i in range(counter,counter+4):
s+=list_[i].encode('utf-8')
s+=","
counter = counter + 4
csv_out.writerow(s[:-1])
The problem is not due to your use of zip() - that looks OK, but instead it is an encoding issue. Probably the restaurant names are unicode strings or in some encoding other than ASCII or UTF8? ISO-8859-1 perhaps?
The csv module does not handle unicode; other encodings might work, but it depends. The module does handle 8-bit values OK (except ASCII NUL), so you should be able to encode them as UTF8 like this:
ENCODING = 'iso-8859-1' # assume strings are encoded in this encoding
def to_utf8(item, from_encoding):
if isinstance(item, str):
# byte strings are first decoded to unicode
item = unicode(item, from_encoding)
return unicode(item).encode('utf8')
with open('data.csv', 'w') as out:
csv_out = csv.writer(out)
csv_out.writerow(['Restaurant Name', 'Change'] * 2)
for x, y in zip(arrayForInitialSpots, arrayForChosenSpots):
csv_out.writerow([to_utf8(item, ENCODING) for item in x+y])
This works by converting each element of the tuple formed by x+y into a UTF-8 string. This includes byte strings in other encodings, as well as other objects such as integers that can be converted to a unicode string via unicode(). If your strings are unicode, just set ENCODING to None.
I'd suggest using numpy:
import numpy as np
IniSpots=[("Restaurant Name0a", 0),("Restaurant Name1a", 1)]
ChoSpots=[("Restaurant Name0b", 0),("Restaurant Name1b", 0)]
c=np.hstack((IniSpots,ChoSpots))
np.savetxt("data.csv", c, fmt='%s',delimiter=",")

Using the python unicode function

I'm working on a project that compares text.
Here is the relevant piece of code:
def post(self):
A = unicode(flask.request.form['A'])
B = unicode(flask.request.form['B'])
I posted large pieces of text from project gutenberg and I get errors like this:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 6: ordinal not in range(128)
Based on this page I have tried errors ignore and errors replace and get the error:
TypeError: decoding Unicode is not supported
If possible I want to be able to take in the most robust set of characters possible. I was hoping there was a python library that would allow this.
Here is more of the code. I think the problem may occur when I try to turn my input into a string.
C = A.split()
D = B.split()
Both = []
for x in C:
if x in D:
Both.append(x)
for x in range(len(Both)):
Both[x]=str(Both[x])
Final = []
for x in set(Both):
Final.append(x)
MissingA = []
for x in C:
if x not in Final and x not in MissingA:
MissingA.append(x)
for x in range(len(MissingA)):
MissingA[x]=str(MissingA[x])
MissingB = []
Here is more of the code. I think the problem may occur when I try to
turn my input into a string.
I think that's right - try eliminating the str() calls.

Replace Specialchars in Python

i need to replace special chars in the filename. Im trying this at the moment with translate, but its not really good working, and i hope you got an idea to do this. Its to make an clear playlist, ive got an bad player of mp3s in my car which cant do umlaute oder specialchars.
My code so far
# -*- coding: utf-8 -*-
import os
import sys
import id3reader
pfad = os.path.dirname(sys.argv[1])+"/"
ordner = ""
table = {
0xe9: u'e',
0xe4: u'ae',
ord(u'ö'): u'oe',
ord(u'ü'): u'ue',
ord(u'ß'): u'ss',
0xe1: u'ss',
0xfc: u'ue',
}
def replace(s):
return ''.join(c for c in s if (c.isalpha() or c == " " or c =="-") )
fobj_in = open(sys.argv[1])
fobj_out = open(sys.argv[1]+".new","w")
for line in fobj_in:
if (line.rstrip()[0:1]=="#" or line.rstrip()[0:1] ==" "):
print line.rstrip()[0:1]
else:
datei= pfad+line.rstrip()
#print datei
id3info = id3reader.Reader(datei)
dateiname= str(id3info.getValue('performer'))+" - "+ str(id3info.getValue('title'))
#print dateiname
arrPfad = line.split('/')
dateiname = replace(dateiname[0:60])
print dateiname
# dateiname = dateiname.translate(table)+".mp3"
ordner = arrPfad[0]+"/"+dateiname
# os.rename(datei,pfad+ordner)
fobj_out.write(ordner+"\r\n")
fobj_in.close()
i get this error: UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 37: ordinal not in range(128)
If i try to use the translate at the id3title i get TypeError: expected a character buffer object
if I need to get rid of non-ascii-characters, I often use:
>>> unicodedata.normalize("NFKD", u"spëcïälchärs").encode('ascii', 'ignore')
'specialchars'
which tries to convert characters to their ascii part of their normalized unicode decomposition.
Bad thing is, it throws away everything it does not know, and is not smart enough to transliterate umlauts (to ue, ae, etc).
But it might help you to at least play those mp3s.
Of course, you are free to do your own str.translate first, and wrap the result in this, to eliminate every non-ascii-character still left. In fact, if your replace is correct, this will solve your problem. I'd suggest you'd take a look on str.translate and str.maketrans, though.

Categories

Resources