Cyrillic encoding python 2.7 array - python

My code so far:
data = [(u'Rest', u'русский', u'фввв', u'vc'), (u'Rest', u'русский', u'фввв ', u'vc')]
print(data)
result:
[(u'Rest', u'\u0440\u0443\u0441\u0441\u043a\u0438\u0439', u'\u0444\u0432\u0432\u0432', u'vc'), (u'Rest', u'\u0440\u0443\u0441\u0441\u043a\u0438\u0439', u'\u0444\u0432\u0432\u0432 ', u'vc')]
I want the output to display the cyrillic characters, like so:
[('Rest', 'русский', 'фввв', 'vc'), ('Rest', 'русский', 'фввв ', 'vc')]

This is happening because when we print out a list or tuple, the representation of the elements within the list is defined by the element's __repr__ function, rather than its __str__ function. To fix this, you can use the following to encode the strings and then decode the repr() of the list.
Code:
# -*- coding: utf-8 -*-
import sys
data = [(u'Rest', u'русский', u'фввв', u'vc'), (u'Rest', u'русский', u'фввв ', u'vc')]
print repr([tuple(x.encode(sys.stdout.encoding) for x in sl) for sl in data]).decode('string-escape')
Out:
[('Rest', 'русский', 'фввв', 'vc'), ('Rest', 'русский', 'фввв ', 'vc')]

Related

How to replace accented characters?

My output looks like 'àéêöhello!'. I need change my output like this 'aeeohello', Just replacing the character à as a like this.
Please Use the below code:
import unicodedata
def strip_accents(text):
try:
text = unicode(text, 'utf-8')
except NameError: # unicode is a default on python 3
pass
text = unicodedata.normalize('NFD', text)\
.encode('ascii', 'ignore')\
.decode("utf-8")
return str(text)
s = strip_accents('àéêöhello')
print s
import unidecode
somestring = "àéêöhello"
#convert plain text to utf-8
u = unicode(somestring, "utf-8")
#convert utf-8 to normal text
print unidecode.unidecode(u)
Output:
aeeohello
Alpesh Valaki's answer is the "nicest", but I had to do some adjustments for it to work:
# I changed the import
from unidecode import unidecode
somestring = "àéêöhello"
#convert plain text to utf-8
# replaced unicode by unidecode
u = unidecode(somestring, "utf-8")
#convert utf-8 to normal text
print(unidecode(u))

Python url encode/decode - Convert % escaped hexadecimal digits into string

For example, if I have an encoded string as:
url='locality=Norwood&address=138+The+Parade&region=SA&country=AU&name=Pav%C3%A9+cafe&postalCode=5067'
The name parameter has the characters %C3%A9 which actually implies the character é.
Hence, I would like the output to be:
new_url='locality=Norwood&address=138+The+Parade&region=SA&country=AU&name=Pavé+cafe&postalCode=5067'
I tried the following steps on a Python terminal:
>>> import urllib2
>>> url='locality=Norwood&address=138+The+Parade&region=SA&country=AU&name=Pav%C3%A9+cafe&postalCode=5067'
>>> new_url=urllib2.unquote(url).decode('utf8')
>>> print new_url
locality=Norwood&address=138+The+Parade&region=SA&country=AU&name=Pavé+cafe&postalCode=5067
>>>
However, when I tried the same thing within a Python script and run as myscript.py, I am getting the following stack trace:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 88: ordinal not in range(128)
I am using Python 2.6.6 and cannot switch to other versions due to work reasons.
How can I overcome this error?
Any help is greatly appreciated. Thanks in advance!
######################################################
EDIT
I realized that I am getting the above expected output.
However, I would like to convert the parameters in the new_url into a dictionary as follows. While doing so, I am not able to retain the special character 'é' in my name parameter.
print new_url
params_list = new_url.split("&")
print(params_list)
params_dict={}
for p in params_list:
temp = p.split("=")
params_dict[temp[0]] = temp[1]
print(params_dict)
Outputs:
new_url
locality=Norwood&address=138+The+Parade&region=SA&country=AU&name=Pavé+cafe&postalCode=5067
params_list
[u'locality=Norwood', u'address=138+The+Parade', u'region=SA', u'country=AU', u'name=Pav\xe9+cafe', u'postalCode=5067']
params_dict
{u'name': u'Pav\xe9+cafe', u'locality': u'Norwood', u'country': u'AU', u'region': u'SA', u'address': u'138+The+Parade', u'postalCode': u'5067'}
Basically ... the name is now 'Pav\xe9+cafe' as opposed to the required 'Pavé'.
How can I still retain the same special character in my params_dict?
This is actually due to the difference between __repr__ and __str__. When printing a unicode string, __str__ is used and results in the é you see when printing new_url. However, when a list or dict is printed, __repr__ is used, which uses __repr__ for each object within lists and dicts. If you print the items separately, they print as you desire.
# -*- coding: utf-8 -*-
new_url = u'name=Pavé+cafe&postalCode=5067'
print(new_url) # name=Pavé+cafe&postalCode=5067
params_list = [s for s in new_url.split("&")]
print(params_list) # [u'name=Pav\xe9+cafe', u'postalCode=5067']
print(params_list[0]) # name=Pavé+cafe
print(params_list[1]) # postalCode=5067
params_dict = {}
for p in params_list:
temp = p.split("=")
params_dict[temp[0]] = temp[1]
print(params_dict) # {u'postalCode': u'5067', u'name': u'Pav\xe9+cafe'}
print(params_dict.values()[0]) # 5067
print(params_dict.values()[1]) # Pavé+cafe
One way to print the list and dict is to get their string representation, then decode them withunicode-escape:
print(str(params_list).decode('unicode-escape')) # [u'name=Pavé+cafe', u'postalCode=5067']
print(str(params_dict).decode('unicode-escape')) # {u'postalCode': u'5067', u'name': u'Pavé+cafe'}
Note: This is only an issue in Python 2. Python 3 prints the characters as you would expect. Also, you may want to look into urlparse for parsing your URL instead of doing it manually.
import urlparse
new_url = u'name=Pavé+cafe&postalCode=5067'
print dict(urlparse.parse_qsl(new_url)) # {u'postalCode': u'5067', u'name': u'Pav\xe9 cafe'}

Error in the coding of the characters in reading a PDF

I need to read this PDF.
I am using the following code:
from PyPDF2 import PdfFileReader
f = open('myfile.pdf', 'rb')
reader = PdfFileReader(f)
content = reader.getPage(0).extractText()
f.close()
content = ' '.join(content.replace('\xa0', ' ').strip().split())
print(content)
However, the encoding is incorrect, it prints:
Resultado da Prova de Sele“‰o do...
But I expected
Resultado da Prova de Seleção do...
How to solve it?
I'm using Python 3
The PyPDF2 extractTest method returns UniCode. So you many need to just explicitly encode it. For example, explicitly encoding the Unicode into UTF-8.
# -*- coding: utf-8 -*-
correct = u'Resultado da Prova de Seleção do...'
print(correct.encode(encoding='utf-8'))
You're on Python 3, so you have Unicode under the hood, and Python 3 defaults to UTF-8. But I wonder if you need to specify a different encoding based on your locale.
# Show installed locales
import locale
from pprint import pprint
pprint(locale.locale_alias)
If that's not the quick fix, since you're getting Unicode back from PyPDF, you could take a look at the code points for those two characters. It's possible that PyPDF wasn't able to determine the correct encoding and gave you the wrong characters.
For example, a quick and dirty comparison of the good and bad strings you posted:
# -*- coding: utf-8 -*-
# Python 3.4
incorrect = 'Resultado da Prova de Sele“‰o do'
correct = 'Resultado da Prova de Seleção do...'
print("Incorrect String")
print("CHAR{}UNI".format(' ' * 20))
print("-" * 50)
for char in incorrect:
print(
'{}{}{}'.format(
char.encode(encoding='utf-8'),
' ' * 20, # Hack; Byte objects don't have __format__
ord(char)
)
)
print("\n" * 2)
print("Correct String")
print("CHAR{}UNI".format(' ' * 20))
print("-" * 50)
for char in correct:
print(
'{}{}{}'.format(
char.encode(encoding='utf-8'),
' ' * 20, # Hack; Byte objects don't have __format__
ord(char)
)
)
Relevant Output:
b'\xe2\x80\x9c' 8220
b'\xe2\x80\xb0' 8240
b'\xc3\xa7' 231
b'\xc3\xa3' 227
If you're getting code point 231, (>>>hex(231) # '0xe7) then you're getting back bad data back from PyPDF.
what i have tried is to replace specific " ' " unicode with "’" which solves this issue. Please let me know if u still failed to generate pdf with this approach.
text = text.replace("'", "’")

writing utf-8 encoded text to a file

I am trying to print chinese text to a file. When i print it on the terminal, it looks correct to me. When i type print >> filename... i get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 24: ordinal not in range(128)
I dont know what else i need to do.. i already encoded all textual data to utf-8 and used string formatting.
This is my code:
# -*- coding: utf-8 -*-
exclude = string.punctuation
def get_documents_cleaned(topics,filename):
c = codecs.open(filename, "w", "utf-8")
for top in topics:
print >> c , "num" , "," , "text" , "," , "id" , "," , top
document_results = proj.get('docs',text=top)['results']
for doc in document_results:
print "{0}, {1}, {2}, {3}".format(doc[1], (doc[0]['text']).encode('utf-8').translate(None,exclude), doc[0]['document']['id'], top.encode('utf-8'))
print >> c , "{0}, {1}, {2}, {3}".format(doc[1], (doc[0]['text']).encode('utf-8').translate(None,exclude), doc[0]['document']['id'], top.encode('utf-8'))
get_documents_cleaned(my_topics,"export_c.csv")
print doc[0]['text'] looks like this:
u' \u3001 \u6010...'
Since your first print statement works, it's clear, that it's not the format function raising the UnicodeDecodeError.
Instead it's a problem with the file writer. c seems expect a unicode object but only gets a UTF-8 encoded str object (let's name it s). So c tries to call s.decode() implicitly which results in the UnicodeDecodeError.
You could fix your problem by simply calling s.decode('utf-8') before printing or by using the Python default open(filename, "w") function instead.

Can't get a degree symbol into raw_input

The problem in my code looks something like this:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
deg = u'°'
print deg
print '40%s N, 100%s W' % (deg, deg)
codelim = raw_input('40%s N, 100%s W)? ' % (deg, deg))
I'm trying to generate a raw_input prompt for delimiter characters inside a latitude/longitude string, and the prompt should include an example of such a string. print deg and print '40%s N, 100%s W' % (deg, deg) both work fine -- they return "°" and "40° N, 100° W" respectively -- but the raw_input step fails every time. The error I get is as follows:
Traceback (most recent call last):
File "C:\Users\[rest of the path]\scratch.py", line 5, in <module>
x = raw_input(' %s W")? ' % (deg))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb0' in position 1:
ordinal not in range(128)
I thought I'd have solved the problem by adding the encoding header, as instructed here (and indeed that did make it possible to print the degree sign at all), but I'm still getting Unicode errors as soon as I add an otherwise-safe string to raw_input. What's going on here?
Try encoding the prompt string to stdouts encoding before passing it to raw input
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import sys
deg = u'°'
prompt = u'40%s N, 100%s W)? ' % (deg, deg)
codelim = raw_input(prompt.encode(sys.stdout.encoding))
40° N, 100° W)?

Categories

Resources