Special characters display with Python - python

I have declared the encoding at the beginning of my file :
#!/usr/bin/env python
# -*- coding: utf-8 -*-
I have this list of words which have special characters
wordswithspecialcharacters = "hors-d'œuvre régal bétail œil écœuré hameçon ex-æquo niño".split()
When I use a function to manipulate those words and print them, the characters are fine :
def plusS (word) :
word = word+"s"
return word
for x in range (len(wordswithspecialcharacters)) :
print wordswithspecialcharacters[x]
this will print them correctly.
But if I simply print the list wordswithspecialcharacters I get:
["hors-d'\xc5\x93uvre", 'r\xc3\xa9gal', 'b\xc3\xa9tail', '\xc5\x93il',
'\xc3\xa9c\xc5\x93ur\xc3\xa9', 'hame\xc3\xa7on', 'ex-\xc3\xa6quo',
'ni\xc3\xb1o']
My questions:
Why doesn't it display them correctly in the list?
How could I fix that issue?
Thanks

Related

Can I split the word according to condition in python

I want to split the same word that start with letter م into two words , for ex معجبني split to ما عجبني how can i do that?? i m using python 2.7
# -*- coding: utf-8 -*-
token=u'معجبني'
if token[0]==u'م':
token="i want her prosess to split the word into ما عجبني
the ouput that i want
ما عجبني
i hope any one help me
With str.startswith() to checks whether string starts with str, optionally restricting the matching with the given indices start and end.
You can do this:
# -*- coding: utf-8 -*-
token=u'معجبني'
new_t = token.replace(u'م',u'ما ',1) if token.startswith(u'م') else token
print(new_t)
#ما عجبني
You could use re.sub() to replace the desired character with a space and other characters.
The \\b word boundary makes sure that ﻡ is the first character in the word. The word boundary doesn't work well with Python2.7 and UTF-8, so you could check if there's a space or string beginning before your character.
# -*- coding: utf-8 -*-
import re
token = u'ﻢﻌﺠﺒﻨﻳ'
#pattern = re.compile(u'\\bﻡ') # <- For Python3
pattern = re.compile(u'(\s|^)ﻡ') # <- For Python2.7
print(re.sub(pattern,u'ﻡﺍ ', token))
It outputs :
ما عجبني
The english equivalent would be :
import re
pattern = re.compile(r'\bno')
text = 'nothing something nothing anode'
print(re.sub(pattern,'not ', text))
# not thing something not thing anode
Note that it automatically checks every word in the text.
Use the split method.
x = ‘blue,red,green’
x.split(“,”)
[‘blue’, ‘red’, ‘green’]
Taken from http://www.pythonforbeginners.com/dictionary/python-split
EDIT: You can then join the array with " ".join(arr). Or you could replace the desire letter with itself and a space.
You example: nothing.replace("t", "t ") => "not thing"

Treat all Unicode characters as single letters

I want to create a program that counts the "value" of a word by adding values given to letters of it based on their first position in a word (as an exercise, I'm new to Python).
Ie. "foo" would return 5 (as 'f' = 1, 'o' = 2) and "bar" would return 6 (as 'b' = 1, 'a' = 2, 'r' = 3).
Here's my code so far:
# -*- coding: utf-8 -*-
def ppn(word):
word = list(word)
cipher = dict()
i = 1
e = 0
for letter in word:
if letter not in cipher:
cipher[letter] = i
e += i
i += 1
else:
e += cipher[letter]
return ''.join(word) + ": " + str(e)
if __name__ == "__main__":
print ppn(str(raw_input()))
It works well, however for words containing characters like 'ł', 'ą' etc. it doesn't return the correct value (I would guess it's because it translates these letters to Unicode codes first). Is there a way to bypass it and make the interpreter treat all the letters as single letters?
Decode your input into unicode, then use unicode everywhere, then decode when you output.
Specifically you will need to change
print ppn(str(raw_input()))
To
print ppn(raw_input().decode(sys.stdin.encoding))
This will decode your input. Then you will also need to change
''.join(word) + ": " + str(e)
To
u''.join(word) + u': ' + unicode(e)
This is making all your code use unicode objects internally.
Print will encode the unicode properly to whatever encoding your terminal is using, but you can also specify it if you need to.
Alternatively you can do exactly what you have already, but run it with python 3.
For more information, please read this very useful talk on the subject
Decode with the encoding of your shell:
if __name__ == "__main__":
import sys
print ppn((raw_input()).decode(sys.stdin.encoding))
For Unix system typically UTF-8 works. On Windows things can be different. To be save use sys.stdin.encoding. You never know where your script is going to run.
Or, even better. switch to Python 3:
# -*- coding: utf-8 -*-
import sys
assert sys.version_info.major > 2
def ppn(word):
word = list(word)
cipher = dict()
i = 1
e = 0
for letter in word:
if letter not in cipher:
cipher[letter] = i
e += i
i += 1
else:
e += cipher[letter]
return ''.join(word) + ": " + str(e)
if __name__ == "__main__":
print(ppn(str(input())))
In Python 3 strings are unicode per default. So no need for the decoding businesses.
All the answers so far have explained what to do, but not what's going on, so here are some hints.
When you use raw_input() with Python 2 you are returned a string of bytes (input() on Python 3 behaves differently). Most unicode characters cannot be represented as a single byte for the reason that there are more unicode characters than values that can be represented with a byte.
Characters like ł or ą, when encoded with utf-8 or other encodings, can take two bytes or more:
>>> 'ł'
'\xc5\x82'
>>> 'ą'
'\xc4\x85'
Your original program is interpreting those two bytes as distinct characters, leading to incorrect results.
Python offers an alternative to byte string: unicode strings. With unicode string, one character appears exactly as one character (the internal representation of the string is opaque), and the problem you are experiencing cannot occur.
Therefore decoding the bytestring into a unicode string is the way to go.

How to map a arabic character to english string using python

I am trying to read a file that has Arabic characters like, 'ع ' and map it to English string "AYN". I want to create such a mapping of all 28 Arabic alphabets to English string in Python 3.4. I am still a beginner in Python and do not have much clue how to start. The file that has Arabic character is coded in UTF8 format.
Use unicodedata;
(note: This is Python 3. In Python 2 use u'ع' instead)
In [1]: import unicodedata
In [2]: unicodedata.name('a')
Out[2]: 'LATIN SMALL LETTER A'
In [6]: unicodedata.name('ع')
Out[6]: 'ARABIC LETTER AIN'
In [7]: unicodedata.name('ع').split()[-1]
Out[7]: 'AIN'
The last line works fine with simple letters, but not with all Arabic symbols. E.g. ڥ is ARABIC LETTER FEH WITH THREE DOTS BELOW.
So you could use;
In [26]: unicodedata.name('ڥ').lower().split()[2]
Out[26]: 'feh'
or
In [28]: unicodedata.name('ڥ').lower()[14:]
Out[28]: 'feh with three dots below'
For identifying characters use something like this (Python 3) ;
c = 'ع'
id = unicodedata.name(c).lower()
if 'arabic letter' in id:
print("{}: {}".format(c, id[14:].lower()))
This would produce;
ع: ain
I'm filtering for the string 'arabic letter' because the arabic unicode block has a lot of other symbols as well.
A complete dictionary can be made with:
arabicdict = {}
for n in range(0x600, 0x700):
c = chr(n)
try:
id = unicodedata.name(c).lower()
if 'arabic letter' in id:
arabicdict[c] = id[14:]
except ValueError:
pass
Refer to the Unicode numbers for each character and then construct a dictionary as follows here:
arabic = {'alif': u'\u0623', 'baa': u'\u0628', ...} # use unicode mappings like so
Use a simple dictionary in python to do this properly. Make sure your file is set in the following way:
#!/usr/bin/python
# -*- coding: utf-8 -*-
Here is code that should work for you (I added in examples of how to get out the values from your dictionary as well, since you are a beginner):
exampledict = {unicode(('ا').decode('utf-8')):'ALIF',unicode(('ع').decode('utf-8')):'AYN'}
keys = exampledict.keys()
values = exampledict.values()
print(keys)
print(values)
exit()
Output:
[u'\u0639', u'\u0627']
['AYN', 'ALIF']
Hope this helps you on your journey learning python, it is fun!

Python: re.split() display cyrillic result

I try to write a function that simply splits a string by any symbol that is not a letter or a number. But I need to use cyrillic and when I do that I get output list with elements like '\x0d' instead of not latin words.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
class Syntax():
def __init__(self, string):
self.string = string.encode('utf-8')
self.list = None
def split(self):
self.list = re.split(ur"\W+", self.string, flags=re.U)
if __name__ == '__main__':
string = ur"Привет, мой друг test words."
a = Syntax(string)
a.split()
print a.string, a.list
Console output:
Привет, мой друг test words.
['\xd0', '\xd1', '\xd0', '\xd0\xb2\xd0\xb5\xd1', '\xd0\xbc\xd0\xbe\xd0\xb9', '\xd0', '\xd1', '\xd1', '\xd0\xb3', 'test', 'words', '']
Thanks for your help.
There are two problems here:
You're coercing unicode to string in your Syntax constructor. In general you should leave text values as unicode. (self.string = string, no encoding).
When you print a Python list it's calling repr on the elements, causing the unicode to be coerced to those values. If you do
for x in a.list:
print x
after making the first change, it'll print Cyrillic.
Edit: printing a list calls repr on the elements, not string. However, printing a string doesn't repr it - print x and print repr(x) yield different values. For strings, the repr is always something you can evaluate in Python to recover the value.

Splitting Japanese characters in Python

I have a list of Japanese Kanji characters that are separated by a symbol that looks like a comma. I would like to use a split function to get the information stored in a list.
If the text was in English then i would like to the following:
x = 'apple,pear,orange'
x.split(',')
However, this does not work for the following:
japanese = '東北カネカ売,フジヤ商店,橋谷,旭販売,東洋装'
I have set the encoding to
# -*- coding: utf-8 -*-
and I am able to read in the Japanese characters fine.
It's not actually a comma:
>>> u','
u'\uff0c'
If you make the string unicode, you can split it just fine:
>>> u'東北カネカ売,フジヤ商店,橋谷,旭販売,東洋装'.split(u',')
[u'\u6771\u5317\u30ab\u30cd\u30ab\u58f2',
u'\u30d5\u30b8\u30e4\u5546\u5e97',
u'\u6a4b\u8c37',
u'\u65ed\u8ca9\u58f2',
u'\u6771\u6d0b\u88c5']
Python 3 works as well:
>>> '東北カネカ売,フジヤ商店,橋谷,旭販売,東洋装'.split(',')
['東北カネカ売', 'フジヤ商店', '橋谷', '旭販売', '東洋装']
This works for me:
for j in japanese.split('\xef\xbc\x8c'): print j
The "comma" here is '\xef\xbc\x8c'.

Categories

Resources