Splitting Japanese characters in Python - python

I have a list of Japanese Kanji characters that are separated by a symbol that looks like a comma. I would like to use a split function to get the information stored in a list.
If the text was in English then i would like to the following:
x = 'apple,pear,orange'
x.split(',')
However, this does not work for the following:
japanese = '東北カネカ売,フジヤ商店,橋谷,旭販売,東洋装'
I have set the encoding to
# -*- coding: utf-8 -*-
and I am able to read in the Japanese characters fine.

It's not actually a comma:
>>> u','
u'\uff0c'
If you make the string unicode, you can split it just fine:
>>> u'東北カネカ売,フジヤ商店,橋谷,旭販売,東洋装'.split(u',')
[u'\u6771\u5317\u30ab\u30cd\u30ab\u58f2',
u'\u30d5\u30b8\u30e4\u5546\u5e97',
u'\u6a4b\u8c37',
u'\u65ed\u8ca9\u58f2',
u'\u6771\u6d0b\u88c5']
Python 3 works as well:
>>> '東北カネカ売,フジヤ商店,橋谷,旭販売,東洋装'.split(',')
['東北カネカ売', 'フジヤ商店', '橋谷', '旭販売', '東洋装']

This works for me:
for j in japanese.split('\xef\xbc\x8c'): print j
The "comma" here is '\xef\xbc\x8c'.

Related

JSON contains incorrect UTF-8 \u00ce\u00b2 instead of Unicode \u03b2, how to fix in Python?

First note that symbol β (Greek beta) have hex representation in UTF-8: CE B2
I have legacy source code in Python 2.7 that uses json strings:
u'{"something":"text \\u00ce\\u00b2 text..."}'
I then it calls json.loads(string) or json.loads(string, 'utf-8'), but the result is Unicode string with UTF-8 characters:
u'text \xce\xb2 text'
What I want is normal Python Unicode (UTF-16?) string:
u'text β text'
If I call:
text = text.decode('unicode_escape')
before json.loads, then I got correct Unicode β symbol, but it also breaks json by also replacing all new lines - \n
The question is, how to convert only "\\u00ce\\00b2" part without affecting other json special characters?
(I am new to Python, and it is not my source code, so I have no idea how this is supposed to work. I suspect that the code only works with ASCII characters)
Something like this, perhaps. This is limited to 2-byte UTF-8 characters.
import re
j = u'{"something":"text \\u00ce\\u00b2 text..."}'
def decodeu (match):
u = '%c%c' % (int(match.group(1), 16), int(match.group(2), 16))
return repr(u.decode('utf-8'))[2:8]
j = re.sub(r'\\u00([cd][0-9a-f])\\u00([89ab][0-9a-f])',decodeu, j)
print(j)
returns {"something":"text \u03b2 text..."} for your sample. At this point, you can import it as regular JSON and get the final string you want.
result = json.loads(j)
Here's a string-fixer that works after loading the JSON. It handles any length UTF-8-like sequence and ignores escape sequences that don't look like UTF-8 sequences.
Example:
import json
import re
def fix(bad):
return re.sub(ur'[\xc2-\xf4][\x80-\xbf]+',lambda m: m.group(0).encode('latin1').decode('utf8'),bad)
# 2- and 3-byte UTF-8-like sequences and onen correct escape code.
json_text = '''\
{
"something":"text \\u00ce\\u00b2 text \\u00e4\\u00bd\\u00a0\\u597d..."
}
'''
data = json.loads(json_text)
bad_str = data[u'something']
good_str = fix(bad_str)
print bad_str
print good_str
Output:
text β text ä½ 好...
text β text 你好...

Can I split the word according to condition in python

I want to split the same word that start with letter م into two words , for ex معجبني split to ما عجبني how can i do that?? i m using python 2.7
# -*- coding: utf-8 -*-
token=u'معجبني'
if token[0]==u'م':
token="i want her prosess to split the word into ما عجبني
the ouput that i want
ما عجبني
i hope any one help me
With str.startswith() to checks whether string starts with str, optionally restricting the matching with the given indices start and end.
You can do this:
# -*- coding: utf-8 -*-
token=u'معجبني'
new_t = token.replace(u'م',u'ما ',1) if token.startswith(u'م') else token
print(new_t)
#ما عجبني
You could use re.sub() to replace the desired character with a space and other characters.
The \\b word boundary makes sure that ﻡ is the first character in the word. The word boundary doesn't work well with Python2.7 and UTF-8, so you could check if there's a space or string beginning before your character.
# -*- coding: utf-8 -*-
import re
token = u'ﻢﻌﺠﺒﻨﻳ'
#pattern = re.compile(u'\\bﻡ') # <- For Python3
pattern = re.compile(u'(\s|^)ﻡ') # <- For Python2.7
print(re.sub(pattern,u'ﻡﺍ ', token))
It outputs :
ما عجبني
The english equivalent would be :
import re
pattern = re.compile(r'\bno')
text = 'nothing something nothing anode'
print(re.sub(pattern,'not ', text))
# not thing something not thing anode
Note that it automatically checks every word in the text.
Use the split method.
x = ‘blue,red,green’
x.split(“,”)
[‘blue’, ‘red’, ‘green’]
Taken from http://www.pythonforbeginners.com/dictionary/python-split
EDIT: You can then join the array with " ".join(arr). Or you could replace the desire letter with itself and a space.
You example: nothing.replace("t", "t ") => "not thing"

Special characters display with Python

I have declared the encoding at the beginning of my file :
#!/usr/bin/env python
# -*- coding: utf-8 -*-
I have this list of words which have special characters
wordswithspecialcharacters = "hors-d'œuvre régal bétail œil écœuré hameçon ex-æquo niño".split()
When I use a function to manipulate those words and print them, the characters are fine :
def plusS (word) :
word = word+"s"
return word
for x in range (len(wordswithspecialcharacters)) :
print wordswithspecialcharacters[x]
this will print them correctly.
But if I simply print the list wordswithspecialcharacters I get:
["hors-d'\xc5\x93uvre", 'r\xc3\xa9gal', 'b\xc3\xa9tail', '\xc5\x93il',
'\xc3\xa9c\xc5\x93ur\xc3\xa9', 'hame\xc3\xa7on', 'ex-\xc3\xa6quo',
'ni\xc3\xb1o']
My questions:
Why doesn't it display them correctly in the list?
How could I fix that issue?
Thanks

How to map a arabic character to english string using python

I am trying to read a file that has Arabic characters like, 'ع ' and map it to English string "AYN". I want to create such a mapping of all 28 Arabic alphabets to English string in Python 3.4. I am still a beginner in Python and do not have much clue how to start. The file that has Arabic character is coded in UTF8 format.
Use unicodedata;
(note: This is Python 3. In Python 2 use u'ع' instead)
In [1]: import unicodedata
In [2]: unicodedata.name('a')
Out[2]: 'LATIN SMALL LETTER A'
In [6]: unicodedata.name('ع')
Out[6]: 'ARABIC LETTER AIN'
In [7]: unicodedata.name('ع').split()[-1]
Out[7]: 'AIN'
The last line works fine with simple letters, but not with all Arabic symbols. E.g. ڥ is ARABIC LETTER FEH WITH THREE DOTS BELOW.
So you could use;
In [26]: unicodedata.name('ڥ').lower().split()[2]
Out[26]: 'feh'
or
In [28]: unicodedata.name('ڥ').lower()[14:]
Out[28]: 'feh with three dots below'
For identifying characters use something like this (Python 3) ;
c = 'ع'
id = unicodedata.name(c).lower()
if 'arabic letter' in id:
print("{}: {}".format(c, id[14:].lower()))
This would produce;
ع: ain
I'm filtering for the string 'arabic letter' because the arabic unicode block has a lot of other symbols as well.
A complete dictionary can be made with:
arabicdict = {}
for n in range(0x600, 0x700):
c = chr(n)
try:
id = unicodedata.name(c).lower()
if 'arabic letter' in id:
arabicdict[c] = id[14:]
except ValueError:
pass
Refer to the Unicode numbers for each character and then construct a dictionary as follows here:
arabic = {'alif': u'\u0623', 'baa': u'\u0628', ...} # use unicode mappings like so
Use a simple dictionary in python to do this properly. Make sure your file is set in the following way:
#!/usr/bin/python
# -*- coding: utf-8 -*-
Here is code that should work for you (I added in examples of how to get out the values from your dictionary as well, since you are a beginner):
exampledict = {unicode(('ا').decode('utf-8')):'ALIF',unicode(('ع').decode('utf-8')):'AYN'}
keys = exampledict.keys()
values = exampledict.values()
print(keys)
print(values)
exit()
Output:
[u'\u0639', u'\u0627']
['AYN', 'ALIF']
Hope this helps you on your journey learning python, it is fun!

String.maketrans for English and Persian numbers

I have a function like this:
persian_numbers = '۱۲۳۴۵۶۷۸۹۰'
english_numbers = '1234567890'
arabic_numbers = '١٢٣٤٥٦٧٨٩٠'
english_trans = string.maketrans(english_numbers, persian_numbers)
arabic_trans = string.maketrans(arabic_numbers, persian_numbers)
text.translate(english_trans)
text.translate(arabic_trans)
I want it to translate all Arabic and English numbers to Persian. But Python says:
english_translate = string.maketrans(english_numbers, persian_numbers)
ValueError: maketrans arguments must have same length
I tried to encode strings with Unicode utf-8 but I always got some errors! Sometimes the problem is Arabic string instead! Do you know a better solution for this job?
EDIT:
It seems the problem is Unicode characters length in ASCII. An Arabic number like '۱' is two character -- that I find out with ord(). And the length problem starts from here :-(
See unidecode library which converts all strings into UTF8. It is very useful in case of number input in different languages.
In Python 2:
>>> from unidecode import unidecode
>>> a = unidecode(u"۰۱۲۳۴۵۶۷۸۹")
>>> a
'0123456789'
>>> unidecode(a)
'0123456789'
In Python 3:
>>> from unidecode import unidecode
>>> a = unidecode("۰۱۲۳۴۵۶۷۸۹")
>>> a
'0123456789'
>>> unidecode(a)
'0123456789'
Unicode objects can interpret these digits (arabic and persian) as actual digits -
no need to translate them by using character substitution.
EDIT -
I came out with a way to make your replacement using Python2 regular expressions:
# coding: utf-8
import re
# Attention: while the characters for the strings bellow are
# dislplayed indentically, inside they are represented
# by distinct unicode codepoints
persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
english_numbers = u'1234567890'
persian_regexp = u"(%s)" % u"|".join(persian_numbers)
arabic_regexp = u"(%s)" % u"|".join(arabic_numbers)
def _sub(match_object, digits):
return english_numbers[digits.find(match_object.group(0))]
def _sub_arabic(match_object):
return _sub(match_object, arabic_numbers)
def _sub_persian(match_object):
return _sub(match_object, persian_numbers)
def replace_arabic(text):
return re.sub(arabic_regexp, _sub_arabic, text)
def replace_persian(text):
return re.sub(arabic_regexp, _sub_persian, text)
Attempt that the "text" parameter must be unicode itself.
(also this code could be shortened
by using lambdas and combining some expressions in a single line, but there is no point in doing so, but for loosing readability)
It should work to you up to here, but please read on the original answer I had posted
-- original answer
So, if you instantiate your variables as unicode (prepending an u to the quote char), they are correctly understood in Python:
>>> persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
>>> english_numbers = u'1234567890'
>>> arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
>>>
>>> print int(persian_numbers)
1234567890
>>> print int(english_numbers)
1234567890
>>> print int(arabic_numbers)
1234567890
>>> persian_numbers.isdigit()
True
>>>
By the way, the "maketrans" method does not exist for unicode objects (in Python2 - see the comments).
It is very important to understand the basics about unicode - for everyone, even people writing English only programs who think they will never deal with any char out of the 26 latin letters. When writing code that will deal with different chars it is vital - the program can't possibly work without you knowing what you are doing except by chance.
A very good article to read is http://www.joelonsoftware.com/articles/Unicode.html - please read it now.
You can keep in mind, while reading it, that Python allows one to translate unicode characters to a string in any "physical" encoding by using the "encode" method of unicode objects.
>>> arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
>>> len(arabic_numbers)
10
>>> enc_arabic = arabic_numbers.encode("utf-8")
>>> print enc_arabic
١٢٣٤٥٦٧٨٩٠
>>> len(enc_arabic)
20
>>> int(enc_arabic)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '\xd9\xa1\xd9\xa2\xd9\xa3\xd9\xa4\xd9\xa5\xd9\xa6\xd9\xa7\xd9\xa8\xd9\xa9\xd9\xa0'
Thus, the characters loose their sense as "single entities" and as digits when encoding - the encoded object (str type in Python 2.x) is justa strrng of bytes - which nonetheless is needed when sending these characters to any output from the program - be it console, GUI Window, database, html code, etc...
You can use persiantools package:
Examples:
>>> from persiantools import digits
>>> digits.en_to_fa("0987654321")
'۰۹۸۷۶۵۴۳۲۱'
>>> digits.ar_to_fa("٠٩٨٧٦٥٤٣٢١") # or digits.ar_to_fa(u"٠٩٨٧٦٥٤٣٢١")
'۰۹۸۷۶۵۴۳۲۱'
unidecode converts all characters from Persian to English, If you want to change only numbers follow bellow:
In python3 you can use this code to convert any Persian|Arabic number to English number while keeping other characters unchanged:
intab='۱۲۳۴۵۶۷۸۹۰١٢٣٤٥٦٧٨٩٠'
outtab='12345678901234567890'
translation_table = str.maketrans(intab, outtab)
output_text = input_text.translate(translation_table)
Use Unicode Strings:
persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
english_numbers = u'1234567890'
arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
And make sure the encoding of your Python file is correct.
With this you can easily do that:
def p2e(persiannumber):
number={
'0':'۰',
'1':'۱',
'2':'۲',
'3':'۳',
'4':'۴',
'5':'۵',
'6':'۶',
'7':'۷',
'8':'۸',
'9':'۹',
}
for i,j in number.items():
persiannumber=persiannumber.replace(j,i)
return persiannumber
here is usage:
print(p2e('۳۱۹۶'))
#returns 3196
In Python 3 easiest way is:
str(int('۱۲۳'))
#123
but if number starts with 0 it have an issue.
so we can use zip() function:
for i, j in zip('1234567890', '۱۲۳۴۵۶۷۸۹۰'):
number.replace(i, j)
def persian_number(persiannumber):
number={
'0':'۰',
'1':'۱',
'2':'۲',
'3':'۳',
'4':'۴',
'5':'۵',
'6':'۶',
'7':'۷',
'8':'۸',
'9':'۹',
}
for i,j in number.items():
persiannumber=time2str.replace(i,j)
return time2str
persiannumber must be a string

Categories

Resources