Python: re.split() display cyrillic result - python

I try to write a function that simply splits a string by any symbol that is not a letter or a number. But I need to use cyrillic and when I do that I get output list with elements like '\x0d' instead of not latin words.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
class Syntax():
def __init__(self, string):
self.string = string.encode('utf-8')
self.list = None
def split(self):
self.list = re.split(ur"\W+", self.string, flags=re.U)
if __name__ == '__main__':
string = ur"Привет, мой друг test words."
a = Syntax(string)
a.split()
print a.string, a.list
Console output:
Привет, мой друг test words.
['\xd0', '\xd1', '\xd0', '\xd0\xb2\xd0\xb5\xd1', '\xd0\xbc\xd0\xbe\xd0\xb9', '\xd0', '\xd1', '\xd1', '\xd0\xb3', 'test', 'words', '']
Thanks for your help.

There are two problems here:
You're coercing unicode to string in your Syntax constructor. In general you should leave text values as unicode. (self.string = string, no encoding).
When you print a Python list it's calling repr on the elements, causing the unicode to be coerced to those values. If you do
for x in a.list:
print x
after making the first change, it'll print Cyrillic.
Edit: printing a list calls repr on the elements, not string. However, printing a string doesn't repr it - print x and print repr(x) yield different values. For strings, the repr is always something you can evaluate in Python to recover the value.

Related

Print all characters in a string

Is there a way to print all characters in python, even ones which usually aren't printed?
For example
>>>print_all("skip
line")
skip\nline
Looks like you want repr()
>>> """skip
... line"""
'skip\nline'
>>>
>>> print(repr("""skip
... line"""))
'skip\nline'
>>> print(repr("skip line"))
'skip\tline
So, your function could be
print_all = lambda s: print(repr(s))
And for Python 2, you need from __future__ import print_function
Even easier, cast it to a raw string by using "%r", raw strings treat backslashes as literal characters:
print("%r" % """skip
line""")
skip\nline
Additionally, use !r in a format call:
print("{0!r}".format("""skip
line"""))
for similar results.

How to convert \\xhh into \xhh python

I have encounter a case where I need to convert a string of character into a character string in python.
s = "\\x80\\x78\\x07\\x00\\x75\\xb3"
print s #gives: \x80\x78\x07\x00\x75\xb3
what I want is that, given the string s, I can get the real character store in s. which in this case is "\x80, \x78, \x07, \x00, \x75, and \xb3"(something like this)�xu�.
You can use string-escape encoding (Python 2.x):
>>> s = "\\x80\\x78\\x07\\x00\\x75\\xb3"
>>> s.decode('string-escape')
'\x80x\x07\x00u\xb3'
Use unicode-escape encoding (in Python 3.x, need to convert to bytes first):
>>> s.encode().decode('unicode-escape')
'\x80x\x07\x00u³'
you can simply write a function, taking the string and returning the converted form!
something like this:
def str_to_chr(s):
res = ""
s = s.split("\\")[1:] #"\\x33\\x45" -> ["x33","x45"]
for(i in s):
res += chr(int('0'+i, 16)) # converting to decimal then taking the chr
return res
remember to print the return of the function.
to find out what does each line do, run that line, if still have questions comment it... i'll answer
or you can build a string from the byte values, but that might not all be "printable" depending on your encoding, example:
# -*- coding: utf-8 -*-
s = "\\x80\\x78\\x07\\x00\\x75\\xb3"
r = ''
for byte in s.split('\\x'):
if byte: # to get rid of empties
r += chr(int(byte,16)) # convert to int from hex string first
print (r) # given the example, not all bytes are printable char's in utf-8
HTH, Edwin

print python dictionary with utf8 values

I have dictionary with utf8 string values. I need to print it without any \xd1 , \u0441 or u'string' symbols.
# -*- coding: utf-8 -*-
a = u'lang=русский'
# prints: lang=русский
print(a)
mydict = {}
mydict['string'] = a
mydict2 = repr(mydict).decode("unicode-escape")
# prints: {'string': u'lang=русский'}
print mydict2
expected
{'string': 'lang=русский'}
Is it possible without parsing the dictionary?
This question is related with Python print unicode strings in arrays as characters, not code points , but I need to get rid from that annoying u
I can't see a reasonable use case for this, but if you want a custom representation of a dictionary (or better said, a custom representation of a unicode object within a dictionary), you can roll it yourself:
def repr_dict(d):
return '{%s}' % ',\n'.join("'%s': '%s'" % pair for pair in d.iteritems())
and then
print repr_dict({u'string': u'lang=русский'})

String.maketrans for English and Persian numbers

I have a function like this:
persian_numbers = '۱۲۳۴۵۶۷۸۹۰'
english_numbers = '1234567890'
arabic_numbers = '١٢٣٤٥٦٧٨٩٠'
english_trans = string.maketrans(english_numbers, persian_numbers)
arabic_trans = string.maketrans(arabic_numbers, persian_numbers)
text.translate(english_trans)
text.translate(arabic_trans)
I want it to translate all Arabic and English numbers to Persian. But Python says:
english_translate = string.maketrans(english_numbers, persian_numbers)
ValueError: maketrans arguments must have same length
I tried to encode strings with Unicode utf-8 but I always got some errors! Sometimes the problem is Arabic string instead! Do you know a better solution for this job?
EDIT:
It seems the problem is Unicode characters length in ASCII. An Arabic number like '۱' is two character -- that I find out with ord(). And the length problem starts from here :-(
See unidecode library which converts all strings into UTF8. It is very useful in case of number input in different languages.
In Python 2:
>>> from unidecode import unidecode
>>> a = unidecode(u"۰۱۲۳۴۵۶۷۸۹")
>>> a
'0123456789'
>>> unidecode(a)
'0123456789'
In Python 3:
>>> from unidecode import unidecode
>>> a = unidecode("۰۱۲۳۴۵۶۷۸۹")
>>> a
'0123456789'
>>> unidecode(a)
'0123456789'
Unicode objects can interpret these digits (arabic and persian) as actual digits -
no need to translate them by using character substitution.
EDIT -
I came out with a way to make your replacement using Python2 regular expressions:
# coding: utf-8
import re
# Attention: while the characters for the strings bellow are
# dislplayed indentically, inside they are represented
# by distinct unicode codepoints
persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
english_numbers = u'1234567890'
persian_regexp = u"(%s)" % u"|".join(persian_numbers)
arabic_regexp = u"(%s)" % u"|".join(arabic_numbers)
def _sub(match_object, digits):
return english_numbers[digits.find(match_object.group(0))]
def _sub_arabic(match_object):
return _sub(match_object, arabic_numbers)
def _sub_persian(match_object):
return _sub(match_object, persian_numbers)
def replace_arabic(text):
return re.sub(arabic_regexp, _sub_arabic, text)
def replace_persian(text):
return re.sub(arabic_regexp, _sub_persian, text)
Attempt that the "text" parameter must be unicode itself.
(also this code could be shortened
by using lambdas and combining some expressions in a single line, but there is no point in doing so, but for loosing readability)
It should work to you up to here, but please read on the original answer I had posted
-- original answer
So, if you instantiate your variables as unicode (prepending an u to the quote char), they are correctly understood in Python:
>>> persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
>>> english_numbers = u'1234567890'
>>> arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
>>>
>>> print int(persian_numbers)
1234567890
>>> print int(english_numbers)
1234567890
>>> print int(arabic_numbers)
1234567890
>>> persian_numbers.isdigit()
True
>>>
By the way, the "maketrans" method does not exist for unicode objects (in Python2 - see the comments).
It is very important to understand the basics about unicode - for everyone, even people writing English only programs who think they will never deal with any char out of the 26 latin letters. When writing code that will deal with different chars it is vital - the program can't possibly work without you knowing what you are doing except by chance.
A very good article to read is http://www.joelonsoftware.com/articles/Unicode.html - please read it now.
You can keep in mind, while reading it, that Python allows one to translate unicode characters to a string in any "physical" encoding by using the "encode" method of unicode objects.
>>> arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
>>> len(arabic_numbers)
10
>>> enc_arabic = arabic_numbers.encode("utf-8")
>>> print enc_arabic
١٢٣٤٥٦٧٨٩٠
>>> len(enc_arabic)
20
>>> int(enc_arabic)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '\xd9\xa1\xd9\xa2\xd9\xa3\xd9\xa4\xd9\xa5\xd9\xa6\xd9\xa7\xd9\xa8\xd9\xa9\xd9\xa0'
Thus, the characters loose their sense as "single entities" and as digits when encoding - the encoded object (str type in Python 2.x) is justa strrng of bytes - which nonetheless is needed when sending these characters to any output from the program - be it console, GUI Window, database, html code, etc...
You can use persiantools package:
Examples:
>>> from persiantools import digits
>>> digits.en_to_fa("0987654321")
'۰۹۸۷۶۵۴۳۲۱'
>>> digits.ar_to_fa("٠٩٨٧٦٥٤٣٢١") # or digits.ar_to_fa(u"٠٩٨٧٦٥٤٣٢١")
'۰۹۸۷۶۵۴۳۲۱'
unidecode converts all characters from Persian to English, If you want to change only numbers follow bellow:
In python3 you can use this code to convert any Persian|Arabic number to English number while keeping other characters unchanged:
intab='۱۲۳۴۵۶۷۸۹۰١٢٣٤٥٦٧٨٩٠'
outtab='12345678901234567890'
translation_table = str.maketrans(intab, outtab)
output_text = input_text.translate(translation_table)
Use Unicode Strings:
persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
english_numbers = u'1234567890'
arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
And make sure the encoding of your Python file is correct.
With this you can easily do that:
def p2e(persiannumber):
number={
'0':'۰',
'1':'۱',
'2':'۲',
'3':'۳',
'4':'۴',
'5':'۵',
'6':'۶',
'7':'۷',
'8':'۸',
'9':'۹',
}
for i,j in number.items():
persiannumber=persiannumber.replace(j,i)
return persiannumber
here is usage:
print(p2e('۳۱۹۶'))
#returns 3196
In Python 3 easiest way is:
str(int('۱۲۳'))
#123
but if number starts with 0 it have an issue.
so we can use zip() function:
for i, j in zip('1234567890', '۱۲۳۴۵۶۷۸۹۰'):
number.replace(i, j)
def persian_number(persiannumber):
number={
'0':'۰',
'1':'۱',
'2':'۲',
'3':'۳',
'4':'۴',
'5':'۵',
'6':'۶',
'7':'۷',
'8':'۸',
'9':'۹',
}
for i,j in number.items():
persiannumber=time2str.replace(i,j)
return time2str
persiannumber must be a string

How to print container object with unicode-containing values?

The following code
# -*- coding: utf-8 -*-
x = (u'abc/αβγ',)
print x
print x[0]
print unicode(x).encode('utf-8')
print x[0].encode('utf-8')
...produces:
(u'abc/\u03b1\u03b2\u03b3',)
abc/αβγ
(u'abc/\u03b1\u03b2\u03b3',)
abc/αβγ
Is there any way to get Python to print
('abc/αβγ',)
that does not require me to build the string representation of the tuple myself? (By this I mean stringing together the "(", "'", encoded value, "'", ",", and ")"?
BTW, I'm using Python 2.7.1.
Thanks!
You could decode the str representation of your tuple with 'raw_unicode_escape'.
In [25]: print str(x).decode('raw_unicode_escape')
(u'abc/αβγ',)
I don't think so - the tuple's __repr__() is built-in, and AFAIK will just call the __repr__ for each tuple item. In the case of unicode chars, you'll get the escape sequences.
(Unless Gandaro's solution works for you - I couldn't get it to work in a plain python shell, but that could be either my locale settings, or that it's something special in ipython.)
The following should be a good start:
>>> x = (u'abc/αβγ',)
>>> S = type('S', (unicode,), {'__repr__': lambda s: s.encode('utf-8')})
>>> tuple(map(S, x))
(abc/αβγ,)
The idea is to make a subclass of unicode which has a __repr__() more to your liking.
Still trying to figure out how best to surround the result in quotes, this works for your example:
>>> S = type('S', (unicode,), {'__repr__': lambda s: "'%s'" % s.encode('utf-8')})
>>> tuple(map(S, x))
('abc/αβγ',)
... but it will look odd if there is a single quote in the string:
>>> S("test'data")
'test'data'

Categories

Resources