The following code
# -*- coding: utf-8 -*-
x = (u'abc/αβγ',)
print x
print x[0]
print unicode(x).encode('utf-8')
print x[0].encode('utf-8')
...produces:
(u'abc/\u03b1\u03b2\u03b3',)
abc/αβγ
(u'abc/\u03b1\u03b2\u03b3',)
abc/αβγ
Is there any way to get Python to print
('abc/αβγ',)
that does not require me to build the string representation of the tuple myself? (By this I mean stringing together the "(", "'", encoded value, "'", ",", and ")"?
BTW, I'm using Python 2.7.1.
Thanks!
You could decode the str representation of your tuple with 'raw_unicode_escape'.
In [25]: print str(x).decode('raw_unicode_escape')
(u'abc/αβγ',)
I don't think so - the tuple's __repr__() is built-in, and AFAIK will just call the __repr__ for each tuple item. In the case of unicode chars, you'll get the escape sequences.
(Unless Gandaro's solution works for you - I couldn't get it to work in a plain python shell, but that could be either my locale settings, or that it's something special in ipython.)
The following should be a good start:
>>> x = (u'abc/αβγ',)
>>> S = type('S', (unicode,), {'__repr__': lambda s: s.encode('utf-8')})
>>> tuple(map(S, x))
(abc/αβγ,)
The idea is to make a subclass of unicode which has a __repr__() more to your liking.
Still trying to figure out how best to surround the result in quotes, this works for your example:
>>> S = type('S', (unicode,), {'__repr__': lambda s: "'%s'" % s.encode('utf-8')})
>>> tuple(map(S, x))
('abc/αβγ',)
... but it will look odd if there is a single quote in the string:
>>> S("test'data")
'test'data'
Related
Is there a way to print all characters in python, even ones which usually aren't printed?
For example
>>>print_all("skip
line")
skip\nline
Looks like you want repr()
>>> """skip
... line"""
'skip\nline'
>>>
>>> print(repr("""skip
... line"""))
'skip\nline'
>>> print(repr("skip line"))
'skip\tline
So, your function could be
print_all = lambda s: print(repr(s))
And for Python 2, you need from __future__ import print_function
Even easier, cast it to a raw string by using "%r", raw strings treat backslashes as literal characters:
print("%r" % """skip
line""")
skip\nline
Additionally, use !r in a format call:
print("{0!r}".format("""skip
line"""))
for similar results.
Well, character encoding and decoding sometimes frustrates me a lot.
So we know u'\u4f60\u597d' is the utf-8 encoding of 你好,
>>> print hellolist
[u'\u4f60\u597d']
>>> print hellolist[0]
你好
Now what I really want to get from the output or write to a file is [u'你好'], but it's [u'\u4f60\u597d'] all the time, so how do you do it?
When you print (or write to a file) a list it internally calls the str() method of the list , but list internally calls repr() on its elements. repr() returns the ugly unicode representation that you are seeing .
Example of repr -
>>> h = u'\u4f60\u597d'
>>> print h
\u4f60\u597d
>>> print repr(h)
u'\u4f60\u597d'
You would need to manually take the elements of the list and print them for them to print correctly.
Example -
>>> h1 = [h,u'\u4f77\u587f']
>>> print u'[' + u','.join([u"'" + unicode(i) + u"'" for i in h1]) + u']'
For lists containing sublists that may have unicode characters, you would need a recursive function , example -
>>> h1 = [h,(u'\u4f77\u587f',)]
>>> def listprinter(l):
... if isinstance(l, list):
... return u'[' + u','.join([listprinter(i) for i in l]) + u']'
... elif isinstance(l, tuple):
... return u'(' + u','.join([listprinter(i) for i in l]) + u')'
... elif isinstance(l, (str, unicode)):
... return u"'" + unicode(l) + u"'"
...
>>>
>>>
>>> print listprinter(h1)
To save them to file, use the same list comprehension or recursive function. Example -
with open('<filename>','w') as f:
f.write(listprinter(l))
You are misunderstanding.
u'' in python is not utf-8, it is simply Unicode (except on Windows in Python <= 3.2, where it is utf-16 instead).
utf-8 is an encoding of Unicode, which is necessarily a sequence of bytes.
Additionally, u'你' and u'\u4f60' are exactly the same thing. It's simply that in Python2 the repr of high characters uses escapes instead of raw values.
Since Python2 is heading for EOL very soon now, you should start to think seriously about switching to Python3. It is a lot easier to keep track of all this in Python3 since there's only one string type and it's much more clear when you .encode and .decode.
with open("some_file.txt","wb") as f:
f.write(hellolist[0].encode("utf8"))
I think will resolve your issue
most text editors use utf8 encoding :)
while the other answers are correct none of them actually resolved your issue
>>> u'\u4f60\u597d'.encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'
if you want the brackets
>>> u'[u\u4f60\u597d,]'.encode("utf8")
one thing is the unicode character itself
hellolist = u'\u4f60\'
and another is how you can represent it.
You can represent it in many many ways depending on where you are going to display.
Web: UTF-8
Database: maybe UTF-16 or UTF-8
Web in Japan: EUC-JP or Shift JIS
For example 本
http://unicode.org/cgi-bin/GetUnihanData.pl?codepoint=672c
http://www.fileformat.info/info/unicode/char/672c/index.htm
I try to write a function that simply splits a string by any symbol that is not a letter or a number. But I need to use cyrillic and when I do that I get output list with elements like '\x0d' instead of not latin words.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
class Syntax():
def __init__(self, string):
self.string = string.encode('utf-8')
self.list = None
def split(self):
self.list = re.split(ur"\W+", self.string, flags=re.U)
if __name__ == '__main__':
string = ur"Привет, мой друг test words."
a = Syntax(string)
a.split()
print a.string, a.list
Console output:
Привет, мой друг test words.
['\xd0', '\xd1', '\xd0', '\xd0\xb2\xd0\xb5\xd1', '\xd0\xbc\xd0\xbe\xd0\xb9', '\xd0', '\xd1', '\xd1', '\xd0\xb3', 'test', 'words', '']
Thanks for your help.
There are two problems here:
You're coercing unicode to string in your Syntax constructor. In general you should leave text values as unicode. (self.string = string, no encoding).
When you print a Python list it's calling repr on the elements, causing the unicode to be coerced to those values. If you do
for x in a.list:
print x
after making the first change, it'll print Cyrillic.
Edit: printing a list calls repr on the elements, not string. However, printing a string doesn't repr it - print x and print repr(x) yield different values. For strings, the repr is always something you can evaluate in Python to recover the value.
For clarification purposes, I am rewriting from scratch with additional information.
Consider the following:
y = hex(1200)
y
'0x4b0'
I need to replace that first 0 of y with a '\' to make it look like '\x04b0'. I am communicating with an instrument over RS-232 serial which takes parameters strictly in that format ('\xSumCharsHere'). Python won't let me do the following.
z = '\x' + y[2:]
ValueError: invalid \x escape
The following is not acceptable, because it still has '\\' in the actual value assigned to z.
z = '\\' + y[1:]
z
'\\x4b0'
The end goal is to send a command like this to my serial port:
s.write(z) # s is a serial object
s.write('\x04b0') # This call is an equivalent of the call above
s.write('\\x04b0') # This command will not work
Your last bit of code doesn't do what you think it does:
>>> x = hex(1200)
>>> y = '\\' + x[1: len(x)]
>>> y
'\\x4b0'
>>> print y
\x4b0
When you type the name of a variable in the Python console, Python prints the string's representation as Python code, which is why you see two backslashes -- a literal backslash in a Python string is escaped by another leading backslash. This code does in fact work, the representation of the result is just throwing you off.
However, I would suggest you use this snippet instead, since yours is omitting leading zeroes:
>>> y = '\\x%04x' % 1200
>>> print y
\x04b0
Your last code bit is correct, and it can be alternatively written using a raw string:
y = r'\x' + x[2: len(x)]
As cdhowie said in his answer:
When you type the name of a variable in the Python console, Python prints the string's representation as Python code. This code does in fact work, the representation of the result is just throwing you off.
This is an alternative for hand-writing escape sequences, however, and one I think is slightly better coding practice as it is much more readable.
The latter will work. In the console, Python uses repr() to print objects, which in this case will show the double slash. Do print y in the console and you'll see that it outputs properly.
You can also clean up your first example a bit:
y = "\\x" + x[2:]
Or the second:
y = "\\" + x[1:]
If you are just trying to get the string \0x4b0 as the representation at the console, you need to actually call print on it at the console:
>>> s='\\0{}'.format(hex(1200)[1:])
>>> s
'\\0x4b0'
>>> print s
\0x4b0
>>> s2='\\0'+hex(1200)[1:]
>>> s2
'\\0x4b0'
>>> print s2
\0x4b0
If you just FORM the string in the console (i.e., it does not go through print), Python is showing you its representation:
>>> '\\0{}'.format(hex(1200)[1:])
'\\0x4b0'
>>> repr(s2)
"'\\\\0x4b0'"
>>> s2
'\\0x4b0'
Edit (based on your comment):
I assume this is an old HP plotter?
Don't be confused by what the shell is showing as your string.
You state that you want to produce a string of \x<someNumGoesHere> (or is it \x0<someNumGoesHere> with a leading 0?)
Here is how:
>>> def angle_string(angle):
... return '\\0{}'.format(hex(angle)[1:])
...
>>> angle_string(1200)
'\\x04b0'
>>> print _
\x04b0
>>> angle_string(33)
'\\x021'
>>> print _
\x021
When you send the string to your device (through the OS file/print like service to the RS232 port), it will be as you format it.
Edit 2
String interpolation is the process where these string literals:
>>> s1
'\n\n\t\tline'
Get translated to this:
>>> print s
line
Logically, these literal characters are single characters:
>>> s1[0]
'\n'
>>> len('\\')
1
My guess is that the way you have opened the serial port s is using the strings is raw mode, so the string \\x0123 is being sent that way (raw mode) vs being interpreted as \x0123
You might try as a work around this:
>>> cmd=chr(92)+'0'+hex(1200)[1:]
>>> s.write(cmd)
I think you also need to open the serial port in FileLike mode so that the string literals are sent as proper single characters.
I have a function like this:
persian_numbers = '۱۲۳۴۵۶۷۸۹۰'
english_numbers = '1234567890'
arabic_numbers = '١٢٣٤٥٦٧٨٩٠'
english_trans = string.maketrans(english_numbers, persian_numbers)
arabic_trans = string.maketrans(arabic_numbers, persian_numbers)
text.translate(english_trans)
text.translate(arabic_trans)
I want it to translate all Arabic and English numbers to Persian. But Python says:
english_translate = string.maketrans(english_numbers, persian_numbers)
ValueError: maketrans arguments must have same length
I tried to encode strings with Unicode utf-8 but I always got some errors! Sometimes the problem is Arabic string instead! Do you know a better solution for this job?
EDIT:
It seems the problem is Unicode characters length in ASCII. An Arabic number like '۱' is two character -- that I find out with ord(). And the length problem starts from here :-(
See unidecode library which converts all strings into UTF8. It is very useful in case of number input in different languages.
In Python 2:
>>> from unidecode import unidecode
>>> a = unidecode(u"۰۱۲۳۴۵۶۷۸۹")
>>> a
'0123456789'
>>> unidecode(a)
'0123456789'
In Python 3:
>>> from unidecode import unidecode
>>> a = unidecode("۰۱۲۳۴۵۶۷۸۹")
>>> a
'0123456789'
>>> unidecode(a)
'0123456789'
Unicode objects can interpret these digits (arabic and persian) as actual digits -
no need to translate them by using character substitution.
EDIT -
I came out with a way to make your replacement using Python2 regular expressions:
# coding: utf-8
import re
# Attention: while the characters for the strings bellow are
# dislplayed indentically, inside they are represented
# by distinct unicode codepoints
persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
english_numbers = u'1234567890'
persian_regexp = u"(%s)" % u"|".join(persian_numbers)
arabic_regexp = u"(%s)" % u"|".join(arabic_numbers)
def _sub(match_object, digits):
return english_numbers[digits.find(match_object.group(0))]
def _sub_arabic(match_object):
return _sub(match_object, arabic_numbers)
def _sub_persian(match_object):
return _sub(match_object, persian_numbers)
def replace_arabic(text):
return re.sub(arabic_regexp, _sub_arabic, text)
def replace_persian(text):
return re.sub(arabic_regexp, _sub_persian, text)
Attempt that the "text" parameter must be unicode itself.
(also this code could be shortened
by using lambdas and combining some expressions in a single line, but there is no point in doing so, but for loosing readability)
It should work to you up to here, but please read on the original answer I had posted
-- original answer
So, if you instantiate your variables as unicode (prepending an u to the quote char), they are correctly understood in Python:
>>> persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
>>> english_numbers = u'1234567890'
>>> arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
>>>
>>> print int(persian_numbers)
1234567890
>>> print int(english_numbers)
1234567890
>>> print int(arabic_numbers)
1234567890
>>> persian_numbers.isdigit()
True
>>>
By the way, the "maketrans" method does not exist for unicode objects (in Python2 - see the comments).
It is very important to understand the basics about unicode - for everyone, even people writing English only programs who think they will never deal with any char out of the 26 latin letters. When writing code that will deal with different chars it is vital - the program can't possibly work without you knowing what you are doing except by chance.
A very good article to read is http://www.joelonsoftware.com/articles/Unicode.html - please read it now.
You can keep in mind, while reading it, that Python allows one to translate unicode characters to a string in any "physical" encoding by using the "encode" method of unicode objects.
>>> arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
>>> len(arabic_numbers)
10
>>> enc_arabic = arabic_numbers.encode("utf-8")
>>> print enc_arabic
١٢٣٤٥٦٧٨٩٠
>>> len(enc_arabic)
20
>>> int(enc_arabic)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '\xd9\xa1\xd9\xa2\xd9\xa3\xd9\xa4\xd9\xa5\xd9\xa6\xd9\xa7\xd9\xa8\xd9\xa9\xd9\xa0'
Thus, the characters loose their sense as "single entities" and as digits when encoding - the encoded object (str type in Python 2.x) is justa strrng of bytes - which nonetheless is needed when sending these characters to any output from the program - be it console, GUI Window, database, html code, etc...
You can use persiantools package:
Examples:
>>> from persiantools import digits
>>> digits.en_to_fa("0987654321")
'۰۹۸۷۶۵۴۳۲۱'
>>> digits.ar_to_fa("٠٩٨٧٦٥٤٣٢١") # or digits.ar_to_fa(u"٠٩٨٧٦٥٤٣٢١")
'۰۹۸۷۶۵۴۳۲۱'
unidecode converts all characters from Persian to English, If you want to change only numbers follow bellow:
In python3 you can use this code to convert any Persian|Arabic number to English number while keeping other characters unchanged:
intab='۱۲۳۴۵۶۷۸۹۰١٢٣٤٥٦٧٨٩٠'
outtab='12345678901234567890'
translation_table = str.maketrans(intab, outtab)
output_text = input_text.translate(translation_table)
Use Unicode Strings:
persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
english_numbers = u'1234567890'
arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
And make sure the encoding of your Python file is correct.
With this you can easily do that:
def p2e(persiannumber):
number={
'0':'۰',
'1':'۱',
'2':'۲',
'3':'۳',
'4':'۴',
'5':'۵',
'6':'۶',
'7':'۷',
'8':'۸',
'9':'۹',
}
for i,j in number.items():
persiannumber=persiannumber.replace(j,i)
return persiannumber
here is usage:
print(p2e('۳۱۹۶'))
#returns 3196
In Python 3 easiest way is:
str(int('۱۲۳'))
#123
but if number starts with 0 it have an issue.
so we can use zip() function:
for i, j in zip('1234567890', '۱۲۳۴۵۶۷۸۹۰'):
number.replace(i, j)
def persian_number(persiannumber):
number={
'0':'۰',
'1':'۱',
'2':'۲',
'3':'۳',
'4':'۴',
'5':'۵',
'6':'۶',
'7':'۷',
'8':'۸',
'9':'۹',
}
for i,j in number.items():
persiannumber=time2str.replace(i,j)
return time2str
persiannumber must be a string