For clarification purposes, I am rewriting from scratch with additional information.
Consider the following:
y = hex(1200)
y
'0x4b0'
I need to replace that first 0 of y with a '\' to make it look like '\x04b0'. I am communicating with an instrument over RS-232 serial which takes parameters strictly in that format ('\xSumCharsHere'). Python won't let me do the following.
z = '\x' + y[2:]
ValueError: invalid \x escape
The following is not acceptable, because it still has '\\' in the actual value assigned to z.
z = '\\' + y[1:]
z
'\\x4b0'
The end goal is to send a command like this to my serial port:
s.write(z) # s is a serial object
s.write('\x04b0') # This call is an equivalent of the call above
s.write('\\x04b0') # This command will not work
Your last bit of code doesn't do what you think it does:
>>> x = hex(1200)
>>> y = '\\' + x[1: len(x)]
>>> y
'\\x4b0'
>>> print y
\x4b0
When you type the name of a variable in the Python console, Python prints the string's representation as Python code, which is why you see two backslashes -- a literal backslash in a Python string is escaped by another leading backslash. This code does in fact work, the representation of the result is just throwing you off.
However, I would suggest you use this snippet instead, since yours is omitting leading zeroes:
>>> y = '\\x%04x' % 1200
>>> print y
\x04b0
Your last code bit is correct, and it can be alternatively written using a raw string:
y = r'\x' + x[2: len(x)]
As cdhowie said in his answer:
When you type the name of a variable in the Python console, Python prints the string's representation as Python code. This code does in fact work, the representation of the result is just throwing you off.
This is an alternative for hand-writing escape sequences, however, and one I think is slightly better coding practice as it is much more readable.
The latter will work. In the console, Python uses repr() to print objects, which in this case will show the double slash. Do print y in the console and you'll see that it outputs properly.
You can also clean up your first example a bit:
y = "\\x" + x[2:]
Or the second:
y = "\\" + x[1:]
If you are just trying to get the string \0x4b0 as the representation at the console, you need to actually call print on it at the console:
>>> s='\\0{}'.format(hex(1200)[1:])
>>> s
'\\0x4b0'
>>> print s
\0x4b0
>>> s2='\\0'+hex(1200)[1:]
>>> s2
'\\0x4b0'
>>> print s2
\0x4b0
If you just FORM the string in the console (i.e., it does not go through print), Python is showing you its representation:
>>> '\\0{}'.format(hex(1200)[1:])
'\\0x4b0'
>>> repr(s2)
"'\\\\0x4b0'"
>>> s2
'\\0x4b0'
Edit (based on your comment):
I assume this is an old HP plotter?
Don't be confused by what the shell is showing as your string.
You state that you want to produce a string of \x<someNumGoesHere> (or is it \x0<someNumGoesHere> with a leading 0?)
Here is how:
>>> def angle_string(angle):
... return '\\0{}'.format(hex(angle)[1:])
...
>>> angle_string(1200)
'\\x04b0'
>>> print _
\x04b0
>>> angle_string(33)
'\\x021'
>>> print _
\x021
When you send the string to your device (through the OS file/print like service to the RS232 port), it will be as you format it.
Edit 2
String interpolation is the process where these string literals:
>>> s1
'\n\n\t\tline'
Get translated to this:
>>> print s
line
Logically, these literal characters are single characters:
>>> s1[0]
'\n'
>>> len('\\')
1
My guess is that the way you have opened the serial port s is using the strings is raw mode, so the string \\x0123 is being sent that way (raw mode) vs being interpreted as \x0123
You might try as a work around this:
>>> cmd=chr(92)+'0'+hex(1200)[1:]
>>> s.write(cmd)
I think you also need to open the serial port in FileLike mode so that the string literals are sent as proper single characters.
Related
I have an API that is demanding that the quotation marks in my XML attributes are escaped, so <cmd_id="1"> will not work, it requires <cmd_id=\"1\">.
I have tried iterating through my string, for example:
b'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<tx><cmd id="1">SetChLevel</cmd><name>C</name><value>30</value></tx>'
Each time that I encounter a " (ascii 34) I will replace it with an escape character (ascii 92) and another quote. Infuriatingly this results in:
b'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<tx><cmd id=\\"1\\">SetChLevel</cmd><name>C</name><value>30</value></tx>'
where the escapes have been escaped. As a sanity check I replaced 92 with any other character and it works as expected.
temp = b'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<tx><cmd id="1">\
SetChLevel</cmd><name>C</name><value>30</value></tx>'
i = 0
j = 0
payload = bytearray(len(temp) + 4)
for char in temp:
if char == 34:
payload[i] = 92
i += 1
payload[i] = 34
i += 1
j += 1
else:
payload[i] = temp[j]
i += 1
j += 1
print(bytes(payload))
I would assume that character 92 would appear once but something is escaping the escape!
Your problem is the result of a very common misunderstanding for programmers new to Python.
When printing a string (or bytes) to the console, Python escapes the escape character (\) to show a string that, when used in Python as a literal, would give you the exact same value.
So:
s = 'abc\\abc'
print(s)
Prints abc\abc, but on the interpreter you get:
>>> s = 'abc\\abc'
>>> print(s)
abc\abc
>>> s
'abc\\abc'
Note that this is correct. After all print(s) should show the string on the console as it is, while s on the interpreter is asking Python to show you the representation of s, which includes the quotes and the escape characters.
Compare:
>>> repr(s)
"'abc\\\\abc'"
repr here prints the representation of the representation of s.
For bytes, things are further complicated because the representation is printed when using print, since print prints a string and a bytes needs to be decoded first, i.e.:
>>> print(some_bytes.decode('utf-8')) # or whatever the encoding is
In short: your code was doing what you wanted it to, it does not duplicate escape characters, you only thought it did because you were looking at the representation of the bytes, not the actual bytes content.
By the way, this also means that you don't have to be paranoid and go through the trouble of writing custom code to replace characters based on their ASCII values, you can simply:
>>> example = bytes('<some attr="value">test</some>', encoding='utf-8')
>>> result = example.replace(b'"', b"\\\"")
>>> print(result.decode('utf-8'))
<some attr=\"value\">test</some>
I won't pretend that b"\\\"" is intuitive, perhaps b'\\"' is better - but both require that you understand the difference between the representation of a string, or its printed value.
So, finally:
>>> example = b'<some attr="value">test</some>'
>>> result = example.replace(b'"', b'\\"')
>>> print(result.decode('utf-8'))
<some attr=\"value\">test</some>
When i assign a windows path as a value in dictionary, the backward slash gets added.
I did try using raw string.
p = "c:\windows\pat.exe"
print p
c:\windows\pat.exe
d = {"p": p}
print d
{'p': 'c:\\windows\\pat.exe'}
Tried it as raw string
d = {"p": r"%s" % p}
print d
{'p': 'c:\\windows\\pat.exe'}
I dont want the backslash to added when assigned to value in dictionary.
This is a mistake that's very common among people new to Python.
TL;DR:
>>> print "c:\windows\pat.exe" == 'c:\\windows\\pat.exe'
True
Explanation:
In the first instance, where you're assigning a value to the string p and then printing p, Python gets the string to print itself and it does so by outputting its literal value. In your example:
>>> p = "c:\windows\pat.exe"
>>> print p
c:\windows\pat.exe
In Python 3, the same:
>>> p = "c:\windows\pat.exe"
>>> print(p)
c:\windows\pat.exe
In the second instance, since you're creating and then printing a dictionary, Python asks the dictionary to print itself. It does so by printing a short Python code representation of itself, since there is no standard simple way of printing a dictionary, like there is for variables with simple types like strings or numbers.
In your example (slightly modified to work by itself):
>>> d = {"p": "c:\windows\pat.exe"}
>>> print d
{'p': 'c:\\windows\\pat.exe'}
So, why does the value of p in the Python code representation have the double backslashes? Because a single backslash in a string literal has an ambiguous meaning. In your example, it just so happens that \w and \p don't have special meanings in Python. However, you've maybe seen things like \n and perhaps \t used in strings to represent a new line or a tab character.
For example:
>>> print "Hello\nworld!"
Hello
world!
So how does Python know when to print a new line and when to print \n literally, when you want to? It doesn't. It just assumes that if the character after the \ doesn't make for a special character, you probably wanted to write a \ and if it is, you wanted to write the special character. If you want to literally write a \, regardless of what follows, you need to follow up the escape character (that's what the \ is called in this context) with another one.
For example:
>>> print "I can see \\n"
I can see \n
That way, there is no ambiguity and Python knows exactly what is intended. You should always write backslashes as double backslashes in normal string literals, instead of relying on luck in avoiding control characters like \n or \t. And that's why Python, when printing its code version of your string "c:\windows\pat.exe", prefers to write it as 'c:\\windows\\pat.exe'. Using single quotes, which are preferred even though double quotes are fine too and using double backslashes.
It's just how it is written in code, "really" your string has single backslashes and the quotes are of course not part of it at all.
If you don't like having to write double backslashes, you can consider using 'raw strings', which is prefixing a string with r or R, telling Python to ignore special characters and take the string exactly as written in code:
>>> print r"This won't have \n a line break"
This won't have \n a line break
But watch out! This doesn't work if you want your last characters in the string to be an odd number of \, for reasons not worth getting into. In that case, you have no other recourse than writing the string with double backslashes:
>>> print r"Too bad\"
File "<stdin>", line 1
print r"Too bad\"
^
SyntaxError: EOL while scanning string literal
>>> print r"Too bad\\"
Too bad\\
>>> print "Too bad\\"
Too bad\
Maybe it is not a problem, because when you print the values (not the whole dictionary) the string will have one backslash
p = "c:\windows\pat.exe"
d = {"p": p}
print (d)
{'p': 'c:\\windows\\pat.exe'}
for i in d:
print("key:", i, " value:", d[i])
Output
{'p': 'c:\\windows\\pat.exe'}
key: p value: c:\windows\pat.exe
>>>
Well, character encoding and decoding sometimes frustrates me a lot.
So we know u'\u4f60\u597d' is the utf-8 encoding of 你好,
>>> print hellolist
[u'\u4f60\u597d']
>>> print hellolist[0]
你好
Now what I really want to get from the output or write to a file is [u'你好'], but it's [u'\u4f60\u597d'] all the time, so how do you do it?
When you print (or write to a file) a list it internally calls the str() method of the list , but list internally calls repr() on its elements. repr() returns the ugly unicode representation that you are seeing .
Example of repr -
>>> h = u'\u4f60\u597d'
>>> print h
\u4f60\u597d
>>> print repr(h)
u'\u4f60\u597d'
You would need to manually take the elements of the list and print them for them to print correctly.
Example -
>>> h1 = [h,u'\u4f77\u587f']
>>> print u'[' + u','.join([u"'" + unicode(i) + u"'" for i in h1]) + u']'
For lists containing sublists that may have unicode characters, you would need a recursive function , example -
>>> h1 = [h,(u'\u4f77\u587f',)]
>>> def listprinter(l):
... if isinstance(l, list):
... return u'[' + u','.join([listprinter(i) for i in l]) + u']'
... elif isinstance(l, tuple):
... return u'(' + u','.join([listprinter(i) for i in l]) + u')'
... elif isinstance(l, (str, unicode)):
... return u"'" + unicode(l) + u"'"
...
>>>
>>>
>>> print listprinter(h1)
To save them to file, use the same list comprehension or recursive function. Example -
with open('<filename>','w') as f:
f.write(listprinter(l))
You are misunderstanding.
u'' in python is not utf-8, it is simply Unicode (except on Windows in Python <= 3.2, where it is utf-16 instead).
utf-8 is an encoding of Unicode, which is necessarily a sequence of bytes.
Additionally, u'你' and u'\u4f60' are exactly the same thing. It's simply that in Python2 the repr of high characters uses escapes instead of raw values.
Since Python2 is heading for EOL very soon now, you should start to think seriously about switching to Python3. It is a lot easier to keep track of all this in Python3 since there's only one string type and it's much more clear when you .encode and .decode.
with open("some_file.txt","wb") as f:
f.write(hellolist[0].encode("utf8"))
I think will resolve your issue
most text editors use utf8 encoding :)
while the other answers are correct none of them actually resolved your issue
>>> u'\u4f60\u597d'.encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'
if you want the brackets
>>> u'[u\u4f60\u597d,]'.encode("utf8")
one thing is the unicode character itself
hellolist = u'\u4f60\'
and another is how you can represent it.
You can represent it in many many ways depending on where you are going to display.
Web: UTF-8
Database: maybe UTF-16 or UTF-8
Web in Japan: EUC-JP or Shift JIS
For example 本
http://unicode.org/cgi-bin/GetUnihanData.pl?codepoint=672c
http://www.fileformat.info/info/unicode/char/672c/index.htm
I have a function like this:
persian_numbers = '۱۲۳۴۵۶۷۸۹۰'
english_numbers = '1234567890'
arabic_numbers = '١٢٣٤٥٦٧٨٩٠'
english_trans = string.maketrans(english_numbers, persian_numbers)
arabic_trans = string.maketrans(arabic_numbers, persian_numbers)
text.translate(english_trans)
text.translate(arabic_trans)
I want it to translate all Arabic and English numbers to Persian. But Python says:
english_translate = string.maketrans(english_numbers, persian_numbers)
ValueError: maketrans arguments must have same length
I tried to encode strings with Unicode utf-8 but I always got some errors! Sometimes the problem is Arabic string instead! Do you know a better solution for this job?
EDIT:
It seems the problem is Unicode characters length in ASCII. An Arabic number like '۱' is two character -- that I find out with ord(). And the length problem starts from here :-(
See unidecode library which converts all strings into UTF8. It is very useful in case of number input in different languages.
In Python 2:
>>> from unidecode import unidecode
>>> a = unidecode(u"۰۱۲۳۴۵۶۷۸۹")
>>> a
'0123456789'
>>> unidecode(a)
'0123456789'
In Python 3:
>>> from unidecode import unidecode
>>> a = unidecode("۰۱۲۳۴۵۶۷۸۹")
>>> a
'0123456789'
>>> unidecode(a)
'0123456789'
Unicode objects can interpret these digits (arabic and persian) as actual digits -
no need to translate them by using character substitution.
EDIT -
I came out with a way to make your replacement using Python2 regular expressions:
# coding: utf-8
import re
# Attention: while the characters for the strings bellow are
# dislplayed indentically, inside they are represented
# by distinct unicode codepoints
persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
english_numbers = u'1234567890'
persian_regexp = u"(%s)" % u"|".join(persian_numbers)
arabic_regexp = u"(%s)" % u"|".join(arabic_numbers)
def _sub(match_object, digits):
return english_numbers[digits.find(match_object.group(0))]
def _sub_arabic(match_object):
return _sub(match_object, arabic_numbers)
def _sub_persian(match_object):
return _sub(match_object, persian_numbers)
def replace_arabic(text):
return re.sub(arabic_regexp, _sub_arabic, text)
def replace_persian(text):
return re.sub(arabic_regexp, _sub_persian, text)
Attempt that the "text" parameter must be unicode itself.
(also this code could be shortened
by using lambdas and combining some expressions in a single line, but there is no point in doing so, but for loosing readability)
It should work to you up to here, but please read on the original answer I had posted
-- original answer
So, if you instantiate your variables as unicode (prepending an u to the quote char), they are correctly understood in Python:
>>> persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
>>> english_numbers = u'1234567890'
>>> arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
>>>
>>> print int(persian_numbers)
1234567890
>>> print int(english_numbers)
1234567890
>>> print int(arabic_numbers)
1234567890
>>> persian_numbers.isdigit()
True
>>>
By the way, the "maketrans" method does not exist for unicode objects (in Python2 - see the comments).
It is very important to understand the basics about unicode - for everyone, even people writing English only programs who think they will never deal with any char out of the 26 latin letters. When writing code that will deal with different chars it is vital - the program can't possibly work without you knowing what you are doing except by chance.
A very good article to read is http://www.joelonsoftware.com/articles/Unicode.html - please read it now.
You can keep in mind, while reading it, that Python allows one to translate unicode characters to a string in any "physical" encoding by using the "encode" method of unicode objects.
>>> arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
>>> len(arabic_numbers)
10
>>> enc_arabic = arabic_numbers.encode("utf-8")
>>> print enc_arabic
١٢٣٤٥٦٧٨٩٠
>>> len(enc_arabic)
20
>>> int(enc_arabic)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '\xd9\xa1\xd9\xa2\xd9\xa3\xd9\xa4\xd9\xa5\xd9\xa6\xd9\xa7\xd9\xa8\xd9\xa9\xd9\xa0'
Thus, the characters loose their sense as "single entities" and as digits when encoding - the encoded object (str type in Python 2.x) is justa strrng of bytes - which nonetheless is needed when sending these characters to any output from the program - be it console, GUI Window, database, html code, etc...
You can use persiantools package:
Examples:
>>> from persiantools import digits
>>> digits.en_to_fa("0987654321")
'۰۹۸۷۶۵۴۳۲۱'
>>> digits.ar_to_fa("٠٩٨٧٦٥٤٣٢١") # or digits.ar_to_fa(u"٠٩٨٧٦٥٤٣٢١")
'۰۹۸۷۶۵۴۳۲۱'
unidecode converts all characters from Persian to English, If you want to change only numbers follow bellow:
In python3 you can use this code to convert any Persian|Arabic number to English number while keeping other characters unchanged:
intab='۱۲۳۴۵۶۷۸۹۰١٢٣٤٥٦٧٨٩٠'
outtab='12345678901234567890'
translation_table = str.maketrans(intab, outtab)
output_text = input_text.translate(translation_table)
Use Unicode Strings:
persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
english_numbers = u'1234567890'
arabic_numbers = u'١٢٣٤٥٦٧٨٩٠'
And make sure the encoding of your Python file is correct.
With this you can easily do that:
def p2e(persiannumber):
number={
'0':'۰',
'1':'۱',
'2':'۲',
'3':'۳',
'4':'۴',
'5':'۵',
'6':'۶',
'7':'۷',
'8':'۸',
'9':'۹',
}
for i,j in number.items():
persiannumber=persiannumber.replace(j,i)
return persiannumber
here is usage:
print(p2e('۳۱۹۶'))
#returns 3196
In Python 3 easiest way is:
str(int('۱۲۳'))
#123
but if number starts with 0 it have an issue.
so we can use zip() function:
for i, j in zip('1234567890', '۱۲۳۴۵۶۷۸۹۰'):
number.replace(i, j)
def persian_number(persiannumber):
number={
'0':'۰',
'1':'۱',
'2':'۲',
'3':'۳',
'4':'۴',
'5':'۵',
'6':'۶',
'7':'۷',
'8':'۸',
'9':'۹',
}
for i,j in number.items():
persiannumber=time2str.replace(i,j)
return time2str
persiannumber must be a string
I am interested in taking in a single character.
c = 'c' # for example
hex_val_string = char_to_hex_string(c)
print hex_val_string
output:
63
What is the simplest way of going about this? Any predefined string library stuff?
There are several ways of doing this:
>>> hex(ord("c"))
'0x63'
>>> format(ord("c"), "x")
'63'
>>> import codecs
>>> codecs.encode(b"c", "hex")
b'63'
On Python 2, you can also use the hex encoding like this (doesn't work on Python 3+):
>>> "c".encode("hex")
'63'
This might help
import binascii
x = b'test'
x = binascii.hexlify(x)
y = str(x,'ascii')
print(x) # Outputs b'74657374' (hex encoding of "test")
print(y) # Outputs 74657374
x_unhexed = binascii.unhexlify(x)
print(x_unhexed) # Outputs b'test'
x_ascii = str(x_unhexed,'ascii')
print(x_ascii) # Outputs test
This code contains examples for converting ASCII characters to and from hexadecimal. In your situation, the line you'd want to use is str(binascii.hexlify(c),'ascii').
Considering your input string is in the inputString variable, you could simply apply .encode('utf-8').hex() function on top of this variable to achieve the result.
inputString = "Hello"
outputString = inputString.encode('utf-8').hex()
The result from this will be 48656c6c6f.
You can do this:
your_letter = input()
def ascii2hex(source):
return hex(ord(source))
print(ascii2hex(your_letter))
For extra information, go to:
https://www.programiz.com/python-programming/methods/built-in/hex
to get ascii code use ord("a");
to convert ascii to character use chr(97)