escape a string which contains non-ascii - python

now I have string s = "\\u653e"
I want to convert this string into s = "\u653e"
I try to make it clear:
# this is what I want
>>s
>>'\u653e'
# this is not what I want, print will escape the string automatically
>>print s
>>\653e
how can I do that?
the original question is that
I have a string s = u'\u653e', [s] = [u'\u653e']
So I want to remove the u, that is, [s] = ['\u653e']
so I just use the command ast.literal_eval(json.dumps(r)) to get the above string "\\u653e"
UPDATE
Thanks tdelaney
Creating a string from an entire list causes my problem. What I should to do is using a unicode string to start with and build the list from its individual elements instead of the entire list. For more details you can see his answer.

s is a single unicode character. "\u653e is a literal encoding that python uses to express unicode characters in ascii text. The unicode_escape codec converts between these types.
>>> s = u'\u653e'
>>> print type(s), len(s), s
<type 'unicode'> 1 放
>>> encoded = s.encode('unicode_escape')
>>> print type(encoded), len(encoded), encoded
<type 'str'> 6 \u653e
In your example just do
s = u'\u653e'
somelist = [s.encode('unicode_escape')]
>>> print somelist
['\\u653e']
>>> print somelist[0]
\u653e
update
From your comments, your problem may be how you create your command string. There seems to be a problem with the python representation of a string verses the string itself. Use a unicode string to start with and build the list from its individual elements instead of the entire list.
>>> excel = [u'\u4e00', u'\u4e8c', u'\u4e09']
>>> cmd = u'create vertex v set s = [{}]'.format(u','.join(excel))
>>> cmd
u'create vertex v set s = [\u4e00,\u4e8c,\u4e09]'
>>> print cmd
create vertex v set s = [一,二,三]

Related

Remove hex character from string in Python

I'm trying to remove one character from a string o hex values but I can't find a solution to make this work. If I print remove2 I get the string "\x41", but when I print buffer I get ABCD". The thing is I don't understand why when I print remove2 I get the string hex format and when I print buffer I get the ASCII format. I think it is in the root of the problem. How could I fix this using Python 2?
>>> buffer = "\x41\x42\x43\x44"
>>> remove = raw_input("Enter hex value to remove: ")
Enter hex value to remove: 42
>>> remove2 = "\\x" + remove
>>> print buffer
ABCD
>>> print remove2
\x42
>>> buffer2 = buffer.replace(remove2, '')
>>> print buffer2
ABCD
I wish buffer2 = "\x41\x43\x44".
Here's the problem:
remove2 = "\\x" + remove
You can't programmatically build escape sequences like that. Instead, do this:
remove2 = chr(int(remove, 16))
Alternatively, you'd have to make buffer contain the literal backslashes instead of the escaped characters:
buffer = "\\x41\\x42\\x43\\x44"
The problem being is that if you print out remove without the print, you'll see
>>> remove2
'\\x42'
that the \ is staying there and not making it hexadecimal. For that you need to do:
remove.decode('hex')
so the code being:
>>> buffer = "\x41\x42\x43\x44"
>>> remove = raw_input("Enter hex value to remove: ")
Enter hex value to remove: 42
>>> remove2=remove.decode('hex')
>>> buffer.replace(remove2, '')
'ACD'
Does that help/answers your question?
You will need to escape the \ in your buffer string o/w it will be treated as hex value. So,
>>> buffer="\\x41\\x42\\x43"`<br>
>>> remove = "42"`<br>
>>> remove = "\\x" + remove` <br>
>>> buffer = buffer.replace(remove, '')` <br>
>>> print buffer #prints \\\x41\\\x43
You can use filter() and construct a filtered bytes object using the user input of "42" and original bytes (just a string in Python2).
>>> inp = "42"
>>> filter(lambda x: x != chr(int(inp, 16)), 'ABCD')
'ACD'
Python 3
>>> inp = "42"
>>> bytes(filter(lambda x: x != int(inp, 16), b'ABCD'))
b'ACD'
Anyway, simpler to use replace(), this is just an alternative way to filter out specific values from a bytes object. It illustrates the basic idea other answers point out. The user input needs to be correctly converted to the value you intend to remove.
When the interp renders the output, the backslashes aren't represented in bytes or str objects for characters/values that correspond to utf-8 or ascii printable characters. If there isn't a corresponding printable character, then an escaped version of the value will be presented in output.

Python - How to convert utf literal such as '\xc3\xb6' to the letter ö

I am trying to convert an encoded url with german Umlaute into a string with those Umlaute.
Here is an example of an encoded string = 'K%C3%B6nnen'.
I would like to convert it to 'Können'
When I use urllib.unquote(a) I get this returned: 'K\xc3\xb6nnen'
\xc3\xb6 I found out is a utf literal.
How can I convert this to an ö ? I find that if I use the print function it converts it correctly, but I cannot figure out how to get a function to return this value? Any ideas?
With decode("utf-8")
print('K\xc3\xb6nnen'.decode("utf-8"))
OUTPUT
Können
EXTRA edit, take care with that
>>> l = []
>>> l.append(s.decode("utf-8")) #s is the string
>>> l
[u'K\xf6nnen']
>>> print(l)
[u'K\xf6nnen']
>>> print(l[0])
Können
>>>
Python will use codification to manage string, print can give you the representation but no the real value, use repr(s) for real value

Python chinese characters

I have the following encoding in python 2.7:
["\xe5\x81\x9a\xe6\x88\x8f\xe4\xb9\x8b\xe8\xaf\xb4"]
I need to get the following (chinese characters) from that:
["做戏之说"]
Anyone knows how to decode the above to get that?
You need to decode your string:
>>> l = ["\xe5\x81\x9a\xe6\x88\x8f\xe4\xb9\x8b\xe8\xaf\xb4"]
>>> a = [l[0].decode('utf8')]
>>> print a[0]
做戏之说
If you want to show your Unicode inside the list you need to convert the standard representation of the list to unicode then print it:
>>> print unicode(repr([l[0].decode('utf8')]), 'unicode-escape')
[u'做戏之说']

How do I lowercase a string in Python?

Is there a way to convert a string to lowercase?
"Kilometers" → "kilometers"
Use str.lower():
"Kilometer".lower()
The canonical Pythonic way of doing this is
>>> 'Kilometers'.lower()
'kilometers'
However, if the purpose is to do case insensitive matching, you should use case-folding:
>>> 'Kilometers'.casefold()
'kilometers'
Here's why:
>>> "Maße".casefold()
'masse'
>>> "Maße".lower()
'maße'
>>> "MASSE" == "Maße"
False
>>> "MASSE".lower() == "Maße".lower()
False
>>> "MASSE".casefold() == "Maße".casefold()
True
This is a str method in Python 3, but in Python 2, you'll want to look at the PyICU or py2casefold - several answers address this here.
Unicode Python 3
Python 3 handles plain string literals as unicode:
>>> string = 'Километр'
>>> string
'Километр'
>>> string.lower()
'километр'
Python 2, plain string literals are bytes
In Python 2, the below, pasted into a shell, encodes the literal as a string of bytes, using utf-8.
And lower doesn't map any changes that bytes would be aware of, so we get the same string.
>>> string = 'Километр'
>>> string
'\xd0\x9a\xd0\xb8\xd0\xbb\xd0\xbe\xd0\xbc\xd0\xb5\xd1\x82\xd1\x80'
>>> string.lower()
'\xd0\x9a\xd0\xb8\xd0\xbb\xd0\xbe\xd0\xbc\xd0\xb5\xd1\x82\xd1\x80'
>>> print string.lower()
Километр
In scripts, Python will object to non-ascii (as of Python 2.5, and warning in Python 2.4) bytes being in a string with no encoding given, since the intended coding would be ambiguous. For more on that, see the Unicode how-to in the docs and PEP 263
Use Unicode literals, not str literals
So we need a unicode string to handle this conversion, accomplished easily with a unicode string literal, which disambiguates with a u prefix (and note the u prefix also works in Python 3):
>>> unicode_literal = u'Километр'
>>> print(unicode_literal.lower())
километр
Note that the bytes are completely different from the str bytes - the escape character is '\u' followed by the 2-byte width, or 16 bit representation of these unicode letters:
>>> unicode_literal
u'\u041a\u0438\u043b\u043e\u043c\u0435\u0442\u0440'
>>> unicode_literal.lower()
u'\u043a\u0438\u043b\u043e\u043c\u0435\u0442\u0440'
Now if we only have it in the form of a str, we need to convert it to unicode. Python's Unicode type is a universal encoding format that has many advantages relative to most other encodings. We can either use the unicode constructor or str.decode method with the codec to convert the str to unicode:
>>> unicode_from_string = unicode(string, 'utf-8') # "encoding" unicode from string
>>> print(unicode_from_string.lower())
километр
>>> string_to_unicode = string.decode('utf-8')
>>> print(string_to_unicode.lower())
километр
>>> unicode_from_string == string_to_unicode == unicode_literal
True
Both methods convert to the unicode type - and same as the unicode_literal.
Best Practice, use Unicode
It is recommended that you always work with text in Unicode.
Software should only work with Unicode strings internally, converting to a particular encoding on output.
Can encode back when necessary
However, to get the lowercase back in type str, encode the python string to utf-8 again:
>>> print string
Километр
>>> string
'\xd0\x9a\xd0\xb8\xd0\xbb\xd0\xbe\xd0\xbc\xd0\xb5\xd1\x82\xd1\x80'
>>> string.decode('utf-8')
u'\u041a\u0438\u043b\u043e\u043c\u0435\u0442\u0440'
>>> string.decode('utf-8').lower()
u'\u043a\u0438\u043b\u043e\u043c\u0435\u0442\u0440'
>>> string.decode('utf-8').lower().encode('utf-8')
'\xd0\xba\xd0\xb8\xd0\xbb\xd0\xbe\xd0\xbc\xd0\xb5\xd1\x82\xd1\x80'
>>> print string.decode('utf-8').lower().encode('utf-8')
километр
So in Python 2, Unicode can encode into Python strings, and Python strings can decode into the Unicode type.
With Python 2, this doesn't work for non-English words in UTF-8. In this case decode('utf-8') can help:
>>> s='Километр'
>>> print s.lower()
Километр
>>> print s.decode('utf-8').lower()
километр
Also, you can overwrite some variables:
s = input('UPPER CASE')
lower = s.lower()
If you use like this:
s = "Kilometer"
print(s.lower()) - kilometer
print(s) - Kilometer
It will work just when called.
Don't try this, totally un-recommend, don't do this:
import string
s='ABCD'
print(''.join([string.ascii_lowercase[string.ascii_uppercase.index(i)] for i in s]))
Output:
abcd
Since no one wrote it yet you can use swapcase (so uppercase letters will become lowercase, and vice versa) (and this one you should use in cases where i just mentioned (convert upper to lower, lower to upper)):
s='ABCD'
print(s.swapcase())
Output:
abcd
I would like to provide the summary of all possible methods
.lower() method.
str.lower()
combination of str.translate() and str.maketrans()
.lower() method
original_string = "UPPERCASE"
lowercase_string = original_string.lower()
print(lowercase_string) # Output: "uppercase"
str.lower()
original_string = "UPPERCASE"
lowercase_string = str.lower(original_string)
print(lowercase_string) # Output: "uppercase"
combination of str.translate() and str.maketrans()
original_string = "UPPERCASE"
lowercase_string = original_string.translate(str.maketrans(string.ascii_uppercase, string.ascii_lowercase))
print(lowercase_string) # Output: "uppercase"
lowercasing
This method not only converts all uppercase letters of the Latin alphabet into lowercase ones, but also shows how such logic is implemented. You can test this code in any online Python sandbox.
def turnIntoLowercase(string):
lowercaseCharacters = ''
abc = ['a','b','c','d','e','f','g','h','i','j','k','l','m',
'n','o','p','q','r','s','t','u','v','w','x','y','z',
'A','B','C','D','E','F','G','H','I','J','K','L','M',
'N','O','P','Q','R','S','T','U','V','W','X','Y','Z']
for character in string:
if character not in abc:
lowercaseCharacters += character
elif abc.index(character) <= 25:
lowercaseCharacters += character
else:
lowercaseCharacters += abc[abc.index(character) - 26]
return lowercaseCharacters
string = str(input("Enter your string, please: " ))
print(turnIntoLowercase(string = string))
Performance check
Now, let's enter the following string (and press Enter) to make sure everything works as intended:
# Enter your string, please:
"PYTHON 3.11.2, 15TH FeB 2023"
Result:
"python 3.11.2, 15th feb 2023"
If you want to convert a list of strings to lowercase, you can map str.lower:
list_of_strings = ['CamelCase', 'in', 'Python']
list(map(str.lower, list_of_strings)) # ['camelcase', 'in', 'python']

python - problems with regular expression and unicode

Hi I have a problem in python. I try to explain my problem with an example.
I have this string:
>>> string = 'ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ'
>>> print string
ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ
and i want, for example, replace charachters different from Ñ,Ã,ï with ""
i have tried:
>>> rePat = re.compile('[^ÑÃï]',re.UNICODE)
>>> print rePat.sub("",string)
�Ñ�����������������������������ï�������������������Ã
I obtained this �.
I think that it's happen because this type of characters in python are represented by two position in the vector: for example \xc3\x91 = Ñ.
For this, when i make the regolar expression, all the \xc3 are not substitued. How I can do this type of sub?????
Thanks
Franco
You need to make sure that your strings are unicode strings, not plain strings (plain strings are like byte arrays).
Example:
>>> string = 'ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ'
>>> type(string)
<type 'str'>
# do this instead:
# (note the u in front of the ', this marks the character sequence as a unicode literal)
>>> string = u'\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\xc0\xc1\xc2\xc3'
# or:
>>> string = 'ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ'.decode('utf-8')
# ... but be aware that the latter will only work if the terminal (or source file) has utf-8 encoding
# ... it is a best practice to use the \xNN form in unicode literals, as in the first example
>>> type(string)
<type 'unicode'>
>>> print string
ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ
>>> rePat = re.compile(u'[^\xc3\x91\xc3\x83\xc3\xaf]',re.UNICODE)
>>> print rePat.sub("", string)
Ã
When reading from a file, string = open('filename.txt').read() reads a byte sequence.
To get the unicode content, do: string = unicode(open('filename.txt').read(), 'encoding'). Or: string = open('filename.txt').read().decode('encoding').
The codecs module can decode unicode streams (such as files) on-the-fly.
Do a google search for python unicode. Python unicode handling can be a bit hard to grasp at first, it pays to read up on it.
I live by this rule: "Software should only work with Unicode strings internally, converting to a particular encoding on output." (from http://www.amk.ca/python/howto/unicode)
I also recommend: http://www.joelonsoftware.com/articles/Unicode.html

Categories

Resources