I should not expect any error here. I just want to take the string literaly and translate it into its bytes. I don't want to encode or decode anything.
I am taking here a stupid example:
>>> astring
u'\xb0'
Stupid enough to give me headache...
>>> bytes(astring)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb0' in position...
One horrible trick is to do this:
>>> bytes(repr(astring)[2:-1])
'\xb0'
One other bad solution is:
>>> bytes(astring.encode("utf-8"))
'\xc2\xb0'
It is a bad solution because my string is not composed of two chars. This is wrong.
Another awful solution would be:
>>> bytes(''.join(map(bytes, [chr(ord(c)) for c in astring])))
'\xb0'
I am using Python 2.7
Background
I would like to compare two columns on a database where the encoding is unknown and sometime conflicting. I don't care about wrong chars on my dump. I just want to get it to have a look at it.
If your Unicode strings are guaranteed to only contain codepoints < 256 then you can convert them to bytes using the Latin1 encoding. Here's some Python 2 code that performs this conversion on all codepoints in range(256).
r = range(256)
s = u''.join([unichr(i) for i in r])
print repr(s)
b = s.encode('latin1')
print repr(b)
a = [ord(c) for c in b]
print a == r
output
u'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
True
FWIW, here's the equivalent Python 3 code.
r = range(256)
s = u''.join([chr(i) for i in r])
print(repr(s))
b = s.encode('latin1')
print(repr(b))
print(list(b) == list(r))
output
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0¡¢£¤¥¦§¨©ª«¬\xad®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'
b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
True
Note that the Python 3 Unicode repr output is a little more human-friendly.
You cannot just 'take the string literally' because the actual, internal, bytes representation of your string is not fixed and is an implementation detail of the your python interpreter that your should not rely on (see PEP3993, on the same system different string can use different internal encoding).
That also means that to get a byte representation of you string, you really need to encode it, and thus specify the encoding.
By the way, astring.encode("utf-8") is not wrong (and already returns a bytes, you don't need the extra bytes(...) in your code), as in utf-8 a single character can be represented as several bytes.
You should be able to just add b before the quotes of the string.
>>> astring = b'\xb0'
>>> astring
b'\xb0'
>>> bytes(astring)
b'\xb0'
>>>
Putting b before the string makes it a bytes object. No more UnicodeEncodeError.
I want to encode string to bytes.
To convert to byes, I used byte.fromhex()
>>> byte.fromhex('7403073845')
b't\x03\x078E'
But it displayed some characters.
How can it be displayed as hex like following?
b't\x03\x078E' => '\x74\x03\x07\x38\x45'
I want to encode string to bytes.
bytes.fromhex() already transforms your hex string into bytes. Don't confuse an object and its text representation -- REPL uses sys.displayhook that uses repr() to display bytes in ascii printable range as the corresponding characters but it doesn't affect the value in any way:
>>> b't' == b'\x74'
True
Print bytes to hex
To convert bytes back into a hex string, you could use bytes.hex method since Python 3.5:
>>> b't\x03\x078E'.hex()
'7403073845'
On older Python version you could use binascii.hexlify():
>>> import binascii
>>> binascii.hexlify(b't\x03\x078E').decode('ascii')
'7403073845'
How can it be displayed as hex like following? b't\x03\x078E' => '\x74\x03\x07\x38\x45'
>>> print(''.join(['\\x%02x' % b for b in b't\x03\x078E']))
\x74\x03\x07\x38\x45
The Python repr can't be changed. If you want to do something like this, you'd need to do it yourself; bytes objects are trying to minimize spew, not format output for you.
If you want to print it like that, you can do:
from itertools import repeat
hexstring = '7403073845'
# Makes the individual \x## strings using iter reuse trick to pair up
# hex characters, and prefixing with \x as it goes
escapecodes = map(''.join, zip(repeat(r'\x'), *[iter(hexstring)]*2))
# Print them all with quotes around them (or omit the quotes, your choice)
print("'", *escapecodes, "'", sep='')
Output is exactly as you requested:
'\x74\x03\x07\x38\x45'
Given a character code as integer number in one encoding, how can you get the character code in, say, utf-8 and again as integer?
UTF-8 is a variable-length encoding, so I'll assume you really meant "Unicode code point". Use chr() to convert the character code to a character, decode it, and use ord() to get the code point.
>>> ord(chr(145).decode('koi8-r'))
9618
You can only map an "integer number" from one encoding to another if they are both single-byte encodings.
Here's an example using "iso-8859-15" and "cp1252" (aka "ANSI"):
>>> s = u'€'
>>> s.encode('iso-8859-15')
'\xa4'
>>> s.encode('cp1252')
'\x80'
>>> ord(s.encode('cp1252'))
128
>>> ord(s.encode('iso-8859-15'))
164
Note that ord is here being used to get the ordinal number of the encoded byte. Using ord on the original unicode string would give its unicode code point:
>>> ord(s)
8364
The reverse operation to ord can be done using either chr (for codes in the range 0 to 127) or unichr (for codes in the range 0 to sys.maxunicode):
>>> print chr(65)
A
>>> print unichr(8364)
€
For multi-byte encodings, a simple "integer number" mapping is usually not possible.
Here's the same example as above, but using "iso-8859-15" and "utf-8":
>>> s = u'€'
>>> s.encode('iso-8859-15')
'\xa4'
>>> s.encode('utf-8')
'\xe2\x82\xac'
>>> [ord(c) for c in s.encode('iso-8859-15')]
[164]
>>> [ord(c) for c in s.encode('utf-8')]
[226, 130, 172]
The "utf-8" encoding uses three bytes to encode the same character, so a one-to-one mapping is not possible. Having said that, many encodings (including "utf-8") are designed to be ASCII-compatible, so a mapping is usually possible for codes in the range 0-127 (but only trivially so, because the code will always be the same).
Here's an example of how the encode/decode dance works:
>>> s = b'd\x06' # perhaps start with bytes encoded in utf-16
>>> map(ord, s) # show those bytes as integers
[100, 6]
>>> u = s.decode('utf-16') # turn the bytes into unicode
>>> print u # show what the character looks like
٤
>>> print ord(u) # show the unicode code point as an integer
1636
>>> t = u.encode('utf-8') # turn the unicode into bytes with a different encoding
>>> map(ord, t) # show that encoding as integers
[217, 164]
Hope this helps :-)
If you need to construct the unicode directly from an integer, use unichr:
>>> u = unichr(1636)
>>> print u
٤
Is there a way to convert a string to lowercase?
"Kilometers" → "kilometers"
Use str.lower():
"Kilometer".lower()
The canonical Pythonic way of doing this is
>>> 'Kilometers'.lower()
'kilometers'
However, if the purpose is to do case insensitive matching, you should use case-folding:
>>> 'Kilometers'.casefold()
'kilometers'
Here's why:
>>> "Maße".casefold()
'masse'
>>> "Maße".lower()
'maße'
>>> "MASSE" == "Maße"
False
>>> "MASSE".lower() == "Maße".lower()
False
>>> "MASSE".casefold() == "Maße".casefold()
True
This is a str method in Python 3, but in Python 2, you'll want to look at the PyICU or py2casefold - several answers address this here.
Unicode Python 3
Python 3 handles plain string literals as unicode:
>>> string = 'Километр'
>>> string
'Километр'
>>> string.lower()
'километр'
Python 2, plain string literals are bytes
In Python 2, the below, pasted into a shell, encodes the literal as a string of bytes, using utf-8.
And lower doesn't map any changes that bytes would be aware of, so we get the same string.
>>> string = 'Километр'
>>> string
'\xd0\x9a\xd0\xb8\xd0\xbb\xd0\xbe\xd0\xbc\xd0\xb5\xd1\x82\xd1\x80'
>>> string.lower()
'\xd0\x9a\xd0\xb8\xd0\xbb\xd0\xbe\xd0\xbc\xd0\xb5\xd1\x82\xd1\x80'
>>> print string.lower()
Километр
In scripts, Python will object to non-ascii (as of Python 2.5, and warning in Python 2.4) bytes being in a string with no encoding given, since the intended coding would be ambiguous. For more on that, see the Unicode how-to in the docs and PEP 263
Use Unicode literals, not str literals
So we need a unicode string to handle this conversion, accomplished easily with a unicode string literal, which disambiguates with a u prefix (and note the u prefix also works in Python 3):
>>> unicode_literal = u'Километр'
>>> print(unicode_literal.lower())
километр
Note that the bytes are completely different from the str bytes - the escape character is '\u' followed by the 2-byte width, or 16 bit representation of these unicode letters:
>>> unicode_literal
u'\u041a\u0438\u043b\u043e\u043c\u0435\u0442\u0440'
>>> unicode_literal.lower()
u'\u043a\u0438\u043b\u043e\u043c\u0435\u0442\u0440'
Now if we only have it in the form of a str, we need to convert it to unicode. Python's Unicode type is a universal encoding format that has many advantages relative to most other encodings. We can either use the unicode constructor or str.decode method with the codec to convert the str to unicode:
>>> unicode_from_string = unicode(string, 'utf-8') # "encoding" unicode from string
>>> print(unicode_from_string.lower())
километр
>>> string_to_unicode = string.decode('utf-8')
>>> print(string_to_unicode.lower())
километр
>>> unicode_from_string == string_to_unicode == unicode_literal
True
Both methods convert to the unicode type - and same as the unicode_literal.
Best Practice, use Unicode
It is recommended that you always work with text in Unicode.
Software should only work with Unicode strings internally, converting to a particular encoding on output.
Can encode back when necessary
However, to get the lowercase back in type str, encode the python string to utf-8 again:
>>> print string
Километр
>>> string
'\xd0\x9a\xd0\xb8\xd0\xbb\xd0\xbe\xd0\xbc\xd0\xb5\xd1\x82\xd1\x80'
>>> string.decode('utf-8')
u'\u041a\u0438\u043b\u043e\u043c\u0435\u0442\u0440'
>>> string.decode('utf-8').lower()
u'\u043a\u0438\u043b\u043e\u043c\u0435\u0442\u0440'
>>> string.decode('utf-8').lower().encode('utf-8')
'\xd0\xba\xd0\xb8\xd0\xbb\xd0\xbe\xd0\xbc\xd0\xb5\xd1\x82\xd1\x80'
>>> print string.decode('utf-8').lower().encode('utf-8')
километр
So in Python 2, Unicode can encode into Python strings, and Python strings can decode into the Unicode type.
With Python 2, this doesn't work for non-English words in UTF-8. In this case decode('utf-8') can help:
>>> s='Километр'
>>> print s.lower()
Километр
>>> print s.decode('utf-8').lower()
километр
Also, you can overwrite some variables:
s = input('UPPER CASE')
lower = s.lower()
If you use like this:
s = "Kilometer"
print(s.lower()) - kilometer
print(s) - Kilometer
It will work just when called.
Don't try this, totally un-recommend, don't do this:
import string
s='ABCD'
print(''.join([string.ascii_lowercase[string.ascii_uppercase.index(i)] for i in s]))
Output:
abcd
Since no one wrote it yet you can use swapcase (so uppercase letters will become lowercase, and vice versa) (and this one you should use in cases where i just mentioned (convert upper to lower, lower to upper)):
s='ABCD'
print(s.swapcase())
Output:
abcd
I would like to provide the summary of all possible methods
.lower() method.
str.lower()
combination of str.translate() and str.maketrans()
.lower() method
original_string = "UPPERCASE"
lowercase_string = original_string.lower()
print(lowercase_string) # Output: "uppercase"
str.lower()
original_string = "UPPERCASE"
lowercase_string = str.lower(original_string)
print(lowercase_string) # Output: "uppercase"
combination of str.translate() and str.maketrans()
original_string = "UPPERCASE"
lowercase_string = original_string.translate(str.maketrans(string.ascii_uppercase, string.ascii_lowercase))
print(lowercase_string) # Output: "uppercase"
lowercasing
This method not only converts all uppercase letters of the Latin alphabet into lowercase ones, but also shows how such logic is implemented. You can test this code in any online Python sandbox.
def turnIntoLowercase(string):
lowercaseCharacters = ''
abc = ['a','b','c','d','e','f','g','h','i','j','k','l','m',
'n','o','p','q','r','s','t','u','v','w','x','y','z',
'A','B','C','D','E','F','G','H','I','J','K','L','M',
'N','O','P','Q','R','S','T','U','V','W','X','Y','Z']
for character in string:
if character not in abc:
lowercaseCharacters += character
elif abc.index(character) <= 25:
lowercaseCharacters += character
else:
lowercaseCharacters += abc[abc.index(character) - 26]
return lowercaseCharacters
string = str(input("Enter your string, please: " ))
print(turnIntoLowercase(string = string))
Performance check
Now, let's enter the following string (and press Enter) to make sure everything works as intended:
# Enter your string, please:
"PYTHON 3.11.2, 15TH FeB 2023"
Result:
"python 3.11.2, 15th feb 2023"
If you want to convert a list of strings to lowercase, you can map str.lower:
list_of_strings = ['CamelCase', 'in', 'Python']
list(map(str.lower, list_of_strings)) # ['camelcase', 'in', 'python']
Hi I have a problem in python. I try to explain my problem with an example.
I have this string:
>>> string = 'ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ'
>>> print string
ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ
and i want, for example, replace charachters different from Ñ,Ã,ï with ""
i have tried:
>>> rePat = re.compile('[^ÑÃï]',re.UNICODE)
>>> print rePat.sub("",string)
�Ñ�����������������������������ï�������������������Ã
I obtained this �.
I think that it's happen because this type of characters in python are represented by two position in the vector: for example \xc3\x91 = Ñ.
For this, when i make the regolar expression, all the \xc3 are not substitued. How I can do this type of sub?????
Thanks
Franco
You need to make sure that your strings are unicode strings, not plain strings (plain strings are like byte arrays).
Example:
>>> string = 'ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ'
>>> type(string)
<type 'str'>
# do this instead:
# (note the u in front of the ', this marks the character sequence as a unicode literal)
>>> string = u'\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\xc0\xc1\xc2\xc3'
# or:
>>> string = 'ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ'.decode('utf-8')
# ... but be aware that the latter will only work if the terminal (or source file) has utf-8 encoding
# ... it is a best practice to use the \xNN form in unicode literals, as in the first example
>>> type(string)
<type 'unicode'>
>>> print string
ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿÀÁÂÃ
>>> rePat = re.compile(u'[^\xc3\x91\xc3\x83\xc3\xaf]',re.UNICODE)
>>> print rePat.sub("", string)
Ã
When reading from a file, string = open('filename.txt').read() reads a byte sequence.
To get the unicode content, do: string = unicode(open('filename.txt').read(), 'encoding'). Or: string = open('filename.txt').read().decode('encoding').
The codecs module can decode unicode streams (such as files) on-the-fly.
Do a google search for python unicode. Python unicode handling can be a bit hard to grasp at first, it pays to read up on it.
I live by this rule: "Software should only work with Unicode strings internally, converting to a particular encoding on output." (from http://www.amk.ca/python/howto/unicode)
I also recommend: http://www.joelonsoftware.com/articles/Unicode.html