I wanted to convert an ascii string (well just text to be precise) towards base64.
So I know how to do that, I just use the following code:
import base64
string = base64.b64encode(bytes("string", 'utf-8'))
print (string)
Which gives me
b'c3RyaW5n'
However the problem is, I'd like it to just print
c3RyaW5n
Is it possible to print the string without the "b" and the '' quotation marks?
Thanks!
The b prefix denotes that it is a binary string. A binary string is not a string: it is a sequence of bytes (values in the 0 to 255 range). It is simply typesetted as a string to make it more compact.
In case of base64 however, all characters are valid ASCII characters, you can thus simply decode it like:
print(string.decode('ascii'))
So here we will decode each byte to its ASCII equivalent. Since base64 guarantees that every byte it produces is in the ASCII range 'A' to '/') we will always produce a valid string. Mind however that this is not guaranteed with an arbitrary binary string.
A simple .decode("utf-8") would do
import base64
string = base64.b64encode(bytes("string", 'utf-8'))
print (string.decode("utf-8"))
Related
Is there an elegant way to convert "test\207\128" into "testπ" in python?
My issue stems from using avahi-browse on Linux, which has a -p flag to output information in an easy to parse format. However the problem is that it outputs non alpha-numeric characters as escaped sequences. So a service published as "name#id" gets output by avahi-browse as "name\035id". This can be dealt with by splitting on the \, dropping a leading zero and using chr(35) to recover the #. This solution breaks on multi-byte utf characters such as "π" which gets output as "\207\128".
The input string you have is an encoding of a UTF-8 string, in a format that Python can't deal with natively. This means you'll need to write a simple decoder, then use Python to translate the UTF-8 string to a string object:
import re
value = r"test\207\128"
# First off turn this into a byte array, since it's not a unicode string
value = value.encode("utf-8")
# Now replace any "\###" with a byte character based off
# the decimal number captured
value = re.sub(b"\\\\([0-9]{3})", lambda m: bytes([int(m.group(1))]), value)
# And now that we have a normal UTF-8 string, decode it back to a string
value = value.decode("utf-8")
print(value)
# Outputs: testπ
I have a string whose value is 'Opérations'. In my script I will read a file and do some comparisons. While comparing strings, the string that I have copied from the same source and placed in my python script DOES not equal to the same string that I receive when reading the same file in my script. Printing both strings give me 'Opérations'. However, when I encode it to utf-8 I notice the difference.
b'Ope\xcc\x81rations'
b'Op\xc3\xa9rations'
My question is what do I do to ensure that the special character in my python script is the same as the file content's when comparing such strings.
Good to know:
You are talking about two type of strings, byte string and unicode string. Each have a method to convert it to the other type of string. Unicode strings have a .encode() method that produces bytes, and byte strings have a .decode() method that produces unicode. It means:
unicode.enocde() ----> bytes
and
bytes.decode() -----> unicode
and UTF-8 is easily the most popular encoding for storage and transmission of Unicode. It uses a variable number of bytes for each code point. The higher the code point value, the more bytes it needs in UTF-8.
Get to the point:
If you redefine your string to two Byte strings and unicode strings, as follwos:
a_byte = b'Ope\xcc\x81rations'
a_unicode = u'Ope\xcc\x81rations'
and
b_byte = b'Op\xc3\xa9rations'
b_unicode = u'Op\xc3\xa9rations'
you w'll see:
print 'a_byte lenght is: ', len(a_byte.decode("utf-8"))
#print 'a_unicode lenght is: ',len(a_unicode.encode("utf-8"))
print 'b_byte lenght is: ',len(b_byte.decode("utf-8"))
#print 'b_unicode lenght is: ', len(b_unicode.encode("utf-8"))
output:
a_byte lenght is: 11
b_byte lenght is: 10
So you see they are not the same.
My solution:
If You don't want to be confused, then you can use repr(), and while print a_byte, b_byte printes Opérations as output, but:
print repr(a_byte),repr(b_byte)
will return:
'Ope\xcc\x81rations','Op\xc3\xa9rations'
You can also normalize the unicode before comparison as #Daniel's answer, as follows:
from unicodedata import normalize
from functools import partial
a_byte = 'Opérations'
norm = partial(normalize, 'NFC')
your_string = norm(a_byte.decode('utf8'))
I have a binary object:
b'{"node": "\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435"}}'
and I want it to be printed in Unicode and not strictly using ASCII symbols.
There is a hacky way to do it:
decoded = string.decode()
parsed_to_dict = json.loads(decoded)
dumped = json.dumps(parsed_to_dict, ensure_ascii=False)
print(dumped)
>>> {"node": "Обновление"}
however the text will not always be parseable as JSON, so I need a simpler way.
Is there a way to print out my binary object (or a decoded Unicode string) as a non-ascii string without going trough parsing/dumping JSON?
For example, how to print this b'\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435' as Обновление?
A bytes string like
b'\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435'
has been encoded using Unicode escape sequences. To convert it back into a proper Unicode string you simply need to specify the 'unicode-escape' codec:
data = b'\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435'
out = data.decode('unicode-escape')
print(out)
output
Обновление
However, if data is already a Unicode string, then you first need to encode it to bytes. You can do that using the ascii codec, presuming data only contains ASCII characters. If it contains characters outside ASCII but within the range of \x80 to \xff you may be able to use the 'latin1' codec.
data = '\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435'
out = data.encode('ascii').decode('unicode-escape')
This should work so long as all the escapes are valid (no single \).
import ast
bytes_object = b'{"node": "\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435"}}'
unicode_string = ast.literal_eval("'{}'".format(bytes_object.decode()))
output:
'{"node": "Обновление"}}'
I understand that this is likely a repeat question, but I'm having trouble finding a solution.
In short I have a string I'd like to decode:
raw = "\x94my quote\x94"
string = decode(raw)
expected from string
'"my quote"'
Last point of note is that I'm working with Python 3 so raw is unicode, and thus is already decoded. Given that, what exactly do I need to do to "decode" the "\x94" characters?
string = "\x22my quote\x22"
print(string)
You don't need to decode, Python 3 does that for you, but you need the correct control character for the double quote "
If however you have a different character set, it appears you have Windows-1252, then you need to decode the byte string from that character set:
str(b"\x94my quote\x94", "windows-1252")
If your string isn't a byte string you have to encode it first, I found the latin-1 encoding to work:
string = "\x94my quote\x94"
str(string.encode("latin-1"), "windows-1252")
I don't know if you mean to this, but this works:
some_binary = a = b"\x94my quote\x94"
result = some_binary.decode()
And you got the result...
If you don't know which encoding to choose, you can use chardet.detect:
import chardet
chardet.detect(some_binary)
Did you try it like this? I think you need to call decode as a method of the byte class, and pass utf-8 as the argument. Add b in front of the string too.
string = b"\x94my quote\x94"
decoded_str = string.decode('utf-8', 'ignore')
print(decoded_str)
I received the following string.How can it be converted to hex value='(\xd2M\x00\x18\x00\x18\x80\x00\x80\x00\x00\x00\x00\x00\x00\xe0\xd2\xe0\xd2.\xd2\x00\x00\x00\x00\x00\x00\n\x00\x18\x00&\x00\x00\x00\x00\x00\x00\x00\x0f0\xfe/\x010\xff/\x000\xff/\x000\xff/\xff/\xff/\xff/\xff/\x000\xff/\xff/\xff/\x000\x000\xff/\x000\x000\x000\xff/\xff/\x000\x000\xff/\x000\xad\xff\x0c\x00\xdd\xff\xc2\xff\xd3\xff\xde\xff\xe9\xff\xca\xff\xd8\xff\xe6\xff\xb5\xff\xb2\xff\xe6\xff\x92\xff\xd0\xff\xa0\xff\xbd\xff\xb4\xff\x82\xff\x90\xfff\xff\xe1\xff\x9f\xff\x94\xff\xd4\xff\xa4\xff\xbb\xff\xe8\xff\x00\x00\x02\x00\xff\x7f\xff\x7f\x97\xff\xd0\xff\xb7\xff~\xffG\xff\xa1\xff\xa1\xff\xcd\xab\x00\x00A\n\x00\x00'
That's not a hex string. You are confusing the Python repr() output for a bytestring, which aims to make debugging easier, with the contents.
Each \xhh is a standard Python string literal escape sequence, and displaying the string like this makes it trivial to copy and paste into another Python session to reproduce the exact same value.
You don't need to hex decode this at all.
An actual hex string consists only of the digits 0 through to 9, and the letters a through to f (upper or lowercase). Your value, converted to hex, looks like this:
>>> value='(\xd2M\x00\x18\x00\x18\x80\x00\x80\x00\x00\x00\x00\x00\x00\xe0\xd2\xe0\xd2.\xd2\x00\x00\x00\x00\x00\x00\n\x00\x18\x00&\x00\x00\x00\x00\x00\x00\x00\x0f0\xfe/\x010\xff/\x000\xff/\x000\xff/\xff/\xff/\xff/\xff/\x000\xff/\xff/\xff/\x000\x000\xff/\x000\x000\x000\xff/\xff/\x000\x000\xff/\x000\xad\xff\x0c\x00\xdd\xff\xc2\xff\xd3\xff\xde\xff\xe9\xff\xca\xff\xd8\xff\xe6\xff\xb5\xff\xb2\xff\xe6\xff\x92\xff\xd0\xff\xa0\xff\xbd\xff\xb4\xff\x82\xff\x90\xfff\xff\xe1\xff\x9f\xff\x94\xff\xd4\xff\xa4\xff\xbb\xff\xe8\xff\x00\x00\x02\x00\xff\x7f\xff\x7f\x97\xff\xd0\xff\xb7\xff~\xffG\xff\xa1\xff\xa1\xff\xcd\xab\x00\x00A\n\x00\x00'
>>> import binascii
>>> binascii.hexlify(value)
'28d24d00180018800080000000000000e0d2e0d22ed20000000000000a00180026000000000000000f30fe2f0130ff2f0030ff2f0030ff2fff2fff2fff2fff2f0030ff2fff2fff2f00300030ff2f003000300030ff2fff2f00300030ff2f0030adff0c00ddffc2ffd3ffdeffe9ffcaffd8ffe6ffb5ffb2ffe6ff92ffd0ffa0ffbdffb4ff82ff90ff66ffe1ff9fff94ffd4ffa4ffbbffe8ff00000200ff7fff7f97ffd0ffb7ff7eff47ffa1ffa1ffcdab0000410a0000'