Use string as bytes [duplicate]

Use string as bytes [duplicate] - python

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
My problem is as follows:
I'm reading a .csv generated by some software and to read it I'm using Pandas. Pandas read the .csv properly but one of the columns stores bytes sequences representing vectors and Pandas stores them as a string.
So I have data (string) and I want to use np.frombuffer() to get the proper vector. The problem is, data is a string so its already encoded so when I use .encode() to turn it into bytes, the sequence is not the original one.
Example: The .csv contains \x00\x00 representing the vector [0,0] with dtype=np.uint8. Pandas stores it as a string and when I try to process it something like this happens:
data = df.data[x] # With x any row.
type(data)
<class 'str'>
print(data)
\x00\x00
e_data = data.encode("latin1")
print(e_data)
b'\\x00\\x00'
v = np.frombuffer(e_data, np.uint8)
print(v)
array([ 92 120 48 48 92 120 48 48], dtype=uint8)
I just want to get b'\x00\x00' from data instead of b'\\x00\\x00' which I understand is a little encoding mess I have not been able to fix yet.
Any way to do this?
Thanks!

Issue: you (apparently) have a string that contains literal backslash escape sequences, such as:
>>> x = r'\x00' # note the use of a raw string literal
>>> x # Python's representation of the string escapes the backslash
'\\x00'
>>> print(x) # but it looks right when printing
\x00
From this, you wish to create a corresponding bytes object, wherein the backslash-escape sequences are translated into the corresponding byte.
Handling these kinds of escape sequences is done using the unicode-escape string encoding. As you may be aware, string encodings convert between bytes and str objects, specifying the rules for which byte sequences correspond to what Unicode code points.
However, the unicode-escape codec assumes that the escape sequences are on the bytes side of the equation and that the str side will have the corresponding Unicode characters:
>>> rb'\x00'.decode('unicode-escape') # create a string with a NUL char
'\x00'
Applying .encode to the string will reverse that process; so if you start with the backslash-escape sequence, it will re-escape the backslash:
>>> r'\x00'.encode('unicode-escape') # the result contains two backslashes, represented as four
b'\\\\x00'
>>> list(r'\x00'.encode('unicode-escape')) # let's look at the numeric values of the bytes
[92, 92, 120, 48, 48]
As you can see, that is clearly not what we want.
We want to convert from bytes to str to do the backslash-escaping. But we have a str to start, so we need to change that to bytes; and we want bytes at the end, so we need to change the str that we get from the backslash-escaping. In both cases, we need to make each Unicode code point from 0-255 inclusive, correspond to a single byte with the same value.
The encoding we need for that task is called latin-1, also known as iso-8859-1.
For example:
>>> r'\x00'.encode('latin-1')
b'\\x00'
Thus, we can reason out the overall conversion:
>>> r'\x00'.encode('latin-1').decode('unicode-escape').encode('latin-1')
b'\x00'
As desired: our str with a literal backslash, lowercase x and two zeros, is converted to a bytes object containing a single zero byte.
Alternately: we can request that backslash-escapes are processed while decoding, by using escape_decode from the codecs standard library module. However, this isn't documented and isn't really meant to be used that way - it's internal stuff used to implement the unicode-escape codec and possibly some other things.
If you want to expose yourself to the risk of that breaking in the future, it looks like:
>>> import codecs
>>> codecs.escape_decode(r'\x00\x00')
(b'\x00\x00', 8)
We get a 2-tuple, with the desired bytes and what I assume is the number of Unicode code points that were decoded (i.e. the length of the string). From my testing, it appears that it can only use UTF-8 encoding for the non-backslash sequences (but this could be specific to how Python is configured), and you can't change this; there is no actual parameter to specify the encoding, for a decode method. Like I said - not meant for general use.
Yes, all of that is as awkward as it seems. The reason you don't get easy support for this kind of thing is that it isn't really how you're intended to design your system. Fundamentally, all data is bytes; text is an abstraction that is encoded by that byte data. Using a single byte (with value 0) to represent four characters of text (the symbols \, x, 0 and 0) is not a normal encoding, and not a reversible one (how do I know whether to decode the byte as those four characters, or as a single NUL character?). Instead, you should strongly consider using some other friendly string representation of your data (perhaps a plain hex dump) and a non-text-encoding-related way to parse it. For example:
>>> data = '41 42' # a string in a simple hex dump format
>>> bytes.fromhex(data) # support is built-in, and works simply
b'AB'
>>> list(bytes.fromhex(data))
[65, 66]

Related

concatenate a string with \x in python 3

Say I have someString = "00". basically I want to convert someString to \x00
I tried multiple ways to achieve my goal, but couldn't find a successful one.
tried:
HexString = '\x'+someString
This method throws this error:
ValueError: invalid \x escape
Unless I do HexString = r'\x'+someString, but then HexString value is set to \\x00 which is not the same as I want.
I also tried using hex() function, which had few issues. But the big issue I had with it was that it returns 0x0. It expects int and etc...
Can anyone help me with converting a string("11") to \x11?

If I am understanding you correctly, the actual goal is to take a string that contains a bunch of pairs of hex digits, and translate each pair of hex digits into the corresponding byte and have a result of type bytes.
In 3.x, this is built directly into the bytes type itself:
>>> bytes.fromhex('11abcdef')
b'\x11\xab\xcd\xef'
You can also instead use the standard library:
>>> import binascii
>>> binascii.unhexlify('11abcdef')
b'\x11\xab\xcd\xef'
You will not necessarily see a \x escape sequence for every byte value. This is normal and expected; it has to do with how the bytes object is represented as text for display purposes.
'\x'+someString
No approach of this general form can work, because it fundamentally misunderstands the problem. The output that you want is not a string, and a string literal like '\x00' does not have a backslash in it, nor a lowercase x - again, what you are seeing is how the string is represented as text, because not every character is printable.

int lets you set the base. For base 16
>>> someString = "00"
>>> int(someString, 16)
0
Of course, 0 is kinda boring because it works for all bases.
If you wanted a byte in a bytes object, you could
>>> import struct
>>> struct.pack("b", int(someString, 16))
b'\x00'
If you want a string (and I'm switching to 0x41 here) you could
>>> chr(int("41", 16))
'A'

You can get ord of the character by using int, then convert it to a character. Then you can encode it to bytes object without any import.
>>> chr(int("11", 16)) # a character
'\x11'
>>> chr(int("11", 16)).encode() # bytes object
b'\x11'

Convert a byte array to a literal string [duplicate]

In a python source code I stumbled upon I've seen a small b before a string like in:
b"abcdef"
I know about the u prefix signifying a unicode string, and the r prefix for a raw string literal.
What does the b stand for and in which kind of source code is it useful as it seems to be exactly like a plain string without any prefix?

The b prefix signifies a bytes string literal.
If you see it used in Python 3 source code, the expression creates a bytes object, not a regular Unicode str object. If you see it echoed in your Python shell or as part of a list, dict or other container contents, then you see a bytes object represented using this notation.
bytes objects basically contain a sequence of integers in the range 0-255, but when represented, Python displays these bytes as ASCII codepoints to make it easier to read their contents. Any bytes outside the printable range of ASCII characters are shown as escape sequences (e.g. \n, \x82, etc.). Inversely, you can use both ASCII characters and escape sequences to define byte values; for ASCII values their numeric value is used (e.g. b'A' == b'\x41')
Because a bytes object consist of a sequence of integers, you can construct a bytes object from any other sequence of integers with values in the 0-255 range, like a list:
bytes([72, 101, 108, 108, 111])
and indexing gives you back the integers (but slicing produces a new bytes value; for the above example, value[0] gives you 72, but value[:1] is b'H' as 72 is the ASCII code point for the capital letter H).
bytes model binary data, including encoded text. If your bytes value does contain text, you need to first decode it, using the correct codec. If the data is encoded as UTF-8, for example, you can obtain a Unicode str value with:
strvalue = bytesvalue.decode('utf-8')
Conversely, to go from text in a str object to bytes you need to encode. You need to decide on an encoding to use; the default is to use UTF-8, but what you will need is highly dependent on your use case:
bytesvalue = strvalue.encode('utf-8')
You can also use the constructor, bytes(strvalue, encoding) to do the same.
Both the decoding and encoding methods take an extra argument to specify how errors should be handled.
Python 2, versions 2.6 and 2.7 also support creating string literals using b'..' string literal syntax, to ease code that works on both Python 2 and 3.
bytes objects are immutable, just like str strings are. Use a bytearray() object if you need to have a mutable bytes value.

This is Python3 bytes literal. This prefix is absent in Python 2.5 and older (it is equivalent to a plain string of 2.x, while plain string of 3.x is equivalent to a literal with u prefix in 2.x). In Python 2.6+ it is equivalent to a plain string, for compatibility with 3.x.

How to convert int to hex (\x) not (0x) [duplicate]

In a python source code I stumbled upon I've seen a small b before a string like in:
b"abcdef"
I know about the u prefix signifying a unicode string, and the r prefix for a raw string literal.
What does the b stand for and in which kind of source code is it useful as it seems to be exactly like a plain string without any prefix?

This is Python3 bytes literal. This prefix is absent in Python 2.5 and older (it is equivalent to a plain string of 2.x, while plain string of 3.x is equivalent to a literal with u prefix in 2.x). In Python 2.6+ it is equivalent to a plain string, for compatibility with 3.x.

What is the difference between a string and a byte string?

I am working with a library which returns a "byte string" (bytes) and I need to convert this to a string.
Is there actually a difference between those two things? How are they related, and how can I do the conversion?

The only thing that a computer can store is bytes.
To store anything in a computer, you must first encode it, i.e. convert it to bytes. For example:
If you want to store music, you must first encode it using MP3, WAV, etc.
If you want to store a picture, you must first encode it using PNG, JPEG, etc.
If you want to store text, you must first encode it using ASCII, UTF-8, etc.
MP3, WAV, PNG, JPEG, ASCII and UTF-8 are examples of encodings. An encoding is a format to represent audio, images, text, etc. in bytes.
In Python, a byte string is just that: a sequence of bytes. It isn't human-readable. Under the hood, everything must be converted to a byte string before it can be stored in a computer.
On the other hand, a character string, often just called a "string", is a sequence of characters. It is human-readable. A character string can't be directly stored in a computer, it has to be encoded first (converted into a byte string). There are multiple encodings through which a character string can be converted into a byte string, such as ASCII and UTF-8.
'I am a string'.encode('ASCII')
The above Python code will encode the string 'I am a string' using the encoding ASCII. The result of the above code will be a byte string. If you print it, Python will represent it as b'I am a string'. Remember, however, that byte strings aren't human-readable, it's just that Python decodes them from ASCII when you print them. In Python, a byte string is represented by a b, followed by the byte string's ASCII representation.
A byte string can be decoded back into a character string, if you know the encoding that was used to encode it.
b'I am a string'.decode('ASCII')
The above code will return the original string 'I am a string'.
Encoding and decoding are inverse operations. Everything must be encoded before it can be written to disk, and it must be decoded before it can be read by a human.

Assuming Python 3 (in Python 2, this difference is a little less well-defined) - a string is a sequence of characters, ie unicode codepoints; these are an abstract concept, and can't be directly stored on disk. A byte string is a sequence of, unsurprisingly, bytes - things that can be stored on disk. The mapping between them is an encoding - there are quite a lot of these (and infinitely many are possible) - and you need to know which applies in the particular case in order to do the conversion, since a different encoding may map the same bytes to a different string:
>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-16')
'蓏콯캁澽苏'
>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-8')
'τoρνoς'
Once you know which one to use, you can use the .decode() method of the byte string to get the right character string from it as above. For completeness, the .encode() method of a character string goes the opposite way:
>>> 'τoρνoς'.encode('utf-8')
b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'

Note: I will elaborate more my answer for Python 3 since the end of life of Python 2 is very close.
In Python 3
bytes consists of sequences of 8-bit unsigned values, while str consists of sequences of Unicode code points that represent textual characters from human languages.
>>> # bytes
>>> b = b'h\x65llo'
>>> type(b)
<class 'bytes'>
>>> list(b)
[104, 101, 108, 108, 111]
>>> print(b)
b'hello'
>>>
>>> # str
>>> s = 'nai\u0308ve'
>>> type(s)
<class 'str'>
>>> list(s)
['n', 'a', 'i', '̈', 'v', 'e']
>>> print(s)
naïve
Even though bytes and str seem to work the same way, their instances are not compatible with each other, i.e, bytes and str instances can't be used together with operators like > and +. In addition, keep in mind that comparing bytes and str instances for equality, i.e. using ==, will always evaluate to False even when they contain exactly the same characters.
>>> # concatenation
>>> b'hi' + b'bye' # this is possible
b'hibye'
>>> 'hi' + 'bye' # this is also possible
'hibye'
>>> b'hi' + 'bye' # this will fail
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can't concat str to bytes
>>> 'hi' + b'bye' # this will also fail
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can only concatenate str (not "bytes") to str
>>>
>>> # comparison
>>> b'red' > b'blue' # this is possible
True
>>> 'red'> 'blue' # this is also possible
True
>>> b'red' > 'blue' # you can't compare bytes with str
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '>' not supported between instances of 'bytes' and 'str'
>>> 'red' > b'blue' # you can't compare str with bytes
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '>' not supported between instances of 'str' and 'bytes'
>>> b'blue' == 'red' # equality between str and bytes always evaluates to False
False
>>> b'blue' == 'blue' # equality between str and bytes always evaluates to False
False
Another issue when dealing with bytes and str is present when working with files that are returned using the open built-in function. On one hand, if you want ot read or write binary data to/from a file, always open the file using a binary mode like 'rb' or 'wb'. On the other hand, if you want to read or write Unicode data to/from a file, be aware of the default encoding of your computer, so if necessary pass the encoding parameter to avoid surprises.
In Python 2
str consists of sequences of 8-bit values, while unicode consists of sequences of Unicode characters. One thing to keep in mind is that str and unicode can be used together with operators if str only consists of 7-bit ASCI characters.
It might be useful to use helper functions to convert between str and unicode in Python 2, and between bytes and str in Python 3.

Let's have a simple one-character string 'š' and encode it into a sequence of bytes:
>>> 'š'.encode('utf-8')
b'\xc5\xa1'
For the purpose of this example, let's display the sequence of bytes in its binary form:
>>> bin(int(b'\xc5\xa1'.hex(), 16))
'0b1100010110100001'
Now it is generally not possible to decode the information back without knowing how it was encoded. Only if you know that the UTF-8 text encoding was used, you can follow the algorithm for decoding UTF-8 and acquire the original string:
11000101 10100001
^^^^^ ^^^^^^
00101 100001
You can display the binary number 101100001 back as a string:
>>> chr(int('101100001', 2))
'š'

From What is Unicode?:
Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one.
......
Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.
So when a computer represents a string, it finds characters stored in the computer of the string through their unique Unicode number and these figures are stored in memory. But you can't directly write the string to disk or transmit the string on network through their unique Unicode number because these figures are just simple decimal number. You should encode the string to byte string, such as UTF-8. UTF-8 is a character encoding capable of encoding all possible characters and it stores characters as bytes (it looks like this). So the encoded string can be used everywhere because UTF-8 is nearly supported everywhere. When you open a text file encoded in UTF-8 from other systems, your computer will decode it and display characters in it through their unique Unicode number.
When a browser receive string data encoded UTF-8 from the network, it will decode the data to string (assume the browser in UTF-8 encoding) and display the string.
In Python 3, you can transform string and byte string to each other:
>>> print('中文'.encode('utf-8'))
b'\xe4\xb8\xad\xe6\x96\x87'
>>> print(b'\xe4\xb8\xad\xe6\x96\x87'.decode('utf-8'))
中文
In a word, string is for displaying to humans to read on a computer and byte string is for storing to disk and data transmission.

Unicode is an agreed-upon format for the binary representation of characters and various kinds of formatting (e.g., lower case/upper case, new line, and carriage return), and other "things" (e.g., emojis). A computer is no less capable of storing a Unicode representation (a series of bits), whether in memory or in a file, than it is of storing an ASCII representation (a different series of bits), or any other representation (series of bits).
For communication to take place, the parties to the communication must agree on what representation will be used.
Because Unicode seeks to represent all the possible characters (and other "things") used in inter-human and inter-computer communication, it requires a greater number of bits for the representation of many characters (or things) than other systems of representation that seek to represent a more limited set of characters/things. To "simplify," and perhaps to accommodate historical usage, Unicode representation is almost exclusively converted to some other system of representation (e.g., ASCII) for the purpose of storing characters in files.
It is not the case that Unicode cannot be used for storing characters in files, or transmitting them through any communications channel. It is simply that it is not.
The term "string," is not precisely defined. "String," in its common usage, refers to a set of characters/things. In a computer, those characters may be stored in any one of many different bit-by-bit representations. A "byte string" is a set of characters stored using a representation that uses eight bits (eight bits being referred to as a byte). Since, these days, computers use the Unicode system (characters represented by a variable number of bytes) to store characters in memory, and byte strings (characters represented by single bytes) to store characters to files, a conversion must be used before characters represented in memory will be moved into storage in files.

Putting it simple, think of our natural languages like - English, Bengali, Chinese, etc. While talking, all of these languages make sound. But do we understand all of them even if we hear them? -
The answer is generally no. So, if I say I understand English, it means that I know how those sounds are encoded to some meaningful English words and I just decode these sounds in the same way to understand them. So, the same goes for any other language. If you know it, you have the encoder-decoder pack for that language in your mind, and again if you don't know it, you just don't have this.
The same goes for digital systems. Just like ourselves, as we can only listen sounds with our ears and make sound with mouth, computers can only store bytes and read bytes. So, the certain application knows how to read bytes and interpret them (like how many bytes to consider to understand any information) and also write in the same way such that its fellow applications also understand it. But without the understanding (encoder-decoder) all data written to a disk are just strings of bytes.

A string is a bunch of items strung together. A byte string is a sequence of bytes, like b'\xce\xb1\xce\xac' which represents "αά". A character string is a bunch of characters, like "αά". Synonymous to a sequence.
A byte string can be directly stored to the disk directly, while a string (character string) cannot be directly stored on the disk. The mapping between them is an encoding.

The Python languages includes str and bytes as standard "built-in types". In other words, they are both classes. I don't think it's worthwhile trying to rationalize why Python has been implemented this way.
Having said that, str and bytes are very similar to one another. Both share most of the same methods. The following methods are unique to the str class:
casefold
encode
format
format_map
isdecimal
isidentifier
isnumeric
isprintable
The following methods are unique to the bytes class:
decode
fromhex
hex

What does a b prefix before a python string mean?

In a python source code I stumbled upon I've seen a small b before a string like in:
b"abcdef"
I know about the u prefix signifying a unicode string, and the r prefix for a raw string literal.
What does the b stand for and in which kind of source code is it useful as it seems to be exactly like a plain string without any prefix?

This is Python3 bytes literal. This prefix is absent in Python 2.5 and older (it is equivalent to a plain string of 2.x, while plain string of 3.x is equivalent to a literal with u prefix in 2.x). In Python 2.6+ it is equivalent to a plain string, for compatibility with 3.x.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.