Convert zero-padded bytes to UTF-8 string

Convert zero-padded bytes to UTF-8 string - python

I'm unpacking several structs that contain 's' type fields from C. The fields contain zero-padded UTF-8 strings handled by strncpy in the C code (note this function's vestigial behaviour). If I decode the bytes I get a unicode string with lots of NUL characters on the end.
>>> b'hiya\0\0\0'.decode('utf8')
'hiya\x00\x00\x00'
I was under the impression that trailing zero bytes were part of UTF-8 and would be dropped automatically.
What's the proper way to drop the zero bytes?

Use str.rstrip() to remove the trailing NULs:
>>> 'hiya\0\0\0'.rstrip('\0')
'hiya'

Either rstrip or replace will only work if the string is padded out to the end of the buffer with nulls. In practice the buffer may not have been initialised to null to begin with so you might get something like b'hiya\0x\0'.
If you know categorically 100% that the C code starts with a null initialised buffer and never never re-uses it, then you might find rstrip to be simpler, otherwise I'd go for the slightly messier but much safer:
>>> b'hiya\0x\0'.split(b'\0',1)[0]
b'hiya'
which treats the first null as a terminator.

Unlike the split/partition-solution this does not copy several strings and might be faster for long bytearrays.
data = b'hiya\0\0\0'
i = data.find(b'\x00')
if i == -1:
return data
return data[:i]

Related

Decode individual octal characters in string variable

A string variable sometimes includes octal characters that need to be un-octaled. Example: oct_var = "String\302\240with\302\240octals", the value of oct_var should be "String with octals" with non-breaking spaces.
Codecs doesn't support octal, and I failed to find a working solution with encode(). The strings originate upstream outside my control.
Python 3.9.8
Edited to add:
It doesn't have to scale or be ultra fast, so maybe the idea from here (#6) can work (not tested yet):
def decode(encoded):
for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')

You forgot to indicate that oct_var should be given as bytes:
>>> oct_var = b"String\302\240with\302\240octals"
>>> oct_var.decode()
'String\xa0with\xa0octals'
>>> print(oct_var.decode())
String with octals
Note: if your value is already as a string (beyond your control), you can try to convert it to bytes:
>>> oct_str = "String\302\240with\302\240octals" # as a string
>>> oct_var = bytes([ord(c) for c in oct_str])
# often equivalent to:
>>> oct_var = oct_str.encode('Latin1')
and then proceed as above.
Note, if the string also contains chars beyond ASCII, (e.g., with Latin1, accented chars like 'é'), the subsequent .decode() will fail, as in UTF-8 those are represented as multibyte chars (e.g. 'é'.encode() == b'\xc3\xa9', but 'é'.encode('Latin1') == b'\xe9'). If the string contains Unicode chars beyond Latin1 (e.g. '你好'), you will get a ValueError or a UnicodeEncodeError, depending on which of the two conversion methods you choose).
In short: don't fly anything expensive, heavy, or with people inside with that -- this is hacky. At the very least, surround your code with try ... except (ValueError, UnicodeEncodeError, UnicodeDecodeError) and handle these exceptions accordingly.

Putting your ideas and pointers together, and with the risks that come with the use of an undocumented function[*], i.e, codecs.escape_decode, this line works:
value = (codecs.escape_decode(bytes(oct_var, "latin-1"))[0].decode("utf-8"))
[*] "Internal function means: you can use it on your risk but the function can be changed or even removed in any Python release."
Explanations for for codecs.escape_decode:
https://stackoverflow.com/a/37059682/5309571
Examples for its use:
https://www.programcreek.com/python/example/8498/codecs.escape_decode
Other approaches that may turn out to be more future-proof than codecs.escape_decode (no warranty, I have not tried them):
https://stackoverflow.com/a/58829514/5309571
https://bytes.com/topic/python/answers/743965-converting-octal-escaped-utf-8-a

Strings are too long [duplicate]

How to remove those "\x00\x00" in a string ?
I have many of those strings (example shown below). I can use re.sub to replace those "\x00". But I am wondering whether there is a better way to do that? Converting between unicode, bytes and string is always confusing.
'Hello\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'.

Use rstrip
>>> text = 'Hello\x00\x00\x00\x00'
>>> text.rstrip('\x00')
'Hello'
It removes all \x00 characters at the end of the string.

>>> a = 'Hello\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>> a.replace('\x00','')
'Hello'

I think the more general solution is to use:
cleanstring = nullterminatedstring.split('\x00',1)[0]
Which will split the string using \x00 as the delimeter 1 time. The split(...) returns a 2 element list: everything before the null in addition to everything after the null (it removes the delimeter). Appending [0] only returns the portion of the string before the first null (\x00) character, which I believe is what you're looking for.
The convention in some languages, specifically C-like, is that a single null character marks the end of the string. For example, you should also expect to see strings that look like:
'Hello\x00dpiecesofsomeoldstring\x00\x00\x00'
The answer supplied here will handle that situation as well as the other examples.

Building on the answers supplied, I suggest that strip() is more generic than rstrip() for cleaning up a data packet, as strip() removes chars from the beginning and the end of the supplied string, whereas rstrip() simply removes chars from the end of the string.
However, NUL chars are not treated as whitespace by default by strip(), and as such you need to specify explicitly. This can catch you out, as print() will of course not show the NUL chars. My solution that I used was to clean the string using ".strip().strip('\x00')":
>>> arbBytesFromSocket = b'\x00\x00\x00\x00hello\x00\x00\x00\x00'
>>> arbBytesAsString = arbBytesFromSocket.decode('ascii')
>>> print(arbBytesAsString)
hello
>>> str(arbBytesAsString)
'\x00\x00\x00\x00hello\x00\x00\x00\x00'
>>> arbBytesAsString = arbBytesFromSocket.decode('ascii').strip().strip('\x00')
>>> str(arbBytesAsString)
'hello'
>>>
This gives you the string/byte array required, without the NUL chars on each end, and also preserves any NUL chars inside the "data packet", which is useful for received byte data that may contain valid NUL chars (eg. a C-type structure). NB. In this case the packet must be "wrapped", i.e. surrounded by non-NUL chars (prefix and suffix), to allow correct detection, and thus only strip unwanted NUL chars.

Neil wrote, '...you might want to put some thought into why you have them in the first place.'
For my own issue with this error code, this led me to the problem. My saved file that I was reading from was in unicode. Once I re-saved the file as a plain ASCII text, the problem was solved

I tried strip and rstrip and they didn't work, but this one did;
Use split and then join the result list:
if '\x00' in name:
name=' '.join(name.split('\x00'))

I ran into this problem copy lists out of Excel. Process was:
Copy a list of ID numbers sent to me in Excel
Run set of pyton code that:
Read the clipboard as text
txt.Split('\n') to give a list
Processed each element in the list
(updating the production system as requird)
Problem was intermitently was returning multiple '\x00' at the end of the text when reading the clipboard.
Have changed from using win32clipboard to using pyperclip to read the clipboard, and it seems to have resolved the problem.

Python convert strings of bytes to byte array

For example given an arbitrary string. Could be chars or just random bytes:
string = '\xf0\x9f\xa4\xb1'
I want to output:
b'\xf0\x9f\xa4\xb1'
This seems so simple, but I could not find an answer anywhere. Of course just typing the b followed by the string will do. But I want to do this runtime, or from a variable containing the strings of byte.
if the given string was AAAA or some known characters I can simply do string.encode('utf-8'), but I am expecting the string of bytes to just be random. Doing that to '\xf0\x9f\xa4\xb1' ( random bytes ) produces unexpected result b'\xc3\xb0\xc2\x9f\xc2\xa4\xc2\xb1'.
There must be a simpler way to do this?
Edit:
I want to convert the string to bytes without using an encoding

The Latin-1 character encoding trivially (and unlike every other encoding supported by Python) encodes every code point in the range 0x00-0xff to a byte with the same value.
byteobj = '\xf0\x9f\xa4\xb1'.encode('latin-1')
You say you don't want to use an encoding, but the alternatives which avoid it seem far inferior.
The UTF-8 encoding is unsuitable because, as you already discovered, code points above 0x7f map to a sequence of multiple bytes (up to four bytes) none of which are exactly the input code point as a byte value.
Omitting the argument to .encode() (as in a now-deleted answer) forces Python to guess an encoding, which produces system-dependent behavior (probably picks UTF-8 on most systems except Windows, where it will typically instead choose something much more unpredictable, as well as usually much more sinister and horrible).

I found a working solution
import struct
def convert_string_to_bytes(string):
bytes = b''
for i in string:
bytes += struct.pack("B", ord(i))
return bytes
string = '\xf0\x9f\xa4\xb1'
print (convert_string_to_bytes(string)))
output:
b'\xf0\x9f\xa4\xb1'

How to remove those "\x00\x00"

How to remove those "\x00\x00" in a string ?
I have many of those strings (example shown below). I can use re.sub to replace those "\x00". But I am wondering whether there is a better way to do that? Converting between unicode, bytes and string is always confusing.
'Hello\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'.

Use rstrip
>>> text = 'Hello\x00\x00\x00\x00'
>>> text.rstrip('\x00')
'Hello'
It removes all \x00 characters at the end of the string.

>>> a = 'Hello\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>> a.replace('\x00','')
'Hello'

I think the more general solution is to use:
cleanstring = nullterminatedstring.split('\x00',1)[0]
Which will split the string using \x00 as the delimeter 1 time. The split(...) returns a 2 element list: everything before the null in addition to everything after the null (it removes the delimeter). Appending [0] only returns the portion of the string before the first null (\x00) character, which I believe is what you're looking for.
The convention in some languages, specifically C-like, is that a single null character marks the end of the string. For example, you should also expect to see strings that look like:
'Hello\x00dpiecesofsomeoldstring\x00\x00\x00'
The answer supplied here will handle that situation as well as the other examples.

Building on the answers supplied, I suggest that strip() is more generic than rstrip() for cleaning up a data packet, as strip() removes chars from the beginning and the end of the supplied string, whereas rstrip() simply removes chars from the end of the string.
However, NUL chars are not treated as whitespace by default by strip(), and as such you need to specify explicitly. This can catch you out, as print() will of course not show the NUL chars. My solution that I used was to clean the string using ".strip().strip('\x00')":
>>> arbBytesFromSocket = b'\x00\x00\x00\x00hello\x00\x00\x00\x00'
>>> arbBytesAsString = arbBytesFromSocket.decode('ascii')
>>> print(arbBytesAsString)
hello
>>> str(arbBytesAsString)
'\x00\x00\x00\x00hello\x00\x00\x00\x00'
>>> arbBytesAsString = arbBytesFromSocket.decode('ascii').strip().strip('\x00')
>>> str(arbBytesAsString)
'hello'
>>>
This gives you the string/byte array required, without the NUL chars on each end, and also preserves any NUL chars inside the "data packet", which is useful for received byte data that may contain valid NUL chars (eg. a C-type structure). NB. In this case the packet must be "wrapped", i.e. surrounded by non-NUL chars (prefix and suffix), to allow correct detection, and thus only strip unwanted NUL chars.

Neil wrote, '...you might want to put some thought into why you have them in the first place.'
For my own issue with this error code, this led me to the problem. My saved file that I was reading from was in unicode. Once I re-saved the file as a plain ASCII text, the problem was solved

I tried strip and rstrip and they didn't work, but this one did;
Use split and then join the result list:
if '\x00' in name:
name=' '.join(name.split('\x00'))

I ran into this problem copy lists out of Excel. Process was:
Copy a list of ID numbers sent to me in Excel
Run set of pyton code that:
Read the clipboard as text
txt.Split('\n') to give a list
Processed each element in the list
(updating the production system as requird)
Problem was intermitently was returning multiple '\x00' at the end of the text when reading the clipboard.
Have changed from using win32clipboard to using pyperclip to read the clipboard, and it seems to have resolved the problem.

What is the fool proof way to convert some string (utf-8 or else) to a simple ASCII string in python

Inside my python scrip, I get some string back from a function which I didn't write. The encoding of it varies. I need to convert it to ascii format. Is there some fool-proof way of doing this? I don't mind replacing the non-ascii chars with blanks or something else...

If you want an ASCII string that unambiguously represents what you have got, without losing any information, the answer is simple:
Don't muck about with encode/decode, use the repr() function (Python 2.X) or the ascii() function (Python 3.x).

You say "the encoding of it varies". I guess that by "it" you mean a Python 2.x "string", which is really a sequence of bytes.
Answer part one: if you do not know the encoding of that encoded string, then no, there is no way at all to do anything meaningful with it*. If you do know the encoding, then step one is to convert your str into a unicode:
encoded_string = i_have_no_control()
the_encoding = 'utf-8' # for the sake of example
text = unicode(encoded_string, the_encoding)
Then you can re-encode your unicode object as ASCII, if you like.
ascii_garbage = text.encode('ascii', 'replace')
* There are heuristic methods for guessing encodings, but they are slow and unreliable. Here's one excellent attempt in Python.

I'd try to normalize the string then encode it. What about :
import unicodedata
s = u"éèêàùçÇ"
print unicodedata.normalize('NFKD',s).encode('ascii','ignore')
This works only if you have unicode as input. Therefor, you must know what can of encoding the function ouputs and decode it. If you don't, there are encoding detection heuristics, but on short strings, there are not reliable.
Of course, you could have luck and the function outputs rely on various unknow encodings but using ascii as a code base, therefor they would allocate the same value for the bytes from 0 to 127 (like utf-8).
In that case, you can just get rid of the unwanted chars by filtering them using OrderedSets :
import string.printable # asccii chars
print "".join(OrderedSet(string.printable) & OrderedSet(s))
Or if you want blanks instead :
print("".join(((char if char in string.printable else " ") for char in s )))
"translate" can help you to do the same.
The only way to know if your are this lucky is to try it out... Sometimes, a big fat lucky day is what any dev need :-)

What's meant by "foolproof" is that the function does not fail with even the most obscure, impossible input -- meaning, you could feed the function random binary data and IT WOULD NEVER FAIL, NO MATTER WHAT. That's what "foolproof" means.
The function should then proceed do its best to convert to the destination encoding. If it has to throw away all the trash it does not understand, then that is perfectly fine and is in fact the most desirable result. Why try to salvage all the junk? Just discard the junk. Tell the user he's not merely a moron for using Microsoft anything, but a non-standard moron for using non-standard Microsoft anything...or for attempting to send in binary data!
I have just precisely this same need (though my need is in PHP), and I also have users who are at least as moronic as I am, sometimes moreso; however, they are definitely nicer and no doubt more patient.
The best, bottom-line thing I've found so far is (in PHP 5.3):
$fixed_string = iconv( 'ISO-8859-1', 'UTF-8//IGNORE//TRANSLATE', $in_string );
This attempts to translate whatever it can and simply throws away all the junk, resulting in a legal UTF-8 string output. I've also not been able to break it or cause it to fail or reject any incoming text or data, even by feeding it gobs of binary junk data.
Finding the iconv() and getting it to work is easy; what's so maddening and wasteful is reading through all the total garbage and bend-over-backwards idiocy that so many programmers seem to espouse when dealing with this encoding fiasco. What's become of the enviable (and respectable) "Flail and Burn The Idiots" mentality of old school programming? Let's get back to basics. Use iconv() and throw away their garbage, and don't be bashful when telling them you threw away their garbage -- in short, don't fail to flail the morons who feed you garbage. And you can tell them I told you so.

If all you want to do is preserve ASCII-compatible characters and throw away the rest, then in most encodings that boils down to removing all characters that have the high bit set -- i.e., characters with value over 127. This works because nearly all character sets are extensions of 7-bit ASCII.
If it's a normal string (i.e., not unicode), you need to decode it in an arbitrary character set (such as iso-8859-1 because it accepts any byte values) and then encode in ascii, using the ignore or replace option for errors:
>>> orig = '1ä2äö3öü4ü'
>>> orig.decode('iso-8859-1').encode('ascii', 'ignore')
'1234'
>>> orig.decode('iso-8859-1').encode('ascii', 'replace')
'1??2????3????4??'
The decode step is necessary because you need a unicode string in order to use encode. If you already have a Unicode string, it's simpler:
>>> orig = u'1ä2äö3öü4ü'
>>> orig.encode('ascii', 'ignore')
'1234'
>>> orig.encode('ascii', 'replace')
'1??2????3????4??'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert zero-padded bytes to UTF-8 string - python

Use str.rstrip() to remove the trailing NULs: >>> 'hiya\0\0\0'.rstrip('\0') 'hiya'

Unlike the split/partition-solution this does not copy several strings and might be faster for long bytearrays. data = b'hiya\0\0\0' i = data.find(b'\x00') if i == -1: return data return data[:i]

Related

Decode individual octal characters in string variable

Strings are too long [duplicate]

Python convert strings of bytes to byte array

How to remove those "\x00\x00"

What is the fool proof way to convert some string (utf-8 or else) to a simple ASCII string in python

Categories

Resources