How to remove those "\x00\x00" - python

How to remove those "\x00\x00" in a string ?
I have many of those strings (example shown below). I can use re.sub to replace those "\x00". But I am wondering whether there is a better way to do that? Converting between unicode, bytes and string is always confusing.
'Hello\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'.

Use rstrip
>>> text = 'Hello\x00\x00\x00\x00'
>>> text.rstrip('\x00')
'Hello'
It removes all \x00 characters at the end of the string.

>>> a = 'Hello\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>> a.replace('\x00','')
'Hello'

I think the more general solution is to use:
cleanstring = nullterminatedstring.split('\x00',1)[0]
Which will split the string using \x00 as the delimeter 1 time. The split(...) returns a 2 element list: everything before the null in addition to everything after the null (it removes the delimeter). Appending [0] only returns the portion of the string before the first null (\x00) character, which I believe is what you're looking for.
The convention in some languages, specifically C-like, is that a single null character marks the end of the string. For example, you should also expect to see strings that look like:
'Hello\x00dpiecesofsomeoldstring\x00\x00\x00'
The answer supplied here will handle that situation as well as the other examples.

Building on the answers supplied, I suggest that strip() is more generic than rstrip() for cleaning up a data packet, as strip() removes chars from the beginning and the end of the supplied string, whereas rstrip() simply removes chars from the end of the string.
However, NUL chars are not treated as whitespace by default by strip(), and as such you need to specify explicitly. This can catch you out, as print() will of course not show the NUL chars. My solution that I used was to clean the string using ".strip().strip('\x00')":
>>> arbBytesFromSocket = b'\x00\x00\x00\x00hello\x00\x00\x00\x00'
>>> arbBytesAsString = arbBytesFromSocket.decode('ascii')
>>> print(arbBytesAsString)
hello
>>> str(arbBytesAsString)
'\x00\x00\x00\x00hello\x00\x00\x00\x00'
>>> arbBytesAsString = arbBytesFromSocket.decode('ascii').strip().strip('\x00')
>>> str(arbBytesAsString)
'hello'
>>>
This gives you the string/byte array required, without the NUL chars on each end, and also preserves any NUL chars inside the "data packet", which is useful for received byte data that may contain valid NUL chars (eg. a C-type structure). NB. In this case the packet must be "wrapped", i.e. surrounded by non-NUL chars (prefix and suffix), to allow correct detection, and thus only strip unwanted NUL chars.

Neil wrote, '...you might want to put some thought into why you have them in the first place.'
For my own issue with this error code, this led me to the problem. My saved file that I was reading from was in unicode. Once I re-saved the file as a plain ASCII text, the problem was solved

I tried strip and rstrip and they didn't work, but this one did;
Use split and then join the result list:
if '\x00' in name:
name=' '.join(name.split('\x00'))

I ran into this problem copy lists out of Excel. Process was:
Copy a list of ID numbers sent to me in Excel
Run set of pyton code that:
Read the clipboard as text
txt.Split('\n') to give a list
Processed each element in the list
(updating the production system as requird)
Problem was intermitently was returning multiple '\x00' at the end of the text when reading the clipboard.
Have changed from using win32clipboard to using pyperclip to read the clipboard, and it seems to have resolved the problem.

Related

Strings are too long [duplicate]

How to remove those "\x00\x00" in a string ?
I have many of those strings (example shown below). I can use re.sub to replace those "\x00". But I am wondering whether there is a better way to do that? Converting between unicode, bytes and string is always confusing.
'Hello\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'.
Use rstrip
>>> text = 'Hello\x00\x00\x00\x00'
>>> text.rstrip('\x00')
'Hello'
It removes all \x00 characters at the end of the string.
>>> a = 'Hello\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>> a.replace('\x00','')
'Hello'
I think the more general solution is to use:
cleanstring = nullterminatedstring.split('\x00',1)[0]
Which will split the string using \x00 as the delimeter 1 time. The split(...) returns a 2 element list: everything before the null in addition to everything after the null (it removes the delimeter). Appending [0] only returns the portion of the string before the first null (\x00) character, which I believe is what you're looking for.
The convention in some languages, specifically C-like, is that a single null character marks the end of the string. For example, you should also expect to see strings that look like:
'Hello\x00dpiecesofsomeoldstring\x00\x00\x00'
The answer supplied here will handle that situation as well as the other examples.
Building on the answers supplied, I suggest that strip() is more generic than rstrip() for cleaning up a data packet, as strip() removes chars from the beginning and the end of the supplied string, whereas rstrip() simply removes chars from the end of the string.
However, NUL chars are not treated as whitespace by default by strip(), and as such you need to specify explicitly. This can catch you out, as print() will of course not show the NUL chars. My solution that I used was to clean the string using ".strip().strip('\x00')":
>>> arbBytesFromSocket = b'\x00\x00\x00\x00hello\x00\x00\x00\x00'
>>> arbBytesAsString = arbBytesFromSocket.decode('ascii')
>>> print(arbBytesAsString)
hello
>>> str(arbBytesAsString)
'\x00\x00\x00\x00hello\x00\x00\x00\x00'
>>> arbBytesAsString = arbBytesFromSocket.decode('ascii').strip().strip('\x00')
>>> str(arbBytesAsString)
'hello'
>>>
This gives you the string/byte array required, without the NUL chars on each end, and also preserves any NUL chars inside the "data packet", which is useful for received byte data that may contain valid NUL chars (eg. a C-type structure). NB. In this case the packet must be "wrapped", i.e. surrounded by non-NUL chars (prefix and suffix), to allow correct detection, and thus only strip unwanted NUL chars.
Neil wrote, '...you might want to put some thought into why you have them in the first place.'
For my own issue with this error code, this led me to the problem. My saved file that I was reading from was in unicode. Once I re-saved the file as a plain ASCII text, the problem was solved
I tried strip and rstrip and they didn't work, but this one did;
Use split and then join the result list:
if '\x00' in name:
name=' '.join(name.split('\x00'))
I ran into this problem copy lists out of Excel. Process was:
Copy a list of ID numbers sent to me in Excel
Run set of pyton code that:
Read the clipboard as text
txt.Split('\n') to give a list
Processed each element in the list
(updating the production system as requird)
Problem was intermitently was returning multiple '\x00' at the end of the text when reading the clipboard.
Have changed from using win32clipboard to using pyperclip to read the clipboard, and it seems to have resolved the problem.

replace or delete specific unicode characters in python

There seem to be a lot of posts about doing this in other languages, but I can't seem to figure out how in Python (I'm using 2.7).
To be clear, I would ideally like to keep the string in unicode, just be able to replace certain specific characters.
For instance:
thisToken = u'tandh\u2013bm'
print(thisToken)
prints the word with the m-dash in the middle. I would just like to delete the m-dash. (but not using indexing, because I want to be able to do this anywhere I find these specific characters.)
I try using replace like you would with any other character:
newToke = thisToken.replace('\u2013','')
print(newToke)
but it just doesn't work. Any help is much appreciated.
Seth
The string you're searching for to replace must also be a Unicode string. Try:
newToke = thisToken.replace(u'\u2013','')
You can see the answer in this post: How to replace unicode characters in string with something else python?
Decode the string to Unicode. Assuming it's UTF-8-encoded:
str.decode("utf-8")
Call the replace method and be sure to pass it a Unicode string as its first argument:
str.decode("utf-8").replace(u"\u2022", "")
Encode back to UTF-8, if needed:
str.decode("utf-8").replace(u"\u2022", "").encode("utf-8")

Convert escape characters in strings (like "\\n" - two characters) into ASCII character (newline)

My program got a string from the command line arguments, that contains many escapes characters.
./myprog.py "\x41\x42\n"
When I print "sys.argv[1]".
I got on the screen:
\x41\x42\n
Is there a simple way to do that the program print instead:
AB[newline]
The string passed to your program is '\\x41\\x42\\n'.
I don't think there is a simple way to revert it back into 'AB\n'.
You'll have to split the string by '\\', and treat each element separately.
If your string is always of the form '\\x..\\x..\\x..\\n', then you can do this:
print ''.join([chr(int('0'+k,16)) for k in sys.argv[1].split('\\')[1:-1]])
Try passing the argument in the following way:
./myprog.py $'\x41\x42\n'
The $'...' notation is allowed to be used together with \x00-like escape sequences for constructing arbitrary byte sequences from the hexadecimal notation.
Another way to fix this is to do what #Barak suggested here -- that is converting the hex characters.
It just depends on what you find easy for you.

Operating on character '№'

I want to search and replace the character '№' in a string.
I am not sure if it's actually a single character or two.
How do I do it? What is its unicode?
If it's any help, I am using Python3.
EDIT: The sentence "I am not sure if it's actually a single character or two" kind of deformed my question. I actually wanted to know its unicode so that I could use the code instead of pasting the character in my python script.
In Python 3 it is always a single character.
3>> 'foo№bar'.replace('№', '#')
'foo#bar'
That character is U+2116 ɴᴜᴍᴇʀᴏ sɪɢɴ.
You can just type it directly in your source file, taking care to to specify the source file encoding as per PEP-236.
Alternatively, you can use either the numeric Unicode escapes, or the more readable named Unicode escapes:
>>> 'foo\u2116'
'foo№'
>>> 'foo\N{NUMERO SIGN}'
'foo№'

Convert zero-padded bytes to UTF-8 string

I'm unpacking several structs that contain 's' type fields from C. The fields contain zero-padded UTF-8 strings handled by strncpy in the C code (note this function's vestigial behaviour). If I decode the bytes I get a unicode string with lots of NUL characters on the end.
>>> b'hiya\0\0\0'.decode('utf8')
'hiya\x00\x00\x00'
I was under the impression that trailing zero bytes were part of UTF-8 and would be dropped automatically.
What's the proper way to drop the zero bytes?
Use str.rstrip() to remove the trailing NULs:
>>> 'hiya\0\0\0'.rstrip('\0')
'hiya'
Either rstrip or replace will only work if the string is padded out to the end of the buffer with nulls. In practice the buffer may not have been initialised to null to begin with so you might get something like b'hiya\0x\0'.
If you know categorically 100% that the C code starts with a null initialised buffer and never never re-uses it, then you might find rstrip to be simpler, otherwise I'd go for the slightly messier but much safer:
>>> b'hiya\0x\0'.split(b'\0',1)[0]
b'hiya'
which treats the first null as a terminator.
Unlike the split/partition-solution this does not copy several strings and might be faster for long bytearrays.
data = b'hiya\0\0\0'
i = data.find(b'\x00')
if i == -1:
return data
return data[:i]

Categories

Resources