How to remove those "\x00\x00" in a string ?
I have many of those strings (example shown below). I can use re.sub to replace those "\x00". But I am wondering whether there is a better way to do that? Converting between unicode, bytes and string is always confusing.
'Hello\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'.
Use rstrip
>>> text = 'Hello\x00\x00\x00\x00'
>>> text.rstrip('\x00')
'Hello'
It removes all \x00 characters at the end of the string.
>>> a = 'Hello\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>> a.replace('\x00','')
'Hello'
I think the more general solution is to use:
cleanstring = nullterminatedstring.split('\x00',1)[0]
Which will split the string using \x00 as the delimeter 1 time. The split(...) returns a 2 element list: everything before the null in addition to everything after the null (it removes the delimeter). Appending [0] only returns the portion of the string before the first null (\x00) character, which I believe is what you're looking for.
The convention in some languages, specifically C-like, is that a single null character marks the end of the string. For example, you should also expect to see strings that look like:
'Hello\x00dpiecesofsomeoldstring\x00\x00\x00'
The answer supplied here will handle that situation as well as the other examples.
Building on the answers supplied, I suggest that strip() is more generic than rstrip() for cleaning up a data packet, as strip() removes chars from the beginning and the end of the supplied string, whereas rstrip() simply removes chars from the end of the string.
However, NUL chars are not treated as whitespace by default by strip(), and as such you need to specify explicitly. This can catch you out, as print() will of course not show the NUL chars. My solution that I used was to clean the string using ".strip().strip('\x00')":
>>> arbBytesFromSocket = b'\x00\x00\x00\x00hello\x00\x00\x00\x00'
>>> arbBytesAsString = arbBytesFromSocket.decode('ascii')
>>> print(arbBytesAsString)
hello
>>> str(arbBytesAsString)
'\x00\x00\x00\x00hello\x00\x00\x00\x00'
>>> arbBytesAsString = arbBytesFromSocket.decode('ascii').strip().strip('\x00')
>>> str(arbBytesAsString)
'hello'
>>>
This gives you the string/byte array required, without the NUL chars on each end, and also preserves any NUL chars inside the "data packet", which is useful for received byte data that may contain valid NUL chars (eg. a C-type structure). NB. In this case the packet must be "wrapped", i.e. surrounded by non-NUL chars (prefix and suffix), to allow correct detection, and thus only strip unwanted NUL chars.
Neil wrote, '...you might want to put some thought into why you have them in the first place.'
For my own issue with this error code, this led me to the problem. My saved file that I was reading from was in unicode. Once I re-saved the file as a plain ASCII text, the problem was solved
I tried strip and rstrip and they didn't work, but this one did;
Use split and then join the result list:
if '\x00' in name:
name=' '.join(name.split('\x00'))
I ran into this problem copy lists out of Excel. Process was:
Copy a list of ID numbers sent to me in Excel
Run set of pyton code that:
Read the clipboard as text
txt.Split('\n') to give a list
Processed each element in the list
(updating the production system as requird)
Problem was intermitently was returning multiple '\x00' at the end of the text when reading the clipboard.
Have changed from using win32clipboard to using pyperclip to read the clipboard, and it seems to have resolved the problem.
Related
There seem to be a lot of posts about doing this in other languages, but I can't seem to figure out how in Python (I'm using 2.7).
To be clear, I would ideally like to keep the string in unicode, just be able to replace certain specific characters.
For instance:
thisToken = u'tandh\u2013bm'
print(thisToken)
prints the word with the m-dash in the middle. I would just like to delete the m-dash. (but not using indexing, because I want to be able to do this anywhere I find these specific characters.)
I try using replace like you would with any other character:
newToke = thisToken.replace('\u2013','')
print(newToke)
but it just doesn't work. Any help is much appreciated.
Seth
The string you're searching for to replace must also be a Unicode string. Try:
newToke = thisToken.replace(u'\u2013','')
You can see the answer in this post: How to replace unicode characters in string with something else python?
Decode the string to Unicode. Assuming it's UTF-8-encoded:
str.decode("utf-8")
Call the replace method and be sure to pass it a Unicode string as its first argument:
str.decode("utf-8").replace(u"\u2022", "")
Encode back to UTF-8, if needed:
str.decode("utf-8").replace(u"\u2022", "").encode("utf-8")
How to remove those "\x00\x00" in a string ?
I have many of those strings (example shown below). I can use re.sub to replace those "\x00". But I am wondering whether there is a better way to do that? Converting between unicode, bytes and string is always confusing.
'Hello\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'.
Use rstrip
>>> text = 'Hello\x00\x00\x00\x00'
>>> text.rstrip('\x00')
'Hello'
It removes all \x00 characters at the end of the string.
>>> a = 'Hello\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>> a.replace('\x00','')
'Hello'
I think the more general solution is to use:
cleanstring = nullterminatedstring.split('\x00',1)[0]
Which will split the string using \x00 as the delimeter 1 time. The split(...) returns a 2 element list: everything before the null in addition to everything after the null (it removes the delimeter). Appending [0] only returns the portion of the string before the first null (\x00) character, which I believe is what you're looking for.
The convention in some languages, specifically C-like, is that a single null character marks the end of the string. For example, you should also expect to see strings that look like:
'Hello\x00dpiecesofsomeoldstring\x00\x00\x00'
The answer supplied here will handle that situation as well as the other examples.
Building on the answers supplied, I suggest that strip() is more generic than rstrip() for cleaning up a data packet, as strip() removes chars from the beginning and the end of the supplied string, whereas rstrip() simply removes chars from the end of the string.
However, NUL chars are not treated as whitespace by default by strip(), and as such you need to specify explicitly. This can catch you out, as print() will of course not show the NUL chars. My solution that I used was to clean the string using ".strip().strip('\x00')":
>>> arbBytesFromSocket = b'\x00\x00\x00\x00hello\x00\x00\x00\x00'
>>> arbBytesAsString = arbBytesFromSocket.decode('ascii')
>>> print(arbBytesAsString)
hello
>>> str(arbBytesAsString)
'\x00\x00\x00\x00hello\x00\x00\x00\x00'
>>> arbBytesAsString = arbBytesFromSocket.decode('ascii').strip().strip('\x00')
>>> str(arbBytesAsString)
'hello'
>>>
This gives you the string/byte array required, without the NUL chars on each end, and also preserves any NUL chars inside the "data packet", which is useful for received byte data that may contain valid NUL chars (eg. a C-type structure). NB. In this case the packet must be "wrapped", i.e. surrounded by non-NUL chars (prefix and suffix), to allow correct detection, and thus only strip unwanted NUL chars.
Neil wrote, '...you might want to put some thought into why you have them in the first place.'
For my own issue with this error code, this led me to the problem. My saved file that I was reading from was in unicode. Once I re-saved the file as a plain ASCII text, the problem was solved
I tried strip and rstrip and they didn't work, but this one did;
Use split and then join the result list:
if '\x00' in name:
name=' '.join(name.split('\x00'))
I ran into this problem copy lists out of Excel. Process was:
Copy a list of ID numbers sent to me in Excel
Run set of pyton code that:
Read the clipboard as text
txt.Split('\n') to give a list
Processed each element in the list
(updating the production system as requird)
Problem was intermitently was returning multiple '\x00' at the end of the text when reading the clipboard.
Have changed from using win32clipboard to using pyperclip to read the clipboard, and it seems to have resolved the problem.
My program got a string from the command line arguments, that contains many escapes characters.
./myprog.py "\x41\x42\n"
When I print "sys.argv[1]".
I got on the screen:
\x41\x42\n
Is there a simple way to do that the program print instead:
AB[newline]
The string passed to your program is '\\x41\\x42\\n'.
I don't think there is a simple way to revert it back into 'AB\n'.
You'll have to split the string by '\\', and treat each element separately.
If your string is always of the form '\\x..\\x..\\x..\\n', then you can do this:
print ''.join([chr(int('0'+k,16)) for k in sys.argv[1].split('\\')[1:-1]])
Try passing the argument in the following way:
./myprog.py $'\x41\x42\n'
The $'...' notation is allowed to be used together with \x00-like escape sequences for constructing arbitrary byte sequences from the hexadecimal notation.
Another way to fix this is to do what #Barak suggested here -- that is converting the hex characters.
It just depends on what you find easy for you.
How to get rid of non-ascii characters like "^L,¢,â" in Perl & Python ? Actually while parsing PDF files in Python & Perl. I'm getting these special characters. Now i have text version of these PDF files, but with these special characters. Is there any function available which will make insures that a file or a variable should not contain any non-ascii character.
The direct answer to your question, in Python, is to use .encode('ascii', 'ignore'), on the Unicode string in question. This will convert the Unicode string to an ASCII string and take out any non-ASCII characters:
>>> u'abc\x0c¢â'.encode('ascii', errors='ignore')
'abc\x0c'
Note that it did not take out the '\x0c'. I put that in because you mentioned the character "^L", by which I assume you mean the form-feed character '\x0c' which can be typed with Ctrl+L. That is an ASCII character, and if you want to take that out, you will also need to write some other code to remove it, such as:
>>> str(''.join([c for c in u'abc\x0c¢â' if 32 <= ord(c) < 128]))
'abc'
BUT this possibly won't help you, because I suspect you don't just want to delete these characters, but actually resolve problems relating to why they are there in the first place. In this case, it could be because of Unicode encoding issues. To deal with that, you will need to ask much more specific questions with specific examples about what you expect and what you are seeing.
For the sake of completeness, some Perl solutions. Both return ,,. Unlike the accepted Python answer, I have used no magic numbers like 32 or 128. The constants here can be looked up much easier in the documentation.
use 5.014; use Encode qw(encode); encode('ANSI_X3.4-1968', "\cL,¢,â", sub{q()}) =~ s/\p{PosixCntrl}//gr;
use 5.014; use Unicode::UCD qw(charinfo); join q(), grep { my $u = charinfo ord $_; 'Basic Latin' eq $u->{block} && 'Cc' ne $u->{category} } split //, "\cL,¢,â";
In Python you can (ab)use the encode function for this purpose (Python 3 prompt):
>>> "hello swede åäö".encode("ascii", "ignore")
b'hello swede '
åäö yields encoding errors, but since I have the errors flag on "ignore", it just happily goes on. Obviously this can mask other errors.
If you want to be absolutely sure you are not missing any "important" errors, register an error handler with codecs.register_error(name, error_handler). This would let you specify a replacement for each error instance.
Also note, that in the example above using Python 3 I get a bytes object back, I would need to convert back to Unicode proper should I need a string object.
I'm unpacking several structs that contain 's' type fields from C. The fields contain zero-padded UTF-8 strings handled by strncpy in the C code (note this function's vestigial behaviour). If I decode the bytes I get a unicode string with lots of NUL characters on the end.
>>> b'hiya\0\0\0'.decode('utf8')
'hiya\x00\x00\x00'
I was under the impression that trailing zero bytes were part of UTF-8 and would be dropped automatically.
What's the proper way to drop the zero bytes?
Use str.rstrip() to remove the trailing NULs:
>>> 'hiya\0\0\0'.rstrip('\0')
'hiya'
Either rstrip or replace will only work if the string is padded out to the end of the buffer with nulls. In practice the buffer may not have been initialised to null to begin with so you might get something like b'hiya\0x\0'.
If you know categorically 100% that the C code starts with a null initialised buffer and never never re-uses it, then you might find rstrip to be simpler, otherwise I'd go for the slightly messier but much safer:
>>> b'hiya\0x\0'.split(b'\0',1)[0]
b'hiya'
which treats the first null as a terminator.
Unlike the split/partition-solution this does not copy several strings and might be faster for long bytearrays.
data = b'hiya\0\0\0'
i = data.find(b'\x00')
if i == -1:
return data
return data[:i]