Howto Remove Garbage Data from String

Howto Remove Garbage Data from String - python

I'm in a situation where I have to use Python to read and write to an EEPROM on an embedded device. The first page (256 bytes) is used for non-volatile data storage. My problem is that the variables can vary in length, and I need to read a fixed amount.
For example, an string is stored at address 30 and can be anywhere from 6 to 10 bytes in length. I need to read the maximum possible length, because I don't know where it ends. What that does is it gives me excess garbage in the string.
data_str = ee_read(bytecount)
dbgmsg("Reading from EEPROM: addr = " + str(addr_low) + " value = " + str(data_str))
> Reading from EEPROM: addr = 30 value = h11c13����
I am fairly new to Python. Is there a way to automatically chop off that data in the string after it's been read in?

Do you mean something like:
>>> s = 'Reading from EEPROM: addr = 30 value = h11c13����'
>>> s
'Reading from EEPROM: addr = 30 value = h11c13\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd'
>>> filter(lambda x: ord(x)<128,s)
'Reading from EEPROM: addr = 30 value = h11c13'
On python3, you'll need to to join the string:
''.join(filter(lambda x: ord(x)<128,s)
A version which works for python2 and python3 would be:
''.join(x for x in s if ord(x) < 128)
Finally, it is concieveable that the excess garbage could contain printing characters. In that case you might want to take only characters until you read a non-printing character, itertools.takewhile could be helpful...
import string #doesn't exist on python3.x, use the builtin `str` type instead.
from itertools import takewhile
printable = set(string.printable)
''.join(takewhile(lambda x: x in printable, s))

Related

Convert a 8bit list to a 32 bit integer array in python

what I have :
textdata = "this is my test data"
DataArray = [ord(c) for c in textdata]
now I want to transform this is into x 32 bit integer by combining 4 elements of the list together
Ex : DataArray[0:4] would become a 32 bit integer, and then iterate to the next 4 elements and do the same. In the end, I would have a 32-bit array with all my results in it.
How can I do this in python whitout iterating over the whole string. Is there a simple way to do this?

Using numpy:
>>> import numpy as np
>>> a = np.frombuffer(b'this is my test data', dtype=np.int32)
>>> a
array([1936287860, 544434464, 1948285293, 544502629, 1635017060], dtype=int32)
>>> a.tobytes()
b'this is my test data'
Use '<i4' or similar as dtype for endianness that's portable between machines.
I'm assuming that you can keep your initial data as bytes rather than unicode, because you really should try hard to do that.

As long as your string is an integer multiple of 4, you can use NumPy in a very efficient way:
import numpy as np
data = np.fromstring(textdata, dtype='>u4')
# array([1952999795, 543781664, 1836654708, 1702065184, 1684108385])
'>u4' means 'big-endian unsigned 4-byte integer'.
Edit: If you use NumPy >= 1.14, then np.fromstring is deprecated, and the right way to process your text is by calling np.frombuffer(textdata.encode(), dtype='>u4').

You can use the struct built-in python module:
from struct import unpack
textdata = "this is my test data"
data = list(unpack('i'*(len(textdata)//4), textdata))
Result:
[1936287860, 544434464, 1948285293, 544502629, 1635017060]
You won't need to iterate over the string and you can find other Format Characters if you want to use unsigned integers for example.

You could use something like the following, which uses bit manipulation (big-endian):
def chunk2int(chunk):
""" Converts a chunk (string) into an int, 8 bits per character """
val = 0
for c in chunk:
val = (val << 8) | (ord(c) & 0xFF)
return val
def int2chunk(val):
""" Converts an int into a chunk, consuming 8 bits per character """
rchunk = []
while val:
rchunk.append(val & 0xFF)
val >>= 8
return ''.join(chr(c) for c in reversed(rchunk))
textdata = "this is my test data"
chunks = [textdata[i:i + 4] for i in range(0, len(textdata), 4)]
print(chunks)
data = [chunk2int(c) for c in chunks]
print(data)
chunks = [int2chunk(d) for d in data]
print(chunks)
Produces:
['this', ' is ', 'my t', 'est ', 'data']
[1952999795, 543781664, 1836654708, 1702065184, 1684108385]
['this', ' is ', 'my t', 'est ', 'data']
If you're using characters with 1 <= ord(c) <= 255 in your input text, this will work. If there are null bytes in your string, the int2chunk method may terminate early, in which case you'd have to pad the chunks.
There's also the struct module, which may be worth looking into, and where you can change the endianness much more simply.

Read string up to a certain size in Python

I have a string stored in a variable. Is there a way to read a string up to a certain size e.g. File objects have f.read(size) which can read up to a certain size?

Check out this post for finding object sizes in python.
If you are wanting to read the string from the start until a certain size MAX is reached, then return that new (possibly shorter string) you might want to try something like this:
import sys
MAX = 176 #bytes
totalSize = 0
newString = ""
s = "MyStringLength"
for c in s:
totalSize = totalSize + sys.getsizeof(c)
if totalSize <= MAX:
newString = newString + str(c)
elif totalSize > MAX:
#string that is slightly larger or the same size as MAX
print newString
break
This prints 'MyString' which is less than (or equal to) 176 Bytes.
Hope this helps.

message = 'a long string which contains a lot of valuable information.'
bite = 10
while message:
# bite off a chunk of the string
chunk = message[:bite]
# set message to be the remaining portion
message = message[bite:]
do_something_with(chunk)

Python3 print in hex representation

I can find lot's of threads that tell me how to convert values to and from hex. I do not want to convert anything. Rather I want to print the bytes I already have in hex representation, e.g.
byteval = '\x60'.encode('ASCII')
print(byteval) # b'\x60'
Instead when I do this I get:
byteval = '\x60'.encode('ASCII')
print(byteval) # b'`'
Because ` is the ASCII character that my byte corresponds to.
To clarify: type(byteval) is bytes, not string.

>>> print("b'" + ''.join('\\x{:02x}'.format(x) for x in byteval) + "'")
b'\x60'

See this:
hexify = lambda s: [hex(ord(i)) for i in list(str(s))]
And
print(hexify("abcde"))
# ['0x61', '0x62', '0x63', '0x64', '0x65']
Another example:
byteval='\x60'.encode('ASCII')
hexify = lambda s: [hex(ord(i)) for i in list(str(s))]
print(hexify(byteval))
# ['0x62', '0x27', '0x60', '0x27']
Taken from https://helloacm.com/one-line-python-lambda-function-to-hexify-a-string-data-converting-ascii-code-to-hexadecimal/

python Incorrect formatting Cyrillic

def inp(text):
tmp = str()
arr = ['.' for x in range(1, 40 - len(text))]
tmp += text + ''.join(arr)
print tmp
s=['tester', 'om', 'sup', 'jope']
sr=['тестер', 'ом', 'суп', 'жопа']
for i in s:
inp(i)
for i in sr:
inp(i)
Output:
tester.................................
om.....................................
sup....................................
jope...................................
тестер...........................
ом...................................
суп.................................
жопа...............................
Why is Python not properly handling Cyrillic? End of the line is not straight and scrappy. Using the formatting goes the same. How can this be corrected? thanks

Read this:
http://docs.python.org/2/howto/unicode.html
Basically, what you have in text parameter to inp function is a string. In Python 2.7, strings are bytes by default. Cyrilic characters are not mapped 1-1 to bytes when encoded in e.g. utf-8 encoding, but require more than one byte (usually 2 in utf-8), so when you do len(text) you don't get the number of characters, but number of bytes.
In order to get the number of characters, you need to know your encoding. Assuming it's utf-8, you can decode text to that encoding and it will print right:
#!/usr/bin/python
# coding=utf-8
def inp(text):
tmp = str()
utext = text.decode('utf-8')
l = len(utext)
arr = ['.' for x in range(1, 40 - l)]
tmp += text + ''.join(arr)
print tmp
s=['tester', 'om', 'sup', 'jope']
sr=['тестер', 'ом', 'суп', 'жопа']
for i in s:
inp(i)
for i in sr:
inp(i)
The important lines are these two:
utext = text.decode('utf-8')
l = len(utext)
where you first decode the text, which results in an unicode string. After that, you can use the built in len to get the length in characters, which is what you want.
Hope this helps.

Unpacking a struct ending with an ASCIIZ string

I am trying to use struct.unpack() to take apart a data record that ends with an ASCII string.
The record (it happens to be a TomTom ov2 record) has this format (stored little-endian):
1 byte
4 byte int for total record size (including this field)
4 byte int
4 byte int
variable-length string, null-terminated
unpack() requires that the string's length be included in the format you pass it. I can use the second field and the known size of the rest of the record -- 13 bytes -- to get the string length:
str_len = struct.unpack("<xi", record[:5])[0] - 13
fmt = "<biii{0}s".format(str_len)
then proceed with the full unpacking, but since the string is null-terminated, I really wish unpack() would do it for me. It'd also be nice to have this should I run across a struct that doesn't include its own size.
How can I make that happen?

I made two new functions that should be useable as drop-in replacements for the standard pack and unpack functions. They both support the 'z' character to pack/unpack an ASCIIZ string. There are no restrictions to the location or number of occurrences of the 'z' character in the format string:
import struct
def unpack (format, buffer) :
while True :
pos = format.find ('z')
if pos < 0 :
break
asciiz_start = struct.calcsize (format[:pos])
asciiz_len = buffer[asciiz_start:].find('\0')
format = '%s%dsx%s' % (format[:pos], asciiz_len, format[pos+1:])
return struct.unpack (format, buffer)
def pack (format, *args) :
new_format = ''
arg_number = 0
for c in format :
if c == 'z' :
new_format += '%ds' % (len(args[arg_number])+1)
arg_number += 1
else :
new_format += c
if c in 'cbB?hHiIlLqQfdspP' :
arg_number += 1
return struct.pack (new_format, *args)
Here's an example of how to use them:
>>> from struct_z import pack, unpack
>>> line = pack ('<izizi', 1, 'Hello', 2, ' world!', 3)
>>> print line.encode('hex')
0100000048656c6c6f000200000020776f726c64210003000000
>>> print unpack ('<izizi',line)
(1, 'Hello', 2, ' world!', 3)
>>>

The size-less record is fairly easy to handle, actually, since struct.calcsize() will tell you the length it expects. You can use that and the actual length of the data to construct a new format string for unpack() that includes the correct string length.
This function is just a wrapper for unpack(), allowing a new format character in the last position that will drop the terminal NUL:
import struct
def unpack_with_final_asciiz(fmt, dat):
"""
Unpack binary data, handling a null-terminated string at the end
(and only at the end) automatically.
The first argument, fmt, is a struct.unpack() format string with the
following modfications:
If fmt's last character is 'z', the returned string will drop the NUL.
If it is 's' with no length, the string including NUL will be returned.
If it is 's' with a length, behavior is identical to normal unpack().
"""
# Just pass on if no special behavior is required
if fmt[-1] not in ('z', 's') or (fmt[-1] == 's' and fmt[-2].isdigit()):
return struct.unpack(fmt, dat)
# Use format string to get size of contained string and rest of record
non_str_len = struct.calcsize(fmt[:-1])
str_len = len(dat) - non_str_len
# Set up new format string
# If passed 'z', treat terminating NUL as a "pad byte"
if fmt[-1] == 'z':
str_fmt = "{0}sx".format(str_len - 1)
else:
str_fmt = "{0}s".format(str_len)
new_fmt = fmt[:-1] + str_fmt
return struct.unpack(new_fmt, dat)
>>> dat = b'\x02\x1e\x00\x00\x00z\x8eJ\x00\xb1\x7f\x03\x00Down by the river\x00'
>>> unpack_with_final_asciiz("<biiiz", dat)
(2, 30, 4886138, 229297, b'Down by the river')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Howto Remove Garbage Data from String - python

Related

Convert a 8bit list to a 32 bit integer array in python

Read string up to a certain size in Python

Python3 print in hex representation

python Incorrect formatting Cyrillic

Unpacking a struct ending with an ASCIIZ string

Categories

Resources