Canonical way to transmit protocol buffers over network?

Canonical way to transmit protocol buffers over network? - python

I'm trying to use Google's protocol buffers (protobuf) with Python in a networked program, using bare sockets. My question is: after the transmitting side sends a message, how does the receiving side knows what kind of message was transmitted? For example, say I have message definitions:
message StrMessage {
required string str = 1;
}
message IntMessage {
required int32 num = 1;
}
Now the transmitter makes a StrMessage, serializes it, and sends the serialized bytes over the network. How does the receiver know to deserialize the bytes with StrMessage rather than IntMessage? I've tried doing two things:
// Proposal 1: send one byte header to indicate type
enum MessageType {
STR_MESSAGE = 1;
INT_MESSAGE = 2;
}
// Proposal 2: use a wrapper message
message Packet {
optional StrMessage m_str = 1;
optional IntMessage m_int = 2;
}
Neither of these seems very clean, though, and both require me to list all the message types by hand. Is there a canonical/better way to handle this problem?
Thanks!

This has been discussed before, for example this thread on the protobuf list, but simply there is no canonical / de-facto way of doing this.
Personally, I like the Packet approach, as it keeps everything self-contained (and indeed, in protobuf-net I have specific methods to process data in that format, just returning the StrMessage / IntMessage, leaving the Packet layer as an unimportant implementation detail that never actually gets used), but since that previous proposal by Kenton never got implemented (AFAIK) it is entirely a matter of personal taste.

Related

Reading Binary data type from Redis published by Streambase(Java)

Here is the java code that, that publishes data to Redis
import com.streambase.sb.util.ByteOrderedDataOutput;
byte[] valuebuffer=null;
ByteOrderedDataOutput boutput = new ByteOrderedDataOutput(0,tuple.getByteOrder());
tuple.serialize(boutput,0);
valuebuffer = boutput.getBuffer();
byte[] keybuffer = null;
String keyvalue = redisStream+"."+keyFieldStr;
keybuffer = keyvalue.getBytes();
strLuaCommands += "redis.call('set',KEYS["+ (++keyCount) +"],ARGV["+ (++argCount) +"])";
keys.add(keybuffer);
args.add(valuebuffer);
I was able to get the data through python struct, but this is not in correct format.
import redis, struct
redis_client = redis.StrictRedis(host="abc.com", port=6379, db=0)
temp = redis_client.get('samplekey')
struct.unpack("<" + ("s" * (len(temp))), temp)

Tuple.serialize() uses the com.streambase.sb.util.ByteOrderedDataOutput class, which has never been part of the StreamBase public API. Therefore the Tuple.serialize() methods shouldn't be considered part of the public API, either.
Also, there's no particular reason to believe that the Python struct.unpack() method knows how to understand StreamBase's ByteOrderedDataOutput, whatever that is. So it's not surprising that what you are unpacking is not what you want.
One sort of quick-to-imagine workaround would be to use the StreamBase Python Operator to convert your StreamBase tuple into Python objects, and then use a Python script to write whatever you want to write into redis. Then, since you will have now encoded the stuff and decoded the stuff with the same Python complementary library functions, you'll have a much better chance of not mangling your data.

Safe and generic serialization in Python

I want to (de)serialize simple objects in Python to a human-readable (e.g. JSON) format. The data may come from an untrusted source. I really like how the Rust library, serde, works:
#[derive(Serialize, Deserialize, Debug)]
struct Point {
x: i32,
y: i32,
}
fn main() {
let point = Point { x: 1, y: 2 };
// Convert the Point to a JSON string.
let serialized = serde_json::to_string(&point).unwrap();
// Prints serialized = {"x":1,"y":2}
println!("serialized = {}", serialized);
// Convert the JSON string back to a Point.
let deserialized: Point = serde_json::from_str(&serialized).unwrap();
// Prints deserialized = Point { x: 1, y: 2 }
println!("deserialized = {:?}", deserialized);
}
I'd like to achieve something like this in Python. Since Python is not statically typed, I'd expect the syntax to be something like:
deserialized = library.loads(data_str, ClassName)
where ClassName is the expected class.
jsonpickle is bad, bad, bad. It makes absolutely no sanitization and its usage leads to arbitrary code execution
There are the serialization libraries: lima, marshmallow, kim but all of them require manually defining serialization schemes. It, in fact, leads to code duplication, which is bad.
Is there anything I could use for simple, generic yet secure serialization in Python?
EDIT: other requirements, which were implicit before
Handle nested serialization (serde can do it: https://gist.github.com/63bcd00691b4bedee781c49435d0d729)
Handle built-in types, i.e. be able to serialize and deserialize everything that the built-in json module can, without special treatment of built-in types.

Since Python doesn't require type annotations, any such library would need to either
use its own classes
take advantage of type annotations.
The latter would be the perfect solution but I have not found any library doing that.
I found a module, though, which requires to define only one class as a model: https://github.com/dimagi/jsonobject
Usage example:
import jsonobject
class Node(jsonobject.JsonObject):
id = jsonobject.IntegerProperty(required=True)
name = jsonobject.StringProperty(required=True)
class Transaction(jsonobject.JsonObject):
provider = jsonobject.ObjectProperty(Node)
requestor = jsonobject.ObjectProperty(Node)
req = Node(id=42, name="REQ")
prov = Node(id=24, name="PROV")
tx = Transaction(provider=prov, requestor=req)
js = tx.to_json()
tx2 = Transaction(js)
print(tx)
print(tx2)

For Python, I would start just by checking the size of the input. The only security risk is running json.load() is a DOS by sending an enormous file.
Once the JSON is parsed, consider running a schema validator such as PyKwalify.

Using google.protobuf.Any in python file

I have such .proto file
syntax = "proto3";
import "google/protobuf/any.proto";
message Request {
google.protobuf.Any request_parameters = 1;
}
How can I create Request object and populate its fields? I tried this:
import ma_pb2
from google.protobuf.any_pb2 import Any
parameters = {"a": 1, "b": 2}
Request = ma_pb2.Request()
some_any = Any()
some_any.CopyFrom(parameters)
Request.request_parameters = some_any
But I have an error:
TypeError: Parameter to CopyFrom() must be instance of same class: expected google.protobuf.Any got dict.
UPDATE
Following prompts of #Kevin I added new message to .proto file:
message Small {
string a = 1;
}
Now code looks like this:
Request = ma_pb2.Request()
small = ma_pb2.Small()
small.a = "1"
some_any = Any()
some_any.Pack(small)
Request.request_parameters = small
But at the last assignment I have an error:
Request.request_parameters = small
AttributeError: Assignment not allowed to field "request_parameters" in protocol message object.
What did I do wrong?

Any is not a magic box for storing arbitrary keys and values. The purpose of Any is to denote "any" message type, in cases where you might not know which message you want to use until runtime. But at runtime, you still need to have some specific message in mind. You can then use the .Pack() and .Unpack() methods to convert that message into an Any, and at that point you would do something like Request.request_parameters.CopyFrom(some_any).
So, if you want to store this specific dictionary:
{"a": 1, "b": 2}
...you'll need a .proto file which describes some message type that has integer fields named a and b. Personally, I'd see that as overkill; just throw your a and b fields directly into the Request message, unless you have a good reason for separating them out. If you "forget" one of these keys, you can always add it later, so don't worry too much about completeness.
If you really want a "magic box for storing arbitrary keys and values" rather than what I described above, you could use a Map instead of Any. This has the advantage of not requiring you to declare all of your keys upfront, in cases where the set of keys might include arbitrary strings (for example, HTTP headers). It has the disadvantage of being harder to lint or type-check (especially in statically-typed languages), because you can misspell a string more easily than an attribute. As shown in the linked resource, Maps are basically syntactic sugar for a repeated field like the following (that is, the on-wire representation is exactly the same as what you'd get from doing this, so it's backwards compatible to clients which don't support Maps):
message MapFieldEntry {
key_type key = 1;
value_type value = 2;
}
repeated MapFieldEntry map_field = N;

Require a `oneof` in protobuf?

I want to make a protobuf Event message that can contain several different event types. Here's an example:
message Event {
required int32 event_id = 1;
oneof EventType {
FooEvent foo_event = 2;
BarEvent bar_event = 3;
BazEvent baz_event = 4;
}
}
This works fine, but one thing that bugs me is that EventType is optional: I can encode an object with only an event_id and protobuf won't complain.
>>> e = test_pb2.Event()
>>> e.IsInitialized()
False
>>> e.event_id = 1234
>>> e.IsInitialized()
True
Is there any way to require the EventType to be set? I'm using Python, if that matters.

According to Protocol Buffers document, the required field rule is not recommended and has already been removed in proto3.
Required Is Forever You should be very careful about marking fields as required. If at some point you wish to stop writing or sending a required field, it will be problematic to change the field to an optional field – old readers will consider messages without this field to be incomplete and may reject or drop them unintentionally. You should consider writing application-specific custom validation routines for your buffers instead. Some engineers at Google have come to the conclusion that using required does more harm than good; they prefer to use only optional and repeated. However, this view is not universal.
And as the above document says, you should consider using application-specific validation instead of marking the fields as required.
There is no way to mark a oneof as "required" (even in proto2) because at the time oneof was introduced, it was already widely accepted that fields probably should never be "required", and so the designers did not bother implementing a way to make a oneof required.

Using options syntax, there are ways to specify validation rules and auto-generate the code for the validation routines.
You can use https://github.com/envoyproxy/protoc-gen-validate like this:
import "validate/validate.proto";
message Event {
required int32 event_id = 1;
oneof EventType {
option (validate.required) = true;
FooEvent foo_event = 2;
BarEvent bar_event = 3;
BazEvent baz_event = 4;
}
}
"You should consider writing application-specific custom validation routines for your buffers instead." And here we are auto-generating such custom validation routines.
But wait, is this going against the spirit of protobuf spec? Why is required bad and validate good? My own answer is that the protobuf spec cares very much about "proxies", i.e. software which serializes/deserializes messages, but has almost no business logic on its own. Such software can simply omit the validation (it's an option), but it cannot omit required (it must render the message unparseable).
For business logic's side, all of this is not a big problem in my experience.

Regarding Killing the switch in Python

I'm processing data from a serial port in python. The first byte indicates the start of a message and then the second byte indicates what type of message it is. Depending on that second byte we read in the message differently (to account for different types of messages, some are only data others are string and so on).
I now had the following structure. I have a general Message class that contains basic functions for every type of message and then derived classes that represent the different types of Messages (for example DataMessage or StringMessage). These have there own specific read and print function.
In my read_value_from_serial I read in all the byte. Right now I use the following code (which is bad) to determine if a message will be a DataMessage or a StringMessage (there are around 6 different type of messages but I simplify a bit).
msg_type = serial_port.read(size=1).encode("hex").upper()
msg_string = StringMessage()
msg_data = StringData()
processread = {"01" : msg_string.read, "02" : msg_data.read}
result = processread[msg_type]()
Now I want to simplify/improve this type of code. I've read about killing the switch but I don't like it that I have to create objects that I won't use in the end. Any suggestions for improving this specific problem?
Thanks

This is very close to what you have and I see nothing wrong with it.
class Message(object):
def print(self):
pass
class StringMessage(Message):
def __init__(self, port):
self.message = 'get a string from port'
def MessageFactory(port):
readers = {'01': StringMessage, … }
msg_type = serial_port.read(size=1).encode("hex").upper()
return readers[msg_type](port)
You say "I don't like it that I have to create objects that I won't use in the end". How is it that you aren't using the objects? If I have a StringMessage msg, then
msg.print()
is using an object exactly how it is supposed to be used. Did it bother you that your one instance of msg_string only existed to call msg_string.read()? My example code makes a new Message instance for every message read; that's what objects are for. That's actually how Object Oriented Programming works.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.