Deserializing a Streamed Protocol Buffer Message With Header and Repeated fields

Deserializing a Streamed Protocol Buffer Message With Header and Repeated fields - python

I am working on deserializing a log file that has been serialized in C using protocol buffers (and NanoPB).
The log file has a short header composed of: entity, version, and identifier. After the header, the stream of data should be continuous and it should log the fields from the sensors but not the header values (this should only occur once and at the beginning).The same .proto file was used to serialize the file. I do not have separate .proto files for the header and for the streamed data.
After my implementation, I assume it should look like this:
firmware "1.0.0"
GUID "1231214211321" (example)
Timestamp 123123
Sens1 2343
Sens2 13123
Sens3 13443
Sens4 1231
Sens5 190
Timestamp 123124
Sens1 2345
Sens2 2312
...
I posted this question to figure out how to structure the .proto file initially, when I was implementing the serialization in C. And in the end I used a similar approach but did no include the: [(nanopb).max_count = 1];
Finally I opted with the following .proto in Python (There can be more sensors than 5):
syntax = "proto3";
import "timestamp.proto";
message SessionLogs {
int32 Entity = 1;
string Version = 2;
string GUID = 3;
repeated SessionLogsDetail LogDetail = 4;
}
message SessionLogsDetail
{
int32 DataTimestamp = 1; // internal counter to identify the order of session logs
// Sensor data, there can be X amount of sensors.
int32 sens1 = 2;
int32 sens2= 3;
int32 sens3= 4;
int32 sens4= 5;
}
At this point, I can serialize a message as I log with my device and according to the file size, the log seems to work, but I have not been able to deserialize it on Python offline to check if my implementation has been correct. And I can't do it in C since its an embedded application and I want to do the post-processing offline with Python.
Also, I have checked this online protobuf deserializer where I can pass the serialized file and get it deserialized without the need of the .proto file. In it I can see the header values (field 3 is empty so its not seen) and the logged information. So this makes me think that the serialization is correct but I am deserializing it wrongly on Python.
This is my current code used to deserialize the message in Python:
import PSessionLogs_pb2
with open('$PROTOBUF_LOG_FILENAME$', 'rb') as f:
read_metric = PSessionLogs_pb2.PSessionLogs()
read_metric.ParseFromString(f.read())
Besides this, I've used protoc to generate the .py equivalent of the .proto file to deserialize offline.

It looks like you've serialized a header, then serialized some other data immediately afterwards, meaning: instead of serializing a SessionLogs that has some SessionLogsDetail records, you've serialized a SessionLogs, and then you've serialized (separately) a SessionLogsDetail - does that sound about right? if so: yes, that will not work correctly; there are ways to do what you're after, but it isn't quite as simple as just serializing one after the other, because the root protobuf object is never terminated; so what actually happens is that it overwrites the root object with later fields by number.
There's two ways of addressing this, depending on the data volume. If the size (including all of the detail rows) is small, you can just change the code so that it is a true parent / child relationship, i.e. so that the rows are all inside the parent. When writing the data, this does not mean that you need to have all the rows before you start writing - there are ways of making appending child rows so that you are sending data as it becomes available; however, when deserializing, it will want to load everything in one go, so this approach is only useful if you're OK with that, i.e. you don't have obscene open-ended numbers of rows.
If you have large numbers of rows, you'll need to add your own framing, essentially. This is often done by adding a length-prefix between each payload, so that you can essentially read a single message at a time. Some of the libraries include helper methods for this; for example, in the java API this is parseDelimitedFrom and writeDelimitedTo. However, my understand is that the python API does not currently support this utility, so you'd need to do the framing yourself :(
To summarize, you currently have:
{header - SessionLogs}
{row 0 - SessionLogsDetail}
{row 1 - SessionLogsDetail}
option 1 is:
{header - SessionLogs
{row 0 - SessionLogsDetail}
{row 1 - SessionLogsDetail}
}
option 2 is:
{length prefix of header}
{header - SessionLogs}
{length prefix of row0}
{row 0 - SessionLogsDetail}
{length prefix of row1}
{row 1 - SessionLogsDetail}
(where the length prefix is something simple like a raw varint, or just a 4-byte integer in some agreed endianness)

Related

python - asn1 parsed text to json

With text given in this link, need to extract data as follows
Each record starts with YYYY Mmm dd hh:mm:ss.ms, for example 2019 Aug 31 09:17:36.550
Each record has a header starting from line #1 above and ending with a blank line
The record data is contained in lines below Interpreted PDU:
The records of interest are the ones with record header first line having 0xB821 NR5G RRC OTA Packet -- RRC_RECONFIG
Is it possible to extract selected record headers and text below #3 above as an array of nested json in the format as below - snipped for brevity, really need to have the entire text data as JSON.
data = [{"time": "2019 Aug 31 09:17:36.550", "PDU Number": "RRC_RECONFIG Message", "Physical Cell ID": 0, "rrc-TransactionIdentifier": 1, "criticalExtensions rrcReconfiguration": {"secondaryCellGroup": {"cellGroupId": 1, "rlc-BearerToAddModList": [{"logicalChannelIdentity": 1, "servedRadioBearer drb-Identity": 2, "rlc-Config am": {"ul-AM-RLC": {"sn-FieldLength": "size18", "t-PollRetransmit": "ms40", "pollPDU": "p32", "pollByte": "kB25", "maxRetxThreshold": "t32"}, "dl-AM-RLC": {"sn-FieldLength": "size18", "t-Reassembly": "ms40", "t-StatusProhibit": "ms20"}}}]}} }, next records data here]
Note that the input text is parsed output of ASN1 data specifications in 3GPP 38.331 section 6.3.2. I'm not sure normal python text parsing is the right way to handle this or should one use something like asn1tools library ? If so an example usage on this data would be helpful.

Unfortunately, it is unlikely that somebody will come with a straight answer to your question (which is very similar to How to extract data from asn1 data file and load it into a dataframe?)
The text of your link is obviously a log file where ASN.1 value notation was used to make the messages human readable. So trying to decode these messages from their textual form is unusual and you will probably not find tooling for that.
In theory, the generic method would be this one:
Gather the ASN.1 DEFINITIONS (schema) that were used to create the ASN.1 messages
Compile these DEFINITIONS with an ASN.1 tool (aka compiler) to generate an object model in your favorite language (python). The tool would provide the specific code to encode and decode ... you would use ASN.1 values decoders.
Add your custom code (either to the object model or plugged in the ASN.1 compiler) to encode your JSON objects
As you see, it is a very long shot (I can expand if this explanation is too short or unclear)
Unless your task is repetivite and/or the number of messages is big, try the methods you already know (manual search, regex) to search the log file.
If you want to see what it takes to create ASN.1 tools, you can find a few (not that many as ASN.1 is not particularly young and popular). Check out https://github.com/etingof/pyasn1 (python)
I created my own for fun in Java and I am adding the ASN.1 value decoders to illustrate my answer: https://github.com/yafred/asn1-tool (branch text-asn-value-support)

Given that you have a textual representation of the input data, you might take a look at the parse library. This allows you to find a pattern in a string and assign contents to variables.
Here is an example for extracting the time, PDU Number and Physical Cell ID data fields:
import parse
with open('w9s2MJK4.txt', 'r') as f:
input = f.read()
data = []
pattern = parse.compile('\n{year:d} {month:w} {day:d} {hour:d}:{min:d}:{sec:d}.{ms:d}{}Physical Cell ID = {pcid:d}{}PDU Number = {pdu:w} {pdutype:w}')
for s in pattern.findall(input):
record = {}
record['time'] = '{} {} {} {:02d}:{:02d}:{:02d}.{:03d}'.format(s.named['year'], s.named['month'], s.named['day'], s.named['hour'], s.named['min'], s.named['sec'], s.named['ms'])
record['PDU Number'] = '{} {}'.format(s.named['pdu'], s.named['pdutype'])
record['Physical Cell ID'] = s.named['pcid']
data.append(record)
Since you have quite a complicated structure and a large number of data fields, this might become a bit cumbersome, but personally I would prefer this approach over regular expressions. Maybe there is also a smarter method to parse the date (which unfortunately seems not to have one of the standard formats supported by the library).

JSON representation of the following nested data

I am writing code to empirically determine the state transition table from data generated by a natural process. I want to derive the states from the data, and then save the state data to HD, for later querying.
From the analysis I have done so far, the state information is nested, and the system has N (fixed at N=3 for simplicity) distinct states. Furthermore, each of these N states has a fixed (variable number) of nested states.
This is the (pseudo YAML) schema I have come up with so far:
machine-state:
frequency_1: state-info
frequency_2: state-info
frequency_3: state-info
state-info:
classification_1:
- classification_1_state_foo
- classification_1_state_foobar
- classification_1_state_foofoo
- classification_1_state_foofoobar
- classification_1_state_foobarfoo
classification_2:
- classification_2_state_name1
- classification_2_state_name2
- classification_2_state_name3
- classification_2_state_name4
classification_3:
- classification_3_state_anothername
- classification_3_state_anothername1
- classification_3_state_anothername2
- classification_3_state_anothername3
It seems the various classifications of the state machine (classification_*) can derive from an ABC. However, I'm not sure how to represent this tree structure in JSON, for simple querying etc.
I am using Python, and intend to store the JSON documents in PostgreSQL db as the backend - so I can query the JSON documents, so I can empirically build a state transition table from the stored data.
My question is, given the problem I'm trying to model (and the sample YAML above)- how may I best represent the data in a JSON model?

I don't see anything better than the most intuitive representation:
{
"classification1": [
"classification_1_state_foo",
"classification_1_state_foobar",
"classification_1_state_foofoo",
...
],
"classification2": [
...
}
However, as we are talking about tree structure, maybe JSON is not the best choice here. If I may suggest a radical change in your approach, you might consider building this data in XML instead, saving the XML data in a file and using BeautifulSoup to query it. Example:
<classification>
<classification_state>classification_1_state_foo</classification_state>
<classification_state>classification_1_state_foobar</classification_state>
<classification_state>classification_1_state_foofoo</classification_state>
...
</classification>
...

AMF serialization for python3

I am trying to write a python3 encoder/decoder for AMF.
The reason I'm doing it is because I didn't find a suitable library that works on python3 (I'm looking for a non-obtrusive library - one that will provide me with the methods and let me handle the gateway myself)
Avaialble libraries I tested for python are amfast, pyamf and amfy. While the first 2 are for python2 (several forks of pyamf suggest that they support python3 but I coudn't get it to work), amfy was designed for python3 but lacks some features that I need (specifically object serialization).
Reading through the specification of AMF0 and AMF3, I was able to add a package encoder/decoder but I stumbled on object serialization and the available documentation was not enough (would love to see some examples). Existing libraries were of no help either.
Using remoteObject (in flex), I managed to send the following request to my parser:
b'\x00\x03\x00\x00\x00\x01\x00\x04null\x00\x02/1\x00\x00\x00\xe0\n\x00\x00\x00\x01\x11
\n\x81\x13Mflex.messaging.messages.CommandMessage\x13operation\x1bcorrelationId\x13
timestamp\x11clientId\x15timeToLive\tbody\x0fheaders\x17destination\x13messageId\x04\x05
\x06\x01\x04\x00\x01\x04\x00\n\x0b\x01\x01\n\x05\tDSId\x06\x07nil%DSMessagingVersion\x04
\x01\x01\x06\x01\x06I03ACB769-9733-6A6C-0923-79F667AE8249'
(notice that newlines were introduced to make the request more readable)
The headers are parsed OK but when I get to the first object (\n near the end of the first line), it is marked as a reference (LSB = 0) while there is no other object it can reference to.
am I reading this wrong? Is this a malformed bytes request?
Any help decoding these bytes will be welcomed.

From the AMF3 spec, section 4.1 NetConnection and AMF3:
The format of this messaging structure is AMF 0 (See [AMF0]. A context header value or message body can switch to AMF 3 encoding using the special avmplus-object-marker type.
What this means is that by default, the message body must be parsed as AMF0. Only when encountering an avmplus-object-marker (0x11) should you switch to AMF3. As a result, the 0x0a type marker in your value is not actually an AMF3 object-marker, but an AMF0 strict-array-marker.
Looking at section 2.12 Strict Array Type in the AMF0 spec, we can see that this type is simply defined as an u32 array-count, followed that number of value-types.
In your data, the array-count is 0x00, 0x00, 0x00, 0x01 (i.e. 1), and the value following that has a type marker of 0x11 - which is the avmplus-object-marker mentioned above. Thus, only after starting to parse the AMF0 array contents should you actually switch to AMF3 to parse the following object.
In this case, the object then is an actual AMF3 object (type marker 0x0a), followed by a non-dynamic U29O-traits with 9 sealed members. But I'm sure you can take it from here. :)

python can't make sense of c++ string sent over winsock

Goal:
I am writing a socket server/client program (c++ is the server, python is the client) to send xml strings that carry data. My goal is to be able to receive an xml message from c++ in Python via socket.
Method
VS2013 pro
Python 2.7.2 via Vizard 4.1
1) socket communication is created just fine, no problems. I can send/receive stuff
2) after communications are initialized, c++ begins creating xml objects using Cmarkup
3) c++ converts the xml object to std::string type
4) c++ sends the std::string over the stream to Python
Problem:
The "string" received in python from C++ is interpreted as garbage symbols (not trying to offend, someone may have strong feelings for them, I do not ;) that look like symbols you'd see in notepad if you opened a binary file. This is not surprising, since data sent over the stream is binary.
What I cannot figure out is how to get Python to make sense of the stream.
Failed Attempts to fix:
1) made sure that VS2013 project uses Unicode characters
2) tried converting stream to python string and decoding it string.decode()
3) tried using Unicode()
4) also tried using binascii() methods to get something useful, small improvement but still not the same characters I sent from c++
If anyone can lend some insight on why this is happening I'd be most grateful. I have read several forums about the way data is sent over sockets, but this aspect of encoding and decoding is still spam-mackerel-casserole to my mind.
Here's the server code that creates xml, converts to string, then sends
MCD_CSTR rootname("ROOT");//initializes name for root node
MCD_CSTR Framename("FRAME");//creates name for child node
CMarkup xml;//initializes xml object using Cmarkup method
xml.AddElem(rootname);//create the root node
xml.IntoElem();//move into it
xml.AddElem(Framename, MyClient.GetFrameNumber().FrameNumber);//create child node with data from elsewhere, FrameNumber is an int
CStringA strXML = xml.GetDoc();//convert the xml object to a string using Cmarkup method
std::string test(strXML);//convert the CstringA to a std::string type
std::cout << test << '\n';//verify that the xml as a string looks right
std::cout << typeid(test).name() << '\n';//make sure it is the right type
iSendResult = send(ClientSocket, (char *)&test, sizeof(test), 0);//send the string to the client
Here is the code to receive the xml string in Python:
while 1:
data = s.recv(1024)#receive the stream with larger than required buffer
print(data)#see what is in there
if not data: break#if no data then stop listening for more

Since test is a string, this cannot work:
iSendResult = send(ClientSocket, (char *)&test, sizeof(test), 0);//send the string
The std::string is not a character array. It is an object, and all that line does is send nonsensical bytes to the socket. You want to send the data, not the object.
iSendResult = send(ClientSocket, (char *)test.c_str(), test.length(), 0);//send the string

You can't just write the memory at the location of a std::string and think that's serialization. Depending on how the C++ library implemented it, std::string is likely to be a structure containing a pointer to the actual character data. If you transmit the pointer, not only will you fail to send the character data, but the pointer value is meaningless in any other context than the current instance of the program.
Instead, serialize the important contents of the string. Send the length, then send the character data itself. Something like this.
uint32_t len = test.length();
send(..., &len, sizeof(uint32_t), ...);
send(..., test.c_str(), len, ...);

In python construct library (for parsing binary data), how to group the rest of data as one field?

I am using Python construct library to parse Bluetooth protocols. The link of the library is here
As the protocols are really complex, I subdivided the parsing into multiple stages instead of building one giganic construct. Right now I already parse the big raw data into this structure:
Container({'CRC': 'd\xcbT',
'CRC_OK': 1,
'Channel': 38,
'RSSI': 43,
'access_addr': 2391391958L,
'header': Container({'TxAdd': False, 'PDU_length': 34, 'PDU_Type': 'ADV_IND', 'RxAdd': False}),
'payload': '2\x15\x00a\x02\x00\x02\x01\x06\x07\x03\x03\x18\x02\x18\x04\x18\x03\x19\x00\x02\x02\n\xfe\t\tAS-D1532'})
As you can see the length of the payload is indicated as PDU_length which is 34. The payload has the following structure:
[first 6 octets: AdvertAddress][the rest of data of 0-31 octets: AdvertData]
However, when I started to parse the payload as a standalone structure, I lost the length of 34 in the context of the construct of the payload. How can I make a construct that will parse the first 6 octects as AdvertAddress and group the rest of data as AdvertData?
My current solution looks like this:
length = len(payload) #I didn't use PDU_length but len(payload) gives me back 34 also.
ADVERT_PAYLOAD = Struct("ADVERT_PAYLOAD",
Field("AdvertAddress",6),
Field("AdvertData",length-6),
)
print ADVERT_PAYLOAD.parse(payload)
This gives the correct output. But apparently not all payloads are of size 34. This method requires me to construct this ADVERT_PAYLOAD eveytime I need to parse a new payload.
I read the documentations many times but couldn't find anything related. There is neither a way for me to pass the knowledge of the length of the payload into the context of ADVERT_PAYLOAD, nor is it able to get the length of the argument passed into the parse method.
Maybe there is no solutions to this problem. But then, how do most people parse such protocol data? As you go further into the payload, it subdivides into more types and you need more more smaller constructs to parse them. Should I build a parent construct, embedding smaller constructs which embed even smaller constructs? I can't imagine how to go about building such a big thing.
Thanks in advance.

GreedyRange will get a list of char, and JoinAdapter will join all the char together:
class JoinAdapter(Adapter):
def _decode(self, obj, context):
return "".join(obj)
ADVERT_PAYLOAD = Struct("ADVERT_PAYLOAD",
Field("AdvertAddress",6),
JoinAdapter(GreedyRange(Field("AdvertData", 1)))
)
payload = '2\x15\x00a\x02\x00\x02\x01\x06\x07\x03\x03\x18\x02\x18\x04\x18\x03\x19\x00\x02\x02\n\xfe\t\tAS-D1532'
print ADVERT_PAYLOAD.parse(payload)
output:
Container:
AdvertAddress = '2\x15\x00a\x02\x00'
AdvertData = '\x02\x01\x06\x07\x03\x03\x18\x02\x18\x04\x18\x03\x19\x00\x02\x02\n\xfe\t\tAS-D1532'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.