With text given in this link, need to extract data as follows
Each record starts with YYYY Mmm dd hh:mm:ss.ms, for example 2019 Aug 31 09:17:36.550
Each record has a header starting from line #1 above and ending with a blank line
The record data is contained in lines below Interpreted PDU:
The records of interest are the ones with record header first line having 0xB821 NR5G RRC OTA Packet -- RRC_RECONFIG
Is it possible to extract selected record headers and text below #3 above as an array of nested json in the format as below - snipped for brevity, really need to have the entire text data as JSON.
data = [{"time": "2019 Aug 31 09:17:36.550", "PDU Number": "RRC_RECONFIG Message", "Physical Cell ID": 0, "rrc-TransactionIdentifier": 1, "criticalExtensions rrcReconfiguration": {"secondaryCellGroup": {"cellGroupId": 1, "rlc-BearerToAddModList": [{"logicalChannelIdentity": 1, "servedRadioBearer drb-Identity": 2, "rlc-Config am": {"ul-AM-RLC": {"sn-FieldLength": "size18", "t-PollRetransmit": "ms40", "pollPDU": "p32", "pollByte": "kB25", "maxRetxThreshold": "t32"}, "dl-AM-RLC": {"sn-FieldLength": "size18", "t-Reassembly": "ms40", "t-StatusProhibit": "ms20"}}}]}} }, next records data here]
Note that the input text is parsed output of ASN1 data specifications in 3GPP 38.331 section 6.3.2. I'm not sure normal python text parsing is the right way to handle this or should one use something like asn1tools library ? If so an example usage on this data would be helpful.
Unfortunately, it is unlikely that somebody will come with a straight answer to your question (which is very similar to How to extract data from asn1 data file and load it into a dataframe?)
The text of your link is obviously a log file where ASN.1 value notation was used to make the messages human readable. So trying to decode these messages from their textual form is unusual and you will probably not find tooling for that.
In theory, the generic method would be this one:
Gather the ASN.1 DEFINITIONS (schema) that were used to create the ASN.1 messages
Compile these DEFINITIONS with an ASN.1 tool (aka compiler) to generate an object model in your favorite language (python). The tool would provide the specific code to encode and decode ... you would use ASN.1 values decoders.
Add your custom code (either to the object model or plugged in the ASN.1 compiler) to encode your JSON objects
As you see, it is a very long shot (I can expand if this explanation is too short or unclear)
Unless your task is repetivite and/or the number of messages is big, try the methods you already know (manual search, regex) to search the log file.
If you want to see what it takes to create ASN.1 tools, you can find a few (not that many as ASN.1 is not particularly young and popular). Check out https://github.com/etingof/pyasn1 (python)
I created my own for fun in Java and I am adding the ASN.1 value decoders to illustrate my answer: https://github.com/yafred/asn1-tool (branch text-asn-value-support)
Given that you have a textual representation of the input data, you might take a look at the parse library. This allows you to find a pattern in a string and assign contents to variables.
Here is an example for extracting the time, PDU Number and Physical Cell ID data fields:
import parse
with open('w9s2MJK4.txt', 'r') as f:
input = f.read()
data = []
pattern = parse.compile('\n{year:d} {month:w} {day:d} {hour:d}:{min:d}:{sec:d}.{ms:d}{}Physical Cell ID = {pcid:d}{}PDU Number = {pdu:w} {pdutype:w}')
for s in pattern.findall(input):
record = {}
record['time'] = '{} {} {} {:02d}:{:02d}:{:02d}.{:03d}'.format(s.named['year'], s.named['month'], s.named['day'], s.named['hour'], s.named['min'], s.named['sec'], s.named['ms'])
record['PDU Number'] = '{} {}'.format(s.named['pdu'], s.named['pdutype'])
record['Physical Cell ID'] = s.named['pcid']
data.append(record)
Since you have quite a complicated structure and a large number of data fields, this might become a bit cumbersome, but personally I would prefer this approach over regular expressions. Maybe there is also a smarter method to parse the date (which unfortunately seems not to have one of the standard formats supported by the library).
Related
I am working on deserializing a log file that has been serialized in C using protocol buffers (and NanoPB).
The log file has a short header composed of: entity, version, and identifier. After the header, the stream of data should be continuous and it should log the fields from the sensors but not the header values (this should only occur once and at the beginning).The same .proto file was used to serialize the file. I do not have separate .proto files for the header and for the streamed data.
After my implementation, I assume it should look like this:
firmware "1.0.0"
GUID "1231214211321" (example)
Timestamp 123123
Sens1 2343
Sens2 13123
Sens3 13443
Sens4 1231
Sens5 190
Timestamp 123124
Sens1 2345
Sens2 2312
...
I posted this question to figure out how to structure the .proto file initially, when I was implementing the serialization in C. And in the end I used a similar approach but did no include the: [(nanopb).max_count = 1];
Finally I opted with the following .proto in Python (There can be more sensors than 5):
syntax = "proto3";
import "timestamp.proto";
message SessionLogs {
int32 Entity = 1;
string Version = 2;
string GUID = 3;
repeated SessionLogsDetail LogDetail = 4;
}
message SessionLogsDetail
{
int32 DataTimestamp = 1; // internal counter to identify the order of session logs
// Sensor data, there can be X amount of sensors.
int32 sens1 = 2;
int32 sens2= 3;
int32 sens3= 4;
int32 sens4= 5;
}
At this point, I can serialize a message as I log with my device and according to the file size, the log seems to work, but I have not been able to deserialize it on Python offline to check if my implementation has been correct. And I can't do it in C since its an embedded application and I want to do the post-processing offline with Python.
Also, I have checked this online protobuf deserializer where I can pass the serialized file and get it deserialized without the need of the .proto file. In it I can see the header values (field 3 is empty so its not seen) and the logged information. So this makes me think that the serialization is correct but I am deserializing it wrongly on Python.
This is my current code used to deserialize the message in Python:
import PSessionLogs_pb2
with open('$PROTOBUF_LOG_FILENAME$', 'rb') as f:
read_metric = PSessionLogs_pb2.PSessionLogs()
read_metric.ParseFromString(f.read())
Besides this, I've used protoc to generate the .py equivalent of the .proto file to deserialize offline.
It looks like you've serialized a header, then serialized some other data immediately afterwards, meaning: instead of serializing a SessionLogs that has some SessionLogsDetail records, you've serialized a SessionLogs, and then you've serialized (separately) a SessionLogsDetail - does that sound about right? if so: yes, that will not work correctly; there are ways to do what you're after, but it isn't quite as simple as just serializing one after the other, because the root protobuf object is never terminated; so what actually happens is that it overwrites the root object with later fields by number.
There's two ways of addressing this, depending on the data volume. If the size (including all of the detail rows) is small, you can just change the code so that it is a true parent / child relationship, i.e. so that the rows are all inside the parent. When writing the data, this does not mean that you need to have all the rows before you start writing - there are ways of making appending child rows so that you are sending data as it becomes available; however, when deserializing, it will want to load everything in one go, so this approach is only useful if you're OK with that, i.e. you don't have obscene open-ended numbers of rows.
If you have large numbers of rows, you'll need to add your own framing, essentially. This is often done by adding a length-prefix between each payload, so that you can essentially read a single message at a time. Some of the libraries include helper methods for this; for example, in the java API this is parseDelimitedFrom and writeDelimitedTo. However, my understand is that the python API does not currently support this utility, so you'd need to do the framing yourself :(
To summarize, you currently have:
{header - SessionLogs}
{row 0 - SessionLogsDetail}
{row 1 - SessionLogsDetail}
option 1 is:
{header - SessionLogs
{row 0 - SessionLogsDetail}
{row 1 - SessionLogsDetail}
}
option 2 is:
{length prefix of header}
{header - SessionLogs}
{length prefix of row0}
{row 0 - SessionLogsDetail}
{length prefix of row1}
{row 1 - SessionLogsDetail}
(where the length prefix is something simple like a raw varint, or just a 4-byte integer in some agreed endianness)
I'm setting up integration between a webflow store and shippo to assist with creating labels and managing shipping. Webflow passes the data as one huge object for address information, however to create a new order in shippo, I need the information parsed, separated as individual line items. I have attempted to use formatter which allows one to extract text, split text, use regex to match data and more.
import re
details = re.search(r'(?<=city:\s).*$', input_data[All Addresses])
Regex in Python is my best option, yet the result will not find and/or display the data.
Please any experts in Zapier integrations, I need assistance in figuring out a way to parse the incoming data from webflow, pass it to the 'create a order' action with shippo.
Structure of Data:
addressee: string
city: string
country: string
more....
You can try this one:
Combine all the data in one whole string
import re
details = re.finall(r'(?<=city:\s).*$', all_addresses)
return details
It will you give the list of all matches in the text.
I currently want to scrape some data from an amazon page and I'm kind of stuck.
For example, lets take this page.
https://www.amazon.com/NIKE-Hyperfre3sh-Athletic-Sneakers-Shoes/dp/B01KWIUHAM/ref=sr_1_1_sspa?ie=UTF8&qid=1546731934&sr=8-1-spons&keywords=nike+shoes&psc=1
I wanted to scrape every variant of shoe size and color. That data can be found opening the source code and searching for 'variationValues'.
There we can see sort of a dictionary containing all the sizes and colors and, below that, in 'asinToDimentionIndexMap', every product code with numbers indicating the variant from the variationValues 'dictionary'.
For example, in asinToDimentionIndexMap we can see
"B01KWIUH5M":[0,0]
Which means that the product code B01KWIUH5M is associated with the size '8M US' (position 0 in variationValues size_name section) and the color 'Teal' (same idea as before)
I want to scrape both the variationValues and the asinToDimentionIndexMap, so i can associate the IndexMap numbers to the variationValues one.
Another person in the site (thanks for the help btw) suggested doing it this way.
script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_')
import json
d = json.loads(data[0])
d['products'][0]
I can sort of understand the first part. We get everything that's a 'script' as a string and then get everything between {}. The issue is what happens after that. My knowledge of json is not that great and reading some stuff about it didn't help that much.
Is it there a way to get, from that data, 2 dictionaries or lists with the variationValues and asinToDimentionIndexMap? (maybe using some regular expressions in the middle to get some data out of a big string). Or explain a little bit what happens with the json part.
Thanks for the help!
EDIT: Added photo of variationValues and asinToDimensionIndexMap
I think you are close Manuel!
The following code will turn your scraped source into easy-to-select boxes:
import json
d = json.loads(data[0])
JSON is a universal format for storing object information. In other words, it's designed to interpret string data into object data, regardless of the platform you are working with.
https://www.w3schools.com/js/js_json_intro.asp
I'm assuming where you may be finding things a challenge is if there are any errors when accessing a particular "box" inside you json object.
Your code format looks correct, but your access within "each box" may look different.
Eg. If your 'asinToDimentionIndexMap' object is nested within a smaller box in the larger 'products' object, then you might access it like this (after running the code above):
d['products'][0]['asinToDimentionIndexMap']
I've hacked and slash a little bit so you can better understand the structure of your particular json file. Take a look at the link below. On the right-hand side, you will see "which boxes are within one another" - which is precisely what you need to know for accessing what you need.
JSON Object Viewer
For example, the following would yield "companyCompliancePolicies_feature_div":
import json
d = json.loads(data[0])
d['updateDivLists']['full'][0]['divToUpdate']
The person helping you before outlined a general case for you, but you'll need to go in an look at structure this way to truly find what you're looking for.
variationValues = re.findall(r'variationValues\" : ({.*?})', ' '.join(script))[0]
asinVariationValues = re.findall(r'asinVariationValues\" : ({.*?}})', ' '.join(script))[0]
dimensionValuesData = re.findall(r'dimensionValuesData\" : (\[.*\])', ' '.join(script))[0]
asinToDimensionIndexMap = re.findall(r'asinToDimensionIndexMap\" : ({.*})', ' '.join(script))[0]
dimensionValuesDisplayData = re.findall(r'dimensionValuesDisplayData\" : ({.*})', ' '.join(script))[0]
Now you can easily convert them to json as use them combine as you wish.
I am using Python construct library to parse Bluetooth protocols. The link of the library is here
As the protocols are really complex, I subdivided the parsing into multiple stages instead of building one giganic construct. Right now I already parse the big raw data into this structure:
Container({'CRC': 'd\xcbT',
'CRC_OK': 1,
'Channel': 38,
'RSSI': 43,
'access_addr': 2391391958L,
'header': Container({'TxAdd': False, 'PDU_length': 34, 'PDU_Type': 'ADV_IND', 'RxAdd': False}),
'payload': '2\x15\x00a\x02\x00\x02\x01\x06\x07\x03\x03\x18\x02\x18\x04\x18\x03\x19\x00\x02\x02\n\xfe\t\tAS-D1532'})
As you can see the length of the payload is indicated as PDU_length which is 34. The payload has the following structure:
[first 6 octets: AdvertAddress][the rest of data of 0-31 octets: AdvertData]
However, when I started to parse the payload as a standalone structure, I lost the length of 34 in the context of the construct of the payload. How can I make a construct that will parse the first 6 octects as AdvertAddress and group the rest of data as AdvertData?
My current solution looks like this:
length = len(payload) #I didn't use PDU_length but len(payload) gives me back 34 also.
ADVERT_PAYLOAD = Struct("ADVERT_PAYLOAD",
Field("AdvertAddress",6),
Field("AdvertData",length-6),
)
print ADVERT_PAYLOAD.parse(payload)
This gives the correct output. But apparently not all payloads are of size 34. This method requires me to construct this ADVERT_PAYLOAD eveytime I need to parse a new payload.
I read the documentations many times but couldn't find anything related. There is neither a way for me to pass the knowledge of the length of the payload into the context of ADVERT_PAYLOAD, nor is it able to get the length of the argument passed into the parse method.
Maybe there is no solutions to this problem. But then, how do most people parse such protocol data? As you go further into the payload, it subdivides into more types and you need more more smaller constructs to parse them. Should I build a parent construct, embedding smaller constructs which embed even smaller constructs? I can't imagine how to go about building such a big thing.
Thanks in advance.
GreedyRange will get a list of char, and JoinAdapter will join all the char together:
class JoinAdapter(Adapter):
def _decode(self, obj, context):
return "".join(obj)
ADVERT_PAYLOAD = Struct("ADVERT_PAYLOAD",
Field("AdvertAddress",6),
JoinAdapter(GreedyRange(Field("AdvertData", 1)))
)
payload = '2\x15\x00a\x02\x00\x02\x01\x06\x07\x03\x03\x18\x02\x18\x04\x18\x03\x19\x00\x02\x02\n\xfe\t\tAS-D1532'
print ADVERT_PAYLOAD.parse(payload)
output:
Container:
AdvertAddress = '2\x15\x00a\x02\x00'
AdvertData = '\x02\x01\x06\x07\x03\x03\x18\x02\x18\x04\x18\x03\x19\x00\x02\x02\n\xfe\t\tAS-D1532'
Im working on a data packet retrieval system which will take a packet, and process the various parts of the packet, based on a system of tags [similar to HTML tags].
[text based files only, no binary files].
Each part of the packet is contained between two identical tags, and here is a sample packet:
"<PACKET><HEAD><ID><ID><SEQ><SEQ><FILENAME><FILENAME><HEAD><DATA><DATA><PACKET>"
The entire packet is contained within the <PACKET><PACKET> tags.
All meta-data is contained within the <HEAD><HEAD> tags and the filename from which the packet is part of is contained within the, you guessed it, the <FILENAME><FILENAME> tags.
Lets say, for example, a single packet is received and stored in a temporary string variable called sTemp.
How do you efficiently retrieve, for example, only the contents of a single pair of tags, for example the contents of the <FILENAME><FILENAME> tags?
I was hoping for such functionality as saying getTagFILENAME( packetX ), which would return the textual string contents of the <FILENAME><FILENAME> tags of the packet.
Is this possible using Python?
Any suggestions or comments appreciated.
If the packet format effectively uses XML-looking syntax (i.e., if the "closing tags" actually include a slash), the xml.etree.ElementTree could be used.
This libray is part of Python Standard Library, starting in Py2.5. I find it a very convenient one to deal with this kind of data. It provides many ways to read and to modify this kind of tree structure. Thanks to the generic nature of XML languages and to the XML awareness built-in the ElementTree library, the packet syntax could evolve easily for example to support repeating elements, element attributes.
Example:
>>> import xml.etree.ElementTree
>>> myPacket = '<PACKET><HEAD><ID>123</ID><SEQ>1</SEQ><FILENAME>Test99.txt</FILE
NAME></HEAD><DATA>spam and cheese</DATA></PACKET>'
>>> xt = xml.etree.ElementTree.fromstring(myPacket)
>>> wrk_ele = xt.find('HEAD/FILENAME')
>>> wrk_ele.text
'Test99.txt'
>>>
Something like this?
import re
def getPacketContent ( code, packetName ):
match = re.search( '<' + packetName + '>(.*?)<' + packetName + '>', code )
return match.group( 1 ) if match else ''
# usage
code = "<PACKET><HEAD><ID><ID><SEQ><SEQ><FILENAME><FILENAME><HEAD><DATA><DATA><PACKET>"
print( getPacketContent( code, 'HEAD' ) )
print( getPacketContent( code, 'SEQ' ) )
As mjv points out, there's not the least sense in inventing an XML-like format if you can just use XML.
But: If you're going to use XML for your packet format, you need to really use XML for it. You should use an XML library to create your packets, not just to parse them. Otherwise you will come to grief the first time one of your field values contains an XML markup character.
You can, of course, write your own code to do the necessary escaping, filter out illegal characters, guarantee well-formedness, etc. For a format this simple, that may be all you need to do. But going down that path is a way to learn things about XML that you perhaps would rather not have to learn.
If using an XML library to create your packets is a problem, you're probably better off defining a custom format (and I'd define one that didn't look anything like XML, to keep people from getting ideas) and building a parser for it using pyparsing.