JSON representation of the following nested data - python

I am writing code to empirically determine the state transition table from data generated by a natural process. I want to derive the states from the data, and then save the state data to HD, for later querying.
From the analysis I have done so far, the state information is nested, and the system has N (fixed at N=3 for simplicity) distinct states. Furthermore, each of these N states has a fixed (variable number) of nested states.
This is the (pseudo YAML) schema I have come up with so far:
machine-state:
frequency_1: state-info
frequency_2: state-info
frequency_3: state-info
state-info:
classification_1:
- classification_1_state_foo
- classification_1_state_foobar
- classification_1_state_foofoo
- classification_1_state_foofoobar
- classification_1_state_foobarfoo
classification_2:
- classification_2_state_name1
- classification_2_state_name2
- classification_2_state_name3
- classification_2_state_name4
classification_3:
- classification_3_state_anothername
- classification_3_state_anothername1
- classification_3_state_anothername2
- classification_3_state_anothername3
It seems the various classifications of the state machine (classification_*) can derive from an ABC. However, I'm not sure how to represent this tree structure in JSON, for simple querying etc.
I am using Python, and intend to store the JSON documents in PostgreSQL db as the backend - so I can query the JSON documents, so I can empirically build a state transition table from the stored data.
My question is, given the problem I'm trying to model (and the sample YAML above)- how may I best represent the data in a JSON model?

I don't see anything better than the most intuitive representation:
{
"classification1": [
"classification_1_state_foo",
"classification_1_state_foobar",
"classification_1_state_foofoo",
...
],
"classification2": [
...
}
However, as we are talking about tree structure, maybe JSON is not the best choice here. If I may suggest a radical change in your approach, you might consider building this data in XML instead, saving the XML data in a file and using BeautifulSoup to query it. Example:
<classification>
<classification_state>classification_1_state_foo</classification_state>
<classification_state>classification_1_state_foobar</classification_state>
<classification_state>classification_1_state_foofoo</classification_state>
...
</classification>
...

Related

How can I extract the data from these strings?

I am making a program that consists of scraping data from a job page, and I get to this data
{"job":{"ciphertext":"~01142b81f148312a7c","rid":225177647,"uid":"1416152499115024384","type":2,"access":4,"title":"Need app developers to handle our app upgrades","status":1,"category":{"name":"Mobile Development","urlSlug":"mobile-development"
,"contractorTier":2,"description":"We have an app currently built, we are looking for someone to \n\n1) Manage the app for bugs etc \n2) Provide feature upgrades \n3) Overall Management and optimization \n\nPlease get in touch and i will share more details. ","questions":null,"qualifications":{"type":0,"location":null,"minOdeskHours":0,"groupRecno":0,"shouldHavePortfolio":false,"tests":null,"minHoursWeek":40,"group":null,"prefEnglishSkill":0,"minJobSuccessScore":0,"risingTalent":true,"locationCheckRequired":false,"countries":null,"regions":null,"states":null,"timezones":null,"localMarket":false,"onSiteType":null,"locations":null,"localDescription":null,"localFlexibilityDescription":null,"earnings":null,"languages":null
],"clientActivity":{"lastBuyerActivity":null,"totalApplicants":0,"totalHired":0,"totalInvitedToInterview":0,"unansweredInvites":0,"invitationsSent":0
,"buyer":{"isPaymentMethodVerified":false,"location":{"offsetFromUtcMillis":14400000,"countryTimezone":"United Arab Emirates (UTC+04:00)","city":"Dubai","country":"United Arab Emirates"
,"stats":{"totalAssignments":31,"activeAssignmentsCount":3,"feedbackCount":27,"score":4.9258937139,"totalJobsWithHires":30,"hoursCount":7.16666667,"totalCharges":{"currencyCode":"USD","amount":19695.83
,"jobs":{"postedCount":59,"openCount":2
,"avgHourlyJobsRate":{"amount":19.999534874418824
But the problem is that the only data I need is:
-Title
-Description
-Customer activity (lastBuyerActivity, totalApplicants, totalHired, totalInvitedToInterview, unansweredInvites, invitationsSent)
-Buyer (isPaymentMethodVerified, location (Country))
-stats (All items)
-jobs (all items)
-avgHourlyJobsRate
These sort of data are JSON type data. Python understands these sort of data through dictionary data type.
Suppose you have your data stored in a string. You can use di = exec(myData) to convert the string to dictionary. Then you can access the structured data like: di["job"] which return's the job section of the data.
di = exec(myData)
print(`di["job"]`)
However this is just a hack and it is not recommended because it's a
bit messy and unpythonic.
The appropriate way is to use JSON library to convert the data to dictionary. Take a look at the code snippet below to get an idea of what is the appropriate way:
import json
myData = "Put your data Here"
res = json.loads(myData)
print(res["jobs"])
convert the data to dictionary using json.loads
then you can easily use the dictionary keys that your want to lookup or filter the data.
This seems to be a dictionary so you can extract something from it by doing: dictionary["job"]["uid"] for example. If it is a Json file convert the data to a Python dictionary

Deserializing a Streamed Protocol Buffer Message With Header and Repeated fields

I am working on deserializing a log file that has been serialized in C using protocol buffers (and NanoPB).
The log file has a short header composed of: entity, version, and identifier. After the header, the stream of data should be continuous and it should log the fields from the sensors but not the header values (this should only occur once and at the beginning).The same .proto file was used to serialize the file. I do not have separate .proto files for the header and for the streamed data.
After my implementation, I assume it should look like this:
firmware "1.0.0"
GUID "1231214211321" (example)
Timestamp 123123
Sens1 2343
Sens2 13123
Sens3 13443
Sens4 1231
Sens5 190
Timestamp 123124
Sens1 2345
Sens2 2312
...
I posted this question to figure out how to structure the .proto file initially, when I was implementing the serialization in C. And in the end I used a similar approach but did no include the: [(nanopb).max_count = 1];
Finally I opted with the following .proto in Python (There can be more sensors than 5):
syntax = "proto3";
import "timestamp.proto";
message SessionLogs {
int32 Entity = 1;
string Version = 2;
string GUID = 3;
repeated SessionLogsDetail LogDetail = 4;
}
message SessionLogsDetail
{
int32 DataTimestamp = 1; // internal counter to identify the order of session logs
// Sensor data, there can be X amount of sensors.
int32 sens1 = 2;
int32 sens2= 3;
int32 sens3= 4;
int32 sens4= 5;
}
At this point, I can serialize a message as I log with my device and according to the file size, the log seems to work, but I have not been able to deserialize it on Python offline to check if my implementation has been correct. And I can't do it in C since its an embedded application and I want to do the post-processing offline with Python.
Also, I have checked this online protobuf deserializer where I can pass the serialized file and get it deserialized without the need of the .proto file. In it I can see the header values (field 3 is empty so its not seen) and the logged information. So this makes me think that the serialization is correct but I am deserializing it wrongly on Python.
This is my current code used to deserialize the message in Python:
import PSessionLogs_pb2
with open('$PROTOBUF_LOG_FILENAME$', 'rb') as f:
read_metric = PSessionLogs_pb2.PSessionLogs()
read_metric.ParseFromString(f.read())
Besides this, I've used protoc to generate the .py equivalent of the .proto file to deserialize offline.
It looks like you've serialized a header, then serialized some other data immediately afterwards, meaning: instead of serializing a SessionLogs that has some SessionLogsDetail records, you've serialized a SessionLogs, and then you've serialized (separately) a SessionLogsDetail - does that sound about right? if so: yes, that will not work correctly; there are ways to do what you're after, but it isn't quite as simple as just serializing one after the other, because the root protobuf object is never terminated; so what actually happens is that it overwrites the root object with later fields by number.
There's two ways of addressing this, depending on the data volume. If the size (including all of the detail rows) is small, you can just change the code so that it is a true parent / child relationship, i.e. so that the rows are all inside the parent. When writing the data, this does not mean that you need to have all the rows before you start writing - there are ways of making appending child rows so that you are sending data as it becomes available; however, when deserializing, it will want to load everything in one go, so this approach is only useful if you're OK with that, i.e. you don't have obscene open-ended numbers of rows.
If you have large numbers of rows, you'll need to add your own framing, essentially. This is often done by adding a length-prefix between each payload, so that you can essentially read a single message at a time. Some of the libraries include helper methods for this; for example, in the java API this is parseDelimitedFrom and writeDelimitedTo. However, my understand is that the python API does not currently support this utility, so you'd need to do the framing yourself :(
To summarize, you currently have:
{header - SessionLogs}
{row 0 - SessionLogsDetail}
{row 1 - SessionLogsDetail}
option 1 is:
{header - SessionLogs
{row 0 - SessionLogsDetail}
{row 1 - SessionLogsDetail}
}
option 2 is:
{length prefix of header}
{header - SessionLogs}
{length prefix of row0}
{row 0 - SessionLogsDetail}
{length prefix of row1}
{row 1 - SessionLogsDetail}
(where the length prefix is something simple like a raw varint, or just a 4-byte integer in some agreed endianness)

python - asn1 parsed text to json

With text given in this link, need to extract data as follows
Each record starts with YYYY Mmm dd hh:mm:ss.ms, for example 2019 Aug 31 09:17:36.550
Each record has a header starting from line #1 above and ending with a blank line
The record data is contained in lines below Interpreted PDU:
The records of interest are the ones with record header first line having 0xB821 NR5G RRC OTA Packet -- RRC_RECONFIG
Is it possible to extract selected record headers and text below #3 above as an array of nested json in the format as below - snipped for brevity, really need to have the entire text data as JSON.
data = [{"time": "2019 Aug 31 09:17:36.550", "PDU Number": "RRC_RECONFIG Message", "Physical Cell ID": 0, "rrc-TransactionIdentifier": 1, "criticalExtensions rrcReconfiguration": {"secondaryCellGroup": {"cellGroupId": 1, "rlc-BearerToAddModList": [{"logicalChannelIdentity": 1, "servedRadioBearer drb-Identity": 2, "rlc-Config am": {"ul-AM-RLC": {"sn-FieldLength": "size18", "t-PollRetransmit": "ms40", "pollPDU": "p32", "pollByte": "kB25", "maxRetxThreshold": "t32"}, "dl-AM-RLC": {"sn-FieldLength": "size18", "t-Reassembly": "ms40", "t-StatusProhibit": "ms20"}}}]}} }, next records data here]
Note that the input text is parsed output of ASN1 data specifications in 3GPP 38.331 section 6.3.2. I'm not sure normal python text parsing is the right way to handle this or should one use something like asn1tools library ? If so an example usage on this data would be helpful.
Unfortunately, it is unlikely that somebody will come with a straight answer to your question (which is very similar to How to extract data from asn1 data file and load it into a dataframe?)
The text of your link is obviously a log file where ASN.1 value notation was used to make the messages human readable. So trying to decode these messages from their textual form is unusual and you will probably not find tooling for that.
In theory, the generic method would be this one:
Gather the ASN.1 DEFINITIONS (schema) that were used to create the ASN.1 messages
Compile these DEFINITIONS with an ASN.1 tool (aka compiler) to generate an object model in your favorite language (python). The tool would provide the specific code to encode and decode ... you would use ASN.1 values decoders.
Add your custom code (either to the object model or plugged in the ASN.1 compiler) to encode your JSON objects
As you see, it is a very long shot (I can expand if this explanation is too short or unclear)
Unless your task is repetivite and/or the number of messages is big, try the methods you already know (manual search, regex) to search the log file.
If you want to see what it takes to create ASN.1 tools, you can find a few (not that many as ASN.1 is not particularly young and popular). Check out https://github.com/etingof/pyasn1 (python)
I created my own for fun in Java and I am adding the ASN.1 value decoders to illustrate my answer: https://github.com/yafred/asn1-tool (branch text-asn-value-support)
Given that you have a textual representation of the input data, you might take a look at the parse library. This allows you to find a pattern in a string and assign contents to variables.
Here is an example for extracting the time, PDU Number and Physical Cell ID data fields:
import parse
with open('w9s2MJK4.txt', 'r') as f:
input = f.read()
data = []
pattern = parse.compile('\n{year:d} {month:w} {day:d} {hour:d}:{min:d}:{sec:d}.{ms:d}{}Physical Cell ID = {pcid:d}{}PDU Number = {pdu:w} {pdutype:w}')
for s in pattern.findall(input):
record = {}
record['time'] = '{} {} {} {:02d}:{:02d}:{:02d}.{:03d}'.format(s.named['year'], s.named['month'], s.named['day'], s.named['hour'], s.named['min'], s.named['sec'], s.named['ms'])
record['PDU Number'] = '{} {}'.format(s.named['pdu'], s.named['pdutype'])
record['Physical Cell ID'] = s.named['pcid']
data.append(record)
Since you have quite a complicated structure and a large number of data fields, this might become a bit cumbersome, but personally I would prefer this approach over regular expressions. Maybe there is also a smarter method to parse the date (which unfortunately seems not to have one of the standard formats supported by the library).

How to get all text inside XML tags

xml file snapshot
From above .xml file I am extracting article-id, article-title, abstract and keywords. For normal text inside single tag getting correct results. But text with multiple tags such as:
<title-group>
<article-title>
Acetylcholinesterase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Bacterium,
<italic>Rapidithrix thailandica</italic>
</article-title>
</title-group>
.
.
same is for abstract...
I got output as:
OrderedDict([(u'italic**', u'Rapidithrix thailandica'), ('#text', u'Acetylcholines terase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Ba cterium,')])
code has considered tag as a text and the o/p generated is also not in the sequence.
How to simply extract text from such input document as "Acetylcholinesterase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Bacterium, Rapidithrix thailandica".
I am using below python code to perform above task..
import xmltodict
import os
from os.path import basename
import re
with open('2630847.nxml') as fd:
doc = xmltodict.parse(fd.read())
pmc_id = doc['article']['front']['article-meta']['article-id'][1]['#text']
article_title = doc['article']['front']['article-meta']['title-group']['article-title']
y = doc['article']['front']['article-meta']['abstract']
y = y.items()[0]
article_abstract = [g.encode('ascii','ignore') for g in y][1]
z = doc['article']['front']['article-meta']['kwd-group']['kwd']
zz = [g.encode('ascii','ignore') for g in z]
article_keywords = ",".join(zz).replace(","," ")
fout = open(str(pmc_id)+".txt","w")
fout.write(str(pmc_id)+"\n"+str(article_title)+". "+str(article_abstract)+". "+str(article_keywords))
Can somebody please suggest corrections..
xmltodict will likely be hard to use for your data. PMC journal articles are definitely not what the authors could have had in mind. Putting any but the most trivial XML into xmltodict is pounding a round peg into a square hole -- you might succeed, but it won't be pretty. I explain this further below under "tldr"....
Instead, I suggest you use a library whose data model fits your data better, such as xml.dom, minidom, or recent versions of BeautifulSoup. In many such libraries you just load the document with one call and then call some function like innerText() to get all the text content of it. You could even just load the document into a browser and call the Javascript innerText() function to get what you want. If the tool you choose doesn't provide innertext() already, it is:
def innertext(node):
t = ""
for curNode in node.childNodes:
if (isinstance(curNode, Text)):
t += curNode.nodeValue
elif (isinstance(curNode, Element)):
t += curNode.innerText
return(t)
You could tweak that to put spaces between the text nodes, depending on your data.
Hope that helps.
==tldr==
xmltodict makes an admirable attempt at making XML "as simple as possible"; but IMHO it errs in making it simpler than possible.
xmltodict basically works by turning every element into a dict, with its children as the dict items, keyed by their element names. But in many cases (such as yours), XML data isn't very much like that at all. For example, an element can have many children with the same name, but a dict can't.
So xmltodict has to do something special. It turns adjacent instances of the same element type into an array (without the element type). Here's an example excerpted from https://github.com/martinblech/xmltodict):
<and>
<many>elements</many>
<many>more elements</many>
</and>
becomes:
"and": {
"many": [
"elements",
"more elements"
]
},
First off, this means that xmltodict always loses the ordering information about child elements unless they are of the same type. So a section that contains a mix of paragraphs, lists, blockquotes, and so on, will either fail to load in xmltodict, or have all the scattered instances of each kind of child gathered together, completely losing their order.
The xmltodict approach also introduces frequent special-cases -- for example, you can't just get a list of all the children, or use len() to find out how many there are, etc. etc., because at every step you have to check whether you're really at a child element, or at a list of them.
Looking at xmltodict's own examples, you'll see that they mostly consist of walking down the tree by element names, but every now and then there's an integer subscript -- that's for the cases where these arrays are needed. But unless the data is unusually simple (which yours isn't), you won't know where that is. For example, if one DIV in an HTML document happens to contain only one P, the code to access the P needs one fewer subscript than with another DIV that happens to have more than one P.
It seems to me undesirable that the number of subscripts to get to something depends on how many siblings it has, and their types.
Alas, the structure still isn't good enough. Since child elements may have their own child elements, just making them strings in that extra array won't be enough. Sometimes they'll have to be dicts again, with some of their items in turn perhaps being arrays, some of whose items may be dicts, and so on. Writing the correct traversal algorithm to gather up the text is significantly harder than the DOM one shown above.
To be completely fair, there is some XML in which the order doesn't matter logically -- for example, you could export a SQL table into an XML file, using a container element for each record with a child element for each field. The order of fields is not information, so if you load such XML into xmltodict, losing the order doesn't matter. Likewise if you serialized Python data that was already just a dict. But those are very specialized edge cases. xmltodict might be an excellent choice for a case like that -- but the articles you're looking at are very far from that.

What is the best way to search millions of JSON files?

I've very recently picked up programming in Python and am working on creating a database.
I've already worked out extracting all these files from their source so they are all in a directory on my computer.
All of these files are structured the same way and what I want to do is search these multidimensional dictionaries and locate the value for a specific set of keys.
These json files are all structured similarly,
{
"userid": 34535367,
"result": {
"list": [
{
"name": 264,
"age": 64,
"id": 456345345
},
{
"name": 263,
"age": 42,
"id": 364563463456
}
]
}
}
In my case, I would like to search for the "name" key and return the relevant data(quality, id and the original userid) for the thousands of names just like it from my millions of JSON files.
Basically I'm very new at this and the little programming knowledge I have is in Python. I'm happy to start learning whatever I need to, but I'm not sure which direction to go.
If your goal is to create a database, then you should look on how databases work and solve the same problem you are trying to solve right now :)
NoSQL databases (like mangodb) work also with json documents and implements most likely a whole set of tools to search and filter documents.
Now to answer your question, there is no quick way to do so unless you do some preprocessing, meaning that you store different information about the data (called metadata).
This is a huge subject and I don't have enough expertise to give you all the answers, but I can give you a simple tip: Use indexes.
An index is a sorted key/value map where for every value, we store the documents that contains that value (or the file + position of the Json document) . For example an index for the name property would like this:
{
263: ('jsonfile10.json', '0')
264: ('jsonfile10.json', '30'),
# The json document can be found on the jsonfile10.json file on line 30
}
By keeping an index for the most queried values, you can turn a linear time search into a logarithmic time search not to mention that inserting a new document is much faster. in your case, you seems to only need an index on the name field.
Creating/updating the index is done when you insert, update or remove a document. Using a balanced binary tree can accelerate the updates on the index.
As a suggestion, why don't you just process all the incoming files and insert the data into a database? You will have a toolset to query that database. SQLite for example will do (as well as any other more sophisticated database):
http://www.sqlite.org/
http://docs.python.org/2/library/sqlite3.html
Simple other solution might be to build a file mapping name_id to /file/path. Then you can logarithmically do a binary search by the name id. But I'd still advise using a proper database as maintaining the index will be more cumbersome than doing some inserts/deletes.

Categories

Resources