Parsing file into Parent/ Child format for a JSON file

Parsing file into Parent/ Child format for a JSON file - python

I would like some help/ advice on how to parse this file for Gene ontology (.obo)
I am working to create a visualisation in D3, and need to create a "tree" file, in the JSON format -
{
"name": "flare",
"description": "flare",
"children": [
{
"name": "analytic",
"description": "analytics",
"children": [
{
"name": "cluster",
"description": "cluster",
"children": [
{"name": "Agglomer", "description": "AgglomerativeCluster", "size": 3938},
{"name": "Communit", "description": "CommunityStructure", "size": 3812},
{"name": "Hierarch", "description": "HierarchicalCluster", "size": 6714},
{"name": "MergeEdg", "description": "MergeEdge", "size": 743}
]
}, etc..
This format seems fairly easy to replicate in a dictionary in python, with 3 fields for each entry: name, description, and children[].
My probelm here is actually HOW to extract the data. The file linked above has "objects" structured as:
[Term]
id: GO:0000001
name: mitochondrion inheritance
namespace: biological_process
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]
synonym: "mitochondrial inheritance" EXACT []
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution
Where I will need the id, is_a and name fields. I have tried using python to parse this, but I cant seem to find a way to locate each object.
Any ideas?

Here's a fairly simple way to parse the objects in your '.obo' file. It saves the object data into a dict with the id as the key and the name and is_a data saved in a list. Then it pretty-prints it using the standard json module's .dumps function.
For testing purposes, I used a truncated version of the file in your link that only includes up to id: GO:0000006.
This code ignores any objects that contain the is_obsolete field. It also removes the description info from the is_a fields; I figured you probably wanted that, but it's easy enough to disable that functionality.
#!/usr/bin/env python
''' Parse object data from a .obo file
From http://stackoverflow.com/q/32989776/4014959
Written by PM 2Ring 2015.10.07
'''
from __future__ import print_function, division
import json
from collections import defaultdict
fname = "go-basic.obo"
term_head = "[Term]"
#Keep the desired object data here
all_objects = {}
def add_object(d):
#print(json.dumps(d, indent = 4) + '\n')
#Ignore obsolete objects
if "is_obsolete" in d:
return
#Gather desired data into a single list,
# and store it in the main all_objects dict
key = d["id"][0]
is_a = d["is_a"]
#Remove the next line if you want to keep the is_a description info
is_a = [s.partition(' ! ')[0] for s in is_a]
all_objects[key] = d["name"] + is_a
#A temporary dict to hold object data
current = defaultdict(list)
with open(fname) as f:
#Skip header data
for line in f:
if line.rstrip() == term_head:
break
for line in f:
line = line.rstrip()
if not line:
#ignore blank lines
continue
if line == term_head:
#end of term
add_object(current)
current = defaultdict(list)
else:
#accumulate object data into current
key, _, val = line.partition(": ")
current[key].append(val)
if current:
add_object(current)
print("\nall_objects =")
print(json.dumps(all_objects, indent = 4, sort_keys=True))
output
all_objects =
{
"GO:0000001": [
"mitochondrion inheritance",
"GO:0048308",
"GO:0048311"
],
"GO:0000002": [
"mitochondrial genome maintenance",
"GO:0007005"
],
"GO:0000003": [
"reproduction",
"GO:0008150"
],
"GO:0000006": [
"high-affinity zinc uptake transmembrane transporter activity",
"GO:0005385"
]
}

Related

Adding a comma between JSON objects in a datafile with Python?

I have large file (about 3GB) which contains what looks like a JSON file but isn't because it lacks commas (,) between "observations" or JSON objects (I have about 2 million of these "objects" in my data file).
For example, this is what I have:
{
"_id": {
"$id": "fh37fc3huc3"
},
"messageid": "4757724838492485088139042828",
"attachments": [],
"usernameid": "47284592942",
"username": "Alex",
"server": "475774810304151552",
"text": "Must watch",
"type": "462050823720009729",
"datetime": "2018-08-05T21:20:20.486000+00:00",
"type": {
"$numberLong": "0"
}
}
{
"_id": {
"$id": "23453532dwq"
},
"messageid": "232534",
"attachments": [],
"usernameid": "273342",
"usernameid": "Alice",
"server": "475774810304151552",
"text": "https://www.youtube.com/",
"type": "4620508237200097wd29",
"datetime": "2018-08-05T21:20:11.803000+00:00",
"type": {
"$numberLong": "0"
}
And this is what I want (the comma between "observations"):
{
"_id": {
"$id": "fh37fc3huc3"
},
"messageid": "4757724838492485088139042828",
"attachments": [],
"username": "Alex",
"server": "475774810304151552",
"type": {
"$numberLong": "0"
}
},
{
"_id": {
"$id": "23453532dwq"
},
"messageid": "232534",
"attachments": [],
"usernameid": "Alice",
"server": "475774810304151552",
"type": {
"$numberLong": "0"
}
This is what I tried but it doesn't give me a comma where I need it:
import re
with open('dataframe.txt', 'r') as input, open('out.txt', 'w') as output:
output.write("[")
for line in input:
line = re.sub('', '},{', line)
output.write(' '+line)
output.write("]")
What can I do so that I can add a comma between each JSON object in my datafile?

This solution presupposes that none of the fields in JSON contains neither { nor }.
If we assume that there is at least one blank line between JSON dictionaries, an idea: let's maintain unclosed curly brackets count ({) as unclosed_count; and if we meet an empty line, we add the coma once.
Like this:
with open('test.json', 'r') as input_f, open('out.json', 'w') as output_f:
output_f.write("[")
unclosed_count = 0
comma_after_zero_added = True
for line in input_f:
unclosed_count_change = line.count('{') - line.count('}')
unclosed_count += unclosed_count_change
if unclosed_count_change != 0:
comma_after_zero_added = False
if line.strip() == '' and unclosed_count == 0 and not comma_after_zero_added:
output_f.write(",\n")
comma_after_zero_added = True
else:
output_f.write(line)
output_f.write("]")

Assuming sufficient memory, you can parse such a stream one object at a time using json.JSONDecoder.raw_decode directly, instead of using json.loads.
>>> x = '{"a": 1}\n{"b": 2}\n' # Hypothetical output of open("dataframe.txt").read()
>>> decoder = json.JSONDecoder()
>>> x = '{"a": 1}\n{"b":2}\n'
>>> decoder.raw_decode(x)
({'a': 1}, 8)
>>> decoder.raw_decode(x, 9)
({'b': 2}, 16)
The output of raw_decode is a tuple containing the first JSON value decoded and the position in the string where the remaining data starts. (Note that json.loads just creates an instance of JSONDecoder, and calls the decode method, which just calls raw_decode and artificially raises an exception if the entire input isn't consumed by the first decoded value.)
A little extra work is involved; note that you can't start decoding with whitespace, so you'll have to use the returned index to detect where the next value starts, following any additional whitespace at the returned index.

Another way to view your data is that you have multiple json records separated by whitespace. You can use the stdlib JSONDecoder to read each record, then strip whitespace and repeat til done. The decoder reads a record from a string and tells you how far it got. Apply that iteratively to the data until all is consumed. This is far less risky than making a bunch of assumptions about what data is contained in the json itself.
import json
def json_record_reader(filename):
with open(filename, encoding="utf-8") as f:
txt = f.read().lstrip()
decoder = json.JSONDecoder()
result = []
while txt:
data, pos = decoder.raw_decode(txt)
result.append(data)
txt = txt[pos:].lstrip()
return result
print(json_record_reader("data.json"))
Considering the size of your file, a memory mapped text file may be the better option.

If you're sure that the only place you will find a blank line is between two dicts, then you can go ahead with your current idea, after you fix its execution. For every line, check if it's empty. If it isn't, write it as-is. If it is, write a comma instead
with open('dataframe.txt', 'r') as input_file, open('out.txt', 'w') as output_file:
output_file.write("[")
for line in input_file:
if line.strip():
output_file.write(line)
else:
output_file.write(",")
output_file.write("]")
If you cannot guarantee that any blank line must be replaced by a comma, you need a different approach.
You want to replace a close-bracket, followed by an empty line (or multiple whitespace), followed by an open-bracket, with },{.
You can keep track of the previous two lines in addition to the current line, and if these are "}", "", and "{" in that order, then write a comma before writing the "{".
from collections import deque
with open('dataframe.txt', 'r') as input_file, open('out.txt', 'w') as output_file:
last_two_lines = deque(maxlen=2)
output_file.write("[")
for line in input_file:
line_s = line.strip()
if line_s == "{" and list(last_two_lines) == ["}", ""]:
output_file.write("," + line)
else:
output_file.write(line)
last_two_lines.append(line_s)
Alternatively, if you want to stick with regex, then you could do
with open('dataframe.txt') as input_file:
file_contents = input_file.read()
repl_contents = re.sub(r'\}(\s+)\{', r'},\1{', file_contents)
with open('out.txt', 'w') as output_file:
output_file.write(repl_contents)
Here, the regex r"\}(\s+)\{" matches the pattern we're looking for (\s+ matches multiple whitespace characters, and captures them in group 1, which we then use in the replacement string as \1.
Note that you will need to read and run re.sub on the entire file, which will be slow.

how to split large json file according to the value of a key?

I have a large json file that I would like to split according to the key "metadata". One example of record is
{"text": "The primary outcome of the study was hospital mortality; secondary outcomes included ICU mortality and lengths of stay for hospital and ICU. ICU mortality was defined as survival of a patient at ultimate discharge from the ICU and hospital mortality was defined as survival at discharge or transfer from our hospital.", "label": "conclusion", "metadata": "18982114"}
There are many records in the json file where the key "metadata" is "18982114". How can I extract all of these records and store them into a separate json file? Ideally, I'm looking for a solution that includes no loading and looping over the file, otherwise it would be very cumbersome every time I query it. I think by using shell command maybe it's doable, but unfortunately I'm not an expert in shell commands...so I would highly appreciate a non-looping fast query solution, thx!
==========================================================================
here are some samples of the file (contains 5 records):
{"text": "Finally, after an emergency laparotomy, patients who received i.v. vasoactive drugs within the first 24 h on ICU were 3.9 times more likely to die (OR 3.85; 95% CI, 1.64 -9.02; P\u00bc0.002). No significant prognostic factors were determined by the model on day 2.", "label": "conclusion", "metadata": "18982114"}
{"text": "Kinetics ofA TP Binding to Normal and Myopathic", "label": "conclusion", "metadata": "10700033"}
{"text": "Observed rate constants, k0b,, were obtained by fitting the equation I(t)=oe-kobs+C by the method of moments, where I is the observed fluorescence intensity, and I0 is the amplitude of fluorescence change. 38 ", "label": "conclusion", "metadata": "235564322"}
{"text": "The capabilities of modern angiographic platforms have recently improved substantially.", "label": "conclusion", "metadata": "2877272"}
{"text": "Few studies have concentrated specifically on the outcomes after surgery.", "label": "conclusion", "metadata": "18989842"}
The job is to fast retrieve the text for the record with metadata "18982114"

Use json package to convert the json object into a dictionary then use the data stored in the metadata key. here is an working example:
# importing the module
import json
# Opening JSON file
with open('data.json') as json_file:
data = json.load(json_file)
# Print the type of data variable
print("Type:", type(data))
# Print the data of dictionary
print("metadata: ", data['metadata'])

You can try this approach:
import json
with open('data.json') as data_json:
data = json.load(data_json)
MATCH_META_DATA = '18982114'
match_records = []
for part_data in data:
if part_data.get('metadata') == MATCH_META_DATA:
match_records.append(part_data)

Let us imagine we have the following JSON content in example.json:
{
"1":{"text": "Some text 1.", "label": "xxx", "metadata": "18982114"},
"2":{"text": "Some text 2.", "label": "yyy", "metadata": "18982114"},
"3":{"text": "Some text 3.", "label": "zzz", "metadata": "something else"}
}
You can do the following:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
# 1. read json content from file
my_json = None
with open("example.json", "r") as file:
my_json = json.load(file)
# 2. filter content
# you can use a list instead of a new dictionary if you don't want to create a new json file
new_json_data = {}
for record_id in my_json:
if my_json[record_id]["metadata"] == str(18982114):
new_json_data[record_id] = my_json[record_id]
# 3. write a new json with filtered data
with open("result.json"), "w") as file:
json.dump(new_json_data, file)
This will output the following result.json file:
{"1": {"text": "Some text 1.", "label": "", "metadata": "18982114"}, "2": {"text": "Some text 2.", "label": "", "metadata": "18982114"}}

How to search and copy an item given the ID in a large json file

I have two large files:
one is a text file with a lot of IDs: one ID per row;
the other one is a 6+ GB json file, containing many items.
I need to search for those IDs in a certain field of the json file and copy the whole item it refers to for later analysis (creating a new file).
I give an example:
IDs.txt
unique_id_1
unique_id_2
...
schema.json
[
{
"id": "unique_id_1",
"name": "",
"text": "",
"date": "",
},
{
"id": "unique_id_aaa",
"name": "",
"text": "",
"date": "",
},
{
"id": "unique_id_2",
"name": "",
"text": "",
"date": "",
},
...
]
I am doing these analysis with Python - Pandas but I am getting troubles due to the large dimension of the files. What is the best way to do this thing? I can also consider using other software / languages

I implemented my second suggestion: this only works if the schema is flat (there are no nested objects in the JSON file). I also did not check what happens if a value in the JSON file is a dictionary, but probably if would be handled more carefully, as I currently check for } in a line to decide if the object is over.
You still need to load the entire IDs file, you need to check somehow if the object is needed.
If the useful_objects list grows too large, you can easily save that periodically while parsing the file.
import json
from pathlib import Path
import re
from typing import Dict
schema_name = "schema.json"
schema_path = Path(schema_name)
ids_name = "IDs.txt"
ids_path = Path(ids_name)
# read the ids
useful_ids = set()
with ids_path.open() as id_f:
for line in id_f:
id_ = line.strip()
useful_ids.add(id_)
print(useful_ids)
useful_objects = []
temp: Dict[str, str] = {}
was_useful = False
with schema_path.open() as sc_f:
for line in sc_f:
# remove start/end whitespace
line = line.strip()
print(f"Parsing line {line}")
# an object is ending
if line[0] == "}":
# add it
if was_useful:
useful_objects.append(temp)
# reset the usefulness for the next object
was_useful = False
# reset the temp object
temp = {}
# parse the line
match = re.match(r'"(.*?)": "(.*)"', line)
# if this did not match, skip the line
if match is None:
continue
# extract the data from the regex match
key = match.group(1)
value = match.group(2)
print(f"\tMatched: {key} {value}")
# build the temp object incrementally
temp[key] = value
# check if this object is useful
if key == "id" and value in useful_ids:
was_useful = True
useful_json = json.dumps(useful_objects, indent=4)
print(useful_json)
Again, not very elegant and not very robust, but as long as you are aware of the limitations, it does the job.
Cheers!

How can I parse nested JSON to CSV

I have a new project where I obtain JSON data back from a REST API - I'm trying to parse this data to csv pipe delimited to import to our legacy software
I can't seem to get all the value pairs parsed properly - this is my first exposure to JSON and I've tried so many things but only getting a little right at a time
I have used Python and can get some items that I need but not the whole JSON tree - it comes across as a list and has some dictionaries and lists in it as well
I know my code is incomplete and just looking for someone to point me in the right direction on what tools in python can get the job done
import json
import csv
with open('tenants.json') as access_json:
read_content = json.load(access_json)
for rm_access in read_content:
rm_data = rm_access
print(rm_data)
contacts_data = rm_data['Contacts']
leases_data = rm_data['Leases']
udfs_data = rm_data['UserDefinedValues']
for contacts_access in contacts_data:
rm_contacts = contacts_access
UPDATED:
import pandas as pd
with open('tenants.json') as access_json:
read_content = json.load(access_json)
for rm_access in read_content:
rm_data = rm_access
pd.set_option('display.max_rows', 10000)
pd.set_option('display.max_columns', 150)
TenantID = []
TenantDisplayID = []
Name = []
FirstName = []
LastName = []
WebMessage = []
Comment = []
RentDueDay = []
RentPeriod = []
FirstContact = []
PropertyID = []
PostingStartDate = []
CreateDate = []
CreateUserID = []
UpdateDate = []
UpdateUserID = []
Contacts = []
for rm_access in read_content:
rm_data = rm_access
TenantID.append(rm_data["TenantID"])
TenantDisplayID.append(rm_data["TenantDisplayID"])
Name.append(rm_data["Name"])
FirstName.append(rm_data["FirstName"])
LastName.append(rm_data["LastName"])
WebMessage.append(rm_data["WebMessage"])
Comment.append(rm_data["Comment"])
RentDueDay.append(rm_data["RentDueDay"])
RentPeriod.append(rm_data["RentPeriod"])
# FirstContact.append(rm_data["FirstContact"])
PropertyID.append(rm_data["PropertyID"])
PostingStartDate.append(rm_data["PostingStartDate"])
CreateDate.append(rm_data["CreateDate"])
CreateUserID.append(rm_data["CreateUserID"])
UpdateUserID.append(rm_data["UpdateUserID"])
Contacts.append(rm_data["Contacts"])
df = pd.DataFrame({"TenantID":TenantID,"TenantDisplayID":TenantDisplayID, "Name"
: Name,"FirstName":FirstName, "LastName": LastName,"WebMessage": WebMessage,"Com
ment": Comment, "RentDueDay": RentDueDay, "RentPeriod": RentPeriod, "PropertyID"
: PropertyID, "PostingStartDate": PostingStartDate,"CreateDate": CreateDate, "Cr
eateUserID": CreateUserID,"UpdateUserID": UpdateUserID,"Contacts": Contacts})
print(df)
Here is sample of the file
[
{
"TenantID": 115,
"TenantDisplayID": 115,
"Name": "Jane Doe",
"FirstName": "Jane",
"LastName": "Doe",
"WebMessage": "",
"Comment": "",
"RentDueDay": 1,
"RentPeriod": "Monthly",
"FirstContact": "2015-11-01T15:30:00",
"PropertyID": 17,
"PostingStartDate": "2010-10-01T00:00:00",
"CreateDate": "2014-04-16T13:35:37",
"CreateUserID": 1,
"UpdateDate": "2017-03-22T11:31:48",
"UpdateUserID": 1,
"Contacts": [
{
"ContactID": 128,
"FirstName": "Jane",
"LastName": "Doe",
"MiddleName": "",
"IsPrimary": true,
"DateOfBirth": "1975-02-27T00:00:00",
"FederalTaxID": "111-11-1111",
"Comment": "",
"Email": "jane.doe#mail.com",
"License": "ZZT4532",
"Vehicle": "BMW 3 Series",
"IsShowOnBill": true,
"Employer": "REW",
"ApplicantType": "Applicant",
"CreateDate": "2014-04-16T13:35:37",
"CreateUserID": 1,
"UpdateDate": "2017-03-22T11:31:48",
"AnnualIncome": 0.0,
"UpdateUserID": 1,
"ParentID": 115,
"ParentType": "Tenant",
"PhoneNumbers": [
{
"PhoneNumberID": 286,
"PhoneNumberTypeID": 2,
"PhoneNumber": "703-555-5610",
"Extension": "",
"StrippedPhoneNumber": "7035555610",
"IsPrimary": true,
"ParentID": 128,
"ParentType": "Contact"
}
]
}
],
"UserDefinedValues": [
{
"UserDefinedValueID": 1,
"UserDefinedFieldID": 4,
"ParentID": 115,
"Name": "Emerg Contact Name",
"Value": "Terry Harper",
"UpdateDate": "2016-01-22T15:41:53",
"FieldType": "Text",
"UpdateUserID": 2,
"CreateUserID": 2
},
{
"UserDefinedValueID": 174,
"UserDefinedFieldID": 5,
"ParentID": 115,
"Name": "Emerg Contact Phone",
"Value": "703-555-3568",
"UpdateDate": "2016-01-22T15:42:03",
"FieldType": "Text",
"UpdateUserID": 2,
"CreateUserID": 2
}
],
"Leases": [
{
"LeaseID": 115,
"TenantID": 115,
"UnitID": 181,
"PropertyID": 17,
"MoveInDate": "2010-10-01T00:00:00",
"SortOrder": 1,
"CreateDate": "2014-04-16T13:35:37",
"UpdateDate": "2017-03-22T11:31:48",
"CreateUserID": 1,
"UpdateUserID": 1
}
],
"Addresses": [
{
"AddressID": 286,
"AddressTypeID": 1,
"Address": "14393 Montgomery Road Lot #102\r\nCincinnati, OH 45122",
"Street": "14393 Montgomery Road Lot #102",
"City": "Cincinnati",
"State": "OH",
"PostalCode": "45122",
"IsPrimary": true,
"ParentID": 115,
"ParentType": "Tenant"
}
],
"OpenReceivables": [],
"Status": "Current"
},
Not all tenants will have all elements which is also tricky
I need the data from the top where there is TenantID, TenantDisplayID, etc
I also need the data from the Contacts, PhoneNumbers, Leases, etc values
Each line should be static so if it doesn't have certain tags then I'd like a Null or None so it would look like
TentantID|TenantDisplayID|FirstName….etc so each line has same number of fields

Something like this should work:
import pandas as pd
pd.set_option('display.max_rows', 10000)
pd.set_option('display.max_columns', 100000)
TenantID = []
TenantDisplayID = []
Name = []
FirstName = []
LastName = []
WebMessage = []
Comment = []
RentDueDay = []
RentPeriod = []
FirstContact = []
PropertyID = []
PostingStartDate = []
CreateDate = []
CreateUserID = []
UpdateDate = []
UpdateUserID = []
Contacts = []
for rm_access in read_content:
rm_data = rm_access
print(rm_data)
TenantID.append(rm_data["TenantID"])
TenantDisplayID.append(rm_data["TenantDisplayID"])
Name.append(rm_data["Name"])
FirstName.append(rm_data["FirstName"])
LastName.append(rm_data["LastName"])
WebMessage.append(rm_data["WebMessage"])
Comment.append(rm_data["Comment"])
RentDueDay.append(rm_data["RentDueDay"])
RentPeriod.append(rm_data["RentPeriod"])
FirstContact.append(rm_data["FirstContact"])
PropertyID.append(rm_data["PropertyID"])
PostingStartDate.append(rm_data["PostingStartDate"])
CreateDate.append(rm_data["CreateDate"])
CreateUserID.append(rm_data["CreateUserID"])
UpdateUserID.append(rm_data["UpdateUserID"])
Contacts.append(rm_data["Contacts"])
df = pd.DataFrame({"TenantID":TenantID,"TenantDisplayID":TenantDisplayID, "Name": Name,
"FirstName":FirstName, "LastName": LastName,"WebMessage": WebMessage,
"Comment": Comment, "RentDueDay": RentDueDay, "RentPeriod": RentPeriod,
"FirstContact": FirstContact, "PropertyID": PropertyID, "PostingStartDate": PostingStartDate,
"CreateDate": CreateDate, "CreateUserID": CreateUserID,"UpdateUserID": UpdateUserID,
"Contacts": Contacts})
print(df)

The General Problem
The problem with this task (and other similar ones) is not just how to create an algorithm - I am sure you will theoretically be able to solve this with a (not so) nice amount of nested for-loops. The problem is to organise the code in a way that you don't get a headache - i.e. in a way that you can fix bugs easily, that you can write unittests, that you can understand the code easily from reading it (in six months from now) and that you can easily change your code in case you need to do so.
I do not know anybody who does not make mistakes when wrapping their head around a deeply nested structure. And chasing for bugs in a code which is heavily nested because it mirrors the nested structure of the data, can be quite frustrating.
The Quick (and most probably: Best) Solution
Rely on packages that are made for your exact usecase, such as
https://github.com/cwacek/python-jsonschema-objects
In case you have a formal definition of the API schema, you could use packages for that. If, for instance, your API has a Swagger schema definition, you cann use swagger-py (https://github.com/digium/swagger-py) to get your JSON response into Python objects.
The Principle Solution: Object Oriented Programming and Recursion
Even if there might be some libraries for your concrete use case, I would like to explain the principle of how to deal with "that kind" of tasks:
A good way to organise code for this kind of problem is using Object Oriented Programming. The nesting hassle can be laid out much clearer by making use of the principle of recursion. This also makes it easier to chabge the code, in case the JSON schema of your API response changes for any reasons (an update of the API, for instance). In your case I would suggest you create something like the following:
class JsonObject:
"""Parent Class for any Object that will be retrieved from the JSON
and potentially has nested JsonObjects inside.
This class takes care of parsing the json into python Objects and deals
with the recursion into the nested structures."""
primitives = []
json_objects = {
# For each class, this dict defines all the "embedded" classes which
# live directly "under" that class in the nested JSON. It will have the
# following structure:
# attribute_name : class
# In your case the JSON schema does not have any "single" objects
# in the nesting strcuture, but only lists of nested objects. I
# still , to demonstrate how you would do it in case, there would be
# single "embedded"
}
json_object_lists = {
# For each class, this dict defines all the "embedded" subclasses which
# are provided in a list "under" that class in the nested JSON.
# It will have the following structure:
# attribute_name : class
}
#classmethod
def from_dict(cls, d: dict) -> "JsonObject":
instance = cls()
for attribute in cls.primitives:
# Here we just parse all the primitives
instance.attribute = getattr(d, attribute, None)
for attribute, klass in cls.json_object_lists.items():
# Here we parse all lists of embedded JSON Objects
nested_objects = []
l = getattr(d, attribute, [])
for nested_dict in l:
nested_objects += klass.from_dict(nested_dict)
setattr(instance, attribute, nested_objects)
for attribute, klass in cls.json_objects.items():
# Here we parse all "single" embedded JSON Objects
setattr(
instance,
attribute,
klass.from_dict(getattr(d, attribute, None)
)
def to_csv(self) -> str:
pass
Since you didn't explain how exactly you want to create a csv from the JSON, I didn't implement that method and left this to you. It is also not necessary to explain the overall approach.
Now we have the general Parent class all our specific will inherit from, so that we can apply recursion to our problem. Now we only need to define these concrete structures, according to the JSON schema we want to parse. I got the following from your sample, but you can easily change the things you need to:
class Address(JsonObject):
primitives = [
"AddressID",
"AddressTypeID",
"Address",
"Street",
"City",
"State",
"PostalCode",
"IsPrimary",
"ParentID",
"ParentType",
]
json_objects = {}
json_object_lists = {}
class Lease(JsonObject):
primitives = [
"LeaseID",
"TenantID",
"UnitID",
"PropertyID",
"MoveInDate",
"SortOrder",
"CreateDate",
"UpdateDate",
"CreateUserID",
"UpdateUserID",
]
json_objects = {}
json_object_lists = {}
class UserDefinedValue(JsonObject):
primitives = [
"UserDefinedValueID",
"UserDefinedFieldID",
"ParentID",
"Name",
"Value",
"UpdateDate",
"FieldType",
"UpdateUserID",
"CreateUserID",
]
json_objects = {}
json_object_lists = {}
class PhoneNumber(JsonObject):
primitives = [
"PhoneNumberID",
"PhoneNumberTypeID",
"PhoneNumber",
"Extension",
"StrippedPhoneNumber",
"IsPrimary",
"ParentID",
"ParentType",
]
json_objects = {}
json_object_lists = {}
class Contact(JsonObject):
primitives = [
"ContactID",
"FirstName",
"LastName",
"MiddleName",
"IsPrimary",
"DateOfBirth",
"FederalTaxID",
"Comment",
"Email",
"License",
"Vehicle",
"IsShowOnBill",
"Employer",
"ApplicantType",
"CreateDate",
"CreateUserID",
"UpdateDate",
"AnnualIncome",
"UpdateUserID",
"ParentID",
"ParentType",
]
json_objects = {}
json_object_lists = {
"PhoneNumbers": PhoneNumber,
}
class Tenant(JsonObject):
primitives = [
"TenantID",
"TenantDisplayID",
"Name",
"FirstName",
"LastName",
"WebMessage",
"Comment",
"RentDueDay",
"RentPeriod",
"FirstContact",
"PropertyID",
"PostingStartDate",
"CreateDate",
"CreateUserID",
"UpdateDate",
"UpdateUserID",
"OpenReceivables", # Maybe this is also a nested Object? Not clear from your sample.
"Status",
]
json_object_lists = {
"Contacts": Contact,
"UserDefinedValues": UserDefinedValue,
"Leases": Lease,
"Addresses": Address,
}
json_objects = {}
You might imagine the "beauty" (at least: order) of that approach, which lies in the following: With this structure, we could tackle any level of nesting in the JSON response of your API without additional headache - our code would not deepen its indentation level, because we have separated the nasty nesting into the recursive definition of JsonObjects from_json method. That is why it is much easier now to identify bugs or apply changes to our code.
To finally parse the JSON now into our Objects, you would do something like the following:
import typing
import json
def tenants_from_json(json_string: str) -> typing.Iterable["Tenant"]:
tenants = [
Tenant.from_dict(tenant_dict)
for tenant_dict in json.loads(json_string)
]
return tenants
Important Final Side Note: This is just the basic Principle
My code example is just a very brief introduction into the idea of using objects and recursion to deal with an overwhelming (and nasty) nesting of a structure. The code has some flaws. For instance one should avoid define mutable class variables. And of course the whole code should validate the data it gets from the API. You also might want to add the type of each attribute and represent that correctly in the Python objects (Your sample has integers, datetimes and strings, for instance).
I really only wanted to show you the very principle of Object Oriented Programming here.
I didn't take the time to test my code. So there are probably bugs left. Again, I just wanted to demonstrate the principle.

Loop stuck to last entry python

I'm reading a file and I want to append data in an array then dumps it in json file. i'm using python 2.7.
The problem is that it only returns the last line of the file and populate the file with this.
Don't know if it's clear so I show the code
import re
import json
results = []
contact = {
"id":"",
"email":""
}
source = open('zen_id.txt')
output = open('zen_id_js.json', 'w')
for line in source:
email = re.search(r'[\w\.-]+#[\w\.-]+', line)
contact['email'] = email.group(0)
p = re.search(r'\d\d\d\d\d', line)
contact['id'] = p.group(0)
results.append(contact)
json.dump(results, output)
And the output is :
[
{
"id": "35148",
"email": "****#gmail.com"
},
{
"id": "35148",
"email": "****#gmail.com"
},
{
"id": "35148",
"email": "****#gmail.com"
},
{
"id": "35148",
"email": "****#gmail.com"
},
Anyone knows what s happening ?
Thanks in advance

by doing
contact = {
"id":"",
"email":""
}
outside the loop, you have one instance of the object. You just modify the same instance over and over again (result.append doesn't create a copy of the dictionary, only stores the reference)
One solution is to define it inside the loop or to create a copy
for line in source:
email = re.search(r'[\w\.-]+#[\w\.-]+', line)
contact = {} # create a new, empty instance
contact['email'] = email.group(0)
...
note that it is not necessary to define the dictionary with keys & empty values, since you're overwriting them anyway. Define it empty.
Another alternative is not to use contact at all and create the dictionary on-the-fly using a literal form when appending to the list:
results.append({"email":email.group(0), "id":p.group(0)})
you can also skip the loop altogether and write that in one line using list comprehension:
results = [{"email":re.search(r'[\w\.-]+#[\w\.-]+', line).group(0), "id":re.search(r'\d\d\d\d\d', line).group(0)} for line in source]
The only issue here is that you cannot handle the cases where there isn't a match, at least easily.

You have to add the contact dictionary inside the for loop.
import re
import json
results = []
source = open('zen_id.txt')
output = open('zen_id_js.json', 'w')
for line in source:
contact = {
"id": "",
"email": ""
}
email = re.search(r'[\w\.-]+#[\w\.-]+', line)
contact['email'] = email.group(0)
p = re.search(r'\d\d\d\d\d', line)
contact['id'] = p.group(0)
results.append(contact)
json.dump(results, output)

You can perform deepcopy.
import re
import json
import copy
results = []
contact = {
"id":"",
"email":""
}
source = open('zen_id.txt')
output = open('zen_id_js.json', 'w')
for line in source:
email = re.search(r'[\w\.-]+#[\w\.-]+', line)
contact['email'] = email.group(0)
p = re.search(r'\d\d\d\d\d', line)
contact['id'] = p.group(0)
results.append(copy.deepcopy(contact))
json.dump(results, output)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing file into Parent/ Child format for a JSON file - python

Related

Adding a comma between JSON objects in a datafile with Python?

how to split large json file according to the value of a key?

How to search and copy an item given the ID in a large json file

How can I parse nested JSON to CSV

Loop stuck to last entry python

Categories

Resources