Pandas - nested json file - python

I am new in Python, would like to extract data from json with Padas.
Json nested structure is as follows:
{
"idDriver": "100001",
"defaultTripType": "private",
"fleetManagerRole": null,
"identifications": [
{
"code": "90-00-00-77-20",
"from": "2019-08-08T10:38:15Z",
"rawId": "",
"vehicle": {
"isBusinessCar": "0",
"id": "10000",
"licensePlate": "ABCD",
"class": "Suziki 1.6 CDTI",
}
}
}
]
}
As an output I would need on one line: 'idDriver' from level 0 and then ‘licensePlate’ from identifications/ vehicle node in one line:
What I have been tried to apply is:
(after loading data from API what works fine)
json_data = json.loads(myResponse.text)
#only unwrapping 'identifications' – works 100% fine
workdata = json_normalize(json_data, record_path= ['identifications'],
meta=['idDriver'])
#unwrapping 'identifications'\'vehicle' - is NOT working
workdata = json_normalize(json_data, record_path= ['identifications','vehicle'],
meta=['idDriver'])
I would appreciate any hint on that.
Kind Regards,
Arek

I would go for rebuilding your dictionary like this:
New_Data = {
"id" : [],
"licensePlate" : []
}
New_Data["id"].append(data["idDriver"])
New_Data["licensePlate"].append(data["identifications"][0]["vehicle"]["licensePlate"])
If you have many data["identifications"] you can easly look over them, if you have many drivers you can do it as well.

For me your first code working nice, only if necessary remove vehicle. text from columns names:
json_data = {
"idDriver": "100001",
"defaultTripType": "private",
"fleetManagerRole": 'null',
"identifications": [
{
"code": "90-00-00-77-20",
"from": "2019-08-08T10:38:15Z",
"rawId": "",
"vehicle": {
"isBusinessCar": "0",
"id": "10000",
"licensePlate": "ABCD",
"class": "Suziki 1.6 CDTI",
}
}
]
}
workdata = json_normalize(json_data, record_path= ['identifications'], meta=['idDriver'])
print (workdata)
code from rawId vehicle.isBusinessCar \
0 90-00-00-77-20 2019-08-08T10:38:15Z 0
vehicle.id vehicle.licensePlate vehicle.class idDriver
0 10000 ABCD Suziki 1.6 CDTI 100001
workdata.columns = workdata.columns.str.replace('vehicle\.','')
print (workdata)
code from rawId isBusinessCar id \
0 90-00-00-77-20 2019-08-08T10:38:15Z 0 10000
licensePlate class idDriver
0 ABCD Suziki 1.6 CDTI 100001

I recently wrote a package to deal with tasks like this easily, it's called cherrypicker. I think the following snippet would achieve your task with CherryPicker:
from cherrypicker import CherryPicker
json_data = json.loads(myResponse.text)
picker = CherryPicker(json_data)
flat_data = picker.flatten['idDriver', 'identifications_0_vehicle_licensePlate'].get()
flat_data would then look like this (I'm assuming that your data is actually a list of objects like the one you described above):
[['100001', 'ABCD'], ...]
You can then load this into a dataframe as follows:
import pandas as pd
df = pd.DataFrame(flat_data, columns=["idDriver", "licensePlate"])
If you want to flatten your data in slightly different ways (e.g. you want every license plate/driver ID combination, not just the first license plate for each driver), then you should be able to do this too although it may require two or three lines rather than just one. Check our the docs for examples of other ways of using it: https://cherrypicker.readthedocs.io.
To install cherrypicker, it's just pip install --user cherrypicker.

Related

How can I make checking the value for the parameter?

I want to write a program that will save information from the API, in the form of a JSON file. The API has the 'exchangeId' parameter. When I save information from the API, I want to save only those files in which the 'exchangeId' will be different and his value will be more then one. How can I make it? Please, give me hand.
My Code:
exchangeIds = {102,311,200,302,521,433,482,406,42,400}
for pair in json_data["marketPairs"]:
if (id := pair.get("exchangeId")):
if id in exchangeIds:
json_data["marketPairs"].append(pair)
exchangeIds.remove(id)
pairs.append({
"exchange_name": pair["exchangeName"],
"market_url": pair["marketUrl"],
"price": pair["price"],
"last_update" : pair["lastUpdated"],
"exchange_id": pair["exchangeId"]
})
out_object["name_of_coin"] = json_data["name"]
out_object["marketPairs"] = pairs
out_object["pairs"] = json_data["numMarketPairs"]
name = json_data["name"]
Example of ExchangeIds output, that I don't need:
{200} #with the one id in `ExchangeId`
Example of JSON output:
{
"name_of_coin": "Pax Dollar",
"marketPairs": [
{
"exchange_name": "Bitrue",
"market_url": "https://www.bitrue.com/trade/usdp_usdt",
"price": 1.0000617355334473,
"last_update": "2021-12-24T16:39:09.000Z",
"exchange_id": 433
},
{
"exchange_name": "Hotbit",
"market_url": "https://www.hotbit.io/exchange?symbol=USDP_USDT",
"price": 0.964348817699553,
"last_update": "2021-12-24T16:39:08.000Z",
"exchange_id": 400
}
],
"pairs": 22
} #this one of exapmle that I need, because there are two id

Converting Excel to JSON using Python

This is my first attempt at something like this, How do I get the desired output?
the desired json is like so
{
"cluster": "ABC,DEF,GHI,SMU,RTR,DIP-OLM"
}
I am following various answers here, and I only have this
import pandas as pd
file = 'my_template.xlsx'
ans = pd.read_excel(file, sheet_name='clusters')
f = ans.to_json(indent=2)
print(f)
which gives
{
"service_name":{
"0":"ABC",
"1":"DEF",
"2":"GHI",
"3":"SMU",
"4":"MDS",
"5":"APS_OLM",
"6":"RTR",
"7":"DIP-OLM",
"8":"LMS-JSX"
}
}
Some help would open the way for me to get a better understanding of this. Thanks in advance. Would xlrd be better suited for this? There are a couple of more sheets that have to be nested inside of this json.
INFO :- The complete json is something like this:
{
"cluster": "ABC,DEF,GHI,SMU,RTR,DIP-OLM",
"version": "07.04.00",
"ABC": {
"microservices": "ABC - HA (07.04.00)",
"maps": {
"pam_bld_nr": "21",
"pam_bld_nr_frontend": "21",
"pam_bld_nr_backend": "21",
"pam_bt_dk_size": "200"
},
"new_block": {
"additional_param": "additional_param_value",
"additional_param": "additional_param_value2"
}
},
Rest of the values need to come from the other 2 sheets in the same file. Just FYI.

The best way to transform a response to a json format in the example

Appreciate if you could help me for the best way to transform a result into json as below.
We have a result like below, where we are getting an information on the employees and the companies. In the result, somehow, we are getting a enum like T, but not for all the properties.
[ {
"T.id":"Employee_11",
"T.category":"Employee",
"node_id":["11"]
},
{
"T.id":"Company_12",
"T.category":"Company",
"node_id":["12"],
"employeecount":800
},
{
"T.id":"id~Employee_11_to_Company_12",
"T.category":"WorksIn",
},
{
"T.id":"Employee_13",
"T.category":"Employee",
"node_id":["13"]
},
{
"T.id":"Parent_Company_14",
"T.category":"ParentCompany",
"node_id":["14"],
"employeecount":900,
"childcompany":"Company_12"
},
{
"T.id":"id~Employee_13_to_Parent_Company_14",
"T.category":"Contractorin",
}]
We need to transform this result into a different structure and grouping based on the category, if category in Employee, Company and ParentCompany, then it should be under the node_properties object, else, should be in the edge_properties. And also, apart from the common properties(property_id, property_category and node), different properties to be added if the category is company and parent company. There are few more logic also where we have to get the from and to properties of the edge object based on the 'to' . the expected response is,
"node_properties":[
{
"property_id":"Employee_11",
"property_category":"Employee",
"node":{node_id: "11"}
},
{
"property_id":"Company_12",
"property_category":"Company",
"node":{node_id: "12"},
"employeecount":800
},
{
"property_id":"Employee_13",
"property_category":"Employee",
"node":{node_id: "13"}
},
{
"property_id":"Company_14",
"property_category":"ParentCompany",
"node":{node_id: "14"},
"employeecount":900,
"childcompany":"Company_12"
}
],
"edge_properties":[
{
"from":"Employee_11",
"to":"Company_12",
"property_id":"Employee_11_to_Company_12",
},
{
"from":"Employee_13",
"to":"Parent_Company_14",
"property_id":"Employee_13_to_Parent_Company_14",
}
]
In java, we have used the enhanced for loop, switch etc. How we can write the code in the python to get the structure as above from the initial result structure. ( I am new to python), thank you in advance.
Regards
Here is a method that I quickly made, you can adjust it to your requirements. You can use regex or your own function to get the IDs of the edge_properties then assign it to an object like the way I did. I am not so sure of your full requirements but if that list that you gave is all the categories then this will be sufficient.
def transform(input_list):
node_properties = []
edge_properties = []
for input_obj in input_list:
# print(obj)
new_obj = {}
if input_obj['T.category'] == 'Employee' or input_obj['T.category'] == 'Company' or input_obj['T.category'] == 'ParentCompany':
new_obj['property_id'] = input_obj['T.id']
new_obj['property_category'] = input_obj['T.category']
new_obj['node'] = {input_obj['node_id'][0]}
if "employeecount" in input_obj:
new_obj['employeecount'] = input_obj['employeecount']
if "childcompany" in input_obj:
new_obj['childcompany'] = input_obj['childcompany']
node_properties.append(new_obj)
else: # You can do elif == to as well based on your requirements if there are other outliers
# You can use regex or whichever method here to split the string and add the values like above
edge_properties.append(new_obj)
return [node_properties, edge_properties]

A efficient way to unpack nested json into a dataframe

I have a nested json, and i want to transform it into a pandas dataframe. I was able to normalize with json_normalize.
However, there are still json layer within the dataframe, which i also want to unpack. How can i do it in the best way? I will likely have to deal with this a few more times within the project i am doing currently
The json i have is the following
{
"data": {
"allOpportunityApplication": {
"data": [
{
"id": "111111111",
"opportunity": {
"programme": {
"short_name": "XX"
}
},
"person": {
"home_lc": {
"name": "NAME"
}
},
"standards": [
{
"constant_name": "constant1",
"standard_option": {
"option": "true"
}
},
{
"constant_name": "constant2",
"standard_option": {
"option": "true"
}
}
]
}
]
}
}
}
Used json_normalize
standards_df = json_normalize(
standard_json['allOpportunityApplication']['data'],
record_path=['standards'],
meta=['id','person','opportunity']
)
with that i get a dataframe with the columns: constant_name, standard_option, id, person, opportunity. The problem is that the data standard_option, person and opportunity are json, with a single option inside.
The current ouput and expected output for each column is as follow
Standard_option
Currently an item in the column "standard_option" looks like:
{'option': 'true'}
I want it to be just true
Person
Currently an item in the column "person" looks like:
{'programme': {'short_name': 'XX'}}
I want it to look like: XX
Opportunity
Currently an item in the column "opportunity" looks like:
{'home_lc': {'name': 'NAME'}}
I want it to look like: NAME
Might not be the best way, but I think it works.
standards_df['person'] = (standards_df.loc[:, 'person']
.apply(lambda x: x['home_lc']['name']))
standards_df['opportunity'] = (standards_df.loc[:, 'opportunity']
.apply(lambda x: x['programme']['short_name']))
constant_name standard_option.option id person opportunity
0 constant1 true 111111111 NAME XX
1 constant2 true 111111111 NAME XX
standard_option was already fine when I run your code

Count a particular value from list in Python mongodb

I am experimenting with Python with MongoDB. I am a newbie with python. Here I get records from a collection and based on a particular value from that collection, I find the count of that record(from the 1st collection). But my problem is I cannot append this count into my list.
Here is the code:
#gen.coroutine
def post(self):
Sid = self.body['Sid']
alpha = []
test = db.student.find({"Sid": Sid})
count = yield test.count()
print(count)
for document in (yield test.to_list(length=1000)):
cursor = db.attendance.find({"StudentId": document.get('_id')})
check = yield cursor.count()
print(check)
alpha.append(document)
self.write(bson.json_util.dumps({"data": alpha}))
the displayed output alpha is from the first collection (student), the count value is from (attendance collection).
when I try to extend the list with check I end up with error
alpha.append(document.extend(check))
But I am getting the correct count value in python terminal, I am unable to write it along with the output.
My output is like
{"data": [{"Sid": "1", "Student Name": "Alex","_id": {"$oid": "..."}}, {"Sid": "1", "Student Name": "Alex","_id": {"$oid": "..."}}]}
My output should be like
{"data": [{"Sid": "1", "Student Name": "Alex","_id": {"$oid": "..."},"count": "5"}, {"Sid": "1", "Student Name": "Alex","_id": {"$oid": "..."},"count": "3"}]}
Please guide me on how I can get my desired output.
Thank you.
A better approach to this is to use the MongoDB .aggregate() method from the python driver you are using rather than repeated .find() and .count() operations:
db.attendance.aggregate([
{ "$group": {
"_id": "$StudentId",
"name": { "$first": "$Student Name" },
"count": { "$sum": 1 }
}}
])
Then it is already done for you.
What your current code is doing is looking up the current student and returning a "count" of how many occurances there are. And you are doing that for every student by the content of your output.
Rather than do that the data is "aggregated" to return both the values from the document along with a "count" within the returned results, and it is aggregated per student.
This means you don't need to run a query for each student just to get the count. Instead you just call the database "once" and make it count all the students you need in one result.
If you need more that one student but not all students then you filter that with query conditions;
db.attendance.aggregate([
{ "$match": { "StudentId": { "$in": list_of_student_ids } } },
{ "$group": {
"_id": "$StudentId",
"name": { "$first": "$Student Name" },
"count": { "$sum": 1 }
}}
])
And the selection along with the aggregation is done for you.
No need for looping code and lots of database request. The .aggregate() method and pipeline will do it for you.
Read the core documation on the Aggregation Pipeline.
Add count entry to the dictionary document and append the dictionary:
for document in (yield test.to_list(length=1000)):
cursor = db.attendance.find({"StudentId": document.get('_id')})
check = yield cursor.count()
document['count'] = check
alpha.append(document)

Categories

Resources