I have a requirement to create a nested dictionary from a Pandas DataFrame.
Below is an example dataset in CSV format:
hostname,nic,vlan,status
server1,eth0,100,enabled
server1,eth2,200,enabled
server2,eth0,100
server2,eth1,100,enabled
server2,eth2,200
server1,eth1,100,disabled
Once the CSV is imported as a DataFrame I have:
>>> import pandas as pd
>>>
>>> df = pd.read_csv('test.csv')
>>>
>>> df
hostname nic vlan status
0 server1 eth0 100 enabled
1 server1 eth2 200 enabled
2 server2 eth0 100 NaN
3 server2 eth1 100 enabled
4 server2 eth2 200 NaN
5 server1 eth1 100 disabled
The output nested dictionary/JSON needs to group by the first two columns (hostname and nic), for example:
{
"hostname": {
"server1": {
"nic": {
"eth0": {
"vlan": 100,
"status": "enabled"
},
"eth1": {
"vlan": 100,
"status": "disabled"
},
"eth2": {
"vlan": 200,
"status": "enabled"
}
}
},
"server2": {
"nic": {
"eth0": {
"vlan": 100
},
"eth1": {
"vlan": 100,
"status": "enabled"
},
"eth2": {
"vlan": 200
}
}
}
}
}
I need to account for:
Missing data, for example not all rows will include 'status'. If this happens we just skip it in the output dictionary
hostnames in the first column may be listed out of order. For example, rows 0, 1 and 5 must be correctly grouped under server1 in the output dictionary
Extra columns beyond vlan and status may be added in future. These must be correctly grouped under hostname and nic
I have looked at groupby and multiindex in the Pandas documentation by as a newcomer I have got stuck.
Any help is appreciated on the best method to achieve this.
It may help to group the df first : df_new = df.groupby(["hostname", "nice"], as_index=False) - note, as_index=False preserves the dataframe format.
You can then use df_new.to_json(orient = 'records', lines=True) to convert your df to json format (as jtweeder mentions in comments). Once you get desired format and would like to write out, you can do something like:
with open('temp.json', 'w') as f:
f.write(df_new.to_json(orient='records', lines=True))
Related
is there any way to add values via aggregation
like db.insert_one
x = db.aggregate([{
"$addFields": {
"chat_id": -10013345566,
}
}])
i tried this
but this code return nothing and values are not updated
i wanna add the values via aggregation
cuz aggregation is way faster than others
sample document :
{"_id": 123 , "chat_id" : 125}
{"_id": 234, "chat_id" : 1325}
{"_id": 1323 , "chat_id" : 335}
expected output :
alternative to db.insert_one() in mongodb aggregation
You have to make use of $merge stage to save output of the aggregation to the collection.
Note: Be very very careful when you use $merge stage as you can accidentally replace the entire document in your collection. Go through the complete documentation of this stage before using it.
db.collection.aggregate([
{
"$match": {
"_id": 123
}
},
{
"$addFields": {
"chat_id": -10013345566,
}
},
{
"$merge": {
"into": "collection", // <- Collection Name
"on": "_id", // <- Merge operation match key
"whenMatched": "merge" // <- Operation to perform when matched
}
},
])
Mongo Playground Sample Execution
I'm parsing some XML data, doing some logic on it, and trying to display the results in an HTML table. The dictionary, after filling, looks like this:
{
"general_info": {
"name": "xxx",
"description": "xxx",
"language": "xxx",
"prefix": "xxx",
"version": "xxx"
},
"element_count": {
"folders": 23,
"conditions": 72,
"listeners": 1,
"outputs": 47
},
"external_resource_count": {
"total": 9,
"extensions": {
"jar": 8,
"json": 1
},
"paths": {
"/lib": 9
}
},
"complexity": {
"over_1_transition": {
"number": 4,
"percentage": 30.769
},
"over_1_trigger": {
"number": 2,
"percentage": 15.385
},
"over_1_output": {
"number": 4,
"percentage": 30.769
}
}
}
Then I'm using pandas to convert the dictionary into a table, like so:
data_frame = pandas.DataFrame.from_dict(data=extracted_metrics, orient='index').stack().to_frame()
The result is a table that is mostly correct:
While the first and second levels seem to render correctly, those categories with a sub-sub category get written as a string in the cell, rather than as a further column. I've also tried using stack(level=1) but it raises an error "IndexError: Too many levels: Index has only 1 level, not 2". I've also tried making it into a series with no luck. It seems like it only renders "complete" columns. Is there a way of filling up the empty spaces in the dictionary before processing?
How can I get, for example, external_resource_count -> extensions to have two daughter rows jar and json, with an additional column for the values, so that the final table looks like this:
Extra credit if anyone can tell me how to get rid of the first row with the index numbers. Thanks!
The way you load the dataframe is correct but you should rename the 0 to a some column name.
# this function extracts all the keys from your nested dicts
def explode_and_filter(df, filterdict):
return [df[col].apply(lambda x:x.get(k) if type(x)==dict else x).rename(f'{k}')
for col,nested in filterdict.items()
for k in nested]
data_frame = pd.DataFrame.from_dict(data= extracted_metrics, orient='index').stack().to_frame(name='somecol')
#lets separate the rows where a dict is present & explode only those rows
mask = data_frame.somecol.apply(lambda x:type(x)==dict)
expp = explode_and_filter(data_frame[mask],
{'somecol':['jar', 'json', '/lib', 'number', 'percentage']})
# here we concat the exploded series to a frame
exploded_df = pd.concat(expp, axis=1).stack().to_frame(name='somecol2').reset_index(level=2)\.rename(columns={'level_2':'somecol'})
# and now we concat the rows with dict elements with the rows with non dict elements
out = pd.concat([data_frame[~mask], exploded_df])
The output dataframe looks like this
I have imported a csv dataset using python and did some clean ups.
Download the dataset here
# importing pandas
import pandas as pd
# reading csv and assigning to 'data'
data = pd.read_csv('co-emissions-per-capita.csv')
# dropping all columns before 2016 (2016 - 2017 remains)
data.drop(data[data.Year < 2016].index, inplace=True)
# dropping rows with all null values in rows
data.dropna(how="all", inplace=True)
# dropping rows with all null values in columns
data.dropna(axis="columns", how="all", inplace=True)
# filling NA values
data["Entity"].fillna("No Country", inplace=True)
data["Code"].fillna("No Code", inplace=True)
data["Year"].fillna("No Year", inplace=True)
data["Per capita CO2 emissions (tonnes per capita)"].fillna(0, inplace=True)
# Sort by Year && Country
data.sort_values(["Year", "Entity"], inplace=True)
# renaming columns
data.rename(columns={"Entity": "Country",
"Per capita CO2 emissions (tonnes per capita)": "CO2 emissions (metric tons)"}, inplace=True)
My currecnt dataset has data for 2 years and 197 countries which is 394 rows
I want to insert the data into mongodb in the following format.
{
{
"_id": ObjectId("5dfasdc2f7c4b0174c5d01bc"),
"year": 2016,
"countries":
{
"name": "Afghanistan",
"code": "AFG",
"CO2 emissions (metric tons)": 0.366302
},
{
"name": "Albania",
"code": "ALB",
"CO2 emissions (metric tons)": 0.366302
}
},
{
"_id": ObjectId("5dfasdc2f7c4b0174c5d01bc"),
"year": 2017,
"countries":
{
"name": "Afghanistan",
"code": "AFG",
"CO2 emissions (metric tons)": 0.366302
},
{
"name": "Albania",
"code": "ALB",
"CO2 emissions (metric tons)": 0.366302
}
}
}
I want one object each for an year.
Inside that I want to nest all the countries and it related information.
To be precise, I want my database to have 2(max) objects and 197 nested objects inside each main object. So each year will only be listed once inside the database whereas each country will appear twice in the database 1 each for 1 year
is there a better structure to store these data? please specify the steps to store these data into mongodb and I'd really appreciate if you can suggest a good 'mongoose for NodeJs' like ODM driver for python.
Use groupby function to split values from your dataframe into separate groups per year.
Use to_dict function with orient parameter set to 'records' to convert results into JSON arrays.
Use pymongo API to connect to DB and insert values.
I am not the one for any hyperbole but I am really stumped by this error and i am sure you will be too..
Here is a simple json object:
[
{
"id": "7012104767417052471",
"session": -1332751885,
"transactionId": "515934477",
"ts": "2019-10-30 12:15:40 AM (+0000)",
"timestamp": 1572394540564,
"sku": "1234",
"price": 39.99,
"qty": 1,
"ex": [
{
"expId": 1007519,
"versionId": 100042440,
"variationId": 100076318,
"value": 1
}
]
}
]
Now I saved the file into ex.json and then executed the following python code:
import pandas as pd
df = pd.read_json('ex.json')
When i see the dataframe the value of my id has changed from "7012104767417052471" to "7012104767417052160"py
Does anyone understand why python does this? I tried it in node, js, and even excel and it is looking fine in everything else..
If I do this I get the right id:
with open('Siva.json') as data_file:
data = json.load(data_file)
df = json_normalize(data)
But I want to understand why pandas doesn't process json properly in a strange way.
This is a known issue:
This has been an OPEN issue since 2018-04-04
read_json reads large integers as strings incorrectly if dtype not explicitly mentioned #20608
As stated in the issue. Explicitly designate the dtype to get the correct number.
import pandas as pd
df = pd.read_json('test.json', dtype={'id': 'int64'})
id session transactionId ts timestamp sku price qty ex
7012104767417052471 -1332751885 515934477 2019-10-30 12:15:40 AM (+0000) 2019-10-30 00:15:40.564 1234 39.99 1 [{'expId': 1007519, 'versionId': 100042440, 'variationId': 100076318, 'value': 1}]
I have the following json object:
{
"Name": "David",
"Gender": "M",
"Date": "2014-01-01",
"Address": {
"Street": "429 Ford",
"City": "Oxford",
"State": "DE",
"Zip": 1009
}
}
How would I load this into a pandas dataframe so that it orients itself as:
name gender date address
David M 20140-01-01 {...}
What I'm trying now is:
pd.read_json(file)
But it orients it as four records instead of one.
You should read it as a Series and then (optionally) convert to a DataFrame:
df = pd.DataFrame(pd.read_json(file, typ='series')).T
df.shape
#(1, 4)
if your JSON file is composed of 1 JSON object per line (not an array, not a pretty printed JSON object)
then you can use:
df = pd.read_json(file, lines=True)
and it will do what you want
if file contains:
{"Name": "David","Gender": "M","Date": "2014-01-01","Address": {"Street": "429 Ford","City": "Oxford","State": "DE","Zip": 1009}}
on 1 line, then you get:
If you use
df = pd.read_json(file, orient='records')
you can load as 1 key per column, but the sub-keys will be split up into multiple rows.