How to customize python pandas dataframe into mongodb object - python

I have imported a csv dataset using python and did some clean ups.
Download the dataset here
# importing pandas
import pandas as pd
# reading csv and assigning to 'data'
data = pd.read_csv('co-emissions-per-capita.csv')
# dropping all columns before 2016 (2016 - 2017 remains)
data.drop(data[data.Year < 2016].index, inplace=True)
# dropping rows with all null values in rows
data.dropna(how="all", inplace=True)
# dropping rows with all null values in columns
data.dropna(axis="columns", how="all", inplace=True)
# filling NA values
data["Entity"].fillna("No Country", inplace=True)
data["Code"].fillna("No Code", inplace=True)
data["Year"].fillna("No Year", inplace=True)
data["Per capita CO2 emissions (tonnes per capita)"].fillna(0, inplace=True)
# Sort by Year && Country
data.sort_values(["Year", "Entity"], inplace=True)
# renaming columns
data.rename(columns={"Entity": "Country",
"Per capita CO2 emissions (tonnes per capita)": "CO2 emissions (metric tons)"}, inplace=True)
My currecnt dataset has data for 2 years and 197 countries which is 394 rows
I want to insert the data into mongodb in the following format.
{
{
"_id": ObjectId("5dfasdc2f7c4b0174c5d01bc"),
"year": 2016,
"countries":
{
"name": "Afghanistan",
"code": "AFG",
"CO2 emissions (metric tons)": 0.366302
},
{
"name": "Albania",
"code": "ALB",
"CO2 emissions (metric tons)": 0.366302
}
},
{
"_id": ObjectId("5dfasdc2f7c4b0174c5d01bc"),
"year": 2017,
"countries":
{
"name": "Afghanistan",
"code": "AFG",
"CO2 emissions (metric tons)": 0.366302
},
{
"name": "Albania",
"code": "ALB",
"CO2 emissions (metric tons)": 0.366302
}
}
}
I want one object each for an year.
Inside that I want to nest all the countries and it related information.
To be precise, I want my database to have 2(max) objects and 197 nested objects inside each main object. So each year will only be listed once inside the database whereas each country will appear twice in the database 1 each for 1 year
is there a better structure to store these data? please specify the steps to store these data into mongodb and I'd really appreciate if you can suggest a good 'mongoose for NodeJs' like ODM driver for python.

Use groupby function to split values from your dataframe into separate groups per year.
Use to_dict function with orient parameter set to 'records' to convert results into JSON arrays.
Use pymongo API to connect to DB and insert values.

Related

How to use pandas groupby on one column with agg - max one col, min another col - without producing multi-level columns

I have the following pandas DataFrame:
account_number = [1234, 5678, 9012, 1234.0, 5678, 9012, 1234.0, 5678, 9012, 1234.0, 5678, 9012]
client_name = ["Ford", "GM", "Honda", "Ford", "GM", "Honda", "Ford", "GM", "Honda", "Ford", "GM", "Honda"]
database = ["DB_Ford", "DB_GM", "DB_Honda", "DB_Ford", "DB_GM", "DB_Honda", "DB_Ford", "DB_GM", "DB_Honda", "DB_Ford", "DB_GM", "DB_Honda"]
server = ["L01SQL04", "L01SQL08", "L01SQL12", "L01SQL04", "L01SQL08", "L01SQL12", "L01SQL04", "L01SQL08", "L01SQL12", "L01SQL04", "L01SQL08", "L01SQL12"]
order_num = [2145479, 2145506, 2145534, 2145603, 2145658, 2429513, 2145489, 2145516, 2145544, 2145499, 2145526, 2145554]
customer_dob = ["1967-12-01", "1963-07-09", "1986-12-05", "1967-11-01", None, "1986-12-05", "1967-12-01", "1963-07-09", "1986-12-05", "1967-12-01", "1963-07-09", "1986-12-04"]
purchase_date = ["2022-06-18", "2022-04-11", "2021-01-18", "2022-06-20", "2022-04-11", "2021-01-18", "2022-06-22", "2022-04-13", "2021-01-18", "2022-06-24", "2022-04-18", "2021-01-18"]
d = {
"account_number": account_number,
"client_name" : client_name,
"database" : database,
"server" : server,
"order_num" : order_num,
"customer_dob" : customer_dob,
"purchase_date" : purchase_date,
}
df = pd.DataFrame(data=d)
dates = ["customer_dob", "purchase_date"]
for date in dates:
df[date] = pd.to_datetime(df[date])
The customer's date of birth (DOB) and purchase date (PD) should be the same per account_number, but since there can be a data entry error on either one, I want to perform a groupby on the account_number and get the max on the DOB, and the min on the PD. This is easy to do if all I want are those two columns in addition to the account_number:
result = df.groupby("account_number").agg({"customer_dob": "max", "purchase_date": "min"}).reset_index()
result
However, I want the other columns as well, as they are guaranteed to be the same for each account_number. The problem is, when I attempt to include the other columns, I get multi-level columns, which I don't want. This first attempt not only produced multi-level columns, but I don't even see the actual values for DOB and PD
result = df.groupby("account_number")["client_name", "database", "server", "order_num"].agg({"customer_dob": "max", "purchase_date": "min"}).reset_index()
result
The second attempt included the DOB and PD, but now twice for each account number, while still producing multi-level columns:
result = df.groupby("account_number")["client_name", "database", "server", "order_num", "customer_dob", "purchase_date"].agg(
{"patient_dob": "max", "insert_date": "min"}).reset_index()
result
I just want the end result to look like this:
So, that's my question to all you Python experts: what do I need to do to accomplish this?
leaving the order number out, per your comment above. If order # are same for an account, then add the order number to the columns list in merge
result = df.groupby("account_number").agg({"customer_dob": "max", "purchase_date": "min"}).reset_index()
result.merge(df[['account_number','client_name','database','server' ]] ,
how='left',
on='account_number').drop_duplicates()
account_number customer_dob purchase_date client_name database server
0 1234.0 1967-12-01 2022-06-18 Ford DB_Ford L01SQL04
4 5678.0 1963-07-09 2022-04-11 GM DB_GM L01SQL08
8 9012.0 1986-12-05 2021-01-18 Honda DB_Honda L01SQL12
You can use:
agg(new_col_1=(col_1, 'sum'), new_col_2=(col_2, 'min'))

Multi-level Python Dict to Pandas DataFrame only processes one level out of many

I'm parsing some XML data, doing some logic on it, and trying to display the results in an HTML table. The dictionary, after filling, looks like this:
{
"general_info": {
"name": "xxx",
"description": "xxx",
"language": "xxx",
"prefix": "xxx",
"version": "xxx"
},
"element_count": {
"folders": 23,
"conditions": 72,
"listeners": 1,
"outputs": 47
},
"external_resource_count": {
"total": 9,
"extensions": {
"jar": 8,
"json": 1
},
"paths": {
"/lib": 9
}
},
"complexity": {
"over_1_transition": {
"number": 4,
"percentage": 30.769
},
"over_1_trigger": {
"number": 2,
"percentage": 15.385
},
"over_1_output": {
"number": 4,
"percentage": 30.769
}
}
}
Then I'm using pandas to convert the dictionary into a table, like so:
data_frame = pandas.DataFrame.from_dict(data=extracted_metrics, orient='index').stack().to_frame()
The result is a table that is mostly correct:
While the first and second levels seem to render correctly, those categories with a sub-sub category get written as a string in the cell, rather than as a further column. I've also tried using stack(level=1) but it raises an error "IndexError: Too many levels: Index has only 1 level, not 2". I've also tried making it into a series with no luck. It seems like it only renders "complete" columns. Is there a way of filling up the empty spaces in the dictionary before processing?
How can I get, for example, external_resource_count -> extensions to have two daughter rows jar and json, with an additional column for the values, so that the final table looks like this:
Extra credit if anyone can tell me how to get rid of the first row with the index numbers. Thanks!
The way you load the dataframe is correct but you should rename the 0 to a some column name.
# this function extracts all the keys from your nested dicts
def explode_and_filter(df, filterdict):
return [df[col].apply(lambda x:x.get(k) if type(x)==dict else x).rename(f'{k}')
for col,nested in filterdict.items()
for k in nested]
data_frame = pd.DataFrame.from_dict(data= extracted_metrics, orient='index').stack().to_frame(name='somecol')
#lets separate the rows where a dict is present & explode only those rows
mask = data_frame.somecol.apply(lambda x:type(x)==dict)
expp = explode_and_filter(data_frame[mask],
{'somecol':['jar', 'json', '/lib', 'number', 'percentage']})
# here we concat the exploded series to a frame
exploded_df = pd.concat(expp, axis=1).stack().to_frame(name='somecol2').reset_index(level=2)\.rename(columns={'level_2':'somecol'})
# and now we concat the rows with dict elements with the rows with non dict elements
out = pd.concat([data_frame[~mask], exploded_df])
The output dataframe looks like this

Custom Histogram aggregation in Elasticsearch

I have a index of following structure
item_id: unique item id
sale_date: date of the date
price: price of the sale wrt the date
I want to create a histogram of the latest sale prices per item. aggregate term item_id and histogram of last or latest price
My first choice was to term aggregate item_id and pick price from top_hits size 1 order sale_date desc and create histogram on the python end.
but.
since the data is in 10s of millions of records for one month. It is not viable to download all sources in time to perform histogram.
Note: Some item sell daily and some at different time interval. which makes it tricky to just pick latest sale_date
Updated:
Input: Item based sales time series data.
Output: Historgram of the count of items lies in a certain price buckets wrt to latest information
I have turn around that I used similar case, You can use max aggs with date type, and you can order aggregation based on nested aggs value, to be like:
"aggs": {
"item ID": {
"terms": {
"field": "item_id",
"size": 10000
},
"aggs": {
"price": {
"terms": {
"field": "price",
"size": 1,
"order": {
"sale_date": "desc"
}
},
"aggs": {
"sale_date": {
"max": {
"field": "sale_date"
}
}
}
}
}
}
}
I hope that will help you, and I wish you inform me if it works with you.

Create nested dictionary from Pandas DataFrame

I have a requirement to create a nested dictionary from a Pandas DataFrame.
Below is an example dataset in CSV format:
hostname,nic,vlan,status
server1,eth0,100,enabled
server1,eth2,200,enabled
server2,eth0,100
server2,eth1,100,enabled
server2,eth2,200
server1,eth1,100,disabled
Once the CSV is imported as a DataFrame I have:
>>> import pandas as pd
>>>
>>> df = pd.read_csv('test.csv')
>>>
>>> df
hostname nic vlan status
0 server1 eth0 100 enabled
1 server1 eth2 200 enabled
2 server2 eth0 100 NaN
3 server2 eth1 100 enabled
4 server2 eth2 200 NaN
5 server1 eth1 100 disabled
The output nested dictionary/JSON needs to group by the first two columns (hostname and nic), for example:
{
"hostname": {
"server1": {
"nic": {
"eth0": {
"vlan": 100,
"status": "enabled"
},
"eth1": {
"vlan": 100,
"status": "disabled"
},
"eth2": {
"vlan": 200,
"status": "enabled"
}
}
},
"server2": {
"nic": {
"eth0": {
"vlan": 100
},
"eth1": {
"vlan": 100,
"status": "enabled"
},
"eth2": {
"vlan": 200
}
}
}
}
}
I need to account for:
Missing data, for example not all rows will include 'status'. If this happens we just skip it in the output dictionary
hostnames in the first column may be listed out of order. For example, rows 0, 1 and 5 must be correctly grouped under server1 in the output dictionary
Extra columns beyond vlan and status may be added in future. These must be correctly grouped under hostname and nic
I have looked at groupby and multiindex in the Pandas documentation by as a newcomer I have got stuck.
Any help is appreciated on the best method to achieve this.
It may help to group the df first : df_new = df.groupby(["hostname", "nice"], as_index=False) - note, as_index=False preserves the dataframe format.
You can then use df_new.to_json(orient = 'records', lines=True) to convert your df to json format (as jtweeder mentions in comments). Once you get desired format and would like to write out, you can do something like:
with open('temp.json', 'w') as f:
f.write(df_new.to_json(orient='records', lines=True))

Load a dataframe from a single json object

I have the following json object:
{
"Name": "David",
"Gender": "M",
"Date": "2014-01-01",
"Address": {
"Street": "429 Ford",
"City": "Oxford",
"State": "DE",
"Zip": 1009
}
}
How would I load this into a pandas dataframe so that it orients itself as:
name gender date address
David M 20140-01-01 {...}
What I'm trying now is:
pd.read_json(file)
But it orients it as four records instead of one.
You should read it as a Series and then (optionally) convert to a DataFrame:
df = pd.DataFrame(pd.read_json(file, typ='series')).T
df.shape
#(1, 4)
if your JSON file is composed of 1 JSON object per line (not an array, not a pretty printed JSON object)
then you can use:
df = pd.read_json(file, lines=True)
and it will do what you want
if file contains:
{"Name": "David","Gender": "M","Date": "2014-01-01","Address": {"Street": "429 Ford","City": "Oxford","State": "DE","Zip": 1009}}
on 1 line, then you get:
If you use
df = pd.read_json(file, orient='records')
you can load as 1 key per column, but the sub-keys will be split up into multiple rows.

Categories

Resources