Merge remaining columns after groupBy and remove NaT/NaNs - python

Input
id
name
country
lost_item
year
status
resolved_date
closed_date
refunded_date
123
John
US
Bike
2020
Resolved
2021-12-25
125
Mike
CAN
Car
2021
Refunded
2021-11-22
123
John
US
Car
2019
Resolved
2021-12-25
563
Steve
CAN
Battery
2022
Closed
2019-02-03
Desired output
{
"items": {
"item": [
{
"id": "123",
"name": "John",
"categories": {
"category": [
{
"lost_item": "Bike",
"year": "2020"
},
{
"lost_item": "Car",
"year": "2019"
}
]
},
"country": "US",
"status": "Resolved",
"resolved_date":"2021-12-25",
},
{
"id": "125",
"name": "Mike",
"categories": {
"category": [
{
"lost_item": "Car",
"year": "2021"
},
]
},
"country": "CAN",
"status": "Reopened",
"refunded_date":"2021-11-22",
},
{
"id": "563",
"name": "Steve",
"categories": {
"category": [
{
"lost_item": "Bike",
"year": "2020"
},
]
},
"country": "CAN",
"status": "Closed",
"closed_date":"2019-02-03",
}
]
}
}
My code:
df = pd.read_excel('C:/Users/hero/Desktop/sample.xlsx', sheet_name='catalog')
df["closed_date"] = df["closed_date"].astype(str)
df["resolved_date"] = df["resolved_date"].astype(str)
df["refunded_date"] = df["refunded_date"].astype(str)
partial = df.groupby(['id', 'name', 'country', 'status', 'closed_date', 'resolved_date', 'refunded_date'], dropna=False).apply(lambda x: {"category":x[['lost_item','year']].to_dict('records')}).reset_index(name="categories").to_dict(orient="records")
res = []
for dict in partial:
clean = {key: value for (key, value) in dict.items() if value!="NaT"}
res.append(clean)
print(json.dumps(res, indent=2)) ## I will be writing the final payload to a JSON file.
In my input the fields id, name, country, status are mandatory fields. The fields resolved_Date, closed_date, refunded_date are not mandatory and will be empty values.
My questions:
Does including columns that have NaN values in GroupBy will have side effects for large datasets? I didn't find any problem with the above sample input.
Can i remove the fields resolved_Date, closed_date, refunded_date in group by and append these columns after group by ?
Whats the best way to handle the NaNs in the dataset ? For my usecase if a NaN is present then i have to drop that particular key not the entire row.
Please let me know if there is any room for improvement in my existing code. Any help is appreciated.
Thanks

Related

Best Way to convert the data in the format

I have a data structure that is something like this
my_data = [
('Continent1','Country1','State1'),
('Continent1','Country1','State2'),
('Continent1','Country2','State1'),
('Continent1','Country2','State2'),
('Continent1','Country2','State3','City1',11111)
]
With the input not limited to State it can be narrowed down further to something like
Cotinent ==> Country ==> State ==> City ==> Zip (With State, City and Zip) being optional fields.
I wish to convert it to a json like provided on the fields shared in payload
{
"Regions": [{
"Continent": 'Continent1',
"Country": "Country1",
"State": "state1"
}, {
"Continent": 'Continent1',
"Country": "Country1",
"State": "state2"
}, {
"Continent": 'Continent1',
"Country": "Country2",
"State": "state1"
}, {
"Continent": 'Continent1',
"Country": "Country1",
"State": "state2"
}, {
"Continent": 'Continent1',
"Country": "Country1",
"State": "state3",
"City": "City1",
"zip": "11111",
}]
}
Any pseudo code/approach for the same would be appreciated which would support the output based on multiple inputs.
keys = ["Continent", "Country", "State", "City", "Zip"]
transformed_data = {
"Regions": [dict(zip(keys, row)) for row in my_data]
}

Python, Avoid ugly nested for loop

I'm new to python programming.
I have tried a lot to avoid these nested for loops, but no success.
My data input like:
[
{
"province_id": "1",
"name": "HCM",
"districts": [
{
"district_id": "1",
"name": "Thu Duc",
"wards": [
{
"ward_id": "1",
"name": "Linh Trung"
},
{
"ward_id": "2",
"name": "Linh Chieu"
}
]
},
{
"district_id": "2",
"name": "Quan 9",
"wards": [
{
"ward_id": "3",
"name": "Hiep Phu"
},
{
"ward_id": "4",
"name": "Long Binh"
}
]
}
]
},
{
"province_id": "2",
"name": "Binh Duong",
"districts": [
{
"district_id": "3",
"name": "Di An",
"wards": [
{
"ward_id": "5",
"name": "Dong Hoa"
},
{
"ward_id": "6",
"name": "Di An"
}
]
},
{
"district_id": "4",
"name": "Thu Dau Mot",
"wards": [
{
"ward_id": "7",
"name": "Hiep Thanh"
},
{
"ward_id": "8",
"name": "Hiep An"
}
]
}
]
}
]
And my code is:
for province in data:
for district in province['districts']:
for ward in district['wards']:
# Excute my function
print('{}, {}, {}'.format(ward['name'], district['name'], province['name']))
Output
Linh Trung, Thu Duc, HCM
Linh Chieu, Thu Duc, HCM
Hiep Phu, Quan 9, HCM
...
Even though my code is working it looks pretty ugly.
How can I avoid these nested for loops?
Your data structure is naturally nested, but one option you have for neatening your code is to write a generator function for iterating over it:
def all_wards(data):
for province in data:
for district in province['districts']:
for ward in district['wards']:
yield province, district, ward
This function has the same triply-nested loop in it as you currently have, but everywhere else in your code, you can now iterate over the data structure with a single non-nested loop:
for province, district, ward in all_wards(data):
print('{}, {}, {}'.format(ward['name'], district['name'], province['name']))
If you prefer to avoid having too much indentation, here's an equivalent way to write the function, similar to #adarian's answer but without creating a temporary list:
def all_wards(data):
return (
province, district, ward
for province in data
for district in province['districts']
for ward in district['wards']
)
Here is a one-liner version
[
print("{}, {}, {}".format(ward["name"], district["name"], province["name"]))
for province in data
for district in province["districts"]
for ward in district["wards"]
]
You could do something like this:
def print_district(district, province):
for ward in district['wards']:
print('{}, {}, {}'.format(ward['name'], district['name'], province['name']))
def print_province(province):
for district in province['districts']:
print_district(district, province)
for province in data:
print_province(province)

Python: most efficient way to categorize transactions

I have a large list of transactions that I want to categorize.
It looks like this:
transactions: [
{
"id": "20200117-16045-0",
"date": "2020-01-17",
"creationTime": null,
"text": "SuperB Vesterbro T 74637",
"originalText": "SuperB Vesterbro T 74637",
"details": null,
"category": null,
"amount": {
"value": -160.45,
"currency": "DKK"
},
"balance": {
"value": 12572.68,
"currency": "DKK"
},
"type": "Card",
"state": "Booked"
},
{
"id": "20200117-4800-0",
"date": "2020-01-17",
"creationTime": null,
"text": "Rent 45228",
"originalText": "Rent 45228",
"details": null,
"category": null,
"amount": {
"value": -48.00,
"currency": "DKK"
},
"balance": {
"value": 12733.13,
"currency": "DKK"
},
"type": "Card",
"state": "Booked"
},
{
"id": "20200114-1200-0",
"date": "2020-01-14",
"creationTime": null,
"text": "Superbest 86125",
"originalText": "SUPERBEST 86125",
"details": null,
"category": null,
"amount": {
"value": -12.00,
"currency": "DKK"
},
"balance": {
"value": 12781.13,
"currency": "DKK"
},
"type": "Card",
"state": "Booked"
}
]
I loaded in the data like this:
with open('transactions.json') as transactions:
file = json.load(transactions)
data = json_normalize(file)['transactions'][0]
return pd.DataFrame(data)
And I have the following categories so far, I want to group the transactions by:
CATEGORIES = {
'Groceries': ['SuperB', 'Superbest'],
'Housing': ['Insurance', 'Rent']
}
Now I would like to loop through each row in the DataFrame and group each transaction.
I would like to do this, by checking if text contains one of the values from the CATEGORIES dictionary.
If so, that transaction should get categorized as the key of the CATEGORIES dictionary - for instance Groceries.
How do I do this most efficiently?
IIUC,
we can create a pipe delimited list from your dictionary and do some assignment with .loc
print(df)
for k,v in CATEGORIES.items():
pat = '|'.join(v)
df.loc[df['text'].str.contains(pat),'category'] = k
print(df[['text','category']])
text category
0 SuperB Vesterbro T 74637 Groceries
1 Rent 45228 Housing
2 Superbest 86125 Groceries
more efficienct solution :
we create a single list of all your values and extract them with str.extract at the same time we re-create your dictionary, so each value is now the key we will map onto your target dataframe.
words = []
mapping_dict = {}
for k,v in CATEGORIES.items():
for item in v:
words.append(item)
mapping_dict[item] = k
ext = df['text'].str.extract(f"({'|'.join(words)})")
df['category'] = ext[0].map(mapping_dict)
print(df)
text category
0 SuperB Vesterbro T 74637 Groceries
1 Rent 45228 Housing
2 Superbest 86125 Groceries

How to convert DataFrame into nested JSON

I'm trying to export a dataFrame into a nested JSON (hierarchical) for D3.js using solution which is only for one level ( parent , children)
Any help would be appreciated. I'm new to python
My DataFrame contains 7 levels
Here is the expected solution
JSON Example:
{
"name": "World",
"children": [
{
"name": "Europe",
"children": [
{
"name": "France",
"children": [
{
"name": "Paris",
"population": 1000000
}]
}]
}]
}
and here is the python method:
def to_flare_json(df, filename):
"""Convert dataframe into nested JSON as in flare files used for D3.js"""
flare = dict()
d = {"name":"World", "children": []}
for index, row in df.iterrows():
parent = row[0]
child = row[1]
child1 = row[2]
child2 = row[3]
child3 = row[4]
child4 = row[5]
child5 = row[6]
child_value = row[7]
# Make a list of keys
key_list = []
for item in d['children']:
key_list.append(item['name'])
#if 'parent' is NOT a key in flare.JSON, append it
if not parent in key_list:
d['children'].append({"name": parent, "children":[{"value": child_value, "name1": child}]})
# if parent IS a key in flare.json, add a new child to it
else:
d['children'][key_list.index(parent)]['children'].append({"value": child_value, "name11": child})
flare = d
# export the final result to a json file
with open(filename +'.json', 'w') as outfile:
json.dump(flare, outfile, indent=4,ensure_ascii=False)
return ("Done")
[EDIT]
Here is a sample of my df
World Continent Region Country State City Boroughs Population
1 Europe Western Europe France Ile de France Paris 17 821964
1 Europe Western Europe France Ile de France Paris 19 821964
1 Europe Western Europe France Ile de France Paris 20 821964
The structure you want is clearly recursive so I made a recursive function to fill it:
def create_entries(df):
entries = []
# Stopping case
if df.shape[1] == 2: # only 2 columns left
for i in range(df.shape[0]): # iterating on rows
entries.append(
{"Name": df.iloc[i, 0],
df.columns[-1]: df.iloc[i, 1]}
)
# Iterating case
else:
values = set(df.iloc[:, 0]) # Getting the set of unique values
for v in values:
entries.append(
{"Name": v,
# reiterating the process but without the first column
# and only the rows with the current value
"Children": create_entries(
df.loc[df.iloc[:, 0] == v].iloc[:, 1:]
)}
)
return entries
All that's left is to create the dictionary and call the function:
mydict = {"Name": "World",
"Children": create_entries(data.iloc[:, 1:])}
Then you just write your dict to a JSON file.
I hope my comments are explicit enough, the idea is to recursively use the first column of the dataset as the "Name" and the rest as the "Children".
Thank you Syncrossus for the answer, but this result in different branches for each boroughs or city
The result is this:
"Name": "World",
"Children": [
{
"Name": "Western Europe",
"Children": [
{
"Name": "France",
"Children": [
{
"Name": "Ile de France",
"Children": [
{
"Name": "Paris",
"Children": [
{
"Name": "17ème",
"Population": 821964
}
]
}
]
}
]
}
]
},{
"Name": "Western Europe",
"Children": [
{
"Name": "France",
"Children": [
{
"Name": "Ile de France",
"Children": [
{
"Name": "Paris",
"Children": [
{
"Name": "10ème",
"Population": 154623
}
]
}
]
}
]
}
]
}
But the desired result is this
"Name": "World",
"Children": [
{
"Continent": "Europe",
"Children": [
{
"Region": "Western Europe",
"Children": [
{
"Country": "France",
"Children": [
{
"State": "Ile De France",
"Children": [
{
"City": "Paris",
"Children": [
{
"Boroughs": "17ème",
"Population": 82194
},
{
"Boroughs": "16ème",
"Population": 99194
}
]
},
{
"City": "Saint-Denis",
"Children": [
{
"Boroughs": "10ème",
"Population": 1294
},
{
"Boroughs": "11ème",
"Population": 45367
}
]
}
]
}
]
},
{
"Country": "Belgium",
"Children": [
{
"State": "Oost-Vlaanderen",
"Children": [
{
"City": "Gent",
"Children": [
{
"Boroughs": "2ème",
"Population": 1234
},
{
"Boroughs": "4ème",
"Population": 7456
}
]
}
]
}
]
}
]
}
]
}
]

parsing nested JSON into multiple dataframe using pandas python

I have a nested JSON as shown below and want to parse into multiple dataframe in python .. please help
{
"tableName": "cases",
"url": "EndpointVoid",
"tableDataList": [{
"_id": "100017252700",
"title": "Test",
"type": "TECH",
"created": "2016-09-06T19:00:17.071Z",
"createdBy": "193164275",
"lastModified": "2016-10-04T21:50:49.539Z",
"lastModifiedBy": "1074113719",
"notes": [{
"id": "30",
"title": "Multiple devices",
"type": "INCCL",
"origin": "D",
"componentCode": "PD17A",
"issueCode": "IP321",
"affectedProduct": "134322",
"summary": "testing the json",
"caller": {
"email": "katie.slabiak#spps.org",
"phone": "651-744-4522"
}
}, {
"id": "50",
"title": "EDU: Multiple Devices - Lightning-to-USB Cable",
"type": "INCCL",
"origin": "D",
"componentCode": "PD17A",
"issueCode": "IP321",
"affectedProduct": "134322",
"summary": "parsing json 2",
"caller": {
"email": "testing1#test.org",
"phone": "123-345-1111"
}
}],
"syncCount": 2316,
"repair": [{
"id": "D208491610",
"created": "2016-09-06T19:02:48.000Z",
"createdBy": "193164275",
"lastModified": "2016-09-21T12:49:47.000Z"
}, {
"id": "D208491610"
}, {
"id": "D208491628",
"created": "2016-09-06T19:03:37.000Z",
"createdBy": "193164275",
"lastModified": "2016-09-21T12:49:47.000Z"
}
],
"enterpriseStatus": "8"
}],
"dateTime": 1475617849,
"primaryKeys": ["$._id"],
"primaryKeyVals": ["100017252700"],
"operation": "UPDATE"
}
I want to parse this and create 3 tables/dataframe/csv as shown below.. please help..
Output table in this format
I don't think this is best way, but I wanted to show you possibility.
import pandas as pd
from pandas.io.json import json_normalize
import json
with open('your_sample.json') as f:
dt = json.load(f)
Table1
df1 = json_normalize(dt, 'tableDataList', 'dateTime')[['_id', 'title', 'type', 'created', 'createdBy', 'lastModified', 'lastModifiedBy', 'dateTime']]
print df1
_id title type created createdBy \
0 100017252700 Test TECH 2016-09-06T19:00:17.071Z 193164275
lastModified lastModifiedBy dateTime
0 2016-10-04T21:50:49.539Z 1074113719 1475617849
Table 2
df2 = json_normalize(dt['tableDataList'], 'notes', '_id')
df2['phone'] = df2['caller'].map(lambda x: x['phone'])
df2['email'] = df2['caller'].map(lambda x: x['email'])
df2 = df2[['_id', 'id', 'title', 'email', 'phone']]
print df2
_id id title \
0 100017252700 30 Multiple devices
1 100017252700 50 EDU: Multiple Devices - Lightning-to-USB Cable
email phone
0 katie.slabiak#spps.org 651-744-4522
1 testing1#test.org 123-345-1111
Table 3
df3 = json_normalize(dt['tableDataList'], 'repair', '_id').dropna()
print df3
created createdBy id lastModified \
0 2016-09-06T19:02:48.000Z 193164275 D208491610 2016-09-21T12:49:47.000Z
2 2016-09-06T19:03:37.000Z 193164275 D208491628 2016-09-21T12:49:47.000Z
_id
0 100017252700
2 100017252700

Categories

Resources