I have a problem with pandas DataFrame - I don't understand how I can create new rows and merge them with a dictionary.
For instanse, I have this dataframe:
shops = [{'Chain': 'SeQu', 'Shop': 'Rimme', 'Location': 'UK', 'Brand': 'Rexona', 'Value': 10},
{'Chain': 'SeQu', 'Shop': 'Rimme', 'Location': 'UK', 'Brand': 'AXE', 'Value': 20},
{'Chain': 'SeQu', 'Shop': 'Rimme', 'Location': 'UK', 'Brand': 'Old Spice', 'Value': 30},
{'Chain': 'SeQu', 'Shop': 'Rimme', 'Location': 'UK', 'Brand': 'Camel', 'Value': 40},
{'Chain': 'SeQu', 'Shop': 'Rimme', 'Location': 'UK', 'Brand': 'Dove', 'Value': 50},
{'Chain': 'SeQu', 'Shop': 'Rum', 'Location': 'USA', 'Brand': 'Rexona', 'Value': 10},
{'Chain': 'SeQu', 'Shop': 'Rum', 'Location': 'USA', 'Brand': 'CIF', 'Value': 20},
{'Chain': 'SeQu', 'Shop': 'Rum', 'Location': 'USA', 'Brand': 'Old Spice', 'Value': 30},
{'Chain': 'SeQu', 'Shop': 'Rum', 'Location': 'USA', 'Brand': 'Camel', 'Value': 40}]
At the same time, I have a dictionary dataframe with Chain-Brand connection:
chain_brands = [{'Chain': 'SeQu', 'Brand': 'Rexona'},
{'Chain': 'SeQu', 'Brand': 'Axe'},
{'Chain': 'SeQu', 'Brand': 'Old Spice'},
{'Chain': 'SeQu', 'Brand': 'Camel'},
{'Chain': 'SeQu', 'Brand': 'Dove'},
{'Chain': 'SeQu', 'Brand': 'CIF'}]
So, I need to create new rows and fill them with 0, if brand in Null. It should look like this:
output = [{'Chain': 'SeQu', 'Shop': 'Rimme', 'Location': 'UK', 'Brand': 'Rexona', 'Value': 10},
{'Chain': 'SeQu', 'Shop': 'Rimme', 'Location': 'UK', 'Brand': 'AXE', 'Value': 20},
{'Chain': 'SeQu', 'Shop': 'Rimme', 'Location': 'UK', 'Brand': 'Old Spice', 'Value': 30},
{'Chain': 'SeQu', 'Shop': 'Rimme', 'Location': 'UK', 'Brand': 'Camel', 'Value': 40},
{'Chain': 'SeQu', 'Shop': 'Rimme', 'Location': 'UK', 'Brand': 'Dove', 'Value': 50},
{'Chain': 'SeQu', 'Shop': 'Rimme', 'Location': 'UK', 'Brand': 'CIF', 'Value': 0},
{'Chain': 'SeQu', 'Shop': 'Rum', 'Location': 'USA', 'Brand': 'Rexona', 'Value': 10},
{'Chain': 'SeQu', 'Shop': 'Rum', 'Location': 'USA', 'Brand': 'CIF', 'Value': 20},
{'Chain': 'SeQu', 'Shop': 'Rum', 'Location': 'USA', 'Brand': 'Old Spice', 'Value': 30},
{'Chain': 'SeQu', 'Shop': 'Rum', 'Location': 'USA', 'Brand': 'Axe', 'Value': 0},
{'Chain': 'SeQu', 'Shop': 'Rum', 'Location': 'USA', 'Brand': 'Camel', 'Value': 40},
{'Chain': 'SeQu', 'Shop': 'Rum', 'Location': 'USA', 'Brand': 'Dove', 'Value': 0}]
Thanks!
You can create a multi-index from the chain_brands dataframe and then use groupby together with reindex to solve this:
mi = pd.MultiIndex.from_arrays(chain_brands.values.T, names=['Chain', 'Brand'])
s = shops.set_index(['Chain', 'Brand']).\
groupby(['Location', 'Shop']).\
apply(lambda x: x.reindex(mi, fill_value=0)).\
drop(columns=['Location', 'Shop']).\
reset_index()
Result:
Location Shop Chain Brand Value
0 UK Rimme SeQu Rexona 10
1 UK Rimme SeQu Axe 0
2 UK Rimme SeQu Old Spice 30
3 UK Rimme SeQu Camel 40
4 UK Rimme SeQu Dove 50
5 UK Rimme SeQu CIF 0
6 USA Rum SeQu Rexona 10
7 USA Rum SeQu Axe 0
8 USA Rum SeQu Old Spice 30
9 USA Rum SeQu Camel 40
10 USA Rum SeQu Dove 0
11 USA Rum SeQu CIF 20
Related
I'm having troubles completely unnesting this json from an Api.
[{'id': 1,
'name': 'Buzz',
'tagline': 'A Real Bitter Experience.',
'first_brewed': '09/2007',
'description': 'A light, crisp and bitter IPA brewed with English and American hops. A small batch brewed only once.',
'image_url': 'https://images.punkapi.com/v2/keg.png',
'abv': 4.5,
'ibu': 60,
'target_fg': 1010,
'target_og': 1044,
'ebc': 20,
'srm': 10,
'ph': 4.4,
'attenuation_level': 75,
'volume': {'value': 20, 'unit': 'litres'},
'boil_volume': {'value': 25, 'unit': 'litres'},
'method': {'mash_temp': [{'temp': {'value': 64, 'unit': 'celsius'},
'duration': 75}],
'fermentation': {'temp': {'value': 19, 'unit': 'celsius'}},
'twist': None},
'ingredients': {'malt': [{'name': 'Maris Otter Extra Pale',
'amount': {'value': 3.3, 'unit': 'kilograms'}},
{'name': 'Caramalt', 'amount': {'value': 0.2, 'unit': 'kilograms'}},
{'name': 'Munich', 'amount': {'value': 0.4, 'unit': 'kilograms'}}],
'hops': [{'name': 'Fuggles',
'amount': {'value': 25, 'unit': 'grams'},
'add': 'start',
'attribute': 'bitter'},
{'name': 'First Gold',
'amount': {'value': 25, 'unit': 'grams'},
'add': 'start',
'attribute': 'bitter'},
{'name': 'Fuggles',
'amount': {'value': 37.5, 'unit': 'grams'},
'add': 'middle',
'attribute': 'flavour'},
{'name': 'First Gold',
'amount': {'value': 37.5, 'unit': 'grams'},
'add': 'middle',
'attribute': 'flavour'},
{'name': 'Cascade',
'amount': {'value': 37.5, 'unit': 'grams'},
'add': 'end',
'attribute': 'flavour'}],
'yeast': 'Wyeast 1056 - American Ale™'},
'food_pairing': ['Spicy chicken tikka masala',
'Grilled chicken quesadilla',
'Caramel toffee cake'],
'brewers_tips': 'The earthy and floral aromas from the hops can be overpowering. Drop a little Cascade in at the end of the boil to lift the profile with a bit of citrus.',
'contributed_by': 'Sam Mason <samjbmason>'},
{'id': 2,
'name': 'Trashy Blonde',
'tagline': "You Know You Shouldn't",
'first_brewed': '04/2008',
'description': 'A titillating, neurotic, peroxide punk of a Pale Ale. Combining attitude, style, substance, and a little bit of low self esteem for good measure; what would your mother say? The seductive lure of the sassy passion fruit hop proves too much to resist. All that is even before we get onto the fact that there are no additives, preservatives, pasteurization or strings attached. All wrapped up with the customary BrewDog bite and imaginative twist.',
'image_url': 'https://images.punkapi.com/v2/2.png',
'abv': 4.1,
'ibu': 41.5,
'target_fg': 1010,
'target_og': 1041.7,
'ebc': 15,
'srm': 15,
'ph': 4.4,
'attenuation_level': 76,
'volume': {'value': 20, 'unit': 'litres'},
'boil_volume': {'value': 25, 'unit': 'litres'},
'method': {'mash_temp': [{'temp': {'value': 69, 'unit': 'celsius'},
'duration': None}],
'fermentation': {'temp': {'value': 18, 'unit': 'celsius'}},
'twist': None},
'ingredients': {'malt': [{'name': 'Maris Otter Extra Pale',
'amount': {'value': 3.25, 'unit': 'kilograms'}},
{'name': 'Caramalt', 'amount': {'value': 0.2, 'unit': 'kilograms'}},
{'name': 'Munich', 'amount': {'value': 0.4, 'unit': 'kilograms'}}],
'hops': [{'name': 'Amarillo',
'amount': {'value': 13.8, 'unit': 'grams'},
'add': 'start',
'attribute': 'bitter'},
{'name': 'Simcoe',
'amount': {'value': 13.8, 'unit': 'grams'},
'add': 'start',
'attribute': 'bitter'},
{'name': 'Amarillo',
'amount': {'value': 26.3, 'unit': 'grams'},
'add': 'end',
'attribute': 'flavour'},
{'name': 'Motueka',
'amount': {'value': 18.8, 'unit': 'grams'},
'add': 'end',
'attribute': 'flavour'}],
'yeast': 'Wyeast 1056 - American Ale™'},
'food_pairing': ['Fresh crab with lemon',
'Garlic butter dipping sauce',
'Goats cheese salad',
'Creamy lemon bar doused in powdered sugar'],
'brewers_tips': 'Be careful not to collect too much wort from the mash. Once the sugars are all washed out there are some very unpleasant grainy tasting compounds that can be extracted into the wort.',
'contributed_by': 'Sam Mason <samjbmason>'}]
I was able to unnest it to a level using json_normalize
import requests
import pandas as pd
url = "https://api.punkapi.com/v2/beers"
requests.get(url).json()
data = requests.get(url).json()
pd.json_normalize(data)
this is an image of the output after using json_normalize
now to unnest the column 'method.mash_temp' I included record_path
pd.json_normalize(
data,
record_path =['method', 'mash_temp'],
meta=['id', 'name']
)
but I am having troubles adding the other columns('ingredients.malt', 'ingredients.hops') with list of dictionaries in the record_path argument.
we have the following dataframe:
import pandas as pd
our_df = pd.DataFrame(data = {'rank': {0: 1, 1: 2}, 'title_name': {0: "And It's Still Alright", 1: 'Black Madonna'}, 'title_id': {0: '120034150', 1: '106938609'}, 'artist_id': {0: '222521', 1: '200160'}, 'artist_name': {0: 'Nathaniel Rateliff', 1: 'Cage The Elephant'}, 'label': {0: 'CNCO', 1: 'RCA'}, 'metrics': {0: [{'name': 'Rank', 'value': 1}, {'name': 'Song', 'value': "And It's Still Alright"}, {'name': 'Artist', 'value': 'Nathaniel Rateliff'}, {'name': 'TP Spins', 'value': 933}, {'name': '+/- Chg. Spins', 'value': -32}, {'name': 'LP Spins', 'value': 965}, {'name': 'Stations', 'value': '44/46'}, {'name': 'Adds', 'value': 0}, {'name': 'TP Audience', 'value': 1260000}, {'name': '+/- Chg. Audience', 'value': -40600}, {'name': 'LP Audience', 'value': 1300600}, {'name': 'TP Stream', 'value': 413101}], 1: [{'name': 'Rank', 'value': 2}, {'name': 'Song', 'value': 'Black Madonna'}, {'name': 'Artist', 'value': 'Cage The Elephant'}, {'name': 'TP Spins', 'value': 814}, {'name': '+/- Chg. Spins', 'value': 38}, {'name': 'LP Spins', 'value': 776}, {'name': 'Stations', 'value': '38/46'}, {'name': 'Adds', 'value': 0}, {'name': 'TP Audience', 'value': 1283400}, {'name': '+/- Chg. Audience', 'value': -21600}, {'name': 'LP Audience', 'value': 1305000}, {'name': 'TP Stream', 'value': 362366}]}})
and we are looking to convert the metrics column into 12 new columns in our dataframe, using the metric's name field as the column name, and value field as the field in the dataframe. Something like this:
rank title_name title_id artist_id artist_name label Rank Song ...
1 'And It's Still Alright' 120034150 222521 'Nathaniel Rateliff' 'CNCO' 1 "And It's Still Alright"
Here's what the value in the metrics column looks like for row 1:
our_df['metrics'][0]
[{'name': 'Rank', 'value': 1},
{'name': 'Song', 'value': "And It's Still Alright"},
{'name': 'Artist', 'value': 'Nathaniel Rateliff'},
{'name': 'TP Spins', 'value': 933},
{'name': '+/- Chg. Spins', 'value': -32},
{'name': 'LP Spins', 'value': 965},
{'name': 'Stations', 'value': '44/46'},
{'name': 'Adds', 'value': 0},
{'name': 'TP Audience', 'value': 1260000},
{'name': '+/- Chg. Audience', 'value': -40600},
{'name': 'LP Audience', 'value': 1300600},
{'name': 'TP Stream', 'value': 413101}]
The +/- in the column names may be problematic though, along with the . in Chg. This dataframe would be best if all the column names were snake_case, if the +/- was replaced with plus_minus, and if the . in Chg. was simply dropped.
Edit: we can assume that the metric names will be the same in every row in the dataframe. However, there may be other dataframes with different metric names, so it would be preferable if the names 'Rank', 'Song', 'Artist', etc. were not hardcoded. Here is the original list before it was converted into a pandas dataframe:
raw_data = [{'rank': 1,
'title_name': 'BUTTER',
'title_id': '',
'artist_id': '',
'artist_name': 'BTS',
'label': '',
'peak_position': 1,
'last_week_rank': 7,
'last_2week_rank': 8,
'metrics': [{'name': 'Rank', 'value': 1},
{'name': 'Song', 'value': 'BUTTER'},
{'name': 'Artist', 'value': 'BTS'},
{'name': 'Label Description', 'value': None},
{'name': 'Label', 'value': ' '},
{'name': 'Last Week Rank', 'value': 7},
{'name': 'Last 2 Week Rank', 'value': 8},
{'name': 'Weeks On Chart', 'value': 15}]},
{'rank': 2,
'title_name': 'STAY',
'title_id': '',
'artist_id': '',
'artist_name': 'THE KID LAROI & JUS',
'label': '',
'peak_position': 1,
'last_week_rank': 1,
'last_2week_rank': 1,
'metrics': [{'name': 'Rank', 'value': 2},
{'name': 'Song', 'value': 'STAY'},
{'name': 'Artist', 'value': 'THE KID LAROI & JUS'},
{'name': 'Label Description', 'value': None},
{'name': 'Label', 'value': ' '},
{'name': 'Last Week Rank', 'value': 1},
{'name': 'Last 2 Week Rank', 'value': 1},
{'name': 'Weeks On Chart', 'value': 8}]}]
Most likely, the fastest way is to process raw_data as a dictionary and only then construct a DataFrame with it.
records = []
for rec in raw_data:
for metric in rec['metrics']:
# process name: snake_case > drop '.' > '+/-' to 'plus_minus'
name = metric['name'].lower().replace(' ','_').replace('.','').replace('+/-','plus_minus')
rec[name] = metric['value']
rec.pop('metrics') # drop metric records
records.append(rec)
df = pd.DataFrame(records)
Output
Resulting df
rank
title_name
title_id
artist_id
artist_name
label
peak_position
last_week_rank
last_2week_rank
song
artist
label_description
last_2_week_rank
weeks_on_chart
0
1
BUTTER
BTS
1
7
8
BUTTER
BTS
8
15
1
2
STAY
THE KID LAROI & JUS
1
1
1
STAY
THE KID LAROI & JUS
1
8
Setup
raw_data = [{'rank': 1,
'title_name': 'BUTTER',
'title_id': '',
'artist_id': '',
'artist_name': 'BTS',
'label': '',
'peak_position': 1,
'last_week_rank': 7,
'last_2week_rank': 8,
'metrics': [{'name': 'Rank', 'value': 1},
{'name': 'Song', 'value': 'BUTTER'},
{'name': 'Artist', 'value': 'BTS'},
{'name': 'Label Description', 'value': None},
{'name': 'Label', 'value': ' '},
{'name': 'Last Week Rank', 'value': 7},
{'name': 'Last 2 Week Rank', 'value': 8},
{'name': 'Weeks On Chart', 'value': 15}]},
{'rank': 2,
'title_name': 'STAY',
'title_id': '',
'artist_id': '',
'artist_name': 'THE KID LAROI & JUS',
'label': '',
'peak_position': 1,
'last_week_rank': 1,
'last_2week_rank': 1,
'metrics': [{'name': 'Rank', 'value': 2},
{'name': 'Song', 'value': 'STAY'},
{'name': 'Artist', 'value': 'THE KID LAROI & JUS'},
{'name': 'Label Description', 'value': None},
{'name': 'Label', 'value': ' '},
{'name': 'Last Week Rank', 'value': 1},
{'name': 'Last 2 Week Rank', 'value': 1},
{'name': 'Weeks On Chart', 'value': 8}]}]
Using the example's data as raw_data, i.e.
our_df = pd.DataFrame(data = {'rank': {0: 1, 1: 2}, 'title_name': {0: "And It's Still Alright", 1: 'Black Madonna'}, 'title_id': {0: '120034150', 1: '106938609'}, 'artist_id': {0: '222521', 1: '200160'}, 'artist_name': {0: 'Nathaniel Rateliff', 1: 'Cage The Elephant'}, 'label': {0: 'CNCO', 1: 'RCA'}, 'metrics': {0: [{'name': 'Rank', 'value': 1}, {'name': 'Song', 'value': "And It's Still Alright"}, {'name': 'Artist', 'value': 'Nathaniel Rateliff'}, {'name': 'TP Spins', 'value': 933}, {'name': '+/- Chg. Spins', 'value': -32}, {'name': 'LP Spins', 'value': 965}, {'name': 'Stations', 'value': '44/46'}, {'name': 'Adds', 'value': 0}, {'name': 'TP Audience', 'value': 1260000}, {'name': '+/- Chg. Audience', 'value': -40600}, {'name': 'LP Audience', 'value': 1300600}, {'name': 'TP Stream', 'value': 413101}], 1: [{'name': 'Rank', 'value': 2}, {'name': 'Song', 'value': 'Black Madonna'}, {'name': 'Artist', 'value': 'Cage The Elephant'}, {'name': 'TP Spins', 'value': 814}, {'name': '+/- Chg. Spins', 'value': 38}, {'name': 'LP Spins', 'value': 776}, {'name': 'Stations', 'value': '38/46'}, {'name': 'Adds', 'value': 0}, {'name': 'TP Audience', 'value': 1283400}, {'name': '+/- Chg. Audience', 'value': -21600}, {'name': 'LP Audience', 'value': 1305000}, {'name': 'TP Stream', 'value': 362366}]}})
raw_data = our_df.to_dict(orient='records')
Output
Resulting df from the solution above
rank
title_name
title_id
artist_id
artist_name
label
song
artist
tp_spins
plus_minus_chg_spins
lp_spins
stations
adds
tp_audience
plus_minus_chg_audience
lp_audience
tp_stream
0
1
And It's Still Alright
120034150
222521
Nathaniel Rateliff
CNCO
And It's Still Alright
Nathaniel Rateliff
933
-32
965
44/46
0
1260000
-40600
1300600
413101
1
2
Black Madonna
106938609
200160
Cage The Elephant
RCA
Black Madonna
Cage The Elephant
814
38
776
38/46
0
1283400
-21600
1305000
362366
Let's start decomposing your issue. After defining our_df we can generate a new dataframe based on the column metrics with:
pd.concat([pd.DataFrame({x['name']:x['value'] for x in y},index=[0]) for y in our_df['metrics']]
Which outputs:
Rank Song ... LP Audience TP Stream
0 1 And It's Still Alright ... 1300600 413101
0 2 Black Madonna ... 1305000 362366
Next it's just a question of joining them together with pd.concat() or merge. I assume the common key is the column Rank therefore I'll use merge:
our_df.drop(columns=['metrics']).merge(pd.concat([pd.DataFrame({x['name']:x['value'] for x in y},index=[0]) for y in our_df['metrics']]),left_on='rank',right_on='Rank')
Outputting the full dataframe
rank title_name ... LP Audience TP Stream
0 1 And It's Still Alright ... 1300600 413101
1 2 Black Madonna ... 1305000 362366
Alternative that might be robust against missing names
metric_df = our_df.apply(
lambda r:
pd.Series(
index=list(map(lambda d: d['name'], r['metrics']))+['rank'],
data=list(map(lambda d: d['value'], r['metrics']))+[r['rank']],
),
axis=1,
)
our_df.merge(metric_df, on='rank')
box = pd.concat({index : pd.DataFrame(ent)
for index, ent in
zip( our_df.index, our_df.metrics)})
( our_df
.drop(columns = 'metrics')
.join(box.droplevel(-1))
.pivot(['rank', 'title_name', 'title_id', 'artist_id', 'artist_name', 'label'],
'name',
'value')
.reset_index()
)
name rank title_name title_id artist_id artist_name label +/- Chg. Audience +/- Chg. Spins Adds Artist LP Audience LP Spins Rank Song Stations TP Audience TP Spins TP Stream
0 1 And It's Still Alright 120034150 222521 Nathaniel Rateliff CNCO -40600 -32 0 Nathaniel Rateliff 1300600 965 1 And It's Still Alright 44/46 1260000 933 413101
1 2 Black Madonna 106938609 200160 Cage The Elephant RCA -21600 38 0 Cage The Elephant 1305000 776 2 Black Madonna 38/46 1283400 814 362366
Working on the raw_data:
from itertools import chain, product
metrics = [ent['metrics'] for ent in raw_data]
non_metrics = [{key : value
for key, value
in ent.items()
if key != 'metrics'}
for ent in raw_data]
combo = zip(metrics, non_metrics)
combo = (product(metrics, [non_metrics])
for metrics, non_metrics in combo)
combo = chain.from_iterable(combo)
combo = [{**left, **right} for left, right in combo]
pd.DataFrame(combo)
name value rank title_name title_id artist_id artist_name label peak_position last_week_rank last_2week_rank
0 Rank 1 1 BUTTER BTS 1 7 8
1 Song BUTTER 1 BUTTER BTS 1 7 8
2 Artist BTS 1 BUTTER BTS 1 7 8
3 Label Description None 1 BUTTER BTS 1 7 8
4 Label 1 BUTTER BTS 1 7 8
5 Last Week Rank 7 1 BUTTER BTS 1 7 8
6 Last 2 Week Rank 8 1 BUTTER BTS 1 7 8
7 Weeks On Chart 15 1 BUTTER BTS 1 7 8
8 Rank 2 2 STAY THE KID LAROI & JUS 1 1 1
9 Song STAY 2 STAY THE KID LAROI & JUS 1 1 1
10 Artist THE KID LAROI & JUS 2 STAY THE KID LAROI & JUS 1 1 1
11 Label Description None 2 STAY THE KID LAROI & JUS 1 1 1
12 Label 2 STAY THE KID LAROI & JUS 1 1 1
13 Last Week Rank 1 2 STAY THE KID LAROI & JUS 1 1 1
14 Last 2 Week Rank 1 2 STAY THE KID LAROI & JUS 1 1 1
15 Weeks On Chart 8 2 STAY THE KID LAROI & JUS 1 1 1
You can then reshape/transform into whatever you desire.
I have a hierarchical data(more than 10 generation) which tells who a person's parent/children are. i would want to represent this as dict of dict. is there any way to achieve this.
sample input - List of Dict/Dataframe
[{'Name': 'Oli Bob', 'Location': 'United Kingdom', 'Parent': nan}, {'Name': 'Mary May', 'Location': 'Germany', 'Parent': 'Oli Bob'}, {'Name': 'Christine Lobowski', 'Location': 'France', 'Parent': 'Oli Bob'}, {'Name': 'Brendon Philips', 'Location': 'USA', 'Parent': 'Oli Bob'}, {'Name': 'Margret Marmajuke', 'Location': 'Canada', 'Parent': 'Brendon Philips'}, {'Name': 'Frank Harbours', 'Location': 'Russia', 'Parent': 'Brendon Philips'}, {'Name': 'Todd Philips', 'Location': 'United Kingdom', 'Parent': 'Frank Harbours'}, {'Name': 'Jamie Newhart', 'Location': 'India', 'Parent': nan}, {'Name': 'Gemma Jane', 'Location': 'China', 'Parent': nan}, {'Name': 'Emily Sykes', 'Location': 'South Korea', 'Parent': 'Emily Sykes'}, {'Name': 'James Newman', 'Location': 'Japan', 'Parent': nan}]
same data in table form
Desired Output
[
{name:"Oli Bob", location:"United Kingdom", _children:[
{name:"Mary May", location:"Germany"},
{name:"Christine Lobowski", location:"France"},
{name:"Brendon Philips", location:"USA",_children:[
{name:"Margret Marmajuke", location:"Canada"},
{name:"Frank Harbours", location:"Russia",_children:[{name:"Todd Philips", location:"United Kingdom"}]},
]},
]},
{name:"Jamie Newhart", location:"India"},
{name:"Gemma Jane", location:"China", _children:[
{name:"Emily Sykes", location:"South Korea"},
]},
{name:"James Newman", location:"Japan"},
];
I have a list of dictionaries and I need to count duplicates by specific keys.
For example:
[
{'name': 'John', 'age': 10, 'country': 'USA', 'height': 185},
{'name': 'John', 'age': 10, 'country': 'Canada', 'height': 185},
{'name': 'Mark', 'age': 10, 'country': 'USA', 'height': 180},
{'name': 'Mark', 'age': 10, 'country': 'Canada', 'height': 180},
{'name': 'Doe', 'age': 15, 'country': 'Canada', 'height': 185}
]
If will specify 'age' and 'country' it should return
[
{
'age': 10,
'country': 'USA',
'count': 2
},
{
'age': 10,
'country': 'Canada',
'count': 2
},
{
'age': 15,
'country': 'Canada',
'count': 1
}
]
Or if I will specify 'name' and 'height':
[
{
'name': 'John',
'height': 185,
'count': 2
},
{
'name': 'Mark',
'height': 180,
'count': 2
},
{
'name': 'Doe',
'heigth': 185,
'count': 1
}
]
Maybe there is a way to implement this by Counter?
You can use itertools.groupby with sorted list:
>>> data = [
{'name': 'John', 'age': 10, 'country': 'USA', 'height': 185},
{'name': 'John', 'age': 10, 'country': 'Canada', 'height': 185},
{'name': 'Mark', 'age': 10, 'country': 'USA', 'height': 180},
{'name': 'Mark', 'age': 10, 'country': 'Canada', 'height': 180},
{'name': 'Doe', 'age': 15, 'country': 'Canada', 'height': 185}
]
>>> from itertools import groupby
>>> key = 'age', 'country'
>>> list_sorter = lambda x: tuple(x[k] for k in key)
>>> grouper = lambda x: tuple(x[k] for k in key)
>>> result = [
{**dict(zip(key, k)), 'count': len([*g])}
for k, g in
groupby(sorted(data, key=list_sorter), grouper)
]
>>> result
[{'age': 10, 'country': 'Canada', 'count': 2},
{'age': 10, 'country': 'USA', 'count': 2},
{'age': 15, 'country': 'Canada', 'count': 1}]
>>> key = 'name', 'height'
>>> result = [
{**dict(zip(key, k)), 'count': len([*g])}
for k, g in
groupby(sorted(data, key=list_sorter), grouper)
]
>>> result
[{'name': 'Doe', 'height': 185, 'count': 1},
{'name': 'John', 'height': 185, 'count': 2},
{'name': 'Mark', 'height': 180, 'count': 2}]
If you use pandas then you can use, pandas.DataFrame.groupby, pandas.groupby.size, pandas.Series.to_frame, pandas.DataFrame.reset_index and finally pandas.DataFrame.to_dict with orient='records':
>>> import pandas as pd
>>> df = pd.DataFrame(data)
>>> df.groupby(list(key)).size().to_frame('count').reset_index().to_dict('records')
[{'name': 'Doe', 'height': 185, 'count': 1},
{'name': 'John', 'height': 185, 'count': 2},
{'name': 'Mark', 'height': 180, 'count': 2}]
In the following example, I would like to sort the animals by the alphabetical order of their category, which is stored in an order dictionnary.
category = [{'uid': 0, 'name': 'mammals'},
{'uid': 1, 'name': 'birds'},
{'uid': 2, 'name': 'fish'},
{'uid': 3, 'name': 'reptiles'},
{'uid': 4, 'name': 'invertebrates'},
{'uid': 5, 'name': 'amphibians'}]
animals = [{'name': 'horse', 'category': 0},
{'name': 'whale', 'category': 2},
{'name': 'mollusk', 'category': 4},
{'name': 'tuna ', 'category': 2},
{'name': 'worms', 'category': 4},
{'name': 'frog', 'category': 5},
{'name': 'dog', 'category': 0},
{'name': 'salamander', 'category': 5},
{'name': 'horse', 'category': 0},
{'name': 'octopus', 'category': 4},
{'name': 'alligator', 'category': 3},
{'name': 'monkey', 'category': 0},
{'name': 'kangaroos', 'category': 0},
{'name': 'salmon', 'category': 2}]
sorted_animals = sorted(animals, key=lambda k: (k['category'])
How could I achieve this?
Thanks.
You are now sorting on the category id. All you need to do is map that id to a lookup for a given category name.
Create a dictionary for the categories first so you can directly map the numeric id to the associated name from the category list, then use that mapping when sorting:
catuid_to_name = {c['uid']: c['name'] for c in category}
sorted_animals = sorted(animals, key=lambda k: catuid_to_name[k['category']])
Demo:
>>> from pprint import pprint
>>> category = [{'uid': 0, 'name': 'mammals'},
... {'uid': 1, 'name': 'birds'},
... {'uid': 2, 'name': 'fish'},
... {'uid': 3, 'name': 'reptiles'},
... {'uid': 4, 'name': 'invertebrates'},
... {'uid': 5, 'name': 'amphibians'}]
>>> animals = [{'name': 'horse', 'category': 0},
... {'name': 'whale', 'category': 2},
... {'name': 'mollusk', 'category': 4},
... {'name': 'tuna ', 'category': 2},
... {'name': 'worms', 'category': 4},
... {'name': 'frog', 'category': 5},
... {'name': 'dog', 'category': 0},
... {'name': 'salamander', 'category': 5},
... {'name': 'horse', 'category': 0},
... {'name': 'octopus', 'category': 4},
... {'name': 'alligator', 'category': 3},
... {'name': 'monkey', 'category': 0},
... {'name': 'kangaroos', 'category': 0},
... {'name': 'salmon', 'category': 2}]
>>> catuid_to_name = {c['uid']: c['name'] for c in category}
>>> pprint(catuid_to_name)
{0: 'mammals',
1: 'birds',
2: 'fish',
3: 'reptiles',
4: 'invertebrates',
5: 'amphibians'}
>>> sorted_animals = sorted(animals, key=lambda k: catuid_to_name[k['category']])
>>> pprint(sorted_animals)
[{'category': 5, 'name': 'frog'},
{'category': 5, 'name': 'salamander'},
{'category': 2, 'name': 'whale'},
{'category': 2, 'name': 'tuna '},
{'category': 2, 'name': 'salmon'},
{'category': 4, 'name': 'mollusk'},
{'category': 4, 'name': 'worms'},
{'category': 4, 'name': 'octopus'},
{'category': 0, 'name': 'horse'},
{'category': 0, 'name': 'dog'},
{'category': 0, 'name': 'horse'},
{'category': 0, 'name': 'monkey'},
{'category': 0, 'name': 'kangaroos'},
{'category': 3, 'name': 'alligator'}]
Note that within each category, the dictionaries have been left in relative input order. You could return a tuple of values from the sorting key to further apply a sorting order within each category, e.g.:
sorted_animals = sorted(
animals,
key=lambda k: (catuid_to_name[k['category']], k['name'])
)
would sort by animal name within each category, producing:
>>> pprint(sorted(animals, key=lambda k: (catuid_to_name[k['category']], k['name'])))
[{'category': 5, 'name': 'frog'},
{'category': 5, 'name': 'salamander'},
{'category': 2, 'name': 'salmon'},
{'category': 2, 'name': 'tuna '},
{'category': 2, 'name': 'whale'},
{'category': 4, 'name': 'mollusk'},
{'category': 4, 'name': 'octopus'},
{'category': 4, 'name': 'worms'},
{'category': 0, 'name': 'dog'},
{'category': 0, 'name': 'horse'},
{'category': 0, 'name': 'horse'},
{'category': 0, 'name': 'kangaroos'},
{'category': 0, 'name': 'monkey'},
{'category': 3, 'name': 'alligator'}]
imo your category structure is far too complicated - at least as long as the uid is nothing but the index, you could simply use a list for that:
category = [c['name'] for c in category]
# ['mammals', 'birds', 'fish', 'reptiles', 'invertebrates', 'amphibians']
sorted_animals = sorted(animals, key=lambda k: category[k['category']])
#[{'name': 'frog', 'category': 5}, {'name': 'salamander', 'category': 5}, {'name': 'whale', 'category': 2}, {'name': 'tuna ', 'category': 2}, {'name': 'salmon', 'category': 2}, {'name': 'mollusk', 'category': 4}, {'name': 'worms', 'category': 4}, {'name': 'octopus', 'category': 4}, {'name': 'horse', 'category': 0}, {'name': 'dog', 'category': 0}, {'name': 'horse', 'category': 0}, {'name': 'monkey', 'category': 0}, {'name': 'kangaroos', 'category': 0}, {'name': 'alligator', 'category': 3}]