I have the following list:
a = [{'cluster_id': 0, 'points': [{'id': 1, 'name': 'Alice', 'lat': 52.523955, 'lon': 13.442362}, {'id': 2, 'name': 'Bob', 'lat': 52.526659, 'lon': 13.448097}]}, {'cluster_id': 0, 'points': [{'id': 1, 'name': 'Alice', 'lat': 52.523955, 'lon': 13.442362}, {'id': 2, 'name': 'Bob', 'lat': 52.526659, 'lon': 13.448097}]}, {'cluster_id': 1, 'points': [{'id': 3, 'name': 'Carol', 'lat': 52.525626, 'lon': 13.419246}, {'id': 4, 'name': 'Dan', 'lat': 52.52443559865125, 'lon': 13.41261723049818}]}, {'cluster_id': 1, 'points': [{'id': 3, 'name': 'Carol', 'lat': 52.525626, 'lon': 13.419246}, {'id': 4, 'name': 'Dan', 'lat': 52.52443559865125, 'lon': 13.41261723049818}]}]
I would like to convert this list into a dataframe with the following columns:
cluster_id
id
name
lat
lon
to save it as a csv. I tried a couple of solutions which I found like:
pd.concat([pd.DataFrame(l) for l in a],axis=1).T
But it didn't work as I expected.
What is the mistake I am doing?
Thanks
You can use pd.json_normalize
df = pd.json_normalize(a, record_path='points', meta='cluster_id')
print(df)
id name lat lon cluster_id
0 1 Alice 52.523955 13.442362 0
1 2 Bob 52.526659 13.448097 0
2 1 Alice 52.523955 13.442362 0
3 2 Bob 52.526659 13.448097 0
4 3 Carol 52.525626 13.419246 1
5 4 Dan 52.524436 13.412617 1
6 3 Carol 52.525626 13.419246 1
7 4 Dan 52.524436 13.412617 1
we have the following dataframe:
import pandas as pd
our_df = pd.DataFrame(data = {'rank': {0: 1, 1: 2}, 'title_name': {0: "And It's Still Alright", 1: 'Black Madonna'}, 'title_id': {0: '120034150', 1: '106938609'}, 'artist_id': {0: '222521', 1: '200160'}, 'artist_name': {0: 'Nathaniel Rateliff', 1: 'Cage The Elephant'}, 'label': {0: 'CNCO', 1: 'RCA'}, 'metrics': {0: [{'name': 'Rank', 'value': 1}, {'name': 'Song', 'value': "And It's Still Alright"}, {'name': 'Artist', 'value': 'Nathaniel Rateliff'}, {'name': 'TP Spins', 'value': 933}, {'name': '+/- Chg. Spins', 'value': -32}, {'name': 'LP Spins', 'value': 965}, {'name': 'Stations', 'value': '44/46'}, {'name': 'Adds', 'value': 0}, {'name': 'TP Audience', 'value': 1260000}, {'name': '+/- Chg. Audience', 'value': -40600}, {'name': 'LP Audience', 'value': 1300600}, {'name': 'TP Stream', 'value': 413101}], 1: [{'name': 'Rank', 'value': 2}, {'name': 'Song', 'value': 'Black Madonna'}, {'name': 'Artist', 'value': 'Cage The Elephant'}, {'name': 'TP Spins', 'value': 814}, {'name': '+/- Chg. Spins', 'value': 38}, {'name': 'LP Spins', 'value': 776}, {'name': 'Stations', 'value': '38/46'}, {'name': 'Adds', 'value': 0}, {'name': 'TP Audience', 'value': 1283400}, {'name': '+/- Chg. Audience', 'value': -21600}, {'name': 'LP Audience', 'value': 1305000}, {'name': 'TP Stream', 'value': 362366}]}})
and we are looking to convert the metrics column into 12 new columns in our dataframe, using the metric's name field as the column name, and value field as the field in the dataframe. Something like this:
rank title_name title_id artist_id artist_name label Rank Song ...
1 'And It's Still Alright' 120034150 222521 'Nathaniel Rateliff' 'CNCO' 1 "And It's Still Alright"
Here's what the value in the metrics column looks like for row 1:
our_df['metrics'][0]
[{'name': 'Rank', 'value': 1},
{'name': 'Song', 'value': "And It's Still Alright"},
{'name': 'Artist', 'value': 'Nathaniel Rateliff'},
{'name': 'TP Spins', 'value': 933},
{'name': '+/- Chg. Spins', 'value': -32},
{'name': 'LP Spins', 'value': 965},
{'name': 'Stations', 'value': '44/46'},
{'name': 'Adds', 'value': 0},
{'name': 'TP Audience', 'value': 1260000},
{'name': '+/- Chg. Audience', 'value': -40600},
{'name': 'LP Audience', 'value': 1300600},
{'name': 'TP Stream', 'value': 413101}]
The +/- in the column names may be problematic though, along with the . in Chg. This dataframe would be best if all the column names were snake_case, if the +/- was replaced with plus_minus, and if the . in Chg. was simply dropped.
Edit: we can assume that the metric names will be the same in every row in the dataframe. However, there may be other dataframes with different metric names, so it would be preferable if the names 'Rank', 'Song', 'Artist', etc. were not hardcoded. Here is the original list before it was converted into a pandas dataframe:
raw_data = [{'rank': 1,
'title_name': 'BUTTER',
'title_id': '',
'artist_id': '',
'artist_name': 'BTS',
'label': '',
'peak_position': 1,
'last_week_rank': 7,
'last_2week_rank': 8,
'metrics': [{'name': 'Rank', 'value': 1},
{'name': 'Song', 'value': 'BUTTER'},
{'name': 'Artist', 'value': 'BTS'},
{'name': 'Label Description', 'value': None},
{'name': 'Label', 'value': ' '},
{'name': 'Last Week Rank', 'value': 7},
{'name': 'Last 2 Week Rank', 'value': 8},
{'name': 'Weeks On Chart', 'value': 15}]},
{'rank': 2,
'title_name': 'STAY',
'title_id': '',
'artist_id': '',
'artist_name': 'THE KID LAROI & JUS',
'label': '',
'peak_position': 1,
'last_week_rank': 1,
'last_2week_rank': 1,
'metrics': [{'name': 'Rank', 'value': 2},
{'name': 'Song', 'value': 'STAY'},
{'name': 'Artist', 'value': 'THE KID LAROI & JUS'},
{'name': 'Label Description', 'value': None},
{'name': 'Label', 'value': ' '},
{'name': 'Last Week Rank', 'value': 1},
{'name': 'Last 2 Week Rank', 'value': 1},
{'name': 'Weeks On Chart', 'value': 8}]}]
Most likely, the fastest way is to process raw_data as a dictionary and only then construct a DataFrame with it.
records = []
for rec in raw_data:
for metric in rec['metrics']:
# process name: snake_case > drop '.' > '+/-' to 'plus_minus'
name = metric['name'].lower().replace(' ','_').replace('.','').replace('+/-','plus_minus')
rec[name] = metric['value']
rec.pop('metrics') # drop metric records
records.append(rec)
df = pd.DataFrame(records)
Output
Resulting df
rank
title_name
title_id
artist_id
artist_name
label
peak_position
last_week_rank
last_2week_rank
song
artist
label_description
last_2_week_rank
weeks_on_chart
0
1
BUTTER
BTS
1
7
8
BUTTER
BTS
8
15
1
2
STAY
THE KID LAROI & JUS
1
1
1
STAY
THE KID LAROI & JUS
1
8
Setup
raw_data = [{'rank': 1,
'title_name': 'BUTTER',
'title_id': '',
'artist_id': '',
'artist_name': 'BTS',
'label': '',
'peak_position': 1,
'last_week_rank': 7,
'last_2week_rank': 8,
'metrics': [{'name': 'Rank', 'value': 1},
{'name': 'Song', 'value': 'BUTTER'},
{'name': 'Artist', 'value': 'BTS'},
{'name': 'Label Description', 'value': None},
{'name': 'Label', 'value': ' '},
{'name': 'Last Week Rank', 'value': 7},
{'name': 'Last 2 Week Rank', 'value': 8},
{'name': 'Weeks On Chart', 'value': 15}]},
{'rank': 2,
'title_name': 'STAY',
'title_id': '',
'artist_id': '',
'artist_name': 'THE KID LAROI & JUS',
'label': '',
'peak_position': 1,
'last_week_rank': 1,
'last_2week_rank': 1,
'metrics': [{'name': 'Rank', 'value': 2},
{'name': 'Song', 'value': 'STAY'},
{'name': 'Artist', 'value': 'THE KID LAROI & JUS'},
{'name': 'Label Description', 'value': None},
{'name': 'Label', 'value': ' '},
{'name': 'Last Week Rank', 'value': 1},
{'name': 'Last 2 Week Rank', 'value': 1},
{'name': 'Weeks On Chart', 'value': 8}]}]
Using the example's data as raw_data, i.e.
our_df = pd.DataFrame(data = {'rank': {0: 1, 1: 2}, 'title_name': {0: "And It's Still Alright", 1: 'Black Madonna'}, 'title_id': {0: '120034150', 1: '106938609'}, 'artist_id': {0: '222521', 1: '200160'}, 'artist_name': {0: 'Nathaniel Rateliff', 1: 'Cage The Elephant'}, 'label': {0: 'CNCO', 1: 'RCA'}, 'metrics': {0: [{'name': 'Rank', 'value': 1}, {'name': 'Song', 'value': "And It's Still Alright"}, {'name': 'Artist', 'value': 'Nathaniel Rateliff'}, {'name': 'TP Spins', 'value': 933}, {'name': '+/- Chg. Spins', 'value': -32}, {'name': 'LP Spins', 'value': 965}, {'name': 'Stations', 'value': '44/46'}, {'name': 'Adds', 'value': 0}, {'name': 'TP Audience', 'value': 1260000}, {'name': '+/- Chg. Audience', 'value': -40600}, {'name': 'LP Audience', 'value': 1300600}, {'name': 'TP Stream', 'value': 413101}], 1: [{'name': 'Rank', 'value': 2}, {'name': 'Song', 'value': 'Black Madonna'}, {'name': 'Artist', 'value': 'Cage The Elephant'}, {'name': 'TP Spins', 'value': 814}, {'name': '+/- Chg. Spins', 'value': 38}, {'name': 'LP Spins', 'value': 776}, {'name': 'Stations', 'value': '38/46'}, {'name': 'Adds', 'value': 0}, {'name': 'TP Audience', 'value': 1283400}, {'name': '+/- Chg. Audience', 'value': -21600}, {'name': 'LP Audience', 'value': 1305000}, {'name': 'TP Stream', 'value': 362366}]}})
raw_data = our_df.to_dict(orient='records')
Output
Resulting df from the solution above
rank
title_name
title_id
artist_id
artist_name
label
song
artist
tp_spins
plus_minus_chg_spins
lp_spins
stations
adds
tp_audience
plus_minus_chg_audience
lp_audience
tp_stream
0
1
And It's Still Alright
120034150
222521
Nathaniel Rateliff
CNCO
And It's Still Alright
Nathaniel Rateliff
933
-32
965
44/46
0
1260000
-40600
1300600
413101
1
2
Black Madonna
106938609
200160
Cage The Elephant
RCA
Black Madonna
Cage The Elephant
814
38
776
38/46
0
1283400
-21600
1305000
362366
Let's start decomposing your issue. After defining our_df we can generate a new dataframe based on the column metrics with:
pd.concat([pd.DataFrame({x['name']:x['value'] for x in y},index=[0]) for y in our_df['metrics']]
Which outputs:
Rank Song ... LP Audience TP Stream
0 1 And It's Still Alright ... 1300600 413101
0 2 Black Madonna ... 1305000 362366
Next it's just a question of joining them together with pd.concat() or merge. I assume the common key is the column Rank therefore I'll use merge:
our_df.drop(columns=['metrics']).merge(pd.concat([pd.DataFrame({x['name']:x['value'] for x in y},index=[0]) for y in our_df['metrics']]),left_on='rank',right_on='Rank')
Outputting the full dataframe
rank title_name ... LP Audience TP Stream
0 1 And It's Still Alright ... 1300600 413101
1 2 Black Madonna ... 1305000 362366
Alternative that might be robust against missing names
metric_df = our_df.apply(
lambda r:
pd.Series(
index=list(map(lambda d: d['name'], r['metrics']))+['rank'],
data=list(map(lambda d: d['value'], r['metrics']))+[r['rank']],
),
axis=1,
)
our_df.merge(metric_df, on='rank')
box = pd.concat({index : pd.DataFrame(ent)
for index, ent in
zip( our_df.index, our_df.metrics)})
( our_df
.drop(columns = 'metrics')
.join(box.droplevel(-1))
.pivot(['rank', 'title_name', 'title_id', 'artist_id', 'artist_name', 'label'],
'name',
'value')
.reset_index()
)
name rank title_name title_id artist_id artist_name label +/- Chg. Audience +/- Chg. Spins Adds Artist LP Audience LP Spins Rank Song Stations TP Audience TP Spins TP Stream
0 1 And It's Still Alright 120034150 222521 Nathaniel Rateliff CNCO -40600 -32 0 Nathaniel Rateliff 1300600 965 1 And It's Still Alright 44/46 1260000 933 413101
1 2 Black Madonna 106938609 200160 Cage The Elephant RCA -21600 38 0 Cage The Elephant 1305000 776 2 Black Madonna 38/46 1283400 814 362366
Working on the raw_data:
from itertools import chain, product
metrics = [ent['metrics'] for ent in raw_data]
non_metrics = [{key : value
for key, value
in ent.items()
if key != 'metrics'}
for ent in raw_data]
combo = zip(metrics, non_metrics)
combo = (product(metrics, [non_metrics])
for metrics, non_metrics in combo)
combo = chain.from_iterable(combo)
combo = [{**left, **right} for left, right in combo]
pd.DataFrame(combo)
name value rank title_name title_id artist_id artist_name label peak_position last_week_rank last_2week_rank
0 Rank 1 1 BUTTER BTS 1 7 8
1 Song BUTTER 1 BUTTER BTS 1 7 8
2 Artist BTS 1 BUTTER BTS 1 7 8
3 Label Description None 1 BUTTER BTS 1 7 8
4 Label 1 BUTTER BTS 1 7 8
5 Last Week Rank 7 1 BUTTER BTS 1 7 8
6 Last 2 Week Rank 8 1 BUTTER BTS 1 7 8
7 Weeks On Chart 15 1 BUTTER BTS 1 7 8
8 Rank 2 2 STAY THE KID LAROI & JUS 1 1 1
9 Song STAY 2 STAY THE KID LAROI & JUS 1 1 1
10 Artist THE KID LAROI & JUS 2 STAY THE KID LAROI & JUS 1 1 1
11 Label Description None 2 STAY THE KID LAROI & JUS 1 1 1
12 Label 2 STAY THE KID LAROI & JUS 1 1 1
13 Last Week Rank 1 2 STAY THE KID LAROI & JUS 1 1 1
14 Last 2 Week Rank 1 2 STAY THE KID LAROI & JUS 1 1 1
15 Weeks On Chart 8 2 STAY THE KID LAROI & JUS 1 1 1
You can then reshape/transform into whatever you desire.
I have a pandas dataframe in which one column custom consists of dictionaries within a list. The list may be empty or have one or more dictionary objects within it. for example...
id custom
1 []
2 [{'key': 'impact', 'name': 'Impact', 'value': 'abc', 'type': 'string'}, {'key': 'proposed_status','name': 'PROPOSED Status [temporary]', 'value': 'pqr', 'type': 'string'}]
3 [{'key': 'impact', 'name': 'Impact', 'value': 'xyz', 'type': 'string'}]
I'm interested in extracting the data from the JSON into separate columns based on the dict keys named 'key' and 'value'!
for example: here, the output df will have additional columns impact and proposed_status:
id custom impact proposed_status
1 ... NA NA
2 ... abc pqr
3 ... xyz NA
Could the smart people of StackOverflow please guide me on the right way to solve this? Thanks!
The approach is in the comments
df = pd.DataFrame({'id': [1, 2, 3],
'custom': [[],
[{'key': 'impact', 'name': 'Impact', 'value': 'abc', 'type': 'string'},
{'key': 'proposed_status',
'name': 'PROPOSED Status [temporary]',
'value': 'pqr',
'type': 'string'}],
[{'key': 'impact', 'name': 'Impact', 'value': 'xyz', 'type': 'string'}]]})
# expand out lists, reset_index() so join() will work
df2 = df.explode("custom").reset_index(drop=True)
# join to keep "id"
df2 = (df2.join(df2["custom"]
# expand embedded dict
.apply(pd.Series))
.loc[:,["id","key","value"]]
# empty list generate spurios NaN, remove
.dropna()
# turn key attribute into column
.set_index(["id","key"]).unstack(1)
# cleanup multi index columns
.droplevel(0, axis=1)
)
df.merge(df2, on="id", how="left")
id
custom
impact
proposed_status
0
1
[]
nan
nan
1
2
[{'key': 'impact', 'name': 'Impact', 'value': 'abc', 'type': 'string'}, {'key': 'proposed_status', 'name': 'PROPOSED Status [temporary]', 'value': 'pqr', 'type': 'string'}]
abc
pqr
2
3
[{'key': 'impact', 'name': 'Impact', 'value': 'xyz', 'type': 'string'}]
xyz
nan
Stackoverflow, please do your magic,
i have dataframe pandas like this
Column_one \
{{'name': 'Marfon ', 'email': '', 'phone': '123454333', 'address': 'San Jose', 'estimated_date': 2019-10-01 00:00:00, 'estimated_time': {'minimum': 1000, 'maximum': 1200, 'min': 0, 'max': 0}}
{{'name': 'Joe Doe ', 'email': 'joe#gmail.com', 'phone': '987655444', 'address': 'Carolina', 'estimated_date': 2019-10-01 00:00:00, 'estimated_time': {'minimum': 1000, 'maximum': 1200, 'min': 0, 'max': 0}}
Column_two
[{'status': False, 'item_code': 'JSK', 'price': 15000, 'note': [], 'sub_total_price': 50}]
[{'status': False, 'item_code': 'HSO', 'price': 15000, 'note': [], 'sub_total_price': 100}]
how to create new dataframe like this?
name email phone address item_code
Marfon 123454333 San Jose JSK
Joe Doe joe#gmail.com 987655444 Carolina HSO
solved
column_one = pd.DataFrame(main_df['Column_one'].values.tolist(), index=main_df.index)
column_two = main_df['Column_two'].apply(lambda x: ', '.join(y['item_code'] for y in x))
data_con = pd.concat([column_one, column_two], axis=1)
print(data_con)
You have some mess in your input data. But if what you meant was this, then:
Column_one =\
[{'name': 'Marfon ', 'email': '', 'phone': '123454333', 'address': 'San Jose', 'estimated_date': '2019-10-01 00:00:00'},
{'name': 'Joe Doe ', 'email': 'joe#gmail.com', 'phone': '987655444', 'address': 'Carolina', 'estimated_date': '2019-10-01 00:00:00'}]
Column_two=\
[{'status': False, 'item_code': 'JSK', 'price': 15000, 'note': [], 'sub_total_price': 50},
{'status': False, 'item_code': 'HSO', 'price': 15000, 'note': [], 'sub_total_price': 100}]
pd.concat([pd.DataFrame(Column_one), pd.DataFrame(Column_two)], axis=1)
output:
name email phone address estimated_date status item_code price note sub_total_price
Marfon 123454333 San Jose 2019-10-01 00:00:00 False JSK 15000 [] 50
Joe Doe joe#gmail.com 987655444 Carolina 2019-10-01 00:00:00 False HSO 15000 [] 100