How to flatten a dataframe within dataframe - python

I would like to flatten a dataframe that is inside the dataframe. In this example, the column account has a dataframe as value. I would like to flatten this into a single dataframe.
Example: (Updated)
import panda as pd
account1 = pd.DataFrame([{'nr': '123', 'balance': 56}, {'nr': '230', 'balance': 55}])
account2 = pd.DataFrame([{'nr': '456', 'balance': 575}])
account3 = pd.DataFrame([{'nr': '350', 'balance': 59}])
df = pd.DataFrame([{'id': 1, 'age': 23, 'name': 'anna', 'account': account1},
{'id': 2, 'age': 71, 'name': 'mary', 'account': account2},
{'id': 3, 'age': 42, 'name': 'bob', 'account': account3}])
print(df)
gives the dataframe:
id age name account
0 1 23 anna nr balance
0 123 56
1 230 55
1 2 71 mary nr balance
0 456 575
2 3 42 bob nr balance
0 350 59
And I would like to get:
id name age account|nr|0 account|balance|0 account|nr|1 account|balance|1
0 1 anna 23 123 56 230 55
1 2 mary 71 456 575
2 3 bob 59 350 59
How can I flatten a dataframe inside a dataframe to a single dataframe? This type of structure is called Hierarchical DataFrame?

This is the solution that I have found.
list_accounts = []
for index_j, row_j in df.iterrows():
account = row_j["account"]
account = pd.DataFrame(account).stack().to_frame().T
account.columns = ['%s%s' % (a, '|%s' % b if b else '') for a, b in account.columns]
list_accounts.append(account)
df = pd.concat([df, pd.concat(list_accounts).reset_index(drop=True)], axis=1)
df.drop(columns="account", inplace=True)

Related

Pandas Duplicating Data but pivoting columns

I want to convert this DF
Location
Date
F1_ID
F1_Name
F1_Height
F1_Status
F2_ID
F2_Name
F2_Height
F2_Status
USA
12/31/19
1
Jon
67
W
2
Anthony
68
L
To this DF by duplicating the rows but switching the data around.
Location
Date
F1_ID
F1_Name
F1_Height
F1_Status
F2_ID
F2_Name
F2_Height
F2_Status
USA
12/31/19
1
Jon
67
W
2
Anthony
68
L
USA
12/31/19
2
Anthony
68
L
1
Jon
67
W
How can I acheive this in Pandas. I tried creating a copy of the df and renaming the columns but would get an error because of unique indexing
Lets try a concat and sort_index:
import re
import pandas as pd
df = pd.DataFrame(
{'Location': {0: 'USA'}, 'Date': {0: '12/31/19'},
'F1_ID': {0: 1}, 'F1_Name': {0: 'Jon'}, 'F1_Height': {0: 67},
'F1_Status': {0: 'W'}, 'F2_ID': {0: 2},
'F2_Name': {0: 'Anthony'}, 'F2_Height': {0: 68},
'F2_Status': {0: 'L'}})
# Columns Not To Swap
keep_columns = ['Location', 'Date']
# Get F1 and F2 Column Names
f1_columns = list(filter(re.compile(r'F1_').search, df.columns))
f2_columns = list(filter(re.compile(r'F2_').search, df.columns))
# Create Inverse DataFrame
inverse_df = df[[*keep_columns, *f2_columns, *f1_columns]]
# Set Columns so they match df (prevents concat from un-inverting)
inverse_df.columns = df.columns
# Concat and sort index
new_df = pd.concat((df, inverse_df)).sort_index().reset_index(drop=True)
print(new_df.to_string())
Src:
Location Date F1_ID F1_Name F1_Height F1_Status F2_ID F2_Name F2_Height F2_Status
0 USA 12/31/19 1 Jon 67 W 2 Anthony 68 L
Output:
Location Date F1_ID F1_Name F1_Height F1_Status F2_ID F2_Name F2_Height F2_Status
0 USA 12/31/19 1 Jon 67 W 2 Anthony 68 L
1 USA 12/31/19 2 Anthony 68 L 1 Jon 67 W

Pandas - Create column with difference in values

I have the below dataset. How can create a new column that shows the difference of money for each person, for each expiry?
The column is yellow is what I want. You can see that it is the difference in money for each expiry point for the person. I highlighted the other rows in colors so it is more clear.
Thanks a lot.
Example
[]
import pandas as pd
import numpy as np
example = pd.DataFrame( data = {'Day': ['2020-08-30', '2020-08-30','2020-08-30','2020-08-30',
'2020-08-29', '2020-08-29','2020-08-29','2020-08-29'],
'Name': ['John', 'Mike', 'John', 'Mike','John', 'Mike', 'John', 'Mike'],
'Money': [100, 950, 200, 1000, 50, 50, 250, 1200],
'Expiry': ['1Y', '1Y', '2Y','2Y','1Y','1Y','2Y','2Y']})
example_0830 = example[ example['Day']=='2020-08-30' ].reset_index()
example_0829 = example[ example['Day']=='2020-08-29' ].reset_index()
example_0830['key'] = example_0830['Name'] + example_0830['Expiry']
example_0829['key'] = example_0829['Name'] + example_0829['Expiry']
example_0829 = pd.DataFrame( example_0829, columns = ['key','Money'])
example_0830 = pd.merge(example_0830, example_0829, on = 'key')
example_0830['Difference'] = example_0830['Money_x'] - example_0830['Money_y']
example_0830 = example_0830.drop(columns=['key', 'Money_y','index'])
Result:
Day Name Money_x Expiry Difference
0 2020-08-30 John 100 1Y 50
1 2020-08-30 Mike 950 1Y 900
2 2020-08-30 John 200 2Y -50
3 2020-08-30 Mike 1000 2Y -200
If the difference is just derived from the previous date, you can just define a date variable in the beginning to find today(t) and previous day (t-1) to filter out original dataframe.
You can solve it with groupby.diff
Take the dataframe
df = pd.DataFrame({
'Day': [30, 30, 30, 30, 29, 29, 28, 28],
'Name': ['John', 'Mike', 'John', 'Mike', 'John', 'Mike', 'John', 'Mike'],
'Money': [100, 950, 200, 1000, 50, 50, 250, 1200],
'Expiry': [1, 1, 2, 2, 1, 1, 2, 2]
})
print(df)
Which looks like
Day Name Money Expiry
0 30 John 100 1
1 30 Mike 950 1
2 30 John 200 2
3 30 Mike 1000 2
4 29 John 50 1
5 29 Mike 50 1
6 28 John 250 2
7 28 Mike 1200 2
And the code
# make sure we have dates in the order we want
df.sort_values('Day', ascending=False)
# groubpy and get the difference from the next row in each group
# diff(1) calculates the difference from the previous row, so -1 will point to the next
df['Difference'] = df.groupby(['Name', 'Expiry']).Money.diff(-1)
Output
Day Name Money Expiry Difference
0 30 John 100 1 50.0
1 30 Mike 950 1 900.0
2 30 John 200 2 -50.0
3 30 Mike 1000 2 -200.0
4 29 John 50 1 NaN
5 29 Mike 50 1 NaN
6 28 John 250 2 NaN
7 28 Mike 1200 2 NaN

How to make dictionary keys is one column of Pandas dataframe to the columns?

I have a dataframe with one column containing stringified list containing dictionaries. I was wondering how can I make new columns from these dictionary keys.
I am looking solution using pandas methods like apply stack etc and NOT USING FOR LOOP as far as possible.
Here is the problem:
speakers = ['Einstein','Newton']
views = [1000,2000]
ratings0 = ("[{'id': 7, 'name': 'Funny', 'count': 100}, {'id': 1, 'name': 'Sad', "
"'count': 110}, {'id': 9, 'name': 'Happy', 'count': 120}]")
ratings1 = ("[{'id': 7, 'name': 'Happy', 'count': 200}, {'id': 3, 'name': 'Funny', "
"'count': 210}, {'id': 2, 'name': 'Sad', 'count': 220}]")
ratings = [ratings0, ratings1]
df = pd.DataFrame({'speaker': speakers, 'ratings': ratings,'views':views})
print(df)
speaker ratings views
0 Einstein [{'id': 7, 'name': 'Funny', 'count': 100}, {'i... 1000
1 Newton [{'id': 7, 'name': 'Happy', 'count': 200}, {'i... 2000
My attempt so far,
# new dataframe only for ratings
dfr = df['ratings'].apply(ast.literal_eval)
dfr = dfr.apply(pd.DataFrame)
dfr = dfr.apply(lambda x: x.sort_values(by='name'))
dfr = dfr.apply(pd.DataFrame.stack)
print(dfr)
0 1 2
count id name count id name count id name
0 100 7 Funny 110 1 Sad 120 9 Happy
1 200 7 Happy 210 3 Funny 220 2 Sad
This gives multi-index dataframe. I tried sorting the dictionary, but still it is not sorted and the column name does not have the same values. Also, I am unsure how to move the values of column name to replace column count and remove other unwanted columns.
Final Wanted Solution
speaker views Funny Sad Happy
Einstein 1000 100 110 120
Newton 2000 210 220 200
Update
I am using Pandas 0.20 and the method .explode() is absent in my workplace and I am not permitted to update Pandas.
For pandas >= 0.25.0 you can use ast.literal_eval + explode + pivot
ii = df.set_index('speaker')['ratings'].apply(ast.literal_eval).explode()
u = pd.DataFrame(ii.tolist(), index=ii.index).reset_index()
u.pivot('speaker', 'name', 'count')
name Funny Happy Sad
speaker
Einstein 100 120 110
Newton 210 200 220
For older versions of pandas
a = df['speaker']
b = df['ratings']
ii = [
{**{'speaker': name}, **row}
for name, element in zip(a, b) for row in ast.literal_eval(element)
]
pd.DataFrame(ii).pivot('speaker', 'name', 'count')
You may use sum, index.repeat to construct a new dataframe and join it df[['speaker', 'views']] and assign it to df1. Next, set_index, unstack, and reset_index
df['ratings'] = df['ratings'].apply(ast.literal_eval)
df1 = (pd.DataFrame(df.ratings.sum(), index=df.index.repeat(df.ratings.str.len()))
.drop('id', 1).join(df[['speaker', 'views']]))
df1.set_index(['speaker', 'views', 'name'])['count'].unstack().reset_index()
Out[213]:
name speaker views Funny Happy Sad
0 Einstein 1000 100 120 110
1 Newton 2000 210 200 220
Note: name in the final output is the label of the columns axis. If you don't want to see it, just chain additional rename_axis as follows
df1.set_index(['speaker', 'views', 'name'])['count'].unstack().reset_index() \
.rename_axis([None], axis=1)
Out[214]:
speaker views Funny Happy Sad
0 Einstein 1000 100 120 110
1 Newton 2000 210 200 220
For loops are not always bad. You can give it a try:
dfr = pd.DataFrame(columns=['id','name','count'])
for i in range(len(df)):
x = pd.DataFrame(df['ratings'].apply(ast.literal_eval)[i])
x.index = [i]*len(x)
dfr = dfr.append(x)
dfr = dfr.reset_index()
dfr = (dfr.drop('id',axis=1)
.pivot_table(index=['index'], columns='name',
values='count',aggfunc='sum')
.rename_axis(None, axis=1).reset_index())
df_final = df.join(dfr)
df_final.drop(['index','ratings'],axis=1,inplace=True)
df_final
Gives:
speaker views Funny Happy Sad
0 Einstein 1000 100 120 110
1 Newton 2000 210 200 220

Convert List of dictionaries with same keys into a tall dataframe

The dataframe has a column with list of dictionaries with same key names . How can i convert it into a tall dataframe? The dataframe is as shown.
A B
1 [{"name":"john","age":"28","salary":"50000"},{"name":"Todd","age":"36","salary":"54000"}]
2 [{"name":"Alex","age":"48","salary":"70000"},{"name":"Mark","age":"89","salary":"150000"}]
3 [{"name":"jane","age":"36","salary":"20000"},{"name":"Rose","age":"28","salary":"90000"}
How to convert the following dataframe to the below one
A name age salary
1 john 28 50000
1 Todd 36 54000
2 Alex 48 70000
2 Mark 89 150000
3 jane 36 20000
3 Rose 28 90000
You are looking for unesting first then , using the same method I provided before .
newdf=unnesting(df,['B'])
pd.concat([newdf,pd.DataFrame(newdf.pop('B').tolist(),index=newdf.index)],axis=1)
A age name salary
0 1 28 john 50000
0 1 36 Todd 54000
1 2 48 Alex 70000
1 2 89 Mark 150000
2 3 36 jane 20000
2 3 28 Rose 90000
More info I have attached my self-def function , you can also find it in the page I linked
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')
Data Input
df.B.to_dict()
{0: [{'name': 'john', 'age': '28', 'salary': '50000'}, {'name': 'Todd', 'age': '36', 'salary': '54000'}], 1: [{'name': 'Alex', 'age': '48', 'salary': '70000'}, {'name': 'Mark', 'age': '89', 'salary': '150000'}], 2: [{'name': 'jane', 'age': '36', 'salary': '20000'}, {'name': 'Rose', 'age': '28', 'salary': '90000'}]}

List of dict of dict in Pandas

I have list of dict of dicts in the following form:
[{0:{'city':'newyork', 'name':'John', 'age':'30'}},
{0:{'city':'newyork', 'name':'John', 'age':'30'}},]
I want to create pandas DataFrame in the following form:
city name age
newyork John 30
newyork John 30
Tried a lot but without any success
can you help me?
Use list comprehension with concat and DataFrame.from_dict:
L = [{0:{'city':'newyork', 'name':'John', 'age':'30'}},
{0:{'city':'newyork', 'name':'John', 'age':'30'}}]
df = pd.concat([pd.DataFrame.from_dict(x, orient='index') for x in L])
print (df)
name age city
0 John 30 newyork
0 John 30 newyork
Solution with multiple keys with new column id should be:
L = [{0:{'city':'newyork', 'name':'John', 'age':'30'},
1:{'city':'newyork1', 'name':'John1', 'age':'40'}},
{0:{'city':'newyork', 'name':'John', 'age':'30'}}]
L1 = [dict(v, id=k) for x in L for k, v in x.items()]
print (L1)
[{'name': 'John', 'age': '30', 'city': 'newyork', 'id': 0},
{'name': 'John1', 'age': '40', 'city': 'newyork1', 'id': 1},
{'name': 'John', 'age': '30', 'city': 'newyork', 'id': 0}]
df = pd.DataFrame(L1)
print (df)
age city id name
0 30 newyork 0 John
1 40 newyork1 1 John1
2 30 newyork 0 John
from pandas import DataFrame
ldata = [{0: {'city': 'newyork', 'name': 'John', 'age': '30'}},
{0: {'city': 'newyork', 'name': 'John', 'age': '30'}}, ]
# 根据上面的ldata创建一个Dataframe
df = DataFrame(d[0] for d in ldata)
print(df)
"""
The answer is:
age city name
0 30 newyork John
1 30 newyork John
"""
import pandas as pd
d = [{0:{'city':'newyork', 'name':'John', 'age':'30'}},{0:{'city':'newyork', 'name':'John', 'age':'30'}},]
df = pd.DataFrame([list(i.values())[0] for i in d])
print(df)
Output:
age city name
0 30 newyork John
1 30 newyork John
You can use:
In [41]: df = pd.DataFrame(next(iter(e.values())) for e in l)
In [42]: df
Out[42]:
age city name
0 30 newyork John
1 30 newyork John
Came to new solution. Not as straightforward as posted here but works properly
L = [{0:{'city':'newyork', 'name':'John', 'age':'30'}},
{0:{'city':'newyork', 'name':'John', 'age':'30'}}]
df = [L[i][0] for i in range(len(L))]
df = pd.DataFrame.from_records(df)

Categories

Resources