Pivot_table from lists in a column value - python

I have a dataframe like:
ID Sim Items
1 0.345 [7,7]
2 0.604 [2,7,3,8,5]
3 0.082 [9,1,9,1]
I want to form a pivot_table by:
df.pivot_table(index ="ID" , columns = "Items", values="Sim")
To do that,
I have to extract list elements in items column and repeat the ID,Sim values for each unique elements in row_list.
To be as:
ID Sim Items
1 0.345 7
2 0.604 2
2 0.604 7
2 0.604 3
2 0.604 8
2 0.604 5
3 0.082 9
3 0.082 1
pivot table :
7 2 3 8 5 1 9
1 0.345 - - - - - -
2 0.604 0.604 0.604 0.604 0.604
3 - - - - - 0.082 0.082
Is there any pythonic approach for that? Or any suggestions?

Use explode(new in pandas 0.25+) before pivot;
df.explode('Items').pivot_table(index ="ID" , columns = "Items", values="Sim")
Items 1 2 3 5 7 8 9
ID
1 NaN NaN NaN NaN 0.345 NaN NaN
2 NaN 0.604 0.604 0.604 0.604 0.604 NaN
3 0.082 NaN NaN NaN NaN NaN 0.082
for lower versions of pandas, you can try with:
(df.drop('Items',1).join(pd.DataFrame(df['Items'].tolist())
.stack(dropna=False).droplevel(1).rename('Items'))
.pivot_table(index ="ID" , columns = "Items", values="Sim"))
Items 1 2 3 5 7 8 9
ID
1 NaN NaN NaN NaN 0.345 NaN NaN
2 NaN 0.604 0.604 0.604 0.604 0.604 NaN
3 0.082 NaN NaN NaN NaN NaN 0.082
If exact ordering matters , use reindex with unique of Items after explode:
(df.explode('Items').pivot_table(index ="ID" , columns = "Items", values="Sim")
.reindex(df.explode('Items')['Items'].unique(),axis=1))
Items 7 2 3 8 5 9 1
ID
1 0.345 NaN NaN NaN NaN NaN NaN
2 0.604 0.604 0.604 0.604 0.604 NaN NaN
3 NaN NaN NaN NaN NaN 0.082 0.082

Related

Ununderstandable Pandas groupby results

Coming from R and been working with the tidyverse mostly, I wonder how does pandas groupby and aggregations work. I have this code and the results are heartbreaking to me.
import pandas as pd
df = pd.read_csv('https://gist.githubusercontent.com/ZeccaLehn/4e06d2575eb9589dbe8c365d61cb056c/raw/64f1660f38ef523b2a1a13be77b002b98665cdfe/mtcars.csv')
df.rename(columns={'Unnamed: 0':'brand'}, inplace=True)
Now I would like to calculate the average displacement (disp) by cylinders, like that:
df['avg_disp'] = df.groupby('cyl').disp.mean()
Which results in something like:
cyl disp avg_disp
31 4 121.0 NaN
2 4 108.0 NaN
27 4 95.1 NaN
26 4 120.3 NaN
25 4 79.0 NaN
20 4 120.1 NaN
7 4 146.7 NaN
8 4 140.8 353.100000
19 4 71.1 NaN
18 4 75.7 NaN
17 4 78.7 NaN
29 6 145.0 NaN
0 6 160.0 NaN
1 6 160.0 NaN
3 6 258.0 NaN
10 6 167.6 NaN
9 6 167.6 NaN
5 6 225.0 NaN
13 8 275.8 NaN
28 8 351.0 NaN
4 8 360.0 105.136364
24 8 400.0 NaN
23 8 350.0 NaN
22 8 304.0 NaN
21 8 318.0 NaN
6 8 360.0 183.314286
11 8 275.8 NaN
16 8 440.0 NaN
30 8 301.0 NaN
14 8 472.0 NaN
12 8 275.8 NaN
15 8 460.0 NaN
After searching for a while, I discovered the transform function which leads to the correct value for avg_disp by assigning the mean value to each row according to the grouping cyl var.
My point is... why can't it be done easily with the mean function instead of using .transform('mean') on the grouped data frame?
If you want to add the results back to the ungrouped dataframe you could use .transform:
... and return a DataFrame having the same indexes as the original object filled with the transformed values.
df['avg_disp'] = df.groupby('cyl').disp.transform('mean')

How to create a new summarize row from a pandas Dataframe base on two columns

I have the below pandas dataframe.
d = {'id1': ['85643', '85644','8564312','8564314','85645','8564316','85646','8564318','85647','85648','85649','85655'],'ID': ['G-00001', 'G-00001','G-00002','G-00002','G-00001','G-00002','G-00001','G-00002','G-00001','G-00001','G-00001','G-00001'],'col1': [1, 2,3,4,5,60,0,0,6,3,2,4],'Goal': [np.nan, 56,np.nan,89,73,np.nan ,np.nan ,np.nan, np.nan, np.nan, 34,np.nan ], 'col2': [3, 4,32,43,55,610,0,0,16,23,72,48],'col3': [1, 22,33,44,55,60,1,5,6,3,2,4],'Name': ['a1asd', 'a2asd','aabsd','aabsd','a3asd','aabsd','aasd','aabsd','aasd','aasd','aasd','aasd'],'Date': ['2021-06-13', '2021-06-13','2021-06-13','2021-06-14','2021-06-15','2021-06-15','2021-06-13','2021-06-16','2021-06-13','2021-06-13','2021-06-13','2021-06-16']}
dff = pd.DataFrame(data=d)
dff
id1 ID col1 Goal col2 col3 Name Date
0 85643 G-00001 1 NaN 3 1 a1asd 2021-06-13
1 85644 G-00001 2 56.0000 4 22 a2asd 2021-06-13
2 8564312 G-00002 3 NaN 32 33 aabsd 2021-06-13
3 8564314 G-00002 4 89.0000 43 44 aabsd 2021-06-14
4 85645 G-00001 5 73.0000 55 55 a3asd 2021-06-15
5 8564316 G-00002 60 NaN 610 60 aabsd 2021-06-15
6 85646 G-00001 0 NaN 0 1 aasd 2021-06-13
7 8564318 G-00002 0 NaN 0 5 aabsd 2021-06-16
8 85647 G-00001 6 NaN 16 6 aasd 2021-06-13
9 85648 G-00001 3 NaN 23 3 aasd 2021-06-13
10 85649 G-00001 2 34.0000 72 2 aasd 2021-06-13
11 85655 G-00001 4 NaN 48 4 aasd 2021-06-16
I want to summarize some of the columns and add them back to the same data frame based on some ids in the "id1" column and "Name" column. Also, I want to give a new name to the "ID" column when we add that row.
for example, I have some "id1" column slices.
Based on below "id1" column ids I want to summarize only "col1","col2","col3",and "Name" columns. Then I want to add that row back to the same dataframe and give a new id for the "ID" column.
b65 = ['85643','85645', '85655','85646']
b66 = ['85643','85645','85647','85648','85649','85644']
b67 = ['8564312','8564314','8564316','8564318']
I want to aggregate the sum for col1,col2, and col3 with average.
So when I tried to do it using dictionary comprehension I was able to create a datframe like below.
create a dictionary
d_map = {'b65': b65, 'b66': b66, 'b67': b67}
# dictionary comprehension
df = pd.DataFrame({k: dff[dff['id1'].isin(v)].agg({'col1': sum, 'col2': sum,
'col3': 'mean', 'Name': 'unique'})
for k,v in d_map.items()}).T.reset_index()
# rename the columns
df = df.rename(columns={'index': 'ID'})
# concat the two frames
pd.concat([dff, df]).reset_index(drop=True)
id1 ID col1 Goal col2 col3 Name Date
0 85643 G-00001 1 NaN 3 1 a1asd 2021-06-13
1 85644 G-00001 2 56.00 4 22 a2asd 2021-06-13
2 8564312 G-00002 3 NaN 32 33 aabsd 2021-06-13
3 8564314 G-00002 4 89.00 43 44 aabsd 2021-06-14
4 85645 G-00001 5 73.00 55 55 a3asd 2021-06-15
5 8564316 G-00002 60 NaN 610 60 aabsd 2021-06-15
6 85646 G-00001 0 NaN 0 1 aasd 2021-06-13
7 8564318 G-00002 0 NaN 0 5 aabsd 2021-06-16
8 85647 G-00001 6 NaN 16 6 aasd 2021-06-13
9 85648 G-00001 3 NaN 23 3 aasd 2021-06-13
10 85649 G-00001 2 34.00 72 2 aasd 2021-06-13
11 85655 G-00001 4 NaN 48 4 aasd 2021-06-16
12 NaN b65 10 NaN 106 15.25 [a1asd, a3asd, aasd] NaN
13 NaN b66 19 NaN 173 14.83 [a1asd, a2asd, a3asd, aasd] NaN
14 NaN b67 67 NaN 685 35.50 [aabsd] NaN
However I want to expand the Name column lists and make new summarized rows for each name in that name list. So i want to make datframe like below. Is it possible to do that.
id1 ID col1 Goal col2 col3 Name Date
0 85643 G-00001 1 NaN 3 1 a1asd 2021-06-13
1 85644 G-00001 2 56.00 4 22 a2asd 2021-06-13
2 8564312 G-00002 3 NaN 32 33 aabsd 2021-06-13
3 8564314 G-00002 4 89.00 43 44 aabsd 2021-06-14
4 85645 G-00001 5 73.00 55 55 a3asd 2021-06-15
5 8564316 G-00002 60 NaN 610 60 aabsd 2021-06-15
6 85646 G-00001 0 NaN 0 1 aasd 2021-06-13
7 8564318 G-00002 0 NaN 0 5 aabsd 2021-06-16
8 85647 G-00001 6 NaN 16 6 aasd 2021-06-13
9 85648 G-00001 3 NaN 23 3 aasd 2021-06-13
10 85649 G-00001 2 34.00 72 2 aasd 2021-06-13
11 85655 G-00001 4 NaN 48 4 aasd 2021-06-16
12 NaN b65 1 NaN 3 1 a1asd NaN
13 NaN b65 5 NaN 55 55 a3asd NaN
14 NaN b65 4 NaN 48 2.5 aasd NaN
15 NaN b66 1 NaN 3 1 a1asd NaN
15 NaN b66 2 NaN 4 22 a2asd NaN
15 NaN b66 5 NaN 55 55 a3asd NaN
15 NaN b66 11 NaN 111 3.6 aasd NaN
16 NaN b67 67 NaN 685 35.50 aabsd NaN
Thanks in advance!

How to create dataframe by randomly selecting from another dataframe?

DP 1 DP 2 DP 3 DP 4 DP 5 DP 6 DP 7 DP 8 DP 9 DP 10
(0.519) (1.117) (1.152) 0.772 1.490 (0.850) (1.189) (0.759)
0.030 0.047 0.632 (0.608) (0.322) 0.939 0.346 0.651
1.290 (0.179) 0.006 0.850 (1.141) 0.758 0.682
1.500 (1.228) 1.840 (1.594) (0.282) (0.907)
(1.540) 0.689 (0.683) 0.005 0.543
(0.197) (0.664) (0.636) 0.878
(0.942) 0.764 (0.137)
0.693 1.647
0.197
I have above dataframe:
i need below dataframe using random value from above dataframe:
DP 1 DP 2 DP 3 DP 4 DP 5 DP 6 DP 7 DP 8 DP 9 DP 10
(0.664) 1.290 0.682 0.030 (0.683) (0.636) (0.683) 1.840 (1.540)
1.490 (0.907) (0.850) (0.197) (1.228) 0.682 1.290 0.939
0.047 0.682 0.346 0.689 (0.137) 1.490 0.197
0.047 0.878 0.651 0.047 0.047 (0.197)
(1.141) 0.758 0.878 1.490 0.651
1.647 1.490 0.772 1.490
(0.519) 0.693 0.346
(0.137) 0.850
0.197
I've tried this code :
df2= df1.sample(len(df1))
print(df2)
But Output is
DP1 DP2 DP3 DP4 DP5 DP6 DP7 DP8 DP9
OP8 0.735590 1.762630 NaN NaN NaN NaN NaN NaN NaN
OP7 -0.999665 0.817949 -0.147698 NaN NaN NaN NaN NaN NaN
OP2 0.031430 0.049994 0.682040 -0.667445 -0.360034 1.089516 0.426642 0.916619 NaN
OP3 1.368955 -0.191781 0.006623 0.932736 -1.277548 0.880056 0.841018 NaN NaN
OP1 -0.551065 -1.195305 -1.243199 0.847178 1.668630 -0.986300 -1.465904 -1.069986 NaN
OP4 1.592201 -1.314628 1.985683 -1.749389 -0.315828 -1.052629 NaN NaN NaN
OP6 -0.208647 -0.710424 -0.686654 0.963221 NaN NaN NaN NaN NaN
OP10 NaN NaN NaN NaN NaN NaN NaN NaN NaN
OP9 0.209244 NaN NaN NaN NaN NaN NaN NaN NaN
OP5 -1.635306 0.737937 -0.736907 0.005545 0.607974 NaN NaN NaN NaN
You can use np.random.choice() for the sampling.
Assuming df is something like this:
df = pd.DataFrame({'DP 1': ['(0.519)','0.030','1.290','1.500','(1.540)','(0.197)','(0.942)','0.693','0.197'],'DP 2': ['(1.117)','0.047','(0.179)','(1.228)','0.689','(0.664)','0.764','1.647',np.nan],'DP 3': ['(1.152)','0.632','0.006','1.840','(0.683)','(0.636)','(0.137)',np.nan,np.nan],'DP 4': ['0.772','(0.608)','0.850','(1.594)','0.005','0.878',np.nan,np.nan,np.nan],'DP 5': ['1.490','(0.322)','(1.141)','(0.282)','0.543',np.nan,np.nan,np.nan,np.nan],'DP 6': ['(0.850)','0.939','0.758','(0.907)',np.nan,np.nan,np.nan,np.nan,np.nan],'DP 7': ['(1.189)','0.346','0.682',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],'DP 8': ['(0.759)','0.651',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],'DP 9': [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],'DP 10': [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]})
# DP 1 DP 2 DP 3 DP 4 DP 5 DP 6 DP 7 DP 8 DP 9 DP 10
# 0 (0.519) (1.117) (1.152) 0.772 1.490 (0.850) (1.189) (0.759) NaN NaN
# 1 0.030 0.047 0.632 (0.608) (0.322) 0.939 0.346 0.651 NaN NaN
# 2 1.290 (0.179) 0.006 0.850 (1.141) 0.758 0.682 NaN NaN NaN
# 3 1.500 (1.228) 1.840 (1.594) (0.282) (0.907) NaN NaN NaN NaN
# 4 (1.540) 0.689 (0.683) 0.005 0.543 NaN NaN NaN NaN NaN
# 5 (0.197) (0.664) (0.636) 0.878 NaN NaN NaN NaN NaN NaN
# 6 (0.942) 0.764 (0.137) NaN NaN NaN NaN NaN NaN NaN
# 7 0.693 1.647 NaN NaN NaN NaN NaN NaN NaN NaN
# 8 0.197 NaN NaN NaN NaN NaN NaN NaN NaN NaN
First extract the choices from all non-null values of df:
choices = df.values[~pd.isnull(df.values)]
# array(['(0.519)', '(1.117)', '(1.152)', '0.772', '1.490', '(0.850)',
# '(1.189)', '(0.759)', '0.030', '0.047', '0.632', '(0.608)',
# '(0.322)', '0.939', '0.346', '0.651', '1.290', '(0.179)', '0.006',
# '0.850', '(1.141)', '0.758', '0.682', '1.500', '(1.228)', '1.840',
# '(1.594)', '(0.282)', '(0.907)', '(1.540)', '0.689', '(0.683)',
# '0.005', '0.543', '(0.197)', '(0.664)', '(0.636)', '0.878',
# '(0.942)', '0.764', '(0.137)', '0.693', '1.647', '0.197'],
# dtype=object)
Then take a np.random.choice() from choices for all non-null cells:
df = df.applymap(lambda x: np.random.choice(choices) if not pd.isnull(x) else x)
# DP 1 DP 2 DP 3 DP 4 DP 5 DP 6 DP 7 DP 8 DP 9 DP 10
# 0 (0.179) 0.682 0.758 (1.152) (0.137) (1.152) 0.939 (0.759) NaN NaN
# 1 1.500 (1.152) (0.197) 0.772 1.840 1.840 0.772 (0.850) NaN NaN
# 2 0.878 0.005 (1.540) 0.764 (0.519) 0.682 (1.152) NaN NaN NaN
# 3 0.758 (0.137) 1.840 1.647 1.647 (0.942) NaN NaN NaN NaN
# 4 0.693 (0.683) (0.759) 1.500 (0.197) NaN NaN NaN NaN NaN
# 5 0.006 (0.137) 0.764 (1.117) NaN NaN NaN NaN NaN NaN
# 6 (0.664) 0.632 (1.141) NaN NaN NaN NaN NaN NaN NaN
# 7 0.543 (0.664) NaN NaN NaN NaN NaN NaN NaN NaN
# 8 (0.137) NaN NaN NaN NaN NaN NaN NaN NaN NaN

Nuances of merging multiple pandas dataframes (3+) on a key column

First question here and a long one - there are a couple of things I am struggling with regarding merging and formatting my dataframes. I have some half working solutions ones but I am unsure if they are the best possible based on what I want.
Here are the standard formats of the dataframes I am merging with pandas.
df1 =
RT %Area RRT
0 4.83 5.257 0.509
1 6.76 0.424 0.712
2 7.27 0.495 0.766
3 7.70 0.257 0.811
4 7.79 0.122 0.821
5 9.49 92.763 1.000
6 11.40 0.681 1.201
df2=
RT %Area RRT
0 4.83 0.731 0.508
1 6.74 1.243 0.709
2 7.28 0.109 0.766
3 7.71 0.287 0.812
4 7.79 0.177 0.820
5 9.50 95.824 1.000
6 11.31 0.348 1.191
7 11.40 1.166 1.200
8 12.09 0.113 1.273
df3 = ...
Currently I am using a reduce operation on pd.merge_ordered() like below to merge my dataframes (3+). This kind of yields what I want and was from a previous question (pandas three-way joining multiple dataframes on columns). I am merging on RRT, and want the indexes with the same RRT values to be placed on the same row - and if the RRT values are unique for that dataset I want a NaN for missing data from other datasets.
#The for loop I use to generate the list of formatted dataframes prior to merging
dfs = []
for entry in os.scandir(directory):
if (entry.path.endswith(".csv")) and entry.is_file():
entry = pd.read_csv(entry.path, header=None)
#Block of formatting code removed
dfs.append(entry.round(2))
dfs = [df1ar,df2ar,df3ar]
df_final = reduce(lambda left,right: pd.merge_ordered(left,right,on='RRT'), dfs)
cols = ['RRT', 'RT_x', '%Area_x', 'RT_y', '%Area_y', 'RT', '%Area']
df_final = df_final[cols]
print(df_final)
RRT RT_x %Area_x RT_y %Area_y RT %Area
0 0.508 NaN NaN 4.83 0.731 NaN NaN
1 0.509 4.83 5.257 NaN NaN 4.83 5.257
2 0.709 NaN NaN 6.74 1.243 NaN NaN
3 0.712 6.76 0.424 NaN NaN 6.76 0.424
4 0.766 7.27 0.495 7.28 0.109 7.27 0.495
5 0.811 7.70 0.257 NaN NaN 7.70 0.257
6 0.812 NaN NaN 7.71 0.287 NaN NaN
7 0.820 NaN NaN 7.79 0.177 NaN NaN
8 0.821 7.79 0.122 NaN NaN 7.79 0.122
9 1.000 9.49 92.763 9.50 95.824 9.49 92.763
10 1.191 NaN NaN 11.31 0.348 NaN NaN
11 1.200 NaN NaN 11.40 1.166 NaN NaN
12 1.201 11.40 0.681 NaN NaN 11.40 0.681
13 1.273 NaN NaN 12.09 0.113 NaN NaN
This works, but:
Can I can insert a multiindex based on the filename of the dataframe that the data came from from and place it above the corresponding columns? Like the suffix option but related back to filename and for more than two sets of data. Is this better done prior to merging? and if so how do I do it? (I've included the for loop I use for to create a list of tables prior to merging.
Is this reduced merge_ordered the simplest way of doing this?
Can I do a similar merge with pd.merge_asof() and use the tolerance value to fine tune the merging based on the similarities between the RRT values? That is, can it be done without cutting off data from the longer dataframes?
I've tried the above and searched for answers, but I'm struggling to find the most efficient way to do everything I want.
concat = pd.concat(dfs, axis=1, keys=['A','B','C'])
concat_final = concat.round(3)
print(concat_final)
A B C
RT %Area RRT RT %Area RRT RT %Area RRT
0 4.83 5.257 0.509 4.83 0.731 0.508 4.83 5.257 0.509
1 6.76 0.424 0.712 6.74 1.243 0.709 6.76 0.424 0.712
2 7.27 0.495 0.766 7.28 0.109 0.766 7.27 0.495 0.766
3 7.70 0.257 0.811 7.71 0.287 0.812 7.70 0.257 0.811
4 7.79 0.122 0.821 7.79 0.177 0.820 7.79 0.122 0.821
5 9.49 92.763 1.000 9.50 95.824 1.000 9.49 92.763 1.000
6 11.40 0.681 1.201 11.31 0.348 1.191 11.40 0.681 1.201
7 NaN NaN NaN 11.40 1.166 1.200 NaN NaN NaN
8 NaN NaN NaN 12.09 0.113 1.273 NaN NaN NaN
I have also tried this - and I get the multiindex to denote which file (A,B,C, just as placeholders) it came from. However, it has obviously not merged based on the RRT value like I want.
Can I apply an operation to change this into a similar format to the pd.merge_ordered() format above? Would groupby() work?
Thanks!

pandas: select first and last valid floats in columns

With a DataFrame that looks like this:
tra98 tra99 tra100 tra101 tra102
0 0.1880 0.345 0.1980 0.2090 0.2190
1 0.2510 0.585 0.2710 0.3240 0.2920
2 0.3240 0.741 0.2190 0.2090 0.2820
3 0.2820 0.825 0.1040 0.1880 0.2400
4 0.2190 1.150 0.0940 0.1360 0.1770
5 0.2300 1.210 0.0522 0.0209 0.0731
6 0.1670 1.290 0.0626 0.0104 0.0104
7 0.0835 1.400 0.0104 NaN NaN
8 0.0418 1.580 NaN NaN NaN
9 0.0209 NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN
How can I select the first and last valid values in each column?
Thank you for your help.
The following shows how you can iterate over the cols, then call dropna() and then access the first and last values calling iloc:
In [21]:
for col in df:
valid_col = df[col].dropna()
print("column:", col, " first:", valid_col.iloc[0], " last:", valid_col.iloc[-1])
column: tra98 first: 0.188 last: 0.0209
column: tra99 first: 0.345 last: 1.58
column: tra100 first: 0.198 last: 0.0104
column: tra101 first: 0.209 last: 0.0104
column: tra102 first: 0.219 last: 0.0104

Categories

Resources