Pandas: create dataframe using value_counts - python

I have data
age
32
16
39
39
23
36
29
26
43
34
35
50
29
29
31
42
53
I need to get smth like this
I can get
df.age.value_counts()
and
100. * df.age.value_counts() / len(df.age)
But how can I union this and give name to columns?

You can use cut with agg:
#helper df with min and max ages, necessary add category Total
df1 = pd.DataFrame({'G':['14 yo and younger','15-19','20-24','25-29','30-34',
'35-39','40-44','45-49','50-54','55-59','60-64','65+','Total'],
'Min':[0, 15,20,25,30,35,40,45,50,55,60,65,np.nan],
'Max':[14,19,24,29,34,39,44,49,54,59,64,120, np.nan]})
print (df1)
G Max Min
0 14 yo and younger 14.0 0.0
1 15-19 19.0 15.0
2 20-24 24.0 20.0
3 25-29 29.0 25.0
4 30-34 34.0 30.0
5 35-39 39.0 35.0
6 40-44 44.0 40.0
7 45-49 49.0 45.0
8 50-54 54.0 50.0
9 55-59 59.0 55.0
10 60-64 64.0 60.0
11 65+ 120.0 65.0
12 Total NaN NaN
cutoff = np.hstack([np.array(df1.Min[0]), df1.Max.values])
labels = df1.G.values
df['Groups'] = pd.cut(df.age, bins=cutoff, labels=labels, right=True, include_lowest=True)
print (df)
age Groups
0 32 30-34
1 16 15-19
2 39 35-39
3 39 35-39
4 23 20-24
5 36 35-39
6 29 25-29
7 26 25-29
8 43 40-44
9 34 30-34
10 35 35-39
11 50 50-54
12 29 25-29
13 29 25-29
14 31 30-34
15 42 40-44
16 53 50-54
df = df.groupby('Groups')['Groups']
.agg({'Total':[len, lambda x: len(x)/df.shape[0] * 100 ]})
.rename(columns={'len':'N', '<lambda>':'%'})
#last Total row
df.ix['Total'] = df.sum()
print (df)
Total
N %
Groups
14 yo and younger 0.0 0.000000
15-19 1.0 5.882353
20-24 1.0 5.882353
25-29 4.0 23.529412
30-34 3.0 17.647059
35-39 4.0 23.529412
40-44 2.0 11.764706
45-49 0.0 0.000000
50-54 2.0 11.764706
55-59 0.0 0.000000
60-64 0.0 0.000000
65+ 0.0 0.000000
Total 17.0 100.000000
EDIT1:
Solution with size scale better:
df1 = df.groupby('Groups').size().to_frame()
df1.columns = pd.MultiIndex.from_arrays(('Total','N'))
df1.ix[:,('Total','%')] = 100 * df1.ix[:,('Total','N')] / df.shape[0]
df1.ix['Total'] = df1.sum()
print (df1)
Total
N %
Groups
14 yo and younger 0.0 0.000000
15-19 1.0 5.882353
20-24 1.0 5.882353
25-29 4.0 23.529412
30-34 3.0 17.647059
35-39 4.0 23.529412
40-44 2.0 11.764706
45-49 0.0 0.000000
50-54 2.0 11.764706
55-59 0.0 0.000000
60-64 0.0 0.000000
65+ 0.0 0.000000
Total 17.0 100.000000

Related

How to do similar to conditional countifs on a dataframe

I am trying to replicate countifs in excel to get a rank between two unique values that are listed in my dataframe. I have attached the expected output calculated in excel using countif and let/rank functions.
I am trying to generate "average rank of gas and coal plants" that takes the number from the "average rank column" and then ranks the two unique types from technology (CCGT or COAL) into two new ranks (Gas or Coal) so then I can get the relavant quantiles for this. In case you are wondering why I would need to do this seeing as there are only two coal plants, well when I run this model on a larger dataset it will be useful to know how to do this in code and not manually on my dataset.
Ideally the output will return two ranks 1-47 for all units with technology == CCGT and 1-2 for all units with technology == COAL.
This is the column I am looking to make
Unit ID
Technology
03/01/2022
04/01/2022
05/01/2022
06/01/2022
07/01/2022
08/01/2022
Average Rank
Unit Rank
Avg Rank of Gas & Coal plants
Gas Quintiles
Coal Quintiles
Quintiles
FAWN-1
CCGT
1.0
5.0
1.0
5.0
2.0
1.0
2.5
1
1
1
0
Gas_1
GRAI-6
CCGT
4.0
18.0
2.0
4.0
3.0
3.0
5.7
2
2
1
0
Gas_1
EECL-1
CCGT
5.0
29.0
4.0
1.0
1.0
2.0
7.0
3
3
1
0
Gas_1
PEMB-21
CCGT
7.0
1.0
6.0
13.0
8.0
8.0
7.2
4
4
1
0
Gas_1
PEMB-51
CCGT
3.0
3.0
3.0
11.0
16.0
7.2
5
5
1
0
Gas_1
PEMB-41
CCGT
9.0
4.0
7.0
7.0
10.0
13.0
8.3
6
6
1
0
Gas_1
WBURB-1
CCGT
6.0
9.0
22.0
2.0
7.0
5.0
8.5
7
7
1
0
Gas_1
PEMB-31
CCGT
14.0
6.0
13.0
6.0
4.0
9.0
8.7
8
8
1
0
Gas_1
GRMO-1
CCGT
2.0
7.0
10.0
24.0
11.0
6.0
10.0
9
9
1
0
Gas_1
PEMB-11
CCGT
21.0
2.0
9.0
10.0
9.0
14.0
10.8
10
10
2
0
Gas_2
STAY-1
CCGT
19.0
12.0
5.0
23.0
6.0
7.0
12.0
11
11
2
0
Gas_2
GRAI-7
CCGT
10.0
27.0
15.0
9.0
15.0
11.0
14.5
12
12
2
0
Gas_2
DIDCB6
CCGT
28.0
11.0
11.0
8.0
19.0
15.0
15.3
13
13
2
0
Gas_2
SCCL-3
CCGT
17.0
16.0
31.0
3.0
18.0
10.0
15.8
14
14
2
0
Gas_2
STAY-4
CCGT
12.0
8.0
20.0
18.0
14.0
23.0
15.8
14
14
2
0
Gas_2
CDCL-1
CCGT
13.0
22.0
8.0
25.0
12.0
16.0
16.0
16
16
2
0
Gas_2
STAY-3
CCGT
8.0
17.0
17.0
20.0
13.0
22.0
16.2
17
17
2
0
Gas_2
MRWD-1
CCGT
19.0
26.0
5.0
19.0
17.3
18
18
2
0
Gas_2
WBURB-3
CCGT
24.0
14.0
17.0
17.0
18.0
19
19
3
0
Gas_3
WBURB-2
CCGT
14.0
21.0
12.0
31.0
18.0
19.2
20
20
3
0
Gas_3
GYAR-1
CCGT
26.0
14.0
17.0
20.0
21.0
19.6
21
21
3
0
Gas_3
STAY-2
CCGT
18.0
20.0
18.0
21.0
24.0
20.0
20.2
22
22
3
0
Gas_3
KLYN-A-1
CCGT
24.0
12.0
19.0
27.0
20.5
23
23
3
0
Gas_3
SHOS-1
CCGT
16.0
15.0
28.0
15.0
29.0
27.0
21.7
24
24
3
0
Gas_3
DIDCB5
CCGT
10.0
35.0
22.0
22.3
25
25
3
0
Gas_3
CARR-1
CCGT
33.0
26.0
27.0
22.0
4.0
22.4
26
26
3
0
Gas_3
LAGA-1
CCGT
15.0
13.0
29.0
32.0
23.0
24.0
22.7
27
27
3
0
Gas_3
CARR-2
CCGT
24.0
25.0
27.0
29.0
21.0
12.0
23.0
28
28
3
0
Gas_3
GRAI-8
CCGT
11.0
28.0
36.0
16.0
26.0
25.0
23.7
29
29
4
0
Gas_4
SCCL-2
CCGT
29.0
16.0
28.0
25.0
24.5
30
30
4
0
Gas_4
LBAR-1
CCGT
19.0
25.0
31.0
28.0
25.8
31
31
4
0
Gas_4
CNQPS-2
CCGT
20.0
32.0
32.0
26.0
27.5
32
32
4
0
Gas_4
SPLN-1
CCGT
23.0
30.0
30.0
27.7
33
33
4
0
Gas_4
DAMC-1
CCGT
23.0
21.0
38.0
34.0
29.0
34
34
4
0
Gas_4
KEAD-2
CCGT
30.0
30.0
35
35
4
0
Gas_4
SHBA-1
CCGT
26.0
23.0
35.0
37.0
30.3
36
36
4
0
Gas_4
HUMR-1
CCGT
22.0
30.0
37.0
37.0
33.0
28.0
31.2
37
37
4
0
Gas_4
CNQPS-4
CCGT
27.0
33.0
35.0
30.0
31.3
38
38
5
0
Gas_5
CNQPS-1
CCGT
25.0
40.0
33.0
32.7
39
39
5
0
Gas_5
SEAB-1
CCGT
32.0
34.0
36.0
29.0
32.8
40
40
5
0
Gas_5
PETEM1
CCGT
35.0
35.0
41
41
5
0
Gas_5
ROCK-1
CCGT
31.0
34.0
38.0
38.0
35.3
42
42
5
0
Gas_5
SEAB-2
CCGT
31.0
39.0
39.0
34.0
35.8
43
43
5
0
Gas_5
WBURB-43
COAL
32.0
37.0
40.0
39.0
31.0
35.8
44
1
0
1
Coal_1
FDUNT-1
CCGT
36.0
36.0
45
44
5
0
Gas_5
COSO-1
CCGT
30.0
42.0
36.0
36.0
45
44
5
0
Gas_5
WBURB-41
COAL
33.0
38.0
41.0
40.0
32.0
36.8
47
2
0
1
Coal_1
FELL-1
CCGT
34.0
39.0
43.0
41.0
33.0
38.0
48
46
5
0
Gas_5
KEAD-1
CCGT
43.0
43.0
49
47
5
0
Gas_5
I have tried to do it the same way I got average rank, which is a rank of the average of inputs in the dataframe but it doesn't seem to work with additional conditions.
Thank you!!
import pandas as pd
df = pd.read_csv("gas.csv")
display(df['Technology'].value_counts())
print('------')
display(df['Technology'].value_counts()[0]) # This is how you access count of CCGT
display(df['Technology'].value_counts()[1])
Output:
CCGT 47
COAL 2
Name: Technology, dtype: int64
------
47
2
By the way: pd.cut or pd.qcut can be used to calculate quantiles. You don't have to manually define what a quantile is.
Refer to the documentation and other websites:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html
https://www.geeksforgeeks.org/how-to-use-pandas-cut-and-qcut/
There are many methods you can pass to rank. Refer to documentation:
https://pandas.pydata.org/docs/reference/api/pandas.Series.rank.html
df['rank'] = df.groupby("Technology")["Average Rank"].rank(method = "dense", ascending = True)
df
method{‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}, default ‘average’
How to rank the group of records that have the same value (i.e. ties):
average: average rank of the group
min: lowest rank in the group
max: highest rank in the group
first: ranks assigned in order they appear in the array
dense: like ‘min’, but rank always increases by 1 between groups.

Python: Pandas merge three dataframes on date, keeping all dates [duplicate]

This question already has answers here:
Merge multiple DataFrames Pandas
(5 answers)
Pandas Merging 101
(8 answers)
Closed 7 months ago.
I have three dataframes
Dataframe df1:
date A
0 2022-04-11 1
1 2022-04-12 2
2 2022-04-14 26
3 2022-04-16 2
4 2022-04-17 1
5 2022-04-20 17
6 2022-04-21 14
7 2022-04-22 1
8 2022-04-23 9
9 2022-04-24 1
10 2022-04-25 5
11 2022-04-26 2
12 2022-04-27 21
13 2022-04-28 9
14 2022-04-29 17
15 2022-04-30 5
16 2022-05-01 8
17 2022-05-07 1241217
18 2022-05-08 211
19 2022-05-09 1002521
20 2022-05-10 488739
21 2022-05-11 12925
22 2022-05-12 57
23 2022-05-13 8515098
24 2022-05-14 1134576
Dateframe df2:
date B
0 2022-04-12 8
1 2022-04-14 7
2 2022-04-16 2
3 2022-04-19 2
4 2022-04-23 2
5 2022-05-07 2
6 2022-05-08 5
7 2022-05-09 2
8 2022-05-14 1
Dataframe df3:
date C
0 2022-04-12 6
1 2022-04-13 1
2 2022-04-14 2
3 2022-04-20 3
4 2022-04-21 9
5 2022-04-22 25
6 2022-04-23 56
7 2022-04-24 49
8 2022-04-25 68
9 2022-04-26 71
10 2022-04-27 40
11 2022-04-28 44
12 2022-04-29 27
13 2022-04-30 34
14 2022-05-01 28
15 2022-05-07 9
16 2022-05-08 20
17 2022-05-09 24
18 2022-05-10 21
19 2022-05-11 8
20 2022-05-12 8
21 2022-05-13 14
22 2022-05-14 25
23 2022-05-15 43
24 2022-05-16 36
25 2022-05-17 29
26 2022-05-18 28
27 2022-05-19 17
28 2022-05-20 6
I would like to merge df1, df2, df3 in a single dataframe with columns date, A, B, C, in such a way that date contains all dates which appeared in df1 and/or df2 and/or df3 (without repetition), and if a particular date was not in any of the dataframes, then for the respective column I put value 0.0. So, I would like to have something like that:
date A B C
0 2022-04-11 1.0 0.0 0.0
1 2022-08-12 2.0 8.0 6.0
2 2022-08-13 0.0 0.0 1.0
...
I tried to use this method
merge1 = pd.merge(df1, df2, how='outer')
sorted_merge1 = merge1.sort_values(by=['date'], ascending=False)
full_merge = pd.merge(sorted_merg1, df3, how='outer')
However, it seems it skips the dates which are not common for all three dataframes.
Try this,
print(pd.merge(df1, df2, on='date', how='outer').merge(df3, on='date', how='outer').fillna(0))
O/P:
date A B C
0 2022-04-11 1.0 0.0 0.0
1 2022-04-12 2.0 8.0 6.0
2 2022-04-14 26.0 7.0 2.0
3 2022-04-16 2.0 2.0 0.0
4 2022-04-17 1.0 0.0 0.0
5 2022-04-20 17.0 0.0 3.0
6 2022-04-21 14.0 0.0 9.0
7 2022-04-22 1.0 0.0 25.0
8 2022-04-23 9.0 2.0 56.0
9 2022-04-24 1.0 0.0 49.0
10 2022-04-25 5.0 0.0 68.0
11 2022-04-26 2.0 0.0 71.0
12 2022-04-27 21.0 0.0 40.0
13 2022-04-28 9.0 0.0 44.0
14 2022-04-29 17.0 0.0 27.0
15 2022-04-30 5.0 0.0 34.0
16 2022-05-01 8.0 0.0 28.0
17 2022-05-07 1241217.0 2.0 9.0
18 2022-05-08 211.0 5.0 20.0
19 2022-05-09 1002521.0 2.0 24.0
20 2022-05-10 488739.0 0.0 21.0
21 2022-05-11 12925.0 0.0 8.0
22 2022-05-12 57.0 0.0 8.0
23 2022-05-13 8515098.0 0.0 14.0
24 2022-05-14 1134576.0 1.0 25.0
25 2022-04-19 0.0 2.0 0.0
26 2022-04-13 0.0 0.0 1.0
27 2022-05-15 0.0 0.0 43.0
28 2022-05-16 0.0 0.0 36.0
29 2022-05-17 0.0 0.0 29.0
30 2022-05-18 0.0 0.0 28.0
31 2022-05-19 0.0 0.0 17.0
32 2022-05-20 0.0 0.0 6.0
​
perform merge chain and fill NaN with 0

Substituting values with conditions

I have a dataframe like this one below
Air Station Code Humidity Temperature Latitude Longitude
St.1 20 10 10.00 10.00
St.2 4 15 25.00 30.00
St.3 16 21 8.00 15.00
St.4 38 8 31.00 40.00
St.5 10 18 10.00 10.00
St.6 40 4 25.00 30.00
St.7 10 13 8.00 15.00
St.8 46 11 31.00 40.00
St.9 28 9 10.00 10.00
St.10 14 22 25.00 30.00
St.11 5 40 8.00 15.00
St.12 11 10 31.00 40.00
...
St.89 61 35 10.00 10.00
St.90 23 29 25.00 30.00
St.91 35 12 8.00 15.00
St.92 31 7 31.00 40.00
I want to change the station codes by matching the coordinates, substituing the codes by repeating the first 4 codes, obtaining this
Air Station Code Humidity Temperature Latitude Longitude
St.1 20 10 10.00 10.00
St.2 4 15 25.00 30.00
St.3 16 21 8.00 15.00
St.4 38 8 31.00 40.00
St.1 10 18 10.00 10.00
St.2 40 4 25.00 30.00
St.3 10 13 8.00 15.00
St.4 46 11 31.00 40.00
St.1 28 9 10.00 10.00
St.2 14 22 25.00 30.00
St.3 5 40 8.00 15.00
St.4 11 10 31.00 40.00
...
St.1 61 35 10.00 10.00
St.2 23 29 25.00 30.00
St.3 35 12 8.00 15.00
St.4 31 7 31.00 40.00
Is there some way to implement an "if/else" substitution on the whole dataframe without going manually over every observation in python?
df['Air Station Code'] = 'St.' + pd.Series(df[['Latitude','Longitude']].astype(str).agg(sum, axis=1).factorize()[0] + 1).astype(str)
df
Out[77]:
Air Station Code Humidity Temperature Latitude Longitude
0 St.1 20 10 10.0 10.0
1 St.2 4 15 25.0 30.0
2 St.3 16 21 8.0 15.0
3 St.4 38 8 31.0 40.0
4 St.1 10 18 10.0 10.0
5 St.2 40 4 25.0 30.0
6 St.3 10 13 8.0 15.0
7 St.4 46 11 31.0 40.0
8 St.1 28 9 10.0 10.0
9 St.2 14 22 25.0 30.0
10 St.3 5 40 8.0 15.0
11 St.4 11 10 31.0 40.0
There may be a better way to do this, but this solves the problem. I create a second dataframe without the duplicates, which keeps the first occurrence of each lat/long. I make lat/long the index and drop the other columns. I then do a "join", adding a new column with the matching lat/long. I then overwrite the original station code with the looked up one.
import pandas as pd
data = [
["St.1", 20, 10, 10.00, 10.00],
["St.2", 4, 15, 25.00, 30.00],
["St.3", 16, 21, 8.00, 15.00],
["St.4", 38, 8, 31.00, 40.00],
["St.5", 10, 18, 10.00, 10.00],
["St.6", 40, 4, 25.00, 30.00],
["St.7", 10, 13, 8.00, 15.00],
["St.8", 46, 11, 31.00, 40.00],
["St.9", 28, 9, 10.00, 10.00],
["St.10", 14, 22, 25.00, 30.00],
["St.11", 5, 40, 8.00, 15.00],
["St.12", 11, 10, 31.00, 40.00],
["St.89", 61, 35, 10.00, 10.00],
["St.90", 23, 29, 25.00, 30.00],
["St.91", 35, 12, 8.00, 15.00],
["St.92", 31, 7, 31.00, 40.00],
]
column = "Air_Station_Code Humidity Temperature Latitude Longitude".split()
df = pd.DataFrame(data,columns=column)
print(df)
df1 = df.drop_duplicates(['Latitude','Longitude'])
df1 = df1[['Air_Station_Code','Latitude','Longitude']]
df1.set_index(['Latitude','Longitude'], inplace=True)
print(df1)
df2 = df.join( df1, on=['Latitude','Longitude'], rsuffix='R' )
print(df2)
df['Air_Station_Code'] = df2['Air_Station_CodeR']
print(df)
Output:
Air_Station_Code Humidity Temperature Latitude Longitude
0 St.1 20 10 10.0 10.0
1 St.2 4 15 25.0 30.0
2 St.3 16 21 8.0 15.0
3 St.4 38 8 31.0 40.0
4 St.5 10 18 10.0 10.0
5 St.6 40 4 25.0 30.0
6 St.7 10 13 8.0 15.0
7 St.8 46 11 31.0 40.0
8 St.9 28 9 10.0 10.0
9 St.10 14 22 25.0 30.0
10 St.11 5 40 8.0 15.0
11 St.12 11 10 31.0 40.0
12 St.89 61 35 10.0 10.0
13 St.90 23 29 25.0 30.0
14 St.91 35 12 8.0 15.0
15 St.92 31 7 31.0 40.0
Air_Station_Code
Latitude Longitude
10.0 10.0 St.1
25.0 30.0 St.2
8.0 15.0 St.3
31.0 40.0 St.4
Air_Station_Code Humidity ... Longitude Air_Station_CodeR
0 St.1 20 ... 10.0 St.1
1 St.2 4 ... 30.0 St.2
2 St.3 16 ... 15.0 St.3
3 St.4 38 ... 40.0 St.4
4 St.5 10 ... 10.0 St.1
5 St.6 40 ... 30.0 St.2
6 St.7 10 ... 15.0 St.3
7 St.8 46 ... 40.0 St.4
8 St.9 28 ... 10.0 St.1
9 St.10 14 ... 30.0 St.2
10 St.11 5 ... 15.0 St.3
11 St.12 11 ... 40.0 St.4
12 St.89 61 ... 10.0 St.1
13 St.90 23 ... 30.0 St.2
14 St.91 35 ... 15.0 St.3
15 St.92 31 ... 40.0 St.4
[16 rows x 6 columns]
Air_Station_Code Humidity Temperature Latitude Longitude
0 St.1 20 10 10.0 10.0
1 St.2 4 15 25.0 30.0
2 St.3 16 21 8.0 15.0
3 St.4 38 8 31.0 40.0
4 St.1 10 18 10.0 10.0
5 St.2 40 4 25.0 30.0
6 St.3 10 13 8.0 15.0
7 St.4 46 11 31.0 40.0
8 St.1 28 9 10.0 10.0
9 St.2 14 22 25.0 30.0
10 St.3 5 40 8.0 15.0
11 St.4 11 10 31.0 40.0
12 St.1 61 35 10.0 10.0
13 St.2 23 29 25.0 30.0
14 St.3 35 12 8.0 15.0
15 St.4 31 7 31.0 40.0

Reading multiple DataFrames from a given input

I have a couple of data frames given this way :
38 47 7 20 35
45 76 63 96 24
98 53 2 87 80
83 86 92 48 1
73 60 26 94 6
80 50 29 53 92
66 90 79 98 46
40 21 58 38 60
35 13 72 28 6
48 76 51 96 12
79 80 24 37 51
86 70 1 22 71
52 69 10 83 13
12 40 3 0 30
46 50 48 76 5
Could you please tell me how it is possible to add them to a list of dataframes?
Thanks a lot!
First convert values to one DataFrame with separator misisng values (converted from blank lines):
df = pd.read_csv(file, header=None, skip_blank_lines=False)
print (df)
0 1 2 3 4
0 38.0 47.0 7.0 20.0 35.0
1 45.0 76.0 63.0 96.0 24.0
2 98.0 53.0 2.0 87.0 80.0
3 83.0 86.0 92.0 48.0 1.0
4 73.0 60.0 26.0 94.0 6.0
5 NaN NaN NaN NaN NaN
6 80.0 50.0 29.0 53.0 92.0
7 66.0 90.0 79.0 98.0 46.0
8 40.0 21.0 58.0 38.0 60.0
9 35.0 13.0 72.0 28.0 6.0
10 48.0 76.0 51.0 96.0 12.0
11 NaN NaN NaN NaN NaN
12 79.0 80.0 24.0 37.0 51.0
13 86.0 70.0 1.0 22.0 71.0
14 52.0 69.0 10.0 83.0 13.0
15 12.0 40.0 3.0 0.0 30.0
16 46.0 50.0 48.0 76.0 5.0
And then in list comprehension create smaller DataFrames in list:
dfs = [g.iloc[1:].astype(int).reset_index(drop=True)
for _, g in df.groupby(df[0].isna().cumsum())]
print (dfs[1])
0 1 2 3 4
0 80 50 29 53 92
1 66 90 79 98 46
2 40 21 58 38 60
3 35 13 72 28 6
4 48 76 51 96 12

Filling missing rows in groups after groupby

I've got some SQL data that I'm grouping and performing some aggregation on. It works nicely:
grouped = df.groupby(['a', 'b'])
agged = grouped.aggregate({
c: [numpy.sum, numpy.mean, numpy.size],
d: [numpy.sum, numpy.mean, numpy.size]
})
and
c d
sum mean size sum mean size
a b
25 20 107.0 0.804511 133.0 5328000 40060.150376 133
21 110.0 0.774648 142.0 6031000 42471.830986 142
23 126.0 0.792453 159.0 8795000 55314.465409 159
24 72.0 0.947368 76.0 2920000 38421.052632 76
25 54.0 0.818182 66.0 2570000 38939.393939 66
26 23 126.0 0.792453 159.0 8795000 55314.465409 159
but I want to fill all of the rows that are in a=25 but not in a=26 with zeros. In other words, something like:
c d
sum mean size sum mean size
a b
25 20 107.0 0.804511 133.0 5328000 40060.150376 133
21 110.0 0.774648 142.0 6031000 42471.830986 142
23 126.0 0.792453 159.0 8795000 55314.465409 159
24 72.0 0.947368 76.0 2920000 38421.052632 76
25 54.0 0.818182 66.0 2570000 38939.393939 66
26 20 0 0 0 0 0 0
21 0 0 0 0 0 0
23 126.0 0.792453 159.0 8795000 55314.465409 159
24 0 0 0 0 0 0
25 0 0 0 0 0 0
How can I do this?
Consider the dataframe df
df = pd.DataFrame(
np.random.randint(10, size=(6, 6)),
pd.MultiIndex.from_tuples(
[(25, 20), (25, 21), (25, 23), (25, 24), (25, 25), (26, 23)],
names=['a', 'b']
),
pd.MultiIndex.from_product(
[['c', 'd'], ['sum', 'mean', 'size']]
)
)
c d
sum mean size sum mean size
a b
25 20 8 3 5 5 0 2
21 3 7 8 9 2 7
23 2 1 3 2 5 4
24 9 0 1 7 1 6
25 1 9 3 5 8 8
26 23 8 8 4 8 0 5
You can quickly recover all missing rows from the cartesian product with unstack(fill_value=0) followed by stack
df.unstack(fill_value=0).stack()
c d
mean size sum mean size sum
a b
25 20 3 5 8 0 2 5
21 7 8 3 2 7 9
23 1 3 2 5 4 2
24 0 1 9 1 6 7
25 9 3 1 8 8 5
26 20 0 0 0 0 0 0
21 0 0 0 0 0 0
23 8 4 8 0 5 8
24 0 0 0 0 0 0
25 0 0 0 0 0 0
Note: Using fill_value=0 preserves the dtype int. Without it, when unstacked, the gaps get filled with NaN and dtypes get converted to float
print(df)
c d
sum mean size sum mean size
a b
25 20 107.0 0.804511 133.0 5328000 40060.150376 133
21 110.0 0.774648 142.0 6031000 42471.830986 142
23 126.0 0.792453 159.0 8795000 55314.465409 159
24 72.0 0.947368 76.0 2920000 38421.052632 76
25 54.0 0.818182 66.0 2570000 38939.393939 66
26 23 126.0 0.792453 159.0 8795000 55314.465409 159
I like:
df = df.unstack().replace(np.nan,0).stack(-1)
print(df)
c d
mean size sum mean size sum
a b
25 20 0.804511 133.0 107.0 40060.150376 133.0 5328000.0
21 0.774648 142.0 110.0 42471.830986 142.0 6031000.0
23 0.792453 159.0 126.0 55314.465409 159.0 8795000.0
24 0.947368 76.0 72.0 38421.052632 76.0 2920000.0
25 0.818182 66.0 54.0 38939.393939 66.0 2570000.0
26 20 0.000000 0.0 0.0 0.000000 0.0 0.0
21 0.000000 0.0 0.0 0.000000 0.0 0.0
23 0.792453 159.0 126.0 55314.465409 159.0 8795000.0
24 0.000000 0.0 0.0 0.000000 0.0 0.0
25 0.000000 0.0 0.0 0.000000 0.0 0.0

Categories

Resources