I have an input data as shown:
df = pd.DataFrame({"colony" : [22, 22, 22, 33, 33, 33],
"measure" : [np.nan, 7, 11, 13, np.nan, 9,],
"Length" : [14, 17, 13, 10, 19,16],
"net/gross" : [np.nan, "gross", "net", "gross", "np.nan", "net"]})
df
colony measure length net/gross
0 22 NaN 14 NaN
1 22 7 17 gross
2 22 11 13 net
3 33 13 10 gross
4 33 NaN 19 NaN
5 33 9 16 net
I want to fill the NaN in the measure column with maximum value from each group of the colony, then fill the NaN in the net/gross column with the net/gross value at the row where the measure was maximum (e.g fill the NaN at index 0 with the value corresponding to where the measure was max which is "net", and 13 on the length_adj column) and create a remark column to document all the NaN filled rows as "max_filled" and the other rows as "unchanged" to arrive at an output as below:
colony measure net/gross length_adj remarks
0 22 11 net 13 max_filled
1 22 7 gross 17 unchanged
2 22 11 net 13 unchanged
3 33 13 gross 10 unchanged
4 33 13 gross 10 max_filled
5 33 9 net 16 unchanged
One approach that allows maximum control of each step (but may be less efficient than more direct pandas methods) is to use apply (with axis=1 to iterate rows) with a custom function, passing the dataframe as an argument as well.
You can use np.isnan to verify that a certain value of a row is or is not nan.
Without using groupby, you can directly for each row retrieve the dataframe of the corresponding colony group. Then you can retrieve the index of the maximum value found with idxmax()
def my_func(row, df):
if np.isnan(row.measure):
max_index_location = df[df.colony==row.colony]['measure'].idxmax()
row.measure = df.iloc[max_index_location].measure
row['Length'] = df.iloc[max_index_location]['Length']
row['net/gross'] = df.iloc[max_index_location]['net/gross']
row['remarks'] = 'max_filled'
else:
row['remarks'] = 'unchanged'
return row
df = df.apply(lambda x: my_func(x, df), axis=1)
Dataframe will be:
colony
measure
Length
net/gross
remarks
0
22
11
13
net
max_filled
1
22
7
17
gross
unchanged
2
22
11
13
net
unchanged
3
33
13
10
gross
unchanged
4
33
13
10
gross
max_filled
5
33
9
16
net
unchanged
Here you go:
df['measure'] = df['measure'].fillna(df.groupby('colony')['measure'].transform('max'))
step1
fill max in measure column
s = df.groupby('colony')['measure'].transform(lambda x: x.fillna(x.max()))
s
0 11.0
1 7.0
2 11.0
3 13.0
4 13.0
5 9.0
Name: measure, dtype: float64
make s to measure column
df[['colony']].assign(measure=s)
result A
colony measure
0 22 11.0
1 22 7.0
2 22 11.0
3 33 13.0
4 33 13.0
5 33 9.0
step2
df1 = df[df.columns[::-1]].dropna()
df1
net/gross Length measure colony
1 gross 17 7.0 22
2 net 13 11.0 22
3 gross 10 13.0 33
5 net 16 9.0 33
step3
merge resultA and df1
df[['colony']].assign(measure=s).merge(df1, how='left')
resultB
colony measure net/gross Length
0 22 11.0 net 13
1 22 7.0 gross 17
2 22 11.0 net 13
3 33 13.0 gross 10
4 33 13.0 gross 10
5 33 9.0 net 16
step4
make resultB to desired output(include full code)
import pandas as pd
import numpy as np
s = df.groupby('colony')['measure'].transform(lambda x: x.fillna(x.max()))
df1 = df[df.columns[::-1]].dropna()
s2 = np.where(df['measure'].isna(), 'max_filled', 'unchanged')
(df[['colony']].assign(measure=s).merge(df1, how='left')
.assign(remark=s2).rename(columns={'Length':'Length_adj'}))
output
colony measure net/gross Length_adj remark
0 22 11.0 net 13 max_filled
1 22 7.0 gross 17 unchanged
2 22 11.0 net 13 unchanged
3 33 13.0 gross 10 unchanged
4 33 13.0 gross 10 max_filled
5 33 9.0 net 16 unchanged
I want to create a Python script to calculate the x value and y value (like matrix).
content of "budget.txt" file:
Budget Jan Feb Mar Apr May Jun Sum
Milk 10 20 31 52 7 11
Eggs 1 5 1 16 4 58
Bread 22 36 17 8 21 16
Butter 4 5 8 11 36 2
Total
The script will calculate the budget.txt file and show results in column "Sum" and Row "Total".
Here is my code:
import sys
budget_file = sys.arvg[1]
df = open(budget_file).read()
print(df)
Output: I can read the file. Now my question is how to sum the values of row and column-wise?
Given your dataframe, I would approach the calculations in two steps:
Compute the Sum of each row of the data frame (see How To Sum DataFrame Rows
Compute the Sum of each Column
Given the following dataframe:
Budget Jan Feb Mar Apr May Jun
0 Milk 10 20 31 52 7 11
1 Eggs 1 5 1 16 4 58
2 Bread 22 36 17 8 21 16
3 Butter 4 5 8 11 36 2
To Compute the Sum across the rows of a dataframe
#list columns to sum across
df['Sum'] = df[list(df.columns)[1:]].sum(axis=1)
This produces the following df:
Budget Jan Feb Mar Apr May Jun Sum
0 Milk 10.0 20.0 31.0 52.0 7.0 11.0 131.0
1 Eggs 1.0 5.0 1.0 16.0 4.0 58.0 85.0
2 Bread 22.0 36.0 17.0 8.0 21.0 16.0 120.0
3 Butter 4.0 5.0 8.0 11.0 36.0 2.0 66.0
To add a row with the sum of all columns:
df.append(pd.Series(df.sum(),name='Total'))
This will yield a dataframe like:
Budget Jan Feb Mar Apr May Jun Sum
0 Milk 10 20 31 52 7 11 262.0
1 Eggs 1 5 1 16 4 58 170.0
2 Bread 22 36 17 8 21 16 240.0
3 Butter 4 5 8 11 36 2 132.0
Total MilkEggsBreadButter 37 66 57 87 68 87 804.0
If the string summation under budget is bothersome you can reset that specific cell to an empty string.
I am analyzing time series data of one stock to seek the highest price for further analysis, here is the sample dataframe df:
date close high_3days
2021-05-01 20 20
2021-05-02 23 23
2021-05-03 26 26
2021-05-04 24 26
2021-05-05 20 26
2021-05-06 26 26
2021-05-07 22 26
2021-05-08 30 30
2021-05-09 20 30
2021-05-10 20 30
I want to add a new column to find the number of days from previous 3 days high. My logic is seeking the index of the row of previous high, and then subtract it from the index of current row.
Here is the desire output:
date close high_3days days_previous_high
2021-05-01 20 20 0
2021-05-02 23 23 0
2021-05-03 26 26 0
2021-05-04 24 26 1
2021-05-05 20 26 2
2021-05-06 22 26 3
2021-05-07 20 26 4
2021-05-08 30 30 0
2021-05-09 20 30 1
2021-05-10 20 30 2
Could you help to figure the way out~? Thanks guys!
Try creating a boolean index with expanding max, then enumerate each group with groupby cumcount:
df['days_previous_high'] = df.groupby(
df['high_3days'].expanding().max().diff().gt(0).cumsum()).cumcount()
df:
date close high_3days days_previous_high
0 2021-05-01 20 20 0
1 2021-05-02 23 23 0
2 2021-05-03 26 26 0
3 2021-05-04 24 26 1
4 2021-05-05 20 26 2
5 2021-05-06 22 26 3
6 2021-05-07 20 26 4
7 2021-05-08 30 30 0
8 2021-05-09 20 30 1
9 2021-05-10 20 30 2
Explaination:
expanding max is used to determine the current maximum value at each row.
df['high_3days'].expanding().max()
diff can be used to see where the current value exceeds the max.
df['high_3days'].expanding().max().diff()
groups can be created by taking the cumsum of where the diff is greater than 0:
df['high_3days'].expanding().max().diff().gt(0).cumsum()
expanding_max expanding_max_diff expanding_max_gt_0 expanding_max_gt_0_cs
20.0 NaN False 0
23.0 3.0 True 1
26.0 3.0 True 2
26.0 0.0 False 2
26.0 0.0 False 2
26.0 0.0 False 2
26.0 0.0 False 2
30.0 4.0 True 3
30.0 0.0 False 3
30.0 0.0 False 3
Now that rows are grouped, groupby cumcount can be used to enumerate each group:
df.groupby(df['high_3days'].expanding().max().diff().gt(0).cumsum()).cumcount()
0 0
1 0
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
dtype: int64
I'm working with a large dataset. The following is an example, calculated with a smaller dataset.
In this example i got the measurements of the pollution of 3 rivers for different timespans. Each year, the amount pollution of a river is measured at a measuring station downstream ("pollution"). It has already been calculated, in which year the river water was polluted upstream ("year_of_upstream_pollution"). My goal ist to create a new column ["result_of_upstream_pollution"], which contains the amount of pollution connected to the "year_of_upstream_pollution". For this, the data from the "pollution"-column has to be reassigned.
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year = [2000,2001,2002,2003,2004,2005,1990,1991,1992,1993,1994,1995,2000,2001,2002,2003,2004,2005]
y1 = [2002,2002,2003,2005,2005,np.NaN,1991,1992,1993,1994,np.NaN,np.NaN,2012,2012,2013,2014,2015,np.NaN]
poll = [10,14,20,11,8,11,
20,22,20,25,18,21,
30,19,15,10,26,28]
dictr1 ={"river_id":ids,"year":year,"pollution": poll,"year_of_upstream_pollution":y1}
dfr1 = pd.DataFrame(dictr1)
print(dfr1)
river_id year pollution year_of_upstream_pollution
0 1 2000 10 2002.0
1 1 2001 14 2002.0
2 1 2002 20 2003.0
3 1 2003 11 2005.0
4 1 2004 8 2005.0
5 1 2005 11 NaN
6 2 1990 20 1991.0
7 2 1991 22 1992.0
8 2 1992 20 1993.0
9 2 1993 25 1994.0
10 2 1994 18 NaN
11 2 1995 21 NaN
12 3 2000 30 2002.0
13 3 2001 19 2002.0
14 3 2002 15 2003.0
15 3 2003 10 2004.0
16 3 2004 26 2005.0
17 3 2005 28 NaN
Example: river_id = 1, year = 2000, year_of_upstream_pollution = 2002
value of the pollution-column in year 2002 = 20
Therefore: result_of_upstream_pollution = 20
The resulting column should look like this:
result_of_upstream_pollution
0 20.0
1 20.0
2 11.0
3 11.0
4 11.0
5 NaN
6 22.0
7 20.0
8 25.0
9 18.0
10 NaN
11 NaN
12 15.0
13 15.0
14 10.0
15 26.0
16 28.0
17 NaN
My own approach:
### My approach
# Split dfr1 in two
dfr3 = pd.DataFrame(dfr1, columns = ["river_id","year","pollution"])
dfr4 = pd.DataFrame(dfr1, columns = ["river_id","year_of_upstream_pollution"])
# Merge the two dataframes on the "year" and "year_of_upstream_pollution"-column
arrayr= dfr4.merge(dfr3, left_on = "year_of_upstream_pollution", right_on = "year", how = "left").pollution.values
listr = arrayr.tolist()
dfr1["result_of_upstream_pollution"] = listr
print(dfr1)
len(listr) # = 28
This results in the following ValueError:
"Length of values does not match length of index"
My explanation for this is, that the values in the "year"-column of "dfr3" are not unique, which leads to several numbers being assigned to each year and explains why: len(listr) = 28
I haven't been able to find a way around this error yet. Please keep in mind that the real dataset is much larger than this one. Any help would be much appreciated!
As you said in the title, this is merge on two column:
dfr1['result_of_upstream_pollution'] = dfr1.merge(dfr1, left_on=['river_id','year'],
right_on=['river_id','year_of_upstream_pollution'],
how='right')['pollution_x']
print(df)
Output:
result_of_upstream_pollution
0 20.0
1 20.0
2 11.0
3 11.0
4 11.0
5 NaN
6 22.0
7 20.0
8 25.0
9 18.0
10 NaN
11 NaN
12 15.0
13 15.0
14 10.0
15 26.0
16 28.0
17 NaN
I just realized that this solution doesn't seem to be working for me.
When i execute the code, this is what happens:
dfr1['result_of_upstream_pollution'] = dfr1.merge(dfr1, left_on=['river_id','year'],
right_on=['river_id','year_of_upstream_pollution'],
how='right')['pollution_x']
print(dfr1)
river_id year pollution year_of_upstream_pollution \
0 1 2000 10 2002.0
1 1 2001 14 2002.0
2 1 2002 20 2003.0
3 1 2003 11 2005.0
4 1 2004 8 2005.0
5 1 2005 11 NaN
6 2 1990 20 1991.0
7 2 1991 22 1992.0
8 2 1992 20 1993.0
9 2 1993 25 1994.0
10 2 1994 18 NaN
11 2 1995 21 NaN
12 3 2000 30 2002.0
13 3 2001 19 2002.0
14 3 2002 15 2003.0
15 3 2003 10 2004.0
16 3 2004 26 2005.0
17 3 2005 28 NaN
result_of_upstream_pollution
0 20.0
1 20.0
2 11.0
3 11.0
4 11.0
5 22.0
6 20.0
7 25.0
8 18.0
9 15.0
10 15.0
11 10.0
12 26.0
13 28.0
14 NaN
15 NaN
16 NaN
17 NaN
For some reason, this code doesn't seem to be handling the "NaN" values in the right way.
If there is an "NaN"-value (in the column: "year_of_upstream_pollution"), there shouldnt be a value in "result_of_upstream_pollution".
Equally, the ids 14,15 and 16 all have values for the "year_of_upstream_pollution" which has matching data in the "pollution-column" and therefore should also have values in the result-column.
On top of that, it seems that all values after the first "NaN" (at id = 5) are assigned the wrong values.
#Quang Hoang Thank you very much for trying to solve my problem! Could you maybe explain why my results differ from yours?
Does anyone know how i can get this code to work?
I have two dataframes each with one column. I'm pasting them exactly as they print below:
Top: (it has no column names as it is the result of a Top = Df1.groupby('col1')['att1'].diff().dropna()
1 15.566667
3 5.066667
5 57.266667
7 -10.366667
9 18.966667
11 50.966667
13 -5.633333
15 -14.266667
17 18.933333
19 3.100000
21 35.966667
23 -17.566667
25 -8.066667
27 -6.366667
29 7.133333
31 -2.633333
33 3.333333
35 -23.800000
37 2.333333
39 -53.533333
41 -17.300000
dtype: float64
Bottom: which is the result of Bottom = np.sqrt(Df2.groupby('ID')['Col2'].sum()/n)
ID
12868123 1.029001
757E13D7 1.432014
79731492 2.912770
799EFB29 1.826576
7D44062A 1.736757
7D4C0E2F 1.943503
7DBA169D 0.650023
7E558E2B 1.256287
7E8B3815 1.491974
7EB80123 0.558717
7FFB607D 1.505221
8065A321 1.809937
80EFE91B 2.064825
811F1B1E 0.992645
82B67C94 0.980618
833C27AE 0.969195
83957B28 0.469914
8447B85D 1.477168
84877498 0.872973
8569499D 2.215307
8617B7D9 1.033294
Name: Col2, dtype: float64
I want the divide those two columns values by each other.
Top/Bottom
I get the following:
1 NaN
3 NaN
5 NaN
7 NaN
9 NaN
11 NaN
13 NaN
15 NaN
17 NaN
19 NaN
21 NaN
23 NaN
25 NaN
27 NaN
29 NaN
31 NaN
33 NaN
35 NaN
37 NaN
39 NaN
41 NaN
12868123 NaN
757E13D7 NaN
79731492 NaN
799EFB29 NaN
7D44062A NaN
7D4C0E2F NaN
7DBA169D NaN
7E558E2B NaN
7E8B3815 NaN
7EB80123 NaN
7FFB607D NaN
8065A321 NaN
80EFE91B NaN
811F1B1E NaN
82B67C94 NaN
833C27AE NaN
83957B28 NaN
8447B85D NaN
84877498 NaN
8569499D NaN
8617B7D9 NaN
dtype: float64
I tried resetting the index column, it didn't help. Not sure why it's not working.
Problem is with different index values, because arithmetic opearations align Series by indices, so need cast to numpy array by values:
print (Top/Bottom.values)
1 15.127942
3 3.538141
5 19.660552
7 -5.675464
9 10.920737
11 26.224126
13 -8.666359
15 -11.356216
17 12.690123
19 5.548426
21 23.894609
23 -9.705679
25 -3.906707
27 -6.413841
29 7.274324
31 -2.717031
33 7.093496
35 -16.111911
37 2.672858
39 -24.165198
41 -16.742573
Name: col, dtype: float64
Solution with div:
print (Top.div(Bottom.values))
1 15.127942
3 3.538141
5 19.660552
7 -5.675464
9 10.920737
11 26.224126
13 -8.666359
15 -11.356216
17 12.690123
19 5.548426
21 23.894609
23 -9.705679
25 -3.906707
27 -6.413841
29 7.274324
31 -2.717031
33 7.093496
35 -16.111911
37 2.672858
39 -24.165198
41 -16.742573
dtype: float64
But if assign one index values to another, you can use:
Top.index = Bottom.index
print (Top/Bottom)
ID
12868123 15.127942
757E13D7 3.538141
79731492 19.660552
799EFB29 -5.675464
7D44062A 10.920737
7D4C0E2F 26.224126
7DBA169D -8.666359
7E558E2B -11.356216
7E8B3815 12.690123
7EB80123 5.548426
7FFB607D 23.894609
8065A321 -9.705679
80EFE91B -3.906707
811F1B1E -6.413841
82B67C94 7.274324
833C27AE -2.717031
83957B28 7.093496
8447B85D -16.111911
84877498 2.672858
8569499D -24.165198
8617B7D9 -16.742573
dtype: float64
And if get error like:
ValueError: operands could not be broadcast together with shapes (20,) (21,)
problem is with different length of Series.
I arrived here because I was looking how to divide a column by a subset of itself.
I found a solution which is not reported here
Suppose you have a df like
d = {'mycol1':[0,0,1,1,2,2],'mycol2':[1,2,3,6,4,8]}
df = pd.DataFrame(data=d)
i.e.
mycol1 mycol2
0 0 1
1 0 2
2 1 3
3 1 6
4 2 4
5 2 8
And now you want to divide mycol2 for a subset composed by the first two values
df['mycol2'].div(df[df['mycol1']==0.0]['mycol2'])
will result in
0 1.0
1 1.0
2 NaN
3 NaN
4 NaN
5 NaN
because of the index problem reported by jezreal.
The solution is to simply use concat to concatenate the subset to match the length of the original df.
Nrows = df[df['mycol1'==0.0]]['mycol2'].shape[0]
Nrows_tot = df['mycol2'].shape[0]
times_longer = int(Nrows_tot/Nrows)
df['mycol3'] = df['mycol2'].div(pd.concat([df[df['mycol1']==0.0]['mycol2']]*times_longer,ignore_index=True))