Sum if column name is higher than row value - python

I am trying to sum rows values of each of the below columns if their "date" values are lower than column names' values :
01-01-2020 01-01-2021 01-01-2022 date
1 1 3 6 01-01-2020
2 4 4 2 01-10-2021
3 5 1 9 01-12-2021
For instance for column 1, the only row whose date value is equal or lower than column 1's name (01-01-2020) is the first row, thus the sum is 1 for column 1.
Likewise, as all dates in the "date" column are lower than last column's name (01-01-2022), the total is 6+2+9=17, which would result to this :
01-01-2020 01-01-2021 01-01-2022 date
1 1 3 6 01-01-2020
2 4 4 2 01-10-2021
3 5 1 9 01-12-2021
Total 1 3 17
Is there any way do to it more elegantly than looping over each columns and then each rows ?

We can check with np.greater_equal.outer, then slice the column mask the unwanted cell with boolean output as NaN
s = pd.to_datetime(df.date).values
m = np.greater_equal.outer(pd.to_datetime(df.columns[:-1]).values,s).T
df = df.append(df.iloc[:,:-1].where(m).sum().to_frame('Total').T)
df
Out[381]:
01-01-2020 01-01-2021 01-01-2022 date
1 1.0 3.0 6.0 01-01-2020
2 4.0 4.0 2.0 01-10-2021
3 5.0 1.0 9.0 01-12-2021
Total 1.0 3.0 17.0 NaN

Related

Closest non equal row in a column in Pandas dataframe

I have this df
d={}
d['id']=['1','1','1','1','1','1','1','1','2','2','2','2','2','2','2','2']
d['qty']=[5,5,5,5,5,6,5,5,1,1,2,2,2,3,5,8]
I would like to create a column that is going to have the following non-equal value of column qty. Meaning that if qty is equal to 5 and its next row is 5 I am going to skip it and look until I find next value not equal to 5, In my case it is 6. And all this should be grouped by id
Here is the desired dataframe.
d['id']=['1','1','1','1','1','1','1','1','2','2','2','2','2','2','2','2']
d['qty']=[5,5,5,5,5,6,5,5,1,1,2,2,2,3,5,8]
d['qty2']=[6,6,6,6,6,5,'NAN','NAN',2,2,3,3,3,5,8,'NAN']
Any help is very much appreciated
You can groupby.shift, mask the identical values, and groupby.bfill:
# shift up per group
s = df.groupby('id')['qty'].shift(-1)
# keep only the different values and bfill per group
df['qty2'] = s.where(df['qty'].ne(s)).groupby(df['id']).bfill()
output:
id qty qty2
0 1 5 6.0
1 1 5 6.0
2 1 5 6.0
3 1 5 6.0
4 1 5 6.0
5 1 6 5.0
6 1 5 NaN
7 1 5 NaN
8 2 1 2.0
9 2 1 2.0
10 2 2 3.0
11 2 2 3.0
12 2 2 3.0
13 2 3 5.0
14 2 5 8.0
15 2 8 NaN

Create a column that contains the sum of rows above within group

Here's the dataset I've got. Basically I would like to create a column containing the sum of the values before the date (which means the sum of the values that is above the row) within the same group. So the first row of each group is supposed to be always 0.
group
date
value
1
10/04/2022
2
1
12/04/2022
3
1
17/04/2022
5
1
22/04/2022
1
2
11/04/2022
3
2
15/04/2022
2
2
17/04/2022
4
The column I want would look like this.
Could you give me an idea how to create such a column?
group
date
value
sum
1
10/04/2022
2
0
1
12/04/2022
3
2
1
17/04/2022
5
5
1
22/04/2022
1
10
2
11/04/2022
3
0
2
15/04/2022
2
3
2
17/04/2022
4
5
You can try groupby.transform and call Series.cumsum().shift()
df['sum'] = (df
# sort the dataframe if needed
.assign(date=pd.to_datetime(df['date'], dayfirst=True))
.sort_values(['group', 'date'])
.groupby('group')['value']
.transform(lambda col: col.cumsum().shift())
.fillna(0))
print(df)
group date value sum
0 1 10/04/2022 2 0.0
1 1 12/04/2022 3 2.0
2 1 17/04/2022 5 5.0
3 1 22/04/2022 1 10.0
4 2 11/04/2022 3 0.0
5 2 15/04/2022 2 3.0
6 2 17/04/2022 4 5.0

How do you fill NaN with mean of a subset of a group?

I have a data frame with some values by year and type. I want to replace all NaN values in each year with the mean of values in that year with a specific type. I would like to do this in the most elegant way possible. I'm dealing with a lot of data so less computation would be good as well.
Example:
df =pd.DataFrame({'year':[1,1,1,2,2,2],
'type':[1,1,2,1,1,2],
'val':[np.nan,5,10,100,200,np.nan]})
I want ALL nan's regardless of their type to be replaced with their respective year mean of all type 1.
In this example, the first row NaN should be replaced with 5 and the last row should be replaced with 150.
This only fills in values that are missing for type 1 , not type 2
df[val]=df[val].fillna(df.query('type==1').groupby('year')[val].transform('mean'))
You want map:
# calculate mean val of type 1 by year
s = df[df['type'].eq(1)].groupby('year')['val'].mean()
# replace `year` by the above mean, and fill in the Nan
df['val'] = df['val'].fillna(df['year'].map(s))
Output:
year type val
0 1 1 5.0
1 1 1 5.0
2 1 2 10.0
3 2 1 100.0
4 2 1 200.0
5 2 2 150.0
Using fillna and matching indexes
df['val'] = (df.set_index('year').val
.fillna(df.query('type == 1').groupby(['year']).val.mean())
.values)
year type val
0 1 1 5.0
1 1 1 5.0
2 1 2 10.0
3 2 1 100.0
4 2 1 200.0
5 2 2 150.0
mask and transform
df.fillna({'val': df.val.mask(df.type.ne(1)).groupby(df.year).transform('mean')})
year type val
0 1 1 5.0
1 1 1 5.0
2 1 2 10.0
3 2 1 100.0
4 2 1 200.0
5 2 2 150.0

Pandas - Replace NaNs in a column with the mean of specific group

I am working with data like the following. The dataframe is sorted by the date:
category value Date
0 1 24/5/2019
1 NaN 24/5/2019
1 1 26/5/2019
2 2 1/6/2019
1 2 23/7/2019
2 NaN 18/8/2019
2 3 20/8/2019
7 3 1/9/2019
1 NaN 12/9/2019
2 NaN 13/9/2019
I would like to replace the "NaN" values with the previous mean for that specific category.
What is the best way to do this in pandas?
Some approaches I considered:
1) This litte riff:
df['mean' = df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
source
This gets me the the correct means in but in another column, and it does not replace the NaNs.
2) This riff replaces the NaNs with the average of the columns:
df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
Source 2
Both of these do not exactly give what I want. If someone could guide me on this it would be much appreciated!
You can replace value by new Series from shift + expanding + mean, first value of 1 group is not replaced, because no previous NaN values exits:
df['Date'] = pd.to_datetime(df['Date'])
s = df.groupby('category')['value'].apply(lambda x: x.shift().expanding().mean())
df['value'] = df['value'].fillna(s)
print (df)
category value Date
0 0 1.0 2019-05-24
1 1 NaN 2019-05-24
2 1 1.0 2019-05-26
3 2 2.0 2019-01-06
4 1 2.0 2019-07-23
5 2 2.0 2019-08-18
6 2 3.0 2019-08-20
7 7 3.0 2019-01-09
8 1 1.5 2019-12-09
9 2 2.5 2019-09-13
You can use pandas.Series.fillna to replace NaN values:
df['value']=df['value'].fillna(df.groupby('category')['value'].transform(lambda x: x.shift().expanding().mean()))
print(df)
category value Date
0 0 1.0 24/5/2019
1 1 NaN 24/5/2019
2 1 1.0 26/5/2019
3 2 2.0 1/6/2019
4 1 2.0 23/7/2019
5 2 2.0 18/8/2019
6 2 3.0 20/8/2019
7 7 3.0 1/9/2019
8 1 1.5 12/9/2019
9 2 2.5 13/9/2019

pandas take average on odd rows

I want to fill in data between each row in a dataframe with an average of current and next row (where columns are numeric)
starting data:
time value value_1 value-2
0 0 0 4 3
1 2 1 6 6
intermediate df:
time value value_1 value-2
0 0 0 4 3
1 1 0 4 3 #duplicate of row 0
2 2 1 6 6
3 3 1 6 6 #duplicate of row 2
I would like to create df_1:
time value value_1 value-2
0 0 0 4 3
1 1 0.5 5 4.5 #average of row 0 and 2
2 2 1 6 6
3 3 2 8 8 #average of row 2 and 4
To to this I appended a copy of the starting dataframe to create the intermediate dataframe shown above:
df = df_0.append(df_0)
df.sort_values(['time'], ascending=[True], inplace=True)
df = df.reset_index()
df['value_shift'] = df['value'].shift(-1)
df['value_shift_1'] = df['value_1'].shift(-1)
df['value_shift_2'] = df['value_2'].shift(-1)
then I was thinking of applying a function to each column:
def average_vals(numeric_val):
#average every odd row
if int(row.name) % 2 != 0:
#take average of value and value_shift for each value
#but this way I need to create 3 separate functions
Is there a way to do this without writing a separate function for each column and applying to each column one by one (in real data I have tens of columns)?
How about this method using DataFrame.reindex and DataFrame.interpolate
df.reindex(np.arange(len(df.index) * 2) / 2).interpolate().reset_index(drop=True)
Explanation
Reindex, in half steps reindex(np.arange(len(df.index) * 2) / 2)
This gives a DataFrame like this:
time value value_1 value-2
0.0 0.0 0.0 4.0 3.0
0.5 NaN NaN NaN NaN
1.0 2.0 1.0 6.0 6.0
1.5 NaN NaN NaN NaN
Then use DataFrame.interpolate to fill in the NaN values .... the default will be linear interpolation, so mean in this case.
Finaly, use .reset_index(drop=True) to fix your index.
Should give
time value value_1 value-2
0 0.0 0.0 4.0 3.0
1 1.0 0.5 5.0 4.5
2 2.0 1.0 6.0 6.0
3 2.0 1.0 6.0 6.0

Categories

Resources