Create a column that contains the sum of rows above within group - python

Here's the dataset I've got. Basically I would like to create a column containing the sum of the values before the date (which means the sum of the values that is above the row) within the same group. So the first row of each group is supposed to be always 0.
group
date
value
1
10/04/2022
2
1
12/04/2022
3
1
17/04/2022
5
1
22/04/2022
1
2
11/04/2022
3
2
15/04/2022
2
2
17/04/2022
4
The column I want would look like this.
Could you give me an idea how to create such a column?
group
date
value
sum
1
10/04/2022
2
0
1
12/04/2022
3
2
1
17/04/2022
5
5
1
22/04/2022
1
10
2
11/04/2022
3
0
2
15/04/2022
2
3
2
17/04/2022
4
5

You can try groupby.transform and call Series.cumsum().shift()
df['sum'] = (df
# sort the dataframe if needed
.assign(date=pd.to_datetime(df['date'], dayfirst=True))
.sort_values(['group', 'date'])
.groupby('group')['value']
.transform(lambda col: col.cumsum().shift())
.fillna(0))
print(df)
group date value sum
0 1 10/04/2022 2 0.0
1 1 12/04/2022 3 2.0
2 1 17/04/2022 5 5.0
3 1 22/04/2022 1 10.0
4 2 11/04/2022 3 0.0
5 2 15/04/2022 2 3.0
6 2 17/04/2022 4 5.0

Related

Closest non equal row in a column in Pandas dataframe

I have this df
d={}
d['id']=['1','1','1','1','1','1','1','1','2','2','2','2','2','2','2','2']
d['qty']=[5,5,5,5,5,6,5,5,1,1,2,2,2,3,5,8]
I would like to create a column that is going to have the following non-equal value of column qty. Meaning that if qty is equal to 5 and its next row is 5 I am going to skip it and look until I find next value not equal to 5, In my case it is 6. And all this should be grouped by id
Here is the desired dataframe.
d['id']=['1','1','1','1','1','1','1','1','2','2','2','2','2','2','2','2']
d['qty']=[5,5,5,5,5,6,5,5,1,1,2,2,2,3,5,8]
d['qty2']=[6,6,6,6,6,5,'NAN','NAN',2,2,3,3,3,5,8,'NAN']
Any help is very much appreciated
You can groupby.shift, mask the identical values, and groupby.bfill:
# shift up per group
s = df.groupby('id')['qty'].shift(-1)
# keep only the different values and bfill per group
df['qty2'] = s.where(df['qty'].ne(s)).groupby(df['id']).bfill()
output:
id qty qty2
0 1 5 6.0
1 1 5 6.0
2 1 5 6.0
3 1 5 6.0
4 1 5 6.0
5 1 6 5.0
6 1 5 NaN
7 1 5 NaN
8 2 1 2.0
9 2 1 2.0
10 2 2 3.0
11 2 2 3.0
12 2 2 3.0
13 2 3 5.0
14 2 5 8.0
15 2 8 NaN

Find the time difference between consecutive rows of two columns for a given value in third column

Lets say we want to compute the variable D in the dataframe below based on time values in variable B and C.
Here, second row of D is C2 - B1, the difference is 4 minutes and
third row = C3 - B2= 4 minutes,.. and so on.
There is no reference value for first row of D so its NA.
Issue:
We also want a NA value for the first row when the category value in variable A changes from 1 to 2. In other words, the value -183 must be replaced by NA.
A B C D
1 5:43:00 5:24:00 NA
1 6:19:00 5:47:00 4
1 6:53:00 6:23:00 4
1 7:29:00 6:55:00 2
1 8:03:00 7:31:00 2
1 8:43:00 8:05:00 2
2 6:07:00 5:40:00 -183
2 6:42:00 6:11:00 4
2 7:15:00 6:45:00 3
2 7:53:00 7:17:00 2
2 8:30:00 7:55:00 2
2 9:07:00 8:32:00 2
2 9:41:00 9:09:00 2
2 10:17:00 9:46:00 5
2 10:52:00 10:20:00 3
You can use:
# Compute delta
df['D'] = (pd.to_timedelta(df['C']).sub(pd.to_timedelta(df['B'].shift()))
.dt.total_seconds().div(60))
# Fill nan
df.loc[df['A'].ne(df['A'].shift()), 'D'] = np.nan
Output:
>>> df
A B C D
0 1 5:43:00 5:24:00 NaN
1 1 6:19:00 5:47:00 4.0
2 1 6:53:00 6:23:00 4.0
3 1 7:29:00 6:55:00 2.0
4 1 8:03:00 7:31:00 2.0
5 1 8:43:00 8:05:00 2.0
6 2 6:07:00 5:40:00 NaN
7 2 6:42:00 6:11:00 4.0
8 2 7:15:00 6:45:00 3.0
9 2 7:53:00 7:17:00 2.0
10 2 8:30:00 7:55:00 2.0
11 2 9:07:00 8:32:00 2.0
12 2 9:41:00 9:09:00 2.0
13 2 10:17:00 9:46:00 5.0
14 2 10:52:00 10:20:00 3.0
You can use the difference between datetime columns in pandas.
Having
df['B_dt'] = pd.to_datetime(df['B'])
df['C_dt'] = pd.to_datetime(df['C'])
Makes the following possible
>>> df['D'] = (df.groupby('A')
.apply(lambda s: (s['C_dt'] - s['B_dt'].shift()).dt.seconds / 60)
.reset_index(drop=True))
You can always drop these new columns later.

Replacing values with the next unique one

In my pandas dataframe I have a column of non-unique values
I want to add a second column that contains the next unique value
i.e,
col
1
5
5
2
2
4
col addedCol
1 5
5 2
5 2
2 4
2 4
4 (last value doesn't matter)
how can i achieve this using pandas?
I'll clarify what I meant, I want each row to contain the next value that is different than of that row's
I hope I better explained myself now
IIUC, you need the next value which is different from the current value.
df.loc[:, 'col2'] = df.drop_duplicates().shift(-1).col
df['col2'].ffill(inplace=True)
col col2
0 1 5.0
1 5 2.0
2 5 2.0
3 2 2.0
(Notice that last 2.0 value doesn't matter). As suggest by #MartijnPieters,
df['col2'] = df['col2'].astype(int)
Can make values back to original integers if needed.
Adding another good solution from #piRSquared
df.assign(addedcol=df.index.to_series().shift(-1).map(df.col.drop_duplicates()).bfill())
col addedcol
0 1 5.0
1 5 2.0
2 5 2.0
3 2 NaN
Another example, if df is
col
0 1
1 5
2 5
3 2
4 3
5 3
6 10
7 9
Then
df.loc[:, 'col2'] = df.drop_duplicates().shift(-1).col
df = df.ffill()
yields
col col2
0 1 5.0
1 5 2.0
2 5 2.0
3 2 3.0
4 3 10.0
5 3 10.0
6 10 9.0
7 9 9.0
Using factorize
s=pd.factorize(df.col)[0]
pd.Series(s+1).map(dict(zip(s,df.col)))
Out[242]:
0 5.0
1 2.0
2 2.0
3 NaN
dtype: float64
#df['newadd']=pd.Series(s+1).map(dict(zip(s,df.col))).values
Under Mart 's condition
s=df.col.diff().ne(0).cumsum()
(s+1).map(dict(zip(s,df.col)))
Out[260]:
0 5.0
1 2.0
2 2.0
3 4.0
4 4.0
5 5.0
6 NaN
7 NaN
Name: col, dtype: float64
Setup
Added additional data with multiple clusters
df = pd.DataFrame({'col': [*map(int, '1552554442')]})
Two interpretations
We have to consider when there exist non-contiguous clusters
df
col
0 1 # First instance of `1` Next unique is `5`
1 5 # First instance of `5` Next unique is `2`
2 5 # Next unique is `2`
3 2 # First instance of `2` Next unique is `4` because `5` is not new
4 5 # Next unique is `4`
5 5 # Next unique is `4`
6 4 # First instance of `4` Next unique is null
7 4 # First instance of `4` Next unique is null
8 4 # First instance of `4` Next unique is null
9 2 # Second time seen `2` Should Next unique be null or what it was before `4`
Allowed to look back
Use factorize and add 1. This is very much in the spirit of #Wen's answer
i, u = df.col.factorize()
u_ = np.append(u, -1) # Append an integer value to represent null
df.assign(addedcol=u_[i + 1])
col addedcol
0 1 5
1 5 2
2 5 2
3 2 4
4 5 2
5 5 2
6 4 -1
7 4 -1
8 4 -1
9 2 4
Only Forward
Similar to before except we'll track the cumulative maximum factorized value
i, u = df.col.factorize()
u_ = np.append(u, -1) # Append an integer value to represent null
x = np.maximum.accumulate(i)
df.assign(addedcol=u_[x + 1])
col addedcol
0 1 5
1 5 2
2 5 2
3 2 4
4 5 4
5 5 4
6 4 -1
7 4 -1
8 4 -1
9 2 -1
You'll notice that the difference is in the last value. When we can only look forward, we see that there is no next unique value.

Python - Adding rows to timeseries dataset

I have a pandas dataframe containing retail sales data which shows the total number of a product sold each week and the stock left at the end of the week. Unfortunately, the dataset only shows a row when a product has been sold and the stock left changes.
I would like to bulk out the dataset so that for each week there is a line for each product being sold. I've shown an example of this below - how can this be done?
As-Is:
Week Product Sold Stock
1 1 1 10
1 2 1 10
1 3 1 10
2 1 2 8
2 3 3 7
To-Be:
Week Product Sold Stock
1 1 1 10
1 2 1 10
1 3 1 10
2 1 2 8
2 2 0 10
2 3 3 7
Create a dataframe using product from itertools with all the combinations of both columns 'Week' and 'Product' and use merge with your original data. Let's say your dataframe is called dfp:
from itertools import product
new_dfp = (pd.DataFrame(list(product(dfp.Week.unique(), dfp.Product.unique())),columns=['Week','Product'])
.merge(dfp,how='left'))
You get the missing row in new_dfp:
Week Product Sold Stock
0 1 1 1.0 10.0
1 1 2 1.0 10.0
2 1 3 1.0 10.0
3 2 1 2.0 8.0
4 2 2 NaN NaN
5 2 3 3.0 7.0
Now you fillna on both column with different values:
new_dfp['Sold'] = new_dfp['Sold'].fillna(0).astype(int) # because no sold in missing rows
new_dfp['Stock'] = new_dfp.groupby('Product')['Stock'].fillna(method='ffill').astype(int)
To fill 'Stock', you need to groupby product and use the method 'ffill' to put the same value than last 'week'. At the end, you get:
Week Product Sold Stock
0 1 1 1 10
1 1 2 1 10
2 1 3 1 10
3 2 1 2 8
4 2 2 0 10
5 2 3 3 7

Conditional sum from rows into a new column in pandas

I am looking to create a new column in panda based on the value in the row. My sample data:
df=pd.DataFrame({"A":['a','a','a','a','a','a','b','b','b'],
"Sales":[2,3,7,1,4,3,5,6,9,10,11,8,7,13,14],
"Week":[1,2,3,4,5,11,1,2,3,4])
I want a new column "Last3WeekSales" corresponding to each week, having the sum of sales for the previous 3 weeks.
NOTE: Shift() won't work here as data for some weeks is missing.
Logic which I thought:
Checking the week no. in each row, then summing up the data from w-1, w-2, w-3.
Output required:
A Week Last3WeekSales
0 a 1 0
1 a 2 2
2 a 3 5
3 a 4 12
4 a 5 11
5 a 11 0
6 b 1 0
7 b 2 5
8 b 3 11
9 b 4 20
Use groupby, shift and rolling:
df['Last3WeekSales'] = df.groupby('A')['Sales']\
.apply(lambda x: x.shift(1)
.rolling(3, min_periods=1)
.sum())\
.fillna(0)
Output:
A Sales Week Last3WeekSales
0 a 2 1 0.0
1 a 3 2 2.0
2 a 7 3 5.0
3 a 1 4 12.0
4 a 4 5 11.0
5 a 3 6 12.0
6 b 5 1 0.0
7 b 6 2 5.0
8 b 9 3 11.0
you can use pandas.rolling_sum to sum over 3 last values, and shift(n) to shift your column by n times (1 in your case).
if we suppose you a column 'sales' with the sales of each week, the code would be :
df["Last3WeekSales"] = df.groupby("A")["sales"].apply(lambda x: pd.rolling_sum(x.shoft(1),3))

Categories

Resources