Python Group BY Cumsum - python

I have this DataFrame :
Value Month
0 1
1 2
8 3
11 4
12 5
17 6
0 7
0 8
0 9
0 10
1 11
2 12
7 1
3 2
1 3
0 4
0 5
And i want to create new variable "Cumsum" like this :
Value Month Cumsum
0 1 0
1 2 1
8 3 9
11 4 20
12 5 32
17 6
0 7
0 8 ...
0 9
0 10
1 11
2 12
7 1 7
3 2 10
1 3 11
0 4 11
0 5 11
Sorry if my code it is not clean, I failed to include my dataframe ...
My problem is that I do not have only 12 lines (1 line per month) but I have many more lines.
By cons I know that my table is tidy and I want to have the cumulated amount until the 12th month and repeat that when the month 1 appears.
Thank you for your help.

Try:
df['Cumsum'] = df.groupby((df.Month == 1).cumsum())['Value'].cumsum()
print(df)
Value Month Cumsum
0 0 1 0
1 1 2 1
2 8 3 9
3 11 4 20
4 12 5 32
5 17 6 49
6 0 7 49
7 0 8 49
8 0 9 49
9 0 10 49
10 1 11 50
11 2 12 52
12 7 1 7
13 3 2 10
14 1 3 11
15 0 4 11
16 0 5 11

code:
df = pd.DataFrame({'value': [0, 1, 8, 11, 12, 17, 0, 0, 0, 0, 1, 2, 7, 3, 1, 0, 0],
'month': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5]})
temp = int(len(df)/12)
for i in range(temp + 1):
start = i * 12
if i < temp:
end = (i + 1) * 12 - 1
df.loc[start:end, 'cumsum'] = df.loc[start:end, 'value'].cumsum()
else:
df.loc[start:, 'cumsum'] = df.loc[start:, 'value'].cumsum()
# df.loc[12:, 'cumsum'] = 12
print(df)
output:
value month cumsum
0 0 1 0.0
1 1 2 1.0
2 8 3 9.0
3 11 4 20.0
4 12 5 32.0
5 17 6 49.0
6 0 7 49.0
7 0 8 49.0
8 0 9 49.0
9 0 10 49.0
10 1 11 50.0
11 2 12 52.0
12 7 1 7.0
13 3 2 10.0
14 1 3 11.0
15 0 4 11.0
16 0 5 11.0

Related

Merging dataframes on two columns alternative solution

I have been trying to find an alternative (possibly more elegant) solution for the following code but without any luck. Here is my code:
import os
import pandas as pd
os.chdir(os.getcwd())
df1 = pd.DataFrame({'Month': [1]*6 + [13]*6,
'Temp': [0, 1, 2, 3, 4, 5]*2,
'Place': [12, 53, 6, 11, 9, 10, 0, 0, 0, 0, 0, 0],
'Place2': [1, 0, 23, 14, 9, 8, 0, 0, 0, 0, 0, 0],
'Place3': [2, 64, 24, 66, 14, 21, 0, 0, 0, 0, 0, 0]}
)
df2 = pd.DataFrame({'Month': [13] * 6,
'Temp': [0, 1, 2, 3, 4, 5],
'Place': [1, 22, 333, 444, 55, 6]})
# Here it creates new columns "Place_y" and "Place_x".
# I want to avoid this if possible.
df_merge = pd.merge(df1, df2, how='left',
left_on=['Temp', 'Month'],
right_on=['Temp', 'Month'])
df_merge.fillna(0, inplace=True)
add_not_nan = lambda x: x['Place_x'] if pd.isnull(x['Place_y']) else x['Place_y']
df_merge['Place'] = df_merge.apply(add_not_nan, axis=1)
df_merge.drop(['Place_x', 'Place_y'], axis=1, inplace=True)
print(df_merge)
What I am trying to accomplish is to merge the two dataframes based on the "Month" and "Temp" columns, while keeping 0s for missing values. I would like to know if there is any way to merge the dataframes without creating the _x and _y columns (basically, a way to skip the creation and deletion of those columns).
Inputs:
First dataframe
Month Temp Place Place2 Place3
0 1 0 12 1 2
1 1 1 53 0 64
2 1 2 6 23 24
3 1 3 11 14 66
4 1 4 9 9 14
5 1 5 10 8 21
6 13 0 0 0 0
7 13 1 0 0 0
8 13 2 0 0 0
9 13 3 0 0 0
10 13 4 0 0 0
11 13 5 0 0 0
Second dataframe
Month Temp Place
0 13 0 1
1 13 1 22
2 13 2 333
3 13 3 444
4 13 4 55
5 13 5 6
Outputs:
After merge
Month Temp Place_x Place2 Place3 Place_y
0 1 0 12 1 2 NaN
1 1 1 53 0 64 NaN
2 1 2 6 23 24 NaN
3 1 3 11 14 66 NaN
4 1 4 9 9 14 NaN
5 1 5 10 8 21 NaN
6 13 0 0 0 0 1.0
7 13 1 0 0 0 22.0
8 13 2 0 0 0 333.0
9 13 3 0 0 0 444.0
10 13 4 0 0 0 55.0
11 13 5 0 0 0 6.0
Final (desired)
Month Temp Place2 Place3 Place
0 1 0 1 2 0.0
1 1 1 0 64 0.0
2 1 2 23 24 0.0
3 1 3 14 66 0.0
4 1 4 9 14 0.0
5 1 5 8 21 0.0
6 13 0 0 0 1.0
7 13 1 0 0 22.0
8 13 2 0 0 333.0
9 13 3 0 0 444.0
10 13 4 0 0 55.0
11 13 5 0 0 6.0
It seems you don't need Place column from df1, you can just drop it before merging:
(df1.drop('Place', axis=1)
.merge(df2, how='left', on=['Temp', 'Month'])
.fillna({'Place': 0}))
# Month Temp Place2 Place3 Place
#0 1 0 1 2 0.0
#1 1 1 0 64 0.0
#2 1 2 23 24 0.0
#3 1 3 14 66 0.0
#4 1 4 9 14 0.0
#5 1 5 8 21 0.0
#6 13 0 0 0 1.0
#7 13 1 0 0 22.0
#8 13 2 0 0 333.0
#9 13 3 0 0 444.0
#10 13 4 0 0 55.0
#11 13 5 0 0 6.0
If you don't know how many such columns are there, and if you always want to include the column from second dataframe for such overlapping column names which you are not using as key column, then you can mask those variables using suffix parameter of pd.merge then filter out the columns taking the masking characters using pandas.DataFrame.filter:
df1.merge(df2,
how='left',
left_on=['Temp', 'Month'],
right_on=['Temp', 'Month'],
suffixes=('###', '')).fillna(0).filter(regex='.*(?<!###)$')
OUTPUT:
Month Temp Place2 Place3 Place
0 1 0 1 2 0.0
1 1 1 0 64 0.0
2 1 2 23 24 0.0
3 1 3 14 66 0.0
4 1 4 9 14 0.0
5 1 5 8 21 0.0
6 13 0 0 0 1.0
7 13 1 0 0 22.0
8 13 2 0 0 333.0
9 13 3 0 0 444.0
10 13 4 0 0 55.0
11 13 5 0 0 6.0
Apparently, you can also filter out the columns at beginning before merger, by checking the existence of columns from second dataframe in the first dataframe:
cols=[col for col in df1.columns if col in ('Temp', 'Month') or col not in df2.columns ]
df1[cols].merge(df2, how='left',
left_on=['Temp', 'Month'],
right_on=['Temp', 'Month']).fillna(0)
Month Temp Place2 Place3 Place
0 1 0 1 2 0.0
1 1 1 0 64 0.0
2 1 2 23 24 0.0
3 1 3 14 66 0.0
4 1 4 9 14 0.0
5 1 5 8 21 0.0
6 13 0 0 0 1.0
7 13 1 0 0 22.0
8 13 2 0 0 333.0
9 13 3 0 0 444.0
10 13 4 0 0 55.0
11 13 5 0 0 6.0

Find the days lag and replace 0 with last day lag pandas

I have a df containing employee , worked_days and sold columns
Some employee sold only for first day and after five days sold another
My data look like this
data = {'id':[1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
'days':[1, 3, 3, 8, 8,8, 3, 8, 8, 9, 9, 12],
'sold':[1, 0, 1, 1, 1, 0, 0, 1, 1, 2, 0, 1]}
df = pd.DataFrame(data)
df['days_lag'] = df.groupby('id')['days'].diff().fillna(0).astype('int16')
Gives me this
id days sold days_lag
0 1 1 1 0
1 1 3 0 2
2 1 3 1 0
3 1 8 1 5
4 1 8 1 0
5 1 8 0 0
6 2 3 0 0
7 2 8 1 5
8 2 8 1 0
9 2 9 2 1
10 2 9 0 0
11 2 12 1 3
I want the results to be like below
id days sold days_lag
0 1 1 1 0
1 1 3 0 2
2 1 3 1 2
3 1 8 1 5
4 1 8 1 5
5 1 8 0 5
6 2 3 0 0
7 2 8 1 5
8 2 8 1 5
9 2 9 2 1
10 2 9 0 1
11 2 12 1 3
How can i achieve this ?
Thanks
Use Groupby.transform:
In [92]: df['days_lag'] = df.groupby('id')['days'].diff().fillna(0).astype('int16')
In [96]: df['days_lag'] = df.groupby(['id', 'days'])['days_lag'].transform('max')
In [97]: df
Out[97]:
id days sold days_lag
0 1 1 1 0
1 1 3 0 2
2 1 3 1 2
3 1 8 1 5
4 1 8 1 5
5 1 8 0 5
6 2 3 0 0
7 2 8 1 5
8 2 8 1 5
9 2 9 2 1
10 2 9 0 1
11 2 12 1 3

Pandas: join dataframe composed by different iteration

I have a dataframe in which multiple dataseries with 2 columsn (0,1). The data is composed of different iterations of a measurement. The data is structured like so:
df = pd.DataFrame({
0: ['user', 'x', 1, 4, 7, 10, 'user', 'x', 1, 4, 7, 10, 'user', 'x', 1, 4, 7, 10],
1: ['iteration=0', 'y',5, 7, 9, 12, 'iteration=1', 'y',20, 8, 12, 12, 'iteration=2', 'y',3, 17, 19, 112]
})
0 user iteration=0
1 x y
2 1 5
3 4 7
4 7 9
5 10 12
6 user iteration=1
7 x y
8 1 20
9 4 8
10 7 12
11 10 12
12 user iteration=2
13 x y
14 1 3
15 4 17
16 7 19
17 10 112
I want to plot x vs y grouped by iteration.
I am trying to do this by first creaeting a single dataframe with the iteration as a column to perform the groupby on:
1 x y iteration
2 1 5 0
3 4 7 0
4 7 9 0
5 10 12 0
8 1 20 1
9 4 8 1
10 7 12 1
11 10 12 1
14 1 3 2
15 4 17 2
16 7 19 2
17 10 112 2
To create this joined dataframe, I implemented this code :
meta=df.loc[df[0]=='user']
lst=[]
ind=0
for index, row in meta.iterrows():
if index==0: #continue to start loop from second value
continue
splitvalue = meta.loc[ind][1].split('=')[1]
print (splitvalue)
temp=temp.iloc[ind:index]
temp['iteration']=splitvalue
ind=index
lst.append(temp)
pd.concat(lst)
Is there a way to create this joined dataframe without creating lists of subdataframes ? Or is there a way to directly plot from the original dataframe ?
You can use:
numeric=~pd.Series([isinstance(key,str) for key in df[0]])
iterations=df[1].where(df[1].str.contains('=').fillna(False)).ffill()
iterations=[int(key.replace('iteration=','')) for key in iterations]
df['iterations']=iterations
df=df.loc[numeric]
df.columns=['x','y','iteration']
df.reset_index(drop=True,inplace=True)
print(df)
x y iteration
0 1 5 0
1 4 7 0
2 7 9 0
3 10 12 0
4 1 20 1
5 4 8 1
6 7 12 1
7 10 12 1
8 1 3 2
9 4 17 2
10 7 19 2
11 10 112 2

Incrementing add under condition in pandas

For the following pandas dataframe
servo_in_position second_servo_in_position Expected output
0 0 1 0
1 0 1 0
2 1 2 1
3 0 3 0
4 1 4 2
5 1 4 2
6 0 5 0
7 0 5 0
8 1 6 3
9 0 7 0
10 1 8 4
11 0 9 0
12 1 10 5
13 1 10 5
14 1 10 5
15 0 11 0
16 0 11 0
17 0 11 0
18 1 12 6
19 1 12 6
20 0 13 0
21 0 13 0
22 0 13 0
I want to increment the column "Expected output" only if "servo_in_position" changes from 0 to 1. I want also to assume "Expected output" to be 0 (null) if "servo_in_position" equals to 0.
I tried
input_data['second_servo_in_position']=(input_data.servo_in_position.diff()!=0).cumsum()
but it gives output as in "second_servo_in_position" column, which is not what I wanted.
After that I would like to group and calculate mean using:
print("Mean=\n\n",input_data.groupby('second_servo_in_position').mean())
Using cumsum and arithmetic.
u = df['servo_in_position']
(u.eq(1) & u.shift().ne(1)).cumsum() * u
0 0
1 0
2 1
3 0
4 2
5 2
6 0
7 0
8 3
9 0
10 4
11 0
12 5
13 5
14 5
15 0
16 0
17 0
18 6
19 6
20 0
21 0
22 0
Name: servo_in_position, dtype: int64
Use cumsum and mask:
df['E_output'] = df['servo_in_position'].diff().eq(1).cumsum()\
.mask(df['servo_in_position'] == 0, 0)
df['servo_in_position'].diff().fillna(df['servo_in_position']).eq(1).cumsum()\
.mask(df['servo_in_position'] == 0, 0)
Output:
servo_in_position second_servo_in_position Expected output E_output
0 0 1 0 0
1 0 1 0 0
2 1 2 1 1
3 0 3 0 0
4 1 4 2 2
5 1 4 2 2
6 0 5 0 0
7 0 5 0 0
8 1 6 3 3
9 0 7 0 0
10 1 8 4 4
11 0 9 0 0
12 1 10 5 5
13 1 10 5 5
14 1 10 5 5
15 0 11 0 0
16 0 11 0 0
17 0 11 0 0
18 1 12 6 6
19 1 12 6 6
20 0 13 0 0
21 0 13 0 0
22 0 13 0 0
Update for first position equal to 1.
df['servo_in_position'].diff().fillna(df['servo_in_position']).eq(1).cumsum()\
.mask(df['servo_in_position'] == 0, 0)
Try np.where:
df['Expected_output'] = np.where(df.servo_in_position.eq(1),
df.servo_in_position.diff().eq(1).cumsum(),
0)
That is cumsum and mul
df.servo_in_position.diff().eq(1).cumsum().mul(df.servo_in_position.eq(1),axis=0)
Fast with Numba
from numba import njit
#njit
def f(u):
out = np.zeros(len(u), np.int64)
a = out[0] = u[0]
for i in range(1, len(u)):
if u[i] == 1:
if u[i - 1] == 0:
a += 1
out[i] = a
return out
f(df.servo_in_position.to_numpy())
array([0, 0, 1, 0, 2, 2, 0, 0, 3, 0, 4, 0, 5, 5, 5, 0, 0, 0, 6, 6, 0, 0, 0])

Split dataframe by certain values in first column?

I have a dataframe like this one:
A C1 C2 Total
PRODUCT1 8 11 19
rs1 5 9 14
rs2 2 2 4
rs3 1 0 1
PRODUCT2 21 12 33
rs7 11 7 18
rs2 7 3 10
rs1 3 1 4
rs9 0 1 1
PRODUCT3 2 11 13
rs9 1 6 7
rs5 1 5 6
The column A is made of strings, I want to split my dataframe by the values in this column, specifically every upper word in it. Like this:
df1 =
PRODUCT1 8 11 19
rs1 5 9 14
rs2 2 2 4
rs3 1 0 1
df2 =
PRODUCT2 21 12 33
rs7 11 7 18
rs2 7 3 10
rs1 3 1 4
rs9 0 1 1
df3 =
PRODUCT3 2 11 13
rs9 1 6 7
rs5 1 5 6
Is there an easy way to achieve this?
import pandas as pd
df = pd.DataFrame({'A': ['PRODUCT1', 'rs1', 'rs2', 'rs3', 'PRODUCT2', 'rs7', 'rs2', 'rs1', 'rs9', 'PRODUCT3', 'rs9', 'rs5'], 'C1': [8, 5, 2, 1, 21, 11, 7, 3, 0, 2, 1, 1], 'C2': [11, 9, 2, 0, 12, 7, 3, 1, 1, 11, 6, 5], 'Total': [19, 14, 4, 1, 33, 18, 10, 4, 1, 13, 7, 6]})
for key, group in df.groupby(df['A'].str.isupper().cumsum()):
print(group)
prints
A C1 C2 Total
0 PRODUCT1 8 11 19
1 rs1 5 9 14
2 rs2 2 2 4
3 rs3 1 0 1
A C1 C2 Total
4 PRODUCT2 21 12 33
5 rs7 11 7 18
6 rs2 7 3 10
7 rs1 3 1 4
8 rs9 0 1 1
A C1 C2 Total
9 PRODUCT3 2 11 13
10 rs9 1 6 7
11 rs5 1 5 6
The idea here is to identify rows which are uppercase:
In [95]: df['A'].str.isupper()
Out[95]:
0 True
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 False
9 True
10 False
11 False
Name: A, dtype: bool
then use cumsum to take a cumulative sum, where True is treated as 1 and False is treated as 0:
In [96]: df['A'].str.isupper().cumsum()
Out[96]:
0 1
1 1
2 1
3 1
4 2
5 2
6 2
7 2
8 2
9 3
10 3
11 3
Name: A, dtype: int64
These values can be used as group numbers. Pass them to df.groupby to group the DataFrame according to these group numbers. df.groupby(...) returns an iterable, which lets you loop through the sub-groups.

Categories

Resources