I just want to know how to get the sum of the last 5th values based on id from every rows.
df:
id values
-----------------
a 5
a 10
a 10
b 2
c 2
d 2
a 5
a 10
a 20
a 10
a 15
a 20
expected df:
id values sum(x.tail(5))
-------------------------------------
a 5 NaN
a 10 NaN
a 10 NaN
b 2 NaN
c 2 NaN
d 2 NaN
a 5 NaN
a 10 NaN
a 20 40
a 10 55
a 15 55
a 20 60
For simplicity, I'm trying to find the sum of values from the last 5th rows from every rows with id a only.
I tried to use code df.apply(lambda x: x.tail(5)), but that only showed me last 5 rows from the very last row of the entire df. I want to get the sum of last nth rows from every and each rows. Basically it's like rolling_sum for time series data.
you can calculate the sum of the last 5 as like this:
df["rolling As"] = df[df['id'] == 'a'].rolling(window=5).sum()["values"]
(this includes the current row as one of the 5. not sure if that is what you want)
id values rolling As
0 a 5 NaN
1 a 10 NaN
2 a 10 NaN
3 b 2 NaN
4 c 2 NaN
5 d 5 NaN
6 a 10 NaN
7 a 20 55.0
8 a 10 60.0
9 a 10 60.0
10 a 15 65.0
11 a 20 75.0
If you don't want it included. you can shift
df["rolling"] = df[df['id'] == 'a'].rolling(window=5).sum()["values"].shift()
to give:
id values rolling
0 a 5 NaN
1 a 10 NaN
2 a 10 NaN
3 b 2 NaN
4 c 2 NaN
5 d 5 NaN
6 a 10 NaN
7 a 20 NaN
8 a 10 55.0
9 a 10 60.0
10 a 15 60.0
11 a 20 65.0
Try using groupby, transform, and rolling:
df['sum(x.tail(5))'] = df.groupby('id')['values']\
.transform(lambda x: x.rolling(5, min_periods=5).sum().shift())
Output:
id values sum(x.tail(5))
1 a 5 NaN
2 a 10 NaN
3 a 10 NaN
4 b 2 NaN
5 c 2 NaN
6 d 2 NaN
7 a 5 NaN
8 a 10 NaN
9 a 20 40.0
10 a 10 55.0
11 a 15 55.0
12 a 20 60.0
Related
I have the following dataframe, which the value should be increasing. Originally the dataframe has some unknown values.
index
value
0
1
1
2
3
2
4
5
6
7
4
8
9
10
3
11
3
12
13
14
15
5
Based on the assumsion that the value should be increasing, I would like to remove the value at index 10 and 11. This would be the desired dataframe:
index
value
0
1
1
2
3
2
4
5
6
7
4
8
9
12
13
14
15
5
Thank you very much
Assuming NaN in the empty cells (if not, temporarily replace them with NaN), use boolean indexing:
# if not NaNs uncomment below
# and use s in place of df['value'] afterwards
# s = pd.to_numeric(df['value'], errors='coerce')
# is the cell empty?
m1 = df['value'].isna()
# are the values strictly increasing?
m2 = df['value'].ge(df['value'].cummax())
out = df[m1|m2]
Output:
index value
1 1 NaN
2 2 NaN
3 3 2.0
4 4 NaN
5 5 NaN
6 6 NaN
7 7 4.0
8 8 NaN
9 9 NaN
12 12 NaN
13 13 NaN
14 14 NaN
15 15 5.0
Try this:
def del_df(df):
df_no_na = df.dropna().reset_index(drop = True)
num_tmp = df_no_na['value'][0] # First value which is not NaN.
del_index_list = [] # indicies to delete
for row_index in range(1, len(df_no_na)):
if df_no_na['value'][row_index] > num_tmp : #Increasing
num_tmp = df_no_na['value'][row_index] # to compare following two values.
else : # Not increasing(same or decreasing)
del_index_list.append(df_no_na['index'][row_index]) # index to delete
df_goal = df.drop([df.index[i] for i in del_index_list])
return df_goal
output:
index value
0 0 1.0
1 1 NaN
2 2 NaN
3 3 2.0
4 4 NaN
5 5 NaN
6 6 NaN
7 7 4.0
8 8 NaN
9 9 NaN
12 12 NaN
13 13 NaN
14 14 NaN
15 15 5.0
I've following data.
Date Item_1
15-03-2021 10
16-03-2021 20
17-03-2021 NaN
18-03-2021 NaN
19-03-2021 NaN
20-03-2021 NaN
21-03-2021 NaN
22-03-2021 10
23-03-2021 30
24-03-2021 NaN
I'm trying to calculate moving avergae while ignoring the NaN values. To do that I followed below approach.
df.rolling(3,on='Date',min_periods=1).mean()
With this I'm getting partially desired result.
Date Item_1
15-03-2021 10
16-03-2021 15
17-03-2021 15
18-03-2021 20
19-03-2021 NaN
20-03-2021 NaN
21-03-2021 NaN
22-03-2021 10
23-03-2021 20
24-03-2021 20
But as window size is 3 the result I want is :
Date Item_1
17-03-2021 15
18-03-2021 20
19-03-2021 NaN
20-03-2021 NaN
21-03-2021 NaN
22-03-2021 10
23-03-2021 20
24-03-2021 20
is there any way to achieve this?
You can filter after rolling by DataFrame.iloc:
N = 3
df1 = df.rolling(N,on='Date',min_periods=1).mean().iloc[N-1:]
print (df1)
Date Item_1
2 17-03-2021 15.0
3 18-03-2021 20.0
4 19-03-2021 NaN
5 20-03-2021 NaN
6 21-03-2021 NaN
7 22-03-2021 10.0
8 23-03-2021 20.0
9 24-03-2021 20.0
Try with window = '3d' instead:
>>> df.rolling('3d', on = 'Date').mean().iloc[2:]
Date Item_1
2 2021-03-17 15.0
3 2021-03-18 20.0
4 2021-03-19 NaN
5 2021-03-20 NaN
6 2021-03-21 NaN
7 2021-03-22 10.0
8 2021-03-23 20.0
9 2021-03-24 20.0
c = 0
number = 0
a = input()
for i in range(10):
b = input().split()
if b[1] != "NaN":
c = c + int(b[1])
number += 1
min = c / number
print(min)
I have a dataframe with thousand records as:
ID to from Date price Type
1 69 18 2/2020 10 A
2 11 12 2/2020 5 A
3 18 10 3/2020 4 B
4 10 11 3/2020 10 A
5 12 69 3/2020 4 B
6 12 20 3/2020 3 B
7 69 21 3/2020 3 A
The output that i want is :
ID to from Date price Type ID to from Date price Type
1 69 18 2/2020 4 A 5 12 69 3/2020 4 B
1' 69 18 2/2020 6 A Nan Nan Nan Nan Nan Nan
2 11 12 2/2020 5 A Nan Nan Nan Nan Nan Nan
4 10 11 3/2020 4 A 3 18 10 3/2020 4 B
4' 10 11 3/2020 6 A Nan Nan Nan Nan Nan Nan
Nan Nan Nan Nan Nan Nan 6 12 20 3/2020 3 B
7 69 21 3/2020 3 A Nan Nan Nan Nan Nan Nan
The idea is to iterate over row , if the type is B , put the row next to the first record with type A and from = TO ,
if the price are equals its ok , if its not split the row with higher price , and the new price will be soustracted.
i divise the dataframe in type A and B , and im trying to iterate both of them
grp = df.groupby('type')
transformed_df_list = []
for idx, frame in grp:
frame.reset_index(drop=True, inplace=True)
transformed_df_list.append(frame.copy())
A = pd.DataFrame([transformed_df_list[0])
B= pd.DataFrame([transformed_df_list[1])
for i , row in A.iterrows():
for i, row1 in B.iterrows():
if row['to'] == row1['from']:
if row['price'] == row1['price']:
row_df = pd.DataFrame([row1])
output = pd.merge(A ,B, how='left' , left_on =['to'] , right_on =['from'] )
The problem is that with merge function a get several duplicate rows and i cant check the price to split the row ?
There is way to insert B row in A dataframe witout merge function ?
I have dataframe , i want to change loc 5 rows before and 5 rows after flag value is 1.
df=pd.DataFrame({'A':[2,1,3,4,7,8,11,1,15,20,15,16,87],
'flag':[0,0,0,0,0,1,1,1,0,0,0,0,0]})
expect_output
df1_before =pd.DataFrame({'A':[1,3,4,7,8],
'flag':[0,0,0,0,1]})
df1_after =pd.DataFrame({'A':[8,11,1,15,20],
'flag':[1,1,1,0,0]})
do same process for all three flag 1
I think one easy way is to loop over the index where the flag is 1 and select the rows you want with loc:
l = len(df)
for idx in df[df.flag.astype(bool)].index:
dfb = df.loc[max(idx-4,0):idx]
dfa = df.loc[idx:min(idx+4,l)]
#do stuff
the min and max function are to ensure the boundary are not passed in case you have a flag=1 within the first or last 5 rows. Note also that with loc, if you want 5 rows, you need to use +/-4 on idx to get the right segment.
That said, depending on what your actual #do stuff is, you might want to change tactic. Let's say for example, you want to calculate the difference between the sum of A over the 5 rows after and the 5 rows before. you could use rolling and shift:
df['roll'] = df.rolling(5)['A'].sum()
df.loc[df.flag.astype(bool), 'diff_roll'] = df['roll'].shift(-4) - df['roll']
print (df)
A flag roll diff_roll
0 2 0 NaN NaN
1 1 0 NaN NaN
2 3 0 NaN NaN
3 4 0 NaN NaN
4 7 0 17.0 NaN
5 8 1 23.0 32.0 #=55-23, 55 is the sum of A of df_after and 23 df_before
6 11 1 33.0 29.0
7 1 1 31.0 36.0
8 15 0 42.0 NaN
9 20 0 55.0 NaN
10 15 0 62.0 NaN
11 16 0 67.0 NaN
12 87 0 153.0 NaN
I am trying to join a fragment of a dataframe with another one. The structure of the dataframe to join is simplified below:
left:
ID f1 TIME
1 10 1
3 10 1
7 10 1
9 10 2
2 10 2
1 10 2
3 10 2
right:
ID f2 f3
1 0 11
7 9 11
I need to select the left dataset by time, and I need to attached the right one, the result I would like to have is the following:
left:
ID f1 TIME f2 f3
1 10 1 0 11
3 10 1 nan nan
7 10 1 9 11
9 10 2 nan nan
2 10 2 nan nan
1 10 2 nan nan
3 10 2 nan nan
Currently I am usually joining dataframes in this way:
left = left.join(right.set_index('ID'), on='ID')
In this case I am using:
left[left.TIME == 1] = left[left.TIME == 1].join(right.set_index('ID'), on='ID')
I have also tried with merge, but the result is the left dataframe without any of the other columns.
Finally the structure of my script need to do this for every unique TIME in the dataframe, thus:
for t in numpy.unique(left.TIME):
#do join on the fragment left.TIME == t
If I save the returned value from the join function in a new dataframe everything works fine, but trying to add the value at the left dataframe does not work.
EDIT: The IDs of the left dataset can be present multiple times, but not inside the same TIME value.
You can filter first by boolean indexing, merge and concat last:
df1 = left[left['TIME']==1]
#alternative
#df1 = left.query('TIME == 1')
df2 = left[left['TIME']!=1]
#alternative
#df2 = left.query('TIME != 1')
df = pd.concat([df1.merge(right, how='left'), df2])
print (df)
ID TIME f1 f2 f3
0 1 1 10 0.0 11.0
1 3 1 10 NaN NaN
2 7 1 10 9.0 11.0
3 9 2 10 NaN NaN
4 2 2 10 NaN NaN
5 1 2 10 NaN NaN
6 3 2 10 NaN NaN
EDIT: merge create default indices, so possible solution is create column first and then set to index:
print (left)
ID f1 TIME
10 1 10 1
11 3 10 1
12 7 10 1
13 9 10 2
14 2 10 2
15 1 10 2
16 3 10 2
#df = left.merge(right, how='left')
df1 = left[left['TIME']==1]
df2 = left[left['TIME']!=1]
df = pd.concat([df1.reset_index().merge(right, how='left').set_index('index'), df2])
print (df)
ID TIME f1 f2 f3
10 1 1 10 0.0 11.0
11 3 1 10 NaN NaN
12 7 1 10 9.0 11.0
13 9 2 10 NaN NaN
14 2 2 10 NaN NaN
15 1 2 10 NaN NaN
16 3 2 10 NaN NaN
EDIT:
After discussion after modify input data is possible use:
df = left.merge(right, how='left', on=['ID','TIME'])
This is one way:
res = left.drop_duplicates('ID')\
.merge(right, how='left')\
.append(left[left.duplicated(subset=['ID'])])
# ID TIME f1 f2 f3
# 0 1 1 10 0.0 11.0
# 1 3 1 10 NaN NaN
# 2 7 1 10 9.0 11.0
# 3 9 2 10 NaN NaN
# 4 2 2 10 NaN NaN
# 5 1 2 10 NaN NaN
# 6 3 2 10 NaN NaN
Note that columns f2 and f3 become float since NaN is considered a float.