I've following data.
Date Item_1
15-03-2021 10
16-03-2021 20
17-03-2021 NaN
18-03-2021 NaN
19-03-2021 NaN
20-03-2021 NaN
21-03-2021 NaN
22-03-2021 10
23-03-2021 30
24-03-2021 NaN
I'm trying to calculate moving avergae while ignoring the NaN values. To do that I followed below approach.
df.rolling(3,on='Date',min_periods=1).mean()
With this I'm getting partially desired result.
Date Item_1
15-03-2021 10
16-03-2021 15
17-03-2021 15
18-03-2021 20
19-03-2021 NaN
20-03-2021 NaN
21-03-2021 NaN
22-03-2021 10
23-03-2021 20
24-03-2021 20
But as window size is 3 the result I want is :
Date Item_1
17-03-2021 15
18-03-2021 20
19-03-2021 NaN
20-03-2021 NaN
21-03-2021 NaN
22-03-2021 10
23-03-2021 20
24-03-2021 20
is there any way to achieve this?
You can filter after rolling by DataFrame.iloc:
N = 3
df1 = df.rolling(N,on='Date',min_periods=1).mean().iloc[N-1:]
print (df1)
Date Item_1
2 17-03-2021 15.0
3 18-03-2021 20.0
4 19-03-2021 NaN
5 20-03-2021 NaN
6 21-03-2021 NaN
7 22-03-2021 10.0
8 23-03-2021 20.0
9 24-03-2021 20.0
Try with window = '3d' instead:
>>> df.rolling('3d', on = 'Date').mean().iloc[2:]
Date Item_1
2 2021-03-17 15.0
3 2021-03-18 20.0
4 2021-03-19 NaN
5 2021-03-20 NaN
6 2021-03-21 NaN
7 2021-03-22 10.0
8 2021-03-23 20.0
9 2021-03-24 20.0
c = 0
number = 0
a = input()
for i in range(10):
b = input().split()
if b[1] != "NaN":
c = c + int(b[1])
number += 1
min = c / number
print(min)
Related
I have a multiline string (and not a text file) like this:
x = '''
Index Value Max Min State
0 10 nan nan nan
1 20 nan nan nan
2 15 nan nan nan
3 25 20 10 1
4 15 25 15 2
5 10 25 15 4
6 15 20 10 3
'''
The column white spaces are unequal.
I want to replace the whitespace with a comma, but keep the end-of-line.
So the result would look like this:
Index,Value,Max,Min,State
0,10,nan,nan,nan
1,20,nan,nan,nan
2,15,nan,nan,nan
3,25,20,10,1
4,15,25,15,2
5,10,25,15,4
6,15,20,10,3
...or alternatively as a pandas dataframe.
what i have tried
I can use replace('') with different spaces, but need to count the white spaces
I can use the re module (from here re.sub question ), but it converts the whole string to 1 line, where as i need to keep multiple lines (end-of-line).
Try with StringIO
from io import StringIO
import pandas as pd
x = '''
Index Value Max Min State
0 10 nan nan nan
1 20 nan nan nan
2 15 nan nan nan
3 25 20 10 1
4 15 25 15 2
5 10 25 15 4
6 15 20 10 3
'''
df = pd.read_csv(StringIO(x), sep='\s\s+', engine='python')
Index Value Max Min State
0 0 10 NaN NaN NaN
1 1 20 NaN NaN NaN
2 2 15 NaN NaN NaN
3 3 25 20.0 10.0 1.0
4 4 15 25.0 15.0 2.0
5 5 10 25.0 15.0 4.0
6 6 15 20.0 10.0 3.0
Since you tagged pandas, you can try:
out = ('\n'.join(pd.Series(x.split('\n')).str.strip().str.replace('\s+',',', regex=True)))
Output (note that there are leading and trailing blank lines because your x does):
Index,Value,Max,Min,State
0,10,nan,nan,nan
1,20,nan,nan,nan
2,15,nan,nan,nan
3,25,20,10,1
4,15,25,15,2
5,10,25,15,4
6,15,20,10,3
Herebelow is an example of my dataset:
[index] [pressure] [flow rate]
0 Nan 0
1 Nan 0
2 3 25
3 5 35
4 6 42
5 Nan 44
6 Nan 46
7 Nan 0
8 5 33
9 4 26
10 3 19
11 Nan 0
12 Nan 0
13 Nan 39
14 Nan 36
15 Nan 41
I would like to find a polynomial relationship between the pressure and flow rate where the data for both are present (in this example we can see there are data points for both pressure and flow rate from index 0 to index 4), and then I need to extend the values of pressure for Nan values based on the polynomial relationship that I found above up to the point where the data for both are present again (in this case the data is again present from index 8 to index 11), in which case I need to find a new polynomial relationship between pressure and flow rate and extend the pressure values further based on my new relationship up to the next available data and so on.
I appreciate any advice on how best to accomplish that.
You can interpolate:
df['[pressure 2]'] = df.set_index('[flow rate]')['[pressure]'].interpolate('polynomial', order=2).values
Output
[index] [pressure] [flow rate] [pressure 2]
0 0 2.0 21 2.000000
1 1 4.0 29 4.000000
2 2 3.0 25 3.000000
3 3 5.0 35 5.000000
4 4 6.0 42 6.000000
5 5 NaN 44 6.000000
6 6 NaN 46 NaN
7 7 NaN 50 NaN
8 8 5.0 33 5.000000
9 9 4.0 26 4.000000
10 10 3.0 19 3.000000
11 11 6.0 44 6.000000
12 12 NaN 41 5.915690
13 13 NaN 39 5.578449
14 14 NaN 36 5.044156
15 15 NaN 40 5.775173
NB. The remaining NaNs cannot be interpolated without ambiguity, you can ffill if needed
I have a dataframe with thousand records as:
ID to from Date price Type
1 69 18 2/2020 10 A
2 11 12 2/2020 5 A
3 18 10 3/2020 4 B
4 10 11 3/2020 10 A
5 12 69 3/2020 4 B
6 12 20 3/2020 3 B
7 69 21 3/2020 3 A
The output that i want is :
ID to from Date price Type ID to from Date price Type
1 69 18 2/2020 4 A 5 12 69 3/2020 4 B
1' 69 18 2/2020 6 A Nan Nan Nan Nan Nan Nan
2 11 12 2/2020 5 A Nan Nan Nan Nan Nan Nan
4 10 11 3/2020 4 A 3 18 10 3/2020 4 B
4' 10 11 3/2020 6 A Nan Nan Nan Nan Nan Nan
Nan Nan Nan Nan Nan Nan 6 12 20 3/2020 3 B
7 69 21 3/2020 3 A Nan Nan Nan Nan Nan Nan
The idea is to iterate over row , if the type is B , put the row next to the first record with type A and from = TO ,
if the price are equals its ok , if its not split the row with higher price , and the new price will be soustracted.
i divise the dataframe in type A and B , and im trying to iterate both of them
grp = df.groupby('type')
transformed_df_list = []
for idx, frame in grp:
frame.reset_index(drop=True, inplace=True)
transformed_df_list.append(frame.copy())
A = pd.DataFrame([transformed_df_list[0])
B= pd.DataFrame([transformed_df_list[1])
for i , row in A.iterrows():
for i, row1 in B.iterrows():
if row['to'] == row1['from']:
if row['price'] == row1['price']:
row_df = pd.DataFrame([row1])
output = pd.merge(A ,B, how='left' , left_on =['to'] , right_on =['from'] )
The problem is that with merge function a get several duplicate rows and i cant check the price to split the row ?
There is way to insert B row in A dataframe witout merge function ?
I have dataframe , i want to change loc 5 rows before and 5 rows after flag value is 1.
df=pd.DataFrame({'A':[2,1,3,4,7,8,11,1,15,20,15,16,87],
'flag':[0,0,0,0,0,1,1,1,0,0,0,0,0]})
expect_output
df1_before =pd.DataFrame({'A':[1,3,4,7,8],
'flag':[0,0,0,0,1]})
df1_after =pd.DataFrame({'A':[8,11,1,15,20],
'flag':[1,1,1,0,0]})
do same process for all three flag 1
I think one easy way is to loop over the index where the flag is 1 and select the rows you want with loc:
l = len(df)
for idx in df[df.flag.astype(bool)].index:
dfb = df.loc[max(idx-4,0):idx]
dfa = df.loc[idx:min(idx+4,l)]
#do stuff
the min and max function are to ensure the boundary are not passed in case you have a flag=1 within the first or last 5 rows. Note also that with loc, if you want 5 rows, you need to use +/-4 on idx to get the right segment.
That said, depending on what your actual #do stuff is, you might want to change tactic. Let's say for example, you want to calculate the difference between the sum of A over the 5 rows after and the 5 rows before. you could use rolling and shift:
df['roll'] = df.rolling(5)['A'].sum()
df.loc[df.flag.astype(bool), 'diff_roll'] = df['roll'].shift(-4) - df['roll']
print (df)
A flag roll diff_roll
0 2 0 NaN NaN
1 1 0 NaN NaN
2 3 0 NaN NaN
3 4 0 NaN NaN
4 7 0 17.0 NaN
5 8 1 23.0 32.0 #=55-23, 55 is the sum of A of df_after and 23 df_before
6 11 1 33.0 29.0
7 1 1 31.0 36.0
8 15 0 42.0 NaN
9 20 0 55.0 NaN
10 15 0 62.0 NaN
11 16 0 67.0 NaN
12 87 0 153.0 NaN
I just want to know how to get the sum of the last 5th values based on id from every rows.
df:
id values
-----------------
a 5
a 10
a 10
b 2
c 2
d 2
a 5
a 10
a 20
a 10
a 15
a 20
expected df:
id values sum(x.tail(5))
-------------------------------------
a 5 NaN
a 10 NaN
a 10 NaN
b 2 NaN
c 2 NaN
d 2 NaN
a 5 NaN
a 10 NaN
a 20 40
a 10 55
a 15 55
a 20 60
For simplicity, I'm trying to find the sum of values from the last 5th rows from every rows with id a only.
I tried to use code df.apply(lambda x: x.tail(5)), but that only showed me last 5 rows from the very last row of the entire df. I want to get the sum of last nth rows from every and each rows. Basically it's like rolling_sum for time series data.
you can calculate the sum of the last 5 as like this:
df["rolling As"] = df[df['id'] == 'a'].rolling(window=5).sum()["values"]
(this includes the current row as one of the 5. not sure if that is what you want)
id values rolling As
0 a 5 NaN
1 a 10 NaN
2 a 10 NaN
3 b 2 NaN
4 c 2 NaN
5 d 5 NaN
6 a 10 NaN
7 a 20 55.0
8 a 10 60.0
9 a 10 60.0
10 a 15 65.0
11 a 20 75.0
If you don't want it included. you can shift
df["rolling"] = df[df['id'] == 'a'].rolling(window=5).sum()["values"].shift()
to give:
id values rolling
0 a 5 NaN
1 a 10 NaN
2 a 10 NaN
3 b 2 NaN
4 c 2 NaN
5 d 5 NaN
6 a 10 NaN
7 a 20 NaN
8 a 10 55.0
9 a 10 60.0
10 a 15 60.0
11 a 20 65.0
Try using groupby, transform, and rolling:
df['sum(x.tail(5))'] = df.groupby('id')['values']\
.transform(lambda x: x.rolling(5, min_periods=5).sum().shift())
Output:
id values sum(x.tail(5))
1 a 5 NaN
2 a 10 NaN
3 a 10 NaN
4 b 2 NaN
5 c 2 NaN
6 d 2 NaN
7 a 5 NaN
8 a 10 NaN
9 a 20 40.0
10 a 10 55.0
11 a 15 55.0
12 a 20 60.0