Herebelow is an example of my dataset:
[index] [pressure] [flow rate]
0 Nan 0
1 Nan 0
2 3 25
3 5 35
4 6 42
5 Nan 44
6 Nan 46
7 Nan 0
8 5 33
9 4 26
10 3 19
11 Nan 0
12 Nan 0
13 Nan 39
14 Nan 36
15 Nan 41
I would like to find a polynomial relationship between the pressure and flow rate where the data for both are present (in this example we can see there are data points for both pressure and flow rate from index 0 to index 4), and then I need to extend the values of pressure for Nan values based on the polynomial relationship that I found above up to the point where the data for both are present again (in this case the data is again present from index 8 to index 11), in which case I need to find a new polynomial relationship between pressure and flow rate and extend the pressure values further based on my new relationship up to the next available data and so on.
I appreciate any advice on how best to accomplish that.
You can interpolate:
df['[pressure 2]'] = df.set_index('[flow rate]')['[pressure]'].interpolate('polynomial', order=2).values
Output
[index] [pressure] [flow rate] [pressure 2]
0 0 2.0 21 2.000000
1 1 4.0 29 4.000000
2 2 3.0 25 3.000000
3 3 5.0 35 5.000000
4 4 6.0 42 6.000000
5 5 NaN 44 6.000000
6 6 NaN 46 NaN
7 7 NaN 50 NaN
8 8 5.0 33 5.000000
9 9 4.0 26 4.000000
10 10 3.0 19 3.000000
11 11 6.0 44 6.000000
12 12 NaN 41 5.915690
13 13 NaN 39 5.578449
14 14 NaN 36 5.044156
15 15 NaN 40 5.775173
NB. The remaining NaNs cannot be interpolated without ambiguity, you can ffill if needed
I've tried merging two dataframes, but I can't seem to get it to work. Each time I merge, the rows where I expect values are all 0. Dataframe df1 already as some data in it, with some left blank. Dataframe df2 will populate those blank rows in df1 where column names match at each value in "TempBin" and each value in "Month" in df1.
EDIT:
Both dataframes are in a for loop. df1 acts as my "storage", df2 changes for each location iteration. So if df2 contained the results for LocationZP, I would also want that data inserted in the matching df1 rows. If I use df1 = df1.append(df2) in the for loop, all of the rows from df2 keep inserting at the very end of df1 for each iteration.
df1:
Month TempBin LocationAA LocationXA LocationZP
1 0 7 1 2
1 1 98 0 89
1 2 12 23 38
1 3 3 14 17
1 4 7 9 14
1 5 1 8 99
13 0 0 0 0
13 1 0 0 0
13 2 0 0 0
13 3 0 0 0
13 4 0 0 0
13 5 0 0 0
df2:
Month TempBin LocationAA
13 0 11
13 1 22
13 2 33
13 3 44
13 4 55
13 5 66
desired output in df1:
Month TempBin LocationAA LocationXA LocationZP
1 0 7 1 2
1 1 98 0 89
1 2 12 23 38
1 3 3 14 17
1 4 7 9 14
1 5 1 8 99
13 0 11 0 0
13 1 22 0 0
13 2 33 0 0
13 3 44 0 0
13 4 55 0 0
13 5 66 0 0
import pandas as pd
df1 = pd.DataFrame({'Month': [1]*6 + [13]*6,
'TempBin': [0,1,2,3,4,5]*2,
'LocationAA': [7,98,12,3,7,1,0,0,0,0,0,0],
'LocationXA': [1,0,23,14,9,8,0,0,0,0,0,0],
'LocationZP': [2,89,38,17,14,99,0,0,0,0,0,0]}
)
df2 = pd.DataFrame({'Month': [13]*6,
'TempBin': [0,1,2,3,4,5],
'LocationAA': [11,22,33,44,55,66]}
)
df1 = pd.merge(df1, df2, on=["Month","TempBin","LocationAA"], how="left")
result:
Month TempBin LocationAA LocationXA LocationZP
1 0 7.0 1.0 2.0
1 1 98.0 0.0 89.0
1 2 12.0 23.0 38.0
1 3 3.0 14.0 17.0
1 4 7.0 9.0 14.0
1 5 1.0 8.0 99.0
13 0 NaN NaN NaN
13 1 NaN NaN NaN
13 2 NaN NaN NaN
13 3 NaN NaN NaN
13 4 NaN NaN NaN
13 5 NaN NaN NaN
Here's some code that worked for me:
# Merge two df into one dataframe on the columns "TempBin" and "Month" filling nan values with 0.
import pandas as pd
df1 = pd.DataFrame({'Month': [1]*6 + [13]*6,
'TempBin': [0,1,2,3,4,5]*2,
'LocationAA': [7,98,12,3,7,1,0,0,0,0,0,0],
'LocationXA': [1,0,23,14,9,8,0,0,0,0,0,0],
'LocationZP': [2,89,38,17,14,99,0,0,0,0,0,0]}
)
df2 = pd.DataFrame({'Month': [13]*6,
'TempBin': [0,1,2,3,4,5],
'LocationAA': [11,22,33,44,55,66]})
df_merge = pd.merge(df1, df2, how='left',
left_on=['TempBin', 'Month'],
right_on=['TempBin', 'Month'])
df_merge.fillna(0, inplace=True)
# add column LocationAA and fill it with the not null value from column LocationAA_x and LocationAA_y
df_merge['LocationAA'] = df_merge.apply(lambda x: x['LocationAA_x'] if pd.isnull(x['LocationAA_y']) else x['LocationAA_y'], axis=1)
# remove column LocationAA_x and LocationAA_y
df_merge.drop(['LocationAA_x', 'LocationAA_y'], axis=1, inplace=True)
print(df_merge)
Output:
Month TempBin LocationXA LocationZP LocationAA
0 1 0 1.0 2.0 0.0
1 1 1 0.0 89.0 0.0
2 1 2 23.0 38.0 0.0
3 1 3 14.0 17.0 0.0
4 1 4 9.0 14.0 0.0
5 1 5 8.0 99.0 0.0
6 13 0 0.0 0.0 11.0
7 13 1 0.0 0.0 22.0
8 13 2 0.0 0.0 33.0
9 13 3 0.0 0.0 44.0
10 13 4 0.0 0.0 55.0
11 13 5 0.0 0.0 66.0
Let me know if there's something you don't understand in the comments :)
PS: Sorry for the extra comments. But I left them there for some more explanations.
You need to use append to get the desired output:
df1 = df1.append(df2)
and if you want to replace the Nulls to zeros add:
df1 = df1.fillna(0)
Here is another way using combine_first()
i = ['Month','TempBin']
df2.set_index(i).combine_first(df1.set_index(i)).reset_index()
I need to apply rolling mean to a column as showing in pic1 s3, after i apply rolling mean and set windows = 5, i got correct answer , but left first 4 rows empty,as showing in pic2 sa3.
i want to fill the first 4 empty cells in pic2 sa3 with the mean of all data in pic1 s3 up to the current row,as showing in pic3 a3.
how can i do with with an easy function besides the rolling mean method.
I think need parameter min_periods=1 in rolling:
min_periods : int, default None
Minimum number of observations in window required to have a value (otherwise result is NA). For a window that is specified by an offset, this will default to 1.
df = df.rolling(5, min_periods=1).mean()
Sample:
np.random.seed(1256)
df = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('abcde'))
print (df)
a b c d e
0 1 5 8 8 9
1 3 6 3 0 6
2 7 0 1 5 1
3 6 6 5 0 4
4 4 9 4 6 1
5 7 7 5 8 3
6 0 7 2 8 2
7 4 8 3 5 5
8 8 2 0 9 2
9 4 7 1 5 1
df = df.rolling(5, min_periods=1).mean()
print (df)
a b c d e
0 1.000000 5.000000 8.00 8.000000 9.000000
1 2.000000 5.500000 5.50 4.000000 7.500000
2 3.666667 3.666667 4.00 4.333333 5.333333
3 4.250000 4.250000 4.25 3.250000 5.000000
4 4.200000 5.200000 4.20 3.800000 4.200000
5 5.400000 5.600000 3.60 3.800000 3.000000
6 4.800000 5.800000 3.40 5.400000 2.200000
7 4.200000 7.400000 3.80 5.400000 3.000000
8 4.600000 6.600000 2.80 7.200000 2.600000
9 4.600000 6.200000 2.20 7.000000 2.600000
So you want to add:
df['sa3'].fillna(df['s3'].mean(), inplace=True)
Hopefully I used correct column names.
You can use pandas to find the rolling mean and then fill the NaN with zero.
Use something like the following:
col = [1,2,3,4,5,6,7,8,9]
df = pd.DataFrame(col)
df['rm'] = df.rolling(5).mean().fillna(value =0, inplace=False)
print df
0 rm
0 1 0.0
1 2 0.0
2 3 0.0
3 4 0.0
4 5 3.0
5 6 4.0
6 7 5.0
7 8 6.0
8 9 7.0
I see, some of the answers are dealing with null and replacing them with mean and some answers are creating rolling mean but not replacing nulls with it. So i figured out the code myself and posting it here.
df['Col']= df['Col'].fillna(df['Col'].rolling(4,center=True,min_periods=1).mean())
'4' is the length of rolling window
centre = True indicates that the replaced value will will consider half the value above and half values below the null values to replace.
Given the following dataframe df, where df['B']=df['M1']+df['M2']:
A M1 M2 B
1 1 2 3
1 2 NaN NaN
1 3 6 9
1 4 8 12
1 NaN 10 NaN
1 6 12 18
I want the NaN in column B to equal the corresponding value in M1 or M2 provided that the latter is not NaN:
A M1 M2 B
1 1 2 3
1 2 NaN 2
1 3 6 9
1 4 8 12
1 NaN 10 10
1 6 12 18
This answer suggested to use:
df.loc[df['B'].isnull(),'B'] = df['M1'], but the structure of this line allows to consider either M1 or M2, and not both at the same time.
Ideas on how I should change it to consider both columns?
EDIT
Not a duplicate question! For ease of understanding, I claimed that df['B']=df['M1']+df['M2'], but in my real case, df['B'] is not a sum and comes from a rather complicated computation. So I cannot apply a simple formula to df['B']: all I can do is change the NaN values to match the corresponding value in either M1 or M2.
Base on our discussion above in the comment
df.B=df.B.fillna(df[['M1','M2']].max(1))
df
Out[52]:
A M1 M2 B
0 1 1.0 2.0 3.0
1 1 2.0 NaN 2.0
2 1 3.0 6.0 9.0
3 1 4.0 8.0 12.0
4 1 NaN 10.0 10.0
5 1 6.0 12.0 18.0
From jezrael
df['B']= (df['M1']+ df['M2']).fillna(df[['M2','M1']].sum(1))
It's really annoying that I cannot find a way to combine several rows or columns by finding there means or standard deviations or something else. Could some one give my an idea? Thanks!
I think you can groupby by index floor divided by 10 and aggregate mean or std:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(10, size=(5,5)),index=[1971,1972,1981,1982,1991])
print (df)
0 1 2 3 4
1971 5 8 9 5 0
1972 0 1 7 6 9
1981 2 4 5 2 4
1982 2 4 7 7 9
1991 1 7 0 6 9
print (df.index // 10)
Int64Index([197, 197, 198, 198, 199], dtype='int64')
df1 = df.groupby([df.index // 10]).mean()
df1.index = df1.index.astype(str) + '0s'
print (df1)
0 1 2 3 4
1970s 2.5 4.5 8.0 5.5 4.5
1980s 2.0 4.0 6.0 4.5 6.5
1990s 1.0 7.0 0.0 6.0 9.0
df1 = df.groupby([df.index // 10]).std()
df1.index = df1.index.astype(str) + '0s'
print (df1)
0 1 2 3 4
1970s 3.535534 4.949747 1.414214 0.707107 6.363961
1980s 0.000000 0.000000 1.414214 3.535534 3.535534
1990s NaN NaN NaN NaN NaN