Setting subset of a pandas DataFrame by a DataFrame - python

I feel like this question has been asked a millions times before, but I just can't seem to get it to work or find a SO-post answering my question.
So I am selecting a subset of a pandas DataFrame and want to change these values individually.
I am subselecting my DataFrame like this:
df.loc[df[key].isnull(), [keys]]
which works perfectly. If I try and set all values to the same value such as
df.loc[df[key].isnull(), [keys]] = 5
it works as well. But if I try and set it to a DataFrame it does not, however no error is produced either.
So for example I have a DataFrame:
data = [['Alex',10,0,0,2],['Bob',12,0,0,1],['Clarke',13,0,0,4],['Dennis',64,2],['Jennifer',56,1],['Tom',95,5],['Ellen',42,2],['Heather',31,3]]
df1 = pd.DataFrame(data,columns=['Name','Age','Amount_of_cars','cars_per_year','some_other_value'])
Name Age Amount_of_cars cars_per_year some_other_value
0 Alex 10 0 0.0 2.0
1 Bob 12 0 0.0 1.0
2 Clarke 13 0 0.0 4.0
3 Dennis 64 2 NaN NaN
4 Jennifer 56 1 NaN NaN
5 Tom 95 5 NaN NaN
6 Ellen 42 2 NaN NaN
7 Heather 31 3 NaN NaN
and a second DataFrame:
data = [[2/64,5],[1/56,1],[5/95,7],[2/42,5],[3/31,7]]
df2 = pd.DataFrame(data,columns=['cars_per_year','some_other_value'])
cars_per_year some_other_value
0 0.031250 5
1 0.017857 1
2 0.052632 7
3 0.047619 5
4 0.096774 7
and I would like to replace those nans with the second DataFrame
df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2
Unfortunately this does not work as the index does not match. So how do I ignore the index, when setting values?
Any help would be appreciated. Sorry if this has been posted before.

It is possible only if number of mising values is same like number of rows in df2, then assign array for prevent index alignment:
df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2.values
print (df1)
Name Age Amount_of_cars cars_per_year some_other_value
0 Alex 10 0 0.000000 2.0
1 Bob 12 0 0.000000 1.0
2 Clarke 13 0 0.000000 4.0
3 Dennis 64 2 0.031250 5.0
4 Jennifer 56 1 0.017857 1.0
5 Tom 95 5 0.052632 7.0
6 Ellen 42 2 0.047619 5.0
7 Heather 31 3 0.096774 7.0
If not, get errors like:
#4 rows assigned to 5 rows
data = [[2/64,5],[1/56,1],[5/95,7],[2/42,5]]
df2 = pd.DataFrame(data,columns=['cars_per_year','some_other_value'])
df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2.values
ValueError: shape mismatch: value array of shape (4,) could not be broadcast to indexing result of shape (5,)
Another idea is set index of df2 by index of filtered rows in df1:
df2 = df2.set_index(df1.index[df1['cars_per_year'].isnull()])
df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2
print (df1)
Name Age Amount_of_cars cars_per_year some_other_value
0 Alex 10 0 0.000000 2.0
1 Bob 12 0 0.000000 1.0
2 Clarke 13 0 0.000000 4.0
3 Dennis 64 2 0.031250 5.0
4 Jennifer 56 1 0.017857 1.0
5 Tom 95 5 0.052632 7.0
6 Ellen 42 2 0.047619 5.0
7 Heather 31 3 0.096774 7.0

Just add .values or .to_numpy() if using pandas v 0.24 +
df1.loc[df1['cars_per_year'].isnull(),['cars_per_year','some_other_value']] = df2.values
Name Age Amount_of_cars cars_per_year some_other_value
0 Alex 10 0 0.000000 2.0
1 Bob 12 0 0.000000 1.0
2 Clarke 13 0 0.000000 4.0
3 Dennis 64 2 0.031250 5.0
4 Jennifer 56 1 0.017857 1.0
5 Tom 95 5 0.052632 7.0
6 Ellen 42 2 0.047619 5.0
7 Heather 31 3 0.096774 7.0

Related

Find polynomial relationship between two pandas df columns and extend it to the rest of the dataset

Herebelow is an example of my dataset:
[index] [pressure] [flow rate]
0 Nan 0
1 Nan 0
2 3 25
3 5 35
4 6 42
5 Nan 44
6 Nan 46
7 Nan 0
8 5 33
9 4 26
10 3 19
11 Nan 0
12 Nan 0
13 Nan 39
14 Nan 36
15 Nan 41
I would like to find a polynomial relationship between the pressure and flow rate where the data for both are present (in this example we can see there are data points for both pressure and flow rate from index 0 to index 4), and then I need to extend the values of pressure for Nan values based on the polynomial relationship that I found above up to the point where the data for both are present again (in this case the data is again present from index 8 to index 11), in which case I need to find a new polynomial relationship between pressure and flow rate and extend the pressure values further based on my new relationship up to the next available data and so on.
I appreciate any advice on how best to accomplish that.
You can interpolate:
df['[pressure 2]'] = df.set_index('[flow rate]')['[pressure]'].interpolate('polynomial', order=2).values
Output
[index] [pressure] [flow rate] [pressure 2]
0 0 2.0 21 2.000000
1 1 4.0 29 4.000000
2 2 3.0 25 3.000000
3 3 5.0 35 5.000000
4 4 6.0 42 6.000000
5 5 NaN 44 6.000000
6 6 NaN 46 NaN
7 7 NaN 50 NaN
8 8 5.0 33 5.000000
9 9 4.0 26 4.000000
10 10 3.0 19 3.000000
11 11 6.0 44 6.000000
12 12 NaN 41 5.915690
13 13 NaN 39 5.578449
14 14 NaN 36 5.044156
15 15 NaN 40 5.775173
NB. The remaining NaNs cannot be interpolated without ambiguity, you can ffill if needed

Combining two dataframes

I've tried merging two dataframes, but I can't seem to get it to work. Each time I merge, the rows where I expect values are all 0. Dataframe df1 already as some data in it, with some left blank. Dataframe df2 will populate those blank rows in df1 where column names match at each value in "TempBin" and each value in "Month" in df1.
EDIT:
Both dataframes are in a for loop. df1 acts as my "storage", df2 changes for each location iteration. So if df2 contained the results for LocationZP, I would also want that data inserted in the matching df1 rows. If I use df1 = df1.append(df2) in the for loop, all of the rows from df2 keep inserting at the very end of df1 for each iteration.
df1:
Month TempBin LocationAA LocationXA LocationZP
1 0 7 1 2
1 1 98 0 89
1 2 12 23 38
1 3 3 14 17
1 4 7 9 14
1 5 1 8 99
13 0 0 0 0
13 1 0 0 0
13 2 0 0 0
13 3 0 0 0
13 4 0 0 0
13 5 0 0 0
df2:
Month TempBin LocationAA
13 0 11
13 1 22
13 2 33
13 3 44
13 4 55
13 5 66
desired output in df1:
Month TempBin LocationAA LocationXA LocationZP
1 0 7 1 2
1 1 98 0 89
1 2 12 23 38
1 3 3 14 17
1 4 7 9 14
1 5 1 8 99
13 0 11 0 0
13 1 22 0 0
13 2 33 0 0
13 3 44 0 0
13 4 55 0 0
13 5 66 0 0
import pandas as pd
df1 = pd.DataFrame({'Month': [1]*6 + [13]*6,
'TempBin': [0,1,2,3,4,5]*2,
'LocationAA': [7,98,12,3,7,1,0,0,0,0,0,0],
'LocationXA': [1,0,23,14,9,8,0,0,0,0,0,0],
'LocationZP': [2,89,38,17,14,99,0,0,0,0,0,0]}
)
df2 = pd.DataFrame({'Month': [13]*6,
'TempBin': [0,1,2,3,4,5],
'LocationAA': [11,22,33,44,55,66]}
)
df1 = pd.merge(df1, df2, on=["Month","TempBin","LocationAA"], how="left")
result:
Month TempBin LocationAA LocationXA LocationZP
1 0 7.0 1.0 2.0
1 1 98.0 0.0 89.0
1 2 12.0 23.0 38.0
1 3 3.0 14.0 17.0
1 4 7.0 9.0 14.0
1 5 1.0 8.0 99.0
13 0 NaN NaN NaN
13 1 NaN NaN NaN
13 2 NaN NaN NaN
13 3 NaN NaN NaN
13 4 NaN NaN NaN
13 5 NaN NaN NaN
Here's some code that worked for me:
# Merge two df into one dataframe on the columns "TempBin" and "Month" filling nan values with 0.
import pandas as pd
df1 = pd.DataFrame({'Month': [1]*6 + [13]*6,
'TempBin': [0,1,2,3,4,5]*2,
'LocationAA': [7,98,12,3,7,1,0,0,0,0,0,0],
'LocationXA': [1,0,23,14,9,8,0,0,0,0,0,0],
'LocationZP': [2,89,38,17,14,99,0,0,0,0,0,0]}
)
df2 = pd.DataFrame({'Month': [13]*6,
'TempBin': [0,1,2,3,4,5],
'LocationAA': [11,22,33,44,55,66]})
df_merge = pd.merge(df1, df2, how='left',
left_on=['TempBin', 'Month'],
right_on=['TempBin', 'Month'])
df_merge.fillna(0, inplace=True)
# add column LocationAA and fill it with the not null value from column LocationAA_x and LocationAA_y
df_merge['LocationAA'] = df_merge.apply(lambda x: x['LocationAA_x'] if pd.isnull(x['LocationAA_y']) else x['LocationAA_y'], axis=1)
# remove column LocationAA_x and LocationAA_y
df_merge.drop(['LocationAA_x', 'LocationAA_y'], axis=1, inplace=True)
print(df_merge)
Output:
Month TempBin LocationXA LocationZP LocationAA
0 1 0 1.0 2.0 0.0
1 1 1 0.0 89.0 0.0
2 1 2 23.0 38.0 0.0
3 1 3 14.0 17.0 0.0
4 1 4 9.0 14.0 0.0
5 1 5 8.0 99.0 0.0
6 13 0 0.0 0.0 11.0
7 13 1 0.0 0.0 22.0
8 13 2 0.0 0.0 33.0
9 13 3 0.0 0.0 44.0
10 13 4 0.0 0.0 55.0
11 13 5 0.0 0.0 66.0
Let me know if there's something you don't understand in the comments :)
PS: Sorry for the extra comments. But I left them there for some more explanations.
You need to use append to get the desired output:
df1 = df1.append(df2)
and if you want to replace the Nulls to zeros add:
df1 = df1.fillna(0)
Here is another way using combine_first()
i = ['Month','TempBin']
df2.set_index(i).combine_first(df1.set_index(i)).reset_index()

How to fill first N/A cell when apply rolling mean to a column -python

I need to apply rolling mean to a column as showing in pic1 s3, after i apply rolling mean and set windows = 5, i got correct answer , but left first 4 rows empty,as showing in pic2 sa3.
i want to fill the first 4 empty cells in pic2 sa3 with the mean of all data in pic1 s3 up to the current row,as showing in pic3 a3.
how can i do with with an easy function besides the rolling mean method.
I think need parameter min_periods=1 in rolling:
min_periods : int, default None
Minimum number of observations in window required to have a value (otherwise result is NA). For a window that is specified by an offset, this will default to 1.
df = df.rolling(5, min_periods=1).mean()
Sample:
np.random.seed(1256)
df = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('abcde'))
print (df)
a b c d e
0 1 5 8 8 9
1 3 6 3 0 6
2 7 0 1 5 1
3 6 6 5 0 4
4 4 9 4 6 1
5 7 7 5 8 3
6 0 7 2 8 2
7 4 8 3 5 5
8 8 2 0 9 2
9 4 7 1 5 1
df = df.rolling(5, min_periods=1).mean()
print (df)
a b c d e
0 1.000000 5.000000 8.00 8.000000 9.000000
1 2.000000 5.500000 5.50 4.000000 7.500000
2 3.666667 3.666667 4.00 4.333333 5.333333
3 4.250000 4.250000 4.25 3.250000 5.000000
4 4.200000 5.200000 4.20 3.800000 4.200000
5 5.400000 5.600000 3.60 3.800000 3.000000
6 4.800000 5.800000 3.40 5.400000 2.200000
7 4.200000 7.400000 3.80 5.400000 3.000000
8 4.600000 6.600000 2.80 7.200000 2.600000
9 4.600000 6.200000 2.20 7.000000 2.600000
So you want to add:
df['sa3'].fillna(df['s3'].mean(), inplace=True)
Hopefully I used correct column names.
You can use pandas to find the rolling mean and then fill the NaN with zero.
Use something like the following:
col = [1,2,3,4,5,6,7,8,9]
df = pd.DataFrame(col)
df['rm'] = df.rolling(5).mean().fillna(value =0, inplace=False)
print df
0 rm
0 1 0.0
1 2 0.0
2 3 0.0
3 4 0.0
4 5 3.0
5 6 4.0
6 7 5.0
7 8 6.0
8 9 7.0
I see, some of the answers are dealing with null and replacing them with mean and some answers are creating rolling mean but not replacing nulls with it. So i figured out the code myself and posting it here.
df['Col']= df['Col'].fillna(df['Col'].rolling(4,center=True,min_periods=1).mean())
'4' is the length of rolling window
centre = True indicates that the replaced value will will consider half the value above and half values below the null values to replace.

Pandas: replace Nan with values from one of two columns

Given the following dataframe df, where df['B']=df['M1']+df['M2']:
A M1 M2 B
1 1 2 3
1 2 NaN NaN
1 3 6 9
1 4 8 12
1 NaN 10 NaN
1 6 12 18
I want the NaN in column B to equal the corresponding value in M1 or M2 provided that the latter is not NaN:
A M1 M2 B
1 1 2 3
1 2 NaN 2
1 3 6 9
1 4 8 12
1 NaN 10 10
1 6 12 18
This answer suggested to use:
df.loc[df['B'].isnull(),'B'] = df['M1'], but the structure of this line allows to consider either M1 or M2, and not both at the same time.
Ideas on how I should change it to consider both columns?
EDIT
Not a duplicate question! For ease of understanding, I claimed that df['B']=df['M1']+df['M2'], but in my real case, df['B'] is not a sum and comes from a rather complicated computation. So I cannot apply a simple formula to df['B']: all I can do is change the NaN values to match the corresponding value in either M1 or M2.
Base on our discussion above in the comment
df.B=df.B.fillna(df[['M1','M2']].max(1))
df
Out[52]:
A M1 M2 B
0 1 1.0 2.0 3.0
1 1 2.0 NaN 2.0
2 1 3.0 6.0 9.0
3 1 4.0 8.0 12.0
4 1 NaN 10.0 10.0
5 1 6.0 12.0 18.0
From jezrael
df['B']= (df['M1']+ df['M2']).fillna(df[['M2','M1']].sum(1))

How to combine the rows in data frame?

It's really annoying that I cannot find a way to combine several rows or columns by finding there means or standard deviations or something else. Could some one give my an idea? Thanks!
I think you can groupby by index floor divided by 10 and aggregate mean or std:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(10, size=(5,5)),index=[1971,1972,1981,1982,1991])
print (df)
0 1 2 3 4
1971 5 8 9 5 0
1972 0 1 7 6 9
1981 2 4 5 2 4
1982 2 4 7 7 9
1991 1 7 0 6 9
print (df.index // 10)
Int64Index([197, 197, 198, 198, 199], dtype='int64')
df1 = df.groupby([df.index // 10]).mean()
df1.index = df1.index.astype(str) + '0s'
print (df1)
0 1 2 3 4
1970s 2.5 4.5 8.0 5.5 4.5
1980s 2.0 4.0 6.0 4.5 6.5
1990s 1.0 7.0 0.0 6.0 9.0
df1 = df.groupby([df.index // 10]).std()
df1.index = df1.index.astype(str) + '0s'
print (df1)
0 1 2 3 4
1970s 3.535534 4.949747 1.414214 0.707107 6.363961
1980s 0.000000 0.000000 1.414214 3.535534 3.535534
1990s NaN NaN NaN NaN NaN

Categories

Resources