I have a dataframe having 4 columns(A,B,C,D). D has some NaN entries. I want to fill the NaN values by the average value of D having same value of A,B,C.
For example,if the value of A,B,C,D are x,y,z and Nan respectively,then I want the NaN value to be replaced by the average of D for the rows where the value of A,B,C are x,y,z respectively.
df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean')) would be faster than apply
In [2400]: df
Out[2400]:
A B C D
0 1 1 1 1.0
1 1 1 1 NaN
2 1 1 1 3.0
3 3 3 3 5.0
In [2401]: df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
Out[2401]:
0 1.0
1 2.0
2 3.0
3 5.0
Name: D, dtype: float64
In [2402]: df['D'] = df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
In [2403]: df
Out[2403]:
A B C D
0 1 1 1 1.0
1 1 1 1 2.0
2 1 1 1 3.0
3 3 3 3 5.0
Details
In [2396]: df.shape
Out[2396]: (10000, 4)
In [2398]: %timeit df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
100 loops, best of 3: 3.44 ms per loop
In [2397]: %timeit df.groupby(['A','B','C'])['D'].apply(lambda x: x.fillna(x.mean()))
100 loops, best of 3: 5.34 ms per loop
I think you need:
df.D = df.groupby(['A','B','C'])['D'].apply(lambda x: x.fillna(x.mean()))
Sample:
df = pd.DataFrame({'A':[1,1,1,3],
'B':[1,1,1,3],
'C':[1,1,1,3],
'D':[1,np.nan,3,5]})
print (df)
A B C D
0 1 1 1 1.0
1 1 1 1 NaN
2 1 1 1 3.0
3 3 3 3 5.0
df.D = df.groupby(['A','B','C'])['D'].apply(lambda x: x.fillna(x.mean()))
print (df)
A B C D
0 1 1 1 1.0
1 1 1 1 2.0
2 1 1 1 3.0
3 3 3 3 5.0
Link to duplicate of this question for further information:
Pandas Dataframe: Replacing NaN with row average
Another suggested way of doing it mentioned in the link is using a simple fillna on the transpose:
df.T.fillna(df.mean(axis=1)).T
Related
I have this dataframe.
from pandas import DataFrame
import pandas as pd
df = pd.DataFrame({'name': ['A','D','M','T','B','C','D','E','A','L'],
'id': [1,1,1,2,2,3,3,3,3,5],
'rate': [3.5,4.5,2.0,5.0,4.0,1.5,2.0,2.0,1.0,5.0]})
>> df
name id rate
0 A 1 3.5
1 D 1 4.5
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
5 C 3 1.5
6 D 3 2.0
7 E 3 2.0
8 A 3 1.0
9 L 5 5.0
df = df.groupby('id')['rate'].mean()
what i want is this:
1) find mean of every 'id'.
2) give the number of ids (length) which has mean >= 3.
3) give back all rows of dataframe (where mean of any id >= 3.
Expected output:
Number of ids (length) where mean >= 3: 3
>> dataframe where (mean(id) >=3)
>>df
name id rate
0 A 1 3.0
1 D 1 4.0
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
5 L 5 5.0
Use GroupBy.transform for means by all groups with same size like original DataFrame, so possible filter by boolean indexing:
df = df[df.groupby('id')['rate'].transform('mean') >=3]
print (df)
name id rate
0 A 1 3.5
1 D 1 4.5
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
9 L 5 5.0
Detail:
print (df.groupby('id')['rate'].transform('mean'))
0 3.333333
1 3.333333
2 3.333333
3 4.500000
4 4.500000
5 1.625000
6 1.625000
7 1.625000
8 1.625000
9 5.000000
Name: rate, dtype: float64
Alternative solution with DataFrameGroupBy.filter:
df = df.groupby('id').filter(lambda x: x['rate'].mean() >=3)
I have the following dataframe with the following code:
for i in range(int(tower1_base),int(tower1_top)):
if i not in tower1_not_included_int :
df = pd.concat([df, pd.DataFrame({"Tower": 1, "Floor": i, "Unit": list("ABCDEFG")})],ignore_index=True)
Result:
Tower Floor Unit
0 1 1.0 A
1 1 1.0 B
2 1 1.0 C
3 1 1.0 D
4 1 1.0 E
5 1 1.0 F
6 1 1.0 G
How can I create another Index column like this?
Tower Floor Unit Index
0 1 1.0 A 1A1
1 1 2.0 B 1B2
2 1 3.0 C 1C3
3 1 4.0 D 1D4
4 1 5.0 E 1E5
5 1 6.0 F 1F6
6 1 7.0 G 1G7
You can simply add the columns:
df['Index'] = df['Tower'].astype(str)+df['Unit']+df['Floor'].astype(int).astype(str)
Outputs this for the first version of your dataframe:
Tower Floor Unit Index
0 1 1.0 A 1A1
1 1 1.0 B 1B1
2 1 1.0 C 1C1
3 1 1.0 D 1D1
4 1 1.0 E 1E1
5 1 1.0 F 1F1
6 1 1.0 G 1G1
Another approach.
I've created a copy of the dataframe and reordered the columns, to make the "melting" easier.
dfAl = df.reindex(columns=['Tower','Unit','Floor'])
to_load = [] #list to load the new column
vals = pd.DataFrame.to_numpy(dfAl) #All values extracted
for sublist in vals:
combs = ''.join([str(i).strip('.0') for i in sublist]) #melting values
to_load.append(combs)
df['Index'] = to_load
If you really want the 'Index' column to be a real index, the last step:
df = df.set_index('Index')
print(df)
Tower Floor Unit
Index
1A1 1 1.0 A
1B2 1 2.0 B
1C3 1 3.0 C
1D4 1 4.0 D
1E5 1 5.0 E
1F6 1 6.0 F
1G7 1 7.0 G
After merging of two data frames:
output = pd.merge(df1, df2, on='ID', how='outer')
I have data frame like this:
index x y z
0 2 NaN 3
0 NaN 3 3
1 2 NaN 4
1 NaN 3 4
...
How to merge rows with the same index?
Expected output:
index x y z
0 2 3 3
1 2 3 4
Perhaps, you could take mean on them.
In [418]: output.groupby('index', as_index=False).mean()
Out[418]:
index x y z
0 0 2.0 3.0 3
1 1 2.0 3.0 4
We can group the DataFrame by the 'index' and then... we can just get the first values with .first() or minimum with .min() etc. depending on the case of course. What do you want to get if the values in z differ?
In [28]: gr = df.groupby('index', as_index=False)
In [29]: gr.first()
Out[29]:
index x y z
0 0 2.0 3.0 3
1 1 2.0 3.0 4
In [30]: gr.max()
Out[30]:
index x y z
0 0 2.0 3.0 3
1 1 2.0 3.0 4
In [31]: gr.min()
Out[31]:
index x y z
0 0 2.0 3.0 3
1 1 2.0 3.0 4
In [32]: gr.mean()
Out[32]:
index x y z
0 0 2.0 3.0 3
1 1 2.0 3.0 4
I am working with a DataFrame in Pandas / Python, each row has an ID (that is not unique), I would like to modify the dataframe to add a column with the secondname for each row that has multiple matching ID's.
Starting with:
ID Name Rate
0 1 A 65.5
1 2 B 67.3
2 2 C 78.8
3 3 D 65.0
4 4 E 45.3
5 5 F 52.0
6 5 G 66.0
7 6 H 34.0
8 7 I 2.0
Trying to get to:
ID Name Rate Secondname
0 1 A 65.5 None
1 2 B 67.3 C
2 2 C 78.8 B
3 3 D 65.0 None
4 4 E 45.3 None
5 5 F 52.0 G
6 5 G 66.0 F
7 6 H 34.0 None
8 7 I 2.0 None
My code:
import numpy as np
import pandas as pd
mydict = {'ID':[1,2,2,3,4,5,5,6,7],
'Name':['A','B','C','D','E','F','G','H','I'],
'Rate':[65.5,67.3,78.8,65,45.3,52,66,34,2]}
df=pd.DataFrame(mydict)
df['Newname']='None'
for i in range(0, df.shape[0]-1):
if df.irow(i)['ID']==df.irow(i+1)['ID']:
df.irow(i)['Newname']=df.irow(i+1)['Name']
Which results in the following error:
A value is trying to be set on a copy of a slice from a DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
df.irow(i)['Newname']=df.irow(i+1)['Secondname']
C:\Users\L\Anaconda3\lib\site-packages\pandas\core\series.py:664: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas- docs/stable/indexing.html#indexing-view-versus-copy
self.loc[key] = value
Any help would be much appreciated.
You can use groupby with custom function f, which use shift and combine_first:
def f(x):
#print x
x['Secondname'] = x['Name'].shift(1).combine_first(x['Name'].shift(-1))
return x
print df.groupby('ID').apply(f)
ID Name Rate Secondname
0 1 A 65.5 NaN
1 2 B 67.3 C
2 2 C 78.8 B
3 3 D 65.0 NaN
4 4 E 45.3 NaN
5 5 F 52.0 G
6 5 G 66.0 F
7 6 H 34.0 NaN
8 7 I 2.0 NaN
You can avoid groupby and find duplicated, then fill helper columns by loc with column Name, then shift and combine_first and last drop helper columns:
print df.duplicated('ID', keep='first')
0 False
1 False
2 True
3 False
4 False
5 False
6 True
7 False
8 False
dtype: bool
print df.duplicated('ID', keep='last')
0 False
1 True
2 False
3 False
4 False
5 True
6 False
7 False
8 False
dtype: bool
df.loc[ df.duplicated('ID', keep='first'), 'first'] = df['Name']
df.loc[ df.duplicated('ID', keep='last'), 'last'] = df['Name']
print df
ID Name Rate first last
0 1 A 65.5 NaN NaN
1 2 B 67.3 NaN B
2 2 C 78.8 C NaN
3 3 D 65.0 NaN NaN
4 4 E 45.3 NaN NaN
5 5 F 52.0 NaN F
6 5 G 66.0 G NaN
7 6 H 34.0 NaN NaN
8 7 I 2.0 NaN NaN
df['SecondName'] = df['first'].shift(-1).combine_first(df['last'].shift(1))
df = df.drop(['first', 'l1'], axis=1)
print df
ID Name Rate SecondName
0 1 A 65.5 NaN
1 2 B 67.3 C
2 2 C 78.8 B
3 3 D 65.0 NaN
4 4 E 45.3 NaN
5 5 F 52.0 G
6 5 G 66.0 F
7 6 H 34.0 NaN
8 7 I 2.0 NaN
TESTING: (in time of testing solution of Roman Kh has wrong output)
len(df) = 9:
In [154]: %timeit jez(df1)
100 loops, best of 3: 15 ms per loop
In [155]: %timeit jez2(df2)
100 loops, best of 3: 3.45 ms per loop
In [156]: %timeit rom(df)
100 loops, best of 3: 3.55 ms per loop
len(df) = 90k:
In [158]: %timeit jez(df1)
10 loops, best of 3: 57.1 ms per loop
In [159]: %timeit jez2(df2)
10 loops, best of 3: 36.4 ms per loop
In [160]: %timeit rom(df)
10 loops, best of 3: 40.4 ms per loop
import pandas as pd
mydict = {'ID':[1,2,2,3,4,5,5,6,7],
'Name':['A','B','C','D','E','F','G','H','I'],
'Rate':[65.5,67.3,78.8,65,45.3,52,66,34,2]}
df=pd.DataFrame(mydict)
print df
df = pd.concat([df]*10000).reset_index(drop=True)
df1 = df.copy()
df2 = df.copy()
def jez(df):
def f(x):
#print x
x['Secondname'] = x['Name'].shift(1).combine_first(x['Name'].shift(-1))
return x
return df.groupby('ID').apply(f)
def jez2(df):
#print df.duplicated('ID', keep='first')
#print df.duplicated('ID', keep='last')
df.loc[ df.duplicated('ID', keep='first'), 'first'] = df['Name']
df.loc[ df.duplicated('ID', keep='last'), 'last'] = df['Name']
#print df
df['SecondName'] = df['first'].shift(-1).combine_first(df['last'].shift(1))
df = df.drop(['first', 'last'], axis=1)
return df
def rom(df):
# cpIDs = True if the next row has the same ID
df['cpIDs'] = df['ID'][:-1] == df['ID'][1:]
# fill in the last row (get rid of NaN)
df.iloc[-1,df.columns.get_loc('cpIDs')] = False
# ShiftName == Name of the next row
df['ShiftName'] = df['Name'].shift(-1)
# fill in SecondName
df.loc[df['cpIDs'], 'SecondName'] = df.loc[df['cpIDs'], 'ShiftName']
# remove columns
del df['cpIDs']
del df['ShiftName']
return df
print jez(df1)
print jez2(df2)
print rom(df)
print jez(df1)
ID Name Rate Secondname
0 1 A 65.5 NaN
1 2 B 67.3 C
2 2 C 78.8 B
3 3 D 65.0 NaN
4 4 E 45.3 NaN
5 5 F 52.0 G
6 5 G 66.0 F
7 6 H 34.0 NaN
8 7 I 2.0 NaN
print jez2(df2)
ID Name Rate SecondName
0 1 A 65.5 NaN
1 2 B 67.3 C
2 2 C 78.8 B
3 3 D 65.0 NaN
4 4 E 45.3 NaN
5 5 F 52.0 G
6 5 G 66.0 F
7 6 H 34.0 NaN
8 7 I 2.0 NaN
print rom(df)
ID Name Rate SecondName
0 1 A 65.5 NaN
1 2 B 67.3 C
2 2 C 78.8 NaN
3 3 D 65.0 NaN
4 4 E 45.3 NaN
5 5 F 52.0 G
6 5 G 66.0 NaN
7 6 H 34.0 NaN
8 7 I 2.0 NaN
EDIT:
If there is more duplicated pairs with same names, use shift for creating first and last columns:
df.loc[ df['ID'] == df['ID'].shift(), 'first'] = df['Name']
df.loc[ df['ID'] == df['ID'].shift(-1), 'last'] = df['Name']
If your dataframe is sorted by ID, you might add a new column which compares ID of the current row with ID of the next row:
# cpIDs = True if the next row has the same ID
df['cpIDs'] = df['ID'][:-1] == df['ID'][1:]
# fill in the last row (get rid of NaN)
df.iloc[-1,df.columns.get_loc('cpIDs')] = False
# ShiftName == Name of the next row
df['ShiftName'] = df['Name'].shift(-1)
# fill in SecondName
df.loc[df['cpIDs'], 'SecondName'] = df.loc[df['cpIDs'], 'ShiftName']
# remove columns
del df['cpIDs']
del df['ShiftName']
Of course, you can shorten the code above, as I intentionally made it longer, but easier to comprehend.
Depending on your dataframe size it might be pretty fast (perhaps the fastest) as it does not use any complicated operations.
P.S. As a side note, try to avoid any loops when dealing with dataframes and numpy arrays. Almost always you can find a so called vector solution which operates on the whole array or large ranges, not on individual cells and rows.
This question already has answers here:
Pandas: filling missing values by mean in each group
(12 answers)
Closed last year.
I Know that the fillna() method can be used to fill NaN in whole dataframe.
df.fillna(df.mean()) # fill with mean of column.
How to limit mean calculation to the group (and the column) where the NaN is.
Exemple:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a': pd.Series([1,1,1,2,2,2]),
'b': pd.Series([1,2,np.NaN,1,np.NaN,4])
})
print df
Input
a b
0 1 1
1 1 2
2 1 NaN
3 2 1
4 2 NaN
5 2 4
Output (after groupby('a') & replace NaN by mean of group)
a b
0 1 1.0
1 1 2.0
2 1 1.5
3 2 1.0
4 2 2.5
5 2 4.0
IIUC then you can call fillna with the result of groupby on 'a' and transform on 'b':
In [44]:
df['b'] = df['b'].fillna(df.groupby('a')['b'].transform('mean'))
df
Out[44]:
a b
0 1 1.0
1 1 2.0
2 1 1.5
3 2 1.0
4 2 2.5
5 2 4.0
If you have multiple NaN values then I think the following should work:
In [47]:
df.fillna(df.groupby('a').transform('mean'))
Out[47]:
a b
0 1 1.0
1 1 2.0
2 1 1.5
3 2 1.0
4 2 2.5
5 2 4.0
EDIT
In [49]:
df = pd.DataFrame({
'a': pd.Series([1,1,1,2,2,2]),
'b': pd.Series([1,2,np.NaN,1,np.NaN,4]),
'c': pd.Series([1,np.NaN,np.NaN,1,np.NaN,4]),
'd': pd.Series([np.NaN,np.NaN,np.NaN,1,np.NaN,4])
})
df
Out[49]:
a b c d
0 1 1 1 NaN
1 1 2 NaN NaN
2 1 NaN NaN NaN
3 2 1 1 1
4 2 NaN NaN NaN
5 2 4 4 4
In [50]:
df.fillna(df.groupby('a').transform('mean'))
Out[50]:
a b c d
0 1 1.0 1.0 NaN
1 1 2.0 1.0 NaN
2 1 1.5 1.0 NaN
3 2 1.0 1.0 1.0
4 2 2.5 2.5 2.5
5 2 4.0 4.0 4.0
You get all NaN for 'd' as all values are NaN for group 1 for d
We first compute the group means, ignoring the missing values:
group_means = df.groupby('a')['b'].agg(lambda v: np.nanmean(v))
Next, we use groupby again, this time fetching the corresponding values:
df_new = df.groupby('a').apply(lambda t: t.fillna(group_means.loc[t['a'].iloc[0]]))