How to use .bfill() with pandas groupby without dropping the grouping variable - python

I want to use bfill and groupby but have not figured out a way to do so without dropping the grouping variable. I know I can just concatenate back the ID column but there's gotta be another way of doing this.
import pandas as pd
import numpy as np
test = pd.DataFrame({'ID': ['A', 'A', 'A', 'B', 'B', 'B'],
'dd': [0, 0, 0, 0, 0, 0],
'nu': np.array([0, 1, np.NaN, np.NaN, 10, 20])})
In [11]:test.groupby('ID').bfill()
Out[11]:
nu
0 0.0
1 1.0
2 NaN
3 10.0
4 10.0
5 20.0
Desired output
ID dd nu
0 A 0 0.0
1 A 0 1.0
2 A 0 NaN
3 B 0 10.0
4 B 0 10.0
5 B 0 20.0

Try df.assign:
>>> test.assign(nu=test.groupby('ID').bfill()['nu'])
ID dd nu
0 A 0 0.0
1 A 0 1.0
2 A 0 NaN
3 B 0 10.0
4 B 0 10.0
5 B 0 20.0
Or df.groupby.apply,
>>> test.groupby('ID').apply(lambda x:x.bfill())
ID dd nu
0 A 0 0.0
1 A 0 1.0
2 A 0 NaN
3 B 0 10.0
4 B 0 10.0
5 B 0 20.0

Related

How to insert the values of dictionary into null values of a dataframe in pandas?

I am new to pandas. I am facing an issue with null values. I have a dict of 3 values and keys which has to be inserted into a column of missing values how do I do that? The last word key is the name of the column name
In [57]: df
Out[57]:
a b c d
0 0 1 2 3
1 0 NaN 0 1
2 0 Nan 3 Nan
3 0 1 2 5
4 0 Nan 2 Nan
In [58]: dict= {df_b : [11,22,44], df_d: [33,54]
The output I want is below.
Out[57]:
a b c d
0 0 1 2 3
1 0 11 0 1
2 0 22 3 33
3 0 1 2 5
4 0 44 2 54
Given your data
d = [[0, 1, 2, 3 ],
[0, np.nan, 0, 1 ],
[0, np.nan, 3, np.nan],
[0, 1, 2, 5 ],
[0, np.nan, 2, np.nan]] ]
df = pd.DataFrame(d, columns=['a', 'b', 'c', 'd'])
d = {'df_b' : [11,22,44], 'df_d': [33,54]}
try pandas.isna()
for key in d:
column_name = key.split('_')[-1]
val = d[key]
for i,v in zip(df[df[column_name].isna()].index, val):
df.loc[i, column_name] = v
output
a b c d
0 1.0 2 3.0
0 11.0 0 1.0
0 22.0 3 33.0
0 1.0 2 5.0
0 44.0 2 54.0
You can use df.loc with isnull() to select the NaN values and replace them with the items in your list.
import pandas as pd
import numpy as np
mydict = {'b' : [11,22,44], 'd': [33,54]}
df = pd.DataFrame({'a': [0,0,0,0,0], 'b': [1, np.nan, np.nan, 1, np.nan], 'c': [2,0,3,2,2], 'd': [3,1,np.nan,5,np.nan]})
for key in mydict:
df.loc[df[key].isnull(), key] = mydict[key]
# a b c d
0 0 1.0 2 3.0
1 0 11.0 0 1.0
2 0 22.0 3 33.0
3 0 1.0 2 5.0
4 0 44.0 2 54.0

Count NaNs windows (and their size) in DataFrame columns

I have HUGE dataframes (milions, tens) and lot of missing (NaNs) values along columns.
I need to count the windows of NaNs and their size, for every column, in the fastest way possible (my code is too slow).
Something like this: frome here
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2, np.nan, np.nan,3,3,np.nan,4,np.nan,np.nan],\
'b':[np.nan, 2, 1, 1, 3, 3, np.nan, np.nan,2, np.nan],\
'c':[np.nan, 2, 1, np.nan, 3, 3, np.nan, np.nan,2, 8]})
df
Out[65]:
a b c
0 1.0 NaN NaN
1 2.0 2.0 2.0
2 NaN 1.0 1.0
3 NaN 1.0 NaN
4 3.0 3.0 3.0
5 3.0 3.0 3.0
6 NaN NaN NaN
7 4.0 NaN NaN
8 NaN 2.0 2.0
9 NaN NaN 8.0
To here:
result
Out[61]:
a b c
0 2 1 1
1 1 2 1
2 2 1 2
Here's one way to do it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2, np.nan, np.nan,3,3,np.nan,4,np.nan,np.nan],\
'b':[np.nan, 2, 1, 1, 3, 3, np.nan, np.nan,2, np.nan],\
'c':[np.nan, 2, 1, np.nan, 3, 3, np.nan, np.nan,2, 8]})
df_n = pd.DataFrame({'a':df['a'].isnull().values,
'b':df['b'].isnull().values,
'c':df['c'].isnull().values})
pr={}
for column_name, _ in df_n.iteritems():
fst = df_n.index[df_n[column_name] & ~ df_n[column_name].shift(1).fillna(False)]
lst = df_n.index[df_n[column_name] & ~ df_n[column_name].shift(-1).fillna(False)]
pr[column_name] = [j-i+1 for i, j in zip(fst, lst)]
df_new=pd.DataFrame(pr)
Output:
a b c
0 2 1 1
1 1 2 1
2 2 1 2
Try this one (example only for a - do analogically for other columns):
>>> df=df.assign(a_count_sum=0)
>>> df["a_count_sum"][np.isnan(df["a"])]=df.groupby(np.isnan(df.a)).cumcount()+1
>>> df
a b c a_count_sum
0 1.0 NaN NaN 0
1 2.0 2.0 2.0 0
2 NaN 1.0 1.0 1
3 NaN 1.0 NaN 2
4 3.0 3.0 3.0 0
5 3.0 3.0 3.0 0
6 NaN NaN NaN 3
7 4.0 NaN NaN 0
8 NaN 2.0 2.0 4
9 NaN NaN 8.0 5
>>> res_1 = df["a_count_sum"][((df["a_count_sum"].shift(-1) == 0) | (np.isnan(df["a_count_sum"].shift(-1)))) & (df["a_count_sum"]!=0)]
>>> res_1
3 2
6 3
9 5
Name: a_count_sum, dtype: int64
>>> res_2 = (-res_1.shift(1).fillna(0)).astype(np.int64)
>>> res_2
3 0
6 -2
9 -3
Name: a_count_sum, dtype: int64
>>> res=res_1+res_2
>>> res
3 2
6 1
9 2
Name: a_count_sum, dtype: int64

Move Null rows to the bottom of the dataframe

I have a dataframe:
df1 = pd.DataFrame({'a': [1, 2, 10, np.nan, 5, 6, np.nan, 8],
'b': list('abcdefgh')})
df1
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 NaN d
4 5.0 e
5 6.0 f
6 NaN g
7 8.0 h
I would like to move all the rows where a is np.nan to the bottom of the dataframe
df2 = pd.DataFrame({'a': [1, 2, 10, 5, 6, 8, np.nan, np.nan],
'b': list('abcefhdg')})
df2
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
I have tried this:
na = df1[df1.a.isnull()]
df1.dropna(subset = ['a'], inplace=True)
df1 = df1.append(na)
df1
Is there a cleaner way to do this? Or is there a function that I can use for this?
New answer after edit OP
You were close but you can clean up your code a bit by using the following:
df1 = pd.concat([df1[df1['a'].notnull()], df1[df1['a'].isnull()]], ignore_index=True)
print(df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
Old answer
Use sort_values with the na_position=last argument:
df1 = df1.sort_values('a', na_position='last')
print(df1)
a b
0 1.0 a
1 2.0 b
2 3.0 c
4 5.0 e
5 6.0 f
7 8.0 h
3 NaN d
6 NaN g
Not exist in pandas yet, use Series.isna with Series.argsort for positions and change ordering by DataFrame.iloc:
df1 = df1.iloc[df1['a'].isna().argsort()].reset_index(drop=True)
print (df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
Or pure pandas solution with helper column and DataFrame.sort_values:
df1 = (df1.assign(tmp=df1['a'].isna())
.sort_values('tmp')
.drop('tmp', axis=1)
.reset_index(drop=True))
print (df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g

Fill empty values in a dataframe based on columns in another dataframe

I have a dataframe df1 like this.
I want to fill the nan and the number 0 in column score with mutiple values in another dataframe df2 according to the different names.
How could I do this?
Option 1
Short version
df1.score = df1.score.mask(df1.score.eq(0)).fillna(
df1.name.map(df2.set_index('name').score)
)
df1
name score
0 A 10.0
1 B 32.0
2 A 10.0
3 C 30.0
4 B 20.0
5 A 45.0
6 A 10.0
7 A 10.0
Option 2
Interesting version using searchsorted. df2 must be sorted by 'name'.
i = np.where(np.isnan(df1.score.mask(df1.score.values == 0).values))[0]
j = df2.name.values.searchsorted(df1.name.values[i])
df1.score.values[i] = df2.score.values[j]
df1
name score
0 A 10.0
1 B 32.0
2 A 10.0
3 C 30.0
4 B 20.0
5 A 45.0
6 A 10.0
7 A 10.0
If df1 and df2 are your dataframes, you can create a mapping and then call pd.Series.replace:
df1 = pd.DataFrame({'name' : ['A', 'B', 'A', 'C', 'B', 'A', 'A', 'A'],
'score': [0, 32, 0, np.nan, np.nan, 45, np.nan, np.nan]})
df2 = pd.DataFrame({'name' : ['A', 'B', 'C'], 'score' : [10, 20, 30]})
print(df1)
name score
0 A 0.0
1 B 32.0
2 A 0.0
3 C NaN
4 B NaN
5 A 45.0
6 A NaN
7 A NaN
print(df2)
name score
0 A 10
1 B 20
2 C 30
mapping = dict(df2.values)
df1.loc[(df1.score.isnull()) | (df1.score == 0), 'score'] =\
df1[(df1.score.isnull()) | (df1.score == 0)].name.replace(mapping)
print(df1)
name score
0 A 10.0
1 B 32.0
2 A 10.0
3 C 30.0
4 B 20.0
5 A 45.0
6 A 10.0
7 A 10.0
Or using merge, fillna
import pandas as pd
import numpy as np
df1.loc[df.score==0,'score']=np.nan
df1.merge(df2,on='name',how='left').fillna(method='bfill',axis=1)[['name','score_x']]\
.rename(columns={'score_x':'score'})
This method changes the order (the result will be sorted by name).
df1.set_index('name').replace(0, np.nan).combine_first(df2.set_index('name')).reset_index()
name score
0 A 10
1 A 10
2 A 45
3 A 10
4 A 10
5 B 32
6 B 20
7 C 30

Fill NaN with mean of a group for each column [duplicate]

This question already has answers here:
Pandas: filling missing values by mean in each group
(12 answers)
Closed last year.
I Know that the fillna() method can be used to fill NaN in whole dataframe.
df.fillna(df.mean()) # fill with mean of column.
How to limit mean calculation to the group (and the column) where the NaN is.
Exemple:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a': pd.Series([1,1,1,2,2,2]),
'b': pd.Series([1,2,np.NaN,1,np.NaN,4])
})
print df
Input
a b
0 1 1
1 1 2
2 1 NaN
3 2 1
4 2 NaN
5 2 4
Output (after groupby('a') & replace NaN by mean of group)
a b
0 1 1.0
1 1 2.0
2 1 1.5
3 2 1.0
4 2 2.5
5 2 4.0
IIUC then you can call fillna with the result of groupby on 'a' and transform on 'b':
In [44]:
df['b'] = df['b'].fillna(df.groupby('a')['b'].transform('mean'))
df
Out[44]:
a b
0 1 1.0
1 1 2.0
2 1 1.5
3 2 1.0
4 2 2.5
5 2 4.0
If you have multiple NaN values then I think the following should work:
In [47]:
df.fillna(df.groupby('a').transform('mean'))
Out[47]:
a b
0 1 1.0
1 1 2.0
2 1 1.5
3 2 1.0
4 2 2.5
5 2 4.0
EDIT
In [49]:
df = pd.DataFrame({
'a': pd.Series([1,1,1,2,2,2]),
'b': pd.Series([1,2,np.NaN,1,np.NaN,4]),
'c': pd.Series([1,np.NaN,np.NaN,1,np.NaN,4]),
'd': pd.Series([np.NaN,np.NaN,np.NaN,1,np.NaN,4])
})
df
Out[49]:
a b c d
0 1 1 1 NaN
1 1 2 NaN NaN
2 1 NaN NaN NaN
3 2 1 1 1
4 2 NaN NaN NaN
5 2 4 4 4
In [50]:
df.fillna(df.groupby('a').transform('mean'))
Out[50]:
a b c d
0 1 1.0 1.0 NaN
1 1 2.0 1.0 NaN
2 1 1.5 1.0 NaN
3 2 1.0 1.0 1.0
4 2 2.5 2.5 2.5
5 2 4.0 4.0 4.0
You get all NaN for 'd' as all values are NaN for group 1 for d
We first compute the group means, ignoring the missing values:
group_means = df.groupby('a')['b'].agg(lambda v: np.nanmean(v))
Next, we use groupby again, this time fetching the corresponding values:
df_new = df.groupby('a').apply(lambda t: t.fillna(group_means.loc[t['a'].iloc[0]]))

Categories

Resources