I have HUGE dataframes (milions, tens) and lot of missing (NaNs) values along columns.
I need to count the windows of NaNs and their size, for every column, in the fastest way possible (my code is too slow).
Something like this: frome here
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2, np.nan, np.nan,3,3,np.nan,4,np.nan,np.nan],\
'b':[np.nan, 2, 1, 1, 3, 3, np.nan, np.nan,2, np.nan],\
'c':[np.nan, 2, 1, np.nan, 3, 3, np.nan, np.nan,2, 8]})
df
Out[65]:
a b c
0 1.0 NaN NaN
1 2.0 2.0 2.0
2 NaN 1.0 1.0
3 NaN 1.0 NaN
4 3.0 3.0 3.0
5 3.0 3.0 3.0
6 NaN NaN NaN
7 4.0 NaN NaN
8 NaN 2.0 2.0
9 NaN NaN 8.0
To here:
result
Out[61]:
a b c
0 2 1 1
1 1 2 1
2 2 1 2
Here's one way to do it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2, np.nan, np.nan,3,3,np.nan,4,np.nan,np.nan],\
'b':[np.nan, 2, 1, 1, 3, 3, np.nan, np.nan,2, np.nan],\
'c':[np.nan, 2, 1, np.nan, 3, 3, np.nan, np.nan,2, 8]})
df_n = pd.DataFrame({'a':df['a'].isnull().values,
'b':df['b'].isnull().values,
'c':df['c'].isnull().values})
pr={}
for column_name, _ in df_n.iteritems():
fst = df_n.index[df_n[column_name] & ~ df_n[column_name].shift(1).fillna(False)]
lst = df_n.index[df_n[column_name] & ~ df_n[column_name].shift(-1).fillna(False)]
pr[column_name] = [j-i+1 for i, j in zip(fst, lst)]
df_new=pd.DataFrame(pr)
Output:
a b c
0 2 1 1
1 1 2 1
2 2 1 2
Try this one (example only for a - do analogically for other columns):
>>> df=df.assign(a_count_sum=0)
>>> df["a_count_sum"][np.isnan(df["a"])]=df.groupby(np.isnan(df.a)).cumcount()+1
>>> df
a b c a_count_sum
0 1.0 NaN NaN 0
1 2.0 2.0 2.0 0
2 NaN 1.0 1.0 1
3 NaN 1.0 NaN 2
4 3.0 3.0 3.0 0
5 3.0 3.0 3.0 0
6 NaN NaN NaN 3
7 4.0 NaN NaN 0
8 NaN 2.0 2.0 4
9 NaN NaN 8.0 5
>>> res_1 = df["a_count_sum"][((df["a_count_sum"].shift(-1) == 0) | (np.isnan(df["a_count_sum"].shift(-1)))) & (df["a_count_sum"]!=0)]
>>> res_1
3 2
6 3
9 5
Name: a_count_sum, dtype: int64
>>> res_2 = (-res_1.shift(1).fillna(0)).astype(np.int64)
>>> res_2
3 0
6 -2
9 -3
Name: a_count_sum, dtype: int64
>>> res=res_1+res_2
>>> res
3 2
6 1
9 2
Name: a_count_sum, dtype: int64
Related
This is what I have:
df=pd.DataFrame({'A':[1,2,3,4,5],'B':[6,np.nan,np.nan,3,np.nan]})
A B
0 1 6.0
1 2 NaN
2 3 NaN
3 4 3.0
4 5 NaN
I would like to extend non-missing values of B to missing values of B underneath, so I have:
A B C
0 1 6.0 6.0
1 2 NaN NaN
2 3 NaN NaN
3 4 3.0 3.0
4 5 NaN NaN
I tried something like this, and it worked last night:
for i in df.index:
df['C'][i]=np.where(pd.isnull(df['B'].iloc[i]),df['C'][i-1],df.B.iloc[i])
But when I woke up this morning it said it didn't recognize 'C.' I couldn't identify the conditions in which it worked and didn't work.
Thanks!
You could use pandas fillna() method to forward fill the missing values with the last non-null value. See the pandas documentation for more details.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [6, np.nan, np.nan, 3, np.nan]
})
df['C'] = df['B'].fillna(method='ffill')
df
# A B C
# 0 1 6.0 6.0
# 1 2 NaN 6.0
# 2 3 NaN 6.0
# 3 4 3.0 3.0
# 4 5 NaN 3.0
I have a dataframe:
df1 = pd.DataFrame({'a': [1, 2, 10, np.nan, 5, 6, np.nan, 8],
'b': list('abcdefgh')})
df1
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 NaN d
4 5.0 e
5 6.0 f
6 NaN g
7 8.0 h
I would like to move all the rows where a is np.nan to the bottom of the dataframe
df2 = pd.DataFrame({'a': [1, 2, 10, 5, 6, 8, np.nan, np.nan],
'b': list('abcefhdg')})
df2
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
I have tried this:
na = df1[df1.a.isnull()]
df1.dropna(subset = ['a'], inplace=True)
df1 = df1.append(na)
df1
Is there a cleaner way to do this? Or is there a function that I can use for this?
New answer after edit OP
You were close but you can clean up your code a bit by using the following:
df1 = pd.concat([df1[df1['a'].notnull()], df1[df1['a'].isnull()]], ignore_index=True)
print(df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
Old answer
Use sort_values with the na_position=last argument:
df1 = df1.sort_values('a', na_position='last')
print(df1)
a b
0 1.0 a
1 2.0 b
2 3.0 c
4 5.0 e
5 6.0 f
7 8.0 h
3 NaN d
6 NaN g
Not exist in pandas yet, use Series.isna with Series.argsort for positions and change ordering by DataFrame.iloc:
df1 = df1.iloc[df1['a'].isna().argsort()].reset_index(drop=True)
print (df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
Or pure pandas solution with helper column and DataFrame.sort_values:
df1 = (df1.assign(tmp=df1['a'].isna())
.sort_values('tmp')
.drop('tmp', axis=1)
.reset_index(drop=True))
print (df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
I have a Pandas dataframe that looks like:
import pandas as pd
import numpy as np
df = pd.DataFrame({"Dummy_Var": [1]*12,
"B": [6, 143.3, 143.3, 143.3, 3, 4, 93.9, 93.9, 93.9, 2, 2, 7],
"C": [4.1, 23.2, 23.2, 23.2, 4.3, 2.5, 7.8, 7.8, 2, 7, 7, 7]})
B C Dummy_Var
0 6.0 4.1 1
1 143.3 23.2 1
2 143.3 23.2 1
3 143.3 23.2 1
4 3.0 4.3 1
5 4.0 2.5 1
6 93.9 7.8 1
7 93.9 7.8 1
8 93.9 2.0 1
9 2.0 7.0 1
10 2.0 7.0 1
11 7.0 7.0 1
Whenever the same numbers show up consecutively three times or more in a row, that data should be replaced with NAN. So the result should be:
B C Dummy_Var
0 6.0 4.1 1
1 NaN NaN 1
2 NaN NaN 1
3 NaN NaN 1
4 3.0 4.3 1
5 4.0 2.5 1
6 NaN 7.8 1
7 NaN 7.8 1
8 NaN 2.0 1
9 2.0 NaN 1
10 2.0 NaN 1
11 7.0 NaN 1
I have written a function that does that:
def non_sense_remover(df, examined_columns, allowed_repeating):
def count_each_group(grp, column):
grp['Count'] = grp[column].count()
return grp
for col in examined_columns:
sel = df.groupby((df[col] != df[col].shift(1)).cumsum()).apply(count_each_group, column=col)["Count"] > allowed_repeating
df.loc[sel, col] = np.nan
return df
df = non_sense_remover(df, ["B", "C"], 2)
However, my real dataframe has 2M rows and 18 columns! It is very very slow to run this function on 2M rows. Is there a more efficient way to do this? Am I missing something? Thanks in advance.
Constructing a boolean mask in this situation will be far more efficient than a solution based on apply(), particularly for large datasets. Here is an approach:
cols = df[['B', 'C']]
mask = (cols.shift(-1) == cols) & (cols.shift(1) == cols)
df[mask | mask.shift(1).fillna(False) | mask.shift(-1).fillna(False)] = np.nan
Edit:
For a more general approach, replacing sequences of length N with NaN, you could do something like this:
from functools import reduce
from operator import or_, and_
def replace_sequential_duplicates_with_nan(df, N):
mask = reduce(and_, [cols.shift(i) == cols.shift(i + 1)
for i in range(N - 1)])
full_mask = reduce(or_, [mask.shift(-i).fillna(False)
for i in range(N)])
df[full_mask] = np.nan
We using groupby + mask
m=df[['B','C']]
df[['B','C']]=m.mask(m.apply(lambda x : x.groupby(x.diff().ne(0).cumsum()).transform('count'))>2)
df
Out[1245]:
B C Dummy_Var
0 6.0 4.1 1
1 NaN NaN 1
2 NaN NaN 1
3 NaN NaN 1
4 3.0 4.3 1
5 4.0 2.5 1
6 NaN 7.8 1
7 NaN 7.8 1
8 NaN 2.0 1
9 2.0 NaN 1
10 2.0 NaN 1
11 7.0 NaN 1
From this link, it appears that using apply/transform (in your case, apply) is causing the biggest bottleneck here. The link I referenced goes into much more detail about why this is and how to solve it
I have a very simple Pandas Series:
xx = pd.Series([1, 2, np.nan, np.nan, 3, 4, 5])
If I run this I get what I want:
>>> xx.rolling(3,1).mean()
0 1.0
1 1.5
2 1.5
3 2.0
4 3.0
5 3.5
6 4.0
But if I have to use .apply() I cannot get it to work by ignoring NaNs in the mean() operation:
>>> xx.rolling(3,1).apply(np.mean)
0 1.0
1 1.5
2 NaN
3 NaN
4 NaN
5 NaN
6 4.0
>>> xx.rolling(3,1).apply(lambda x : np.mean(x))
0 1.0
1 1.5
2 NaN
3 NaN
4 NaN
5 NaN
6 4.0
What should I do in order to both use .apply() and have the result in the first output? My actual problem is more complicated that I have to use .apply() to realize but it boils down to this issue.
You can use np.nanmean()
xx.rolling(3,1).apply(lambda x : np.nanmean(x))
Out[59]:
0 1.0
1 1.5
2 1.5
3 2.0
4 3.0
5 3.5
6 4.0
dtype: float64
If you have to process the nans explicitly, you can do:
xx.rolling(3,1).apply(lambda x : np.mean(x[~np.isnan(x)]))
Out[94]:
0 1.0
1 1.5
2 1.5
3 2.0
4 3.0
5 3.5
6 4.0
dtype: float64
This question already has answers here:
Pandas: filling missing values by mean in each group
(12 answers)
Closed last year.
I Know that the fillna() method can be used to fill NaN in whole dataframe.
df.fillna(df.mean()) # fill with mean of column.
How to limit mean calculation to the group (and the column) where the NaN is.
Exemple:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a': pd.Series([1,1,1,2,2,2]),
'b': pd.Series([1,2,np.NaN,1,np.NaN,4])
})
print df
Input
a b
0 1 1
1 1 2
2 1 NaN
3 2 1
4 2 NaN
5 2 4
Output (after groupby('a') & replace NaN by mean of group)
a b
0 1 1.0
1 1 2.0
2 1 1.5
3 2 1.0
4 2 2.5
5 2 4.0
IIUC then you can call fillna with the result of groupby on 'a' and transform on 'b':
In [44]:
df['b'] = df['b'].fillna(df.groupby('a')['b'].transform('mean'))
df
Out[44]:
a b
0 1 1.0
1 1 2.0
2 1 1.5
3 2 1.0
4 2 2.5
5 2 4.0
If you have multiple NaN values then I think the following should work:
In [47]:
df.fillna(df.groupby('a').transform('mean'))
Out[47]:
a b
0 1 1.0
1 1 2.0
2 1 1.5
3 2 1.0
4 2 2.5
5 2 4.0
EDIT
In [49]:
df = pd.DataFrame({
'a': pd.Series([1,1,1,2,2,2]),
'b': pd.Series([1,2,np.NaN,1,np.NaN,4]),
'c': pd.Series([1,np.NaN,np.NaN,1,np.NaN,4]),
'd': pd.Series([np.NaN,np.NaN,np.NaN,1,np.NaN,4])
})
df
Out[49]:
a b c d
0 1 1 1 NaN
1 1 2 NaN NaN
2 1 NaN NaN NaN
3 2 1 1 1
4 2 NaN NaN NaN
5 2 4 4 4
In [50]:
df.fillna(df.groupby('a').transform('mean'))
Out[50]:
a b c d
0 1 1.0 1.0 NaN
1 1 2.0 1.0 NaN
2 1 1.5 1.0 NaN
3 2 1.0 1.0 1.0
4 2 2.5 2.5 2.5
5 2 4.0 4.0 4.0
You get all NaN for 'd' as all values are NaN for group 1 for d
We first compute the group means, ignoring the missing values:
group_means = df.groupby('a')['b'].agg(lambda v: np.nanmean(v))
Next, we use groupby again, this time fetching the corresponding values:
df_new = df.groupby('a').apply(lambda t: t.fillna(group_means.loc[t['a'].iloc[0]]))