How to remove consecutive bad data points in Pandas

How to remove consecutive bad data points in Pandas - python

I have a Pandas dataframe that looks like:
import pandas as pd
import numpy as np
df = pd.DataFrame({"Dummy_Var": [1]*12,
"B": [6, 143.3, 143.3, 143.3, 3, 4, 93.9, 93.9, 93.9, 2, 2, 7],
"C": [4.1, 23.2, 23.2, 23.2, 4.3, 2.5, 7.8, 7.8, 2, 7, 7, 7]})
B C Dummy_Var
0 6.0 4.1 1
1 143.3 23.2 1
2 143.3 23.2 1
3 143.3 23.2 1
4 3.0 4.3 1
5 4.0 2.5 1
6 93.9 7.8 1
7 93.9 7.8 1
8 93.9 2.0 1
9 2.0 7.0 1
10 2.0 7.0 1
11 7.0 7.0 1
Whenever the same numbers show up consecutively three times or more in a row, that data should be replaced with NAN. So the result should be:
B C Dummy_Var
0 6.0 4.1 1
1 NaN NaN 1
2 NaN NaN 1
3 NaN NaN 1
4 3.0 4.3 1
5 4.0 2.5 1
6 NaN 7.8 1
7 NaN 7.8 1
8 NaN 2.0 1
9 2.0 NaN 1
10 2.0 NaN 1
11 7.0 NaN 1
I have written a function that does that:
def non_sense_remover(df, examined_columns, allowed_repeating):
def count_each_group(grp, column):
grp['Count'] = grp[column].count()
return grp
for col in examined_columns:
sel = df.groupby((df[col] != df[col].shift(1)).cumsum()).apply(count_each_group, column=col)["Count"] > allowed_repeating
df.loc[sel, col] = np.nan
return df
df = non_sense_remover(df, ["B", "C"], 2)
However, my real dataframe has 2M rows and 18 columns! It is very very slow to run this function on 2M rows. Is there a more efficient way to do this? Am I missing something? Thanks in advance.

Constructing a boolean mask in this situation will be far more efficient than a solution based on apply(), particularly for large datasets. Here is an approach:
cols = df[['B', 'C']]
mask = (cols.shift(-1) == cols) & (cols.shift(1) == cols)
df[mask | mask.shift(1).fillna(False) | mask.shift(-1).fillna(False)] = np.nan
Edit:
For a more general approach, replacing sequences of length N with NaN, you could do something like this:
from functools import reduce
from operator import or_, and_
def replace_sequential_duplicates_with_nan(df, N):
mask = reduce(and_, [cols.shift(i) == cols.shift(i + 1)
for i in range(N - 1)])
full_mask = reduce(or_, [mask.shift(-i).fillna(False)
for i in range(N)])
df[full_mask] = np.nan

We using groupby + mask
m=df[['B','C']]
df[['B','C']]=m.mask(m.apply(lambda x : x.groupby(x.diff().ne(0).cumsum()).transform('count'))>2)
df
Out[1245]:
B C Dummy_Var
0 6.0 4.1 1
1 NaN NaN 1
2 NaN NaN 1
3 NaN NaN 1
4 3.0 4.3 1
5 4.0 2.5 1
6 NaN 7.8 1
7 NaN 7.8 1
8 NaN 2.0 1
9 2.0 NaN 1
10 2.0 NaN 1
11 7.0 NaN 1

From this link, it appears that using apply/transform (in your case, apply) is causing the biggest bottleneck here. The link I referenced goes into much more detail about why this is and how to solve it

Related

Extending the Value of non-Missing Cells to Subsequent Rows in Pandas

This is what I have:
df=pd.DataFrame({'A':[1,2,3,4,5],'B':[6,np.nan,np.nan,3,np.nan]})
A B
0 1 6.0
1 2 NaN
2 3 NaN
3 4 3.0
4 5 NaN
I would like to extend non-missing values of B to missing values of B underneath, so I have:
A B C
0 1 6.0 6.0
1 2 NaN NaN
2 3 NaN NaN
3 4 3.0 3.0
4 5 NaN NaN
I tried something like this, and it worked last night:
for i in df.index:
df['C'][i]=np.where(pd.isnull(df['B'].iloc[i]),df['C'][i-1],df.B.iloc[i])
But when I woke up this morning it said it didn't recognize 'C.' I couldn't identify the conditions in which it worked and didn't work.
Thanks!

You could use pandas fillna() method to forward fill the missing values with the last non-null value. See the pandas documentation for more details.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [6, np.nan, np.nan, 3, np.nan]
})
df['C'] = df['B'].fillna(method='ffill')
df
# A B C
# 0 1 6.0 6.0
# 1 2 NaN 6.0
# 2 3 NaN 6.0
# 3 4 3.0 3.0
# 4 5 NaN 3.0

Count NaNs windows (and their size) in DataFrame columns

I have HUGE dataframes (milions, tens) and lot of missing (NaNs) values along columns.
I need to count the windows of NaNs and their size, for every column, in the fastest way possible (my code is too slow).
Something like this: frome here
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2, np.nan, np.nan,3,3,np.nan,4,np.nan,np.nan],\
'b':[np.nan, 2, 1, 1, 3, 3, np.nan, np.nan,2, np.nan],\
'c':[np.nan, 2, 1, np.nan, 3, 3, np.nan, np.nan,2, 8]})
df
Out[65]:
a b c
0 1.0 NaN NaN
1 2.0 2.0 2.0
2 NaN 1.0 1.0
3 NaN 1.0 NaN
4 3.0 3.0 3.0
5 3.0 3.0 3.0
6 NaN NaN NaN
7 4.0 NaN NaN
8 NaN 2.0 2.0
9 NaN NaN 8.0
To here:
result
Out[61]:
a b c
0 2 1 1
1 1 2 1
2 2 1 2

Here's one way to do it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2, np.nan, np.nan,3,3,np.nan,4,np.nan,np.nan],\
'b':[np.nan, 2, 1, 1, 3, 3, np.nan, np.nan,2, np.nan],\
'c':[np.nan, 2, 1, np.nan, 3, 3, np.nan, np.nan,2, 8]})
df_n = pd.DataFrame({'a':df['a'].isnull().values,
'b':df['b'].isnull().values,
'c':df['c'].isnull().values})
pr={}
for column_name, _ in df_n.iteritems():
fst = df_n.index[df_n[column_name] & ~ df_n[column_name].shift(1).fillna(False)]
lst = df_n.index[df_n[column_name] & ~ df_n[column_name].shift(-1).fillna(False)]
pr[column_name] = [j-i+1 for i, j in zip(fst, lst)]
df_new=pd.DataFrame(pr)
Output:
a b c
0 2 1 1
1 1 2 1
2 2 1 2

Try this one (example only for a - do analogically for other columns):
>>> df=df.assign(a_count_sum=0)
>>> df["a_count_sum"][np.isnan(df["a"])]=df.groupby(np.isnan(df.a)).cumcount()+1
>>> df
a b c a_count_sum
0 1.0 NaN NaN 0
1 2.0 2.0 2.0 0
2 NaN 1.0 1.0 1
3 NaN 1.0 NaN 2
4 3.0 3.0 3.0 0
5 3.0 3.0 3.0 0
6 NaN NaN NaN 3
7 4.0 NaN NaN 0
8 NaN 2.0 2.0 4
9 NaN NaN 8.0 5
>>> res_1 = df["a_count_sum"][((df["a_count_sum"].shift(-1) == 0) | (np.isnan(df["a_count_sum"].shift(-1)))) & (df["a_count_sum"]!=0)]
>>> res_1
3 2
6 3
9 5
Name: a_count_sum, dtype: int64
>>> res_2 = (-res_1.shift(1).fillna(0)).astype(np.int64)
>>> res_2
3 0
6 -2
9 -3
Name: a_count_sum, dtype: int64
>>> res=res_1+res_2
>>> res
3 2
6 1
9 2
Name: a_count_sum, dtype: int64

pandas - add missing rows on the basis of column values to have linspace

I have a pandas dataframe like
a b c
0 0.5 10 7
1 1.0 6 6
2 2.0 1 7
3 2.5 6 -5
4 3.5 9 7
and I would like to fill the missing columns with respect to the column 'a' on the basis of a certain step. In this case, given the step of 0.5, I would like to fill the 'a' column with the missing values, that is 1.5 and 3.0, and set the other columns to null, in order to obtain the following result.
a b c
0 0.5 10.0 7.0
1 1.0 6.0 6.0
2 1.5 NaN NaN
3 2.0 1.0 7.0
4 2.5 6.0 -5.0
5 3.0 NaN NaN
6 3.5 9.0 7.0
Which is the cleanest way to do this with pandas or other libraries like numpy or scipy?
Thanks!

Create array by numpy.arange, then create index by set_index and last reindex with reset_index:
step= .5
idx = np.arange(df['a'].min(), df['a'].max() + step, step)
df = df.set_index('a').reindex(idx).reset_index()
print (df)
a b c
0 0.5 10.0 7.0
1 1.0 6.0 6.0
2 1.5 NaN NaN
3 2.0 1.0 7.0
4 2.5 6.0 -5.0
5 3.0 NaN NaN
6 3.5 9.0 7.0

One simple way to achieve that is to first create the index you want and then merge the remaining of the information on it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [0.5, 1, 2, 2.5, 3.5],
'b': [10, 6, 1, 6, 9],
'c': [7, 6, 7, -5, 7]})
ls = np.arange(df.a.min(), df.a.max(), 0.5)
new_df = pd.DataFrame({'a':ls})
new_df = new_df.merge(df, on='a', how='left')

pandas rolling apply to allow nan

I have a very simple Pandas Series:
xx = pd.Series([1, 2, np.nan, np.nan, 3, 4, 5])
If I run this I get what I want:
>>> xx.rolling(3,1).mean()
0 1.0
1 1.5
2 1.5
3 2.0
4 3.0
5 3.5
6 4.0
But if I have to use .apply() I cannot get it to work by ignoring NaNs in the mean() operation:
>>> xx.rolling(3,1).apply(np.mean)
0 1.0
1 1.5
2 NaN
3 NaN
4 NaN
5 NaN
6 4.0
>>> xx.rolling(3,1).apply(lambda x : np.mean(x))
0 1.0
1 1.5
2 NaN
3 NaN
4 NaN
5 NaN
6 4.0
What should I do in order to both use .apply() and have the result in the first output? My actual problem is more complicated that I have to use .apply() to realize but it boils down to this issue.

You can use np.nanmean()
xx.rolling(3,1).apply(lambda x : np.nanmean(x))
Out[59]:
0 1.0
1 1.5
2 1.5
3 2.0
4 3.0
5 3.5
6 4.0
dtype: float64
If you have to process the nans explicitly, you can do:
xx.rolling(3,1).apply(lambda x : np.mean(x[~np.isnan(x)]))
Out[94]:
0 1.0
1 1.5
2 1.5
3 2.0
4 3.0
5 3.5
6 4.0
dtype: float64

Missing data, insert rows in Pandas and fill with NAN

I'm new to Python and Pandas so there might be a simple solution which I don't see.
I have a number of discontinuous datasets which look like this:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 3.5 2 0
4 4.0 4 5
5 4.5 3 3
I now look for a solution to get the following:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NAN NAN
4 2.0 NAN NAN
5 2.5 NAN NAN
6 3.0 NAN NAN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
The problem is,that the gap in A varies from dataset to dataset in position and length...

set_index and reset_index are your friends.
df = DataFrame({"A":[0,0.5,1.0,3.5,4.0,4.5], "B":[1,4,6,2,4,3], "C":[3,2,1,0,5,3]})
First move column A to the index:
In [64]: df.set_index("A")
Out[64]:
B C
A
0.0 1 3
0.5 4 2
1.0 6 1
3.5 2 0
4.0 4 5
4.5 3 3
Then reindex with a new index, here the missing data is filled in with nans. We use the Index object since we can name it; this will be used in the next step.
In [66]: new_index = Index(arange(0,5,0.5), name="A")
In [67]: df.set_index("A").reindex(new_index)
Out[67]:
B C
0.0 1 3
0.5 4 2
1.0 6 1
1.5 NaN NaN
2.0 NaN NaN
2.5 NaN NaN
3.0 NaN NaN
3.5 2 0
4.0 4 5
4.5 3 3
Finally move the index back to the columns with reset_index. Since we named the index, it all works magically:
In [69]: df.set_index("A").reindex(new_index).reset_index()
Out[69]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3

Using the answer by EdChum above, I created the following function
def fill_missing_range(df, field, range_from, range_to, range_step=1, fill_with=0):
return df\
.merge(how='right', on=field,
right = pd.DataFrame({field:np.arange(range_from, range_to, range_step)}))\
.sort_values(by=field).reset_index().fillna(fill_with).drop(['index'], axis=1)
Example usage:
fill_missing_range(df, 'A', 0.0, 4.5, 0.5, np.nan)

In this case I am overwriting your A column with a newly generated dataframe and merging this to your original df, I then resort it:
In [177]:
df.merge(how='right', on='A', right = pd.DataFrame({'A':np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5)})).sort(columns='A').reset_index().drop(['index'], axis=1)
Out[177]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
So in the general case you can adjust the arange function which takes a start and end value, note I added 0.5 to the end as ranges are open closed, and pass a step value.
A more general method could be like this:
In [197]:
df = df.set_index(keys='A', drop=False).reindex(np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5))
df.reset_index(inplace=True)
df['A'] = df['index']
df.drop(['A'], axis=1, inplace=True)
df.reset_index().drop(['level_0'], axis=1)
Out[197]:
index B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Here we set the index to column A but don't drop it and then reindex the df using the arange function.

This question was asked a long time ago, but I have a simple solution that's worth mentioning. You can simply use NumPy's NaN. For instance:
import numpy as np
df[i,j] = np.NaN
will do the trick.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove consecutive bad data points in Pandas - python

From this link, it appears that using apply/transform (in your case, apply) is causing the biggest bottleneck here. The link I referenced goes into much more detail about why this is and how to solve it

Related

Extending the Value of non-Missing Cells to Subsequent Rows in Pandas

Count NaNs windows (and their size) in DataFrame columns

pandas - add missing rows on the basis of column values to have linspace

pandas rolling apply to allow nan

Missing data, insert rows in Pandas and fill with NAN

Categories

Resources