Fill non-consecutive missings with consecutive numbers - python

For a given data frame...
data = pd.DataFrame([[1., 6.5], [1., np.nan],[5, 3], [6.5, 3.], [2, np.nan]])
that looks like this...
0 1
0 1.0 6.5
1 1.0 NaN
2 5.0 3.0
3 6.5 3.0
4 2.0 NaN
...I want to create a third column where all missings of the second column are replaced with consecutive numbers. So the result should look like this:
0 1 2
0 1.0 6.5 NaN
1 1.0 NaN 1
2 5.0 3.0 NaN
3 6.5 3.0 NaN
4 2.0 NaN 2
(my data frame has much more rows, so imagine 70 missings in the second column so that the last number in the 3rd column would be 70)
How can I create the 3rd column?

You can do it this way, I took the liberty of renaming the columns to avoid the confusion of what I am selecting, you can do the same with your dataframe using:
data = data.rename(columns={0:'a',1:'b'})
In [41]:
data.merge(pd.DataFrame({'c':range(1,len(data[data.b.isnull()]) + 1)}, index=data[data.b.isnull()].index),how='left', left_index=True, right_index=True)
Out[41]:
a b c
0 1.0 6.5 NaN
1 1.0 NaN 1
2 5.0 3.0 NaN
3 6.5 3.0 NaN
4 2.0 NaN 2
[5 rows x 3 columns]
Some explanation here of the one liner:
# we want just the rows where column 'b' is null:
data[data.b.isnull()]
# now construct a dataset of the length of this dataframe starting from 1:
range(1,len(data[data.b.isnull()]) + 1) # note we have to add a 1 at the end
# construct a new dataframe from this and crucially use the index of the null values:
pd.DataFrame({'c':range(1,len(data[data.b.isnull()]) + 1)}, index=data[data.b.isnull()].index)
# now perform a merge and tell it we want to perform a left merge and use both sides indices, I've removed the verbose dataframe construction and replaced with new_df here but you get the point
data.merge(new_df,how='left', left_index=True, right_index=True)
Edit
You can also do it another way using #Karl.D's suggestion:
In [56]:
data['c'] = data['b'].isnull().cumsum().where(data['b'].isnull())
data
Out[56]:
a b c
0 1.0 6.5 NaN
1 1.0 NaN 1
2 5.0 3.0 NaN
3 6.5 3.0 NaN
4 2.0 NaN 2
[5 rows x 3 columns]
Timings also suggest that Karl's method would be faster for larger datasets but I would profile this:
In [57]:
%timeit data.merge(pd.DataFrame({'c':range(1,len(data[data.b.isnull()]) + 1)}, index=data[data.b.isnull()].index),how='left', left_index=True, right_index=True)
%timeit data['c'] = data['b'].isnull().cumsum().where(data['b'].isnull())
1000 loops, best of 3: 1.31 ms per loop
1000 loops, best of 3: 501 µs per loop

Related

Forward fill on custom value in pandas dataframe

I am looking to perform forward fill on some dataframe columns.
the ffill method replaces missing values or NaN with the previous filled value.
In my case, I would like to perform a forward fill, with the difference that I don't want to do that on Nan but for a specific value (say "*").
Here's an example
import pandas as pd
import numpy as np
d = [{"a":1, "b":10},
{"a":2, "b":"*"},
{"a":3, "b":"*"},
{"a":4, "b":"*"},
{"a":np.nan, "b":50},
{"a":6, "b":60},
{"a":7, "b":70}]
df = pd.DataFrame(d)
with df being
a b
0 1.0 10
1 2.0 *
2 3.0 *
3 4.0 *
4 NaN 50
5 6.0 60
6 7.0 70
The expected result should be
a b
0 1.0 10
1 2.0 10
2 3.0 10
3 4.0 10
4 NaN 50
5 6.0 60
6 7.0 70
If replacing "*" by np.nan then ffill, that would cause to apply ffill to column a.
Since my data has hundreds of columns, I was wondering if there is a more efficient way than looping over all columns, check if it countains "*", then replace and ffill.
You can use df.mask with df.isin with df.replace
df.mask(df.isin(['*']),df.replace('*',np.nan).ffill())
a b
0 1.0 10
1 2.0 10
2 3.0 10
3 4.0 10
4 NaN 50
5 6.0 60
6 7.0 70
I think you're going in the right direction, but here's a complete solution. What I'm doing is 'marking' the original NaN values, then replacing "*" with NaN, using ffill, and then putting the original NaN values back.
df = df.replace(np.NaN, "<special>").replace("*", np.NaN).ffill().replace("<special>", np.NaN)
output:
a b
0 1.0 10.0
1 2.0 10.0
2 3.0 10.0
3 4.0 10.0
4 NaN 50.0
5 6.0 60.0
6 7.0 70.0
And here's an alternative solution that does the same thing, without the 'special' marking:
original_nan = df.isna()
df = df.replace("*", np.NaN).ffill()
df[original_nan] = np.NaN

Compare two pandas dataframes and replace value based on condition

I have the following two pandas dataframes:
df1
A B C
0 1 2 1
1 7 3 6
2 3 10 11
df2
A B C
0 2 0 2
1 8 4 7
Where A,B and C are column headings of both dataframes.
I am trying to compare columns of df1 to columns of df2 such that the first row in df2 is the lower bound and the second row is the upper bound. Any values in df1 outside the lower and upper bound (column wise) needs to be replaced with NaN.
So in this example the output should be:
A B C
0 nan 2 nan
1 7 3 6
2 3 nan nan
As a basic I am trying df1[df1 < df2] = np.nan, but this does not work. I have also tried .where() but not getting any success.
Would appreciate some help here, thanks.
IIUC
df=df1.where(df1.ge(df2.iloc[0])&df1.lt(df2.iloc[1]))
A B C
0 NaN 2.0 NaN
1 7.0 3.0 6.0
2 3.0 NaN NaN
You could do something like:
lower = df1 < df2.iloc[0, :]
upper = df1 > df2.iloc[1, :]
df1[lower | upper] = np.nan
print(df1)
Output
A B C
0 NaN 2.0 NaN
1 7.0 3.0 6.0
2 3.0 NaN NaN
Here is one with df.clip and mask:
df1.mask(df1.ne(df1.clip(lower = df2.loc[0],upper = df1.loc[1],axis=1)))
A B C
0 NaN 2.0 NaN
1 7.0 3.0 6.0
2 3.0 NaN NaN
A slightly different approach using between,
df1.apply(lambda x:x.where(x.between(*df2.values, False)), axis=1)

pandas - add missing rows on the basis of column values to have linspace

I have a pandas dataframe like
a b c
0 0.5 10 7
1 1.0 6 6
2 2.0 1 7
3 2.5 6 -5
4 3.5 9 7
and I would like to fill the missing columns with respect to the column 'a' on the basis of a certain step. In this case, given the step of 0.5, I would like to fill the 'a' column with the missing values, that is 1.5 and 3.0, and set the other columns to null, in order to obtain the following result.
a b c
0 0.5 10.0 7.0
1 1.0 6.0 6.0
2 1.5 NaN NaN
3 2.0 1.0 7.0
4 2.5 6.0 -5.0
5 3.0 NaN NaN
6 3.5 9.0 7.0
Which is the cleanest way to do this with pandas or other libraries like numpy or scipy?
Thanks!
Create array by numpy.arange, then create index by set_index and last reindex with reset_index:
step= .5
idx = np.arange(df['a'].min(), df['a'].max() + step, step)
df = df.set_index('a').reindex(idx).reset_index()
print (df)
a b c
0 0.5 10.0 7.0
1 1.0 6.0 6.0
2 1.5 NaN NaN
3 2.0 1.0 7.0
4 2.5 6.0 -5.0
5 3.0 NaN NaN
6 3.5 9.0 7.0
One simple way to achieve that is to first create the index you want and then merge the remaining of the information on it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [0.5, 1, 2, 2.5, 3.5],
'b': [10, 6, 1, 6, 9],
'c': [7, 6, 7, -5, 7]})
ls = np.arange(df.a.min(), df.a.max(), 0.5)
new_df = pd.DataFrame({'a':ls})
new_df = new_df.merge(df, on='a', how='left')

Fill NaN values in dataframe with previous values in column

Hi I have a dataframe with some missing values ex:
The black numbers 40 and 50 are the values already inputted and the red ones are to autofill from the previous values. Row 2 is blank as there is no previous number to fill.
Any idea how I can do this efficiently? I was trying loops but maybe there is a better way
It can be done easily with ffill method in pandas fillna.
To illustrate the working consider the following sample dataframe
df = pd.DataFrame()
df['Vals'] = [1, 2, 3, np.nan, np.nan, 6, 7, np.nan, 8]
Vals
0 1.0
1 2.0
2 3.0
3 NaN
4 5.0
5 6.0
6 7.0
7 NaN
8 8.0
To fill the missing value do this
df['Vals'].fillna(method='ffill', inplace=True)
Vals
0 1.0
1 2.0
2 3.0
3 3.0
4 3.0
5 6.0
6 7.0
7 7.0
8 8.0
There is a direct synonym function for this, pandas.DataFrame.ffill
df['Vals',inplace=True]

How to remove consecutive bad data points in Pandas

I have a Pandas dataframe that looks like:
import pandas as pd
import numpy as np
df = pd.DataFrame({"Dummy_Var": [1]*12,
"B": [6, 143.3, 143.3, 143.3, 3, 4, 93.9, 93.9, 93.9, 2, 2, 7],
"C": [4.1, 23.2, 23.2, 23.2, 4.3, 2.5, 7.8, 7.8, 2, 7, 7, 7]})
B C Dummy_Var
0 6.0 4.1 1
1 143.3 23.2 1
2 143.3 23.2 1
3 143.3 23.2 1
4 3.0 4.3 1
5 4.0 2.5 1
6 93.9 7.8 1
7 93.9 7.8 1
8 93.9 2.0 1
9 2.0 7.0 1
10 2.0 7.0 1
11 7.0 7.0 1
Whenever the same numbers show up consecutively three times or more in a row, that data should be replaced with NAN. So the result should be:
B C Dummy_Var
0 6.0 4.1 1
1 NaN NaN 1
2 NaN NaN 1
3 NaN NaN 1
4 3.0 4.3 1
5 4.0 2.5 1
6 NaN 7.8 1
7 NaN 7.8 1
8 NaN 2.0 1
9 2.0 NaN 1
10 2.0 NaN 1
11 7.0 NaN 1
I have written a function that does that:
def non_sense_remover(df, examined_columns, allowed_repeating):
def count_each_group(grp, column):
grp['Count'] = grp[column].count()
return grp
for col in examined_columns:
sel = df.groupby((df[col] != df[col].shift(1)).cumsum()).apply(count_each_group, column=col)["Count"] > allowed_repeating
df.loc[sel, col] = np.nan
return df
df = non_sense_remover(df, ["B", "C"], 2)
However, my real dataframe has 2M rows and 18 columns! It is very very slow to run this function on 2M rows. Is there a more efficient way to do this? Am I missing something? Thanks in advance.
Constructing a boolean mask in this situation will be far more efficient than a solution based on apply(), particularly for large datasets. Here is an approach:
cols = df[['B', 'C']]
mask = (cols.shift(-1) == cols) & (cols.shift(1) == cols)
df[mask | mask.shift(1).fillna(False) | mask.shift(-1).fillna(False)] = np.nan
Edit:
For a more general approach, replacing sequences of length N with NaN, you could do something like this:
from functools import reduce
from operator import or_, and_
def replace_sequential_duplicates_with_nan(df, N):
mask = reduce(and_, [cols.shift(i) == cols.shift(i + 1)
for i in range(N - 1)])
full_mask = reduce(or_, [mask.shift(-i).fillna(False)
for i in range(N)])
df[full_mask] = np.nan
We using groupby + mask
m=df[['B','C']]
df[['B','C']]=m.mask(m.apply(lambda x : x.groupby(x.diff().ne(0).cumsum()).transform('count'))>2)
df
Out[1245]:
B C Dummy_Var
0 6.0 4.1 1
1 NaN NaN 1
2 NaN NaN 1
3 NaN NaN 1
4 3.0 4.3 1
5 4.0 2.5 1
6 NaN 7.8 1
7 NaN 7.8 1
8 NaN 2.0 1
9 2.0 NaN 1
10 2.0 NaN 1
11 7.0 NaN 1
From this link, it appears that using apply/transform (in your case, apply) is causing the biggest bottleneck here. The link I referenced goes into much more detail about why this is and how to solve it

Categories

Resources