Remove columns with more than N consecutive NaN's - python

I have a DataFrame with around 1000 columns, some columns have 0 NaNs, some have 3, some have 400.
What I want to do is remove all columns where there exists a number of consecutive NaNs that are larger than some threshold N, the rest I will impute by taking mean of nearest neighbors.
df
ColA | ColB | ColC | ColD | ColE
NaN 5 3 NaN NaN
NaN 6 NaN 4 4
NaN 7 4 4 4
NaN 5 5 NaN NaN
NaN 5 4 NaN 4
NaN 3 3 NaN 3
threshold = 2
remove_consecutive_nan(df,threshold)
Which would return
ColB | ColC | ColE
5 3 NaN
6 NaN 4
7 4 4
5 5 NaN
5 4 4
3 3 3
How would I write the remove_consecutive_nan function?

You can create groups for each column of same values for consecutive missing values, then count each column separately and last filter out columns by threshold:
def remove_consecutive_nan(df,threshold):
m = df.notna()
mask = m.cumsum().mask(m).apply(pd.Series.value_counts).gt(threshold)
return df.loc[:, ~mask.any(axis=0)]
print (remove_consecutive_nan(df, 2))
ColB ColC ColE
0 5 3.0 NaN
1 6 NaN 4.0
2 7 4.0 4.0
3 5 5.0 NaN
4 5 4.0 4.0
5 3 3.0 3.0
Alternative with counts missing values by range:
def remove_consecutive_nan(df,threshold):
m = df.isna()
b = m.cumsum()
mask = b.sub(b.mask(m).ffill().fillna(0)).gt(threshold)
return df.loc[:, ~mask.any(axis=0)]

Related

Concatenate columns skipping pasted rows and columns

I expect to describe well want I need. I have a data frame with the same columns name and another column that works as an index. The data frame looks as follows:
df = pd.DataFrame({'ID':[1,1,1,1,1,2,2,2,3,3,3,3],'X':[1,2,3,4,5,2,3,4,1,3,4,5],'Y':[1,2,3,4,5,2,3,4,5,4,3,2]})
df
Out[21]:
ID X Y
0 1 1 1
1 1 2 2
2 1 3 3
3 1 4 4
4 1 5 5
5 2 2 2
6 2 3 3
7 2 4 4
8 3 1 5
9 3 3 4
10 3 4 3
11 3 5 2
My intention is to copy X as an index or one column (it doesn't matter) and append Y columns from each 'ID' in the following way:
You can try
out = pd.concat([group.rename(columns={'Y': f'Y{name}'}) for name, group in df.groupby('ID')])
out.columns = out.columns.str.replace(r'\d+$', '', regex=True)
print(out)
ID X Y Y Y
0 1 1 1.0 NaN NaN
1 1 2 2.0 NaN NaN
2 1 3 3.0 NaN NaN
3 1 4 4.0 NaN NaN
4 1 5 5.0 NaN NaN
5 2 2 NaN 2.0 NaN
6 2 3 NaN 3.0 NaN
7 2 4 NaN 4.0 NaN
8 3 1 NaN NaN 5.0
9 3 3 NaN NaN 4.0
10 3 4 NaN NaN 3.0
11 3 5 NaN NaN 2.0
Here's another way to do it:
df_org = pd.DataFrame({'ID':[1,1,1,1,1,2,2,2,3,3,3,3],'X':[1,2,3,4,5,2,3,4,1,3,4,5]})
df = df_org.copy()
for i in set(df_org['ID']):
df1 = df_org[df_org['ID']==i]
col = 'Y'+str(i)
df1.columns = ['ID', col]
df = pd.concat([ df, df1[[col]] ], axis=1)
df.columns = df.columns.str.replace(r'\d+$', '', regex=True)
print(df)
Output:
ID X Y Y Y
0 1 1 1.0 NaN NaN
1 1 2 2.0 NaN NaN
2 1 3 3.0 NaN NaN
3 1 4 4.0 NaN NaN
4 1 5 5.0 NaN NaN
5 2 2 NaN 2.0 NaN
6 2 3 NaN 3.0 NaN
7 2 4 NaN 4.0 NaN
8 3 1 NaN NaN 1.0
9 3 3 NaN NaN 3.0
10 3 4 NaN NaN 4.0
11 3 5 NaN NaN 5.0
Another solution could be as follow.
Get unique values for column ID (stored in array s).
Use np.transpose to repeat column ID n times (n == len(s)) and evaluate the array's matches with s.
Use np.where to replace True with values from df.Y and False with NaN.
Finally, drop the orignal df.Y and rename the new columns as required.
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID':[1,1,1,1,1,2,2,2,3,3,3,3],
'X':[1,2,3,4,5,2,3,4,1,3,4,5],
'Y':[1,2,3,4,5,2,3,4,5,4,3,2]})
s = df.ID.unique()
df[s] = np.where((np.transpose([df.ID]*len(s))==s),
np.transpose([df.Y]*len(s)),
np.nan)
df.drop('Y', axis=1, inplace=True)
df.rename(columns={k:'Y' for k in s}, inplace=True)
print(df)
ID X Y Y Y
0 1 1 1.0 NaN NaN
1 1 2 2.0 NaN NaN
2 1 3 3.0 NaN NaN
3 1 4 4.0 NaN NaN
4 1 5 5.0 NaN NaN
5 2 2 NaN 2.0 NaN
6 2 3 NaN 3.0 NaN
7 2 4 NaN 4.0 NaN
8 3 1 NaN NaN 5.0
9 3 3 NaN NaN 4.0
10 3 4 NaN NaN 3.0
11 3 5 NaN NaN 2.0
If performance is an issue, this method should be faster than this answer, especially when the number of unique values for ID increases.

Create dataframe with hierarchical indices and extra columns from non-hierarchically indexed dataframe

Consider a simple dataframe:
import numpy as np
import pandas as pd
x = pd.DataFrame(np.arange(10).reshape(5,2))
print(x)
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
I would like to create a hierarchically indexed dataframe of the form:
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN
where the 'a' columns correspond to the original dataframe columns and the 'b' columns are blank (or nan).
I can certainly create a hierarchically indexed dataframe with all NaNs and loop over the columns of the original dataframe, writing them into
the new dataframe. Is there something more compact than that?
you can do with MultiIndex.from_product
extra_level = ['a', 'b']
new_cols = pd.MultiIndex.from_product([x.columns, extra_level])
x.columns = new_cols[::len(x.columns)] # select all the first element of extra_level
x = x.reindex(columns=new_cols)
print(x)
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN
Very much like #Ben.T I am using MultiIndex.from_product:
x.assign(l='a')
.set_index('l', append=True)
.unstack()
.reindex(pd.MultiIndex.from_product([x.columns.tolist(), ['a','b']]), axis=1)
Output:
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN

Pandas Matrix Addition/Division between index values and column header values

I have the following the data frame and I want to add the index value plus the column header and divide by 2.
The initial grid would like this:
grid = pd.DataFrame(
columns=[1,2,3,4,5],
index = [1,2,3,4,5]
)
grid
This results in the following:
1 2 3 4 5
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
What I'm looking to get is: For example, index of 1 + column header of 3 results in 4, which divided by 2 is 2
1 2 3 4 5
1 1 1.5 2 2.5 3
2 1.5 2 2.5 3 3.5
3 2 2.5 3 3.5 4
4 2.5 3 3.5 4 4.5
5 3 3.5 4 4.5 5
You can use broadcasting here:
grid[:] = (grid.index.values[:,None] + grid.columns.values)/2

Function to determine window in a rolling function

I have a dataframe in which I want to apply a rolling mean over a column of numbers that come in 3-pairs where I only want 4 unique values to go into the mean.
Lets say my dataframe looks like:
Group Column to roll
1 9
2 5
2 5
2 4
2 4
2 4
2 3
2 3
2 3
2 6
2 6
2 6
2 8
Since I want 4 unique values to go into the mean but all values to be of equal weight and within the same group, my expected output (assuming I need 4 unique values) would be:
Group Output
1 nan
2 nan
2 nan
2 nan
2 nan
2 nan
2 nan
2 nan
2 nan
2 (6+3+4+5)/4
2 (6+3+4+5)/4
2 (6+3+4+5)/4
2 (8+6+3+4)/4
Any ideas how to do this?
You could try something like this:
df['Column to roll'].drop_duplicates().rolling(4).mean().reindex(df.index).ffill()
Output:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 4.50
9 4.50
10 4.50
11 5.25
Name: Column to roll, dtype: float64
Edit question changed
df_out = df.groupby('Group')['Column to roll']\
.apply(lambda x: x.drop_duplicates().rolling(4).mean()).rename('Output')
df.set_index('Group',append=True).swaplevel(0,1)\
.join(df_out, how='left').ffill().reset_index(level=1, drop=True)
Output:
Column to roll Output
Group
1 9 NaN
2 5 NaN
2 5 NaN
2 4 NaN
2 4 NaN
2 4 NaN
2 3 NaN
2 3 NaN
2 3 NaN
2 6 4.50
2 6 4.50
2 6 4.50
2 8 5.25

Find observations in which both columns are NaN and replace them with 0 in pandas DataFrame

Here is a dataframe
a b c d
nan nan 3 5
nan 1 2 3
1 nan 4 5
2 3 7 9
nan nan 2 3
I want to replace the observations in both columns 'a' and 'b' where both of them are NaNs with 0s. Rows 2 and 5 in columns 'a' and 'b' have both both NaN, so I want to replace only those rows with 0's in those matching NaN columns.
so my output must be
a b c d
0 0 3 5
nan 1 2 3
1 nan 4 5
2 3 7 9
0 0 2 3
There might be a easier builtin function in Pandas, but this one should work.
df[['a', 'b']] = df.ix[ (np.isnan(df.a)) & (np.isnan(df.b)), ['a', 'b'] ].fillna(0)
Actually the solution from #Psidom much easier to read.
You can create a boolean series based on the conditions on columns a/b, and then use loc to modify corresponding columns and rows:
df.loc[df[['a','b']].isnull().all(1), ['a','b']] = 0
df
# a b c d
#0 0.0 0.0 3 5
#1 NaN 1.0 2 3
#2 1.0 NaN 4 5
#3 2.0 3.0 7 9
#4 0.0 0.0 2 3
Or:
df.loc[df.a.isnull() & df.b.isnull(), ['a','b']] = 0

Categories

Resources