in a dataframe df how can I find the columns that contains all nan after grouping the rows?
In [97]: df
Out[97]:
a b group
0 NaN NaN a
1 0.0 NaN a
2 2.0 NaN a
3 1.0 7.0 b
4 1.0 3.0 b
5 7.0 4.0 b
6 2.0 6.0 c
7 9.0 6.0 c
8 3.0 0.0 c
9 9.0 0.0 c
in this case the desired output should be
group: a - columns: b
Use set_index by grouping column first, then find all NaNs by isnull.
Then groupby and aggregate all. Last reshape by stack and create new DataFrame with all groups and columns names:
print (df.set_index('group').isnull().groupby('group').all())
a b
group
a False True
b False False
c False False
a = df.set_index('group').isnull().groupby('group').all().stack()
b = pd.DataFrame(a[a].index.values.tolist(), columns=['group','cols'])
print (b)
group cols
0 a b
try this ?
df.groupby('group').sum().unstack()[df.groupby('group').sum().unstack().isnull()].reset_index()
level_0 group 0
0 b a NaN
Are you looking for this ? i.e get the group name and the value column that as full Nan values
vals = [(i['group'].iloc[0],i.columns[i.isnull().all()].tolist()) for _,i in df.groupby('group')]
Output:
[('a', ['b']), ('b', []), ('c', [])]
Related
I have a main dataframe and a sub dataframe. I want to merge each column in sub dataframe into main dataframe with main dataframe column as a reference. I have successfully arrived at my desired answer, except that I see duplicated columns of the main dataframe. Below are the my expected and present answers.
Present solution:
df = pd.DataFrame({'Ref':[1,2,3,4]})
df1 = pd.DataFrame({'A':[2,3],'Z':[1,2]})
df = [df.merge(df1[col_name],left_on='Ref',right_on=col_name,how='left') for col_name in df1.columns]
df = pd.concat(df,axis=1)
df =
Ref A Ref Z
0 1 NaN 1 1.0
1 2 2.0 2 2.0
2 3 3.0 3 NaN
3 4 NaN 4 NaN
Expected Answer:
df =
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
Update
Use duplicated:
>>> df.loc[:, ~df.columns.duplicated()]
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
Old answer
You can use:
# Your code
...
df = pd.concat(df, axis=1)
# Use pop and insert to cleanup your dataframe
df.insert(0, 'Ref', df.pop('Ref').iloc[:, 0])
Output:
>>> df
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
What about setting 'Ref' col as index while getting dataframe list. (And resetting index such that you get back Ref as a column)
df = pd.DataFrame({'Ref':[1,2,3,4]})
df1 = pd.DataFrame({'A':[2,3],'Z':[1,2]})
df = [df.merge(df1[col_name],left_on='Ref',right_on=col_name,how='left').set_index('Ref') for col_name in df1.columns]
df = pd.concat(df,axis=1)
df = df.reset_index()
Ref A Z
1 NaN 1.0
2 2.0 2.0
3 3.0 NaN
4 NaN NaN
This is a reduction process. Instead of the list comprehension use for - loop, or even reduce:
from functools import reduce
reduce(lambda x, y : x.merge(df1[y],left_on='Ref',right_on=y,how='left'), df1.columns, df)
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
The above is similar to:
for y in df1.columns:
df = df.merge(df1[y],left_on='Ref',right_on=y,how='left')
df
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
Given two grouped dataframes (df_train & df_test), how do I fill the missing values of df_test using values derived from df_train? For this example, I used median.
df_train=pd.DataFrame({'col_1':['A','B','A','A','C','B','B','A','A','B','C'], 'col_2':[float('NaN'),2,1,3,1,float('NaN'),2,3,2,float('NaN'),1]})
df_test=pd.DataFrame({'col_1':['A','A','A','A','B','C','C','B','B','B','C'], 'col_2':[3,float('NaN'),1,2,2,float('NaN'),float('NaN'),3,2,float('NaN'),float('NaN')]})
# These are the median values derived from df_train which I would like to impute into df_test based on the column col_1.
values_used_in_df_train = df_train.groupby(by='col_1')['col_2'].median()
values_used_in_df_train
col_1
A 2.5
B 2.0
C 1.0
Name: col_2, dtype: float64
# For df_train, I can simply do the following:
df_train.groupby('col_1')['col_2'].transform(lambda x : x.fillna(x.median()))
I tried df_test.groupby('col_1')['col_2'].transform(lambda x : x.fillna(values_used_in_df_train)) which does not work.
So I want:
df_test
col_1 col_2
0 A 3.0
1 A NaN
2 A 1.0
3 A 2.0
4 B 2.0
5 C NaN
6 C NaN
7 B 3.0
8 B 2.0
9 B NaN
10 C NaN
to become
df_test
col_1 col_2
0 A 3.0
1 A 2.5
2 A 1.0
3 A 2.0
4 B 2.0
5 C 1.0
6 C 1.0
7 B 3.0
8 B 2.0
9 B 2.0
10 C 1.0
Below here are just my thoughts, you do not have to consider them since it might be irrelevant/confusing.
I guess I could use an if-else method to match row-by-row to the index of values_used_in_df_train but I am trying to achieve this within groupby.
Try merging df_test and values_used_in_df_train:
df_test=df_test.merge(values_used_in_df_train.reset_index(),on='col_1',how='left',suffixes=('','_y'))
Finally fill missing values by using fillna():
df_test['col_2']=df_test['col_2'].fillna(df_test.pop('col_2_y'))
OR
Another way(If order is not important):
append df_test and values_used_in_df_train and then drop NaN's:
df_test=(df_test.append(values_used_in_df_train.reset_index())
.dropna(subset=['col_2'])
.reset_index(drop=True))
I have the following two pandas dataframes:
df1
A B C
0 1 2 1
1 7 3 6
2 3 10 11
df2
A B C
0 2 0 2
1 8 4 7
Where A,B and C are column headings of both dataframes.
I am trying to compare columns of df1 to columns of df2 such that the first row in df2 is the lower bound and the second row is the upper bound. Any values in df1 outside the lower and upper bound (column wise) needs to be replaced with NaN.
So in this example the output should be:
A B C
0 nan 2 nan
1 7 3 6
2 3 nan nan
As a basic I am trying df1[df1 < df2] = np.nan, but this does not work. I have also tried .where() but not getting any success.
Would appreciate some help here, thanks.
IIUC
df=df1.where(df1.ge(df2.iloc[0])&df1.lt(df2.iloc[1]))
A B C
0 NaN 2.0 NaN
1 7.0 3.0 6.0
2 3.0 NaN NaN
You could do something like:
lower = df1 < df2.iloc[0, :]
upper = df1 > df2.iloc[1, :]
df1[lower | upper] = np.nan
print(df1)
Output
A B C
0 NaN 2.0 NaN
1 7.0 3.0 6.0
2 3.0 NaN NaN
Here is one with df.clip and mask:
df1.mask(df1.ne(df1.clip(lower = df2.loc[0],upper = df1.loc[1],axis=1)))
A B C
0 NaN 2.0 NaN
1 7.0 3.0 6.0
2 3.0 NaN NaN
A slightly different approach using between,
df1.apply(lambda x:x.where(x.between(*df2.values, False)), axis=1)
I'm trying to run what I think is simple code to eliminate any columns with all NaNs, but can't get this to work (axis = 1 works just fine when eliminating rows):
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2,np.nan,np.nan], 'b':[4,np.nan,6,np.nan], 'c':[np.nan, 8,9,np.nan], 'd':[np.nan,np.nan,np.nan,np.nan]})
df = df[df.notnull().any(axis = 0)]
print df
Full error:
raise IndexingError('Unalignable boolean Series provided as 'pandas.core.indexing.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
Expected output:
a b c
0 1.0 4.0 NaN
1 2.0 NaN 8.0
2 NaN 6.0 9.0
3 NaN NaN NaN
You need loc, because filter by columns:
print (df.notnull().any(axis = 0))
a True
b True
c True
d False
dtype: bool
df = df.loc[:, df.notnull().any(axis = 0)]
print (df)
a b c
0 1.0 4.0 NaN
1 2.0 NaN 8.0
2 NaN 6.0 9.0
3 NaN NaN NaN
Or filter columns and then select by []:
print (df.columns[df.notnull().any(axis = 0)])
Index(['a', 'b', 'c'], dtype='object')
df = df[df.columns[df.notnull().any(axis = 0)]]
print (df)
a b c
0 1.0 4.0 NaN
1 2.0 NaN 8.0
2 NaN 6.0 9.0
3 NaN NaN NaN
Or dropna with parameter how='all' for remove all columns filled by NaNs only:
print (df.dropna(axis=1, how='all'))
a b c
0 1.0 4.0 NaN
1 2.0 NaN 8.0
2 NaN 6.0 9.0
3 NaN NaN NaN
You can use dropna with axis=1 and thresh=1:
In[19]:
df.dropna(axis=1, thresh=1)
Out[19]:
a b c
0 1.0 4.0 NaN
1 2.0 NaN 8.0
2 NaN 6.0 9.0
3 NaN NaN NaN
This will drop any column which doesn't have at least 1 non-NaN value which will mean any column with all NaN will get dropped
The reason what you tried failed is because the boolean mask:
In[20]:
df.notnull().any(axis = 0)
Out[20]:
a True
b True
c True
d False
dtype: bool
cannot be aligned on the index which is what is used by default, as this produces a boolean mask on the columns
I was facing the same issue while using a function in fairlearn package. Resetting the index inplace worked for me.
I came here because I tried to filter the 1st 2 letters like this:
filtered = df[(df.Name[0:2] != 'xx')]
The fix was:
filtered = df[(df.Name.str[0:2] != 'xx')]
I have a dictionary of Pandas Series objects that I want to turn into a Dataframe. The key for each series should be the column heading. The individual series overlap but, each label is unique.
I thought I should be able to just do
df = pd.DataFrame(data)
But I keep getting the error InvalidIndexError: Reindexing only valid with uniquely valued Index objects.
I get the same error if I try to turn each series into a frame, and use pd.concat(data, axis=1).
Which doesn't make sense if you take the column label into account. What am I doing wrong, and how do I fix it?
I believe you need reset_index with parameter drop=True of each Series in dict comprehension, because duplicates in index:
s = pd.Series([1,4,5,2,0], index=[1,2,2,3,5])
s1 = pd.Series([5,7,8,1],index=[1,2,3,4])
data = {'a':s, 'b': s1}
print (s.reset_index(drop=True))
0 1
1 4
2 5
3 2
4 0
dtype: int64
df = pd.concat({k:v.reset_index(drop=True) for k,v in data.items()}, axis=1)
print (df)
a b
0 1 5.0
1 4 7.0
2 5 8.0
3 2 1.0
4 0 NaN
If you need drop rows where duplicated index use boolean indexing with duplicated:
print (s[~s.index.duplicated()])
1 1
2 4
3 2
5 0
dtype: int64
df = pd.concat({k:v[~v.index.duplicated()] for k,v in data.items()}, axis=1)
print (df)
a b
1 1.0 5.0
2 4.0 7.0
3 2.0 8.0
4 NaN 1.0
5 0.0 NaN
Another solution:
print (s.groupby(level=0).mean())
1 1.0
2 4.5
3 2.0
5 0.0
dtype: float64
df = pd.concat({k:v.groupby(level=0).mean() for k,v in data.items()}, axis=1)
print (df)
a b
1 1.0 5.0
2 4.5 7.0
3 2.0 8.0
4 NaN 1.0
5 0.0 NaN