Remove NaN/NULL columns in a Pandas dataframe? - python

I have a dataFrame in pandas and several of the columns have all null values. Is there a built in function which will let me remove those columns?

Yes, dropna. See http://pandas.pydata.org/pandas-docs/stable/missing_data.html and the DataFrame.dropna docstring:
Definition: DataFrame.dropna(self, axis=0, how='any', thresh=None, subset=None)
Docstring:
Return object with labels on given axis omitted where alternately any
or all of the data are missing
Parameters
----------
axis : {0, 1}
how : {'any', 'all'}
any : if any NA values are present, drop that label
all : if all values are NA, drop that label
thresh : int, default None
int value : require that many non-NA values
subset : array-like
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include
Returns
-------
dropped : DataFrame
The specific command to run would be:
df=df.dropna(axis=1,how='all')

Another solution would be to create a boolean dataframe with True values at not-null positions and then take the columns having at least one True value. This removes columns with all NaN values.
df = df.loc[:,df.notna().any(axis=0)]
If you want to remove columns having at least one missing (NaN) value;
df = df.loc[:,df.notna().all(axis=0)]
This approach is particularly useful in removing columns containing empty strings, zeros or basically any given value. For example;
df = df.loc[:,(df!='').all(axis=0)]
removes columns having at least one empty string.

Here is a simple function which you can use directly by passing dataframe and threshold
df
'''
pets location owner id
0 cat San_Diego Champ 123.0
1 dog NaN Ron NaN
2 cat NaN Brick NaN
3 monkey NaN Champ NaN
4 monkey NaN Veronica NaN
5 dog NaN John NaN
'''
def rmissingvaluecol(dff,threshold):
l = []
l = list(dff.drop(dff.loc[:,list((100*(dff.isnull().sum()/len(dff.index))>=threshold))].columns, 1).columns.values)
print("# Columns having more than %s percent missing values:"%threshold,(dff.shape[1] - len(l)))
print("Columns:\n",list(set(list((dff.columns.values))) - set(l)))
return l
rmissingvaluecol(df,1) #Here threshold is 1% which means we are going to drop columns having more than 1% of missing values
#output
'''
# Columns having more than 1 percent missing values: 2
Columns:
['id', 'location']
'''
Now create new dataframe excluding these columns
l = rmissingvaluecol(df,1)
df1 = df[l]
PS: You can change threshold as per your requirement
Bonus step
You can find the percentage of missing values for each column (optional)
def missing(dff):
print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))
missing(df)
#output
'''
id 83.33
location 83.33
owner 0.00
pets 0.00
dtype: float64
'''

Function for removing all null columns from the data frame:
def Remove_Null_Columns(df):
dff = pd.DataFrame()
for cl in fbinst:
if df[cl].isnull().sum() == len(df[cl]):
pass
else:
dff[cl] = df[cl]
return dff
This function will remove all Null columns from the df.

Related

How to shift row values left in a dataframe to replace NaN

I have a huge dataframe with 40 columns (10 groups of 4 columns), with value in some groups and NaN for others. I want the values for all the row left-shifted, such that wherever values be present in that row, the final Dataframe should be filled with Group1 -> Group 2 -> Group 3 and so on.
Here is a sample dataframe and the required output below:
Here is the required output:
I have used the below code to achieve shifting the values left. However, if a value is missing in an available group, e.g. Item 2 type-1, or Item 3 cat-2, the below code will ignore that and will replace it with the value to its right, and so on.
v = df1.values
a = [[n]*v.shape[1] for n in range(v.shape[0])]
b = pd.isnull(v).argsort(axis=1, kind = 'mergesort')
df2 = pd.DataFrame(v[a,b],df1.index,df1.columns)
How to achieve this?
Thanks.

How to extract column title from a dataframe and add it to another dataframe?

My goal is to have my column titles in the small df added to an existing large dataframe without me manually typing the name in.
This is the small dataframe.
veddra_term_code veddra_version veddra_term_name number_of_animals_affected accuracy
335 11 Emesis NaN NaN
142 11 Anaemia NOS NaN NaN
The large dataframe is similar to the above but has forty columns.
This is the code I used to extract the small dataframe from dict.
df = pd.DataFrame(reaction for result in d['results'] for reaction in result['reaction']) #get reaction data
df
You can pass dataframe.reindex a list of columns, consisting of the existing columns and also new ones. If a column does not exist yet in the dataframe, it will get as value NaN.
Assume that df is your big dataframe you want to extend with columns. You can then create a new list of column names (columns_to_add) from your small dataframe and combine them. Then you call reindex on the big dataframe.
import pandas as pd
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
existing_columns = df.columns.tolist()
columns_to_add = ["C", "D"] # or use small_df.columns.tolist()
new_columns = existing_columns + columns_to_add
df = df.reindex(columns = new_columns)
This will produce:
A B C D
0 1 2 NaN NaN
1 2 3 NaN NaN
2 3 4 NaN NaN
If you do not like NaN you can use a different value by passing the keyword fill_value.
(e.g. df.reindex(columns = new_columns, fill_value=0).
df.columns will give you an array of the names of the columns
import numpy as np
#loop small dataframe headers
for i in small_df.columns:
# if large df doesnt have the header, create the header
if i not in large_df.columns:
#creates new header with no data
large_df.loc[:,i]=np.nan

Is there a way to get the mean value of previous two columns in pandas?

I want to calculate the mean value of previous two rows and fill the NAN's in my dataframe. There are only few rows with missing values in the 2010-19 column.
I tried using bfill and ffill but it only captures the previous or next row/column value and fill NAN.
My example data set has 7 columns as below:
X 1990-2000 2000-2010 2010-19 1990-2000 2000-2010 2010-19
Hyderabad 10 20 NAN 1 3 NAN
The output I want:
X 1990-2000 2000-2010 2010-19 1990-2000 2000-2010 2010-19
Hyderabad 10 20 15 1 3 2
To use fillna row-wise in this way, an easy solution is to provide an pandas series as argument to fillna. This will replace NaN values depending on the index.
Since the column names have duplicates the below code uses the column indices. Assuming a dataframe called df:
col_indices = [3, 6]
for i in col_indices:
means = df.iloc[:, [i-1, i-2]].mean(axis=1)
df.iloc[:, i].fillna(means, inplace=True)
This will fill the NaN values with the mean of the two columns to the left of each column in col_indices.

Pandas - drop row containing Nan and then drop any associated rows

I have a dataframe with 2 columns: 'age' and 'name'. Which looks like this (when opened in notepad):
,age,name
0,18,Bill
1,22,Harry
2,Nan,Bill
4,5,William
(the first column is an index)
I need to drop any rows with Nan in the age column and also drop any rows which have the same name in the name column. For example, in the snippet of my data frame I would want to drop both rows with Bill in as one of the ages contains Nan
Currently i have this:
df_no_dups = dp[dp.isfinite(dp['age'])]
This is the first part but am stuck on removing the other rows with the same name as the row containing Nan
Any help would be great
Filter by boolean indexing with boolean mask created by transform for test if all values per groups have no missing value:
df1 = df[df['age'].notnull().groupby(df['name']).transform('all')]
Or check missing values, test if at least one True per group and last invert boolean mask by ~:
df1 = df[~df['age'].isnull().groupby(df['name']).transform('any')]
print (df1)
age name
1 22.0 Harry
3 5.0 William
Detail:
print (df['age'].notnull())
0 True
1 True
2 False
3 True
Name: age, dtype: bool
print (df['age'].notnull().groupby(df['name']).transform('all'))
0 False
1 True
2 False
3 True
Name: age, dtype: bool
try this,
df=df.drop_duplicates(subset=['name'],keep=False)
df[(df['age'].notnull()] #or df[(df['age']!='Nan')] (as your input Contains Nan as string)
Explanation:
First remove the duplicates and pass keep=False to remove all duplicates. Then filter for NaN.
Output:
age name
1 22 Harry
4 5 William
This works for me:
import pandas as pd
df = pd.read_excel('test.xlsx')
df = df.drop_duplicates(subset='name', keep=False)
df = df.dropna(subset=['age'])
Edit: this works for null values, if Nan is a string as pointed by #Mohamed then use the answer provided by him.

Pandas select all columns without NaN

I have a DF with 200 columns. Most of them are with NaN's. I would like to select all columns with no NaN's or at least with the minimum NaN's. I've tried to drop all with a threshold or with notnull() but without success. Any ideas.
df.dropna(thresh=2, inplace=True)
df_notnull = df[df.notnull()]
DF for example:
col1 col2 col3
23 45 NaN
54 39 NaN
NaN 45 76
87 32 NaN
The output should look like:
df.dropna(axis=1, thresh=2)
col1 col2
23 45
54 39
NaN 45
87 32
You can create with non-NaN columns using
df = df[df.columns[~df.isnull().all()]]
Or
null_cols = df.columns[df.isnull().all()]
df.drop(null_cols, axis = 1, inplace = True)
If you wish to remove columns based on a certain percentage of NaNs, say columns with more than 90% data as null
cols_to_delete = df.columns[df.isnull().sum()/len(df) > .90]
df.drop(cols_to_delete, axis = 1, inplace = True)
df[df.columns[~df.isnull().any()]] will give you a DataFrame with only the columns that have no null values, and should be the solution.
df[df.columns[~df.isnull().all()]] only removes the columns that have nothing but null values and leaves columns with even one non-null value.
df.isnull() will return a dataframe of booleans with the same shape as df. These bools will be True if the particular value is null and False if it isn't.
df.isnull().any() will return True for all columns with even one null. This is where I'm diverging from the accepted answer, as df.isnull().all() will not flag columns with even one value!
I assume that you wan't to get all the columns without any NaN. If that's the case, you can first get the name of the columns without any NaN using ~col.isnull.any(), then use that your columns.
I can think in the following code:
import pandas as pd
df = pd.DataFrame({
'col1': [23, 54, pd.np.nan, 87],
'col2': [45, 39, 45, 32],
'col3': [pd.np.nan, pd.np.nan, 76, pd.np.nan,]
})
# This function will check if there is a null value in the column
def has_nan(col, threshold=0):
return col.isnull().sum() > threshold
# Then you apply the "complement" of function to get the column with
# no NaN.
df.loc[:, ~df.apply(has_nan)]
# ... or pass the threshold as parameter, if needed
df.loc[:, ~df.apply(has_nan, args=(2,))]
Here is a simple function which you can use directly by passing dataframe and threshold
df
'''
pets location owner id
0 cat San_Diego Champ 123.0
1 dog NaN Ron NaN
2 cat NaN Brick NaN
3 monkey NaN Champ NaN
4 monkey NaN Veronica NaN
5 dog NaN John NaN
'''
def rmissingvaluecol(dff,threshold):
l = []
l = list(dff.drop(dff.loc[:,list((100*(dff.isnull().sum()/len(dff.index))>=threshold))].columns, 1).columns.values)
print("# Columns having more than %s percent missing values:"%threshold,(dff.shape[1] - len(l)))
print("Columns:\n",list(set(list((dff.columns.values))) - set(l)))
return l
rmissingvaluecol(df,1) #Here threshold is 1% which means we are going to drop columns having more than 1% of missing values
#output
'''
# Columns having more than 1 percent missing values: 2
Columns:
['id', 'location']
'''
Now create new dataframe excluding these columns
l = rmissingvaluecol(df,1)
df1 = df[l]
PS: You can change threshold as per your requirement
Bonus step
You can find the percentage of missing values for each column (optional)
def missing(dff):
print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))
missing(df)
#output
'''
id 83.33
location 83.33
owner 0.00
pets 0.00
dtype: float64
'''
you should try df_notnull = df.dropna(how='all')
This will get you only non null rows.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html
null_series = df.isnull().sum() # The number of missing values from each column in your dataframe
full_col_series = null_series[null_series == 0] # Will keep only the columns with no missing values
df = df[full_col_series.index]
This worked for me quite well and probably tailored for your need as well!
def nan_weed(df,thresh):
ind = []
i = df.shape[1]
for j in range(0,i-1):
if df[j].isnull().sum() <= thresh:
ind.append(j)
return df[ind]
I see a lot of how to get rid of null values on this thread. Which for my dataframes is never the case. We do not delete data. Ever.
I took the question as how to get just your null values to show, and in my case I had to find latitude and longitude and fill them in.
What I did was this for one column nulls:
df[df['Latitude'].isnull()]
or to explain it out
dataframe[dataframe['Column you want'].isnull()]
This pulled up my whole data frame and all the missing values of latitude.
What did not work is this and I can't explain why. Trying to do two columns at the same time:
df[df[['Latitude','Longitude']].isnull()]
That will give me all NANs in the entire data frame.
So to do this all at once what I added was the ID, in my case my ID for each row is APNs, with the two columns I needed at the end
df[df['Latitude'].isnull()][['APN','Latitude','Longitude']]
By doing this little hack I was able to get every ID I needed to add data too for 600,000+ rows of data to filter for. Then did it again for longitude just to be sure I did not miss anything.

Categories

Resources