Unable to remove rows which have more than 2 NaN - python

I have a df which has NaN in it, I tried running df.dropna() the it drops all rows which have NaN value but when I try using df.dropna(thresh=2) then nothing happens ,no row gets deleted. Can someone please explain me why is this occurring.
This is how I have changed the values to NaN
:
for col in df.columns:
df.loc[df[col] == '?', col] = np.nan
It is the the pic with total number of rows and Columns
second pic when I used df.dropna()
third pic is after using df.dropna(thresh=2)

thresh=2 says the row must have at least 2 valid / non NaN values otherwise delete that row.
In given screen shots there are 13 column.
So, to remove rows which have more than 2 Nan the thresh should be thresh = 11.
In this way Pandas will move all the rows where it finds more than 2 NaN
Hope this helps!

As per my knowledge, thresh works for integer values. it will respond if you have only integer value is there and rest are NaN and thresh should be greater than 1. eq.
import pandas as pd
data = [{'id':1,'name' : 'John'},
{'id':2, 'name' : 'Aaron', 'phone':43242123213,'age':32},
{'id':3, 'name':'alan' }
]
df = pd.DataFrame(data)
OUTPUT:
age id name phone
0 NaN 1 John NaN
1 32.0 2 Aaron 4.324212e+10
2 NaN 3 Alan NaN
>> df.dropna(thresh=2)
It will not work but if I remove the name 'Alan' at index 2, It will respond and delete 3rd row 3 as only integer value id left.

Related

Pandas ffill on section of DataFrame

I am attempting to forward fill a filtered section of a DataFrame but it is not working the way I hoped.
I have df that look like this:
Col Col2
0 1 NaN
1 NaN NaN
2 3 string
3 NaN string
I want it to look like this:
Col Col2
0 1 NaN
1 NaN NaN
2 3 string
3 3 string
This my current code:
filter = (df["col2"] == "string")
df.loc[filter, "col"].fillna(method="ffill", inplace=True)
But my code does not change the df at all. Any feedback is greatly appreciated
We can use boolean indexing to filter the section of Col where Col2 = 'string' then forward fill and update the values only in that section
m = df['Col2'].eq('string')
df.loc[m, 'Col'] = df.loc[m, 'Col'].ffill()
Col Col2
0 1.0 NaN
1 NaN NaN
2 3.0 string
3 3.0 string
I am not sure I understand your question but if you want to fill the NAN values or any values you should use the Simple imputer
from sklearn.impute import SimpleImputer
Then you can define an imputer that fills these missing values/NAN with a specific strategy. For example if you want to fill these values with the mean of all the column you can write it as follows:
imputer=SimpleImputer(missing_values=np.nan, strategy= 'mean')
Or you can write it like this if you have the NaN as string
imputer=SimpleImputer(missing_values="NaN", strategy= 'mean')
and if you want to fill it with a specific values you can do this:
imputer=SimpleImputer(missing_values=np.nan, strategy= 'constant', fill_value = "YOUR VALUE")
Then you can use it like that
df[["Col"]]=imputer.fit_transform(df[["Col"]])

Adding a value at the end of a column in a multindex column dataframe

I have a simple problem that probably has a simple solution but I couldn't found it anywhere. I have the following multindex column Dataframe:
mux = pd.MultiIndex.from_product(['A','B','C'], ['Datetime', 'Str', 'Ret']])
dfr = pd.DataFrame(columns=mux)
| A | B | C |
|Datetime|Str|Ret|Datetime|Str|Ret|Datetime|Str|Ret|
I need to add values one by one at the end of a specific subcolumn. For example add one value at the end of column A sub-column Datetime and leave the rest of the row as it is, then add another value to column B sub-column Str and again leave the rest of the values in the same row untouched and so on. So my questions are: Is it possible to target individual locations in this type of Dataframes? How? and also Is it possible to append not a full row but an individual value always at the end after the previous value without knowing where the end is?. Thank you so much for your answers.
IIUC, you can use .loc:
idx = len(dfr) # get the index of the next row after the last one
dfr.loc[idx, ('A', 'Datetime')] = pd.to_datetime('2021-09-24')
dfr.loc[idx, ('B', 'Str')] = 'Hello'
dfr.loc[idx, ('C', 'Ret')] = 4.3
Output:
>>> dfr
A B C
Datetime Str Ret Datetime Str Ret Datetime Str Ret
0 2021-09-24 00:00:00 NaN NaN NaN Hello NaN NaN NaN 4.3
Update
I mean for example when I have different number of values in different columns (for example 6 values in column A-Str but only 4 in column B-Datetime) but I donĀ“t really know. In that case what I need is to add the next value in that column after the last one so I need to know the index of the last non Nan value of that particular column so I can use it in your answer because if I use len(dfr) while trying to add value to the column that only has 4 values it will end up in the 7th row instead of the 5th row, this is because one of the columns may have more values than the others.
You can do it easily using last_valid_index. Create a convenient function append_to_col to append values inplace in your dataframe:
def append_to_col(col, val):
idx = dfr[col].last_valid_index()
dfr.loc[idx+1 if idx is not None else 0, col] = val
# Fill your dataframe
append_to_col(('A', 'Datetime'), '2021-09-24')
append_to_col(('A', 'Datetime'), '2021-09-25')
append_to_col(('B', 'Str'), 'Hello')
append_to_col(('C', 'Ret'), 4.3)
append_to_col(('C', 'Ret'), 8.2)
append_to_col(('A', 'Datetime'), '2021-09-26')
Output:
>>> dfr
A B C
Datetime Str Ret Datetime Str Ret Datetime Str Ret
0 2021-09-24 NaN NaN NaN Hello NaN NaN NaN 4.3
1 2021-09-25 NaN NaN NaN NaN NaN NaN NaN 8.2
2 2021-09-26 NaN NaN NaN NaN NaN NaN NaN NaN

Add new column with column names of a table, based on conditions [duplicate]

I have a dataframe as below:
I want to get the name of the column if column of a particular row if it contains 1 in the that column.
Use DataFrame.dot:
df1 = df.dot(df.columns)
If there is multiple 1 per row:
df2 = df.dot(df.columns + ';').str.rstrip(';')
Firstly
Your question is very ambiguous and I recommend reading this link in #sammywemmy's comment. If I understand your problem correctly... we'll talk about this mask first:
df.columns[
(df == 1) # mask
.any(axis=0) # mask
]
What's happening? Lets work our way outward starting from within df.columns[**HERE**] :
(df == 1) makes a boolean mask of the df with True/False(1/0)
.any() as per the docs:
"Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent".
This gives us a handy Series to mask the column names with.
We will use this example to automate for your solution below
Next:
Automate to get an output of (<row index> ,[<col name>, <col name>,..]) where there is 1 in the row values. Although this will be slower on large datasets, it should do the trick:
import pandas as pd
data = {'foo':[0,0,0,0], 'bar':[0, 1, 0, 0], 'baz':[0,0,0,0], 'spam':[0,1,0,1]}
df = pd.DataFrame(data, index=['a','b','c','d'])
print(df)
foo bar baz spam
a 0 0 0 0
b 0 1 0 1
c 0 0 0 0
d 0 0 0 1
# group our df by index and creates a dict with lists of df's as values
df_dict = dict(
list(
df.groupby(df.index)
)
)
Next step is a for loop that iterates the contents of each df in df_dict, checks them with the mask we created earlier, and prints the intended results:
for k, v in df_dict.items(): # k: name of index, v: is a df
check = v.columns[(v == 1).any()]
if len(check) > 0:
print((k, check.to_list()))
('b', ['bar', 'spam'])
('d', ['spam'])
Side note:
You see how I generated sample data that can be easily reproduced? In the future, please try to ask questions with posted sample data that can be reproduced. This way it helps you understand your problem better and it is easier for us to answer it for you.
Getting column name are dividing in 2 sections.
If you want in a new column name then condition should be unique because it will only give 1 col name for each row.
data = {'foo':[0,0,3,0], 'bar':[0, 5, 0, 0], 'baz':[0,0,2,0], 'spam':[0,1,0,1]}
df = pd.DataFrame(data)
df=df.replace(0,np.nan)
df
foo bar baz spam
0 NaN NaN NaN NaN
1 NaN 5.0 NaN 1.0
2 3.0 NaN 2.0 NaN
3 NaN NaN NaN 1.0
If you were looking for min or maximum
max= df.idxmax(1)
min = df.idxmin(1)
out= df.assign(max=max , min=min)
out
foo bar baz spam max min
0 NaN NaN NaN NaN NaN NaN
1 NaN 5.0 NaN 1.0 bar spam
2 3.0 NaN 2.0 NaN foo baz
3 NaN NaN NaN 1.0 spam spam
2nd case, If your condition is satisfied in multiple columns for example you are looking for columns that contain 1 and you are looking for list because its not possible to adjust in same dataframe.
str_con= df.astype(str).apply(lambda x:x.str.contains('1.0',case=False, na=False)).any()
df.column[str_con]
#output
Index(['spam'], dtype='object') #only spam contains 1
Or you are looking for numerical condition columns contains value more than 1
num_con = df.apply(lambda x:x>1.0).any()
df.columns[num_con]
#output
Index(['foo', 'bar', 'baz'], dtype='object') #these col has higher value than 1
Happy learning

pandas dropna() only if in first row NaN value

I have a dataframe like the following
df = [[1,'NaN',3],[4,5,'Nan'],[7,8,9]]
df = pd.DataFrame(df)
and I would like to remove all columns that have in their first row a NaN value.
So the output should be:
df = [[1,3],[4,'Nan'],[7,9]]
df = pd.DataFrame(df)
So in this case, only the second column is removed since the first element was a NaN value.
Hence, dropna() is based on a condition.. any idea how to handle this? Thx!
If values are np.nan and not string NaN(else replace them), you can do:
Input:
df = [[1,np.nan,3],[4,5,np.nan],[7,8,9]]
df = pd.DataFrame(df)
Solution:
df.loc[:,df.iloc[0].notna()] #assign back to your desired variable
0 2
0 1 3.0
1 4 NaN
2 7 9.0

Pandas select all columns without NaN

I have a DF with 200 columns. Most of them are with NaN's. I would like to select all columns with no NaN's or at least with the minimum NaN's. I've tried to drop all with a threshold or with notnull() but without success. Any ideas.
df.dropna(thresh=2, inplace=True)
df_notnull = df[df.notnull()]
DF for example:
col1 col2 col3
23 45 NaN
54 39 NaN
NaN 45 76
87 32 NaN
The output should look like:
df.dropna(axis=1, thresh=2)
col1 col2
23 45
54 39
NaN 45
87 32
You can create with non-NaN columns using
df = df[df.columns[~df.isnull().all()]]
Or
null_cols = df.columns[df.isnull().all()]
df.drop(null_cols, axis = 1, inplace = True)
If you wish to remove columns based on a certain percentage of NaNs, say columns with more than 90% data as null
cols_to_delete = df.columns[df.isnull().sum()/len(df) > .90]
df.drop(cols_to_delete, axis = 1, inplace = True)
df[df.columns[~df.isnull().any()]] will give you a DataFrame with only the columns that have no null values, and should be the solution.
df[df.columns[~df.isnull().all()]] only removes the columns that have nothing but null values and leaves columns with even one non-null value.
df.isnull() will return a dataframe of booleans with the same shape as df. These bools will be True if the particular value is null and False if it isn't.
df.isnull().any() will return True for all columns with even one null. This is where I'm diverging from the accepted answer, as df.isnull().all() will not flag columns with even one value!
I assume that you wan't to get all the columns without any NaN. If that's the case, you can first get the name of the columns without any NaN using ~col.isnull.any(), then use that your columns.
I can think in the following code:
import pandas as pd
df = pd.DataFrame({
'col1': [23, 54, pd.np.nan, 87],
'col2': [45, 39, 45, 32],
'col3': [pd.np.nan, pd.np.nan, 76, pd.np.nan,]
})
# This function will check if there is a null value in the column
def has_nan(col, threshold=0):
return col.isnull().sum() > threshold
# Then you apply the "complement" of function to get the column with
# no NaN.
df.loc[:, ~df.apply(has_nan)]
# ... or pass the threshold as parameter, if needed
df.loc[:, ~df.apply(has_nan, args=(2,))]
Here is a simple function which you can use directly by passing dataframe and threshold
df
'''
pets location owner id
0 cat San_Diego Champ 123.0
1 dog NaN Ron NaN
2 cat NaN Brick NaN
3 monkey NaN Champ NaN
4 monkey NaN Veronica NaN
5 dog NaN John NaN
'''
def rmissingvaluecol(dff,threshold):
l = []
l = list(dff.drop(dff.loc[:,list((100*(dff.isnull().sum()/len(dff.index))>=threshold))].columns, 1).columns.values)
print("# Columns having more than %s percent missing values:"%threshold,(dff.shape[1] - len(l)))
print("Columns:\n",list(set(list((dff.columns.values))) - set(l)))
return l
rmissingvaluecol(df,1) #Here threshold is 1% which means we are going to drop columns having more than 1% of missing values
#output
'''
# Columns having more than 1 percent missing values: 2
Columns:
['id', 'location']
'''
Now create new dataframe excluding these columns
l = rmissingvaluecol(df,1)
df1 = df[l]
PS: You can change threshold as per your requirement
Bonus step
You can find the percentage of missing values for each column (optional)
def missing(dff):
print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))
missing(df)
#output
'''
id 83.33
location 83.33
owner 0.00
pets 0.00
dtype: float64
'''
you should try df_notnull = df.dropna(how='all')
This will get you only non null rows.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html
null_series = df.isnull().sum() # The number of missing values from each column in your dataframe
full_col_series = null_series[null_series == 0] # Will keep only the columns with no missing values
df = df[full_col_series.index]
This worked for me quite well and probably tailored for your need as well!
def nan_weed(df,thresh):
ind = []
i = df.shape[1]
for j in range(0,i-1):
if df[j].isnull().sum() <= thresh:
ind.append(j)
return df[ind]
I see a lot of how to get rid of null values on this thread. Which for my dataframes is never the case. We do not delete data. Ever.
I took the question as how to get just your null values to show, and in my case I had to find latitude and longitude and fill them in.
What I did was this for one column nulls:
df[df['Latitude'].isnull()]
or to explain it out
dataframe[dataframe['Column you want'].isnull()]
This pulled up my whole data frame and all the missing values of latitude.
What did not work is this and I can't explain why. Trying to do two columns at the same time:
df[df[['Latitude','Longitude']].isnull()]
That will give me all NANs in the entire data frame.
So to do this all at once what I added was the ID, in my case my ID for each row is APNs, with the two columns I needed at the end
df[df['Latitude'].isnull()][['APN','Latitude','Longitude']]
By doing this little hack I was able to get every ID I needed to add data too for 600,000+ rows of data to filter for. Then did it again for longitude just to be sure I did not miss anything.

Categories

Resources