I want to be able to remove rows that are empty in column NymexPlus and NymexMinus
right now the code I have is
df.dropna(subset=['NymexPlus'], inplace=True)
The thing about this code is that it will also delete rows in the column NymexMinus which I don't want to happen.
Is there an If/AND statement that will work in terms of only getting rid of empty cells for both of the columns?
Use a list as subset parameter and how='all':
df.dropna(subset=['NymexPlus', 'NymexMinus'], how='all', inplace=True)
Related
df.dropna(subset=[column_name], inplace=True)
I have a dataframe with some missing values in some column (column_name). For example, one value is the empty string, ''. But the code above doesn't drop the row with such empty values.
Isn't the code right way to do it?
The code does not drop those because they are not na they are an empty string. For those you would have to do something like:
rows_to_drop = df[df[column_name]==''].index
df.drop(rows_to_drop, inplace=True)
Alternative:
Something like this would also work:
df = df.loc[df[column_name]!='',:]
I have a large dataset where I need to remove a sizeable chunk of columns, so I want to view the list of columns I have and their indices and then pass them in to a drop command with slice:
df.drop(df.columns[25:100], axis=1, inplace=True)
However I need to first see the indices for all the columns. On another dataset I was able to see this using df.info(), but on this occasion I just see a summary of the data.
Can someone advise how to do this or an alternative way?
Perhaps this could help:
{k:i for i,k in enumerate(df.columns)}
This will produce a dictionary of each column and its index.
Additionally, if you want to query the index of specific columns:
[list(df.columns).index(col) for col in COLUMN_NAMES]
Where COLUMN_NAMES are the columns you want to find.
Just a random q. If there's a dataframe, df, from the Boston Homes ds, and I'm trying to do EDA on a few of the columns, set to a variable feature_cols, which I could use afterwards to check for na, how would one go about this? I have the following, which is throwing an error:
This is what I was hoping to try to do after the above:
Any feedback would be greatly appreciated. Thanks in advance.
There are two problems in your pictures. First is a keyError, because if you want to access subset of columns of a dataframe, you need to pass the names of the columns in a list not a tuple, so the first line should be
feature_cols = df[['RM','ZN','B']]
However, this will return a dataframe with three columns. What you want to use in the for loop can not work with pandas. We usually iterate over rows, not columns, of a dataframe, you can use the one line:
df.isna().sum()
This will print all names of columns of the dataframe along with the count of the number of missing values in each column. Of course, if you want to check only a subset of columns, you can. replace df buy df[list_of_columns_names].
You need to store the names of the columns only in an array, to access multiple columns, for example
feature_cols = ['RM','ZN','B']
now accessing it as
x = df[feature_cols]
Now to iterate on columns of df, you can use
for column in df[feature_cols]:
print(df[column]) # or anything
As per your updated comment,. if your end goal is to see null counts only, you can achieve without looping., e.g
df[feature_cols].info(verbose=True,null_count=True)
I want to iterate over each row and and want to check in each column if value is NaN if it is then i want to replace it with the previous value of the same row which is not null.
I believe the prefer way would be using lamba function. But still not figure out to code it
Note: I have thousands of rows and 200 columns in each row
The following should do the work:
df.fillna(method='ffill', axis=1, inplace=True)
Can you please clarify what you want to be done with NaNs in first column(s)?
i think you can use this -
your_df.apply(lambda x : x.fillna(method='ffill'), axis=1)
I'm working on python 3.x.I have a pandas data frame with only one column, student.At 501th row student contains nan
df.at[501,'student'] returns nan
To remove this I used following code
df.at['student'].replace('', np.nan, inplace=True)
But after that I'm still getting nan for df.at[501,'student']
I also tried this
df.at['student'].replace('', np.nan, inplace=True)
But I'm using df in for loop to check value of student to apply some business logic but with inplace=True I'm getting key error :501
Can you suggest me how do I remove the nan & use df in for loop to check student value?
Adding another answer since it's completely a different case.
I think you are not looping correctly on the dataframe, seems like you are looping relying on the index of the dataframe when you should probably loop on the items row by row or preferably use df.apply.
If you still want to loop on the items and you don't care about the previous index, you can reset the index with df.reset_index(drop=True)
df['student'].replace('', np.nan, inplace=True)
df['student'].dropna(inplace=True)
df = df.reset_index(drop=True)
# do your loop here
your problem is that you are dropping the item at index 501 then trying to access it, when you drop items pandas doesn't automatically update the index.
the replace function that you use would replace the first parameter with the second.
if you want to replace the np.nan with empty then you have to do
df['student'].replace(np.nan, '', inplace=True)
but this would not remove the row, it'd just replace it with an empty string, what you want is
df['student'].dropna(inplace=True)
but you gotta do this before looping over the elements, don't dropna in the loop.
it'd be helpful to know what exactly you are doing in the loop
One way to remove the rows that contains Nan values in the "student" column is
df = df[~df['student'].isnull()]