Remove np.nan from a pd.DataFrame index - python

I have a dataframe with accident data of some streets:
I'd like to remove (or at least select) that first row that is indexed as np.nan. I tried streets.loc[np.nan,:] but that returns a KeyError: nan. I'm not sure how else to specifically select that record.
Other than using pd.DataFrame.iloc[0,:] (which is imprecise as it relies on location rather than index name) how can I select that specific record?

I think there are two options you can do.
You can fill any random value to nan and then select it.
df.fillna(value={'ON STREET NAME': 'random'})
streets.loc['random',:]
assign another index column, but this can affect your dataframe later.

You can do df = df.dropna()
This will remove all rows with at least one nan value.
Optionally, you could also do df.dropna(inplace=True) The parameter inplace just means that you don't have to specify df = df.dropna() and it will modify the original var for you.
You can find more info on this here: pandas.DataFrame.dropna

I will do
df = df[df.index.notna()]

Related

Acces each row and check each column value in dataframe

I want to iterate over each row and and want to check in each column if value is NaN if it is then i want to replace it with the previous value of the same row which is not null.
I believe the prefer way would be using lamba function. But still not figure out to code it
Note: I have thousands of rows and 200 columns in each row
The following should do the work:
df.fillna(method='ffill', axis=1, inplace=True)
Can you please clarify what you want to be done with NaNs in first column(s)?
i think you can use this -
your_df.apply(lambda x : x.fillna(method='ffill'), axis=1)

How to drop row from pandas data frame containing nan

I'm working on python 3.x.I have a pandas data frame with only one column, student.At 501th row student contains nan
df.at[501,'student'] returns nan
To remove this I used following code
df.at['student'].replace('', np.nan, inplace=True)
But after that I'm still getting nan for df.at[501,'student']
I also tried this
df.at['student'].replace('', np.nan, inplace=True)
But I'm using df in for loop to check value of student to apply some business logic but with inplace=True I'm getting key error :501
Can you suggest me how do I remove the nan & use df in for loop to check student value?
Adding another answer since it's completely a different case.
I think you are not looping correctly on the dataframe, seems like you are looping relying on the index of the dataframe when you should probably loop on the items row by row or preferably use df.apply.
If you still want to loop on the items and you don't care about the previous index, you can reset the index with df.reset_index(drop=True)
df['student'].replace('', np.nan, inplace=True)
df['student'].dropna(inplace=True)
df = df.reset_index(drop=True)
# do your loop here
your problem is that you are dropping the item at index 501 then trying to access it, when you drop items pandas doesn't automatically update the index.
the replace function that you use would replace the first parameter with the second.
if you want to replace the np.nan with empty then you have to do
df['student'].replace(np.nan, '', inplace=True)
but this would not remove the row, it'd just replace it with an empty string, what you want is
df['student'].dropna(inplace=True)
but you gotta do this before looping over the elements, don't dropna in the loop.
it'd be helpful to know what exactly you are doing in the loop
One way to remove the rows that contains Nan values in the "student" column is
df = df[~df['student'].isnull()]

How do I dynamically update a column in pandas with the value of the column to its left?

I have a dataframe with a series of columns that contain boolean values, one column for each month of the year. Here's a snippet of the df:
df
I'm trying to update the 2019.04_flag, 2019.05_flag, etc columns with the last valid value. I know that I can use df[2019.04_flag].fillna(2019.03_flag), but I don't want to write 11 fillna lines. Is there a means of updating the value dynamically? I've tried to use the fillna method with the ffill parameter here df with ffill, but as you can see it doesn't propagate across the row.
Edited
I would look into the pandas fillna method, documentation is here. It has different methods for filling NaN -- I think "ffill" would suit your needs. It fills the NaN with the last valid entry. Try the following:
df = df.fillna(method = "ffill", axis = 1)
Setting axis = 1 will perform the imputation across the columns, the axis I believe you want (a single row across columns).

Cant drop columns with pandas if index_col = 0 is used while reading csv's [duplicate]

I have the following code which imports a CSV file. There are 3 columns and I want to set the first two of them to variables. When I set the second column to the variable "efficiency" the index column is also tacked on. How can I get rid of the index column?
df = pd.DataFrame.from_csv('Efficiency_Data.csv', header=0, parse_dates=False)
energy = df.index
efficiency = df.Efficiency
print efficiency
I tried using
del df['index']
after I set
energy = df.index
which I found in another post but that results in "KeyError: 'index' "
When writing to and reading from a CSV file include the argument index=False and index_col=False, respectively. Follows an example:
To write:
df.to_csv(filename, index=False)
and to read from the csv
df.read_csv(filename, index_col=False)
This should prevent the issue so you don't need to fix it later.
df.reset_index(drop=True, inplace=True)
DataFrames and Series always have an index. Although it displays alongside the column(s), it is not a column, which is why del df['index'] did not work.
If you want to replace the index with simple sequential numbers, use df.reset_index().
To get a sense for why the index is there and how it is used, see e.g. 10 minutes to Pandas.
You can set one of the columns as an index in case it is an "id" for example.
In this case the index column will be replaced by one of the columns you have chosen.
df.set_index('id', inplace=True)
If your problem is same as mine where you just want to reset the column headers from 0 to column size. Do
df = pd.DataFrame(df.values);
EDIT:
Not a good idea if you have heterogenous data types. Better just use
df.columns = range(len(df.columns))
you can specify which column is an index in your csv file by using index_col parameter of from_csv function
if this doesn't solve you problem please provide example of your data
One thing that i do is df=df.reset_index()
then df=df.drop(['index'],axis=1)
To remove or not to create the default index column, you can set the index_col to False and keep the header as Zero. Here is an example of how you can do it.
recording = pd.read_excel("file.xls",
sheet_name= "sheet1",
header= 0,
index_col= False)
The header = 0 will make your attributes to headers and you can use it later for calling the column.
It works for me this way:
Df = data.set_index("name of the column header to start as index column" )

Replace values in column based on condition, then return dataframe

I'd like to replace some values in the first row of a dataframe by a dummy.
df[[0]].replace(["x"], ["dummy"])
The problem here is that the values in the first column are replaced, but not as part of the dataframe.
print(df)
yields the dataframe with the original data in column 1. I've tried
df[(df[[0]].replace(["x"], ["dummy"]))]
which doesn't work either..
replace returns a copy of the data by default, so you need to either overwrite the df by self-assign or pass inplace=True:
df[[0]].replace(["x"], ["dummy"], inplace=True)
or
df[0] = df[[0]].replace(["x"], ["dummy"])
see the docs

Categories

Resources