I'm trying to remove a row from my data frame in which one of the columns has a value of null. Most of the help I can find relates to removing NaN values which hasn't worked for me so far.
Here I've created the data frame:
# successfully crated data frame
df1 = ut.get_data(symbols, dates) # column heads are 'SPY', 'BBD'
# can't get rid of row containing null val in column BBD
# tried each of these with the others commented out but always had an
# error or sometimes I was able to get a new column of boolean values
# but i just want to drop the row
df1 = pd.notnull(df1['BBD']) # drops rows with null val, not working
df1 = df1.drop(2010-05-04, axis=0)
df1 = df1[df1.'BBD' != null]
df1 = df1.dropna(subset=['BBD'])
df1 = pd.notnull(df1.BBD)
# I know the date to drop but still wasn't able to drop the row
df1.drop([2015-10-30])
df1.drop(['2015-10-30'])
df1.drop([2015-10-30], axis=0)
df1.drop(['2015-10-30'], axis=0)
with pd.option_context('display.max_row', None):
print(df1)
Here is my output:
Can someone please tell me how I can drop this row, preferably both by identifying the row by the null value and how to drop by date?
I haven't been working with pandas very long and I've been stuck on this for an hour. Any advice would be much appreciated.
This should do the work:
df = df.dropna(how='any',axis=0)
It will erase every row (axis=0) that has "any" Null value in it.
EXAMPLE:
#Recreate random DataFrame with Nan values
df = pd.DataFrame(index = pd.date_range('2017-01-01', '2017-01-10', freq='1d'))
# Average speed in miles per hour
df['A'] = np.random.randint(low=198, high=205, size=len(df.index))
df['B'] = np.random.random(size=len(df.index))*2
#Create dummy NaN value on 2 cells
df.iloc[2,1]=None
df.iloc[5,0]=None
print(df)
A B
2017-01-01 203.0 1.175224
2017-01-02 199.0 1.338474
2017-01-03 198.0 NaN
2017-01-04 198.0 0.652318
2017-01-05 199.0 1.577577
2017-01-06 NaN 0.234882
2017-01-07 203.0 1.732908
2017-01-08 204.0 1.473146
2017-01-09 198.0 1.109261
2017-01-10 202.0 1.745309
#Delete row with dummy value
df = df.dropna(how='any',axis=0)
print(df)
A B
2017-01-01 203.0 1.175224
2017-01-02 199.0 1.338474
2017-01-04 198.0 0.652318
2017-01-05 199.0 1.577577
2017-01-07 203.0 1.732908
2017-01-08 204.0 1.473146
2017-01-09 198.0 1.109261
2017-01-10 202.0 1.745309
See the reference for further detail.
If everything is OK with your DataFrame, dropping NaNs should be as easy as that. If this is still not working, make sure you have the proper datatypes defined for your column (pd.to_numeric comes to mind...)
----clear null all colum-------
df = df.dropna(how='any',axis=0)
---if you want to clean NULL by based on 1 column.---
df[~df['B'].isnull()]
A B
2017-01-01 203.0 1.175224
2017-01-02 199.0 1.338474
**2017-01-03 198.0 NaN** clean
2017-01-04 198.0 0.652318
2017-01-05 199.0 1.577577
2017-01-06 NaN 0.234882
2017-01-07 203.0 1.732908
2017-01-08 204.0 1.473146
2017-01-09 198.0 1.109261
2017-01-10 202.0 1.745309
Please forgive any mistakes.
To remove all the null values dropna() method will be helpful
df.dropna(inplace=True)
To remove remove which contain null value of particular use this code
df.dropna(subset=['column_name_to_remove'], inplace=True)
It appears that the value in your column is "null" and not a true NaN which is what dropna is meant for. So I would try:
df[df.BBD != 'null']
or, if the value is actually a NaN then,
df[pd.notnull(df.BBD)]
I recommend giving one of these two lines a try:
df_clean = df1[df1['BBD'].isnull() == False]
df_clean = df1[df1['BBD'].isna() == False]
Related
I am surprised that my reindex is producing NaNs in whole dataframe when the original dataframe does have numerical values init. Don't know why?
Code:
df =
A ... D
Unnamed: 0 ...
2022-04-04 11:00:05 NaN ... 2419.0
2022-04-04 11:00:10 NaN ... 2419.0
## exp start and end times
exp_start, exp_end = '2022-04-04 11:00:00','2022-04-04 13:00:00'
## one second index
onesec_idx = pd.date_range(start=exp_start,end=exp_end,freq='1s')
## map new index to the df
df = df.reindex(onesec_idx)
Result:
df =
A ... D
2022-04-04 11:00:00 NaN ... NaN
2022-04-04 11:00:01 NaN ... NaN
2022-04-04 11:00:02 NaN ... NaN
2022-04-04 11:00:03 NaN ... NaN
2022-04-04 11:00:04 NaN ... NaN
2022-04-04 11:00:05 NaN ... NaN
From the documentation you can see that df.reindex() will Places NA/NaN in locations having no value in the previous index.
However you can also provide a value that you want to replace missing values with (It defaults to NaN):
df.reindex(onesec_idx, fill_value='')
If you want to replace the NaN in a particular column or even in the whole dataframe you can run something like after doing a reindex:
df.fillna('',inplace=True) # for replacing NaN in the entire df with ''
df['d'].fillna(0, inplace=True) # if you want to replace all NaN in the D column with 0
Sources:
Documentation for reindex: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html
Documentation for fillna: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
I have seen many methods like concat, join, merge but i am missing the technique for my simple dataset.
I have two datasets looks like mentioned below
dates.csv
2020-07-06
2020-07-07
2020-07-08
2020-07-09
2020-07-10
.....
...
...
mydata.csv
Expected,Predicted
12990,12797.578628473471
12990,12860.382061836583
12990,12994.159035827917
12890,13019.073929662367
12890,12940.34108357684
.............
.......
.....
I want to combine these two datasets which have same number of rows on btoh csv files. I tried concat method but i see NaN's
delete = dates.csv (pd.DataFrame)
data1 = mydata.csv (pd.DataFrame)
result = pd.concat([delete, data1], axis=0, ignore_index=True)
print(result)
Output:
0 Expected Predicted
0 2020-07-06 NaN NaN
1 2020-07-07 NaN NaN
2 2020-07-08 NaN NaN
3 2020-07-09 NaN NaN
4 2020-07-10 NaN NaN
.. ... ... ...
307 NaN 10999.0 10526.433098
308 NaN 10999.0 10911.247147
309 NaN 10490.0 11038.685328
310 NaN 10490.0 10628.204624
311 NaN 10490.0 10632.495169
[312 rows x 3 columns]
I dont want all NaN's.
Thanks for your help!
You could use .join() method from pandas.
delete = dates.csv (pd.DataFrame)
data1 = mydata.csv (pd.DataFrame)
result = delete.join(data1)
If your two dataframes respect the same order, you can use the join method mentionned by Nik, by default it joins on the index.
Otherwise, if you have a key that you can join your dataframes on, you can specify it like this:
joined_data = first_df.join(second_df, on=key)
Your first_df and second_df should then share a column with the same name to join on.
I have a big dataframe. Some of the values in a column are NaN. I want to fill them with some value based on the other column value.
Data:
df =
A B
2019-10-01 09:19:40 667.029710 10
2019-10-01 09:20:15 673.518030 20
2019-10-01 09:21:29 533.137144 30
2020-07-25 15:51:15 NaN 40
2020-07-25 17:20:20 NaN 50
2020-07-25 17:21:23 NaN 60
I want to fill NaN in A column based on the B column value.
My code:
sdf = df[df['A'].isnull()] # slice NaN and create a new dataframe
sdf['A'] = sdf['B']*sdf['B']
df = pd.concat([df,sdf])
Everything works fine. I feel my code is lengthy. Is there a one line code?
For fillna we can do
df.A.fillna(df.B**2, inplace=True)
I have a csv file which is something like below
date,mean,min,max,std
2018-03-15,3.9999999999999964,inf,0.0,100.0
2018-03-16,0.46403712296984756,90.0,0.0,inf
2018-03-17,2.32452732452731,,0.0,143.2191767899579
2018-03-18,2.8571428571428523,inf,0.0,100.0
2018-03-20,0.6928406466512793,100.0,0.0,inf
2018-03-22,2.8675703858185635,,0.0,119.05383697172658
I want to select those column values which is > 20 and < 500 that is (20 to 500) and put those values along with date in another column of a dataframe.The other dataframe looks something like this
Date percentage_change location
2018-02-14 23.44 BOM
So I want to get the date, value from the csv and add it into the new dataframe at appropriate columns.Something like
Date percentage_change location
2018-02-14 23.44 BOM
2018-03-15 100.0 NaN
2018-03-16 90.0 NaN
2018-03-17 143.2191767899579 NaN
.... .... ....
Now I am aware of functions like df.max(axis=1) and df.min(axis=1) which gives you the min and max but not sure for finding values based on a range.So how can this be achieved?
Given dataframes df1 and df2, you can achieve this via aligning column names, cleaning numeric data, and then using pd.DataFrame.append.
df_app = df1.loc[:, ['date', 'mean', 'min', 'std']]\
.rename(columns={'date': 'Date'})\
.replace(np.inf, 0)\
.fillna(0)
print(df_app)
df_app['percentage_change'] = np.maximum(df_app['min'], df_app['std'])
print(df_app)
df_app = df_app[df_app['percentage_change'].between(20, 500)]
res = df2.append(df_app.loc[:, ['Date', 'percentage_change']])
print(res)
# Date location percentage_change
# 0 2018-02-14 BOM 23.440000
# 0 2018-03-15 NaN 100.000000
# 1 2018-03-16 NaN 90.000000
# 2 2018-03-17 NaN 143.219177
# 3 2018-03-18 NaN 100.000000
# 4 2018-03-20 NaN 100.000000
# 5 2018-03-22 NaN 119.053837
I am working on jupyter lab with pandas, version 0.20.1. I have a pivot table with a DatetimeIndex such as
In [1]:
pivot = df.pivot_table(index='Date', columns=['State'], values='B',
fill_value=0, aggfunc='count')
pivot
Out [1]:
State SAFE UNSAFE
Date
2017-11-18 1 0
2017-11-22 57 42
2017-11-23 155 223
The table counts all occurrences of events on a specific date, which can be either SAFE or UNSAFE. I need to resample the resulting table and sum the results.
Resampling the table with a daily frequency introduces NaNs on the days without data. Surprisingly, I cannot imput those NaNs with pandas' fillna().
In [2]:
pivot = pivot.resample('D').sum().fillna(0.)
pivot
Out [2]:
State SAFE UNSAFE
Date
2017-11-18 1.0 0.0
2017-11-19 NaN NaN
2017-11-20 NaN NaN
2017-11-21 NaN NaN
2017-11-22 57.0 42.0
2017-11-23 155.0 223.0
Anyone can explain why this happens and how can I get rid of those NaNs? I could do something in the line of
for col in ['SAFE', 'UNSAFE']:
mov.loc[mov[col].isnull(), col] = 0
However that look rather ugly, plus I'd like to understand why the first approach is not working.