removing rows with any column containing NaN, NaTs, and nans - python

Currently I have data as below:
df_all.head()
Out[2]:
Unnamed: 0 Symbol Date Close Weight
0 4061 A 2016-01-13 36.515889 (0.000002)
1 4062 AA 2016-01-14 36.351784 0.000112
2 4063 AAC 2016-01-15 36.351784 (0.000004)
3 4064 AAL 2016-01-19 36.590483 0.000006
4 4065 AAMC 2016-01-20 35.934062 0.000002
df_all.tail()
Out[3]:
Unnamed: 0 Symbol Date Close Weight
1252498 26950320 nan NaT 9.84 NaN
1252499 26950321 nan NaT 10.26 NaN
1252500 26950322 nan NaT 9.99 NaN
1252501 26950323 nan NaT 9.11 NaN
1252502 26950324 nan NaT 9.18 NaN
df_all.dtypes
Out[4]:
Unnamed: 0 int64
Symbol object
Date datetime64[ns]
Close float64
Weight object
dtype: object
As can be seen, I am getting values in Symbol of nan, Nat for Date and NaN for weight.
MY GOAL: I want to remove any row that has ANY column containing nan, Nat or NaN and have a new df_clean to be the result
I don't seem to be able to apply the appropriate filter? I am not sure if I have to convert the datatypes first (although I tried this as well)

You can use
df_all.replace({'nan': None})[~pd.isnull(df_all).any(axis=1)]
This is because isnull recognizes both NaN and NaT as "null" values.

Since, the symbol 'nan' is not caught by dropna() or isnull(). You need to cast the symbol'nan' as np.nan
Try this:
df["symbol"] = np.where(df["symbol"]=='nan',np.nan, df["symbol"] )
df.dropna()

Related

Pandas merge single column dataframe with another dataframe of multiple columns

I have one dataframe_1 as
date
0 2020-01-01
1 2020-01-02
2 2020-01-03
3 2020-01-04
4 2020-01-05
and another dataframe_2 as
date source dest price
634647 2020-09-18 EUR USD 1.186317
634648 2020-09-19 EUR USD 1.183970
634649 2020-09-20 EUR USD 1.183970
I want to merge them on 'date' but the problem is dataframe_1 last date is '2021-02-15' and dataframe_2 last date is '2021-02-01'.
I want the resulting dataframe as
date source dest price
634647 2021-02-01 EUR USD 1.186317
634648 2021-02-02 NaN NaN NaN
634649 2021-02-03 Nan NaN NaN
...
date source dest price
634647 2021-02-13 NaN NaN NaN
634648 2021-02-14 NaN NaN NaN
634649 2021-02-25 NaN NaN NaN
But I am not able to do it using pd.merge, please ignore the indices in the dataframes.
Thanks a lot in advance.
you can use join to do it.
df1.set_index('date').join(df2.set_index('date'))

Filling missing dates by imputing on previous dates in Python

I have a time series that I want to lag and predict on for future data one year ahead that looks like:
Date Energy Pred Energy Lag Error
.
2017-09-01 9 8.4
2017-10-01 10 9
2017-11-01 11 10
2017-12-01 12 11.5
2018-01-01 1 1.3
NaT (pred-true)
NaT
NaT
NaT
.
.
All I want to do is impute dates into the NaT entries to continue from 2018-01-01 to 2019-01-01 (just fill them like we're in Excel drag and drop) because there are enough NaT positions to fill up to that point.
I've tried model['Date'].fillna() with various methods and either just repeats the same previous date or drops things I don't want to drop.
Any way to just fill these NaTs with 1 month increments like the previous data?
Make the df and set the index (there are better ways to set the index):
"""
Date,Energy,Pred Energy,Lag Error
2017-09-01,9,8.4
2017-10-01,10,9
2017-11-01,11,10
2017-12-01,12,11.5
2018-01-01,1,1.3
"""
import pandas as pd
df = pd.read_clipboard(sep=",", parse_dates=True)
df.set_index(pd.DatetimeIndex(df['Date']), inplace=True)
df.drop("Date", axis=1, inplace=True)
df
Reindex to a new date_range:
idx = pd.date_range(start='2017-09-01', end='2019-01-01', freq='MS')
df = df.reindex(idx)
Output:
Energy Pred Energy Lag Error
2017-09-01 9.0 8.4 NaN
2017-10-01 10.0 9.0 NaN
2017-11-01 11.0 10.0 NaN
2017-12-01 12.0 11.5 NaN
2018-01-01 1.0 1.3 NaN
2018-02-01 NaN NaN NaN
2018-03-01 NaN NaN NaN
2018-04-01 NaN NaN NaN
2018-05-01 NaN NaN NaN
2018-06-01 NaN NaN NaN
2018-07-01 NaN NaN NaN
2018-08-01 NaN NaN NaN
2018-09-01 NaN NaN NaN
2018-10-01 NaN NaN NaN
2018-11-01 NaN NaN NaN
2018-12-01 NaN NaN NaN
2019-01-01 NaN NaN NaN
Help from:
Pandas Set DatetimeIndex

dataframe merge with missing data

I have 2 dataframes:
df.head()
Out[2]:
Unnamed: 0 Symbol Date Close
0 4061 A 2016-01-13 36.515889
1 4062 A 2016-01-14 36.351784
2 4063 A 2016-01-15 36.351784
3 4064 A 2016-01-19 36.590483
4 4065 A 2016-01-20 35.934062
and
dfw.head()
Out[3]:
Symbol Weight
0 A (0.000002)
1 AA 0.000112
2 AAC (0.000004)
3 AAL 0.000006
4 AAMC 0.000002
ISSUE:
Not every symbol if df will have a weight in dfw. If it does not I want to drop it from my new dataframe (all dates of it). If the symbol is in dfw I want to merge the weight in with df so that each row has symbol, date, close and weight. I have tried the following but get NaN values. I also am not sure how to remove all symbols with no weights even if I was successful.
dfall = df.merge(dfw, on='Symbol', how='left')
dfall.head()
Out[14]:
Unnamed: 0 Symbol Date Close Weight
0 4061 A 2016-01-13 36.515889 NaN
1 4062 A 2016-01-14 36.351784 NaN
2 4063 A 2016-01-15 36.351784 NaN
3 4064 A 2016-01-19 36.590483 NaN
4 4065 A 2016-01-20 35.934062 NaN
df_all = df[df.Symbol.isin(dfw.Symbol.unique())].merge(dfw, how='left', on='Symbol')
I am not sure why you are getting NaN values. Perhaps you have spaces in you your symbols? You can clean them via: dfw['Symbol'] = dfw.Symbol.str.strip() You would need to do the same for df.
>>> df_all
Unnamed: 0 Symbol Date Close Weight
0 4061 A 2016-01-13 36.515889 (0.000002)
1 4062 A 2016-01-14 36.351784 (0.000002)
2 4063 A 2016-01-15 36.351784 (0.000002)
3 4064 A 2016-01-19 36.590483 (0.000002)
4 4065 A 2016-01-20 35.934062 (0.000002)

How to multiply two dataframes if they have the same index value along the corresponding row?

Suppose I have something like this (which may have the forecast_date index repeated):
df1:
forecast_date value
2015-04-11 18952
2015-04-12 18938
2015-04-13 18940
2015-04-14 18949
2015-04-15 18955
2015-04-16 18956
...
2015-04-02 18950
2015-04-03 18968
I also have another dataframe that is like this (indices here are never duplicated):
df2:
date value
2015-04-01 1.3
2015-04-02 1.35
2015-04-03 1.34
2015-04-04 1.45
....
I want to multiple the df1 row value by the df2 row value if their indices match. What is an elegant way to do this in pandas? This is probably really easy and I am just overlooking it.
Thanks.
If you set the index to be the dates for both df's then multiplication will align where the indices match:
In [46]:
df['value'] * df1['value']
Out[46]:
2015-04-01 NaN
2015-04-02 25582.50
2015-04-03 25417.12
2015-04-04 NaN
2015-04-11 NaN
2015-04-12 NaN
2015-04-13 NaN
2015-04-14 NaN
2015-04-15 NaN
2015-04-16 NaN
Name: value, dtype: float64
The question is whether you want NaN values where the rows are missing or not.
EDIT
If you have duplicate date values then what you could do is left merge the other df's value column and then multiply the 2 columns so the following should work:
In [58]:
df1.rename(columns={'value':'other_value'}, inplace=True)
merged = df.merge(df1, left_on='forecast_date', right_on='date', how='left')
merged['new_value'] = merged['value'] * merged['other_value']
merged
Out[58]:
forecast_date value date other_value new_value
0 2015-04-11 18952 NaN NaN NaN
1 2015-04-12 18938 NaN NaN NaN
2 2015-04-13 18940 NaN NaN NaN
3 2015-04-14 18949 NaN NaN NaN
4 2015-04-15 18955 NaN NaN NaN
5 2015-04-16 18956 NaN NaN NaN
6 2015-04-02 18950 2015-04-02 1.35 25582.50
7 2015-04-03 18968 2015-04-03 1.34 25417.12
The above assumes that the date columns have not been set as the index already.
You could make use of a array to store the values. And then look through arrayA if the same value occurs in arrayB. If yes, do a calculation.
you could use
df1.multiply(df2)
check pandas.DataFrame.multiply

slicing on a perod index in pandas when slice start and end may be out of bounds

Is the following behavior expected or a bug?
I have a process where I need rows from Dataframe, but in the boudary conditons the simple rule ( all rows 5 days preceeding will generate selections partially or fully outside the index. I would like pandas to behave like python and always return a frame even if sometimes there are no rows.
The index is Period index and the data is sorted.
Configuration is panas 12 numpy 1.7 and windows 64
In testing I have df.loc raises an index error if the requested slice is not completely with int he index
df[start:end] returned a frame but not always the rows I expected
import pandas as pd
october = pd.PeriodIndex( start = '20131001', end = '20131010', freq = 'D')
oct_sales =pd.DataFrame(dict(units=[100+ i for i in range(10)]), index =october)
#returns empty frame as desired
oct_sales['2013-09-01': '2013-09-30']
# empty dataframe -- I was expecting two rows
oct_sales['2013-09-30': '2013-10-02']
# works as expected
oct_sales['2013-10-01': '2013-10-02']
# same as oct_sales['2013-10-02':] -- expected no rows
oct_sales['2013-10-02': '2013-09-30']
This is as expected. The slicing on labels (start : end), only works if the labels exist. To get what I think you are after reindex for the entire period, select, then dropna. That said, the loc behavior of raising is correct, while the [] indexing should work (maybe a bug).
In [23]: idx = pd.PeriodIndex( start = '20130901', end = '20131010', freq = 'D')
In [24]: oct_sales.reindex(idx)
Out[24]:
units
2013-09-01 NaN
2013-09-02 NaN
2013-09-03 NaN
2013-09-04 NaN
2013-09-05 NaN
2013-09-06 NaN
2013-09-07 NaN
2013-09-08 NaN
2013-09-09 NaN
2013-09-10 NaN
2013-09-11 NaN
2013-09-12 NaN
2013-09-13 NaN
2013-09-14 NaN
2013-09-15 NaN
2013-09-16 NaN
2013-09-17 NaN
2013-09-18 NaN
2013-09-19 NaN
2013-09-20 NaN
2013-09-21 NaN
2013-09-22 NaN
2013-09-23 NaN
2013-09-24 NaN
2013-09-25 NaN
2013-09-26 NaN
2013-09-27 NaN
2013-09-28 NaN
2013-09-29 NaN
2013-09-30 NaN
2013-10-01 100
2013-10-02 101
2013-10-03 102
2013-10-04 103
2013-10-05 104
2013-10-06 105
2013-10-07 106
2013-10-08 107
2013-10-09 108
2013-10-10 109
In [25]: oct_sales.reindex(idx)['2013-09-30':'2013-10-02']
Out[25]:
units
2013-09-30 NaN
2013-10-01 100
2013-10-02 101
In [26]: oct_sales.reindex(idx)['2013-09-30':'2013-10-02'].dropna()
Out[26]:
units
2013-10-01 100
2013-10-02 101

Categories

Resources