Drop nan rows in pandas that are not in the middle - python

I have a pandas dataframe which are indexed by time,
For example:
Time Value
2010-01-01 nan
2010-01-02 nan
2010-01-03 3
2010-01-04 4
2010-01-05 5
2010-01-06 3
2010-01-07 nan
2010-01-08 nan
2010-01-09 3
2010-01-10 3
2010-01-11 4
2010-01-12 5
2010-01-13 3
2010-01-14 nan
2010-01-15 nan
In this example, I would like to drop the first two and last two rows. But not the rows with nan in the middle. Is there a way to do this?

You can use index of first valid value and last valid value to filter the dataframe:
df.loc[df.Value.first_valid_index(): df.Value.last_valid_index()]
Result:
Value
Time
2010-01-03 3.0
2010-01-04 4.0
2010-01-05 5.0
2010-01-06 3.0
2010-01-07 NaN
2010-01-08 NaN
2010-01-09 3.0
2010-01-10 3.0
2010-01-11 4.0
2010-01-12 5.0
2010-01-13 3.0

Supposing data is your dataframe:
a, b = data.dropna().index[[0, -1]]
You could also consider selecting a specific column, e.g. using data['Value'] instead of data.
This way you get the starting and ending indices not containing NaN. Then you just have to get that slice (being careful to include that last row):
data[a:b+1]
Result:
Time Value
2010-01-03 3
2010-01-04 4
2010-01-05 5
2010-01-06 3
2010-01-07 nan
2010-01-08 nan
2010-01-09 3
2010-01-10 3
2010-01-11 4
2010-01-12 5
2010-01-13 3
Single-row solution following #unutbu's tip to use loc:
data.loc[slice(*data.dropna().index[[0, -1]])]

Using bfill and ffill
df[df.Value.ffill().notnull()&df.Value.bfill().notnull()]
Out[464]:
Time Value
2 2010-01-03 3.0
3 2010-01-04 4.0
4 2010-01-05 5.0
5 2010-01-06 3.0
6 2010-01-07 NaN
7 2010-01-08 NaN
8 2010-01-09 3.0
9 2010-01-10 3.0
10 2010-01-11 4.0
11 2010-01-12 5.0
12 2010-01-13 3.0

Related

Python resample to only keep every 5th day by group

I have a dataframe, consisting of daily stock observations, date and PERMNO (Identifier). I want to resample the dataframe to only consist of observations for every 5th trading day for every stock. The dataframe looks something like the below:
[10610 rows x 3 columns]
PERMNO date RET gret cumret_5d
0 10001.0 2010-01-04 -0.004856 0.995144 NaN
1 10001.0 2010-01-05 -0.005856 0.994144 NaN
2 10001.0 2010-01-06 0.011780 1.011780 NaN
3 10001.0 2010-01-07 -0.033940 0.966060 NaN
4 10001.0 2010-01-08 0.038150 1.038150 3.888603e-03
5 10001.0 2010-01-11 0.015470 1.015470 2.439321e-02
6 10001.0 2010-01-12 -0.004760 0.995240 2.552256e-02
7 10001.0 2010-01-13 -0.003350 0.996650 1.018706e-02
8 10001.0 2010-01-14 -0.001928 0.998072 4.366128e-02
9 10001.0 2010-01-15 -0.007730 0.992270 -2.462285e-03
10 10002.0 2010-01-05 -0.011690 0.988310 NaN
11 10002.0 2010-01-06 0.011826 1.011826 NaN
12 10002.0 2010-01-07 -0.021420 0.978580 NaN
13 10002.0 2010-01-08 0.004974 1.004974 NaN
14 10002.0 2010-01-11 -0.023760 0.976240 -3.992141e-02
15 10002.0 2010-01-12 0.002028 1.002028 -2.659527e-02
16 10002.0 2010-01-13 0.009780 1.009780 -2.856358e-02
17 10002.0 2010-01-14 0.017380 1.017380 9.953183e-03
18 10002.0 2010-01-15 -0.008865 0.991135 -3.954383e-03
19 10002.0 2010-02-18 -0.006958 0.993042 1.318849e-02
The result I want to produce is:
[10610 rows x 3 columns]
PERMNO date RET gret cumret_5d
4 10001.0 2010-01-08 0.038150 1.038150 3.888603e-03
9 10001.0 2010-01-15 -0.007730 0.992270 -2.462285e-03
13 10002.0 2010-01-08 0.004974 1.004974 NaN
18 10002.0 2010-01-15 -0.008865 0.991135 -3.954383e-03
I.e I want to keep observations for dates (2010-01-08), (2010-01-15), (2010-01-22)... continuing up until today. The problem is that not every stock contains the same dates (some may have its first trading day in the middle of a month). Further, every 5th trading day is not continuously every 7th day due to holidays.
I have tried using
crsp_daily = crsp_daily.groupby('PERMNO').resample('5D',on='date')
Which just resulted in an empty dataframe:
Out:
DatetimeIndexResamplerGroupby [freq=<Day>, axis=0, closed=left, label=left, convention=e, origin=start_day]
Any ideas on how to solve this problem?
You could loop through the values of PERMNO and then for each subset use .iloc[::5] to get every 5th row. Then concat each resulting DataFrame together:
dfs = []
for val in crsp_daily['PERMNO'].unique():
dfs.append(crsp_daily[crsp_daily['PERMNO'] == val].iloc[::5])
result = pd.concat(dfs)
For future reference, I solved it by:
def remove_nonrebalancing_dates(df,gap):
count = pd.DataFrame(df.set_index('date').groupby('date'), columns=['date', 'tmp']).reset_index()
del count['tmp']
count['index'] = count['index'] + 1
count = count[(count['index'].isin(range(gap, len(count['index']) + 1, gap)))]
df = df[(df['date'].isin(count['date']))]
return df
dataframe with containing only every 5th trading day can then be defined as:
df = remove_nonrebalancing_dates(df,5)

Comparing two Pandas dataframes row by row and inserting matching value into the other dataframe

I have two pandas DataFrames named complete_data and raw_data. My intention is to look up the date column (row by row) of raw_data DataFrames in the complete_data DataFrames. For the rows of raw_data DataFrames found in the complete_data, I want to insert the corresponding row in P1 and P2 into complete_data.
Please note:
The unique key in both DataFrames is the 'date' and the complete_data DataFrame has the complete set of 'dates' that need other columns to be fetched from raw_data DataFrame.
I want the final Dataframe to be the 'complete_data' DataFrame having NaN values where date does not exists in the raw_data DataFrame. And where 'date' exists, the rows in columns P1 and P2 be inserted into complete_data DataFrame.
Here is my sample of code:
import pandas as pd
import numpy as np
complete_data = pd.DataFrame({'date':['2010-01-01','2010-01-02','2010-01-03','2010-01-04','2010-01-05','2010-01-06','2010-01-07','2010-01-08']})
raw_data = pd.DataFrame({'date':['2010-01-01','2010-01-02','2010-01-03','2010-01-05','2010-01-07','2010-01-08'],
'P1':['4.4','5.2','5.6','6.2','6.5','7.2'],
'P2':['200','220','230','250','270','280']})
column_labels = list(raw_data.columns)
column_labels = column_labels[1:]
complete_data[column_labels] = np.nan
i = 0
while i<raw_data.shape[0]:
if raw_data['date'].iloc[i] in complete_data['date'].iloc[i]:
complete_data.iloc[[i],[1,2]]=raw_data.iloc[[i],[1,2]]
else:
complete_data.iloc[[i],[1,2]] = raw_data.iloc[[i],[1,2]]
i+=1
My output is:
date P1 P2
0 2010-01-01 4.4 200
1 2010-01-02 5.2 220
2 2010-01-03 5.6 230
3 2010-01-04 6.2 250
4 2010-01-05 6.5 270
5 2010-01-06 7.2 280
6 2010-01-07 NaN NaN
7 2010-01-08 NaN NaN
My expected output should be:
date P1 P2
0 2010-01-01 4.4 200
1 2010-01-02 5.2 220
2 2010-01-03 5.6 230
3 2010-01-04 NaN NaN
4 2010-01-05 6.2 250
5 2010-01-06 6.5 270
6 2010-01-07 NaN NaN
7 2010-01-08 7.2 280
You could do this:
For the df:s you gave:
date P1 P2
0 2010-01-01 4.4 200
1 2010-01-02 5.2 220
2 2010-01-03 5.6 230
3 2010-01-05 6.2 250
4 2010-01-07 6.5 270
5 2010-01-08 7.2 280
and
date P1 P2
0 2010-01-01 NaN NaN
1 2010-01-02 NaN NaN
2 2010-01-03 NaN NaN
3 2010-01-04 NaN NaN
4 2010-01-05 NaN NaN
5 2010-01-06 NaN NaN
6 2010-01-07 NaN NaN
7 2010-01-08 NaN NaN
df = complete_data.merge(raw_data, on =['date'], how='left').dropna(axis=1, how='all')
df = df.rename(columns={'P1_y':'P1','P2_y':'P2'})
which gives:
date P1 P2
0 2010-01-01 4.4 200
1 2010-01-02 5.2 220
2 2010-01-03 5.6 230
3 2010-01-04 NaN NaN
4 2010-01-05 6.2 250
5 2010-01-06 NaN NaN
6 2010-01-07 6.5 270
7 2010-01-08 7.2 280
Note that the expected output in your question does not match the definition of the dataframes you gave.

Mapping ranges of date in pandas dataframe

I would like to map values defined in a dictionary of date: value into a DataFrame of dates.
Consider the following example:
import pandas as pd
df = pd.DataFrame(range(19), index=pd.date_range(start="2010-01-01", end="2010-01-10", freq="12H"))
dct = {
"2009-01-01": 1,
"2010-01-05": 2,
"2020-01-01": 3,
}
I would like to get something like this:
df
0 test
2010-01-01 00:00:00 0 1.0
2010-01-01 12:00:00 1 1.0
2010-01-02 00:00:00 2 1.0
2010-01-02 12:00:00 3 1.0
2010-01-03 00:00:00 4 1.0
2010-01-03 12:00:00 5 1.0
2010-01-04 00:00:00 6 1.0
2010-01-04 12:00:00 7 1.0
2010-01-05 00:00:00 8 2.0
2010-01-05 12:00:00 9 2.0
2010-01-06 00:00:00 10 2.0
2010-01-06 12:00:00 11 2.0
2010-01-07 00:00:00 12 2.0
2010-01-07 12:00:00 13 2.0
2010-01-08 00:00:00 14 2.0
2010-01-08 12:00:00 15 2.0
2010-01-09 00:00:00 16 2.0
2010-01-09 12:00:00 17 2.0
2010-01-10 00:00:00 18 2.0
I have tried the following but I get a list of nan:
df["test"] = pd.Series(df.index.map(dct), index=df.index).ffill()
Any suggestions?
There are missing values, because no match types - in dict are keys like strings, in DaatFrame is datetimes in DatetimeIndex, need same types - here datetimes in helper Series created from dictionary with Series.asfreq for add datetimes between:
dct = {
"2009-01-01": 1,
"2010-01-05": 2,
"2020-01-01": 3,
}
s = pd.Series(dct).rename(lambda x: pd.to_datetime(x)).asfreq('d', method='ffill')
df["test"] = df.index.to_series().dt.normalize().map(s)
print (df)
0 test
2010-01-01 00:00:00 0 1
2010-01-01 12:00:00 1 1
2010-01-02 00:00:00 2 1
2010-01-02 12:00:00 3 1
2010-01-03 00:00:00 4 1
2010-01-03 12:00:00 5 1
2010-01-04 00:00:00 6 1
2010-01-04 12:00:00 7 1
2010-01-05 00:00:00 8 2
2010-01-05 12:00:00 9 2
2010-01-06 00:00:00 10 2
2010-01-06 12:00:00 11 2
2010-01-07 00:00:00 12 2
2010-01-07 12:00:00 13 2
2010-01-08 00:00:00 14 2
2010-01-08 12:00:00 15 2
2010-01-09 00:00:00 16 2
2010-01-09 12:00:00 17 2
2010-01-10 00:00:00 18 2

Take time points, and make labels against datetime object to correlate for things around points

I'm trying to use the usual times I take medication (so + 4 hours on top of that) and fill in a data frame with a label, of being 2,1 or 0, for when I am on this medication, or for the hour after the medication as 2 for just being off of the medication.
As an example of the dataframe I am trying to add this column too,
<bound method NDFrame.to_clipboard of id sentiment magnitude angry disgusted fearful \
created
2020-05-21 12:00:00 23.0 -0.033333 0.5 NaN NaN NaN
2020-05-21 12:15:00 NaN NaN NaN NaN NaN NaN
2020-05-21 12:30:00 NaN NaN NaN NaN NaN NaN
2020-05-21 12:45:00 NaN NaN NaN NaN NaN NaN
2020-05-21 13:00:00 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ...
2021-04-20 00:45:00 NaN NaN NaN NaN NaN NaN
2021-04-20 01:00:00 NaN NaN NaN NaN NaN NaN
2021-04-20 01:15:00 NaN NaN NaN NaN NaN NaN
2021-04-20 01:30:00 NaN NaN NaN NaN NaN NaN
2021-04-20 01:45:00 46022.0 -1.000000 1.0 NaN NaN NaN
happy neutral sad surprised
created
2020-05-21 12:00:00 NaN NaN NaN NaN
2020-05-21 12:15:00 NaN NaN NaN NaN
2020-05-21 12:30:00 NaN NaN NaN NaN
2020-05-21 12:45:00 NaN NaN NaN NaN
2020-05-21 13:00:00 NaN NaN NaN NaN
... ... ... ... ...
2021-04-20 00:45:00 NaN NaN NaN NaN
2021-04-20 01:00:00 NaN NaN NaN NaN
2021-04-20 01:15:00 NaN NaN NaN NaN
2021-04-20 01:30:00 NaN NaN NaN NaN
2021-04-20 01:45:00 NaN NaN NaN NaN
[32024 rows x 10 columns]>
And the data for the timestamps for when i usually take my medication,
['09:00 AM', '12:00 PM', '03:00 PM']
How would I use those time stamps to get this sort of column information?
Update
So, trying to build upon the question, How would I make sure it only adds medication against places where there is data available, and making sure that the after medication timing of one hour is applied correctly!
Thanks
Use np.select() to choose the appropriate label for a given condition.
First dropna() if all values after created are null (subset=df.columns[1:]). You can change the subset depending on your needs (e.g., subset=['id'] if rows should be dropped just for having a null id).
Then generate datetime arrays for taken-, active-, and after-medication periods based on the duration of the medication. Check whether the created times match any of the times in active (label 1) or after (label 2), otherwise default to 0.
# drop rows that are empty except for column 0 (i.e., except for df.created)
df.dropna(subset=df.columns[1:], inplace=True)
# convert times to datetime
df.created = pd.to_datetime(df.created)
taken = pd.to_datetime(['09:00:00', '12:00:00', '15:00:00'])
# generate time arrays
duration = 2 # hours
active = np.array([(taken + pd.Timedelta(f'{h}H')).time for h in range(duration)]).ravel()
after = (taken + pd.Timedelta(f'{duration}H')).time
# define boolean masks by label
conditions = {
1: df.created.dt.floor('H').dt.time.isin(active),
2: df.created.dt.floor('H').dt.time.isin(after),
}
# create medication column with np.select()
df['medication'] = np.select(conditions.values(), conditions.keys(), default=0)
Here is the output with some slightly modified data that better demonstrate the active / after / nan scenarios:
created id sentiment magnitude medication
0 2020-05-21 12:00:00 23.0 -0.033333 0.5 1
3 2020-05-21 12:45:00 39.0 -0.500000 0.5 1
4 2020-05-21 13:00:00 90.0 -0.500000 0.5 1
5 2020-05-21 13:15:00 100.0 -0.033333 0.1 1
9 2020-05-21 14:15:00 1000.0 0.033333 0.5 2
10 2020-05-21 14:30:00 3.0 0.001000 1.0 2
17 2021-04-20 01:00:00 46022.0 -1.000000 1.0 0
20 2021-04-20 01:45:00 46022.0 -1.000000 1.0 0

Linearly interpolating Pandas Time Series

I am pulling data on exchange rates using Pandas. The data does not have values for every single day. I'd like to fill in the missing time series using Pandas interoplate function so that all dates are included in the index. For example, 2010-01-09 and 2010-01-10 are both missing. The interoplate function seems not to be doing anything, but I can't figure out why.
from pandas_datareader import data
can = data.get_data_fred('DEXCAUS')
can = can.interpolate(method='linear')
can = can.dropna()
print can.head(10)
Output:
DEXCAUS
DATE
2010-01-04 1.0377
2010-01-05 1.0371
2010-01-06 1.0333
2010-01-07 1.0351
2010-01-08 1.0345
2010-01-11 1.0317
2010-01-12 1.0374
2010-01-13 1.0319
2010-01-14 1.0260
2010-01-15 1.0287
Desired Output:
DEXCAUS
DATE
2010-01-04 1.0377
2010-01-05 1.0371
2010-01-06 1.0333
2010-01-07 1.0351
2010-01-08 1.0345
2010-01-09 some value..
2010-01-10 some value..
2010-01-11 1.0317
2010-01-12 1.0374
2010-01-13 1.0319
2010-01-14 1.0260
2010-01-15 1.0287
You need to resample first:
df.resample('D').interpolate(method='linear')
Out:
DEXCAUS
DATE
2010-01-04 1.037700
2010-01-05 1.037100
2010-01-06 1.033300
2010-01-07 1.035100
2010-01-08 1.034500
2010-01-09 1.033567
2010-01-10 1.032633
2010-01-11 1.031700
2010-01-12 1.037400
2010-01-13 1.031900
2010-01-14 1.026000
2010-01-15 1.028700

Categories

Resources