I'm trying to substitute NaTs in a pandas dataframe.
orders.PAID_AT
0 NaT
1 NaT
2 NaT
3 NaT
4 NaT
6 NaT
7 NaT
8 NaT
9 NaT
10 NaT
11 2018-08-04 16:19:10
12 2018-08-04 16:19:10
13 NaT
14 NaT
15 2018-08-04 13:49:08
16 2018-08-04 13:49:08
18 NaT
19 NaT
20 NaT
21 2018-08-04 12:41:48
The rows 0..10 need to be filled with value of row 11 etc. Somehow I can't get it right with:
orders.PAID_AT.fillna(method='bfill', inplace=True)
I'm getting the same result as above. What am I missing here?
For avoid chained assignments assign back:
orders.PAID_AT = orders.PAID_AT.bfill()
Related
I am currently trying to find a way to merge specific rows of df2 to df1 based on their datetime indices in a way that avoids lookahead bias so that I can add external features (df2) to my main dataset (df1) for ML applications. The lengths of the dataframes are different, and the datetime indices aren't increasing at a constant rate. My current thought process is to do this by using nested loops and if statements, but this method would be too slow as the dataframes I am trying to do this on both have over 30000 rows each. Is there a faster way of doing this?
df1
index a b
2015-06-02 16:00:00 0 5
2015-06-05 16:00:00 1 6
2015-06-06 16:00:00 2 7
2015-06-11 16:00:00 3 8
2015-06-12 16:00:00 4 9
df2
index c d
2015-06-02 9:03:00 10 16
2015-06-02 15:12:00 11 17
2015-06-02 16:07:00 12 18
... ... ...
2015-06-12 15:29:00 13 19
2015-06-12 16:02:00 14 20
2015-06-12 17:33:00 15 21
df_combined
(because you can't see the rows at 06-05, 06-06, 06-11, I just have NaN as the row values to make it easier to interpret)
index a b c d
2015-06-02 16:00:00 0 5 11 17
2015-06-05 16:00:00 1 NaN NaN NaN
2015-06-06 16:00:00 2 NaN NaN NaN
2015-06-11 16:00:00 3 NaN NaN NaN
2015-06-12 16:00:00 4 9 13 19
df_combined.loc[0, ['c', 'd']] and df_combined.loc[4, ['c', 'd']] are 11,17 and 13,19 respectively instead of 12,18 and 14,20 to avoid lookahead bias because in a live scenario, those values haven't been observed yet.
IIUC, you need merge_asof. assuming your index are ordered in time, it is with the direction backward.
print(pd.merge_asof(df1, df2, left_index=True, right_index=True, direction='backward'))
# a b c d
# 2015-06-02 16:00:00 0 5 11 17
# 2015-06-05 16:00:00 1 6 12 18
# 2015-06-06 16:00:00 2 7 12 18
# 2015-06-11 16:00:00 3 8 12 18
# 2015-06-12 16:00:00 4 9 13 19
Note that the dates 06-05, 06-06, 06-11 are not NaN but it is the last values in df2 (for 2015-06-02 16:07:00) being available before these dates in your given data.
Note: if what your dates are actually a column named index and not your index, then do:
print(pd.merge_asof(df1, df2, on='index', direction='backward'))
I am working with a pandas dataframe with date column. I have converted the dtype of this column from object to datetime using pandas pd.to_datetime:
Input:
0 30-11-2019
1 31-12-2019
2 31-12-2019
3 31-12-2019
4 31-12-2019
5 21-01-2020
6 27-01-2020
7 01-02-2020
8 01-02-2020
9 03-02-2020
10 15-02-2020
11 12-03-2020
12 13-03-2020
13 31-03-2020
14 31-03-2020
15 04-04-2020
16 04-04-2020
17 04-04-2020
ta['transaction_date'] = pd.to_datetime(ta['transaction_date'])
Output:
0 2019-11-30
1 2019-12-31
2 2019-12-31
3 2019-12-31
4 2019-12-31
5 2020-01-21
6 2020-01-27
7 2020-01-02
8 2020-01-02
9 2020-03-02
10 2020-02-15
11 2020-12-03
12 2020-03-13
13 2020-03-31
14 2020-03-31
15 2020-04-04
16 2020-04-04
17 2020-04-04
As you can see that the 11th output after converting it into datetime is wrong month is swapped with day.This is affecting my further analysis. How can I sort this out.
Use dayfirst=True parameter or specify format, because pandas by default matching months first, if possible:
a['transaction_date'] = pd.to_datetime(ta['transaction_date'], dayfirst=True)
Or:
a['transaction_date'] = pd.to_datetime(ta['transaction_date'], format='%d-%m-%Y')
Method 1
Look into this dateframe
there is a parameter named dayfirst set it to true
Method 2
Use the parameter format in the to_datetime function
I have a dataframe that contains a data column
Comp_date
0 2020-04-24
1 NaT
2 NaT
3 NaT
4 2020-08-06
5 NaT
6 NaT
7 NaT
8 2020-08-22
9 NaT
I am trying to fill the null with the value of the previous date + add a constant number of days (10). But I am unable to do so. I tried the following
df['Comp_date']=df['Comp_date'].fillna((df['Comp_date'].shift()+pd.to_timedelta(10, unit='D')), inplace=True)
Nothing happens and I get the same result. Any help?
expected outcome
Comp_date
0 2020-04-24
1 2020-05-04
2 2020-05-14
3 2020-05-24
4 2020-08-06
5 2020-08-16
6 2020-08-26
7 2020-09-05
8 2020-08-22
9 2020-09-01
Idea is create groups for missing values by Series.notna and Series.cumsum and create counter by GroupBy.cumcount, multiple number of days by Series.mul convert to timedeltas by to_timedelta what is added to forward filling missing values with ffill:
num_days = 10
g = df['Comp_date'].notna().cumsum()
days = pd.to_timedelta(df.groupby(g).cumcount().mul(num_days), unit='d')
df['Comp_date'] = df['Comp_date'].ffill().add(days)
print (df)
Comp_date
0 2020-04-24
1 2020-05-04
2 2020-05-14
3 2020-05-24
4 2020-08-06
5 2020-08-16
6 2020-08-26
7 2020-09-05
8 2020-08-22
9 2020-09-01
I'm not clear on your question, but this adds a constant number of days to the last observed Comp_date.
constant_number_of_days = 2
df2 = df['Comp_date'].ffill().to_frame()
df2.loc[df['Comp_date'].isnull(), 'Comp_date'] += pd.Timedelta(days=constant_number_of_days)
>>> df2
Comp_date
0 2020-04-24
1 2020-04-26
2 2020-04-26
3 2020-04-26
4 2020-08-06
5 2020-08-08
6 2020-08-08
7 2020-08-08
8 2020-08-22
9 2020-08-24
I have two columns in a Pandas data frame that are dates.
I am looking to subtract one column from another and the result being the difference in numbers of days as an integer.
A peek at the data:
df_test.head(10)
Out[20]:
First_Date Second Date
0 2016-02-09 2015-11-19
1 2016-01-06 2015-11-30
2 NaT 2015-12-04
3 2016-01-06 2015-12-08
4 NaT 2015-12-09
5 2016-01-07 2015-12-11
6 NaT 2015-12-12
7 NaT 2015-12-14
8 2016-01-06 2015-12-14
9 NaT 2015-12-15
I have created a new column successfully with the difference:
df_test['Difference'] = df_test['First_Date'].sub(df_test['Second Date'], axis=0)
df_test.head()
Out[22]:
First_Date Second Date Difference
0 2016-02-09 2015-11-19 82 days
1 2016-01-06 2015-11-30 37 days
2 NaT 2015-12-04 NaT
3 2016-01-06 2015-12-08 29 days
4 NaT 2015-12-09 NaT
However I am unable to get a numeric version of the result:
df_test['Difference'] = df_test[['Difference']].apply(pd.to_numeric)
df_test.head()
Out[25]:
First_Date Second Date Difference
0 2016-02-09 2015-11-19 7.084800e+15
1 2016-01-06 2015-11-30 3.196800e+15
2 NaT 2015-12-04 NaN
3 2016-01-06 2015-12-08 2.505600e+15
4 NaT 2015-12-09 NaN
How about:
df_test['Difference'] = (df_test['First_Date'] - df_test['Second Date']).dt.days
This will return difference as int if there are no missing values(NaT) and float if there is.
Pandas have a rich documentation on Time series / date functionality and Time deltas
You can divide column of dtype timedelta by np.timedelta64(1, 'D'), but output is not int, but float, because NaN values:
df_test['Difference'] = df_test['Difference'] / np.timedelta64(1, 'D')
print (df_test)
First_Date Second Date Difference
0 2016-02-09 2015-11-19 82.0
1 2016-01-06 2015-11-30 37.0
2 NaT 2015-12-04 NaN
3 2016-01-06 2015-12-08 29.0
4 NaT 2015-12-09 NaN
5 2016-01-07 2015-12-11 27.0
6 NaT 2015-12-12 NaN
7 NaT 2015-12-14 NaN
8 2016-01-06 2015-12-14 23.0
9 NaT 2015-12-15 NaN
Frequency conversion.
You can use datetime module to help here. Also, as a side note, a simple date subtraction should work as below:
import datetime as dt
import numpy as np
import pandas as pd
#Assume we have df_test:
In [222]: df_test
Out[222]:
first_date second_date
0 2016-01-31 2015-11-19
1 2016-02-29 2015-11-20
2 2016-03-31 2015-11-21
3 2016-04-30 2015-11-22
4 2016-05-31 2015-11-23
5 2016-06-30 2015-11-24
6 NaT 2015-11-25
7 NaT 2015-11-26
8 2016-01-31 2015-11-27
9 NaT 2015-11-28
10 NaT 2015-11-29
11 NaT 2015-11-30
12 2016-04-30 2015-12-01
13 NaT 2015-12-02
14 NaT 2015-12-03
15 2016-04-30 2015-12-04
16 NaT 2015-12-05
17 NaT 2015-12-06
In [223]: df_test['Difference'] = df_test['first_date'] - df_test['second_date']
In [224]: df_test
Out[224]:
first_date second_date Difference
0 2016-01-31 2015-11-19 73 days
1 2016-02-29 2015-11-20 101 days
2 2016-03-31 2015-11-21 131 days
3 2016-04-30 2015-11-22 160 days
4 2016-05-31 2015-11-23 190 days
5 2016-06-30 2015-11-24 219 days
6 NaT 2015-11-25 NaT
7 NaT 2015-11-26 NaT
8 2016-01-31 2015-11-27 65 days
9 NaT 2015-11-28 NaT
10 NaT 2015-11-29 NaT
11 NaT 2015-11-30 NaT
12 2016-04-30 2015-12-01 151 days
13 NaT 2015-12-02 NaT
14 NaT 2015-12-03 NaT
15 2016-04-30 2015-12-04 148 days
16 NaT 2015-12-05 NaT
17 NaT 2015-12-06 NaT
Now, change type to datetime.timedelta, and then use the .days method on valid timedelta objects.
In [226]: df_test['Diffference'] = df_test['Difference'].astype(dt.timedelta).map(lambda x: np.nan if pd.isnull(x) else x.days)
In [227]: df_test
Out[227]:
first_date second_date Difference Diffference
0 2016-01-31 2015-11-19 73 days 73
1 2016-02-29 2015-11-20 101 days 101
2 2016-03-31 2015-11-21 131 days 131
3 2016-04-30 2015-11-22 160 days 160
4 2016-05-31 2015-11-23 190 days 190
5 2016-06-30 2015-11-24 219 days 219
6 NaT 2015-11-25 NaT NaN
7 NaT 2015-11-26 NaT NaN
8 2016-01-31 2015-11-27 65 days 65
9 NaT 2015-11-28 NaT NaN
10 NaT 2015-11-29 NaT NaN
11 NaT 2015-11-30 NaT NaN
12 2016-04-30 2015-12-01 151 days 151
13 NaT 2015-12-02 NaT NaN
14 NaT 2015-12-03 NaT NaN
15 2016-04-30 2015-12-04 148 days 148
16 NaT 2015-12-05 NaT NaN
17 NaT 2015-12-06 NaT NaN
Hope that helps.
I feel that the overall answer does not handle if the dates 'wrap' around a year. This would be useful in understanding proximity to a date being accurate by day of year. In order to do these row operations, I did the following. (I had this used in a business setting in renewing customer subscriptions).
def get_date_difference(row, x, y):
try:
# Calcuating the smallest date difference between the start and the close date
# There's some tricky logic in here to calculate for determining date difference
# the other way around (Dec -> Jan is 1 month rather than 11)
sub_start_date = int(row[x].strftime('%j')) # day of year (1-366)
close_date = int(row[y].strftime('%j')) # day of year (1-366)
later_date_of_year = max(sub_start_date, close_date)
earlier_date_of_year = min(sub_start_date, close_date)
days_diff = later_date_of_year - earlier_date_of_year
# Calculates the difference going across the next year (December -> Jan)
days_diff_reversed = (365 - later_date_of_year) + earlier_date_of_year
return min(days_diff, days_diff_reversed)
except ValueError:
return None
Then the function could be:
dfAC_Renew['date_difference'] = dfAC_Renew.apply(get_date_difference, x = 'customer_since_date', y = 'renewal_date', axis = 1)
Create a vectorized method
def calc_xb_minus_xa(df):
time_dict = {
'<Minute>': 'm',
'<Hour>': 'h',
'<Day>': 'D',
'<Week>': 'W',
'<Month>': 'M',
'<Year>': 'Y'
}
time_delta = df.at[df.index[0], 'end_time'] - df.at[df.index[0], 'open_time']
offset_base_name = str(to_offset(time_delta).base)
time_term = time_dict.get(offset_base_name)
result = (df.end_time - df.open_time) / np.timedelta64(1, time_term)
return result
Then in your df do:
df['x'] = calc_xb_minus_xa(df)
This will work for minutes, hours, days, weeks, month and Year.
open_time and end_time need to change according your df
I have the following datadrame
var loyal_date
1 2017-01-17
1 2017-01-03
1 2017-01-11
1 NaT
1 NaT
2 2017-01-15
2 2017-01-07
2 Nat
2 Nat
2 Nat
i need to group by var column and find the percentage of non missing value in loyal_date column for each group. Is there any way to do it using lambda function?
try this:
In [59]: df
Out[59]:
var loyal_date
0 1 2017-01-17
1 1 2017-01-03
2 1 2017-01-11
3 1 NaT
4 1 NaT
5 2 2017-01-15
6 2 2017-01-07
7 2 NaT
8 2 NaT
9 2 NaT
In [60]: df.groupby('var')['loyal_date'].apply(lambda x: x.notnull().sum()/len(x)*100)
Out[60]:
var
1 60.0
2 40.0
Name: loyal_date, dtype: float64