Pandas to_datetime function giving erratic output - python

My data frame has a column 'Date' which is of type object but I want to convert it to pandas time series. So I am using pd.to_datetime function. This function is converting the datatype but giving erratic output.
code:
x1['TS'] = pd.to_datetime(x1['Date'])
x1['Day'] = x1['TS'].dt.dayofweek
x1[['Date', 'TS', 'Day']].iloc[::1430,:]
Now notice the output closely and see the columns Date and TS. it should be same but in some cases, its different.
output :
Date TS Day
0 01-12-2017 2017-01-12 3
1430 01-12-2017 2017-01-12 3
2860 02-12-2017 2017-02-12 6
4290 03-12-2017 2017-03-12 6
5720 04-12-2017 2017-04-12 2
7150 05-12-2017 2017-05-12 4
8580 07-12-2017 2017-07-12 2
10010 08-12-2017 2017-08-12 5
11440 09-12-2017 2017-09-12 1
12870 09-12-2017 2017-09-12 1
14300 10-12-2017 2017-10-12 3
15730 11-12-2017 2017-11-12 6
17160 12-12-2017 2017-12-12 1
18590 13-12-2017 2017-12-13 2
20020 14-12-2017 2017-12-14 3
21450 15-12-2017 2017-12-15 4
22880 16-12-2017 2017-12-16 5
24310 17-12-2017 2017-12-17 6
25740 18-12-2017 2017-12-18 0
27170 19-12-2017 2017-12-19 1
28600 20-12-2017 2017-12-20 2
30030 21-12-2017 2017-12-21 3
31460 22-12-2017 2017-12-22 4
32890 23-12-2017 2017-12-23 5
34320 24-12-2017 2017-12-24 6
35750 25-12-2017 2017-12-25 0
37180 26-12-2017 2017-12-26 1
38610 27-12-2017 2017-12-27 2
40040 28-12-2017 2017-12-28 3
41470 29-12-2017 2017-12-29 4
42900 30-12-2017 2017-12-30 5
44330 31-12-2017 2017-12-31 6
45760 01-01-2018 2018-01-01 0
47190 02-01-2018 2018-02-01 3
48620 03-01-2018 2018-03-01 3
50050 04-01-2018 2018-04-01 6
51480 05-01-2018 2018-05-01 1
52910 06-01-2018 2018-06-01 4
54340 07-01-2018 2018-07-01 6
55770 08-01-2018 2018-08-01 2
57200 09-01-2018 2018-09-01 5
58630 10-01-2018 2018-10-01 0
60060 11-01-2018 2018-11-01 3
61490 12-01-2018 2018-12-01 5
62920 13-01-2018 2018-01-13 5
64350 14-01-2018 2018-01-14 6
65780 15-01-2018 2018-01-15 0
67210 16-01-2018 2018-01-16 1

Oops! Looks like your dates start with the day being first. You'll have to tell pandas to handle that accordingly. Set the dayfirst flag to True when calling to_datetime.
x1['TS'] = pd.to_datetime(x1['Date'], dayfirst=True)

When you pass in a time without specifying the format, Pandas tries to guess at the format in a naive manner. It was assuming that what is your day is actually your month but then when it sees that it is month 13, realizes that can't be the month column and switches.
The following should work but I like #cᴏʟᴅsᴘᴇᴇᴅ's solution better because it is simpler to just raise the dayfirst flag.
To fix this, provide the current format to the to_datetime function.
The documentation gives the following example which you can modify to fit your situation:
pd.to_datetime('13000101', format='%Y%m%d', errors='ignore')
See details here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html
Time format conventions (what %Y means and so on) are here: https://docs.python.org/3.2/library/time.html

Related

pandas: find the first friday following the first monday of the month

currently trying to figure this out with code like this..but not quite yet able to get it:
df[(df.index.day_of_week==0) & (df.index.day<15) & (df.shift(-4).index.day_of_week==4)]
this is what the data looks like. (i've added the day_of_week column for convenience). basically, i am trying to find the first day_of_week=0 (monday) in the month, then filter for the first day_of_week=4 after that (friday)
close day_of_week
date
2022-07-01 3825.330078 4
2022-07-05 3831.389893 1
2022-07-06 3845.080078 2
2022-07-07 3902.620117 3
2022-07-08 3899.379883 4
2022-07-11 3854.429932 0
2022-07-12 3818.800049 1
2022-07-13 3801.780029 2
2022-07-14 3790.379883 3
2022-07-15 3863.159912 4
...
2022-08-01 4118.629883 0
2022-08-02 4091.189941 1
2022-08-03 4155.169922 2
2022-08-04 4151.939941 3
2022-08-05 4145.189941 4
2022-08-08 4140.060059 0
2022-08-09 4122.470215 1
2022-08-10 4210.240234 2
2022-08-11 4207.270020 3
2022-08-12 4280.149902 4
...
2022-09-01 3966.850098 3
2022-09-02 3924.260010 4
2022-09-06 3908.189941 1
2022-09-07 3979.870117 2
2022-09-08 4006.179932 3
2022-09-09 4067.360107 4
2022-09-12 4110.410156 0
2022-09-13 3932.689941 1
2022-09-14 3946.010010 2
2022-09-15 3901.350098 3
2022-09-16 3873.330078 4
...
2022-10-03 3678.429932 0
2022-10-04 3790.929932 1
2022-10-05 3783.280029 2
2022-10-06 3744.520020 3
2022-10-07 3639.659912 4
2022-10-10 3612.389893 0
2022-10-11 3588.840088 1
2022-10-12 3577.030029 2
...
2022-11-01 3856.100098 1
2022-11-02 3759.689941 2
2022-11-03 3719.889893 3
2022-11-04 3770.550049 4
2022-11-07 3806.800049 0
2022-11-08 3828.110107 1
This should return:
2022-07-15 3863.159912 4
2022-08-05 4145.189941 4
2022-09-16 3873.330078 4
2022-10-07 3639.659912 4
EDIT:
while i dont expect this to work, curious as to why this returns no results? is shifting not supported when filtering in this manner
df[(df.index.day_of_week==0) & (df.index.day<15) & (df.shift(-4).index.day_of_week==4)]
You can group by [year, month, day_of_week] and do a cumcount to assign to each row the number of times its day_of_week has appeared in this month.
Then, grab the rows corresponding to the first monday of the month using the filter day_of_week == 0 & cumcount == 0 and shift their index by 4 days to get the following Fridays. Finally, we do an intersection to filter out shifted indexes that do not exist in the original frame.
wanted_indices = df.index[df.groupby([df.index.year, df.index.month, df['dow']]).cumcount().eq(0) & df['dow'].eq(0)].shift(4, 'D')
df.loc[wanted_indices.intersection(df.index)]
Result on your example dataframe
close dow
date
2022-07-15 3863.159912 4
2022-08-05 4145.189941 4
2022-09-16 3873.330078 4
2022-10-07 3639.659912 4
I feel like the simplest way to do that is to recognize that the Friday after the first Monday of each month must be no less than day 5 (Monday must be at least day 1). So, just first filter by that requirement and then filter by day_of_week.
df_filtered = df[df.index.dt.day >= 5]
first_fridays = df[df['day_of_week'] == 5]

Number of active IDs in each period

I have a dataframe that looks like this
ID | START | END
1 |2016-12-31|2017-02-30
2 |2017-01-30|2017-10-30
3 |2016-12-21|2018-12-30
I want to know the number of active IDs in each possible day. So basically count the number of overlapping time periods.
What I did to calculate this was creating a new data frame c_df with the columns date and count. The first column was populated using a range:
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
Then for every line in my original data frame I calculated a different range for the start and end dates:
id_dates = pd.date_range(start=min(user['START']), end=max(user['END']))
I then used this range of dates to increment by one the corresponding count cell in c_df.
All these loops though are not very efficient for big data sets and look ugly. Is there a more efficient way of doing this?
If your dataframe is small enough so that performance is not a concern, create a date range for each row, then explode them and count how many times each date exists in the exploded series.
Requires pandas >= 0.25:
df.apply(lambda row: pd.date_range(row['START'], row['END']), axis=1) \
.explode() \
.value_counts() \
.sort_index()
If your dataframe is large, take advantage of numpy broadcasting to improve performance.
Work with any version of pandas:
dates = pd.date_range(df['START'].min(), df['END'].max()).values
start = df['START'].values[:, None]
end = df['END'].values[:, None]
mask = (start <= dates) & (dates <= end)
result = pd.DataFrame({
'Date': dates,
'Count': mask.sum(axis=0)
})
Create IntervalIndex and use genex or list comprehension with contains to check each date again each interval (Note: I made a smaller sample to test on this solution)
Sample `df`
Out[56]:
ID START END
0 1 2016-12-31 2017-01-20
1 2 2017-01-20 2017-01-30
2 3 2016-12-28 2017-02-03
3 4 2017-01-20 2017-01-25
iix = pd.IntervalIndex.from_arrays(df.START, df.END, closed='both')
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
df_final = pd.DataFrame({'dates': all_dates,
'date_counts': (iix.contains(dt).sum() for dt in all_dates)})
In [58]: df_final
Out[58]:
dates date_counts
0 2016-12-28 1
1 2016-12-29 1
2 2016-12-30 1
3 2016-12-31 2
4 2017-01-01 2
5 2017-01-02 2
6 2017-01-03 2
7 2017-01-04 2
8 2017-01-05 2
9 2017-01-06 2
10 2017-01-07 2
11 2017-01-08 2
12 2017-01-09 2
13 2017-01-10 2
14 2017-01-11 2
15 2017-01-12 2
16 2017-01-13 2
17 2017-01-14 2
18 2017-01-15 2
19 2017-01-16 2
20 2017-01-17 2
21 2017-01-18 2
22 2017-01-19 2
23 2017-01-20 4
24 2017-01-21 3
25 2017-01-22 3
26 2017-01-23 3
27 2017-01-24 3
28 2017-01-25 3
29 2017-01-26 2
30 2017-01-27 2
31 2017-01-28 2
32 2017-01-29 2
33 2017-01-30 2
34 2017-01-31 1
35 2017-02-01 1
36 2017-02-02 1
37 2017-02-03 1

Pandas - Times series multiple slices of a dataframe groupby Id

What I have:
A dataframe, df consists of 3 columns (Id, Item and Timestamp). Each subject has unique Id with recorded Item on a particular date and time (Timestamp). The second dataframe, df_ref consists of date time range reference for slicing the df, the Start and the End for each subject, Id.
df:
Id Item Timestamp
0 1 aaa 2011-03-15 14:21:00
1 1 raa 2012-05-03 04:34:01
2 1 baa 2013-05-08 22:21:29
3 1 boo 2015-12-24 21:53:41
4 1 afr 2016-04-14 12:28:26
5 1 aud 2017-05-10 11:58:02
6 2 boo 2004-06-22 22:20:58
7 2 aaa 2005-11-16 07:00:00
8 2 ige 2006-06-28 17:09:18
9 2 baa 2008-05-22 21:28:00
10 2 boo 2017-06-08 23:31:06
11 3 ige 2011-06-30 13:14:21
12 3 afr 2013-06-11 01:38:48
13 3 gui 2013-06-21 23:14:26
14 3 loo 2014-06-10 15:15:42
15 3 boo 2015-01-23 02:08:35
16 3 afr 2015-04-15 00:15:23
17 3 aaa 2016-02-16 10:26:03
18 3 aaa 2016-06-10 01:11:15
19 3 ige 2016-07-18 11:41:18
20 3 boo 2016-12-06 19:14:00
21 4 gui 2016-01-05 09:19:50
22 4 aaa 2016-12-09 14:49:50
23 4 ige 2016-11-01 08:23:18
df_ref:
Id Start End
0 1 2013-03-12 00:00:00 2016-05-30 15:20:36
1 2 2005-06-05 08:51:22 2007-02-24 00:00:00
2 3 2011-05-14 10:11:28 2013-12-31 17:04:55
3 3 2015-03-29 12:18:31 2016-07-26 00:00:00
What I want:
Slice the df dataframe based on the data time range given for each Id (groupby Id) in df_ref and concatenate the sliced data into new dataframe. However, a subject could have more than one date time range (in this example Id=3 has 2 date time range).
df_expected:
Id Item Timestamp
0 1 baa 2013-05-08 22:21:29
1 1 boo 2015-12-24 21:53:41
2 1 afr 2016-04-14 12:28:26
3 2 aaa 2005-11-16 07:00:00
4 2 ige 2006-06-28 17:09:18
5 3 ige 2011-06-30 13:14:21
6 3 afr 2013-06-11 01:38:48
7 3 gui 2013-06-21 23:14:26
8 3 afr 2015-04-15 00:15:23
9 3 aaa 2016-02-16 10:26:03
10 3 aaa 2016-06-10 01:11:15
11 3 ige 2016-07-18 11:41:18
What I have done so far:
I referred to this post (Time series multiple slice) while doing my code. I modify the code since it does not have the groupby element which I need.
My code:
from datetime import datetime
df['Timestamp'] = pd.to_datetime(df.Timestamp, format='%Y-%m-%d %H:%M')
x = pd.DataFrame()
for pid in def_ref.Id.unique():
selection = df[(df['Id']== pid) & (df['Timestamp']>= def_ref['Start']) & (df['Timestamp']<= def_ref['End'])]
x = x.append(selection)
Above code give error:
ValueError: Can only compare identically-labeled Series objects
First use merge with default inner join, also it create all combinations for duplicated Id. Then filter by between and DataFrame.loc for filtering by conditions and by df1.columns in one step:
df1 = df.merge(df_ref, on='Id')
df2 = df1.loc[df1['Timestamp'].between(df1['Start'], df1['End']), df.columns]
print (df2)
Id Item Timestamp
2 1 baa 2013-05-08 22:21:29
3 1 boo 2015-12-24 21:53:41
4 1 afr 2016-04-14 12:28:26
7 2 aaa 2005-11-16 07:00:00
8 2 ige 2006-06-28 17:09:18
11 3 ige 2011-06-30 13:14:21
13 3 afr 2013-06-11 01:38:48
15 3 gui 2013-06-21 23:14:26
22 3 afr 2015-04-15 00:15:23
24 3 aaa 2016-02-16 10:26:03
26 3 aaa 2016-06-10 01:11:15
28 3 ige 2016-07-18 11:41:18

Identify amount of due date between subsequent rows

I have a table that group by ID and sorted transaction date as shown below.
id transactions_date membership_expire_date
1 2016-11-16 2016-12-16
1 2016-12-15 2017-01-14
1 2017-01-15 2017-02-14
1 2017-02-15 2017-03-17
2 2015-01-31 2015-03-03
2 2015-02-28 2015-03-31
2 2015-04-05 2015-05-01
I want calculate if the users were late on the due date. For example, on userid 1, on the second row's transactions_date, user performed payment before the membership_expire_date stated on 1st row(within, equal to or 1 day after membership_expire_date are considered as punctual), therefore the amount of due = 0. However, for userid 2 last row, the user paid on 2015-04-05. Therefore, 2015-04-05 - 2015-03-31 - 1 days(one day after membership_expire_date is fine) = 4 days due.
How should I compute them? I am stuck after sorted them this way.
transactions_train = transactions_train.sort_values(by=['id','transaction_date', 'membership_expire_date'], ascending=True)
The expected result is something like below.
id transactions_date membership_expire_date late_count
1 2016-11-16 2016-12-16 0
1 2016-12-15 2017-01-14 0
1 2017-01-15 2017-02-14 0
1 2017-02-16 2017-03-17 1
2 2015-01-31 2015-03-03 0
2 2015-02-28 2015-03-31 0
2 2015-04-05 2015-05-01 4
You need to look at shift, indeed.
def days_due(group):
print('-', group)
day = pd.Timedelta('1d')
days_late = ((group['transactions_date'] - group['membership_expire_date'].shift()) / day - 1)
days_late = days_late.where(days_late > 0)
return days_late.fillna(0).astype(int)
df['late_count'] = pd.concat(days_due(group) for idx, group in df.groupby('id'))
id transactions_date membership_expire_date late_count
0 1 2016-11-16 2016-12-16 0
1 1 2016-12-15 2017-01-14 0
2 1 2017-01-15 2017-02-14 0
3 1 2017-02-16 2017-03-17 1
4 2 2015-01-31 2015-03-03 0
5 2 2015-02-28 2015-03-31 0
6 2 2015-04-05 2015-05-01 4

Convert list to datetime in pandas

I have the foll. list in pandas:
str = jan_1 jan_15 feb_1 feb_15 mar_1 mar_15 apr_1 apr_15 may_1 may_15 jun_1 jun_15 jul_1 jul_15 aug_1 aug_15 sep_1 sep_15 oct_1 oct_15 nov_1 nov_15 dec_1 dec_15
Is there a way to convert it into datetime?
I tried:
pd.to_datetime(pd.Series(str))
You have to specify the format argument while calling pd.to_datetime. Try
pd.to_datetime(pd.Series(s), format='%b_%d')
this gives
0 1900-01-01
1 1900-01-15
2 1900-02-01
3 1900-02-15
4 1900-03-01
5 1900-03-15
6 1900-04-01
7 1900-04-15
8 1900-05-01
9 1900-05-15
For setting the current year, a hack may be required, like
pd.to_datetime(pd.Series(s) + '_2015', format='%b_%d_%Y')
to get
0 2015-01-01
1 2015-01-15
2 2015-02-01
3 2015-02-15
4 2015-03-01
5 2015-03-15
6 2015-04-01
7 2015-04-15
8 2015-05-01
9 2015-05-15

Categories

Resources