datetime: subtracting date from itself yields 3288 days - python

I have a bunch of dates in a pandas dataframe, mostly observed for July of each year, of type datetime64[ns].
In [126]:
e6.To.head()
Out[122]:
14 1991-07-01
15 1992-07-01
16 1993-07-01
17 1994-07-01
18 1995-07-01
Name: To, dtype: datetime64[ns]
I ultimately want to store in a separate variable the rolling difference from one row to the next using shift(), but I found subtracting dates to produce odd results. Here, I subtract a series of dates from itself (reprinting the first five results. Some of them are, as expected, 0, but others are obviously not.
In [127]:
(e6.To-e6.To).head()
Out[127]:
1 0 days
1 -3288 days
1 3288 days
1 0 days
2 0 days
Name: To, dtype: timedelta64[ns]
If I take just the top five observations and then subtract, I do not get this result, and get all 0's as expected:
In [128]:
e6.To.head()-e6.To.head()
Out[119]:
14 0 days
15 0 days
16 0 days
17 0 days
18 0 days
Name: To, dtype: timedelta64[ns]
I can't reproduce it if I 'enter' the data directly, like so:
In [128]:
test=pd.DataFrame(data=['1991-07-01','1992-07-01','1993-07-01','1994-07-01','1995-07-01','1996-07-01'],columns=['date'])
test['date']=test['date'].astype('datetime64')
test.date - test.date
Out[128]:
0 0 days
1 0 days
2 0 days
3 0 days
4 0 days
5 0 days
Name: date, dtype: timedelta64[ns]
Any ideas what I am doing wrong here?

Not quite an answer but I need some space to show something. My guess is that something weird is going on with indexing (I have no idea why, though). Note my comment about indexing above and also note #ASGM's comment about the difference being very close to 9 years.
I'm using your code to create the sample data above, but adding a few years and sticking to the name of 'e6' for the dataframe and 'To' for the variable in the event that matters (I really doubt it, but you know...)
In [10]: e6
Out[10]:
To
0 1991-07-01
1 1992-07-01
2 1993-07-01
3 1994-07-01
4 1995-07-01
5 1996-07-01
6 1997-07-01
7 1998-07-01
8 1999-07-01
9 2000-07-01
10 2001-07-01
11 2002-07-01
In [11]: e6.To - e6.To[9]
Out[11]:
0 -3288 days
1 -2922 days
2 -2557 days
3 -2192 days
4 -1827 days
5 -1461 days
6 -1096 days
7 -731 days
8 -366 days
9 0 days
10 365 days
11 730 days
Name: To, dtype: timedelta64[ns]

Related

How to filter only the first Friday in a column with one week of data (Pandas)

I have a column that contains Friday-Friday dates ex. Fri March 4 to Fri March 11. I only want to filter the earliest Friday date. Any suggestions. I figured a way to sort out the min value, but I feel like there's a better method
df['Submitted On'] = pd.to_datetime(df['Submitted On'])
early = df['Submitted On'].min()
df = df.loc[df['Submitted On'] != early]
Although I don't know the use case for your data, your method is a little brittle. If for some reason the range of dates in your column changes, then you're filtering out the earliest date regardless of whether it's a Friday or not.
You can use the .dt.dayofweek method for Series which will return integers 0 through 6 for the day of the week meaning Friday is 4, and filter based on the first occurrence of a Friday. For example:
df = pd.DataFrame({'Submitted On': pd.date_range('2022-03-04','2022-03-11'), 'value':range(8)})
df['Submitted On'] = pd.to_datetime(df['Submitted On'])
filtered_df = df.drop(labels=df[df['Submitted On'].dt.dayofweek == 4].index.values[0])
Result:
Submitted On value
1 2022-03-05 1
2 2022-03-06 2
3 2022-03-07 3
4 2022-03-08 4
5 2022-03-09 5
6 2022-03-10 6
7 2022-03-11 7
And note that if I change the date range slightly, it still drops the first Friday:
df = pd.DataFrame({'Submitted On': pd.date_range('2022-03-03','2022-03-12'), 'value':range(10)})
filtered_df = df.drop(labels=df[df['Submitted On'].dt.dayofweek == 4].index.values[0])
Result:
Submitted On value
0 2022-03-03 0
2 2022-03-05 2
3 2022-03-06 3
4 2022-03-07 4
5 2022-03-08 5
6 2022-03-09 6
7 2022-03-10 7
8 2022-03-11 8
9 2022-03-12 9

Calculating values from time series in pandas multi-indexed pivot tables

I've got a dataframe in pandas that stores the Id of a person, the quality of interaction, and the date of the interaction. A person can have multiple interactions across multiple dates, so to help visualise and plot this I converted it into a pivot table grouping first by Id then by date to analyse the pattern over time.
e.g.
import pandas as pd
df = pd.DataFrame({'Id':['A4G8','A4G8','A4G8','P9N3','P9N3','P9N3','P9N3','C7R5','L4U7'],
'Date':['2016-1-1','2016-1-15','2016-1-30','2017-2-12','2017-2-28','2017-3-10','2019-1-1','2018-6-1','2019-8-6'],
'Quality':[2,3,6,1,5,10,10,2,2]})
pt = df.pivot_table(values='Quality', index=['Id','Date'])
print(pt)
Leads to this:
Id
Date
Quality
A4G8
2016-1-1
2
2016-1-15
4
2016-1-30
6
P9N3
2017-2-12
1
2017-2-28
5
2017-3-10
10
2019-1-1
10
C7R5
2018-6-1
2
L4U7
2019-8-6
2
However, I'd also like to...
Measure the time from the first interaction for each interaction per Id
Measure the time from the previous interaction with the same Id
So I'd get a table similar to the one below
Id
Date
Quality
Time From First
Time To Prev
A4G8
2016-1-1
2
0 days
NA days
2016-1-15
4
14 days
14 days
2016-1-30
6
29 days
14 days
P9N3
2017-2-12
1
0 days
NA days
2017-2-28
5
15 days
15 days
2017-3-10
10
24 days
9 days
The Id column is a string type, and I've converted the date column into datetime, and the Quality column into an integer.
The column is rather large (>10,000 unique ids) so for performance reasons I'm trying to avoid using for loops. I'm guessing the solution is somehow using pd.eval but I'm stuck as to how to apply it correctly.
Apologies I'm a python, pandas, & stack overflow) noob and I haven't found the answer anywhere yet so even some pointers on where to look would be great :-).
Many thanks in advance
Convert Dates to datetimes and then substract minimal datetimes per groups by GroupBy.transformb subtracted by column Date and for second new column use DataFrameGroupBy.diff:
df['Date'] = pd.to_datetime(df['Date'])
df['Time From First'] = df['Date'].sub(df.groupby('Id')['Date'].transform('min'))
df['Time To Prev'] = df.groupby('Id')['Date'].diff()
print (df)
Id Date Quality Time From First Time To Prev
0 A4G8 2016-01-01 2 0 days NaT
1 A4G8 2016-01-15 3 14 days 14 days
2 A4G8 2016-01-30 6 29 days 15 days
3 P9N3 2017-02-12 1 0 days NaT
4 P9N3 2017-02-28 5 16 days 16 days
5 P9N3 2017-03-10 10 26 days 10 days
6 P9N3 2019-01-01 10 688 days 662 days
7 C7R5 2018-06-01 2 0 days NaT
8 L4U7 2019-08-06 2 0 days NaT
df["Date"] = pd.to_datetime(df.Date)
df = df.merge(
df.groupby(["Id"]).Date.first(),
on="Id",
how="left",
suffixes=["", "_first"]
)
df["Time From First"] = df.Date-df.Date_first
df['Time To Prev'] = df.groupby('Id').Date.diff()
df.set_index(["Id", "Date"], inplace=True)
df
output:

So I have two dates column in pandas, I need to find which all orders where completed in less than a month

The two columns are close_date and created_on. I have done
products_pipeline_and_teams['Days_Taken_to_close']=products_pipeline_and_teams.close_date - products_pipeline_and_teams.created_on
and got the results like this
0 12 days
1 36 days
2 NaT
3 77 days
4 68 days
5 NaT
6 113 days
7 9 days
8 14 days
9 NaT
Name: Days_Taken_to_close, dtype: timedelta64[ns]
how do I find result where number of days is more than > 30?
I have tried this
products_pipeline_and_teams['Days_Taken_to_close'] < 30
Its giving error like this
Invalid comparison between dtype=timedelta64[ns] and int
try < timedelta(days = 30)
You have to import timedelta like from datetime import timedelta.

Get the mean of timedelta column

I have a column made of timedelta elements in a dataframe:
time_to_return_ask
0 0 days 00:00:00.046000
1 0 days 00:00:00.204000
2 0 days 00:00:00.336000
3 0 days 00:00:00.362000
4 0 days 00:00:00.109000
...
3240 0 days 00:00:00.158000
3241 0 days 00:00:00.028000
3242 0 days 00:00:00.130000
3243 0 days 00:00:00.035000
3244 0
Name: time_to_return_ask, Length: 3245, dtype: object
I tried to apply the solution of another question, by taking the values of the different elements, but I am already stuck. Any idea? Thanks!
What I tried:
df['time_to_return_ask'].values.astype(np.int64)
means = dropped.groupby('ts').mean()
means['new'] = pd.to_timedelta(means['new'])

Standard deviation of difference between dates in pandas group by

I have a dataframe of transactions. One of my columns is the date (datetime64[ns]). I'm making a group by of users (email as id). Something I'm interested in is the variability of time between orders of each user. So what I'm looking for in the group by is to find the standard deviation of the difference between dates (in days) for each user. If the user has two or least transactions the answer should be 0. This is some of the dataframe (I changed some things manually):
df
email date
0 cuadros.paolo#gmail.com 2018-05-01 12:29:59
1 rlez_1202#hotmail.com 2018-07-11 13:43:22
2 cuadros.paolo#gmail.com 2018-09-21 12:29:23
3 paola.alvarado#rumah.com.pe 2018-09-01 09:21:43
4 luchosuito#gmail.com 2018-04-30 12:29:30
5 paola.alvarado#rumah.com.pe 2018-03-22 12:29:23
6 davida.alvarado.703#gmail.com 2018-07-21 12:29:17
7 cuadros.paolo#gmail.com 2018-08-11 12:29:41
8 rlez_1202#hotmail.com 2018-05-23 12:29:14
9 luchosuito#gmail.com 2018-06-01 12:29:17
10 jessica26011#hotmail.com 2018-07-18 12:29:20
11 cuadros.paolo#gmail.com 2018-08-21 12:29:40
12 rlez_1202#hotmail.com 2018-10-01 12:29:31
13 paola.alvarado#rumah.com.pe 2018-06-01 12:29:20
14 miluska-paico#hotmail.com 2018-05-21 12:29:18
15 cinthia_leon87#hotmail.com 2018-07-20 12:29:59
I've tried many ways, but still can't get it. Please help.
For sequential differences, which seems to make the most sense given your explanation:
df.sort_values('date').groupby('email').apply(lambda x: x.date.diff().std()).fillna(0)
Output:
email
cinthia_leon87#hotmail.com 0 days 00:00:00
cuadros.paolo#gmail.com 48 days 05:04:12.988006
davida.alvarado.703#gmail.com 0 days 00:00:00
jessica26011#hotmail.com 0 days 00:00:00
luchosuito#gmail.com 0 days 00:00:00
miluska-paico#hotmail.com 0 days 00:00:00
paola.alvarado#rumah.com.pe 14 days 18:10:16.764069
rlez_1202#hotmail.com 23 days 06:17:04.453408
dtype: timedelta64[ns]
.std() will be null for groups with 1 value non-null value and since .diff reduces the number of non-null observations by 1, this automatically returns NaN for any groups with 2 or fewer measurements, which we fill with 0.
Also just be aware that the default for pandas is to use N-1 degrees of freedom.

Categories

Resources