Get the mean of timedelta column - python

I have a column made of timedelta elements in a dataframe:
time_to_return_ask
0 0 days 00:00:00.046000
1 0 days 00:00:00.204000
2 0 days 00:00:00.336000
3 0 days 00:00:00.362000
4 0 days 00:00:00.109000
...
3240 0 days 00:00:00.158000
3241 0 days 00:00:00.028000
3242 0 days 00:00:00.130000
3243 0 days 00:00:00.035000
3244 0
Name: time_to_return_ask, Length: 3245, dtype: object
I tried to apply the solution of another question, by taking the values of the different elements, but I am already stuck. Any idea? Thanks!
What I tried:
df['time_to_return_ask'].values.astype(np.int64)
means = dropped.groupby('ts').mean()
means['new'] = pd.to_timedelta(means['new'])

Related

Calculates a standard deviation columns for timedelta elements

I have the following dataframe in Python:
ID
country_ID
visit_time
0
ESP
10 days 12:03:00
0
ESP
5 days 02:03:00
0
ENG
5 days 10:02:00
1
ENG
3 days 08:05:03
1
ESP
1 days 03:02:00
1
ENG
2 days 07:01:03
2
ENG
0 days 12:01:02
For each ID I want to calculate the standard deviation of each country_ID group.
std_visit_ESP and std_visit_ENG columns.
standard deviation of visit time with country_ID = ESP for each ID.
standard deviation of visit time with country_ID = ENG for each ID.
ID
std_visit_ESP
std_visit_ENG
0
2 days 17:00:00
0 days 00:00:00
1
0 days 00:00:00
0 days 12:32:00
2
NaT
0 days 00:00:00
With the groupby method for the mean, you can specify the parameter numeric_only = False, but the std method of groupby does not include this option.
My idea is to convert the timedelta to seconds, calculate the standard deviation and then convert it back to timedelta. Here is an example:
td1 = timedelta(10,0,0,0,3,12,0).total_seconds()
td2 = timedelta(5,0,0,0,3,2,0).total_seconds()
arr = [td1,td2]
var = np.std(arr)
show_s = pd.to_timedelta(var, unit='s')
print(show_s)
I don't know how to use this with groupby to get the desired result. I am grateful for your help.
Use GroupBy.std and pd.to_timedelta
total_seconds = \
pd.to_timedelta(
df['visit_time'].dt.total_seconds()
.groupby([df['ID'], df['country_ID']]).std(),
unit='S').unstack().fillna(pd.Timedelta(days=0))
print(total_seconds)
country_ID ENG ESP
ID
0 0 days 00:00:00 3 days 19:55:25.973595304
1 0 days 17:43:29.315934274 0 days 00:00:00
2 0 days 00:00:00 0 days 00:00:00
If I understand correctly, this should work for you:
stddevs = df['visit_time'].dt.total_seconds().groupby([df['country_ID']]).std().apply(lambda x: pd.Timedelta(seconds=x))
Output:
>>> stddevs
country_ID
ENG 2 days 01:17:43.835702
ESP 4 days 16:40:16.598773
Name: visit_time, dtype: timedelta64[ns]
Formatting:
stddevs = df['visit_time'].dt.total_seconds().groupby([df['country_ID']]).std().apply(lambda x: pd.Timedelta(seconds=x)).to_frame().T.add_prefix('std_visit_').reset_index(drop=True).rename_axis(None, axis=1)
Output:
>>> stddevs
std_visit_ENG std_visit_ESP
0 2 days 01:17:43.835702 4 days 16:40:16.598773

So I have two dates column in pandas, I need to find which all orders where completed in less than a month

The two columns are close_date and created_on. I have done
products_pipeline_and_teams['Days_Taken_to_close']=products_pipeline_and_teams.close_date - products_pipeline_and_teams.created_on
and got the results like this
0 12 days
1 36 days
2 NaT
3 77 days
4 68 days
5 NaT
6 113 days
7 9 days
8 14 days
9 NaT
Name: Days_Taken_to_close, dtype: timedelta64[ns]
how do I find result where number of days is more than > 30?
I have tried this
products_pipeline_and_teams['Days_Taken_to_close'] < 30
Its giving error like this
Invalid comparison between dtype=timedelta64[ns] and int
try < timedelta(days = 30)
You have to import timedelta like from datetime import timedelta.

How to convert X min, Y sec string to timestamp

I have a dataframe with a duration column of strings in a format like:
index
duration
0
26 s
1
24 s
2
4 min, 37 s
3
7 s
4
1 min, 1 s
Is there a pandas or strftime() / strptime() way to convert the duration column to a min/sec timestamp.
I've attempted this way to convert strings, but I'll run into multiple scenarios after replacing strings:
for row in df['index']:
if "min, " in df['duration'][row]:
df['duration'][row] = df['duration'][row].replace(' min, ', ':').replace(' s', '')
else:
pass
Thanks in advance
Try:
pd.to_timedelta(df['duration'])
Output:
0 0 days 00:00:26
1 0 days 00:00:24
2 0 days 00:04:37
3 0 days 00:00:07
4 0 days 00:01:01
Name: duration, dtype: timedelta64[ns]

How to convert "0 days 00:09:06.633000000" object to minutes/hours/seconds?

I was working with some data in pandas as after saving it to csv it changed the format from 00:31:24.904000 timedelta64[ns] to 0 days 00:31:24.904000 Object.
0 0 days 00:25:20.835688000
1 0 days 00:01:44.004000000
2 0 days 00:18:29.023000000
3 0 days 00:09:06.633000000
4 0 days 00:02:16.826000000
...
6004 0 days 00:00:00.000000000
6005 0 days 00:31:24.904000000
6006 0 days 00:02:31.637000000
6007 0 days 00:03:40.214000000
6008 0 days 00:01:26.577000000
Name: Time, Length: 6009, dtype: object
How can I convert it back to timedelta or some other date/time related format?
How can I avoid such conversion during saving to csv
How can I convert it back to timedelta or some other date/time related format?
df['Time'] = pd.to_timedelta(df['Time'])
How can I avoid such conversion during saving to csv
It is not possible, because in csv all data are strings.

datetime: subtracting date from itself yields 3288 days

I have a bunch of dates in a pandas dataframe, mostly observed for July of each year, of type datetime64[ns].
In [126]:
e6.To.head()
Out[122]:
14 1991-07-01
15 1992-07-01
16 1993-07-01
17 1994-07-01
18 1995-07-01
Name: To, dtype: datetime64[ns]
I ultimately want to store in a separate variable the rolling difference from one row to the next using shift(), but I found subtracting dates to produce odd results. Here, I subtract a series of dates from itself (reprinting the first five results. Some of them are, as expected, 0, but others are obviously not.
In [127]:
(e6.To-e6.To).head()
Out[127]:
1 0 days
1 -3288 days
1 3288 days
1 0 days
2 0 days
Name: To, dtype: timedelta64[ns]
If I take just the top five observations and then subtract, I do not get this result, and get all 0's as expected:
In [128]:
e6.To.head()-e6.To.head()
Out[119]:
14 0 days
15 0 days
16 0 days
17 0 days
18 0 days
Name: To, dtype: timedelta64[ns]
I can't reproduce it if I 'enter' the data directly, like so:
In [128]:
test=pd.DataFrame(data=['1991-07-01','1992-07-01','1993-07-01','1994-07-01','1995-07-01','1996-07-01'],columns=['date'])
test['date']=test['date'].astype('datetime64')
test.date - test.date
Out[128]:
0 0 days
1 0 days
2 0 days
3 0 days
4 0 days
5 0 days
Name: date, dtype: timedelta64[ns]
Any ideas what I am doing wrong here?
Not quite an answer but I need some space to show something. My guess is that something weird is going on with indexing (I have no idea why, though). Note my comment about indexing above and also note #ASGM's comment about the difference being very close to 9 years.
I'm using your code to create the sample data above, but adding a few years and sticking to the name of 'e6' for the dataframe and 'To' for the variable in the event that matters (I really doubt it, but you know...)
In [10]: e6
Out[10]:
To
0 1991-07-01
1 1992-07-01
2 1993-07-01
3 1994-07-01
4 1995-07-01
5 1996-07-01
6 1997-07-01
7 1998-07-01
8 1999-07-01
9 2000-07-01
10 2001-07-01
11 2002-07-01
In [11]: e6.To - e6.To[9]
Out[11]:
0 -3288 days
1 -2922 days
2 -2557 days
3 -2192 days
4 -1827 days
5 -1461 days
6 -1096 days
7 -731 days
8 -366 days
9 0 days
10 365 days
11 730 days
Name: To, dtype: timedelta64[ns]

Categories

Resources