Given the date column, I want to create another column diff that count how many days apart from the first date.
date diff
2011-01-01 00:00:10 0
2011-01-01 00:00:11 0.000011 days
2011-02-01 00:00:11 30.000011 days
2013-02-01 00:00:11 395.000011 days
2014-02-01 00:00:11 760.000011 days
Dates are in datetime. What I tried so far:
df = df.sort_values(['date'], ascending=True)
df.set_index('date', inplace = True)
first = df.index[0]
df['diff'] = (first - df.index.shift()).fillna(0)
you can try
df['diff'] = df.date - df.date.min()
df
date diff
0 2011-01-01 00:00:10 0 days 00:00:00
1 2011-01-01 00:00:11 0 days 00:00:01
2 2011-02-01 00:00:11 31 days 00:00:01
3 2013-02-01 00:00:11 762 days 00:00:01
4 2014-02-01 00:00:11 1127 days 00:00:01
This is what you try..
>>> df
date
0 2011-01-01 00:00:10
1 2011-01-01 00:00:11
2 2011-02-01 00:00:11
3 2013-02-01 00:00:11
4 2014-02-01 00:00:11
First convert them to timestamps, so data can be framed correctly, Once converted, simply difference the DataFrame:
>>> df2 = df.apply(lambda x: [pd.Timestamp(ts) for ts in x])
>>> df['diff'] = (df2 - df2.shift()).fillna(0)
>>> df
date diff
0 2011-01-01 00:00:10 0 days 00:00:00
1 2011-01-01 00:00:11 0 days 00:00:01
2 2011-02-01 00:00:11 31 days 00:00:00
3 2013-02-01 00:00:11 731 days 00:00:00
4 2014-02-01 00:00:11 365 days 00:00:00
Here's what I'd do to get days as float number values:
dates = pd.to_datetime(df.date) # make sure we are working with dates and not strings
df["diff"] = (dates - dates[0]).apply(lambda x: x.total_seconds() / 86400))
The resulting df:
date diff
0 2011-01-01 00:00:10 0.000000
1 2011-01-01 00:00:11 0.000012
2 2011-02-01 00:00:11 31.000012
3 2013-02-01 00:00:11 762.000012
4 2014-02-01 00:00:11 1127.000012
You can use this approach without setting a new index
Raw dataframe
df
date diff
0 2011-01-01 00:00:10 0.000000
1 2011-01-01 00:00:11 0.000011
2 2011-02-01 00:00:11 30.000011
3 2013-02-01 00:00:11 395.000011
4 2014-02-01 00:00:11 760.000011
Possible answer
df['diff_new'] = df['date'] - df.loc[0,'date']
date diff diff_new
0 2011-01-01 00:00:10 0.000000 0 days 00:00:00
1 2011-01-01 00:00:11 0.000011 0 days 00:00:01
2 2011-02-01 00:00:11 30.000011 31 days 00:00:01
3 2013-02-01 00:00:11 395.000011 762 days 00:00:01
4 2014-02-01 00:00:11 760.000011 1127 days 00:00:01
BTW, I get different date differences that you show in the raw data for the 3rd row. You can compare manually with this online tool to calculate date differences in days.
Related
In the example dataframe below, how can I convert t_relative into hours? For example, the relative time in the first row would be 49 hours.
tstart tend t_relative
0 2131-05-16 23:00:00 2131-05-19 00:00:00 2 days 01:00:00
1 2131-05-16 23:00:00 2131-05-19 00:15:00 2 days 01:15:00
2 2131-05-16 23:00:00 2131-05-19 00:45:00 2 days 01:45:00
3 2131-05-16 23:00:00 2131-05-19 01:00:00 2 days 02:00:00
4 2131-05-16 23:00:00 2131-05-19 01:15:00 2 days 02:15:00
t_relative was calculated with the operation, df['t_relative'] = df['tend']-df['tstart'].
You can divide Timedelta:
df['t_relative']/pd.Timedelta('1H')
Output:
0 49.00
1 49.25
2 49.75
3 50.00
4 50.25
Name: t_relative, dtype: float64
I have this dataFrame where some tasks happened time period
Date Start Time End Time
0 2016-01-01 0:00:00 2016-01-01 0:10:00 2016-01-01 0:25:00
1 2016-01-01 0:00:00 2016-01-01 1:17:00 2016-01-01 1:31:00
2 2016-01-02 0:00:00 2016-01-02 0:30:00 2016-01-02 0:32:00
... ... ... ...
Convert this df to 30 mins interval
Expected outcome
Date Hours
1 2016-01-01 0:30:00 0:15
2 2016-01-01 1:00:00 0:00
3 2016-01-01 1:30:00 0:13
4 2016-01-01 2:00:00 0:01
5 2016-01-01 2:30:00 0:00
6 2016-01-01 3:00:00 0:00
... ...
47 2016-01-01 23:30:00 0:00
48 2016-01-02 23:59:59 0:00
49 2016-01-02 00:30:00 0:00
50 2016-01-02 01:00:00 0:02
... ...
I was trying to do with for loop which was getting tedious. Any simple way to do in pandas.
IIUC you can discard the Date column, get the time difference between start and end, groupby 30 minutes and agg on first (assuming you always have one entry only per 30 minutes slot):
print (df.assign(Diff=df["End Time"]-df["Start Time"])
.groupby(pd.Grouper(key="Start Time", freq="30T"))
.agg({"Diff": "first"})
.fillna(pd.Timedelta(seconds=0)))
Diff
Start Time
2016-01-01 00:00:00 0 days 00:15:00
2016-01-01 00:30:00 0 days 00:00:00
2016-01-01 01:00:00 0 days 00:14:00
2016-01-01 01:30:00 0 days 00:00:00
2016-01-01 02:00:00 0 days 00:00:00
2016-01-01 02:30:00 0 days 00:00:00
...
2016-01-02 00:30:00 0 days 00:02:00
The idea is to create a series with 0 and DatetimeIndex per minutes between min start time and max end time. Then add 1 where Start Time and subtract 1 where End Time. You can then use cumsum to count the values between Start and End, resample.sum per 30 minutes and reset_index. The last line of code is to get the proper format in the Hours column.
#create a series of 0 with a datetime index
res = pd.Series(data=0,
index= pd.DatetimeIndex(pd.date_range(df['Start Time'].min(),
df['End Time'].max(),
freq='T'),
name='Dates'),
name='Hours')
# add 1 o the start time and remove 1 to the end start
res[df['Start Time']] += 1
res[df['End Time']] -= 1
# cumsum to get the right value for each minute then resample per 30 minutes
res = (res.cumsum()
.resample('30T', label='right').sum()
.reset_index('Dates')
)
# change the format of the Hours column, honestly not necessary
res['Hours'] = pd.to_datetime(res['Hours'], format='%M').dt.strftime('%H:%M') # or .dt.time
print(res)
Dates Hours
0 2016-01-01 00:30:00 00:15
1 2016-01-01 01:00:00 00:00
2 2016-01-01 01:30:00 00:13
3 2016-01-01 02:00:00 00:01
4 2016-01-01 02:30:00 00:00
5 2016-01-01 03:00:00 00:00
...
48 2016-01-02 00:30:00 00:00
49 2016-01-02 01:00:00 00:02
I wonder if is it possible to convert irregular time series interval to regular one without interpolating value from other column like this :
Index count
2018-01-05 00:00:00 1
2018-01-07 00:00:00 4
2018-01-08 00:00:00 15
2018-01-11 00:00:00 2
2018-01-14 00:00:00 5
2018-01-19 00:00:00 5
....
2018-12-26 00:00:00 6
2018-12-29 00:00:00 7
2018-12-30 00:00:00 8
And I expect the result to be something like this:
Index count
2018-01-01 00:00:00 0
2018-01-02 00:00:00 0
2018-01-03 00:00:00 0
2018-01-04 00:00:00 0
2018-01-05 00:00:00 1
2018-01-06 00:00:00 0
2018-01-07 00:00:00 4
2018-01-08 00:00:00 15
2018-01-09 00:00:00 0
2018-01-10 00:00:00 0
2018-01-11 00:00:00 2
2018-01-12 00:00:00 0
2018-01-13 00:00:00 0
2018-01-14 00:00:00 5
2018-01-15 00:00:00 0
2018-01-16 00:00:00 0
2018-01-17 00:00:00 0
2018-01-18 00:00:00 0
2018-01-19 00:00:00 5
....
2018-12-26 00:00:00 6
2018-12-27 00:00:00 0
2018-12-28 00:00:00 0
2018-12-29 00:00:00 7
2018-12-30 00:00:00 8
2018-12-31 00:00:00 0
So, far I just try resample from pandas but it only partially solved my problem.
Thanks in advance
Use DataFrame.reindex with date_range:
#if necessary
df.index = pd.to_datetime(df.index)
df = df.reindex(pd.date_range('2018-01-01','2018-12-31'), fill_value=0)
print (df)
count
2018-01-01 0
2018-01-02 0
2018-01-03 0
2018-01-04 0
2018-01-05 1
...
2018-12-27 0
2018-12-28 0
2018-12-29 7
2018-12-30 8
2018-12-31 0
[365 rows x 1 columns]
I have the above dataframe (snippet) and want create a new dataframe which is a conditional selection where I keep only the rows that are timestamped with a time before 15:00:00.
I'm still somewhat new to Pandas / python and have been stuck on this for a while :(
You can use DataFrame.between_time:
start = pd.to_datetime('2015-02-24 11:00')
rng = pd.date_range(start, periods=10, freq='14h')
df = pd.DataFrame({'Date': rng, 'a': range(10)})
print (df)
Date a
0 2015-02-24 11:00:00 0
1 2015-02-25 01:00:00 1
2 2015-02-25 15:00:00 2
3 2015-02-26 05:00:00 3
4 2015-02-26 19:00:00 4
5 2015-02-27 09:00:00 5
6 2015-02-27 23:00:00 6
7 2015-02-28 13:00:00 7
8 2015-03-01 03:00:00 8
9 2015-03-01 17:00:00 9
df = df.set_index('Date').between_time('00:00:00', '15:00:00')
print (df)
a
Date
2015-02-24 11:00:00 0
2015-02-25 01:00:00 1
2015-02-25 15:00:00 2
2015-02-26 05:00:00 3
2015-02-27 09:00:00 5
2015-02-28 13:00:00 7
2015-03-01 03:00:00 8
If need exclude 15:00:00 add parameter include_end=False:
df = df.set_index('Date').between_time('00:00:00', '15:00:00', include_end=False)
print (df)
a
Date
2015-02-24 11:00:00 0
2015-02-25 01:00:00 1
2015-02-26 05:00:00 3
2015-02-27 09:00:00 5
2015-02-28 13:00:00 7
2015-03-01 03:00:00 8
You can check the hours of the date column and use it for subsetting:
df['date'] = pd.to_datetime(df['date']) # optional if the date column is of datetime type
df[df.date.dt.hour < 15]
I'm new to pandas / python:
I have a dataframe (events.number) indexed by a datetime object.
I'm trying to extract an event count hourly, on every Monday (or other particular weekday). I wrote:
hour_tally_monday = events.number.groupby(lambda x: (x.hour & x.weekday==0) ).count()
but this does not work correctly.
I can drop the "& x.weekday==1" and it works but presumably uses all the days in the frame. What's the right (simplest) syntax to just average over Mondays?
I think you need first filter dataframe with boolean indexing and then use groupby with size:
import pandas as pd
start = pd.to_datetime('2016-02-01')
end = pd.to_datetime('2016-02-25')
rng = pd.date_range(start, end, freq='12H')
events = pd.DataFrame({'number': [1] * 20 + [2] * 15 + [3] * 14}, index=rng)
print events
number
2016-02-01 00:00:00 1
2016-02-01 12:00:00 1
2016-02-02 00:00:00 1
2016-02-02 12:00:00 1
2016-02-03 00:00:00 1
2016-02-03 12:00:00 1
2016-02-04 00:00:00 1
2016-02-04 12:00:00 1
2016-02-05 00:00:00 1
2016-02-05 12:00:00 1
2016-02-06 00:00:00 1
2016-02-06 12:00:00 1
2016-02-07 00:00:00 1
...
...
filtered = events[events.index.weekday == 0]
print filtered
number
2016-02-01 00:00:00 1
2016-02-01 12:00:00 1
2016-02-08 00:00:00 1
2016-02-08 12:00:00 1
2016-02-15 00:00:00 2
2016-02-15 12:00:00 2
2016-02-22 00:00:00 3
2016-02-22 12:00:00 3
In version 0.18.1 you can use new method DatetimeIndex.weekday_name:
filtered = events[events.index.weekday_name == 'Monday']
print filtered
number
2016-02-01 00:00:00 1
2016-02-01 12:00:00 1
2016-02-08 00:00:00 1
2016-02-08 12:00:00 1
2016-02-15 00:00:00 2
2016-02-15 12:00:00 2
2016-02-22 00:00:00 3
2016-02-22 12:00:00 3
print filtered.groupby(filtered.index.hour).size()
0 4
12 4
dtype: int64