split, groupby, combine in Pandas to find a difference in dates - python

I have a simple dataframe that looks like this:
I would like to use groupby to group by id, then find some way to difference the dates, and then column bind them back to the dataframe, so I end up with this:
The groupby is straightforward,
grouped = DF.groupby('id')
and finding the earliest date is straightforward,
maxdates = grouped['date'].min()
But I'm not sure how to proceed. How do I apply the date subtraction operation, then combine?
There is a similar question here.
Thanks for reading this far.
My dataframe is:
dates=pd.to_datetime(['2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01', '2015-05-01', '2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04', '2015-01-05'])
DF = DataFrame({'id':[1,1,1,1,1,2,2,2,2,2], 'date':dates})
cols = ['id', 'date']
DF=DF[cols]
EDIT:
Both answers below are awesome. I wish I could accept them both.

You can use apply like this:
earliest_by_id = DF.groupby('id')['date'].min()
def since_earliest(row):
return row.date - earliest_by_id[row.id]
DF['days_since_earliest'] = DF.apply(since_earliest, axis=1)
print(DF)
id date days_since_earliest
0 1 2015-01-01 0 days
1 1 2015-02-01 31 days
2 1 2015-03-01 59 days
3 1 2015-04-01 90 days
4 1 2015-05-01 120 days
5 2 2015-01-01 0 days
6 2 2015-01-02 1 days
7 2 2015-01-03 2 days
8 2 2015-01-04 3 days
9 2 2015-01-05 4 days
edit:
DF['days_since_earliest'] = DF.apply(since_earliest, axis=1).astype('timedelta64[D]')
print(DF)
id date days_since_earliest
0 1 2015-01-01 0
1 1 2015-02-01 31
2 1 2015-03-01 59
3 1 2015-04-01 90
4 1 2015-05-01 120
5 2 2015-01-01 0
6 2 2015-01-02 1
7 2 2015-01-03 2
8 2 2015-01-04 3
9 2 2015-01-05 4

FWIW, using transform can often be simpler (and usually faster) than apply. transform takes the results of a groupby operation and broadcasts it up to the original index:
>>> df["dse"] = df["date"] - df.groupby("id")["date"].transform(min)
>>> df
id date dse
0 1 2015-01-01 0 days
1 1 2015-02-01 31 days
2 1 2015-03-01 59 days
3 1 2015-04-01 90 days
4 1 2015-05-01 120 days
5 2 2015-01-01 0 days
6 2 2015-01-02 1 days
7 2 2015-01-03 2 days
8 2 2015-01-04 3 days
9 2 2015-01-05 4 days
If you'd prefer integer days instead of timedelta objects, you can use the dt.days accessor:
>>> df["dse"] = df["dse"].dt.days
>>> df
id date dse
0 1 2015-01-01 0
1 1 2015-02-01 31
2 1 2015-03-01 59
3 1 2015-04-01 90
4 1 2015-05-01 120
5 2 2015-01-01 0
6 2 2015-01-02 1
7 2 2015-01-03 2
8 2 2015-01-04 3
9 2 2015-01-05 4

Related

Pandas Difference Between Dates in Months

i have a dataframe date column with below values
2015-01-01
2015-02-01
2015-03-01
2015-07-01
2015-08-01
2015-10-01
2015-11-01
2016-02-01
i want to find the difference of these values in months, as below
date_dt diff_mnts
2015-01-01 0
2015-02-01 1
2015-03-01 1
2015-07-01 4
2015-08-01 1
2015-10-01 2
2015-11-01 1
2016-02-01 3
i tried to use the diff() method to calculate the days and then convert to astype('timedelta64(M)'). but in those cases, when days are less than 30 - its showing month difference values as 0. please let me know, if there is any easy built in function, which i can try in this case.
Option 1
Change the period and call diff.
df
Date
0 2015-01-01
1 2015-02-01
2 2015-03-01
3 2015-07-01
4 2015-08-01
5 2015-10-01
6 2015-11-01
7 2016-02-01
df.Date.dtype
dtype('<M8[ns]')
df.Date.dt.to_period('M').diff().fillna(0)
0 0
1 1
2 1
3 4
4 1
5 2
6 1
7 3
Name: Date, dtype: int64
Option 2
Alternatively, call diff on dt.month, but you'll need to account for gaps over a year (solution improved thanks to #galaxyan!) -
i = df.Date.dt.year.diff() * 12
j = df.Date.dt.month.diff()
(i + j).fillna(0).astype(int)
0 0
1 1
2 1
3 4
4 1
5 2
6 1
7 3
Name: Date, dtype: int64
Caveat (thanks to for spotting it) is that this wouldn't work for gaps over a year.
Try the following steps
Cast the column into datetime format.
Use the .month method to get the month number
Use the shift() method in pandas to calculate difference
example code will look something like this
df['diff_mnts'] = date_dt.month - date_dt.shift().month

Pandas pivoting/stacking/reshaping

I'm trying to import data to a pandas DataFrame with columns being date string, label, value. My data looks like the following (just with 4 dates and 5 labels)
from numpy import random
import numpy as np
import pandas as pd
# Creating the data
dates = ("2015-01-01", "2015-01-02", "2015-01-03", "2015-01-04")
values = [random.rand(5) for _ in range(4)]
data = dict(zip(dates,values))
So, the data is a dictionary where the keys are dates, the keys a list of values where the index is the label.
Loading this data structure into a DataFrame
df1 = pd.DataFrame(data)
gives me the dates as columns, the label as index, and the value as the value.
An alternative loading would be
df2 = pd.DataFrame()
df2.from_dict(data, orient='index')
where the dates are index, and columns are labels.
In either of both cases do I manage to do pivoting or stacking to my preferred view.
How should I approach the pivoting/stacking to get the view I want? Or should I change my data structure before loading it into a DataFrame? In particular I'd like to avoid of having to create all the rows of the table beforehand by using a bunch of calls to zip.
IIUC:
Option 1
pd.DataFrame.stack
pd.DataFrame(data).stack() \
.rename('value').rename_axis(['label', 'date']).reset_index()
label date value
0 0 2015-01-01 0.345109
1 0 2015-01-02 0.815948
2 0 2015-01-03 0.758709
3 0 2015-01-04 0.461838
4 1 2015-01-01 0.584527
5 1 2015-01-02 0.823529
6 1 2015-01-03 0.714700
7 1 2015-01-04 0.160735
8 2 2015-01-01 0.779006
9 2 2015-01-02 0.721576
10 2 2015-01-03 0.246975
11 2 2015-01-04 0.270491
12 3 2015-01-01 0.465495
13 3 2015-01-02 0.622024
14 3 2015-01-03 0.227865
15 3 2015-01-04 0.638772
16 4 2015-01-01 0.266322
17 4 2015-01-02 0.575298
18 4 2015-01-03 0.335095
19 4 2015-01-04 0.761181
Option 2
comprehension
pd.DataFrame(
[[i, d, v] for d, l in data.items() for i, v in enumerate(l)],
columns=['label', 'date', 'value']
)
label date value
0 0 2015-01-01 0.345109
1 1 2015-01-01 0.584527
2 2 2015-01-01 0.779006
3 3 2015-01-01 0.465495
4 4 2015-01-01 0.266322
5 0 2015-01-02 0.815948
6 1 2015-01-02 0.823529
7 2 2015-01-02 0.721576
8 3 2015-01-02 0.622024
9 4 2015-01-02 0.575298
10 0 2015-01-03 0.758709
11 1 2015-01-03 0.714700
12 2 2015-01-03 0.246975
13 3 2015-01-03 0.227865
14 4 2015-01-03 0.335095
15 0 2015-01-04 0.461838
16 1 2015-01-04 0.160735
17 2 2015-01-04 0.270491
18 3 2015-01-04 0.638772
19 4 2015-01-04 0.761181

groupby DataFrame with new column representing the group

I have a DataFrame with a timestamp column
d1=DataFrame({'a':[datetime(2015,1,1,20,2,1),datetime(2015,1,1,20,14,58),
datetime(2015,1,1,20,17,5),datetime(2015,1,1,20,31,5),
datetime(2015,1,1,20,34,28),datetime(2015,1,1,20,37,51),datetime(2015,1,1,20,41,19),
datetime(2015,1,1,20,49,4),datetime(2015,1,1,20,59,21)], 'b':[2,4,26,22,45,3,8,121,34]})
a b
0 2015-01-01 20:02:01 2
1 2015-01-01 20:14:58 4
2 2015-01-01 20:17:05 26
3 2015-01-01 20:31:05 22
4 2015-01-01 20:34:28 45
5 2015-01-01 20:37:51 3
6 2015-01-01 20:41:19 8
7 2015-01-01 20:49:04 121
8 2015-01-01 20:59:21 34
I can group by 15 minute intervals by doing these operations
d2=d1.set_index('a')
d3=d2.groupby(pd.TimeGrouper('15Min'))
The number of rows by group is found by
d3.size()
a
2015-01-01 20:00:00 2
2015-01-01 20:15:00 1
2015-01-01 20:30:00 4
2015-01-01 20:45:00 2
I want my original DataFrame to have a column corresponding to the unique number of rows in the specific group that it belongs to. For example, the first group
2015-01-01 20:00:00
has 2 rows so the first two rows of my new column in d1 should have the number 1
the second group
2015-01-01 20:15:00
has 1 row so the third row of my new column in d1 should have the number 2
the third group
2015-01-01 20:15:00
has 4 rows so the fourth, fifth, sixth, and seventh rows of my new column in d1 should have the number 3
I want my new DataFrame to look like this
a b c
0 2015-01-01 20:02:01 2 1
1 2015-01-01 20:14:58 4 1
2 2015-01-01 20:17:05 26 2
3 2015-01-01 20:31:05 22 3
4 2015-01-01 20:34:28 45 3
5 2015-01-01 20:37:51 3 3
6 2015-01-01 20:41:19 8 3
7 2015-01-01 20:49:04 121 4
8 2015-01-01 20:59:21 34 4
Use .transform() on your groupby object with an itertools.count iterator:
from datetime import datetime
from itertools import count
import pandas as pd
d1 = pd.DataFrame({'a': [datetime(2015,1,1,20,2,1), datetime(2015,1,1,20,14,58),
datetime(2015,1,1,20,17,5), datetime(2015,1,1,20,31,5),
datetime(2015,1,1,20,34,28), datetime(2015,1,1,20,37,51),
datetime(2015,1,1,20,41,19), datetime(2015,1,1,20,49,4),
datetime(2015,1,1,20,59,21)],
'b': [2, 4, 26, 22, 45, 3, 8, 121, 34]})
d2 = d1.set_index('a')
counter = count(1)
d2['c'] = (d2.groupby(pd.TimeGrouper('15Min'))['b']
.transform(lambda x: next(counter)))
print(d2)
Output:
b c
a
2015-01-01 20:02:01 2 1
2015-01-01 20:14:58 4 1
2015-01-01 20:17:05 26 2
2015-01-01 20:31:05 22 3
2015-01-01 20:34:28 45 3
2015-01-01 20:37:51 3 3
2015-01-01 20:41:19 8 3
2015-01-01 20:49:04 121 4
2015-01-01 20:59:21 34 4

Convert list to datetime in pandas

I have the foll. list in pandas:
str = jan_1 jan_15 feb_1 feb_15 mar_1 mar_15 apr_1 apr_15 may_1 may_15 jun_1 jun_15 jul_1 jul_15 aug_1 aug_15 sep_1 sep_15 oct_1 oct_15 nov_1 nov_15 dec_1 dec_15
Is there a way to convert it into datetime?
I tried:
pd.to_datetime(pd.Series(str))
You have to specify the format argument while calling pd.to_datetime. Try
pd.to_datetime(pd.Series(s), format='%b_%d')
this gives
0 1900-01-01
1 1900-01-15
2 1900-02-01
3 1900-02-15
4 1900-03-01
5 1900-03-15
6 1900-04-01
7 1900-04-15
8 1900-05-01
9 1900-05-15
For setting the current year, a hack may be required, like
pd.to_datetime(pd.Series(s) + '_2015', format='%b_%d_%Y')
to get
0 2015-01-01
1 2015-01-15
2 2015-02-01
3 2015-02-15
4 2015-03-01
5 2015-03-15
6 2015-04-01
7 2015-04-15
8 2015-05-01
9 2015-05-15

Pandas groupby date

I have a DataFrame with events. One or more events can occur at a date (so the date can't be an index). The date range is several years. I want to groupby years and months and have a count of the Category values. Thnx
in [12]: df = pd.read_excel('Pandas_Test.xls', 'sheet1')
In [13]: df
Out[13]:
EventRefNr DateOccurence Type Category
0 86596 2010-01-02 00:00:00 3 Small
1 86779 2010-01-09 00:00:00 13 Medium
2 86780 2010-02-10 00:00:00 6 Small
3 86781 2010-02-09 00:00:00 17 Small
4 86898 2010-02-10 00:00:00 6 Small
5 86898 2010-02-11 00:00:00 6 Small
6 86902 2010-02-17 00:00:00 9 Small
7 86908 2010-02-19 00:00:00 3 Medium
8 86908 2010-03-05 00:00:00 3 Medium
9 86909 2010-03-06 00:00:00 8 Small
10 86930 2010-03-12 00:00:00 29 Small
11 86934 2010-03-16 00:00:00 9 Small
12 86940 2010-04-08 00:00:00 9 High
13 86941 2010-04-09 00:00:00 17 Small
14 86946 2010-04-14 00:00:00 10 Small
15 86950 2011-01-19 00:00:00 12 Small
16 86956 2011-01-24 00:00:00 13 Small
17 86959 2011-01-27 00:00:00 17 Small
I tried:
df.groupby(df['DateOccurence'])
For the month and year break out I often add additional columns to the data frame that break out the dates into each piece:
df['year'] = [t.year for t in df.DateOccurence]
df['month'] = [t.month for t in df.DateOccurence]
df['day'] = [t.day for t in df.DateOccurence]
It adds space complexity (adding columns to the df) but is less time complex (less processing on groupby) than a datetime index but it's really up to you. datetime index is the more pandas way to do things.
After breaking out by year, month, day you can do any groupby you need.
df.groupby['year','month'].Category.apply(pd.value_counts)
To get months across multiple years:
df.groupby['month'].Category.apply(pd.value_counts)
Or in Andy Hayden's datetime index
df.groupby[di.month].Category.apply(pd.value_counts)
You can simply pick which method fits your needs better.
You can apply value_counts to the SeriesGroupby (for the column):
In [11]: g = df.groupby('DateOccurence')
In [12]: g.Category.apply(pd.value_counts)
Out[12]:
DateOccurence
2010-01-02 Small 1
2010-01-09 Medium 1
2010-02-09 Small 1
2010-02-10 Small 2
2010-02-11 Small 1
2010-02-17 Small 1
2010-02-19 Medium 1
2010-03-05 Medium 1
2010-03-06 Small 1
2010-03-12 Small 1
2010-03-16 Small 1
2010-04-08 High 1
2010-04-09 Small 1
2010-04-14 Small 1
2011-01-19 Small 1
2011-01-24 Small 1
2011-01-27 Small 1
dtype: int64
I actually hoped this to return the following DataFrame, but you need to unstack it:
In [13]: g.Category.apply(pd.value_counts).unstack(-1).fillna(0)
Out[13]:
High Medium Small
DateOccurence
2010-01-02 0 0 1
2010-01-09 0 1 0
2010-02-09 0 0 1
2010-02-10 0 0 2
2010-02-11 0 0 1
2010-02-17 0 0 1
2010-02-19 0 1 0
2010-03-05 0 1 0
2010-03-06 0 0 1
2010-03-12 0 0 1
2010-03-16 0 0 1
2010-04-08 1 0 0
2010-04-09 0 0 1
2010-04-14 0 0 1
2011-01-19 0 0 1
2011-01-24 0 0 1
2011-01-27 0 0 1
If there were multiple different Categories with the same Date they would be on the same row...

Categories

Resources