Pandas, add date column to a series - python

I have a timeseries dataframe that is data agnostic and uses period vs date.
I would like to at some point add in dates, using the period.
My dataframe looks like
period custid
1 1
2 1
3 1
1 2
2 2
1 3
2 3
3 3
4 3
I would like to be able to pick a random starting date, for example 1/1/2018, and that would be period 1 so you would end up with
period custid date
1 1 1/1/2018
2 1 2/1/2018
3 1 3/1/2018
1 2 1/1/2018
2 2 2/1/2018
1 3 1/1/2018
2 3 2/1/2018
3 3 3/1/2018
4 3 4/1/2018

You could create a column of timedeltas, based on the period column, where each row is a time delta of period dates (-1, so that it starts at 0). then, starting from your start_date, which you can define as a datetime object, add the timedelta to start date:
start_date = pd.to_datetime('1/1/2018')
df['date'] = pd.to_timedelta(df['period'] - 1, unit='D') + start_date
>>> df
period custid date
0 1 1 2018-01-01
1 2 1 2018-01-02
2 3 1 2018-01-03
3 1 2 2018-01-01
4 2 2 2018-01-02
5 1 3 2018-01-01
6 2 3 2018-01-02
7 3 3 2018-01-03
8 4 3 2018-01-04
Edit: In your comment, you said you were trying to add months, not days. For this, you could use your method, or alternatively, the following:
from pandas.tseries.offsets import MonthBegin
df['date'] = start_date + (df['period'] -1) * MonthBegin()

Related

Get Week of month column from date

I want to extract week of month column from the date.
Dummy Data:
data = pd.DataFrame(pd.date_range(' 1/ 1/ 2000', periods = 100, freq ='D'))
Code I tried:
def add_week_of_month(df):
df['monthweek'] = pd.to_numeric(df.index.day/7)
df['monthweek'] = df['monthweek'].apply(lambda x: math.ceil(x))
return df
But this code does count 7 day periods within a month. The first 7 days of a month the column would be 1, from day 8 to day 14 it would be 2 etc
But I want to have is calendar weeks per month, so on the first day of the month the feature would be 1, from the first Monday after that it would be 2 etc.
Can anyone help me with this?
You can convert to weekly period and subtract to the first week of the month + 1 if a Monday.
If you want weeks starting on Sundays, use 'W-SAT' as period and start.dt.dayofweek.eq(6).
# get first day of month
start = data[0]+pd.offsets.MonthBegin()+pd.offsets.MonthBegin(-1)
# or
# start = data[0].dt.to_period('M').dt.to_timestamp()
data['monthweek'] = ((data[0].dt.to_period('W')-start.dt.to_period('W'))
.apply(lambda x: x.n)
.add(start.dt.dayofweek.eq(0))
)
NB. in your input, column 0 is the date.
output:
0 monthweek
0 2000-01-01 0
1 2000-01-02 0
2 2000-01-03 1 # Monday
3 2000-01-04 1
4 2000-01-05 1
5 2000-01-06 1
6 2000-01-07 1
7 2000-01-08 1
8 2000-01-09 1
9 2000-01-10 2 # Monday
10 2000-01-11 2
.. ... ...
95 2000-04-05 1
96 2000-04-06 1
97 2000-04-07 1
98 2000-04-08 1
99 2000-04-09 1
[100 rows x 2 columns]
Example for 2001 (starts on a Monday):
0 monthweek
0 2001-01-01 1 # Monday
1 2001-01-02 1
2 2001-01-03 1
3 2001-01-04 1
4 2001-01-05 1
5 2001-01-06 1
6 2001-01-07 1
7 2001-01-08 2 # Monday
8 2001-01-09 2
9 2001-01-10 2
10 2001-01-11 2
11 2001-01-12 2
12 2001-01-13 2
13 2001-01-14 2
14 2001-01-15 3
get the first day then add it to the day of the month and divide by 7
first_day = dt.replace(day=1)
dom = dt.day
adjusted_dom = dom + first_day.weekday()
return int(math.ceil(adjusted_dom/7.0))

Count number of columns above a date

I have a pandas dataframe with several columns and I would like to know the number of columns above the date 2016-12-31 . Here is an example:
ID
Bill
Date 1
Date 2
Date 3
Date 4
Bill 2
4
6
2000-10-04
2000-11-05
1999-12-05
2001-05-04
8
6
8
2016-05-03
2017-08-09
2018-07-14
2015-09-12
17
12
14
2016-11-16
2017-05-04
2017-07-04
2018-07-04
35
And I would like to get this column
Count
0
2
3
Just create the mask and call sum on axis=1
date = pd.to_datetime('2016-12-31')
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1)
OUTPUT:
0 0
1 2
2 3
dtype: int64
If needed, call .to_frame('count') to create datarame with column as count
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1).to_frame('Count')
Count
0 0
1 2
2 3
Use df.filter to filter the Date* columns + .sum(axis=1)
(df.filter(like='Date') > '2016-12-31').sum(axis=1).to_frame(name='Count')
Result:
Count
0 0
1 2
2 3
You can do:
df['Count'] = (df.loc[:, [x for x in df.columns if 'Date' in x]] > '2016-12-31').sum(axis=1)
Output:
ID Bill Date 1 Date 2 Date 3 Date 4 Bill 2 Count
0 4 6 2000-10-04 2000-11-05 1999-12-05 2001-05-04 8 0
1 6 8 2016-05-03 2017-08-09 2018-07-14 2015-09-12 17 2
2 12 14 2016-11-16 2017-05-04 2017-07-04 2018-07-04 35 3
We select columns with 'Date' in the name. It's better when we have lots of columns like these and don't want to put them one by one. Then we compare it with lookup date and sum 'True' values.

Grouping by date range with pandas

I am looking to group by two columns: user_id and date; however, if the dates are close enough, I want to be able to consider the two entries part of the same group and group accordingly. Date is m-d-y
user_id date val
1 1-1-17 1
2 1-1-17 1
3 1-1-17 1
1 1-1-17 1
1 1-2-17 1
2 1-2-17 1
2 1-10-17 1
3 2-1-17 1
The grouping would group by user_id and dates +/- 3 days from each other. so the group by summing val would look like:
user_id date sum(val)
1 1-2-17 3
2 1-2-17 2
2 1-10-17 1
3 1-1-17 1
3 2-1-17 1
Any way someone could think of that this could be done (somewhat) easily? I know there are some problematic aspects of this. for example, what to do if the dates string together endlessly with three days apart. but the exact data im using only has 2 values per person..
Thanks!
I'd convert this to a datetime column and then use pd.TimeGrouper:
dates = pd.to_datetime(df.date, format='%m-%d-%y')
print(dates)
0 2017-01-01
1 2017-01-01
2 2017-01-01
3 2017-01-01
4 2017-01-02
5 2017-01-02
6 2017-01-10
7 2017-02-01
Name: date, dtype: datetime64[ns]
df = (df.assign(date=dates).set_index('date')
.groupby(['user_id', pd.TimeGrouper('3D')])
.sum()
.reset_index())
print(df)
user_id date val
0 1 2017-01-01 3
1 2 2017-01-01 2
2 2 2017-01-10 1
3 3 2017-01-01 1
4 3 2017-01-31 1
Similar solution using pd.Grouper:
df = (df.assign(date=dates)
.groupby(['user_id', pd.Grouper(key='date', freq='3D')])
.sum()
.reset_index())
print(df)
user_id date val
0 1 2017-01-01 3
1 2 2017-01-01 2
2 2 2017-01-10 1
3 3 2017-01-01 1
4 3 2017-01-31 1
Update: TimeGrouper will be deprecated in future versions of pandas, so Grouper would be preferred in this scenario (thanks for the heads up, Vaishali!).
I come with a very ugly solution but still work...
df=df.sort_values(['user_id','date'])
df['Key']=df.sort_values(['user_id','date']).groupby('user_id')['date'].diff().dt.days.lt(3).ne(True).cumsum()
df.groupby(['user_id','Key'],as_index=False).agg({'val':'sum','date':'first'})
Out[586]:
user_id Key val date
0 1 1 3 2017-01-01
1 2 2 2 2017-01-01
2 2 3 1 2017-01-10
3 3 4 1 2017-01-01
4 3 5 1 2017-02-01

Dates to Durations in Pandas

I feel like this should be done very easily, yet I can't figure out how. I have a pandas DataFrame with column date:
0 2012-08-21
1 2013-02-17
2 2013-02-18
3 2013-03-03
4 2013-03-04
Name: date, dtype: datetime64[ns]
I want to have a columns of durations, something like:
0 0
1 80 days
2 1 day
3 15 days
4 1 day
Name: date, dtype: datetime64[ns]
My attempt yields bunch of 0 days and NaT instead:
>>> df.date[1:] - df.date[:-1]
0 NaT
1 0 days
2 0 days
...
Any ideas?
Timedeltas are useful here: (see docs)
Starting in v0.15.0, we introduce a new scalar type Timedelta, which is a subclass of datetime.timedelta, and behaves in a similar manner, but allows compatibility with np.timedelta64 types as well as a host of custom representation, parsing, and attributes.
Timedeltas are differences in times, expressed in difference units, e.g. days, hours, minutes, seconds. They can be both positive and negative.
df
0
0 2012-08-21
1 2013-02-17
2 2013-02-18
3 2013-03-03
4 2013-03-04
You could:
pd.to_timedelta(df)
TimedeltaIndex(['0 days'], dtype='timedelta64[ns]', freq=None)
0 0
1 180
2 1
3 13
4 1
Name: 0, dtype: int64
Alternatively, you can calculate the difference between points in time using .shift() (or .diff() as illustrated by #Andy Hayden):
res = df-df.shift()
to get:
res.fillna(0)
0
0 0 days
1 180 days
2 1 days
3 13 days
4 1 days
You can convert these from timedelta64 dtype to integer using:
res.fillna(0).squeeze().dt.days
0 0
1 180
2 1
3 13
4 1
You can use diff:
In [11]: s
Out[11]:
0 2012-08-21
1 2013-02-17
2 2013-02-18
3 2013-03-03
4 2013-03-04
Name: date, dtype: datetime64[ns]
In [12]: s.diff()
Out[12]:
0 NaT
1 180 days
2 1 days
3 13 days
4 1 days
Name: date, dtype: timedelta64[ns]
In [13]: s.diff().fillna(0)
Out[13]:
0 0 days
1 180 days
2 1 days
3 13 days
4 1 days
Name: date, dtype: timedelta64[ns]
df.date[1:] - df.date[:-1] doesn't do what you think it does. Each element is subtracted by series/dataframe index mapping, not by location in the series.
Calculating df.date[1:] - df.date[:-1] does:
+---- index of df.date[1:]
| +---- index of df.date[:-1]
| |
| v
v
- 0 2012-08-21 = NaT
1 2013-02-17 - 1 2013-02-17 = 0
2 2013-02-18 - 2 2013-02-18 = 0
3 2013-03-03 - 3 2013-03-03 = 0
4 2013-03-04 - = NaT

split, groupby, combine in Pandas to find a difference in dates

I have a simple dataframe that looks like this:
I would like to use groupby to group by id, then find some way to difference the dates, and then column bind them back to the dataframe, so I end up with this:
The groupby is straightforward,
grouped = DF.groupby('id')
and finding the earliest date is straightforward,
maxdates = grouped['date'].min()
But I'm not sure how to proceed. How do I apply the date subtraction operation, then combine?
There is a similar question here.
Thanks for reading this far.
My dataframe is:
dates=pd.to_datetime(['2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01', '2015-05-01', '2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04', '2015-01-05'])
DF = DataFrame({'id':[1,1,1,1,1,2,2,2,2,2], 'date':dates})
cols = ['id', 'date']
DF=DF[cols]
EDIT:
Both answers below are awesome. I wish I could accept them both.
You can use apply like this:
earliest_by_id = DF.groupby('id')['date'].min()
def since_earliest(row):
return row.date - earliest_by_id[row.id]
DF['days_since_earliest'] = DF.apply(since_earliest, axis=1)
print(DF)
id date days_since_earliest
0 1 2015-01-01 0 days
1 1 2015-02-01 31 days
2 1 2015-03-01 59 days
3 1 2015-04-01 90 days
4 1 2015-05-01 120 days
5 2 2015-01-01 0 days
6 2 2015-01-02 1 days
7 2 2015-01-03 2 days
8 2 2015-01-04 3 days
9 2 2015-01-05 4 days
edit:
DF['days_since_earliest'] = DF.apply(since_earliest, axis=1).astype('timedelta64[D]')
print(DF)
id date days_since_earliest
0 1 2015-01-01 0
1 1 2015-02-01 31
2 1 2015-03-01 59
3 1 2015-04-01 90
4 1 2015-05-01 120
5 2 2015-01-01 0
6 2 2015-01-02 1
7 2 2015-01-03 2
8 2 2015-01-04 3
9 2 2015-01-05 4
FWIW, using transform can often be simpler (and usually faster) than apply. transform takes the results of a groupby operation and broadcasts it up to the original index:
>>> df["dse"] = df["date"] - df.groupby("id")["date"].transform(min)
>>> df
id date dse
0 1 2015-01-01 0 days
1 1 2015-02-01 31 days
2 1 2015-03-01 59 days
3 1 2015-04-01 90 days
4 1 2015-05-01 120 days
5 2 2015-01-01 0 days
6 2 2015-01-02 1 days
7 2 2015-01-03 2 days
8 2 2015-01-04 3 days
9 2 2015-01-05 4 days
If you'd prefer integer days instead of timedelta objects, you can use the dt.days accessor:
>>> df["dse"] = df["dse"].dt.days
>>> df
id date dse
0 1 2015-01-01 0
1 1 2015-02-01 31
2 1 2015-03-01 59
3 1 2015-04-01 90
4 1 2015-05-01 120
5 2 2015-01-01 0
6 2 2015-01-02 1
7 2 2015-01-03 2
8 2 2015-01-04 3
9 2 2015-01-05 4

Categories

Resources