Add column with number of days between dates in DataFrame pandas - python

I want to subtract dates in 'A' from dates in 'B' and add a new column with the difference.
df
A B
one 2014-01-01 2014-02-28
two 2014-02-03 2014-03-01
I've tried the following, but get an error when I try to include this in a for loop...
import datetime
date1=df['A'][0]
date2=df['B'][0]
mdate1 = datetime.datetime.strptime(date1, "%Y-%m-%d").date()
rdate1 = datetime.datetime.strptime(date2, "%Y-%m-%d").date()
delta = (mdate1 - rdate1).days
print delta
What should I do?

To remove the 'days' text element, you can also make use of the dt() accessor for series: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dt.html
So,
df[['A','B']] = df[['A','B']].apply(pd.to_datetime) #if conversion required
df['C'] = (df['B'] - df['A']).dt.days
which returns:
A B C
one 2014-01-01 2014-02-28 58
two 2014-02-03 2014-03-01 26

Assuming these were datetime columns (if they're not apply to_datetime) you can just subtract them:
df['A'] = pd.to_datetime(df['A'])
df['B'] = pd.to_datetime(df['B'])
In [11]: df.dtypes # if already datetime64 you don't need to use to_datetime
Out[11]:
A datetime64[ns]
B datetime64[ns]
dtype: object
In [12]: df['A'] - df['B']
Out[12]:
one -58 days
two -26 days
dtype: timedelta64[ns]
In [13]: df['C'] = df['A'] - df['B']
In [14]: df
Out[14]:
A B C
one 2014-01-01 2014-02-28 -58 days
two 2014-02-03 2014-03-01 -26 days
Note: ensure you're using a new of pandas (e.g. 0.13.1), this may not work in older versions.

A list comprehension is your best bet for the most Pythonic (and fastest) way to do this:
[int(i.days) for i in (df.B - df.A)]
i will return the timedelta(e.g. '-58 days')
i.days will return this value as a long integer value(e.g. -58L)
int(i.days) will give you the -58 you seek.
If your columns aren't in datetime format. The shorter syntax would be: df.A = pd.to_datetime(df.A)

How about this:
times['days_since'] = max(list(df.index.values))
times['days_since'] = times['days_since'] - times['months']
times

Related

Pandas: Access timestamp attributes after reindex

I am having trouble understanding what happens to a timestamp after you reindex a data frame using pd.date_range. If I have the following example where I am using pd.DataFrame.reindex to create a longer time series:
import pandas as pd
import numpy as np
idx_inital = pd.date_range('2004-03-01','2004-05-05')
df = pd.DataFrame(index = idx_inital, data={'data': np.random.randint(0,100,idx_inital.size)})
idx_new = pd.date_range('2004-01-01','2004-05-05')
df= df.reindex(idx_new, fill_value = 0)
which returns the expected result where all data are assigned 0:
data
2004-01-01 0
2004-01-02 0
2004-01-03 0
2004-01-04 0
2004-01-05 0
Now If I want to use apply to assign a new column using:
def year_attrib(row):
if row.index.month >2:
result = row.index.year + 11
else:
result = row.index.year + 15
return result
df['year_attrib'] = df.apply(lambda x: year_attrib(x), axis=1)
I am getting the error:
AttributeError: ("'Index' object has no attribute 'month'", 'occurred at index 2004-01-01 00:00:00')
If I inspect what each row is being passed to year_attrib with:
row = df.iloc[0]
row
Out[32]:
data 0
Name: 2004-01-01 00:00:00, dtype: int32
It looks like the timestamp is being passed to Name and I have no idea how to access it. When I look at row.index I get:
row.index
Out[34]: Index(['data'], dtype='object')
What is the cause of this behavior?
the problem is, when use apply function to a DataFrame with parameter axis=1, each row of the dataframe is passed to the function as a Series. See the doc of pandas.
So, what actually happened in the year_attrib function is, row.index will return the index of the row, which is the column of the dataframe.
In [5]: df.columns
Out[5]: Index(['data'], dtype='object')
thus AttributeError will be raised when use row.index.month.
if you really want to use this function to get what you want, use row.name.month instead.
however it's still suggested to use a vectorized way, like:
In [10]: df.loc[df.index.month>2, 'year_attrib'] = df[df.index.month>2].index.year + 11
In [11]: df.loc[df.index.month<=2, 'year_attrib'] = df[df.index.month>2].index.year + 15
In [12]: df
Out[12]:
data year_attrib
2004-03-01 93 2015
2004-03-02 48 2015
2004-03-03 88 2015
2004-03-04 44 2015
2004-03-05 11 2015
2004-03-06 4 2015
2004-03-07 70 2015

Dropping column values that don't meet a requirement

I have a pandas data frame with a 'date_of_birth' column. Values take the form 1977-10-24T00:00:00.000Z for example.
I want to grab the year, so I tried the following:
X['date_of_birth'] = X['date_of_birth'].apply(lambda x: int(str(x)[4:]))
This works if I am guaranteed that the first 4 letters are always integers, but it fails on my data set as some dates are messed up or garbage. Is there a way I can adjust my lambda without using regex? If not, how could I write this in regex?
I think it would be better to just use to_datetime to convert to datetime dtype, you can drop the invalid rows using dropna and also access just the year attribute using dt.year:
In [58]:
df = pd.DataFrame({'date':['1977-10-24T00:00:00.000Z', 'duff', '200', '2016-01-01']})
df['mod_dates'] = pd.to_datetime(df['date'], errors='coerce')
df
Out[58]:
date mod_dates
0 1977-10-24T00:00:00.000Z 1977-10-24
1 duff NaT
2 200 NaT
3 2016-01-01 2016-01-01
In [59]:
df.dropna()
Out[59]:
date mod_dates
0 1977-10-24T00:00:00.000Z 1977-10-24
3 2016-01-01 2016-01-01
In [60]:
df['mod_dates'].dt.year
Out[60]:
0 1977.0
1 NaN
2 NaN
3 2016.0
Name: mod_dates, dtype: float64

How to Bin/Count based on dates in Python

I have a python series which contains datetime.date objects ranging from 1/2013 to 12/2015 which is the month a product was sold. What I would like to do is count and bin by month the number of products sold.
Is there an efficient way of doing this with pandas?
I recommend using datetime64, that is first apply pd.to_datetime on the index. If you set this as an index then you can use resample:
In [11]: s = pd.date_range('2015-01', '2015-03', freq='5D') # DatetimeIndex
In [12]: pd.Series(1, index=s).resample('M', how='count')
Out[12]:
2015-01-31 7
2015-02-28 5
2015-03-31 1
Freq: M, dtype: int64

Pandas DataFrame merge between two values instead of matching one

I have a Dataframe with a date column and I want to merge it with another but not on a match for the column but if the date column is BETWEEN two columns on the second dataframe.
I believe I can achieve this by using apply on the first to filter the second based on these criterion and then combining the results but apply has in practice been a horribly slow way to go about things.
Is there a way to merge with the match being a BETWEEN instead of an Exact match.
example dataframe:
,Code,Description,BeginDate,EndDate,RefSessionTypeId,OrganizationCalendarId
0,2014-2015,School Year: 2014-2015,2014-08-18 00:00:00.000,2015-08-01 00:00:00.000,1,3
1,2012-2013,School Year: 2012-2013,2012-09-01 00:00:00.000,2013-08-16 00:00:00.000,1,2
2,2013-2014,School Year: 2013-2014,2013-08-19 00:00:00.000,2014-08-17 00:00:00.000,1,1
instead of merge on date=BeginDate or date=EndDate I would want to match on date BETWEEN(BeginDate, EndDate)
You can use numpy.searchsorted() to simulate BETWEEN.
Say your data and lookup value look like this:
In [162]: data = pd.DataFrame({
.....: 'Date': pd.Series(pd.np.random.randint(1429449000, 1429649000, 1000) * 1E9).astype('datetime64[ns]'),
.....: 'Value': pd.np.random.randint(0, 100, 1000),
.....: })
In [163]: data.head()
Out[163]:
Date Value
0 2015-04-21 13:37:37 60
1 2015-04-20 06:27:43 76
2 2015-04-20 09:01:51 70
3 2015-04-21 10:47:31 5
4 2015-04-19 18:39:45 27
In [164]:
In [164]: lookup = pd.Series(
.....: pd.np.random.randint(0, 10, 5),
.....: index=pd.Series(pd.np.random.randint(1429449000, 1429649000, 5) * 1E9).astype('datetime64[ns]'),
.....: )
In [165]: lookup
Out[165]:
2015-04-21 11:10:39 4
2015-04-21 07:07:51 1
2015-04-20 08:27:19 1
2015-04-21 09:58:42 6
2015-04-20 06:46:12 7
dtype: int32
You'd first want to make sure that all dates in data['Date'] are available in lookup's index. Then, sort the lookup by date.
In [166]: lookup[data['Date'].max()] = lookup[data['Date'].min()] = None
In [167]: lookup = lookup.sort_index()
Now comes the important bit -- use NumPy's extremely fast searchsorted() method to get the indices:
In [168]: indices = pd.np.searchsorted(lookup.index.astype(long), data['Date'].astype(long).values, side='left')
In [169]: data['Lookup'] = lookup.iloc[indices].values
In [170]: data.head()
Out[170]:
Date Value Lookup
0 2015-04-21 13:37:37 60 None
1 2015-04-20 06:27:43 76 7
2 2015-04-20 09:01:51 70 1
3 2015-04-21 10:47:31 5 4
4 2015-04-19 18:39:45 27 7
EDIT: you might want to convert the Date range you have in your dataset to a single Series like lookup above. That's because in case of overlapping date ranges, it isn't always clear which value to look up.
I ended up realizing I was over thinking this I added a column called merge to both tables which was just all 1's
then I can merge on that column and do regular boolean filters on the resulting merged table.
a["merge"] = 1
b["merge"] = 1
c = a.merge(b, on="merge")
then filter on c

Calculating date_range over GroupBy object in pandas

I have a massive dataframe with four columns, two of which are 'date' (in datetime format) and 'page' (a location saved as a string). I have grouped the dataframe by 'page' and called it pagegroup, and want to know the range of time over which each page is accessed (e.g. the first access was on 1-1-13, the last on 1-5-13, so the max-min is 5 days).
I know in pandas I can use date_range to compare two datetimes, but trying something like:
pagegroup['date'].agg(np.date_range)
returns
AttributeError: 'module' object has no attribute 'date_range'
while trying the simple (non date-specific) numpy function ptp gives me an integer answer:
daterange = pagegroup['date'].agg([np.ptp])
daterange.head()
ptp
page
%2F 0
/ 13325984000000000
/-509606456 297697000000000
/-511484155 0
/-511616154 0
Can anyone think of a way to calculate the range of dates and have it return in a recognizable date format?
Thank you
Assuming you have indexed by datetime can use groupby apply:
In [11]: df = pd.DataFrame([[1, 2], [1, 3], [2, 4]],
columns=list('ab'),
index=pd.date_range('2013', freq='H', periods=3)
In [12]: df
Out[12]:
a b
2013-08-22 00:00:00 1 2
2013-08-22 01:00:00 1 3
2013-08-22 02:00:00 2 4
In [13]: g = df.groupby('a')
In [14]: g.apply(lambda x: x.iloc[-1].name - x.iloc[0].name)
Out[14]:
a
1 01:00:00
2 00:00:00
dtype: timedelta64[ns]
Here iloc[-1] grabs the last row in the group and iloc[0] gets the first. The name attribute is the index of the row.
#Elyase points out that this only works if the original DatetimeIndex was in order, if not you can use max/min (which actually reads better, but may be less efficient):
In [15]: g.apply(lambda x: x.index.max() - x.index.min())
Out[15]:
a
1 01:00:00
2 00:00:00
dtype: timedelta64[ns]
Note: to get the timedelta between two Timestamps we have just subtracted (-).
If date is a column rather than an index, then use the column name:
g.apply(lambda x: x['date'].iloc[-1] - x['date'].iloc[0])
g.apply(lambda x: x['date'].max() - x['date'].min())

Categories

Resources