I have a massive dataframe with four columns, two of which are 'date' (in datetime format) and 'page' (a location saved as a string). I have grouped the dataframe by 'page' and called it pagegroup, and want to know the range of time over which each page is accessed (e.g. the first access was on 1-1-13, the last on 1-5-13, so the max-min is 5 days).
I know in pandas I can use date_range to compare two datetimes, but trying something like:
pagegroup['date'].agg(np.date_range)
returns
AttributeError: 'module' object has no attribute 'date_range'
while trying the simple (non date-specific) numpy function ptp gives me an integer answer:
daterange = pagegroup['date'].agg([np.ptp])
daterange.head()
ptp
page
%2F 0
/ 13325984000000000
/-509606456 297697000000000
/-511484155 0
/-511616154 0
Can anyone think of a way to calculate the range of dates and have it return in a recognizable date format?
Thank you
Assuming you have indexed by datetime can use groupby apply:
In [11]: df = pd.DataFrame([[1, 2], [1, 3], [2, 4]],
columns=list('ab'),
index=pd.date_range('2013', freq='H', periods=3)
In [12]: df
Out[12]:
a b
2013-08-22 00:00:00 1 2
2013-08-22 01:00:00 1 3
2013-08-22 02:00:00 2 4
In [13]: g = df.groupby('a')
In [14]: g.apply(lambda x: x.iloc[-1].name - x.iloc[0].name)
Out[14]:
a
1 01:00:00
2 00:00:00
dtype: timedelta64[ns]
Here iloc[-1] grabs the last row in the group and iloc[0] gets the first. The name attribute is the index of the row.
#Elyase points out that this only works if the original DatetimeIndex was in order, if not you can use max/min (which actually reads better, but may be less efficient):
In [15]: g.apply(lambda x: x.index.max() - x.index.min())
Out[15]:
a
1 01:00:00
2 00:00:00
dtype: timedelta64[ns]
Note: to get the timedelta between two Timestamps we have just subtracted (-).
If date is a column rather than an index, then use the column name:
g.apply(lambda x: x['date'].iloc[-1] - x['date'].iloc[0])
g.apply(lambda x: x['date'].max() - x['date'].min())
Related
I have a pandas dataframe with a column that is of Timedelta type. I used groupby with a separate month column to create groups of these Timdelta by month, I then tried to use the agg function along with min, max, mean on the Timedelta column which triggered DataError: No numeric types to aggregate
As a solution for this I tried to use the total_seconds() function along with apply() to get a numeric representation of the column, however the behaviour seems strange to me as the NaT values in my Timedelta column were turned into -9.223372e+09 but they result in NaN when total_seconds() is used on a scalar without apply()
A minimal example:
test = pd.Series([np.datetime64('nat'),np.datetime64('nat')])
res = test.apply(pd.Timedelta.total_seconds)
print(res)
which produces:
0 -9.223372e+09
1 -9.223372e+09
dtype: float64
whereas:
res = test.iloc[0].total_seconds()
print(res)
yields:
nan
The behaviour of the second example is desired as I wish to perform aggregations etc and propagate missing/invalid values. Is this a bug ?
You should use .dt.total_seconds() method, instead of applying pd.Timedelta.total_seconds function onto datetime64[ns] dtype column:
In [232]: test
Out[232]:
0 NaT
1 NaT
dtype: datetime64[ns] # <----
In [233]: pd.to_timedelta(test)
Out[233]:
0 NaT
1 NaT
dtype: timedelta64[ns] # <----
In [234]: pd.to_timedelta(test).dt.total_seconds()
Out[234]:
0 NaN
1 NaN
dtype: float64
Another demo:
In [228]: s = pd.Series(pd.to_timedelta(['03:33:33','1 day','aaa'], errors='coerce'))
In [229]: s
Out[229]:
0 0 days 03:33:33
1 1 days 00:00:00
2 NaT
dtype: timedelta64[ns]
In [230]: s.dt.total_seconds()
Out[230]:
0 12813.0
1 86400.0
2 NaN
dtype: float64
I have a few columns where some values are DateTime and some are simply the year. I'd like to index the values of the datetime such that if I = I[:4] I get the year for variable I after looping through the column instead of an error stating 'datetime.datetime' object is not subscriptable. Essentially, I'd like to drop everything from datetime that isn't the year in a column with mixed ints that have values that are only the year, and datetime instances from which I would only like to retrieve the year.
I am not exactly sure what you are trying to do and unfortunately I cannot post comments (yet).
From what I guess, you have a DataFrame with a column of mixed types, and you want to convert all values with type datetime to int. As an example, here is a DataFrame:
>>> data = [[1, 1990], [2, datetime(1991, 1, 1)]]
>>> df = pd.DataFrame(data, columns=['id', 'time'])
>>> df
id time
0 1 1990
1 2 1991-01-01 00:00:00
You can use map(link) to convert the second column:
>>> df.loc[:,'date'] = df.loc[:,'date'].map(lambda x: x.year if isinstance(x, datetime) else x)
>>> df
id time
0 1 1990
1 2 1991
I have a pandas data frame with a 'date_of_birth' column. Values take the form 1977-10-24T00:00:00.000Z for example.
I want to grab the year, so I tried the following:
X['date_of_birth'] = X['date_of_birth'].apply(lambda x: int(str(x)[4:]))
This works if I am guaranteed that the first 4 letters are always integers, but it fails on my data set as some dates are messed up or garbage. Is there a way I can adjust my lambda without using regex? If not, how could I write this in regex?
I think it would be better to just use to_datetime to convert to datetime dtype, you can drop the invalid rows using dropna and also access just the year attribute using dt.year:
In [58]:
df = pd.DataFrame({'date':['1977-10-24T00:00:00.000Z', 'duff', '200', '2016-01-01']})
df['mod_dates'] = pd.to_datetime(df['date'], errors='coerce')
df
Out[58]:
date mod_dates
0 1977-10-24T00:00:00.000Z 1977-10-24
1 duff NaT
2 200 NaT
3 2016-01-01 2016-01-01
In [59]:
df.dropna()
Out[59]:
date mod_dates
0 1977-10-24T00:00:00.000Z 1977-10-24
3 2016-01-01 2016-01-01
In [60]:
df['mod_dates'].dt.year
Out[60]:
0 1977.0
1 NaN
2 NaN
3 2016.0
Name: mod_dates, dtype: float64
I have a python series which contains datetime.date objects ranging from 1/2013 to 12/2015 which is the month a product was sold. What I would like to do is count and bin by month the number of products sold.
Is there an efficient way of doing this with pandas?
I recommend using datetime64, that is first apply pd.to_datetime on the index. If you set this as an index then you can use resample:
In [11]: s = pd.date_range('2015-01', '2015-03', freq='5D') # DatetimeIndex
In [12]: pd.Series(1, index=s).resample('M', how='count')
Out[12]:
2015-01-31 7
2015-02-28 5
2015-03-31 1
Freq: M, dtype: int64
I want to subtract dates in 'A' from dates in 'B' and add a new column with the difference.
df
A B
one 2014-01-01 2014-02-28
two 2014-02-03 2014-03-01
I've tried the following, but get an error when I try to include this in a for loop...
import datetime
date1=df['A'][0]
date2=df['B'][0]
mdate1 = datetime.datetime.strptime(date1, "%Y-%m-%d").date()
rdate1 = datetime.datetime.strptime(date2, "%Y-%m-%d").date()
delta = (mdate1 - rdate1).days
print delta
What should I do?
To remove the 'days' text element, you can also make use of the dt() accessor for series: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dt.html
So,
df[['A','B']] = df[['A','B']].apply(pd.to_datetime) #if conversion required
df['C'] = (df['B'] - df['A']).dt.days
which returns:
A B C
one 2014-01-01 2014-02-28 58
two 2014-02-03 2014-03-01 26
Assuming these were datetime columns (if they're not apply to_datetime) you can just subtract them:
df['A'] = pd.to_datetime(df['A'])
df['B'] = pd.to_datetime(df['B'])
In [11]: df.dtypes # if already datetime64 you don't need to use to_datetime
Out[11]:
A datetime64[ns]
B datetime64[ns]
dtype: object
In [12]: df['A'] - df['B']
Out[12]:
one -58 days
two -26 days
dtype: timedelta64[ns]
In [13]: df['C'] = df['A'] - df['B']
In [14]: df
Out[14]:
A B C
one 2014-01-01 2014-02-28 -58 days
two 2014-02-03 2014-03-01 -26 days
Note: ensure you're using a new of pandas (e.g. 0.13.1), this may not work in older versions.
A list comprehension is your best bet for the most Pythonic (and fastest) way to do this:
[int(i.days) for i in (df.B - df.A)]
i will return the timedelta(e.g. '-58 days')
i.days will return this value as a long integer value(e.g. -58L)
int(i.days) will give you the -58 you seek.
If your columns aren't in datetime format. The shorter syntax would be: df.A = pd.to_datetime(df.A)
How about this:
times['days_since'] = max(list(df.index.values))
times['days_since'] = times['days_since'] - times['months']
times