How to Bin/Count based on dates in Python

How to Bin/Count based on dates in Python - python

I have a python series which contains datetime.date objects ranging from 1/2013 to 12/2015 which is the month a product was sold. What I would like to do is count and bin by month the number of products sold.
Is there an efficient way of doing this with pandas?

I recommend using datetime64, that is first apply pd.to_datetime on the index. If you set this as an index then you can use resample:
In [11]: s = pd.date_range('2015-01', '2015-03', freq='5D') # DatetimeIndex
In [12]: pd.Series(1, index=s).resample('M', how='count')
Out[12]:
2015-01-31 7
2015-02-28 5
2015-03-31 1
Freq: M, dtype: int64

Related

How to calculate the time (in seconds) difference between two DateTime columns using pandas?

A pandas DataFrame (df3) has contained two columns contain timedelta64[ns] as shown. How can you calculate the difference time of them in seconds in a new column?
[In][1] df3.head()
Out[1]
The new df3 should be like:
how do I get difference in total seconds?

We can use total_seconds
(df.dropoff_datetime-df.pickup_datetime).dt.total_seconds()
Out[514]:
0 1327.0
1 2040.0
2 1680.0
3 1975.0
4 3083.0
dtype: float64
df['diff']= (df.dropoff_datetime-df.pickup_datetime).dt.total_seconds()

Odd behaviour when applying `Pandas.Timedelta.total_seconds`

I have a pandas dataframe with a column that is of Timedelta type. I used groupby with a separate month column to create groups of these Timdelta by month, I then tried to use the agg function along with min, max, mean on the Timedelta column which triggered DataError: No numeric types to aggregate
As a solution for this I tried to use the total_seconds() function along with apply() to get a numeric representation of the column, however the behaviour seems strange to me as the NaT values in my Timedelta column were turned into -9.223372e+09 but they result in NaN when total_seconds() is used on a scalar without apply()
A minimal example:
test = pd.Series([np.datetime64('nat'),np.datetime64('nat')])
res = test.apply(pd.Timedelta.total_seconds)
print(res)
which produces:
0 -9.223372e+09
1 -9.223372e+09
dtype: float64
whereas:
res = test.iloc[0].total_seconds()
print(res)
yields:
nan
The behaviour of the second example is desired as I wish to perform aggregations etc and propagate missing/invalid values. Is this a bug ?

You should use .dt.total_seconds() method, instead of applying pd.Timedelta.total_seconds function onto datetime64[ns] dtype column:
In [232]: test
Out[232]:
0 NaT
1 NaT
dtype: datetime64[ns] # <----
In [233]: pd.to_timedelta(test)
Out[233]:
0 NaT
1 NaT
dtype: timedelta64[ns] # <----
In [234]: pd.to_timedelta(test).dt.total_seconds()
Out[234]:
0 NaN
1 NaN
dtype: float64
Another demo:
In [228]: s = pd.Series(pd.to_timedelta(['03:33:33','1 day','aaa'], errors='coerce'))
In [229]: s
Out[229]:
0 0 days 03:33:33
1 1 days 00:00:00
2 NaT
dtype: timedelta64[ns]
In [230]: s.dt.total_seconds()
Out[230]:
0 12813.0
1 86400.0
2 NaN
dtype: float64

Iteratively Subset DataFrame based on Unique Times

Given the following example DataFrame:
>>> df
Times Values
0 05/10/2017 01:01:03 1
1 05/10/2017 01:05:00 2
2 05/10/2017 01:06:10 3
3 05/11/2017 08:25:20 4
4 05/11/2017 08:30:14 5
5 05/11/2017 08:30:35 6
I want to subset this DataFrame by the 'Time' column, by matching a partial string up to the hour. For example, I want to subset using partial strings which contain "05/10/2017 01:" and "05/11/2017 08:" which breaks up the subsets into two new data frames:
>>> df1
Times Values
0 05/10/2017 01:01:03 1
1 05/10/2017 01:05:00 2
2 05/10/2017 01:06:10 3
and
>>> df2
0 05/11/2017 08:25:20 4
1 05/11/2017 08:30:14 5
2 05/11/2017 08:30:35 6
Is it possible to make this subset iterative in Pandas, for multiple dates/times that similarly have the date/hour as the common identifier?

First, cast your Times column into a datetime format, and set it as the index:
df['Times'] = pd.to_datetime(df['Times'])
df.set_index('Times', inplace = True)
Then use the groupby method, with a TimeGrouper:
g = df.groupby(pd.TimeGrouper('h'))
g is an iterator that yields tuple pairs of times and sub-dataframes of those times. If you just want the sub-dfs, you can do zip(*g)[1].
A caveat: the sub-dfs are indexed by the timestamp, and pd.TimeGrouper only works when the times are the index. If you want to have the timestamp as a column, you could instead do:
df['Times'] = pd.to_datetime(df['Times'])
df['time_hour'] = df['Times'].dt.floor('1h')
g = df.groupby('time_hour')
Alternatively, you could just call .reset_index() on each of the dfs from the former method, but this will probably be much slower.

Convert Times to a hour period, groupby and then extract each group as a DF.
df1,df2=[g.drop('hour',1) for n,g in\
df.assign(hour=pd.DatetimeIndex(df.Times)\
.to_period('h')).groupby('hour')]
df1
Out[874]:
Times Values
0 2017-05-10 01:01:03 1
1 2017-05-10 01:05:00 2
2 2017-05-10 01:06:10 3
df2
Out[875]:
Times Values
3 2017-05-11 08:25:20 4
4 2017-05-11 08:30:14 5
5 2017-05-11 08:30:35 6

First make sure that the Times column is of type DateTime.
Second, set times column as index.
Third, use between_time method.
df['Times'] = pd.to_datetime(df['Times'])
df.set_index('Times', inplace=True)
df1 = df.between_time('1:00:00', '1:59:59')
df2 = df.between_time('8:00:00', '8:59:59')

If you use the datetime type you can extract things like hours and days.
times = pd.to_datetime(df['Times'])
hours = times.apply(lambda x: x.hour)
df1 = df[hours == 1]

You can use the str[] accessor to truncate the string representation of your date (you might have to cast astype(str) if your columns is a datetime and then use groupby.groups to access the dataframes as a dictionary where the keys are your truncated date values:
>>> df.groupby(df.Times.astype(str).str[0:13]).groups
{'2017-05-10 01': DatetimeIndex(['2017-05-10 01:01:03', '2017-05-10 01:05:00',
'2017-05-10 01:06:10'],
dtype='datetime64[ns]', name='time', freq=None),
'2017-05-11 08': DatetimeIndex(['2017-05-11 08:25:20', '2017-05-11 08:30:14',
'2017-05-11 08:30:35'],
dtype='datetime64[ns]', name='time', freq=None)}

Dropping column values that don't meet a requirement

I have a pandas data frame with a 'date_of_birth' column. Values take the form 1977-10-24T00:00:00.000Z for example.
I want to grab the year, so I tried the following:
X['date_of_birth'] = X['date_of_birth'].apply(lambda x: int(str(x)[4:]))
This works if I am guaranteed that the first 4 letters are always integers, but it fails on my data set as some dates are messed up or garbage. Is there a way I can adjust my lambda without using regex? If not, how could I write this in regex?

I think it would be better to just use to_datetime to convert to datetime dtype, you can drop the invalid rows using dropna and also access just the year attribute using dt.year:
In [58]:
df = pd.DataFrame({'date':['1977-10-24T00:00:00.000Z', 'duff', '200', '2016-01-01']})
df['mod_dates'] = pd.to_datetime(df['date'], errors='coerce')
df
Out[58]:
date mod_dates
0 1977-10-24T00:00:00.000Z 1977-10-24
1 duff NaT
2 200 NaT
3 2016-01-01 2016-01-01
In [59]:
df.dropna()
Out[59]:
date mod_dates
0 1977-10-24T00:00:00.000Z 1977-10-24
3 2016-01-01 2016-01-01
In [60]:
df['mod_dates'].dt.year
Out[60]:
0 1977.0
1 NaN
2 NaN
3 2016.0
Name: mod_dates, dtype: float64

Calculating date_range over GroupBy object in pandas

I have a massive dataframe with four columns, two of which are 'date' (in datetime format) and 'page' (a location saved as a string). I have grouped the dataframe by 'page' and called it pagegroup, and want to know the range of time over which each page is accessed (e.g. the first access was on 1-1-13, the last on 1-5-13, so the max-min is 5 days).
I know in pandas I can use date_range to compare two datetimes, but trying something like:
pagegroup['date'].agg(np.date_range)
returns
AttributeError: 'module' object has no attribute 'date_range'
while trying the simple (non date-specific) numpy function ptp gives me an integer answer:
daterange = pagegroup['date'].agg([np.ptp])
daterange.head()
ptp
page
%2F 0
/ 13325984000000000
/-509606456 297697000000000
/-511484155 0
/-511616154 0
Can anyone think of a way to calculate the range of dates and have it return in a recognizable date format?
Thank you

Assuming you have indexed by datetime can use groupby apply:
In [11]: df = pd.DataFrame([[1, 2], [1, 3], [2, 4]],
columns=list('ab'),
index=pd.date_range('2013', freq='H', periods=3)
In [12]: df
Out[12]:
a b
2013-08-22 00:00:00 1 2
2013-08-22 01:00:00 1 3
2013-08-22 02:00:00 2 4
In [13]: g = df.groupby('a')
In [14]: g.apply(lambda x: x.iloc[-1].name - x.iloc[0].name)
Out[14]:
a
1 01:00:00
2 00:00:00
dtype: timedelta64[ns]
Here iloc[-1] grabs the last row in the group and iloc[0] gets the first. The name attribute is the index of the row.
#Elyase points out that this only works if the original DatetimeIndex was in order, if not you can use max/min (which actually reads better, but may be less efficient):
In [15]: g.apply(lambda x: x.index.max() - x.index.min())
Out[15]:
a
1 01:00:00
2 00:00:00
dtype: timedelta64[ns]
Note: to get the timedelta between two Timestamps we have just subtracted (-).
If date is a column rather than an index, then use the column name:
g.apply(lambda x: x['date'].iloc[-1] - x['date'].iloc[0])
g.apply(lambda x: x['date'].max() - x['date'].min())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to Bin/Count based on dates in Python - python

I have a python series which contains datetime.date objects ranging from 1/2013 to 12/2015 which is the month a product was sold. What I would like to do is count and bin by month the number of products sold. Is there an efficient way of doing this with pandas?

Related

How to calculate the time (in seconds) difference between two DateTime columns using pandas?

Odd behaviour when applying `Pandas.Timedelta.total_seconds`

Iteratively Subset DataFrame based on Unique Times

Dropping column values that don't meet a requirement

Calculating date_range over GroupBy object in pandas

Categories

Resources