I have some consumer purchase data that looks like
CustomerID InvoiceDate
13654.0 2011-07-17 13:29:00
14841.0 2010-12-16 10:28:00
19543.0 2011-10-18 16:58:00
12877.0 2011-06-15 13:34:00
15073.0 2011-06-06 12:33:00
I'm interested in the rate at which customers purchase. I'd like to group by each customer and then determine how many purchases were made in each quarter (let's say each quarter is every 3 months starting in January).
I could just define when each quarter starts and ends and make another column. I'm wondering if I could instead use groupby to achieve the same thing.
Presently, this is how I do it:
r = data.groupby('CustomerID')
frames = []
for name,frame in r:
f =frame.set_index('InvoiceDate').resample("QS").count()
f['CustomerID']= name
frames.append(f)
g = pd.concat(frames)
UPDATE:
In [43]: df.groupby(['CustomerID', pd.Grouper(key='InvoiceDate', freq='QS')]) \
.size() \
.reset_index(name='Count')
Out[43]:
CustomerID InvoiceDate Count
0 12877.0 2011-04-01 1
1 13654.0 2011-07-01 1
2 14841.0 2010-10-01 1
3 15073.0 2011-04-01 1
4 19543.0 2011-10-01 1
Is that what you want?
In [39]: df.groupby(pd.Grouper(key='InvoiceDate', freq='QS')).count()
Out[39]:
CustomerID
InvoiceDate
2010-10-01 1
2011-01-01 0
2011-04-01 2
2011-07-01 1
2011-10-01 1
I think this is the best I will be able to do:
data.groupby('CustomerID').apply(lambda x: x.set_index('InvoiceDate').resample('QS').count())
Related
I have a crime dataset where every row is one recorded offence that is to be used in an ARIMA time series model.
Date
0 2015-09-05
1 2015-09-05
2 2015-07-08
3 2017-09-05
4 2018-09-05
4 2018-09-05
I would like to group by data, so that offences that occurred on the same day are aggregated.
Date Count
0 2015-09-05 2
1 2015-07-08 1
2 2017-09-05 1
3 2018-09-05 2
I'm struggling because I'm trying to both group by weeks per year, and because I'm not aggregating the contents of a column, I'm trying to count how many rows are grouped into it.
Thank you.
If your dataset is a dataframe, you can use:
df.assign(Count=1).groupby('Date')['Count'].count()
If it's a series:
series.to_frame().assign(Count=1).groupby('Date')['Count'].count()
For example:
df = pd.DataFrame({'Date':['2015-09-05',
'2015-09-05',
'2015-07-08',
'2017-09-05',
'2018-09-05',
'2018-09-05']})
df.assign(Count=1).groupby('Date')['Count'].count().reset_index()
Returns:
Date Count
0 2015-07-08 1
1 2015-09-05 2
2 2017-09-05 1
3 2018-09-05 2
One way to do it is to use Python rather than pandas for the heavy lifting:
import pandas as pd
import datetime
df = pd.DataFrame([datetime.datetime.strptime(x, "%Y-%m-%d").date() for x in ['2015-09-05', '2015-09-05', '2015-07-08', '2017-09-05', '2018-09-05', '2018-09-05']], columns=['Date'])
from collections import Counter
c = Counter(list(df['Date']))
df2 = pd.DataFrame(zip(list(c.keys()), list(c.values())), columns=['Date', 'Count'])
print(df2)
Output:
Date Count
0 2015-09-05 2
1 2015-07-08 1
2 2017-09-05 1
3 2018-09-05 2
My data looks like this:
print(df)
DateTime, Status
'2021-09-01', 0
'2021-09-05', 1
'2021-09-07', 0
And I need it to look like this:
print(df_desired)
DateTime, Status
'2021-09-01', 0
'2021-09-02', 0
'2021-09-03', 0
'2021-09-04', 0
'2021-09-05', 1
'2021-09-06', 1
'2021-09-07', 0
Right now I accomplish this using pandas like this:
new_index = pd.DataFrame(index = pd.date_range(df.index[0], df.index[-1], freq='D'))
df = new_index.join(df).ffill()
Missing values before the first record in any column are imputed using the inverse of the first record in that column because it's binary and only shows change-points this is guaranteed to be correct.
To my understanding my desired dataframe contained "continuous" data, but I'm not sure what to call the data structure in my source data.
The problem:
When I do this to a dataframe that has a frequency of one record per second and I want to load a year's worth of data my memory overflows (92GB required, ~60GB available). I'm not sure if there is a standard procedure instead of my solution that I don't know the name of and cannot find using google or that I'm using the join method wrong, but this seems horribly inefficient, the resulting dataframe is only a few 100 megabytes after this operation. Any feedback on this would be great!
Use DataFrame.asfreq working with DatetimeIndex:
df = df.set_index('DateTime').asfreq('d', method='ffill').reset_index()
print (df)
DateTime Status
0 2021-09-01 0
1 2021-09-02 0
2 2021-09-03 0
3 2021-09-04 0
4 2021-09-05 1
5 2021-09-06 1
6 2021-09-07 0
You can use this pipeline:
(df.set_index('DateTime')
.reindex(pd.date_range(df['DateTime'].min(), df['DateTime'].max()))
.rename_axis('DateTime')
.ffill(downcast='infer')
.reset_index()
)
output:
DateTime Status
0 2021-09-01 0
1 2021-09-02 0
2 2021-09-03 0
3 2021-09-04 0
4 2021-09-05 1
5 2021-09-06 1
6 2021-09-07 0
input:
DateTime Status
0 2021-09-01 0
1 2021-09-05 1
2 2021-09-07 0
I'm new pandas user and I try to do something with my DataFrame.
I have DataFrame watchers with two columns: repo_id and created_at:
In: watchers.head()
Out:
repo_id created_at
0 1 2010-05-12 06:16:00
1 1 2009-02-16 12:51:54
2 2 2011-02-09 03:53:14
3 1 2010-09-01 09:05:21
4 2 2009-03-04 09:44:56
I want to create new DataFrame - grouped by the month from created_at and repo_id and take information about count of rows for each of them. The result should be similar to:
In: watchers_by_month()
Out:
repo_id month count
0 1 2009-02-28 32
1 1 2009-03-31 42
2 2 2009-05-31 3
3 2 2009-06-30 24
4 3 2013-04-30 23
The order doesn't matter. I just need to know repo_id still for each count.
I tried do something with my DataFrame, but I don't know how to achieve the above effect.
the only thing I could get:
In: watchers.index = watchers['created_at']
watchers.groupby(['repo_id', pd.Grouper(freq='M')]).count()
Out:
created_at
repo_id created_at
1 2009-02-28 323
2009-03-31 56
2009-04-30 29
2009-05-31 24
2009-06-30 35
... ... ...
107672 2013-04-30 6
2013-05-31 3
2013-06-30 3
2013-07-31 6
2013-08-31 1
Assuming your watchers['created_at] is of datetime64[ns], then create an additional month column
watchers['month'] = watchers['created_at'].dt.month
watchers_by_month = watchers.groupby(by=['repo_id','month']['created_at'].count().reset_index().rename_column(column={'created_at':'count'})
If your watchers['created_at] is not of datetime64[ns], then first convert watchers['created_at] to datetime64[ns] using pd.to_datetime() then create an additional month column , then run the above code.
You are very close, just turn that into a dataframe with reset_index:
(watchers.groupby(['repo_id', pd.Grouper(freq='M')])
.count().reset_index(name='count')
)
I have data that looks like this
date ticker x y
0 2018-01-31 ABC 1 5
1 2019-01-31 ABC 2 6
2 2018-01-31 XYZ 3 7
3 2019-01-31 XYZ 4 8
So it is a panel of yearly observations. I want to upsample to a monthly frequency and forward fill the new observations. So ABC would look like
date ticker x y
0 2018-01-31 ABC 1 5
1 2018-02-28 ABC 1 5
...
22 2019-11-30 ABC 2 6
23 2019-12-31 ABC 2 6
Notice that I want to fill through the last year, not just up until the last date.
Right now I am doing something like
newidx = df.groupby('ticker')['date'].apply(lambda x:
pd.Series(pd.date_range(x.min(),x.max()+YearEnd(1),freq='M'))).reset_index()
newidx.drop('level_1',axis=1,inplace=True)
df = pd.merge(newidx,df,on=['date','ticker'],how='left')
This is obviously a terrible way to do this. It's really slow, but it works. What is the proper way to handle this?
Your approach might be slow because you need groupby, then merge. Let's try another option with reindex so you only need groupby:
(df.set_index('date')
.groupby('ticker')
.apply(lambda x: x.reindex(pd.date_range(x.index.min(),x.index.max()+YearEnd(1),freq='M'),
method='ffill'))
.reset_index('ticker', drop=True)
.reset_index()
)
This forloop will take 3 days to complete. How can I increase the speed?
for i in range(df.shape[0]):
df.loc[df['Creation date'] >= pd.to_datetime(str(df['Original conf GI dte'].iloc[i])),'delivered'] += df['Sale order item'].iloc[i]
I think the forloop is enough to understand?
If Creation date is bigger than Original conf GI date, then add Sale order item value to delivered column.
Each row's date is "Date Accepted" (Date Delivered is future date). Input is Order Ouantity, Date Accepted & Date Delivered....Output is Delivered column
Order Quantity Date Accepted Date Delivered Delivered
20 01-05-2010 01-02-2011 0
10 01-11-2010 01-03-2011 0
300 01-12-2010 01-09-2011 0
5 01-03-2011 01-03-2012 30
20 01-04-2012 01-11-2013 335
10 01-07-2013 01-12-2014 335
Convert values to numpy arrays by Series.to_numpy, compare them with broadcasting, match order values by numpy.where and last sum:
date1 = df['Date Accepted'].to_numpy()
date2 = df['Date Delivered'].to_numpy()
order = df['Order Quantity'].to_numpy()
#oldier pandas versions
#date1 = df['Date Accepted'].values
#date2 = df['Date Delivered'].values
#order = df['Order Quantity'].values
df['Delivered1'] = np.where(date1[:, None] >= date2, order, 0).sum(axis=1)
print (df)
Order Quantity Date Accepted Date Delivered Delivered Delivered1
0 20 2010-01-05 2011-01-02 0 0
1 10 2010-01-11 2011-01-03 0 0
2 300 2010-01-12 2011-01-09 0 0
3 5 2011-01-03 2012-01-03 30 30
4 20 2012-01-04 2013-01-11 335 335
5 10 2013-01-07 2014-01-12 335 335
If I understand correctly, you can use np.where() for speed. Currently you are looping on the dataframe rows whereas numpy operations are designed to operate on the entire column:
cond= df['Creation date'].ge(pd.to_datetime(str(df['Original conf GI dte'])))
df['delivered']=np.where(cond,df['delivered']+df['Sale order item'],df['delivered'])