I have the following dataframe
import pandas as pd
from pandas import Timestamp
foo = pd.DataFrame.from_dict(data={'id': {0: '1',
1: '1',
2: '1',
3: '2',
4: '2'},
'session': {0: 3, 1: 2, 2: 1, 3: 1, 4: 2},
'start_time': {0: Timestamp('2021-09-02 19:49:19'),
1: Timestamp('2021-09-16 10:54:21'),
2: Timestamp('2021-07-12 17:11:54'),
3: Timestamp('2021-03-02 01:53:22'),
4: Timestamp('2021-01-09 11:38:35')}})
I would like to add a new column to foo, called diff_start_time, which would be the difference of the start_time column of the current session from the previous one, grouped by id. I would like the difference to be in hours.
How could I do that in python ?
Use DataFrameGroupBy.diff with Series.dt.total_seconds:
foo['diff_start_time'] = foo.groupby('id')['start_time'].diff().dt.total_seconds().div(3600)
print (foo)
id session start_time diff_start_time
0 1 3 2021-09-02 19:49:19 NaN
1 1 2 2021-09-16 10:54:21 327.083889
2 1 1 2021-07-12 17:11:54 -1577.707500
3 2 1 2021-03-02 01:53:22 NaN
4 2 2 2021-01-09 11:38:35 -1238.246389
If necessary sorting first by id, session:
foo = foo.sort_values(['id','session'])
foo['diff_start_time'] = foo.groupby('id')['start_time'].diff().dt.total_seconds().div(3600)
print (foo)
id session start_time diff_start_time
2 1 1 2021-07-12 17:11:54 NaN
1 1 2 2021-09-16 10:54:21 1577.707500
0 1 3 2021-09-02 19:49:19 -327.083889
3 2 1 2021-03-02 01:53:22 NaN
4 2 2 2021-01-09 11:38:35 -1238.246389
You can use .groupby() + diff() + dt.total_seconds() to get the total number of seconds in difference, then divide by 3600 to get the differences in hours.
df_out = foo.sort_values(['id', 'session'])
df_out['diff_start_time'] = df_out.groupby('id')['start_time'].diff().dt.total_seconds() / 3600
Result:
print(df_out)
id session start_time diff_start_time
2 1 1 2021-07-12 17:11:54 NaN
1 1 2 2021-09-16 10:54:21 1577.707500
0 1 3 2021-09-02 19:49:19 -327.083889
3 2 1 2021-03-02 01:53:22 NaN
4 2 2 2021-01-09 11:38:35 -1238.246389
Related
I have a dataframe named grouped_train which lists the number of sales of each item in each store by month. Here's an example of the dataframe.
date_block_num
item_id
shop_id
item_cnt_month
0
32
0
6
0
32
5
1
0
26
1
3
0
26
18
9
1
32
46
1
1
26
50
2
There are 33 date_block_nums which correspond to different months. I'd like to add two columns which list the sum of all sales by item_id in the previous month/date_block_num as well as the mean of sales for that particular item_id from all store_ids in the previous month, any rows where date_block_num == 0 should be None.
So, using the example df above, the output would look like:
date_block_num
item_id
shop_id
item_cnt_month
item_sales_prev_month
mean_item_sales_prev_month
0
32
0
6
None
None
0
32
5
1
None
None
0
26
1
3
None
None
0
26
18
9
None
None
1
32
46
1
7
3.5
1
26
50
2
12
6
I've written some code for just the sum_item_prev_month column which I believe works and could easily change it to create the mean sales column as well, but with over 2.9 million rows in my dataframe, my code takes multiple hours to run. Admittedly, I'm not well versed with pandas, there has to be some vectorized formula I'm missing to speed up this computation. Here's the code I have so far.
sales_by_item_by_month = grouped_train.groupby(['date_block_num', 'item_id'], as_index=False).agg({'item_cnt_month' : 'sum'})
date_block_nums = list(grouped_train['date_block_num'])
item_ids = list(grouped_train['item_id'])
sales_for_item_prev_month = []
for index in range(len(item_ids)):
if date_block_nums[index] == 0:
sales_for_item_prev_month.append(None)
else:
sales = sales_by_item_by_month[(sales_by_item_by_month['item_id'] == item_ids[index]) & (sales_by_item_by_month['date_block_num'] == date_block_nums[index] - 1)]
if len(sales) == 0:
sales_for_item_prev_month.append(0)
else:
sales_for_item_prev_month.append(int(sales['item_cnt_month'].values))
grouped_train['item_sales_prev_month'] = sales_for_item_prev_month
Any advice would be much appreciated!
Assuming date_block_num are sequential.
Try calculating the sum and mean using groupby agg then increment the date_block_num by 1 to align it to the next group:
sum_means = df.groupby(['date_block_num', 'item_id']).agg(
item_sales_prev_month=('item_cnt_month', 'sum'),
mean_item_sales_prev_month=('item_cnt_month', 'mean')
).reset_index()
sum_means['date_block_num'] += 1
sum_means:
date_block_num item_id item_sales_prev_month mean_item_sales_prev_month
0 1 26 12 6.0
1 1 32 7 3.5
2 2 26 2 2.0
3 2 32 1 1.0
Then merge back to the original frame:
df = df.merge(sum_means, on=['date_block_num', 'item_id'], how='left')
df:
date_block_num item_id shop_id item_cnt_month item_sales_prev_month mean_item_sales_prev_month
0 0 32 0 6 NaN NaN
1 0 32 5 1 NaN NaN
2 0 26 1 3 NaN NaN
3 0 26 18 9 NaN NaN
4 1 32 46 1 7.0 3.5
5 1 26 50 2 12.0 6.0
Complete Code:
import pandas as pd
df = pd.DataFrame({
'date_block_num': {0: 0, 1: 0, 2: 0, 3: 0, 4: 1, 5: 1},
'item_id': {0: 32, 1: 32, 2: 26, 3: 26, 4: 32, 5: 26},
'shop_id': {0: 0, 1: 5, 2: 1, 3: 18, 4: 46, 5: 50},
'item_cnt_month': {0: 6, 1: 1, 2: 3, 3: 9, 4: 1, 5: 2}
})
sum_means = df.groupby(['date_block_num', 'item_id']).agg(
item_sales_prev_month=('item_cnt_month', 'sum'),
mean_item_sales_prev_month=('item_cnt_month', 'mean')
).reset_index()
sum_means['date_block_num'] += 1
df = df.merge(sum_means, on=['date_block_num', 'item_id'], how='left')
[I've attached a picture of my Series and the code to obtain the series , how would I obtain the number of days between a 1 and the next 0. For example, the number of days between the first 1 and next 0 is 4 days (1st August to 5th August], the number of days between the next 1 an 0 is also 4 days [8th august to 12 August 1
values = [1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1]
dates =['2019-08-01', '2019-08-02', '2019-08-05', '2019-08-06',
'2019-08-07', '2019-08-08', '2019-08-09', '2019-08-12',
'2019-08-13', '2019-08-14', '2019-08-15', '2019-08-16',
'2019-08-19', '2019-08-20', '2019-08-21', '2019-08-22',
'2019-08-23', '2019-08-26', '2019-08-27', '2019-08-28',
'2019-08-29', '2019-08-30']
pd.Series(values, index = dates)
You try this using groupby like itertool.groupby here. The extract 1st index of every group. Since you have to find difference b/w two groups there have to be same number of 1 groups and 0 groups, if it's not the case then drop the last group.
s = pd.Series(values, index = pd.to_datetime(dates))
g = s.ne(s.shift()).cumsum()
vals = s.groupby(g).apply(lambda x:x.index[0])
# vals
1 2019-08-01
2 2019-08-05
3 2019-08-08
4 2019-08-12
5 2019-08-13
6 2019-08-14
7 2019-08-16
8 2019-08-23
9 2019-08-29
dtype: object
Now we dont have same number of 1 groups and 0 groups, so ditch the group index. And make chunks for size 2 i.e now, each has 1 and 0 groups indices.
end = None if not len(vals)%2 else -1
vals = vals.iloc[:end].values.reshape((-1, 2))
# vals
array([['2019-08-01T00:00:00.000000000', '2019-08-05T00:00:00.000000000'],
['2019-08-08T00:00:00.000000000', '2019-08-12T00:00:00.000000000'],
['2019-08-13T00:00:00.000000000', '2019-08-14T00:00:00.000000000'],
['2019-08-16T00:00:00.000000000', '2019-08-23T00:00:00.000000000']],
dtype='datetime64[ns]')
Now, we have to find the difference using np.diff.
days = np.diff(vals, axis=1).squeeze()
out = pd.Series(days)
# out
0 4 days
1 4 days
2 1 days
3 7 days
dtype: timedelta64[ns]
I think something like below should work, first have a series with a date index:
ds = pd.Series(values, index = pd.to_datetime(dates))
Then you calculate the difference between consecutive values:
delta = ds - ds.shift(fill_value=ds[0]-1)
It looks like this :
pd.DataFrame({'value':ds,'delta':delta})
value delta
2019-08-01 1 1
2019-08-02 1 0
2019-08-05 0 -1
2019-08-06 0 0
2019-08-07 0 0
2019-08-08 1 1
2019-08-09 1 0
2019-08-12 0 -1
2019-08-13 1 1
2019-08-14 0 -1
So the start dates you need are when delta is 1, and the next zero you need is where it is -1 .So:
starts = ds.index[delta == 1]
ends = ds.index[delta == -1]
(ends - starts[:len(ends)]).days
Int64Index([4, 4, 1, 7], dtype='int64')
Note there are some cases where at the end of the data frame, you have 1s but they don't flip into 0, so I ignore those.
Start from creation of a DataFrame with date column composed of
dates converted to datetime and val column composed of values:
df = pd.DataFrame({'date': pd.to_datetime(dates), 'val': values})
The idea to get the result is:
Get dates where val == 0 (for other rows take NaT).
Perform "backwards filling".
Subtract date.
From the above result (timedelta) get the days number.
Fill outstanding NaT values (if any) with 0 (in your case
this pertains to 2 last rows, which are not followed by any "0 row").
Save the result in dist column.
The code to do it is:
df['dist'] = (df.date.where(df.val == 0).bfill(0) - df.date)\
.dt.days.fillna(0, downcast='infer')
The result is:
date val dist
0 2019-08-01 1 4
1 2019-08-02 1 3
2 2019-08-05 0 0
3 2019-08-06 0 0
4 2019-08-07 0 0
5 2019-08-08 1 4
6 2019-08-09 1 3
7 2019-08-12 0 0
8 2019-08-13 1 1
9 2019-08-14 0 0
10 2019-08-15 0 0
11 2019-08-16 1 7
12 2019-08-19 1 4
13 2019-08-20 1 3
14 2019-08-21 1 2
15 2019-08-22 1 1
16 2019-08-23 0 0
17 2019-08-26 0 0
18 2019-08-27 0 0
19 2019-08-28 0 0
20 2019-08-29 1 0
21 2019-08-30 1 0
(dist column is the distance in days).
If you need, take from the above result only rows with val != 0.
Suppose I have several records for each person, each with a certain date. I want to construct a column that indicates, per person, the number of other records that are less than 2 months old. That is, I focus just on the records of, say, individual 'A', and I loop over his/her records to see whether there are other records of individual 'A' that are less than two months old (compared to the current row/record).
Let's see some test data to make it clearer
import pandas as pd
testdf = pd.DataFrame({
'id_indiv': [1, 1, 1, 2, 2, 2],
'id_record': [12, 13, 14, 19, 20, 23],
'date': ['2017-04-28', '2017-04-05', '2017-08-05',
'2016-02-01', '2016-02-05', '2017-10-05'] })
testdf.date = pd.to_datetime(testdf.date)
I'll add the expected column of counts
testdf['expected_counts'] = [1, 0, 0, 0, 1, 0]
#Gives:
date id_indiv id_record expected
0 2017-04-28 1 12 1
1 2017-04-05 1 13 0
2 2017-08-05 1 14 0
3 2016-02-01 2 19 0
4 2016-02-05 2 20 1
5 2017-10-05 2 23 0
My first thought was to group by id_indiv then use apply or transform with custom function. To make things easier, I'll first add a variable that substracts two months from the record date and then I'll write the count_months custom function for the apply or transform
testdf['2M_before'] = testdf['date'] - pd.Timedelta('{0}D'.format(30*2))
def count_months(chunk, month_var='2M_before'):
counts = np.empty(len(chunk))
for i, (ind, row) in enumerate(chunk.iterrows()):
#Count records earlier than two months old
#but not newer than the current one
counts[i] = ((chunk.date > row[month_var])
& (chunk.date < row.date)).sum()
return counts
I tried first with transform:
testdf.groupby('id_indiv').transform(count_months)
but it gives an AttributeError: ("'Series' object has no attribute 'iterrows'", 'occurred at index date') which I guess means that transform passes a Series object to the custom function, but I don't know how to fix that.
Then I tried with apply
testdf.groupby('id_indiv').apply(count_months)
#Gives
id_indiv
1 [1.0, 0.0, 0.0]
2 [0.0, 1.0, 0.0]
dtype: object
This almost works, but it gives the result as a list. To "unstack" that list, I followed an answer on this question:
#First sort, just in case the order gets messed up when pasting back:
testdf = testdf.sort_values(['id_indiv', 'id_record'])
counts = (testdf.groupby('id_indiv').apply(count_months)
.apply(pd.Series).stack()
.reset_index(level=1, drop=True))
#Now create the new column
testdf.set_index('id_indiv', inplace=True)
testdf['mycount'] = counts.astype('int')
assert (testdf.expected == testdf.mycount).all()
#df looks now likes this
date id_record expected 2M_before mycount
id_indiv
1 2017-04-28 12 1 2017-02-27 1
1 2017-04-05 13 0 2017-02-04 0
1 2017-08-05 14 0 2017-06-06 0
2 2016-02-01 19 0 2015-12-03 0
2 2016-02-05 20 1 2015-12-07 1
2 2017-10-05 23 0 2017-08-06 0
This seems to work, but it seems like there should be a much easier way (maybe using transform?). Besides, pasting back the column like that doesn't seem very robust.
Thanks for your time!
Edited to count recent records per person
Here's one way to count all records strictly newer than 2 months for each person using a lookback window of exactly two calendar months minus 1 day (as opposed to an approximate 2-month window of 60 days or something).
# imports and setup
import pandas as pd
testdf = pd.DataFrame({
'id_indiv': [1, 1, 1, 2, 2, 2],
'id_record': [12, 13, 14, 19, 20, 23],
'date': ['2017-04-28', '2017-04-05', '2017-08-05',
'2016-02-01', '2016-02-05', '2017-10-05'] })
# more setup
testdf['date'] = pd.to_datetime(testdf['date'])
testdf.set_index('date', inplace=True)
testdf.sort_index(inplace=True) # required for the index-slicing below
# solution
count_recent_records = lambda x: [x.loc[d - pd.DateOffset(months=2, days=-1):d].count() - 1 for d in x.index]
testdf['mycount'] = testdf.groupby('id_indiv').transform(count_recent_records)
# output
testdf
id_indiv id_record mycount
date
2016-02-01 2 19 0
2016-02-05 2 20 1
2017-04-05 1 13 0
2017-04-28 1 12 1
2017-08-05 1 14 0
2017-10-05 2 23 0
testdf = testdf.sort_values('date')
out_df = pd.DataFrame()
for i in testdf.id_indiv.unique():
for d in testdf.date:
date_diff = (d - testdf.loc[testdf.id_indiv == i,'date']).dt.days
out_dict = {'person' : i,
'entry_date' : d,
'count' : sum((date_diff > 0) & (date_diff <= 60))}
out_df = out_df.append(out_dict, ignore_index = True)
out_df
count entry_date person
0 0.0 2016-02-01 2.0
1 1.0 2016-02-05 2.0
2 0.0 2017-04-05 2.0
3 0.0 2017-04-28 2.0
4 0.0 2017-08-05 2.0
5 0.0 2017-10-05 2.0
6 0.0 2016-02-01 1.0
7 0.0 2016-02-05 1.0
8 0.0 2017-04-05 1.0
9 1.0 2017-04-28 1.0
10 0.0 2017-08-05 1.0
11 0.0 2017-10-05 1.0
I have this data frame:
dict_data = {'id' : [1,1,1,2,2,2,2,2],
'datetime' : np.array(['2016-01-03T16:05:52.000000000', '2016-01-03T16:05:52.000000000',
'2016-01-03T16:05:52.000000000', '2016-01-27T15:45:20.000000000',
'2016-01-27T15:45:20.000000000', '2016-11-27T15:08:04.000000000',
'2016-11-27T15:08:04.000000000', '2016-11-27T15:08:04.000000000'], dtype='datetime64[ns]')}
df_data=pd.DataFrame(dict_data)
The data looks like this
Data
I want to rank over customer id and date, I used this code
(df_data.assign(rn=df_data.sort_values(['datetime'], ascending=True)
....: .groupby(['datetime','id'])
....: .cumcount() + 1)
....: .sort_values(['datetime','rn'])
....: )
I get a different rank by ID for each date:
table with rank
What I would like to see is rank by ID but for the same datetime get the same rank for each ID.
Here is how you can rank by datetime and id :
##### RANK BY datetime and id #####
In[]: df_data.rank(axis =0,ascending = 1, method = 'dense')
Out[]:
datetime id
0 1 1
1 1 1
2 1 1
3 2 2
4 2 2
5 3 2
6 3 2
7 3 2
##### GROUPBY id AND USE APPLY TO GET VALUE FOR FOR EACH GROUP #####
In[]: df_data.rank(axis =0,ascending = 1, method = 'dense').groupby('id').apply(lambda x: x)
Out[]:
datetime id
0 1 1
1 1 1
2 1 1
3 2 2
4 2 2
5 3 2
6 3 2
7 3 2
##### THEN RANK INSIDE EACH GROUP #####
In[]: df_data.assign(rank=df_data.rank(axis =0,ascending = 1, method = 'dense').groupby('id').apply(lambda x: x.rank(axis =0,ascending = 1, method = 'dense'))['datetime'])
Out[]:
datetime id rank
0 2016-01-03 16:05:52 1 1
1 2016-01-03 16:05:52 1 1
2 2016-01-03 16:05:52 1 1
3 2016-01-27 15:45:20 2 1
4 2016-01-27 15:45:20 2 1
5 2016-11-27 15:08:04 2 2
6 2016-11-27 15:08:04 2 2
7 2016-11-27 15:08:04 2 2
If you want to change the method of ranking you'll get more info on ranking from the pandas documentation on ranking
I have about 50 DataFrames in a list that have a form like this, where the particular dates included in each DataFrame are not necessarily the same.
>>> print(df1)
Unnamed: 0 df1_name
0 2004/04/27 2.2700
1 2004/04/28 2.2800
2 2004/04/29 2.2800
3 2004/04/30 2.2800
4 2004/05/04 2.2900
5 2004/05/05 2.3000
6 2004/05/06 2.3200
7 2004/05/07 2.3500
8 2004/05/10 2.3200
9 2004/05/11 2.3400
10 2004/05/12 2.3700
Now, I want to merge these 50 DataFrames together on the date column (unnamed first column in each DataFrame), and include all dates that are present in any of the DataFrames. Should a DataFrame not have a value for that date, it can just be NaN.
So a minimal example:
>>> print(sample1)
Unnamed: 0 sample_1
0 2004/04/27 1
1 2004/04/28 2
2 2004/04/29 3
3 2004/04/30 4
>>> print(sample2)
Unnamed: 0 sample_2
0 2004/04/28 5
1 2004/04/29 6
2 2004/05/01 7
3 2004/05/03 8
Then after the merge
>>> print(merged_df)
Unnamed: 0 sample_1 sample_2
0 2004/04/27 1 NaN
1 2004/04/28 2 5
2 2004/04/29 3 6
3 2004/04/30 4 NaN
....
Is there an easy way to make use of the merge or join functions of Pandas to accomplish this? I have gotten awfully stuck trying to determine how to combine the dates like this.
All you need to do is pd.concat on all your sample dataframes. But you have to set a couple of things. One, set the index of each one to be the column you want to merge on. Ensure that column is a date column. Below is an example of how to do it.
One liner
pd.concat([s.set_index('Unnamed: 0') for s in [sample1, sample2]], axis=1).rename_axis('Unnamed: 0').reset_index()
Unnamed: 0 sample_1 sample_2
0 2004/04/27 1.0 NaN
1 2004/04/28 2.0 5.0
2 2004/04/29 3.0 6.0
3 2004/04/30 4.0 NaN
4 2004/05/01 NaN 7.0
5 2004/05/03 NaN 8.0
I think this is more understandable
sample1 = pd.DataFrame([
['2004/04/27', 1],
['2004/04/28', 2],
['2004/04/29', 3],
['2004/04/30', 4],
], columns=['Unnamed: 0', 'sample_1'])
sample2 = pd.DataFrame([
['2004/04/28', 5],
['2004/04/29', 6],
['2004/05/01', 7],
['2004/05/03', 8],
], columns=['Unnamed: 0', 'sample_2'])
list_of_samples = [sample1, sample2]
for i, sample in enumerate(list_of_samples):
s = list_of_samples[i].copy()
cols = s.columns.tolist()
cols[0] = 'Date'
s.columns = cols
s.Date = pd.to_datetime(s.Date)
s.set_index('Date', inplace=True)
list_of_samples[i] = s
pd.concat(list_of_samples, axis=1)
sample_1 sample_2
Date
2004-04-27 1.0 NaN
2004-04-28 2.0 5.0
2004-04-29 3.0 6.0
2004-04-30 4.0 NaN
2004-05-01 NaN 7.0
2004-05-03 NaN 8.0