I have a dataframe that looks like this:
Part
Date
1
9/1/2021
1
9/8/2021
1
9/15/2021
2
9/1/2020
2
9/1/2021
2
9/1/2022
The dataframe is already sorted by part, then by date.
I am trying to find the average days between each date grouped by the Part column.
So the desired output would be:
Part
Avg Days
1
7
2
365
How would you go about processing this data to achieve the desired output?
You can groupby "Date", use apply+ diff to get the time delta between consecutive rows, and get the mean:
(df.groupby('Part')['Date']
.apply(lambda s: s.diff().mean())
.to_frame()
.reset_index()
)
output:
Part Date
1 7 days
2 365 days
Related
I am trying to calculate the time difference(in days) between the customer's previous visit out time and the customer's latest visit in time.
time difference = latest in time - previous out time
Here is a sample of input data
sample output table
The approach I have tried so far groupby based on customer ID and rank
temp['RANK'] = temp.groupby('customer ID')['in time'].rank(ascending=True)
but I am unsure that how to calculate the difference.
You can use GroupBy.shift() to get the previous out time within the group. Substracted by current in time. Then, use dt.days to get the number of days of the timedelta between in time and out time within the group, as follows:
# convert date strings to datetime format
df['out time'] = pd.to_datetime(df['out time'], dayfirst=True)
df['in time'] = pd.to_datetime(df['in time'], dayfirst=True)
df['Visit diff (in days)'] = (df['in time'] - df['out time'].groupby(df['customer ID']).shift()).dt.days
Data Input:
print(df)
customer ID out time in time
0 1 05-12-1999 15:20:07 05-12-1999 14:23:31
1 1 21-12-1999 09:59:34 21-12-1999 09:41:09
2 2 05-12-1999 11:53:34 05-12-1999 11:05:37
3 2 08-12-1999 19:55:00 08-12-1999 19:40:10
4 3 01-12-1999 15:15:26 01-12-1999 13:08:11
5 3 16-12-1999 17:10:09 16-12-1999 16:34:10
Result:
print(df)
customer ID out time in time Visit diff (in days)
0 1 1999-12-05 15:20:07 1999-12-05 14:23:31 NaN
1 1 1999-12-21 09:59:34 1999-12-21 09:41:09 15.0
2 2 1999-12-05 11:53:34 1999-12-05 11:05:37 NaN
3 2 1999-12-08 19:55:00 1999-12-08 19:40:10 3.0
4 3 1999-12-01 15:15:26 1999-12-01 13:08:11 NaN
5 3 1999-12-16 17:10:09 1999-12-16 16:34:10 15.0
You may try the following:
temp.groupby('customer ID').apply(lambda x: (x['in time'].max() - x['out time'].min()).days )
i have data with 3 columns: date, id, sales.
my first task is filtering sales above 100. i did it.
second task, grouping id by consecutive days.
index
date
id
sales
0
01/01/2018
03
101
1
01/01/2018
07
178
2
02/01/2018
03
120
3
03/01/2018
03
150
4
05/01/2018
07
205
the result should be:
index
id
count
0
03
3
1
07
1
2
07
1
i need to do this task without using pandas/dataframe, but right now i can't imagine from which side attack this problem.
just for effort, i tried the suggestion for a solution here count consecutive days python dataframe
but the ids' not grouped.
here is my code:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date']).dt.date
s = data.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = data.groupby(['id', s]).size().reset_index(level=0, drop=True)
it is very importent that the "new_frame" will have "count" column, because after i need to count id by range of those count days in "count" column. e.g. count of id's in range of 0-7 days, 7-12 days etc. but it's not part of my question.
Thank you a lot
Your code is close, but need some fine-tuning, as follows:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date'], dayfirst=True)
df2 = data.sort_values(['id', 'date'])
s = df2.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = df2.groupby(['id', s]).size().reset_index(level=1, drop=True).reset_index(name='count')
Result:
print(new_frame)
id count
0 3 3
1 7 1
2 7 1
Summary of changes:
As your dates are in dd/mm/yyyy instead of the default mm/dd/yyyy, you have to specify the parameter dayfirst=True in pd.to_datetime(). Otherwise, 02/01/2018 will be regarded as 2018-02-01 instead of 2018-01-02 as expected and the day diff with adjacent entries will be around 30 as opposed to 1.
We added a sort step to sort by columns id and date to simplify the later grouping during the creation of the series s.
In the last groupby() the code reset_index(level=0, drop=True) should be dropping level=1 instead. Since, level=0 is the id fields which we want to keep.
In the last groupby(), we do an extra .reset_index(name='count') to make the Pandas series change back to a dataframe and also name the new column as count.
I have a dataframe which looks like this
In []: df.head()
Out [] :
DATE NAME AMOUNT CURRENCY
2018-07-27 John 100 USD
2018-06-25 Jane 150 GBP
...
The contents under the DATE column are of date type.
I want to aggregate all the data to be able to see to understand the days of the month and the count of the number of transactions that happened on that date.
I also wanted to group it by year as well as day.
The end result I wanted would have looked something like this
YEAR DAY COUNT
2018 1 0
2 1
3 0
4 0
5 3
6 4
and so on
I used the following code but the numbers are all wrong. Please help
In []: df = pd.DataFrame({'DATE':pd.date_range(start=dt.datetime(2018,7,27),end=dt.datetime(2020,7,21))})
df.groupby([df['DATE'].dt.year, df['DATE'].dt.day]).agg({'count'})
How to calculate using pandas weeks between two dates such 2019-12-15 and 2019-12-28
Data:
cw = pd.DataFrame({ "lead_date" : ["2019-12-28" , "2019-12-23"] ,
"Received_date" : ["2019-12-15" , "2019-12-21" ] })
So I could do something like
cw["weeks_between"]= ( cw["lead_date"] - cw["Received_date"]) / 7
The problem is..
For row 1:
it will return 1.85, but is wrong value because one day starts in on beginning of week Vs End of week
For row 2:
It will return 0.28, but also wrong because one day starts end of week Vs beginning of week.
-
So how can I get the number of weeks in between this two dates?
Method 1: Using list comprehension, dt.period & getattr
provided by Jon Clements in comments
This method will work when years change between the compared dates:
cw['weeks_diff'] = (
[getattr(el, 'n', 0)
for el in cw['lead_date'].dt.to_period('W') - cw['Received_date'].dt.to_period('W')]
)
Method 2: using weeknumbers with dt.strftime('%W')
We can use pd.to_datetime to convert your dates to datetime. Then we use the dt.strftime accessor to get the weeknumbers with %W.
Finally we substract both weeknumbers:
weeks = (cw[['lead_date', 'Received_date']]
.apply(lambda x: pd.to_datetime(x).dt.strftime('%W'))
.replace('NaT', 0)
.astype(int)
)
cw['weeks_diff'] = weeks['lead_date'] - weeks['Received_date']
lead_date Received_date weeks_diff
0 2019-12-28 2019-12-15 2
1 2019-12-23 2019-12-21 1
You need to use convert to datetime using pandas
import pandas as pd
import numpy as np
df = pd.DataFrame({ "lead_date" : ["2019-12-28" , "2019-12-23"] ,
"Received_date" : ["2019-12-15" , "2019-12-21" ] })
df['lead_date']=pd.to_datetime(df['lead_date'])
df['Received_date']=pd.to_datetime(df['Received_date'])
Here is the difference in days between "lead_date" and "Received_date"
df['time_between'] =df['lead_date']-df['Received_date']
print(df.head())
lead_date Received_date time_between
0 2019-12-28 2019-12-15 13 days
1 2019-12-23 2019-12-21 2 days
Update: edits below to get number of weeks. Also added import pandas and numpy.
To get 'time_between' column in weeks:
df['time_between']= df['time_between']/np.timedelta64(1,'W')
will yield
lead_date Received_date time_between
0 2019-12-28 2019-12-15 1.857143
1 2019-12-23 2019-12-21 0.285714
Update 2: If you want week number subtractions and not days between then use:
df['lead_date']=pd.to_datetime(df['lead_date']).dt.week
df['Received_date']=pd.to_datetime(df['Received_date']).dt.week
df['time_between'] =df['lead_date']-df['Received_date']
yields,
lead_date Received_date time_between
0 52 50 2
1 52 51 1
.dt.week returns week number in the year.
I have the data set of customers with their policies, I am trying to find the number of months the customer is with us. (tenure)
df
cust_no poly_no start_date end_date
1 1 2016-06-01 2016-08-31
1 2 2017-05-01 2018-05-31
1 3 2016-11-01 2018-05-31
output should look like,
cust_no no_of_months
1 22
So basically, it should get rid of the months where there is no policy and count the overlapping period once not twice. I have to do this for every customers, so group by cust_no, how can i do this?
Thanks.
One way to do this is to create date ranges for each records, then use stack to get all the months. Next, take the unique values only to count a month only once:
s = df.apply(lambda x: pd.Series(pd.date_range(x.start_date, x.end_date, freq='M').values), axis=1)
ss = s.stack().unique()
ss.shape[0]
Output:
22
For multiple customers you can use groupby. Continuing with #ScottBoston's answer:
df_range = df.apply(lambda r: pd.Series(
pd.date_range(start=r.start_date, end=r.end_date, freq='M')
.values), axis=1)
df_range.groupby('cust_no').apply(lambda x: x.stack().unique().shape[0])