I have the data set of customers with their policies, I am trying to find the number of months the customer is with us. (tenure)
df
cust_no poly_no start_date end_date
1 1 2016-06-01 2016-08-31
1 2 2017-05-01 2018-05-31
1 3 2016-11-01 2018-05-31
output should look like,
cust_no no_of_months
1 22
So basically, it should get rid of the months where there is no policy and count the overlapping period once not twice. I have to do this for every customers, so group by cust_no, how can i do this?
Thanks.
One way to do this is to create date ranges for each records, then use stack to get all the months. Next, take the unique values only to count a month only once:
s = df.apply(lambda x: pd.Series(pd.date_range(x.start_date, x.end_date, freq='M').values), axis=1)
ss = s.stack().unique()
ss.shape[0]
Output:
22
For multiple customers you can use groupby. Continuing with #ScottBoston's answer:
df_range = df.apply(lambda r: pd.Series(
pd.date_range(start=r.start_date, end=r.end_date, freq='M')
.values), axis=1)
df_range.groupby('cust_no').apply(lambda x: x.stack().unique().shape[0])
Related
I have looked for solutions but seem to find none that point me in the right direction, hopefully, someone on here can help. I have a stock price data set, with a frequency of Month Start. I am trying to get an output where the calendar years are the column names, and the day and month will be the index (there will only be 12 rows since it is monthly data). The rows will be filled with the stock prices corresponding to the year and month. I, unfortunately, have no code since I have looked at for loops, groupby, etc but can't seem to figure this one out.
You might want to split the date into month and year and to apply a pivot:
s = pd.to_datetime(df.index)
out = (df
.assign(year=s.year, month=s.month)
.pivot_table(index='month', columns='year', values='Close', fill_value=0)
)
output:
year 2003 2004
month
1 0 2
2 0 3
3 0 4
12 1 0
Used input:
df = pd.DataFrame({'Close': [1,2,3,4]},
index=['2003-12-01', '2004-01-01', '2004-02-01', '2004-03-01'])
You need multiple steps to do that.
First split your column into the right format.
Then convert this column into two separate columns.
Then pivot the table accordingly.
import pandas as pd
# Test Dataframe
df = pd.DataFrame({'Date': ['2003-12-01', '2004-01-01', '2004-02-01', '2004-12-01'],
'Close': [6.661, 7.053, 6.625, 8.999]})
# Split datestring into list of form [year, month-day]
df = df.assign(Date=df.Date.str.split(pat='-', n=1))
# Separate date-list column into two columns
df = pd.DataFrame(df.Date.to_list(), columns=['Year', 'Date'], index=df.index).join(df.Close)
# Pivot the table
df = df.pivot(columns='Year', index='Date')
df
Output:
Close
Year 2003 2004
Date
01-01 NaN 7.053
02-01 NaN 6.625
12-01 6.661 8.999
I have a dataframe that looks like this:
Part
Date
1
9/1/2021
1
9/8/2021
1
9/15/2021
2
9/1/2020
2
9/1/2021
2
9/1/2022
The dataframe is already sorted by part, then by date.
I am trying to find the average days between each date grouped by the Part column.
So the desired output would be:
Part
Avg Days
1
7
2
365
How would you go about processing this data to achieve the desired output?
You can groupby "Date", use apply+ diff to get the time delta between consecutive rows, and get the mean:
(df.groupby('Part')['Date']
.apply(lambda s: s.diff().mean())
.to_frame()
.reset_index()
)
output:
Part Date
1 7 days
2 365 days
i have data with 3 columns: date, id, sales.
my first task is filtering sales above 100. i did it.
second task, grouping id by consecutive days.
index
date
id
sales
0
01/01/2018
03
101
1
01/01/2018
07
178
2
02/01/2018
03
120
3
03/01/2018
03
150
4
05/01/2018
07
205
the result should be:
index
id
count
0
03
3
1
07
1
2
07
1
i need to do this task without using pandas/dataframe, but right now i can't imagine from which side attack this problem.
just for effort, i tried the suggestion for a solution here count consecutive days python dataframe
but the ids' not grouped.
here is my code:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date']).dt.date
s = data.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = data.groupby(['id', s]).size().reset_index(level=0, drop=True)
it is very importent that the "new_frame" will have "count" column, because after i need to count id by range of those count days in "count" column. e.g. count of id's in range of 0-7 days, 7-12 days etc. but it's not part of my question.
Thank you a lot
Your code is close, but need some fine-tuning, as follows:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date'], dayfirst=True)
df2 = data.sort_values(['id', 'date'])
s = df2.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = df2.groupby(['id', s]).size().reset_index(level=1, drop=True).reset_index(name='count')
Result:
print(new_frame)
id count
0 3 3
1 7 1
2 7 1
Summary of changes:
As your dates are in dd/mm/yyyy instead of the default mm/dd/yyyy, you have to specify the parameter dayfirst=True in pd.to_datetime(). Otherwise, 02/01/2018 will be regarded as 2018-02-01 instead of 2018-01-02 as expected and the day diff with adjacent entries will be around 30 as opposed to 1.
We added a sort step to sort by columns id and date to simplify the later grouping during the creation of the series s.
In the last groupby() the code reset_index(level=0, drop=True) should be dropping level=1 instead. Since, level=0 is the id fields which we want to keep.
In the last groupby(), we do an extra .reset_index(name='count') to make the Pandas series change back to a dataframe and also name the new column as count.
I found many questions similar to mine, but none of them answer it exactly (this one comes closest, but it focusses on ruby).
I have a pandas DataFrame like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': pd.date_range('2014-10-03', '2015-10-02', freq='1D'), 'Variable': np.random.randn(365)})
df.head()
Out[272]:
Date Variable
0 2014-10-03 0.637167
1 2014-10-04 0.562135
2 2014-10-05 -1.069769
3 2014-10-06 0.556997
4 2014-10-07 0.253468
I want to sort the data from January 1st to December 31st, ignoring the year component of the Date column. The background is that I want to track changes in Variable over the year, but my period starts and ends in October.
I thought of creating a seperate column for month and year and then sorting by those. But I am unsure how to do this in a "correct" and concise way.
Expected output:
Date Variable
0 01-01 0.637167 # (Placeholder-values)
1 01-02 0.562135
2 01-03 -1.069769
3 01-04 0.556997
4 01-05 0.253468
On way from argsort
yourdf=df.loc[df.Date.dt.strftime('%m%d').astype(int).argsort()]
You can create the day and month columns by simply doing the following
df = pd.DataFrame(data=pd.date_range('2014-10-03', '2015-10-02', freq='1D'), columns=['date'])
df['day'] = df['date'].apply(lambda x: x.day)
df['month'] = df['date'].apply(lambda x: x.month)
You could make it even more compact. But quick analysis, you can use the above.
I have a Pandas dataframe which look like this
The customer number is unique to each customer, but repeats itself if the customer visits again.
I want to groupby customer number. Then in each groupby object, I want to find out the duration between visits.
So, I do it like this..
df['Date'] = pd.to_datetime(df['Date'], format='%d %b %y')
grouped = df.groupby('Customer no')
My question is,
how do I iterate over the grouped rows and find out the time (in days) between subsequent visits.
I think you need groupby with diff:
print (df.groupby('Customer no')['Date'].diff())
13 NaT
22 0 days
26 0 days
Name: Date, dtype: timedelta64[ns]
#if need convert days to numeric
print (df.groupby('Customer no')['Date'].diff() / np.timedelta64(1, 'D'))
13 NaN
22 0.0
26 0.0
Name: Date, dtype: float64
Frequency conversion.