Python select data for top 3 values per group in dataframe - python

From the given dataframe sorted by ID and Date:
ID Date Value
1 12/10/1998 0
1 04/21/2002 21030
1 08/16/2013 56792
1 09/18/2014 56792
1 09/14/2016 66354
2 06/16/2015 46645
2 12/08/2015 47641
2 12/11/2015 47641
2 04/13/2017 47641
3 07/29/2009 28616
3 03/31/2011 42127
3 03/17/2013 56000
I would like to get values for top 3 Dates, group by ID:
56792
56792
66354
47641
47641
47641
28616
42127
56000
I need values only

You could sort_values both by ID and Date, and use GroupBy.tail to take the values for the top 3 dates:
df.Date = pd.to_datetime(df.Date)
df.sort_values(['ID','Date']).groupby('ID').Value.tail(3).to_numpy()
# array([56792, 56792, 66354, 47641, 47641, 47641, 28616, 42127, 56000])

Related

counting number of dates in between two date range from different dataframe [duplicate]

This question already has answers here:
How to join two dataframes for which column values are within a certain range?
(9 answers)
Closed 7 days ago.
Existing dataframe :
df_1
Id dates time(sec)_1 time(sec)_2
1 02/02/2022 15 20
1 04/02/2022 20 30
1 03/02/2022 30 40
1 06/02/2022 50 40
2 10/02/2022 10 10
2 11/02/2022 15 20
df_2
Id min_date action_date
1 02/02/2022 04/02/2022
2 06/02/2022 10/02/2022
Expected Dataframe :
df_2
Id min_date action_date count_of_dates avg_time_1 avg_time_2
1 02/02/2022 04/02/2022 3 21.67 30
2 06/02/2022 10/02/2022 1 10 10
count of dates, avg_time_1 , avg_time_2 is to be created from the df_1.
count of dates is calculated considering the min_date and action_date i.e. number of dates from from df_1 falling under min_date and action_date.
avg_time_1 and avg_time_2 are calculated w.r.t. to count of dates
stuck with applying the condition for dates :-( any leads.?
If small data is possible filter per rows by custom function:
df_1['dates'] = df_1['dates'].apply(pd.to_datetime)
df_2[['min_date','action_date']] = df_2[['min_date','action_date']].apply(pd.to_datetime)
def f(x):
m = df_1['Id'].eq(x['Id']) & df_1['dates'].between(x['min_date'], x['action_date'])
s = df_1.loc[m, ['time(sec)_1','time(sec)_2']].mean()
return pd.Series([m.sum()] + s.to_list(), index=['count_of_dates'] + s.index.tolist())
df = df_2.join(df_2.apply(f, axis=1))
print (df)
Id min_date action_date count_of_dates time(sec)_1 time(sec)_2
0 1 2022-02-02 2022-04-02 3.0 21.666667 30.0
1 2 2022-06-02 2022-10-02 1.0 10.000000 10.0
If Id in df_2 is unique is possible improve performance by merge df_1 with aggregate size and mean:
df = df_2.merge(df_1, on='Id')
d = {'count_of_dates':('Id','size'),
'time(sec)_1':('time(sec)_1','mean'),
'time(sec)_2':('time(sec)_2','mean')}
df = df_2.join(df[df['dates'].between(df['min_date'], df['action_date'])]
.groupby('Id').agg(**d), on='Id')
print (df)
Id min_date action_date count_of_dates time(sec)_1 time(sec)_2
0 1 2022-02-02 2022-04-02 3 21.666667 30
1 2 2022-06-02 2022-10-02 1 10.000000 10

Comparing one date column to another in a different row

I have a dataframe df_corp:
ID arrival_date leaving_date
1 01/02/20 05/02/20
2 01/03/20 07/03/20
1 12/02/20 20/02/20
1 07/03/20 10/03/20
2 10/03/20 15/03/20
I would like to find the difference between leaving_date of a row and arrival date of the next entry with respect to ID. Basically I want to know how long before they book again.
So it'll look something like this.
ID arrival_date leaving_date time_between
1 01/02/20 05/02/20 NaN
2 01/03/20 07/03/20 NaN
1 12/02/20 20/02/20 7
1 07/03/20 10/03/20 15
2 10/03/20 15/03/20 3
I've tried grouping by ID to do the sum but I'm seriously lost on how to get the value from the next row and a different column in one.
You need to convert to_datetime and to perform a GroupBy.shift to get the previous departure date:
# arrival
a = pd.to_datetime(df_corp['arrival_date'], dayfirst=True)
# previous departure per ID
l = pd.to_datetime(df_corp['leaving_date'], dayfirst=True).groupby(df_corp['ID']).shift()
# difference in days
df_corp['time_between'] = (a-l).dt.days
output:
ID arrival_date leaving_date time_between
0 1 01/02/20 05/02/20 NaN
1 2 01/03/20 07/03/20 NaN
2 1 12/02/20 20/02/20 7.0
3 1 07/03/20 10/03/20 16.0
4 2 10/03/20 15/03/20 3.0

Pandas get the Month Ending Values from Series

I need to get the month-end balance from a series of entries.
Sample data:
date contrib totalShrs
0 2009-04-23 5220.00 10000.000
1 2009-04-24 10210.00 20000.000
2 2009-04-27 16710.00 30000.000
3 2009-04-30 22610.00 40000.000
4 2009-05-05 28909.00 50000.000
5 2009-05-20 38409.00 60000.000
6 2009-05-28 46508.00 70000.000
7 2009-05-29 56308.00 80000.000
8 2009-06-01 66108.00 90000.000
9 2009-06-02 78108.00 100000.000
10 2009-06-12 86606.00 110000.000
11 2009-08-03 95606.00 120000.000
The output would look something like this:
2009-04-30 40000
2009-05-31 80000
2009-06-30 110000
2009-07-31 110000
2009-08-31 120000
Is there a simple Pandas method?
I don't see how I can do this with something like a groupby?
Or would I have to do something like iterrows, find all the monthly entries, order them by date and pick the last one?
Thanks.
Use Grouper with GroupBy.last, forward filling missing values by ffill with Series.reset_index:
#if necessary
#df['date'] = pd.to_datetime(df['date'])
df = df.groupby(pd.Grouper(freq='m',key='date'))['totalShrs'].last().ffill().reset_index()
#alternative
#df = df.resample('m',on='date')['totalShrs'].last().ffill().reset_index()
print (df)
date totalShrs
0 2009-04-30 40000.0
1 2009-05-31 80000.0
2 2009-06-30 110000.0
3 2009-07-31 110000.0
4 2009-08-31 120000.0
Following gives you the information you want, i.e. end of month values, though the format is not exactly what you asked:
df['month'] = df['date'].str.split('-', expand = True)[1] # split date column to get month column
newdf = pd.DataFrame(columns=df.columns) # create a new dataframe for output
grouped = df.groupby('month') # get grouped values
for g in grouped: # for each group, get last row
gdf = pd.DataFrame(data=g[1])
newdf.loc[len(newdf),:] = gdf.iloc[-1,:] # fill new dataframe with last row obtained
newdf = newdf.drop('date', axis=1) # drop date column, since month column is there
print(newdf)
Output:
contrib totalShrs month
0 22610 40000 04
1 56308 80000 05
2 86606 110000 06
3 95606 120000 08

Python Pandas: groupby date and count new records for each period

I'm trying to use Python Pandas to count daily returning visitors to my website over a time period.
Example data:
df1 = pd.DataFrame({'user_id':[1,2,3,1,3], 'date':['2012-09-29','2012-09-30','2012-09-30','2012-10-01','2012-10-01']})
print df1
date user_id
0 2012-09-29 1
1 2012-09-30 2
2 2012-09-30 3
3 2012-10-01 1
4 2012-10-01 3
What I'd like to have as final result:
df1_result = pd.DataFrame({'count_new':[1,2,0], 'date':['2012-09-29','2012-09-30','2012-10-01']})
print df1_result
count_new date
0 1 2012-09-29
1 2 2012-09-30
2 0 2012-10-01
In the first day there is 1 new user because user 1 appears for the first time.
In the second day there are 2 new users: user 2 and user 3 both appear for the first time.
Finally in the third day there are 0 new users: user 1 and user 3 have both already appeared in previous periods.
So far I have been looking into merging two copies of same dataframe and shifting one by a date, but without success:
pd.merge(df1, df1.user_id.shift(-date), on = 'date').groupby('date')['user_id_y'].nunique()
Any help would be much appreciated,
Thanks
>>> (df1
.groupby(['user_id'], as_index=False)['date'] # Group by `user_id` and get first date.
.first()
.groupby(['date']) # Group result on `date` and take counts.
.count()
.reindex(df1['date'].unique()) # Reindex on original dates.
.fillna(0)) # Fill null values with zero.
user_id
date
2012-09-29 1
2012-09-30 2
2012-10-01 0
It is better to add a new column Isreturning (in case you need to analysis on Returning customer in the future)
df['Isreturning']=df.groupby('user_id').cumcount()
Only show new customer
df.loc[df.Isreturning==0,:].groupby('date')['user_id'].count()
Out[840]:
date
2012-09-29 1
2012-09-30 2
Name: user_id, dtype: int64
Or you can :
df.groupby('date')['Isreturning'].apply(lambda x : len(x[x==0]))
Out[843]:
date
2012-09-29 1
2012-09-30 2
2012-10-01 0
Name: Isreturning, dtype: int64

pandas group dates to quarterly and sum sales column

I am learning python and at the moment I am playing with some sales data. The data is in csv format and is showing weekly sales.
I have below columns with some sample data as below:
store# dept# dates weeklysales
1 1 01/01/2005 50000
1 1 08/01/2005 120000
1 1 15/01/2005 75000
1 1 22/01/2005 25000
1 1 29/01/2005 18000
1 2 01/01/2005 15000
1 2 08/01/2005 12000
1 2 15/01/2005 75000
1 2 22/01/2005 35000
1 2 29/01/2005 28000
1 1 01/02/2005 50000
1 1 08/02/2005 120000
1 1 15/02/2005 75000
1 1 22/03/2005 25000
1 1 29/03/2005 18000
I want to add the weeklysales to monthly basis in each department and want to display the records.
I have tried to use groupby function in Pandas from below links:
how to convert monthly data to quarterly in pandas
Pandas group by and sum two columns
Pandas group-by and sum
But what is happening in the above that I get sum of all the columns and getting the following output by adding the store and dept numbers as well:
store# dept# dates weeklysales
4 3 01/2005 28800
4 1 01/2005 165000
4 3 02/2005 245000
4 3 03/2005 43000
I do not want to add store and dept numbers but want to just add the weeklysales figure by each month and want the display like:
store# dept# dates weeklysales
1 1 01/2005 28800
1 2 01/2005 165000
1 1 02/2005 245000
1 1 03/2005 43000
Will be grateful if I can get a solution for that.
Cheers,
Is this what you are after?
Convert dates to month/year format and then group and sum sales.
(df.assign(dates=df.dates.dt.strftime('%m/%Y'))
.groupby(['store#','dept#','dates'])
.sum()
.reset_index()
)
Out[243]:
store# dept# dates weeklysales
0 1 1 01/2005 288000
1 1 1 02/2005 245000
2 1 1 03/2005 43000
3 1 2 01/2005 165000

Categories

Resources