How many rides completed on average in a 4 hour span - python

I have a dataset:
ride_completion_time ride_id
0 2022-08-27 11:42:02 1
1 2022-08-24 05:59:26 2
2 2022-08-23 17:40:05 3
3 2022-08-28 23:06:01 4
4 2022-08-27 03:21:29 5
I would like to find out in a 4 hour time span, on average, how many rides are actually completed?
I run df3.dtypes to get my data types.
output:
dropoff_datetime datetime64[ns]
ride_id object
dtype: object
Then I've tried the following:
Option 1)
df3 = df3.groupby(df3.ride_completion_time.dt.floor('2H')).mean()
Result: Dataframe object has no attribute dropoff_date_time
Option 2)
df3.groupby(df3.index.floor('4H').time).sum()
Result: It gives me the right grouping (I see that it's changing my times to every 4 hours) but then it's not summing it really? I tried using average but average isn't supported I think.
Can someone point me in the right direction?

ride_id is object type (probably string) so sum and mean would exclude this column. You want to know number of rides, you can do size:
df3.groupby(df3.index.floor('4H').time).size()
As to why 2) works but not 1), probably somewhere you set ride_completion_time as index in your code.

Related

Date frequency detection, but one that occurs most often

I am looking for a way to check the frequency of dates in a column. I have a date with a frequency of every week, but sometimes there is a hurdle of 2 or 3 weeks, and the pd.infer_freq method returns NaN.
My data:
2022-01-01
2022-01-08
2022-01-23
2022-01-30
Your sample data is too small for pd.infer_freq to be able to infer the frequencies. You could find the most common time difference between consecutive days and use that to infer the frequency -
s = pd.Series(dates)
print((s - s.shift(1)).mode())
Output
0 7 days
dtype: timedelta64[ns]

Multiple math operations on timeseries data using groupby

I have a dataframe/series containing hourly sampled data over a couple of years. I'd like to sum the values for each month, then calculate the mean of those monthly totals over all the years.
I can get a multi-index dataframe/series of the totals using:
df.groupby([df.index.year, df.index.month]).sum()
Date & Time Date & Time
2016 3 220.246292
4 736.204574
5 683.240291
6 566.693919
7 948.116766
8 761.214823
9 735.168033
10 771.210572
11 542.314915
12 434.467037
2017 1 728.983901
2 639.787918
3 709.944521
4 704.610437
5 685.729297
6 760.175060
7 856.928659
But I don't know how to then combine the data to get the means.
I might be totally off on the wrong track too. Also not sure I've labelled the question very well.
I think you need mean per years - so per first level:
df.groupby([df.index.year, df.index.month]).sum().mean(level=0)
You can use groupby twice, once to get the monthly sum, once to get the mean of monthly sum:
(df.groupby(pd.Grouper(freq='M')).sum()
.groupby(pd.Grouper(freq='Y')).mean()
)

Pandas: Group by, Cumsum + Shift with a "where clause"

I am attempting to learn some Pandas that I otherwise would be doing in SQL window functions.
Assume I have the following dataframe which shows different players previous matches played and how many kills they got in each match.
date player kills
2019-01-01 a 15
2019-01-02 b 20
2019-01-03 a 10
2019-03-04 a 20
Throughout the below code I managed to create a groupby where I only show previous summed values of kills (the sum of the players kills excluding the kills he got in the game of the current row).
df['sum_kills'] = df.groupby('player')['kills'].transform(lambda x: x.cumsum().shift())
This creates the following values:
date player kills sum_kills
2019-01-01 a 15 NaN
2019-01-02 b 20 NaN
2019-01-03 a 10 15
2019-03-04 a 20 25
However what I ideally want is the option to include a filter/where clause in the grouped values. So let's say I only wanted to get the summed values from the previous 30 days (1 month). Then my new dataframe should instead look like this:
date player kills sum_kills
2019-01-01 a 15 NaN
2019-01-02 b 20 NaN
2019-01-03 a 10 15
2019-03-04 a 20 NaN
The last row would provide zero summed_kills because no games from player a had been played over the last month. Is this possible somehow?
I think you are a bit in a pinch using groupby and transform. As explained here, transform operates on a single series, so you can't access data of other columns.
groupby and apply does not seem the correct way too, because the custom function is expected to return an aggregated result for the group passed by groupby, but you want a different result for each row.
So the best solution I can propose is to use apply without groupy, and perform all the selection by yourself inside the custom function:
def killcount(x, data, timewin):
"""count the player's kills in a time window before the time of current row.
x: dataframe row
data: full dataframe
timewin: a pandas.Timedelta
"""
return data.loc[(data['date'] < x['date']) #select dates preceding current row
& (data['date'] >= x['date']-timewin) #select dates in the timewin
& (data['player'] == x['player'])]['kills'].sum() #select rows with same player
df['sum_kills'] = df.apply(lambda r : killcount(r, df, pd.Timedelta(30, 'D')), axis=1)
This returns:
date player kills sum_kills
0 2019-01-01 a 15 0
1 2019-01-02 b 20 0
2 2019-01-03 a 10 15
3 2019-03-04 a 20 0
In case you haven't done yet, remember do parse 'date' column to datetime type using pandas.to_datetime otherwise you cannot perform date comparison.

Traversing groups of group by object pandas

I need help with some big pandas issue.
As a lot of people asked to have the real input and real desired output in order to answer the question, there it goes:
So I have the following dataframe
Date user cumulative_num_exercises total_exercises %_exercises
2017-01-01 1 2 7 28,57
2017-01-01 2 1 7 14.28
2017-01-01 4 3 7 42,85
2017-01-01 10 1 7 14,28
2017-02-02 1 2 14 14,28
2017-02-02 2 3 14 21,42
2017-02-02 4 4 14 28,57
2017-02-02 10 5 14 35,71
2017-03-03 1 3 17 17,64
2017-03-03 2 3 17 17,64
2017-03-03 4 5 17 29,41
2017-03-03 10 6 17 35,29
%_exercises_accum
28,57
42,85
85,7
100
14,28
35,7
64,27
100
17,64
35,28
64,69
100
-The column %_exercises is the value of the column (cumulative_num_exercises/total_exercises)*100
-The column %_exercises_accum is the value of the sum of the %_exercises for each month. (Note that at the end of each month, it reaches the value 100).
-I need to calculate, whith this data, the % of users that contributed to do a 50%, 80% and 90% of the total exercises, during each month.
-In order to do so, I have thought to create a new column, called category, which will later be used to count how many users contributed to each of the 3 percentages (50%, 80% and 90%). The category column takes the following values:
0 if the user did a %_exercises_accum = 0.
1 if the user did a %_exercises_accum < 50 and > 0.
50 if the user did a %_exercises_accum = 50.
80 if the user did a %_exercises_accum = 80.
90 if the user did a %_exercises_accum = 90.
And so on, because there are many cases in order to determine who contributes to which percentage of the total number of exercises on each month.
I have already determined all the cases and all the values that must be taken.
Basically, I traverse the dataframe using a for loop, and with two main ifs:
if (df.iloc[i][date] == df.iloc[i][date].shift()):
calculations to determine the percentage or percentages to which the user from the second to the last row of the same month group contributes
(because the same user can contribute to all the percentages, or to more than one)
else:
calculations to determine to which percentage of exercises the first
member of each
month group contributes.
The calculations involve:
Looking at the value of the category column in the previous row using shift().
Doing while loops inside the for, because when a user suddenly reaches a big percentage, we need to go back for the users in the same month, and change their category_column value to 50, as they have contributed to the 50%, but didn't reach it. for instance, in this situation:
Date %_exercises_accum
2017-01-01 1,24
2017-01-01 3,53
2017-01-01 20,25
2017-01-01 55,5
The desired output for the given dataframe at the beginning of the question would include the same columns as before (date, user, cumulative_num_exercises, total_exercises, %_exercises and %_exercises_accum) plus the category column, which is the following:
category
50
50
508090
90
50
50
5080
8090
50
50
5080
8090
Note that the rows with the values: 508090, or 8090, mean that that user is contributing to create:
508090: both 50%, 80% and 90% of total exercises in a month.
8090: both 80% and 90% of exercises in a month.
Does anyone know how can I simplify this for loop by traversing the groups of a group by object?
Thank you very much!
Given no sense of what calculations you wish to accomplish, this is my best guess at what you're looking for. However, I'd re-iterate Datanovice's point that the best way to get answers is to provide a sample output.
You can slice to each unique date using the following code:
dates = ['2017-01-01', '2017-01-01','2017-01-01','2017-01-01','2017-02-02','2017-02-02','2017-02-02','2017-02-02','2017-03-03','2017-03-03','2017-03-03','2017-03-03']
df = pd.DataFrame(
{'date':pd.to_datetime(dates),
'user': [1,2,4,10,1,2,4,10,1,2,4,10],
'cumulative_num_exercises':[2,1,3,1,2,3,4,5,3,3,5,6],
'total_exercises':[7,7,7,7,14,14,14,14,17,17,17,17]}
)
df = df.set_index('date')
for idx in df.index.unique():
hold = df.loc[idx]
### YOUR CODE GOES HERE ###

Pandas - How to get list of

(I am learning Pandas, so please explain solution)
My data looks like this:
Category currency sellerRating Duration endDay ClosePrice
0 Music/Movie/Game US 3249 5 Mon 0.01 0.01
1 Music/Movie/Game US 3249 5 Mon 0.01 0.01
2 Music/Movie/Game US 3249 5 Mon 0.01 0.01
3 Music/Movie/Game US 3249 5 Mon 0.01 0.01
4 Music/Movie/Game US 3249 5 Mon 0.01 0.01
Dtypes result is:
Category object
currency object
sellerRating int64
Duration int64
endDay object
ClosePrice float64
OpenPrice float64
PriceIncrease float64
dtype: object
I am trying to find out the top (e.g. top 10) items with the highest ClosePrice for EACH category.
Out of ideas, giving up and trying to do it by hand for each category, I have tried:
df[(df['ClosePrice']> 93) & ([df.Category == 'Automotive'])]
...but it did not work. The error I get is:
ValueError: operands could not be broadcast together with shapes (351550,) (1975,)
I have also explored Crosstab, but it's not what I am looking for.
There must be a way to do what I want automatically in one line of Pandas code. Any advice? Thanks!
I'd use nlargest method:
df.groupby('Category', group_keys=False).apply(lambda x: x.nlargest(10, 'ClosePrice'))
Use groupby and then apply the sort keeping only top k values
top = 10
df.groupby('Category', group_keys=None).apply(lambda x: x.sort_values('ClosePrice')[:top])
Since you ask for an explanation of the solution, I'll try.
By using groupby you're creating groups of data based on the Category column. Every group will have the same Category. After that the code applies, for each group, the sort_values will sort the data by ClosePrice and, after that, gets only the top values.
The code above may 'mess' the indexes by keeping original index. If you need to reset the index you must use
df.groupby('Category', group_keys=None).apply(lambda x: x.sort_values('ClosePrice')[:top]).reset_index(drop=True)

Categories

Resources