Pandas group sum divided by unique items in group - python

I have a data in excel of employees and no. of hours worked in a week. I tagged each employee to a project he/she is working on. I can get sum of hours worked in each project by doing groupby as below:
util_breakup_sum = df[["Tag", "Bill. Hours"]].groupby("Tag").sum()
Bill. Hours
Tag
A61H 92.00
A63B 139.75
An 27.00
B32B 33.50
H 37.00
Manager 8.00
PP 23.00
RP0117 38.50
Se 37.50
However, when I try to calculate average time spent on each project per person, it gives me (sum/ total number of entries by employee), whereas correct average should be (sum / unique employee in group).
Example of mean is given below:
util_breakup_mean = df[["Tag", "Bill. Hours"]].groupby("Tag").mean()
Bill. Hours
Tag
A61H 2.243902
A63B 1.486702
An 1.000000
B32B 0.712766
H 2.055556
Manager 0.296296
PP 1.095238
RP0117 1.425926
Se 3.750000
For example, group A61H has just two employees, so there average should be (92/2) = 46. However, the code is dividing by total number of entries by these employees and hence giving an average of 2.24.
How to get the average from unique employee names in the group?

Try:
df.groupby("Tag")["Bill. Hours"].sum().div(df.groupby("Tag")["Employee"].nunique()
Where Employee is column identifying employees.

You can try nunique
util_breakup_mean = util_breakup_sum/df.groupby("Tag")['employee'].nunique()

Related

How to calculate best rating based on the combination of rating number and number of ratings

Im building a movie recommendation system. The recommendation has been calculated at this point. I have a dataframe in pandas (using python - machine learning problem) and this dataset has 3 columns: movie name, movie rating, and number of ratings. I can easily find the best rating by using highest value of course. I want to find the best rating based upon not only the rating value but also on the number of ratings provided. In instance: I have a movie toy story which is rated 8.8 by 222 people. I have another movie called coco rated 8.9 by 131 people. Based on this despite coco being rated higher I need a calculation that will inform me that toy story is the highest rated movie theoetically as it has somewhat close to twice the amount of ratings. Any help is always appreciated as I am a student and still learning.
import pandas as pd
#creating empty lists to form dataset
movie_names_list = []
movie_ratings_list = []
movie_number_of_ratings_list = []
#entry 1
movie_names_list.append("Toy story")
movie_ratings_list.append(8.8)
movie_number_of_ratings_list.append(222)
#entry 2
movie_names_list.append("Coco")
movie_ratings_list.append(8.9)
movie_number_of_ratings_list.append(131)
#entry 3
movie_names_list.append("Frozen")
movie_ratings_list.append(8.5)
movie_number_of_ratings_list.append(275)
movie_df = pd.DataFrame({
'Movie_Name':movie_names_list,
'Movie_Rating':movie_ratings_list,
'Rated_By_Number_Of_Users':movie_number_of_ratings_list
})
movie_df.head(5)
I found the answer myself after trying many methods...
Step 1: is to find a weight value automatically that will be applied to both movies to measure and compute a given movie's weight percentage based on the number of ratings the movie has. In this scenario 2 movies: toy story and coco from the above example will be compared. The formula for the automatic calculation of weight value is: weight = total_number_of_reviews_in_dataframe (from all movies - both toy story and coco) / 100.
Answer: weight = (222 + 131) / 100 = 3.53.
Step 2: for both movies, we will calculate the weight that the number of ratings will carry in the determination of finding the highest-rated movie. Importantly, the weight percentage from both movies, when combined, should equal 100. The formula for movie weight calculation is: movie_weight = number_of_ratings_for_movie / weight from step 1.
Answer:
Toy Story: 222 / 3.53 = 62.88.
Coco: 131 / 3.53 = 37.11.
Step 3: calculate a weight-based total for both movies. The formula for this is: movie_weight_based_total = movie_weight (from step 2) * rating_for_movie (the average rating).
Answer:
Toy Story: 62.88 * 8.8 = 553.344.
Coco: 37.11 * 8.9 = 330.279.
Final step: use a conditional statement to find what total is higher and the answer to this is the best-rated movie :)
Python code addition below (can easily make a function out of this):
#calculate the weight that will be used to compute and measure the best rated movie
weight_for_rating_calculation = movie_df['Rated_By_Number_Of_Users'].sum() / 100
#for both movies calculate the wieght that the number of ratings will carry in the determination of finding the highest rated movie
movie_1_weight = movie_df.iloc[0]['Rated_By_Number_Of_Users'] / weight_for_rating_calculation # toy story
movie_2_weight = movie_df.iloc[1]['Rated_By_Number_Of_Users'] / weight_for_rating_calculation # coco
#calculate a weight-based total for both movies
movie_1_weight_based_total = movie_1_weight * movie_df.iloc[0]['Movie_Rating']
movie_2_weight_based_total = movie_2_weight * movie_df.iloc[1]['Movie_Rating']
#which ever total is higher is the best-rated movie now based upon the combination of both rating value and number of ratings
if (movie_1_weight_based_total > movie_2_weight_based_total):
print("Toy Story is the best rated movie")
else:
("Coco is the best rated movie")

Lifetimes package gives inconsistent results

I am using Lifetimes to compute CLV of some customers of mine.
I have transactional data and, by means of summary_data_from_transaction_data (the implementation can be found here) I would like to compute
the recency, the frequency and the time interval T of each customer.
Unfortunately, it seems that the method does not compute correctly the frequency.
Here is the code for testing my dataset:
df_test = pd.read_csv('test_clv.csv', sep=',')
RFT_from_libray = summary_data_from_transaction_data(df_test,
'Customer',
'Transaction date',
observation_period_end='2020-02-12',
freq='D')
According to the code, the result is:
frequency recency T
Customer
1158624 18.0 389.0 401.0
1171970 67.0 396.0 406.0
1188564 12.0 105.0 401.0
The problem is that customer 1188564 and customer 1171970 did respectively 69 and 14 transaction, thus the frequency should have been 68 and 13.
Printing the size of each customer confirms that:
print(df_test.groupby('Customer').size())
Customer
1158624 19
1171970 69
1188564 14
I did try to use natively the underlying code in the summary_data_from_transaction_data like this:
RFT_native = df_test.groupby('Customer', sort=False)['Transaction date'].agg(["min", "max", "count"])
observation_period_end = (
pd.to_datetime('2020-02-12', format=None).to_period('D').to_timestamp()
)
# subtract 1 from count, as we ignore their first order.
RFT_native ["frequency"] = RFT_native ["count"] - 1
RFT_native ["T"] = (observation_period_end - RFT_native ["min"]) / np.timedelta64(1, 'D') / 1
RFT_native ["recency"] = (RFT_native ["max"] - RFT_native ["min"]) / np.timedelta64(1, 'D') / 1
As you can see, the result is indeed correct.
min max count frequency T recency
Customer
1171970 2019-01-02 15:45:39 2020-02-02 13:40:18 69 68 405.343299 395.912951
1188564 2019-01-07 18:10:55 2019-04-22 14:27:08 14 13 400.242419 104.844595
1158624 2019-01-07 10:52:33 2020-01-31 13:50:36 19 18 400.546840 389.123646
Of course my dataset is much bigger, and a slight difference in my frequency and/or recency alters a lot the computation of the BGF model.
What am I missing? Is there something that I should consider when using the method?
I might be a bit late to answer your query, but here it goes.
The documentation for the Lifestyles package defines frequency as:
frequency represents the number of repeat purchases the customer has made. This means that it’s one less than the total number of purchases. This is actually slightly wrong. It’s the count of time periods the customer had a purchase in. So if using days as units, then it’s the count of days the customer had a purchase on.
So, it's basically the number of time periods when the customer has made a repeat purchase, not the number of individual repeat purchases. A quick scan of your sample dataset confirmed that both 1188564 and 1171970 indeed made 2 purchases on a single day, 13Jan2019 and 15Jun2019, respectively. So these 2 transactions would be considered as 1 when calculating frequency that would result in the frequency calculated by summary_data_from_transaction_data function to be 2 less than your manual count.
According to documentation, you need to set:
include_first_transaction = True
include_first_transaction (bool, optional) – Default: False By default
the first transaction is not included while calculating frequency and
monetary_value. Can be set to True to include it. Should be False if
you are going to use this data with any fitters in lifetimes package

How to find last 24 hours data from pandas data frame

I have a data in which we have two columns, one is description and another is publishedAt. I applied sort function on publishedAt column and get the output of descending order of date. Here is the sample of my data frame:
description publishedAt
13 Bitcoin price has failed to secure momentum in... 2018-05-06T15:22:22Z
16 Brian Kelly, a long-time contributor to CNBC’s... 2018-05-05T15:56:48Z
2 The bitcoin price is less than $100 away from ... 2018-05-05T13:14:45Z
12 Mati Greenspan, a senior analyst at eToro and ... 2018-05-04T16:05:37Z
52 A Singaporean startup developing ‘smart bankno... 2018-05-04T14:02:30Z
75 Cryptocurrencies are set to make a comeback on... 2018-05-03T08:10:19Z
76 The bitcoin price is hovering near its best le... 2018-04-30T16:26:57Z
74 In today’s climate of ICOs with 100 billion to... 2018-04-30T12:03:31Z
27 Investment guru Warren Buffet remains unsold o... 2018-04-29T17:22:19Z
22 The bitcoin price has increased by around $400... 2018-04-28T12:28:35Z
68 Bitcoin futures volume reached an all-time hig... 2018-04-27T16:32:44Z
14 Biotech-company-turned-cryptocurrency-investme... 2018-04-27T14:25:15Z
67 The bitcoin price has rebounded to $9,200 afte... 2018-04-27T06:24:42Z
Now i want to description of last 3 hours, 6 hours, 12 hours and 24 hours.
How can i find it?
Thanks
As a simple solution within Pandas, you can use the DataFrame.last(offset) function. Be sure to set the PublishedAt column as the dataframe DateTimeIndex. A similar function to get rows a the start of a dataframe is the DataFrame.first(offset) function.
Here is an example using the provided offsets:
df.last('24h')
df.last('12h')
df.last('6h')
df.last('3h')
Assuming that the dataframe is called df
import datetime as dt
df[df['publishedAt']>=(dt.datetime.now()-dt.timedelta(hours=3))]['description'] #hours = 6,12, 24
if you need the intervals exclusive, thus the description withing the last 6 hours but not the ones within 3 hours, you'll need to use array-like logical operators from numpy like numpy.logicaland(arr1, arr2) in the first breaket.

Pandas averaging selected rows and columns

I am working with some EPL stats. I have csv with all matches from one season in following format.
D H A H_SC A_SC H_ODDS D_ODDS A_ODDS...
11.05.2014 Norwich Arsenal 0 2 5.00 4.00 1.73
11.05.2014 Chelsea Swansea 0 0 1.50 3.00 5.00
What I would like to do is for each match calculate average stats of teams from N previous matches. The result should look something like this.
D H A H_SC A_SC H_ODDS D_ODDS A_ODDS...
11.05.2014 Norwich Arsenal avgNorwichSC avgArsenalSC 5.00 4.00 1.73
11.05.2014 Chelsea Swansea avgChelseaSC avgSwanseaSC 1.50 3.00 5.00
So the date, teams and odds remains untouched and other stats are replaced with average from N previous matches. EDIT: The matches from first N rounds should not be in final table because there is not enough data to calculate averages.
The most tricky part for me is that the stats I am averaging have different prefix (H_ or A_) depending on where was the match played.
All I managed to do for now is to create dictionary, where key is club name and value is DataFrame containing all matches played by club.
D H A H_SC A_SC H_ODDS D_ODDS A_ODDS...
11.05.2014 Norwich Arsenal 0 2 5.00 4.00 1.73
04.05.2014 Arsenal West Brom 1 0 1.40 5.25 8.00
I have also previously coded this without pandas, but I was not satisfied with the code and i would like to learn pandas :).
You say you want to learn pandas, so I've given a few examples (tested with similar data) to get you going along the right track. It's a bit of an opinion, but I think finding the last N games is hard, so I'll initially assume / pretend you want to find averages over the whole table at first. If finding "last N" is really import, I can add to the answer. This should get you going with pandas and gropuby - I've left prints in so you can understand what's going on.
import pandas
EPL_df = pandas.DataFrame.from_csv('D:\\EPLstats.csv')
#Find most recent date for each team
EPL_df['D'] = pandas.to_datetime(EPL_df['D'])
homeGroup = EPL_df.groupby('H')
awayGroup = EPL_df.groupby('A')
#Following will give you dataframes, team against last game, home and away
homeLastGame = homeGroup['D'].max()
awayLastGame = awayGroup['D'].max()
teamLastGame = pandas.concat([homeLastGame, awayLastGame]).reset_index().groupby('index')['D'].max()
print teamLastGame
homeAveScore = homeGroup['H_SC'].mean()
awayAveScore = awayGroup['A_SC'].mean()
teamAveScore = (homeGroup['H_SC'].sum() + awayGroup['A_SC'].sum()) / (homeGroup['H_SC'].count() + awayGroup['A_SC'].count())
print teamAveScore
You now have average scores for each team along with their most recent match dates. All you have to do now is select the relevant rows of the original dataframe using the most recent dates (i.e. eveything apart from the score columns) and then select from the average score dataframes using the team names from that row.
e.g.
recentRows = EPL_df.loc[EPL_df['D'] > pandas.to_datetime("2015/01/10")]
print recentRows
def insertAverages(s):
a = teamAveScore[s['H']]
b = teamAveScore[s['A']]
print a,b
return pandas.Series(dict(H_AVSC=a, A_AVSC=b))
finalTable = pandas.concat([recentRows, recentRows.apply(insertAverages, axis = 1)], axis=1)
print finalTable
finalTable has your original odds etc for the most recent games with two extra columns (H_AVSC and A_AVSC) for the average scores of home and away teams involved in those matches
Edit
Couple of gotchas
just noticed I didn't put a format string in to_datetime(). For your dates - they look like UK format with dots so you should do
EPL_df['D'] = pandas.to_datetime(EPL_df['D'], format='%d.%m.%Y')
You could use the minimum of the dates in teamLastGame instead of the hard coded 2015/01/10 in my example.
If you really need to replace column H_SC with H_AVSC in your finalTable, rather than add on the averages:
newCols = recentRows.apply(insertAverages, axis = 1)
recentRows['H_SC'] = newCols['H_AVSC']
recentRows['A_SC'] = newCols['A_AVSC']
print recentRows

How to calculate cumulative moving average in Python/SQLAlchemy/Flask

I'll give some context so it makes sense. I'm capturing Customer Ratings for Products in a table (Rating) and want to be able to return a Cumulative Moving Average of the ratings based on time.
A basic example follows taking a rating per day:
02 FEB - Rating: 5 - Cum Avg: 5
03 FEB - Rating: 4 - Cum Avg: (5+4)/2 = 4.5
04 FEB - Rating: 1 - Cum Avg: (5+4+1)/3 = 3.3
05 FEB - Rating: 5 - Cum Avg: (5+4+1+5)/4 = 3.75
Etc...
I'm trying to think of an approach that won't scale horribly.
My current idea is to have a function that is tripped when a row is inserted into the Rating table that works out the Cum Avg based on the previous row for that product
So the fields would be something like:
TABLE: Rating
| RatingId | DateTime | ProdId | RatingVal | RatingCnt | CumAvg |
But this seems like a fairly dodgy way to store the data.
What would be the (or any) way to accomplish this? If I was to use the 'trigger' of sorts, how do you go about doing that in SQLAlchemy?
Any and all advice appreciated!
I don't know about SQLAlchemy, but I might use an approach like this:
Store the cumulative average and rating count separately from individual ratings.
Every time you get a new rating, update the cumulative average and rating count:
new_count = old_count + 1
new_average = ((old_average * old_count) + new_rating) / new_count
Optionally, store a row for each new rating.
Updating the average and rating count could be done with a single SQL statement.

Categories

Resources