How to calculate cumulative moving average in Python/SQLAlchemy/Flask - python

I'll give some context so it makes sense. I'm capturing Customer Ratings for Products in a table (Rating) and want to be able to return a Cumulative Moving Average of the ratings based on time.
A basic example follows taking a rating per day:
02 FEB - Rating: 5 - Cum Avg: 5
03 FEB - Rating: 4 - Cum Avg: (5+4)/2 = 4.5
04 FEB - Rating: 1 - Cum Avg: (5+4+1)/3 = 3.3
05 FEB - Rating: 5 - Cum Avg: (5+4+1+5)/4 = 3.75
Etc...
I'm trying to think of an approach that won't scale horribly.
My current idea is to have a function that is tripped when a row is inserted into the Rating table that works out the Cum Avg based on the previous row for that product
So the fields would be something like:
TABLE: Rating
| RatingId | DateTime | ProdId | RatingVal | RatingCnt | CumAvg |
But this seems like a fairly dodgy way to store the data.
What would be the (or any) way to accomplish this? If I was to use the 'trigger' of sorts, how do you go about doing that in SQLAlchemy?
Any and all advice appreciated!

I don't know about SQLAlchemy, but I might use an approach like this:
Store the cumulative average and rating count separately from individual ratings.
Every time you get a new rating, update the cumulative average and rating count:
new_count = old_count + 1
new_average = ((old_average * old_count) + new_rating) / new_count
Optionally, store a row for each new rating.
Updating the average and rating count could be done with a single SQL statement.

Related

Pandas group sum divided by unique items in group

I have a data in excel of employees and no. of hours worked in a week. I tagged each employee to a project he/she is working on. I can get sum of hours worked in each project by doing groupby as below:
util_breakup_sum = df[["Tag", "Bill. Hours"]].groupby("Tag").sum()
Bill. Hours
Tag
A61H 92.00
A63B 139.75
An 27.00
B32B 33.50
H 37.00
Manager 8.00
PP 23.00
RP0117 38.50
Se 37.50
However, when I try to calculate average time spent on each project per person, it gives me (sum/ total number of entries by employee), whereas correct average should be (sum / unique employee in group).
Example of mean is given below:
util_breakup_mean = df[["Tag", "Bill. Hours"]].groupby("Tag").mean()
Bill. Hours
Tag
A61H 2.243902
A63B 1.486702
An 1.000000
B32B 0.712766
H 2.055556
Manager 0.296296
PP 1.095238
RP0117 1.425926
Se 3.750000
For example, group A61H has just two employees, so there average should be (92/2) = 46. However, the code is dividing by total number of entries by these employees and hence giving an average of 2.24.
How to get the average from unique employee names in the group?
Try:
df.groupby("Tag")["Bill. Hours"].sum().div(df.groupby("Tag")["Employee"].nunique()
Where Employee is column identifying employees.
You can try nunique
util_breakup_mean = util_breakup_sum/df.groupby("Tag")['employee'].nunique()

Automatically calculating percentage and store into variables

I am currently on a demographic project.
I have data from 3 different countries and their birth statistics in each month.
My question:
I want to calculate the percentage people born in each month and plot it for each country. (x = Month, y = Percentage born)
Therefore, I want to calculate the percentages first.
I want to do this per iteration over all months to improve my code. So far:
EU = df2["European Union - 28 countries (2013-2020)"]
CZE = df2["Czechia"]
GER = df2["Germany including former GDR"]
EU_1 = EU[1] / EU[0] *100
EU_2 = EU[2] / EU[0] *100
etc.
for each month and 3 countries.
How can I calculate all automatically by changing Country [i] and store every value separately (Function, for loop?)
Thank you very much!
You could do something like this:
EU = df2["European Union - 28 countries (2013-2020)"]
monthly_percentages = [num_born / EU[0] * 100 for num_born in EU[1:]]
Assuming that the first element of EU is the total births that year and the rest are the births per each month. If you wanted to do all the countries automatically, you could loop through each country and calculate the birth percentage for each month, then store it somewhere. It would look something like:
country_birth_percentages = []
for country in countries:
country_birth_percentages.append([num_born / country[0] * 100 for num_born in country[1:]])

How to append a value in excel based on cell values from multiple columns in Python and/or R

I'm new to the openpyxl and other similar excel packages in Python (even Pandas). What I want to achieve is to append the lowest possible price that I can keep for each of the product based on the expense formula. The expense formula is in the code below, the data is like this on excel:
**Product** |**Cost** |**Price** | **Lowest_Price**
ABC 32 66
XYZ 15 32
DEF 22 44
JML 60 120
I have the code below code on Python 3.5, which works, however this might not be the most optimized solution, I need to know how to append the lowest value on Lowest_price column:
cost = 32 #Cost of product from cost column
price = 66 #Price of product from price column
Net = cost+5 #minimum profit required
lowest = 0 #lowest price that I can keep on the website for each product
#iterating over each row values and posting the lowest value adjacent to each product.
for i in range(Net, price):
expense = (i*0.15) + 15 #expense formula
if i - expense >= Net:
lowest = i
break
print (lowest) #this value should be printed adjacent to the price, in the Lowest price Column
Now if someone can help me doing that in Python and/or R. The reason I want in both Python and R is because I want to compare the time complexity, as I have a huge set of data to deal with.
I'm fine with code that works with any of the excel formats i.e. xls or xlsx as long it is fast
I worked it out this way.
import pandas as pd
df = pd.read_excel('path/file.xls')
row = list(df.index)
for x in row:
cost = df.loc[x,'Cost']
price = df.loc[x,'Price']
Net = cost+5
df.loc[x,'Lowest_Price'] = 0
for i in range(Net, price):
expense = (i*0.15) + 15 #expense formula
if i - expense >= Net:
df.loc[x,'Lowest_Price'] = i
break
#If you want to save it back to an excel file
df.to_excel('path/file.xls',index=False)
It gave this output:
Product Cost Price Lowest_Price
0 ABC 32 66 62.0
1 XYZ 15 32 0.0
2 DEF 22 44 0.0
3 JML 60 120 95.0

Using 3 criteria for a Table Lookup Python

Backstory: I'm fairly new to python, and have only ever done things in MATLAB prior.
I am looking to take a specific value from a table based off of data I have.
The data I have is
Temperatures = [0.8,0.1,-0.8,-1.4,-1.7,-1.5,-2,-1.7,-1.7,-1.3,-0.7,-0.2,0.3,1.4,1.4,1.5,1.2,1,0.9,1.3,1.7,1.7,1.6,1.6]
Hour of the Day =
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]
This is all data for a Monday.
My Monday table looks like this:
Temp | Hr0 | Hr1 | Hr2 ...
-15 < t <= -10 | 0.01 | 0.02 | 0.06 ...
-10 < t <= -5 | 0.04 | 0.03 | 0.2 ...
with the Temperatures increment by +5 until 30, and the hours of the day until 23. The values in the table are constants that I would like to call based off of the temperature and hour.
For example, I'd like to be able to say:
print(monday(1,1)) = 0.01
I would also be doing this for everyday of the week for a mass data analysis, thus the need for it to be efficient.
What I've done so far:
So i have stored all of my tables in dictionaries that look kind of like this:
monday_hr0 = [0.01,0.04, ... ]
So first by column then calling them by the temperature value.
What I have now is a bunch of loops that looks like this:
for i in range (0,365):
for j in range (0,24):
if Day[i] = monday
if hr[i+24*j] = 0
if temp[i] = -15
constant.append(monday_hr1[0])
...
if hr[i+24*j] = 1
if temp[i] = -15
constant.append(monday_hr2[0])
...
...
elif Day[i] = tuesday
if hr[i+24*j] = 0
if temp[i] = -15
constant.append(tuesday_hr1[0])
...
if hr[i+24*j] = 1
if temp[i] = -15
constant.append(tuesday_hr2[0])
...
...
...
I'm basically saying here if it's a monday, use this table. Then if it's this hour use this column. Then if it's this temperature, use this cell. This is VERY VERY inefficient however.
I'm sure there's a quicker way but I can't wrap my head around it. Thank you very much for your help!
Okay, bear with me here, I'm on mobile. I'll try to write up a solution.
I am assuming the following:
you have a dictionary called day_data which contains the table of data for each day of the week.
you have a dictionary called days which maps 0-6 to a day of the week. 0 is monday, 6 is Sunday.
you have a list of temperatures you want something done with
you have a time of the day you want to use to pick out the appropriate data from your day_data. You want to do this for each day of the year.
We should only have to iterate once through all 365 days and once through each hour of the day.
heat-load-days={}
for day_index in range(1,365):
day=Days[day_index%7]
#day is now the Day of the week.
data = day_data[day]
Heat_load =[]
for hour in range(24):
#still unsure on how to select which temperature row from the data table.
Heat_load.append (day_data_selected)
heat-load-days [day] = Heat_load

Generating a retention cohort from a pandas dataframe

I have a pandas dataframe that looks like this:
+-----------+------------------+---------------+------------+
| AccountID | RegistrationWeek | Weekly_Visits | Visit_Week |
+-----------+------------------+---------------+------------+
| ACC1 | 2015-01-25 | 0 | NaT |
| ACC2 | 2015-01-11 | 0 | NaT |
| ACC3 | 2015-01-18 | 0 | NaT |
| ACC4 | 2014-12-21 | 14 | 2015-02-12 |
| ACC5 | 2014-12-21 | 5 | 2015-02-15 |
| ACC6 | 2014-12-21 | 0 | 2015-02-22 |
+-----------+------------------+---------------+------------+
It's essentially a visit log of sorts, as it holds all the necessary data for creating a cohort analysis.
Each registration week is a cohort.
To know how many people are part of the cohort I can use:
visit_log.groupby('RegistrationWeek').AccountID.nunique()
What I want to do is create a pivot table with the registration weeks as keys. The columns should be the visit_weeks and the values should be the count of unique account ids who have more than 0 weekly visits.
Together with the total accounts in each cohort, I will then be able to show percentages instead of absolute values.
The end product would look something like this:
+-------------------+-------------+-------------+-------------+
| Registration Week | Visit_week1 | Visit_Week2 | Visit_week3 |
+-------------------+-------------+-------------+-------------+
| week1 | 70% | 30% | 20% |
| week2 | 70% | 30% | |
| week3 | 40% | | |
+-------------------+-------------+-------------+-------------+
I tried pivoting the dataframe like this:
visit_log.pivot_table(index='RegistrationWeek', columns='Visit_Week')
But I haven't nailed down the value part. I'll need to somehow count account Id and divide the sum by the registration week aggregation from above.
I'm new to pandas so if this isn't the best way to do retention cohorts, please enlighten me!
Thanks
There are several aspects to your question.
What you can build with the data you have
There are several kinds of retention. For simplicity, we’ll mention only two :
Day-N retention : if a user registered on day 0, did she log in on day N ? (Logging on day N+1 does not affect this metric). To measure it, you need to keep track of all the logs of your users.
Rolling retention : if a user registered on day 0, did she log in on day N or any day after that ? (Logging in on day N+1 affects this metric). To measure it, you just need the last know logs of your users.
If I understand your table correctly, you have two relevant variables to build your cohort table : registration date, and last log (visit week). The number of weekly visits seems irrelevant.
So with this you can only go with option 2, rolling retention.
How to build the table
First, let's build a dummy data set so that we have enough to work on and you can reproduce it :
import pandas as pd
import numpy as np
import math
import datetime as dt
np.random.seed(0) # so that we all have the same results
def random_date(start, end,p=None):
# Return a date randomly chosen between two dates
if p is None:
p = np.random.random()
return start + dt.timedelta(seconds=math.ceil(p * (end - start).days*24*3600))
n_samples = 1000 # How many users do we want ?
index = range(1,n_samples+1)
# A range of signup dates, say, one year.
end = dt.datetime.today()
from dateutil.relativedelta import relativedelta
start = end - relativedelta(years=1)
# Create the dataframe
users = pd.DataFrame(np.random.rand(n_samples),
index=index, columns=['signup_date'])
users['signup_date'] = users['signup_date'].apply(lambda x : random_date(start, end,x))
# last logs randomly distributed within 10 weeks of singing up, so that we can see the retention drop in our table
users['last_log'] = users['signup_date'].apply(lambda x : random_date(x, x + relativedelta(weeks=10)))
So now we should have something that looks like this :
users.head()
Here is some code to build a cohort table :
### Some useful functions
def add_weeks(sourcedate,weeks):
return sourcedate + dt.timedelta(days=7*weeks)
def first_day_of_week(sourcedate):
return sourcedate - dt.timedelta(days = sourcedate.weekday())
def last_day_of_week(sourcedate):
return sourcedate + dt.timedelta(days=(6 - sourcedate.weekday()))
def retained_in_interval(users,signup_week,n_weeks,end_date):
'''
For a given list of users, returns the number of users
that signed up in the week of signup_week (the cohort)
and that are retained after n_weeks
end_date is just here to control that we do not un-necessarily fill the bottom right of the table
'''
# Define the span of the given week
cohort_start = first_day_of_week(signup_week)
cohort_end = last_day_of_week(signup_week)
if n_weeks == 0:
# If this is our first week, we just take the number of users that signed up on the given period of time
return len( users[(users['signup_date'] >= cohort_start)
& (users['signup_date'] <= cohort_end)])
elif pd.to_datetime(add_weeks(cohort_end,n_weeks)) > pd.to_datetime(end_date) :
# If adding n_weeks brings us later than the end date of the table (the bottom right of the table),
# We return some easily recognizable date (not 0 as it would cause confusion)
return float("Inf")
else:
# Otherwise, we count the number of users that signed up on the given period of time,
# and whose last known log was later than the number of weeks added (rolling retention)
return len( users[(users['signup_date'] >= cohort_start)
& (users['signup_date'] <= cohort_end)
& pd.to_datetime((users['last_log']) >= pd.to_datetime(users['signup_date'].map(lambda x: add_weeks(x,n_weeks))))
])
With this we can create the actual function :
def cohort_table(users,cohort_number=6,period_number=6,cohort_span='W',end_date=None):
'''
For a given dataframe of users, return a cohort table with the following parameters :
cohort_number : the number of lines of the table
period_number : the number of columns of the table
cohort_span : the span of every period of time between the cohort (D, W, M)
end_date = the date after which we stop counting the users
'''
# the last column of the table will end today :
if end_date is None:
end_date = dt.datetime.today()
# The index of the dataframe will be a list of dates ranging
dates = pd.date_range(add_weeks(end_date,-cohort_number), periods=cohort_number, freq=cohort_span)
cohort = pd.DataFrame(columns=['Sign up'])
cohort['Sign up'] = dates
# We will compute the number of retained users, column-by-column
# (There probably is a more pythonesque way of doing it)
range_dates = range(0,period_number+1)
for p in range_dates:
# Name of the column
s_p = 'Week '+str(p)
cohort[s_p] = cohort.apply(lambda row: retained_in_interval(users,row['Sign up'],p,end_date), axis=1)
cohort = cohort.set_index('Sign up')
# absolute values to percentage by dividing by the value of week 0 :
cohort = cohort.astype('float').div(cohort['Week 0'].astype('float'),axis='index')
return cohort
Now you can call it and see the result :
cohort_table(users)
Hope it helps
Using the same format of users data from rom_j's answer, this will be cleaner/faster, but only works assuming there is at least one signup/churn per week. Not a terrible assumption on large enough data.
users = users.applymap(lambda d: d.strftime('%Y-%m-%V') if pd.notnull(d) else d)
tab = pd.crosstab(signup_date, last_log)
totals = tab.T.sum()
retention_counts = ((tab.T.cumsum().T * -1)
.replace(0, pd.NaT)
.add(totals, axis=0)
)
retention = retention_counts.div(totals, axis=0)
realined = [retention.loc[a].dropna().values for a in retention.index]
realigned_retention = pd.DataFrame(realined, index=retention.index)

Categories

Resources