Retain Data Labels in Statsmodels Likehood Model

Retain Data Labels in Statsmodels Likehood Model - python

I am trying to use the statsmodels package to do Maximum Likehood Estimation.
My goal is to compute a set of soccer team ratings given a bunch of game outcomes.
I'm using a Logistic model:
1/(1 + e^(HFA + HT - AT))
That has the following parameters:
HFA - home field advantage
HT - home team rating
AT - away team rating
The goal here is to compute each teams' ratings (there are 18 teams) plus a constant Home Field Advantage factor.
The data I'm given is simply a list of game outcomes - home team, away team, 1 if the home team won the game, 0 if not.
My current thinking is to enrich this data to have a column for each team. I would then make a particular column 1 if that team is playing in that particular game and 0 if not, so there should be two 1's per row. Then add another column called 'hfa', which is always 1, which represents the home field advantage.
Because of how the model works, for any given game, I need to know which team was home and which was away, then I can compute the prediction properly. And to do that, I believe I need the data labels so I can determine which of the two teams in the game was the home team. However, any time I include non-numeric data (e.g, the team name) in my X columns, I get an error from statsmodels.
Here is the gist of the code I'm trying to write. Any guidance on how to do what I'm trying to do would be much appreciated - thank you!
from statsmodels.base.model import GenericLikelihoodModel
class MyLogit(GenericLikelihoodModel):
def nloglikeobs(self, params):
"""
This function should return one evaluation of the negative log-likelihood function per observation in your
dataset (i.e. rows of the endog/X matrix).
"""
???
X_cols = ['hfa'] + list(teams)
X = results[X_cols]
y = results['game_result']
model = MyLogit(X, y)
result = model.fit()
print(result.summary())

Related

Is there a way in pandas (python) to calculate the days of inventory grouped by material number

With the given data frame the calculated measure would be the DOI (aka how many days into the future will the inventory last; based on the demand. Note: The figures populated in the DOI need to be programmatically calculated and grouped on the material.
Calculation of DOI: Let us take the first row belonging to material A1. The dates are on weekly basis.
Inventory = 1000
Days into the future till when the inventory would last: 300 + 400 + part of 500. This means the DOI is 7 + 7 + (1000-300-400)/500 = 14.6 [aka 26.01.2023 - 19-01-2023]; [09.02.2023 - 02.02.2023]
An important point to note is that the demand figure of the concerned row is NOT taken into account while calculating DOI.
I have tried to calculate the cumulative demand without taking the first row for each material (here A1 and B1).
inv_network['cum_demand'] = 0
for i in range(inv_network.shape[0]-1):
if inv_network.loc[i+1,'Period'] > inv_network.loc[0,'Period']:
inv_network.loc[i+1,'cum_demand']= inv_network.loc[i,'cum_demand'] + inv_network.loc[i+1,'Primary Demand']
print(inv_network)
However, this piece of code is taking a lot of time, with the increase in the number of records.
As part of the next step when I am trying to calculate DOI, I am running into issues to get the right value.

Confusion matrix almost perfectly balanced

I'm trying to predict the winning team of a Destiny2 PvP match based on the player stats.
The dataset has multiple columns:
player_X_stat
X from 1 to 12 for each player, players from 1 to 6 are in team A and players from 7 to 12 are in team B
stat is a specific player stat like average number of kills per game or average number of deaths per game etc.
So there are X * (number of different stat) columns (around 100 at the end)
winning_team being always 0 (see note below). The winning team is always composed of players from 1 to 6 (team A) and the losing team is composed of players from 7 to 12 (team B).
note: To set up a balanced dataset, I shuffle the teams so that the winning_team column has as many 0s as 1s. For example, player_1_kills becomes player_7_kills, player_7_kills becomes player_1_kills and winning_team becomes 1.
However when training a catboost classifier or a simple SVC, the confusion matrix is almost perfectly balanced:
[[87486 31592]
[31883 87196]]
It seems to me that this is an anomaly but I have no idea where it could come from.
Edit: link to google colab notebook

Is there a way to assign probabilities to samples in a random number generator?

I have a financial dataset with monthly aggregates. I know the real world average for each measure.
I am trying to build some dummy transactional data using Python. I do not want the dummy transactional data to be entirely random. I want to model it around the real world averages I have.
Eg - If from the real data, the monthly total profit is $1000 and the total transactions are 5, then the average profit per transaction is $200.
I want to create dummy transactions that are modelled around this real world average of $200.
This is how I did it :
import pandas as pd
from random import gauss
bucket = []
for _ in range(5):
value = [int(gauss(200,50))]
bucket += value
transactions = pd.DataFrame({ 'Amount' : bucket})
Now, the challenge for me is that I have to randomize the identifiers too.
For eg, I know for a fact that there are three buyers in total. Let's call them A, B and C.
These three have done those 5 transactions and I want to randomly assign them when I create the dummy transactional data. However, I also know that A is very likely to do a lot more transactions than B and C. To make my dummy data close to real life scenarios, I want to assign probabilities to the occurence of these buyers in my dummy transactional data.
Let's say I want it like this:
A : 60% appearance
B : 20% appearance
C : 20% appearance
How can I achieve this?

What you are asking is not a probability. You want a 100% chance of A having 60% chance of buying. For the same take a dict as an input that has a probability of each user buying. Then create a list with these probabilities on your base and randomly pick a buyer from the list. Something like below:
import random
#Buy percentages of the users
buy_percentage = {'A': 0.6, 'B': 0.2, 'C': 0.2}
#no of purchases
base = 100
buy_list = list()
for buyer, percentage in buy_percentage.items():
buy_user = [buyer for _ in range(0, int(percentage*base))]
buy_list.extend(buy_user)
for _ in range(0,base):
#Randomly gets a buyer but makes sure that your ratio is maintained
buyer = random.choice(buy_list)
#your code to get buying price goes below
UPDATE:
Alternatively, the answer given in the below link can be used. This solution is better in my opinion.
A weighted version of random.choice

Predicting user ratings

I have a prediction problem that I am working on and need some help on how to approach it. I have a CSV with two columns, user_id and ratings, where a user is giving a rating on something in the ratings column. A user can repeat in the user_id column with different unique ratings. For example:
user_id rating
1 5
4 6
1 6
7 6
2 7
4 7
Now the prediction data set has users who have already given previous rating similar to the one above:
user_id rating
11 6
12 10
13 8
13 9
14 4
14 5
Goal is to predict what these specific users will rate the next time. Secondly, lets say if we add a user '15' with no rating history, how can we predict the first two ratings that user will provide, in order.
I'm not sure how to train a model, with just user_id and ratings, which also happens to be the target column. Any help is appreciated!

First and the foremost thing that you have to mention that on what a user is giving rating i.e., the category for example in movie rating system you can provide that for a particular movie A which is an action movie the user gives rating 1 which means that the user hates action and for a movie B which is comedy type the user gives rating 9 which means that the user is a comedy lover so the next time a similar category came you can predict the rating of the user very easily and you can do so by including many movie category like thriller,romance,drama etc and can even take many accounting features like movie length, leading actor, director, language etc etc as all these govern a user rating very broadly.
But if you not provide on which basis the user is giving rating then it is very hard and of no use for example I am a user and I give ratings like 1,5,2,6,8,1,9,3,4,10 can you predict my next rating the answer is no because it just like a random generator between 0-10 but in the movie case where my past ratings clearly show that I love comedy and hate action then when a new comedy movie came you can easily predict the rating for that movie for me.
But still if your problem is this only then you can use various statistical methods like either take the mean and then approximate to nearest integer or either take the mode.
But I can suggest is that plot the rating for a user and visualise it, if it is following some pattern like for a user the rating first increases then goes to peak then decreases then go to minimum and then increases and follow like this(believe me this is going to be very impractical due to your constraints) and on the basis of that predict the rating.
But the best out of all these is be make a statistical model like give high weight to the last rating and lesser weight to second last rating and then even lesser and then take a mean, eg->
predict_rating = w1*(last_rating) + w2*(second_last_rating) + w3*(third_last_rating) ....
and then take mean
This will give you very good results and indeed it is machine learning and this particular algorithm in which you find the best suited weights is multivariate linear regression
and this for sure is the best model for the given constraints

Collapsing Categorical Feature Levels in Python StatsModels OLS output

I am trying to create a multiple linear regression model to predict
the rating a guest gives to a hotel (Reviewer_Score) in Python using statsmodels.
Review_Total_Negative_Word_Counts is how long their negative comments about the hotel are
Total_Number_of_Reviews is how many reviews the hotel has
Review_Total_Positive_Word_Counts is how long their positive comments about the hotel are
Total_Number_of_Reviews_Revewier_Has_Given is how many reviews the guest has given on the site
Attitude is a categorical variable: GOOD or BAD
Reason is reason for visit (Leisure or Business)
Continent is the continent which the guest came from (multiple levels)
Solo is whether the traveler is a solo traveler ('Yes' or 'No')
Season is during which season the guest stayed at the hotel ('Fall', 'Winter', 'Summer', 'Spring')
As you can see, I have some numeric and also categorical features.
My code so far is:
import statsmodels.formula.api as smf
lm = smf.ols(formula = 'Reviewer_Score ~ Review_Total_Negative_Word_Counts + Total_Number_of_Reviews + Review_Total_Positive_Word_Counts + Total_Number_of_Reviews_Reviewer_Has_Given + Attitude + Reason + Continent + Solo + Season', data = Hotel).fit()
lm.params
lm.summary()
My issue is that when I look at the parameters (slopes and intercept estimates) also P-values, they look like:
The levels of each of the categorical features are included and I just want to have an output that shows us the slopes and p-values for numeric and categorical features (NOT the slopes and p-values for each level in the categorical features!)
Essentially, I want the slope output to look like:
Intercept
Total_Number_of_Reviews
Review_Total_Positive_Word_Counts
Total_Number_of_Reviews_Revewier_Has_Given
Attitude
Reason
Continent
Solo
Season
How would I do something like this to collapse the levels and just show the significance and slope value for each of the variables?

Right now, each of your original inputs to your model is being converted into dummy variables.*
The reason this clashes with your expectations, I suspect, is that you have three types of variables you call categorical in your model:
Temporal ("Season")
Binary ("Attitude", "Reason", "Solo")
Categorical ("Continent")
OnlyContinent is truly non-binary categorical as there is no way to order the continents in a hierarchy without any further information. For "Season" the model/program has no indication that there are only four seasons, or that they occur in a temporal order. With the binary variables, it similarly doesn't know that there are only two possible values.
I recommend converting binary variables to 1,0, or Nan (you could first use a lambda function, followed by pd.fillna()).
For "Season" specifically, it sounds you want something more akin to "time of year, indicated by season/quarter." I'd map the seasons to 1,2,3 or 4.
For the "Continent" you could rank the continents by how many reviews you have from each, and convert each continent to its respective rank... but you'd be regressing on something more akin to a blend of "continent" + "population from originating continent." (This, of course, may be useful to do anyways). Or, you could keep the dummy variable encoding that was already utilized.
Alternatively, you could come up with a random mapping for the continent, but include some indicator of the relative population from each continent in addition.
*To make this explicit, you can use pd.get_dummmies()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.