Scipy - adjust National Team means based on sample size - python

I have a population of National Teams (32), and parameter (mean) that I want to measure for each team, aggregated per match.
For example: I get the mean scouts for all strikers for each team, per match, and then I get the the mean (or median) for all team matches.
Now, one group of teams have played 18 matches and another group has played only 8 matches for the World Cup Qualifying.
I have an hypothesis that, for two teams with equal mean value, the one with larger sample size (18) should be ranked higher.
less_than_8 = all_stats[all_stats['games']<=8]
I get values:
3 0.610759
7 0.579832
14 0.537579
20 0.346510
25 0.403606
27 0.536443
and with:
sns.displot(less_than_8, x="avg_attack",kind='kde',bw_adjust=2)
I plot:
with a mean of 0.5024547681196802
Now, for:
more_than_18 = all_stats[all_stats['games']>=18]
I get values:
0 0.148860
1 0.330585
4 0.097578
6 0.518595
8 0.220798
11 0.200142
12 0.297721
15 0.256037
17 0.195157
18 0.176994
19 0.267094
21 0.295228
22 0.248932
23 0.420940
24 0.148860
28 0.297721
30 0.350516
31 0.205128
and I plot the curve:
with a lower mean, of 0.25982701104003497.
It seems clear that sample size does affect the mean, diminishing it as size increases.
Is there a way I can adjust the means of larger sample size AS IF they were being calculated on a smaller sample size, or vice versa, using prior and posteriori assumptions?
NOTE. I have std for all teams.
There is a proposed solution for a similar matter, using Empirical Bayes estimation and a beta distribution, which can be seen here Understanding empirical Bayes estimation (using baseball statistics), but I'm not sure as to how it could prior means could be extrapolated from successful attempts.

Sample size does affect the mean, but it's not exactly like mean should increase or decrease when sample size is increased. Moreover; sample size will get closer and closer to the population mean μ and standard deviation σ.
I cannot give an exact proposal without more information like; how many datapoints per team, per match, what are the standard deviations in these values. But just looking at the details I have to presume the 6 teams qualified with only 8 matches somehow smashed whatever the stat you have measured. (Probably this is why they only played 8 matches?)
I can make a few simple proposals based on the fact that you would want to rank these teams;
Proposal 1:
Extend these stats and calculate a population mean, std for a season. (If you have prior seasons use them as well)
Use this mean value to rank teams (Without any sample adjustments) - This would likely result in the 6 teams landing on top
Proposal 2:
Calculate per game mean across all teams(call it mean_gt) [ for game 01. mean for game 02.. or mean for game in Week 01, Week 02 ] - I recommend based on week as 6 teams will only have 8 games and this would bias datapoints in the beginning or end.
plot mean_gt and compare each team's given Week's mean with mean_gt [ call this diff diff_gt]
diff_gt gives a better perspective of a team's performance on each week. So you can get a mean of this value to rank teams.
When filling datapoint for 6 teams with 8 matches I suggest using the population mean rather than extrapolating to keep things simple.
But it's possible to get creative; like using the difference of aggregate total for 32 teams also. Like [ 32*mean_gt_of_week_1 - total of [32-x] teams]/x
I do have another idea. But rather wait for a feedback as I am way off the simple solution for adjusting a sample mean. :)

Related

How to solve constraint problem where variables must be one of a discrete set of values in OR-Tools?

My problem looks like this: A movie theatre is showing a set of n films over a 5 day period. Each movie has a corresponding IMBD score. I want to watch one movie per day over the 5 day period, maximising the cumulative IMBD score whilst making sure that I watch the best movies first (i.e. Monday's movie will have a higher score than Tuesday's movie, Tuesday's higher than Wednesday's etc.). An extra constraint if that the theatre doesn't show every movie every day. For example.
Showings:
Monday showings: Sleepless in Seattle, Multiplicity, Jaws, The Hobbit
Tuesday showings: Sleepless in Seattle, Kramer vs Kramer, Jack Reacher
Wednesday showings: The Hobbit, A Star is Born, Joker
etc.
Scores:
Sleepless in Seattle: 7.0
Multiplicity: 10
Jaws: 9.2
The Hobbit: 8.9
A Star is Born: 6.2
Joker: 5.8
Kramer vs Kramer: 8.7
etc.
The way I've thought about this is that each day represents a variable: a,b,c,d,e and we are maximising (a+b+c+d+e) where each variable represents a day of the week and to make sure that I watch the movies in descending order in terms of IMBD rank I would add a constraint that a > b > c > d > e. However using the linear solver as far as I can tell you cannot specify discrete values, just a range as the variables will be chosen from a continuous range (in an ideal world I think the problem would look like "solve for a,b,c,d,e maximising their cumulative sum, while ensuring a>b>c>d>e where a can be this set of possible values, b can be this set of possible values etc.). I'm wondering if someone can point me in the right direction of which OR-Tools solver (or another library) would be best for this problem?
I tried to use the GLOP linear solver to solve this problem but failed. I was expecting it to solve for a,b,c,d, and e but I couldn't write the necessary constraints with this paradigm.

Pandas series function that shows the probability of the up and down moves of the stock price

Days Adjusted stock price
price
0 100
1 50
2 200
3 210
4 220
5 34
6 35
7 36
8 89
Assuming this table is a pandas dataframe. Can someone help me out with writing function that show the probability of the up and down moves of the stock price. For example, what is the probability of the stock price having two up days in a row.
Thanks I am new to python and I have been trying to figure this out for a while!
Actual stock price movement prediction is both a broad and a deep subject usually associated with time series analysis which I would consider out of the scope of this question.
However, the naive approach would be to assume the Bernoulli model where each price move is considered independent both of any previous moves and of time.
In this case, the probability of the price moving up can be inferred by measuring all the up moves against all moves recorded.
# df is a single-column pandas DataFrame storing the price
((df['price'] - df['price'].shift(1)) > 0).sum()/(len(df) - 1)
which for the data you posted gives 0.75.
Given the above, the probability of the price going up for two consecutive days would be 0.75*0.75 approximately equating to 0.56.

Estimating price with Linear Regression

I'm posting here because I couldn't find any solution to my problem anywhere else. Basically we are learning Linear Regression using python at school and the professor wants us to estimate the price of each ingredient in a sandwich as well as the fixed profit of each sandwich based on a csv table. So far we only messed with one X variable and one Y variable, so I'm pretty confused what should I do here? Thank you. Here is the table:
tomato,lettuce,cheese,pickles,palmetto,burger,corn,ham,price
0.05,1,0.05,0,0.05,0.2,0.05,0,18.4
0.05,0,0.05,0.05,0,0.2,0.05,0.05,16.15
0.05,1,0.05,0,0.05,0.4,0,0,22.15
0.05,1,0.05,0,0.05,0.2,0.05,0.05,19.4
0.05,1,0,0,0,0.2,0.05,0.05,18.4
0,0,0.05,0,0,0,0.05,0.05,11.75
0.05,1,0,0,0,0.2,0,0.05,18.15
0.05,1,0.05,0.05,0.05,0.2,0.05,0,18.65
0,0,0.05,0,0,0.2,0.05,0.05,15.75
0.05,1,0.05,0,0.05,0,0.05,0.05,15.4
0.05,1,0,0,0,0.2,0,0,17.15
0.05,1,0,0,0.05,0.2,0.05,0.05,18.9
0,1,0.05,0,0,0.2,0.05,0.05,18.75
You have 9 separate variables for regression (tomato ... price), and 13 samples for each of them (the 13 lines).
So the first approach could be doing a regression for "tomato" on data points
0.05
0.05
0.05
0.05
0.05
0
0.05
0.05
0
0.05
0.05
0.05
0
then doing another one for "lettuce" and the others, up to "price" with
18.4
16.15
22.15
19.4
18.4
11.75
18.15
18.65
15.75
15.4
17.15
18.9
18.75
Online viewer for looking at your CSV data: http://www.convertcsv.com/csv-viewer-editor.htm, but Google SpreadSheet, Excel, etc. can display it nicely too.
SciPy can probably (most likely) do the task for you on vectors too (so handling the 9 variables together), but the part of having 13 samples in the 13 rows, remains.
EDIT: bad news, I was tired and have not answered the full question, sorry about that.
While it is true that you can take the first 8 columns (tomato...ham) as time series, and make individual regressions for them (which is probably the first part of this assignment), the last column (price) is expected to be estimated from the first 8.
Using the notation in Wikipedia, https://en.wikipedia.org/wiki/Linear_regression#Introduction, your y vector is the last column (the prices), the X matrix is the first 8 columns of your data (tomato...ham), extended with a column of 1-s somewhere.
Then pick an estimation method (some are listed in that page too, https://en.wikipedia.org/wiki/Linear_regression#Estimation_methods, but you may want to pick one you have learned about at class). The actual math is there, and NumPy can do the matrix/vector calculations. If you go for "Ordinary least squares", numpy.linalg.lstsq does the same (https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.lstsq.html#numpy.linalg.lstsq - you may find adding that column of 1-s familiar), so it can be used for verifying the results.

Exponential Weighted Moving Average using Pandas

I need to confirm few thing related to pandas exponential weighted moving average function.
If I have a data set df for which I need to find a 12 day exponential moving average, would the method below be correct.
exp_12=df.ewm(span=20,min_period=12,adjust=False).mean()
Given the data set contains 20 readings the span (Total number of values) should equal to 20.
Since I need to find a 12 day moving average hence min_period=12.
I interpret span as total number of values in a data set or the total time covered.
Can someone confirm if my above interpretation is correct?
I can't get the significance of adjust.
I've attached the link to pandas.df.ewm documentation below.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ewm.html
Quoting from Pandas docs:
Span corresponds to what is commonly called an “N-day EW moving average”.
In your case, set span=12.
You do not need to specify that you have 20 datapoints, pandas takes care of that. min_period may not be required here.

Python: calculating the probability a point will conform to curve

Have a situation where I am given a total ticket count, and cumulative ticket sale data as follows:
Total Tickets Available: 300
Day 1: 15 tickets sold to date
Day 2: 20 tickets sold to date
Day 3: 25 tickets sold to date
Day 4: 30 tickets sold to date
Day 5: 46 tickets sold to date
The number of tickets sold is nonlinear, and I'm asked if someone plans to buy a ticket on Day 23, what is the probability he will get a ticket?
I've been looking at quite a libraries used for curve fitting like numpy, PyLab, and sage but I've been a bit overwhelmed since statistics is not in my background. How would I easily calculate a probability given this set of data? If it helps, I also have ticket sale data at other locations, the curve should be somewhat different.
The best answer to this question would require more information about the problem--are people more/less likely to buy a ticket as the date approaches (and mow much)? Are there advertising events that will transiently affect the rate of sales? And so on.
We don't have access to that information, though, so let's just assume, as a first approximation, that the rate of ticket sales is constant. Since sales occur basically at random, they might be best modeled as a Poisson process Note that this does not account for the fact that many people will buy more than one ticket, but I don't think that will make much difference for the results; perhaps a real statistician could chime in here. Also: I'm going to discuss the constant-rate Poisson process here but note that since you mentioned the rate is decidedly NOT constant, you could look into variable-rate Poisson processes as a next step.
To model a Poisson process, all you need is the average rate of ticket sales. In your example data, sales-per-day are [15, 5, 5, 5, 16], so the average rate is about 9.2 tickets per day. We've already sold 46 tickets, so there are 254 remaining.
From here, it is simple to ask, "Given a rate of 9.2 tpd, what is the probability of selling less than 254 tickets in 23 days?" (ignore the fact that you can't sell more than 300 tickets). The way to calculate this is with a cumulative distribution function (see here for the CDF for a poisson distribution).
On average, we would expect to sell 23 * 9.2 = 211.6 tickets after 23 days, so in the language of probability distributions, the expectation value is 211.6. The CDF tells us, "given an expectation value λ, what is the probability of seeing a value <= x". You can do the math yourself or ask scipy to do it for you:
>>> import scipy.stats
>>> scipy.stats.poisson(9.2 * 23).cdf(254-1)
0.99747286634158705
So this tells us: IF ticket sales can be accurately represented as a Poisson process and IF the average rate of ticket sales really is 9.2 tpd, then the probability of at least one ticket being available after 23 more days is 99.7%.
Now let's say someone wants to bring a group of 50 friends and wants to know the probability of getting all 50 tickets if they buy them in 25 days (rephrase the question as "If we expect on average to sell 9.2 * 25 tickets, what is the probability of selling <= (254-50) tickets?"):
>>> scipy.stats.poisson(9.2 * 25).cdf(254-50)
0.044301801145630537
So the probability of having 50 tickets available after 25 days is about 4%.

Categories

Resources