Python: calculating the probability a point will conform to curve - python

Have a situation where I am given a total ticket count, and cumulative ticket sale data as follows:
Total Tickets Available: 300
Day 1: 15 tickets sold to date
Day 2: 20 tickets sold to date
Day 3: 25 tickets sold to date
Day 4: 30 tickets sold to date
Day 5: 46 tickets sold to date
The number of tickets sold is nonlinear, and I'm asked if someone plans to buy a ticket on Day 23, what is the probability he will get a ticket?
I've been looking at quite a libraries used for curve fitting like numpy, PyLab, and sage but I've been a bit overwhelmed since statistics is not in my background. How would I easily calculate a probability given this set of data? If it helps, I also have ticket sale data at other locations, the curve should be somewhat different.

The best answer to this question would require more information about the problem--are people more/less likely to buy a ticket as the date approaches (and mow much)? Are there advertising events that will transiently affect the rate of sales? And so on.
We don't have access to that information, though, so let's just assume, as a first approximation, that the rate of ticket sales is constant. Since sales occur basically at random, they might be best modeled as a Poisson process Note that this does not account for the fact that many people will buy more than one ticket, but I don't think that will make much difference for the results; perhaps a real statistician could chime in here. Also: I'm going to discuss the constant-rate Poisson process here but note that since you mentioned the rate is decidedly NOT constant, you could look into variable-rate Poisson processes as a next step.
To model a Poisson process, all you need is the average rate of ticket sales. In your example data, sales-per-day are [15, 5, 5, 5, 16], so the average rate is about 9.2 tickets per day. We've already sold 46 tickets, so there are 254 remaining.
From here, it is simple to ask, "Given a rate of 9.2 tpd, what is the probability of selling less than 254 tickets in 23 days?" (ignore the fact that you can't sell more than 300 tickets). The way to calculate this is with a cumulative distribution function (see here for the CDF for a poisson distribution).
On average, we would expect to sell 23 * 9.2 = 211.6 tickets after 23 days, so in the language of probability distributions, the expectation value is 211.6. The CDF tells us, "given an expectation value λ, what is the probability of seeing a value <= x". You can do the math yourself or ask scipy to do it for you:
>>> import scipy.stats
>>> scipy.stats.poisson(9.2 * 23).cdf(254-1)
0.99747286634158705
So this tells us: IF ticket sales can be accurately represented as a Poisson process and IF the average rate of ticket sales really is 9.2 tpd, then the probability of at least one ticket being available after 23 more days is 99.7%.
Now let's say someone wants to bring a group of 50 friends and wants to know the probability of getting all 50 tickets if they buy them in 25 days (rephrase the question as "If we expect on average to sell 9.2 * 25 tickets, what is the probability of selling <= (254-50) tickets?"):
>>> scipy.stats.poisson(9.2 * 25).cdf(254-50)
0.044301801145630537
So the probability of having 50 tickets available after 25 days is about 4%.

Related

How to solve constraint problem where variables must be one of a discrete set of values in OR-Tools?

My problem looks like this: A movie theatre is showing a set of n films over a 5 day period. Each movie has a corresponding IMBD score. I want to watch one movie per day over the 5 day period, maximising the cumulative IMBD score whilst making sure that I watch the best movies first (i.e. Monday's movie will have a higher score than Tuesday's movie, Tuesday's higher than Wednesday's etc.). An extra constraint if that the theatre doesn't show every movie every day. For example.
Showings:
Monday showings: Sleepless in Seattle, Multiplicity, Jaws, The Hobbit
Tuesday showings: Sleepless in Seattle, Kramer vs Kramer, Jack Reacher
Wednesday showings: The Hobbit, A Star is Born, Joker
etc.
Scores:
Sleepless in Seattle: 7.0
Multiplicity: 10
Jaws: 9.2
The Hobbit: 8.9
A Star is Born: 6.2
Joker: 5.8
Kramer vs Kramer: 8.7
etc.
The way I've thought about this is that each day represents a variable: a,b,c,d,e and we are maximising (a+b+c+d+e) where each variable represents a day of the week and to make sure that I watch the movies in descending order in terms of IMBD rank I would add a constraint that a > b > c > d > e. However using the linear solver as far as I can tell you cannot specify discrete values, just a range as the variables will be chosen from a continuous range (in an ideal world I think the problem would look like "solve for a,b,c,d,e maximising their cumulative sum, while ensuring a>b>c>d>e where a can be this set of possible values, b can be this set of possible values etc.). I'm wondering if someone can point me in the right direction of which OR-Tools solver (or another library) would be best for this problem?
I tried to use the GLOP linear solver to solve this problem but failed. I was expecting it to solve for a,b,c,d, and e but I couldn't write the necessary constraints with this paradigm.

Scipy - adjust National Team means based on sample size

I have a population of National Teams (32), and parameter (mean) that I want to measure for each team, aggregated per match.
For example: I get the mean scouts for all strikers for each team, per match, and then I get the the mean (or median) for all team matches.
Now, one group of teams have played 18 matches and another group has played only 8 matches for the World Cup Qualifying.
I have an hypothesis that, for two teams with equal mean value, the one with larger sample size (18) should be ranked higher.
less_than_8 = all_stats[all_stats['games']<=8]
I get values:
3 0.610759
7 0.579832
14 0.537579
20 0.346510
25 0.403606
27 0.536443
and with:
sns.displot(less_than_8, x="avg_attack",kind='kde',bw_adjust=2)
I plot:
with a mean of 0.5024547681196802
Now, for:
more_than_18 = all_stats[all_stats['games']>=18]
I get values:
0 0.148860
1 0.330585
4 0.097578
6 0.518595
8 0.220798
11 0.200142
12 0.297721
15 0.256037
17 0.195157
18 0.176994
19 0.267094
21 0.295228
22 0.248932
23 0.420940
24 0.148860
28 0.297721
30 0.350516
31 0.205128
and I plot the curve:
with a lower mean, of 0.25982701104003497.
It seems clear that sample size does affect the mean, diminishing it as size increases.
Is there a way I can adjust the means of larger sample size AS IF they were being calculated on a smaller sample size, or vice versa, using prior and posteriori assumptions?
NOTE. I have std for all teams.
There is a proposed solution for a similar matter, using Empirical Bayes estimation and a beta distribution, which can be seen here Understanding empirical Bayes estimation (using baseball statistics), but I'm not sure as to how it could prior means could be extrapolated from successful attempts.
Sample size does affect the mean, but it's not exactly like mean should increase or decrease when sample size is increased. Moreover; sample size will get closer and closer to the population mean μ and standard deviation σ.
I cannot give an exact proposal without more information like; how many datapoints per team, per match, what are the standard deviations in these values. But just looking at the details I have to presume the 6 teams qualified with only 8 matches somehow smashed whatever the stat you have measured. (Probably this is why they only played 8 matches?)
I can make a few simple proposals based on the fact that you would want to rank these teams;
Proposal 1:
Extend these stats and calculate a population mean, std for a season. (If you have prior seasons use them as well)
Use this mean value to rank teams (Without any sample adjustments) - This would likely result in the 6 teams landing on top
Proposal 2:
Calculate per game mean across all teams(call it mean_gt) [ for game 01. mean for game 02.. or mean for game in Week 01, Week 02 ] - I recommend based on week as 6 teams will only have 8 games and this would bias datapoints in the beginning or end.
plot mean_gt and compare each team's given Week's mean with mean_gt [ call this diff diff_gt]
diff_gt gives a better perspective of a team's performance on each week. So you can get a mean of this value to rank teams.
When filling datapoint for 6 teams with 8 matches I suggest using the population mean rather than extrapolating to keep things simple.
But it's possible to get creative; like using the difference of aggregate total for 32 teams also. Like [ 32*mean_gt_of_week_1 - total of [32-x] teams]/x
I do have another idea. But rather wait for a feedback as I am way off the simple solution for adjusting a sample mean. :)

How to find the maximum affordable amount of bitcoin/cryptocurrencies?

I am currently building a trading bot for cryptocurrencies based on time series analysis in Python. While working on defining the buy and sell signals, I am confronted with the issue of finding the maximum affordable amount of coins to buy with a given stock of cash such that the cash will not be negative. For simplicity reasons, we can assume that the minimum amount of coins to buy is 0.0001, so 0.0001 of the current crypto price. So how can I implement it in Python to find the maximum amount of 0.0001 units of a cryptocurrency with a given cash stock such that the cash won't be negative but maximally used?
It sounds like you're looking for ceiling and floor math functions.
https://docs.python.org/3/library/math.html
for example if you have 10 dollars and each share is $3 you can only buy 3 shares.
or in code:
import math
print(math.floor(10/3))
>>>3
*edited to show result

What kind of Machine Learning Problem it is if we have to Predict customer next spend category in python?

I have a data set of shape -> (6210782, 5).
This has 200,000 unique customers and their transactions at different different outlets. Time Series is little over an year.
df.head()
customer_id TransactionDate TransationTime Amount OutletCategory
514 22-04-2015 19:42:18 9445 M16
514 23-04-2015 16:29:28 2000 M23
514 02-05-2015 15:17:55 1398 M16
514 27-06-2015 13:51:29 1995 M7
514 07-08-2015 17:31:30 2000 M23
What Kind of Machine Learning Problem it is and what should be the approach and algorithm used in solving following tasks:
1) predict customers Next Transaction category?
(I am thinking of this as multinomial classification)
2) predict customers Next Transaction category in next 6 hrs?
3) predict customers Next Transaction Amount?
(Is this an LSTM task)
4) predict customers Next Transaction Amount in next 6 hrs?
As we have 200,000 unique customers how should I prepare the data if I have to predict the next transaction amount ? Should I pivot the customers to columns???
Data/ Time Series Exploration that may help visualize the data:
Below is the Transactions Amount wrt to categories over the time series chart:
For below charts: I have created a small data set of "Datetime" as index and "Amount" column to understand the transnational behavior wrt to time.
Amount Spend to Transaction Dates chart:
Amount Spend to Weekly TransactionDates chart:
Mean of Amount spend in a day(hourly)
Expectations:
I am new to Data Science and Python so just looking for right steps to proceed with the task (will manage the code myself)
There will be never the exactly right answer to this kind of problem.
To your problems:
Everything related to 6 hours seems like to be a Timeseries problem. The works e.g. with Arima-Models.
3) Is a Regression, you basically have to predict a amount which has a wide range of possibilities. The starting point could be a linear-regression. But there are also other algorithms for that
1) Should be a multiclass problem, for this you could use a decision tree e.g.
In general:
To give you more ideas: Scikit-Learn https://scikit-learn.org/stable/ can be a good starting point for you.

Calculating interest rates in numpy

Hopefully this is a quick and easy question that is not a repeat. I am looking for a built in numpy function (though it could also be a part of another library) which, given an original loan amount, monthly payment amount, and number of payments, can calculate the interest rate. I see numpy has the following function:
np.pmt(Interest_Rate/Payments_Year, Years*Payments_Year, Principal)
This is close to what I am looking for but would instead prefer a function which would provide the interest rate when given the three parameters I listed above. Thank you.
You want numpy.rate from numpy_financial.
An example usage: suppose I'm making 10 monthly payments of $200 to pay off an initial loan of $1500. Then the monthly interest rate is approximately 5.6%:
>>> import numpy as np
>>> monthly_payment = 200.0
>>> number_of_payments = 10
>>> initial_loan_amount = 1500.0
>>> np.rate(number_of_payments, -monthly_payment, initial_loan_amount, 0.0)
0.056044636451588969
Note the sign convention here: the payment is negative (it's money leaving my account), while the initial loan amount is positive.
You should also take a look at the when parameter: depending on whether interest is accrued after each payment or before, you'll want to select the value of when accordingly. The above example models the situation where the first round of interest is added before the first payment is made (when='end'). If instead the payment is made at the beginning of each month, and interest accrued at the end of the month (when='begin'), the effective interest rate ends up higher, a touch over 7%.
>>> np.rate(number_of_payments, -monthly_payment, initial_loan_amount, 0.0, when='begin')
0.070550580696092852

Categories

Resources