My problem looks like this: A movie theatre is showing a set of n films over a 5 day period. Each movie has a corresponding IMBD score. I want to watch one movie per day over the 5 day period, maximising the cumulative IMBD score whilst making sure that I watch the best movies first (i.e. Monday's movie will have a higher score than Tuesday's movie, Tuesday's higher than Wednesday's etc.). An extra constraint if that the theatre doesn't show every movie every day. For example.
Showings:
Monday showings: Sleepless in Seattle, Multiplicity, Jaws, The Hobbit
Tuesday showings: Sleepless in Seattle, Kramer vs Kramer, Jack Reacher
Wednesday showings: The Hobbit, A Star is Born, Joker
etc.
Scores:
Sleepless in Seattle: 7.0
Multiplicity: 10
Jaws: 9.2
The Hobbit: 8.9
A Star is Born: 6.2
Joker: 5.8
Kramer vs Kramer: 8.7
etc.
The way I've thought about this is that each day represents a variable: a,b,c,d,e and we are maximising (a+b+c+d+e) where each variable represents a day of the week and to make sure that I watch the movies in descending order in terms of IMBD rank I would add a constraint that a > b > c > d > e. However using the linear solver as far as I can tell you cannot specify discrete values, just a range as the variables will be chosen from a continuous range (in an ideal world I think the problem would look like "solve for a,b,c,d,e maximising their cumulative sum, while ensuring a>b>c>d>e where a can be this set of possible values, b can be this set of possible values etc.). I'm wondering if someone can point me in the right direction of which OR-Tools solver (or another library) would be best for this problem?
I tried to use the GLOP linear solver to solve this problem but failed. I was expecting it to solve for a,b,c,d, and e but I couldn't write the necessary constraints with this paradigm.
Days Adjusted stock price
price
0 100
1 50
2 200
3 210
4 220
5 34
6 35
7 36
8 89
Assuming this table is a pandas dataframe. Can someone help me out with writing function that show the probability of the up and down moves of the stock price. For example, what is the probability of the stock price having two up days in a row.
Thanks I am new to python and I have been trying to figure this out for a while!
Actual stock price movement prediction is both a broad and a deep subject usually associated with time series analysis which I would consider out of the scope of this question.
However, the naive approach would be to assume the Bernoulli model where each price move is considered independent both of any previous moves and of time.
In this case, the probability of the price moving up can be inferred by measuring all the up moves against all moves recorded.
# df is a single-column pandas DataFrame storing the price
((df['price'] - df['price'].shift(1)) > 0).sum()/(len(df) - 1)
which for the data you posted gives 0.75.
Given the above, the probability of the price going up for two consecutive days would be 0.75*0.75 approximately equating to 0.56.
I'm posting here because I couldn't find any solution to my problem anywhere else. Basically we are learning Linear Regression using python at school and the professor wants us to estimate the price of each ingredient in a sandwich as well as the fixed profit of each sandwich based on a csv table. So far we only messed with one X variable and one Y variable, so I'm pretty confused what should I do here? Thank you. Here is the table:
tomato,lettuce,cheese,pickles,palmetto,burger,corn,ham,price
0.05,1,0.05,0,0.05,0.2,0.05,0,18.4
0.05,0,0.05,0.05,0,0.2,0.05,0.05,16.15
0.05,1,0.05,0,0.05,0.4,0,0,22.15
0.05,1,0.05,0,0.05,0.2,0.05,0.05,19.4
0.05,1,0,0,0,0.2,0.05,0.05,18.4
0,0,0.05,0,0,0,0.05,0.05,11.75
0.05,1,0,0,0,0.2,0,0.05,18.15
0.05,1,0.05,0.05,0.05,0.2,0.05,0,18.65
0,0,0.05,0,0,0.2,0.05,0.05,15.75
0.05,1,0.05,0,0.05,0,0.05,0.05,15.4
0.05,1,0,0,0,0.2,0,0,17.15
0.05,1,0,0,0.05,0.2,0.05,0.05,18.9
0,1,0.05,0,0,0.2,0.05,0.05,18.75
You have 9 separate variables for regression (tomato ... price), and 13 samples for each of them (the 13 lines).
So the first approach could be doing a regression for "tomato" on data points
0.05
0.05
0.05
0.05
0.05
0
0.05
0.05
0
0.05
0.05
0.05
0
then doing another one for "lettuce" and the others, up to "price" with
18.4
16.15
22.15
19.4
18.4
11.75
18.15
18.65
15.75
15.4
17.15
18.9
18.75
Online viewer for looking at your CSV data: http://www.convertcsv.com/csv-viewer-editor.htm, but Google SpreadSheet, Excel, etc. can display it nicely too.
SciPy can probably (most likely) do the task for you on vectors too (so handling the 9 variables together), but the part of having 13 samples in the 13 rows, remains.
EDIT: bad news, I was tired and have not answered the full question, sorry about that.
While it is true that you can take the first 8 columns (tomato...ham) as time series, and make individual regressions for them (which is probably the first part of this assignment), the last column (price) is expected to be estimated from the first 8.
Using the notation in Wikipedia, https://en.wikipedia.org/wiki/Linear_regression#Introduction, your y vector is the last column (the prices), the X matrix is the first 8 columns of your data (tomato...ham), extended with a column of 1-s somewhere.
Then pick an estimation method (some are listed in that page too, https://en.wikipedia.org/wiki/Linear_regression#Estimation_methods, but you may want to pick one you have learned about at class). The actual math is there, and NumPy can do the matrix/vector calculations. If you go for "Ordinary least squares", numpy.linalg.lstsq does the same (https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.lstsq.html#numpy.linalg.lstsq - you may find adding that column of 1-s familiar), so it can be used for verifying the results.
I need to confirm few thing related to pandas exponential weighted moving average function.
If I have a data set df for which I need to find a 12 day exponential moving average, would the method below be correct.
exp_12=df.ewm(span=20,min_period=12,adjust=False).mean()
Given the data set contains 20 readings the span (Total number of values) should equal to 20.
Since I need to find a 12 day moving average hence min_period=12.
I interpret span as total number of values in a data set or the total time covered.
Can someone confirm if my above interpretation is correct?
I can't get the significance of adjust.
I've attached the link to pandas.df.ewm documentation below.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ewm.html
Quoting from Pandas docs:
Span corresponds to what is commonly called an “N-day EW moving average”.
In your case, set span=12.
You do not need to specify that you have 20 datapoints, pandas takes care of that. min_period may not be required here.
Have a situation where I am given a total ticket count, and cumulative ticket sale data as follows:
Total Tickets Available: 300
Day 1: 15 tickets sold to date
Day 2: 20 tickets sold to date
Day 3: 25 tickets sold to date
Day 4: 30 tickets sold to date
Day 5: 46 tickets sold to date
The number of tickets sold is nonlinear, and I'm asked if someone plans to buy a ticket on Day 23, what is the probability he will get a ticket?
I've been looking at quite a libraries used for curve fitting like numpy, PyLab, and sage but I've been a bit overwhelmed since statistics is not in my background. How would I easily calculate a probability given this set of data? If it helps, I also have ticket sale data at other locations, the curve should be somewhat different.
The best answer to this question would require more information about the problem--are people more/less likely to buy a ticket as the date approaches (and mow much)? Are there advertising events that will transiently affect the rate of sales? And so on.
We don't have access to that information, though, so let's just assume, as a first approximation, that the rate of ticket sales is constant. Since sales occur basically at random, they might be best modeled as a Poisson process Note that this does not account for the fact that many people will buy more than one ticket, but I don't think that will make much difference for the results; perhaps a real statistician could chime in here. Also: I'm going to discuss the constant-rate Poisson process here but note that since you mentioned the rate is decidedly NOT constant, you could look into variable-rate Poisson processes as a next step.
To model a Poisson process, all you need is the average rate of ticket sales. In your example data, sales-per-day are [15, 5, 5, 5, 16], so the average rate is about 9.2 tickets per day. We've already sold 46 tickets, so there are 254 remaining.
From here, it is simple to ask, "Given a rate of 9.2 tpd, what is the probability of selling less than 254 tickets in 23 days?" (ignore the fact that you can't sell more than 300 tickets). The way to calculate this is with a cumulative distribution function (see here for the CDF for a poisson distribution).
On average, we would expect to sell 23 * 9.2 = 211.6 tickets after 23 days, so in the language of probability distributions, the expectation value is 211.6. The CDF tells us, "given an expectation value λ, what is the probability of seeing a value <= x". You can do the math yourself or ask scipy to do it for you:
>>> import scipy.stats
>>> scipy.stats.poisson(9.2 * 23).cdf(254-1)
0.99747286634158705
So this tells us: IF ticket sales can be accurately represented as a Poisson process and IF the average rate of ticket sales really is 9.2 tpd, then the probability of at least one ticket being available after 23 more days is 99.7%.
Now let's say someone wants to bring a group of 50 friends and wants to know the probability of getting all 50 tickets if they buy them in 25 days (rephrase the question as "If we expect on average to sell 9.2 * 25 tickets, what is the probability of selling <= (254-50) tickets?"):
>>> scipy.stats.poisson(9.2 * 25).cdf(254-50)
0.044301801145630537
So the probability of having 50 tickets available after 25 days is about 4%.