Reinforcement Learning with MDP for revenues optimization - python

I want to modelize the service of selling seats on an airplane as an MDP( markov decision process) to use reinforcement learning for airline revenues optimization, for that I needed to define what would be: states, actions, policy, value and reward. I thought a little a bit about it, but i think there is still something missing.
I modelize my system this way:
States = (r,c) where r is the number of passengers and c the number of seats bought so r>=c.
Actions = (p1,p2,p3) that are the 3 prices. the objective is to decide which one of them give more revenues.
Reward: revenues.
Could you please tell me what do u think and help me?
After the modelization, I have to implement all of that wit Reinforcement Learning. Is there a package that do the work ?

I think the biggest thing missing in your formulation is the sequential part. Reinforcement learning is useful when used sequentially, where the next state has to be dependent on the current state (thus the "Markovian"). In this formulation, you have not specified any Markovian behavior at all. Also, the reward is a scalar which is dependent on either the current state or the combination of current state and action. In your case, the revenue is dependent on the price (the action), but it has no correlation to the state (the seat). These are the two big problems that I see with your formulation, there are others as well. I will suggest you to go through the RL theory (online courses and such) and write a few sample problems before trying to formulate your own.

Related

How to set your own value function in Reinforecement learning?

I am new to using reinforcement learning, I only read the first few chapters in R.Sutton (so I have a small theoretical background).
I try to solve a combinatorial optimization problem which can be broken down to:
I am looking for the optimal configuration of points (qubits) on a grid (quantum computer).
I already have a cost function to qualify a configuration. I also have a reward function.
Right now I am using simulated annealing, where I randomly move a qubit or swap two qubits.
However, this ansatz is not working well for more than 30 qubits.
That's why I thought to use a policy, which tells me which qubit to move/swap instead of doing it randomly.
Reading the gym documentation, I couldn't find what option I should use. I don't need Q-Learning or deep reinforcement learning as far as I understood since I only need to learn a policy?
I would also be fine using Pytorch or whatever. With this little amount of information, what do you recommend to chose? More importantly, how can I set my own value function?
There are two categories of RL algorithms.
One category like Q-learning, Deep Q-learning and other ones learn a value function that for a state and an action predicts the estimated reward that you will get. Then, once you know for each state and each action what the reward is, your policy is simply to select for each state the action that provides the biggest reward. Thus, in the case of these algorithms, even if you learn a value function, the policy depends on this value function.
Then, you have other deep rl algorithms where you learn a policy directly, like Reinforce, Actor Critic algorithms or other ones. You still learn a value function, but at the same time you also learn a policy with the help of the value function. The value function will help the system learn the policy during training, but during testing you do not use the value function anymore, but only the policy.
Thus, in the first case, you actually learn a value function and act greedy on this value function, and in the second case you learn a value function and a policy and then you use the policy to navigate in the environment.
In the end, both these algorithms should work for your problem, and if you say that you are new to RL, maybe you could try the Deep Q-learning from the gym documentation.

Financial Contagion (Epidemic spread) Model Meets Problem

Recently I want to rebuild the financial contagion model in the paper: Contagion in Financial Networks by Prasanna Gai.
Now I am stuck in the first figure:
(figure 3 actually).
What I Did
I used Python and networkx.
First, build the ER network with 1000 nodes, and the probability depends on which average degree I want to simulate. For example, if I want to simulate the average degree at 3, then the probability of generating ER network is 3/(1000-1) where 1000 is the network size.
Then for each node, I find how many nodes point towards it and count, to calculate the AiIB (weights). If node 1 has 3 nodes point towards it, then the weights on these edges are AiIB (0.2 in paper) / 3 (number of neighbours).
To simulate contagion, first randomly chose one node to remove all its assets. then it cannot pay back the liabilities to its neighbours if the liability is more than the capital buffer (Ki, 0.04 in the paper). For those banks received liabilities from more than one banks, even each link's weight is less than the Ki, if the sum of these liabilities is more than Ki it still is considered as bankrupt. The model is just like epidemic spread where the new bankrupt banks will influence the new batch of banks, ended at no more banks are bankrupted in this system.
As contagion is defined as more than 5% of the banks bankrupt in this network ( 50 in this case).
To plot the figure, each average degree needs to test 100 times:
probability = number of contagion happened / simulation times 100 here
extent = [under the case of happening contagion] sum of the fraction of bankrupt banks/number of contagion happened.
The raw code is available on GitHub. By running er_100.py you can get the figure of mine like this:
If you have any trouble of the code please let me know. (the code needs at least 1 hours to run on the GCP with 8 vCPU...)
I also tried the network with 60 nodes here is it looks like:
it has a little similar shape with figure 1. But this still not good and the small network is not what I want.
I don't know what problems with my code. In my view, I have considered everywhere and it should get similar results. I even started to question the authority of the paper...
If you have any ideas please help me with this.
This is a "hard" question to answer.
I still do not find any clue to the code. Then I rewrite the code in R and run it, here is the draft figure I got:
As you can see now the figure is just the one in the paper. But the algorithm and the structure were totally the same with I wrote it in Python.
Maybe this is a case that shows Python just can't do.
If anyone feels interesting of this question, want to divide more among the difference between Python and R, this is a great example. And I am very happy to offer any helps.
By the way the model code in R is available in GitHub and it still under update.
For those who spent time on reading my description, Thanks for your time.
update:
I am also can't believe this because in my view the code is doing count and calculate that's simple. I print out very thing at every step and check every node, from 10 nodes small network to 1000 nodes network, the log file reached over 50G. All looks normal and the number (bankrupt one) just not reach the threshold. Not like in R, with totally the same structure, the results is just same with the paper.
I really don't know why and have no idea.

Deep Q-learning modification

#Edit:
I'm trying to create an agent to play the game of Tetris, using a convolutional nnet that takes the board state + current piece as input. From what I've read, Deep Q-learning is not very good at this, which I just confirmed.
#end Edit
Suppose that an agent is learning a policy to play a game, where each game step can be represented as
s, a, r, s', done
representing
state, action, reward, next state, game over
In the Deep Q-learning algorithm, the agent is in state s and takes some action a (following an epsilon-greedy policy), observes a reward r and gets to the next state s'.
The agent acts like this:
# returns an action index
get_action(state, epsilon)
if random() < epsilon
return random_action_index
else
return argmax(nnet.predict(state))
The parameters are updated by greedily observing the max Q-value in state s', so we have
# action taken was 'a' in state 's' leading to 's_'
prediction = nnet.predict(s)
if done
target = reward
else
target = reward + gamma * max(nnet.predict(s_))
prediction[a] = target
The [prediction, target] is feed to some nnet for weight update. So this nnet gets a state s as input, and outputs a vector of q-values with dimension n_actions. This is all clear to me.
Now, suppose that my state-actions are so noise, that this approach will simply not work. So, instead of outputting a vector of dimension n_actions, my nnet output is a single value, representing the "state-quality" (how desirable is that state).
Now my agent acts like this:
# returns an action based on how good the next state is
get_action(state, epsilon):
actions = []
for each action possible in state:
game.deepcopy().apply(action)
val = nnet.predict(game.get_state())
action.value = val
actions.append(action)
if random() < epsilon
return randomChoice(actions)
else
return action with_max_value from actions
And my [prediction, target] is like this:
# action taken was 'a' in state 's' leading to 's_'
prediction = nnet.predict(s)
if done
target = reward
else
target = reward + gamma * nnet.predict(s_)
I have some questions regarding this second algorithm:
a) Does it makes sense to act non greedily sometimes?
Intuitively no, because if I land in a bad state, it was probably because of a bad random action, not because the previous state was 'bad'. The Q-learning update will adjust the bad action, but this second algorithm will wrongly adjust the previous state.
b) What kind of algorithm is this? Where does it fits in Reinforcement Learning?
c) In the case of Tetris, the state almost never repeats, so what can I do in this case? Is that the reason deep q-learning fails here?
This may look confusing, but the algorithm actually works. I can provide extra details if necessary, thank you!
Now, suppose that my state-actions are so noise, that this approach will simply not work. So, instead of outputting a vector of dimension n_actions, my nnet output is a single value, representing the "state-quality" (how desirable is that state).
Now my agent acts like this:
# returns an action based on how good the next state is
get_action(state, epsilon):
actions = []
for each action possible in state:
game.apply(action)
val = nnet.predict(game.get_state())
action.value = val
actions.append(action)
if random() < epsilon
return randomChoice(actions)
else
return action with_max_value from actions
First a quick note on that pseudocode: I don't think that would work, because you don't simulate the effects of the different actions on copies of the game state, but just on the game state directly. You'd probably want to create separate copies of the game state first, and run each action once on a different copy.
Anyway, this kind of algorithm is generally assumed not to be possible in Reinforcement Learning settings. In RL, we generally operate under the assumption that we don't have a "simulator" or "forward model" or anything like that. We normally assume that we have an agent situated in a real environment in which we can generate experience that can be used to learn from. Under this assumption, we cannot implement that kind of for each action possible in state loop which simulates what would happen if we were to execute different actions in the same game state. The assumption is that we first have to pick one action, execute it, and then learn from that particular trajectory of experience; we can no longer "go back", imagine that we had selected a different action, and also learn from that trajectory.
Of course, in practice doing this is often possible, because we often do actually have a simulator (for example a robot simulator, or a game, etc.). The majority of research in RL still makes the assumption that we do not have such a simulator, because this leads to algorithms that may eventually be useable in "real-world" situations (e.g., real-world physical robots). Implementing the idea you describe above actually means you're moving more towards search algorithms (such as Monte-Carlo Tree Search), rather than Reinforcement Learning algorithms. It means that your approach is limited to scenarios in which you do have a simulator available (e.g., games).
a) Does it makes sense to act non greedily sometimes?
Even though you include the search-like process of looping through all actions and simulating all their effects, I suspect you'll still have to have some form of exploration if you want to converge to good policies, so you'll have to act non-greedily sometimes. Yes, it looks like this will cause your algorithm to converge to something different from the traditional interpretation of an "optimal policy". This is not too much of a problem if your epsilon is fairly low though. In practice, it will likely tend to be a slightly "safer" policy that is learned. See also my answer to this other question.
b) What kind of algorithm is this? Where does it fits in Reinforcement Learning?
Aside from my discussion above on how this actually moves a bit towards the domain of Search algorithms rather than RL algorithms, it also looks to me like this would be an on-policy algorithm rather than an off-policy algorithm (standard Q-learning is off-policy, because it learns about the greedy policy whilst generating experience through a non-greedy policy). This distinction is also discussed in detail by most of the answers to the question I linked to above.

Are shadow prices valid after an infeasible solve from COIN using PULP

I am solving a minimization linear program using COIN-OR's CLP solver with PULP in Python.
The variables that are included in the problem are a subset of the total number of possible variables and sometimes my pricing heuristic will pick a subset of variables that result in an infeasible solution. After which I use shadow prices to price new variables in.
My question is, if the problem is infeasible, I still get values from calling prob.constraints[c].pi, but those values don't always seem to be "valid" or "good" per se.
Now, a solver like Gurobi won't even let me call the shadow prices after an infeasible solve.
Actually Stu, this might work! - the "dummy var" in my case could be the source/sink node, which I can loosen the flow constraints on, allowing infinite flow in/out but with a large cost. This makes the solution feasible with a very bad high optimal cost; then the pricing of the new variables should work and show me which variables to add to the problem on the next iteration. I'll give it a try and report back. My only concern is that the bigM cost coefficient on the source/sink node may skew the pricing of the variables making all of them look relatively attractive. This would be counter productive bc adding most of the variables back into the problem will defeat the purpose of my column generation in the first place. I'll test it...

Utilising Genetic algorithm to overcome different size datasets in model

SO I realise the question I am asking here is large and complex.
A potential solution to variences in sizes of
In all of my searching through statistical forums and posts I haven't come across a scientifically sound method of taking into account the type of data that I am encountering,
but I have thought up a (novel?) potential solutions to account perfectly (in my mind) for large and small datasets within the same model.
The proposed method involves using a genetic algorithm to alter two numbers defining a relationship between the size of the dataset making up an implied strike rate and the
percentage of the implied strike to be used, with the target of the model to maximise the homology of the number 1 in two columns of the following csv. (ultra simplified
but hopefully demonstrates the principle)
Example data
Date,PupilName,Unique class,Achieved rank,x,y,x/y,Average xy
12/12/2012,PupilName1,UniqueClass1,1,3000,9610,0.312174818,0.08527
12/12/2012,PupilName2,UniqueClass1,2,300,961,0.312174818,0.08527
12/12/2012,PupilName3,UniqueClass1,3,1,3,0.333333333,0.08527
13/12/2012,PupilName1,UniqueClass2,1,2,3,0.666666667,0.08527
13/12/2012,PupilName2,UniqueClass2,2,0,1,0,0.08527
13/12/2012,PupilName3,UniqueClass2,3,0,5,0,0.08527
13/12/2012,PupilName4,UniqueClass2,4,0,2,0,0.08527
13/12/2012,PupilName5,UniqueClass2,5,0,17,0,0.08527
14/12/2012,PupilName1,UniqueClass3,1,1,2,0.5,0.08527
14/12/2012,PupilName2,UniqueClass3,2,0,1,0,0.08527
14/12/2012,PupilName3,UniqueClass3,3,0,5,0,0.08527
14/12/2012,PupilName4,UniqueClass3,4,0,6,0,0.08527
14/12/2012,PupilName5,UniqueClass3,5,0,12,0,0.08527
15/12/2012,PupilName1,UniqueClass4,1,0,0,0,0.08527
15/12/2012,PupilName2,UniqueClass4,2,1,25,0.04,0.08527
15/12/2012,PupilName3,UniqueClass4,3,1,29,0.034482759,0.08527
15/12/2012,PupilName4,UniqueClass4,4,1,38,0.026315789,0.08527
16/12/2012,PupilName1,UniqueClass5,1,12,24,0.5,0.08527
16/12/2012,PupilName2,UniqueClass5,2,1,2,0.5,0.08527
16/12/2012,PupilName3,UniqueClass5,3,13,59,0.220338983,0.08527
16/12/2012,PupilName4,UniqueClass5,4,28,359,0.077994429,0.08527
16/12/2012,PupilName5,UniqueClass5,5,0,0,0,0.08527
17/12/2012,PupilName1,UniqueClass6,1,0,0,0,0.08527
17/12/2012,PupilName2,UniqueClass6,2,2,200,0.01,0.08527
17/12/2012,PupilName3,UniqueClass6,3,2,254,0.007874016,0.08527
17/12/2012,PupilName4,UniqueClass6,4,2,278,0.007194245,0.08527
17/12/2012,PupilName5,UniqueClass6,5,1,279,0.003584229,0.08527
So I have created a tiny model dataset, which contains some good examples of where my current methods fall short and how I feel a genetic algorithm can be used to fix this. If we look in the dataset above it contains 6 unique classes the ultimate objective of the algorithm is to create as high as possible correspondence between a rank of an adjusted x/y and the achieved rank in column 3 (zero based referencing.) In uniqueclass1 we have two identical x/y values, now these are comparatively large x/y values if you compare with the average (note the average isn't calculated from this dataset) but it would be common sense to expect that the 3000/9610 is more significant and therefore more likely to have an achieved rank of 1 than the 300/961. So what I want to do is make an adjusted x/y to overcome these differences in dataset sizes using a logarithmic growth relationship defined by the equation:
adjusted xy = ((1-exp(-y*α)) * x/y)) + ((1-(1-exp(-y*α)))*Average xy)
Where α is the only dynamic number
If I can explain my logic a little and open myself up to (hopefully) constructive criticsm. This graph below shows is an exponential growth relationship between size of the data set and the % of x/y contributing to the adjusted x/y. Essentially what the above equation says is as the dataset gets larger the percentage of the original x/y used in the adjusted x/y gets larger. Whatever percentage is left is made up by the average xy. Could hypothetically be 75% x/y and 25% average xy for 300/961 and 95%/5% for 3000/9610 creating an adjusted x/y which clearly demonstrates
For help with understanding the lowering of α would produce the following relationship where by a larger dataset would be requred to achieve the same "% of xy contributed"
Conversly increasing α would produce the following relationship where by a smaller dataset would be requred to achieve the same "% of xy contributed"
So I have explained my logic. I am also open to code snippets to help me overcome the problem. I have plans to make a multitude of genetic/evolutionary algorithms in the future and could really use a working example to pick apart and play with in order to help my understanding of how to utilise such abilities of python. If additional detail is required or further clarification about the problem or methods please do ask, I really want to be able to solve this problem and future problems of this nature.
So after much discussion about the methods available to overcome the problem presented here I have come to the conclusion that he best method would be a genetic algorithm to iterate α in order to maximise the homology/correspondance between a rank of an adjusted x/y and the achieved rank in column 3. It would be greatly greatly appreciated if anyone be able to help in that department?
So to clarify, this post is no longer a discussion about methodology
I am hoping someone can help me produce a genetic algorithm to maximise the homology between the results of the equation
adjusted xy = ((1-exp(-y*α)) * x/y)) + ((1-(1-exp(-y*α)))*Average xy)
Where adjusted xy applies to each row of the csv. Maximising homology could be achieved by minimising the difference between the rank of the adjusted xy (where the rank is by each Unique class only) and Achieved rank.
Minimising this value would maximise the homology and essentially solve the problem presented to me of different size datasets. If any more information is required please ask, I check this post about 20 times a day at the moment so should reply rather promptly. Many thanks SMNALLY.
The problem you are facing sounds to me like "Bias Variance Dilemna" from a general point of view. In a nutshell, a more precise model favours variance (sensitivity to change in a single training set), a more general model favours bias (model works for many training sets)
May I suggest not to focus on GA but look at Instance Base Learning and advanced regression techniques. The Andrew moore page at CMU is a good entry point.
And particularly those slides.
[EDIT]
After a second reading, here is my second understanding:
You have a set of example data with two related attributes X and Y.
You do not want X/Y to dominate when Y is small, (considered as less representative).
As a consequence you want to "weigth" the examples with a adapted value adjusted_xy .
You want adjusted_xy to be related to a third attribute R (rank). Related such as,per class, adjusted_xy is sorted like R.
To do so you suggest to put it as an optimization problem, searching for PARAMS of a given function F(X,Y,PARAMS)= adjusted_xy .
With the constraint that D=Distance( achieved rank for this class, rank of adjusted_xy for this class ) is minimal.
Your question, at least for me, is in the field of attribute selection/attribute adaptation. (I guess the data set will later be used for supervised learning ).
One problem that I see in your approach (if well understood) is that, at the end, rank will be highly related to adjusted_xy which will bring therefore no interesting supplementary information.
Once this said, I think you surely know how GA works . You have to
define the content of the chromosome : this appears to be your alpha parameter.
define an appropriate fitness function
The fitness function for one individual can be a sum of distances over all examples of the dataset.
As you are dealing with real values , other metaheuristics such as Evolution Strategies (ES) or Simulated Anealing may be more adapted than GA.
As solving optimization problems is cpu intensive, you might eventually consider C or Java instead of Python. (as fitness at least will be interpreted and thus cost a lot).
Alternatively I would look at using Y as a weight to some supervised learning algorithm (if supervised learning is the target).
Let's start by the problem: You consider the fact that some features lead to some of your classes a 'strike'. You are taking a subset of your data and try to establish a rule for the strikes. You do establish one but then you notice that the accuracy of your rule depends on the volume of the dataset that was used to establish the 'strike' rate anyway. You are also commenting on the effect of some samples in biasing your 'strike' estimate.
The immediate answer is that it looks like you have a lot of variation in your data, therefore you will in one way or another need to collect more to account for that variation. (That is, variation that is inherent to the problem).
The fact that in some cases the numbers end up in 'unusable cases' could also be down to outliers. That is, measurements that are 'out of bounds' for a number of reasons and which you would have to find a way to either exclude them or re-adjust them. But this depends a lot on the context of the problem.
'Strike rates' on their own will not help but they are perhaps a step towards the right direction. In any case, you can not compare strike rates if they are coming from samples of different sizes as you have found out too. If your problem is purely to determine the size of your sample so that your results conform to some specific accuracy then i would recommend that you have a look at Statistical Power and how does the sample size affects it. But still, to determine the sample size you need to know a bit more about your data, which brings us back to point #1 about the inherent variation.
Therefore, my attempt to an answer is this: If i have understood your question correctly, you are dealing with a classification problem in which you seek to assign a number of items (patients) to a number of classes (types of cancer) on the evidence of some features (existence of genetic markers, or frequency of their appearance or any other quantity anyway) about these items. But, some features might not exist for all items or, there is a core group of features but there might be some more that do not appear all the time. The question now is, which classifier do you use to achieve this? Logistic regression was mentioned previously and has not helped. Therefore, what i would suggest is going for a Naive Bayesian Classifier. The classifier can be trained with the datasets you have used to derive the 'strike rates' which will provide the a-priori probabilities. When the classifier is 'running' it will be using the features of new data to construct a likelihood that the patient who provided this data should be assigned to each class.
Perhaps the more common example for such a classifier is the spam-email detectors where the likelihood that an email is spam is judged on the existence of specific words in the email (and a suitable training dataset that provides a good starting point of course).
Now, in terms of trying this out practically (and since your post is tagged with python related tags :) ), i would like to recommend Weka. Weka contains a lot of related functionality including bootstrapping that could potentially help you with those differences in the size of the datasets. Although Weka is Java, bindings exist for it in Python too. I would definitely give it a go, the Weka package, book and community are very helpful.
No. Don't use a genetic algorithm.
The bigger the search space of models and parameters, the better your chances of finding a good fit for your data points. But the less this fit will mean. Especially since for some groups your sample sizes are small and therefore the measurements have a high random component to them. This is why, somewhat counterintuitively, it is often actually harder to find a good model for your data after collecting it than before.
You have taken the question to the programmer's lair. This is not the place for it. We solve puzzles.
This is not a puzzle to find the best line through the dots. You are searching for a model that makes sense and brings understanding on the subject matter. A genetic algorithm is very creative at line-through-dot drawing but will bring you little understanding.
Take the problem back where it belongs and ask the statisticians instead.
For a good model should be based on theory behind the data. It'll have to match the points on the right side of the graph, where (if I understand you right) most of the samples are. It'll be able to explain in hard probabilities how likely the deviations on the left are and tell you if they are significant or not.
If you do want to do some programming, I'd suggest you take the simplest linear model, add some random noise, and do a couple simulation runs for a population like your subjects. See if the data looks like the data you're looking at or if it generally 'looks' different, in which case there really is something nonlinear (and possibly interesting) going on on the left.
I once tackled a similar problem (as similar as problems like this ever are), in which there were many classes and high variance in features per data point. I personally used a Random Forest classifier (which I wrote in Java). Since your data is highly variant, and therefore hard to model, you could create multiple forests from different random samples of your large dataset and put a control layer on top to classify data against all the forests, then take the best score. I don't write python, but i found this link
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
which may give you something to play with.
Following Occam's razor, you must select a simpler model for small dataset and may want to switch to a more complex model as your dataset grows.
There are no [good] statistical tests that show you if a given model, in isolation, is a good predictor of your data. Or rather, a test may tell you that given model fitness is N, but you can never tell what the acceptable value of N is.
Thus, build several models and pick one with better tradeoff of predictive power and simplicity using Akaike information criterion. It has useful properties and not too hard to understand. :)
There are other tests of course, but AIC should get you started.
For a simple test, check out p-value

Categories

Resources