Knight's Tour using a Neural Network

Knight's Tour using a Neural Network - python

I was looking at the knights tour problem and decided to have a go at implementing it in python using a neural network to find solutions.
The general explanation of the method can be found on Wikipedia
While I think I have implemented it correctly (I can't see anything else that is wrong), it doesn't work, it updates a few links, removing the edges where the connecting vertex has a degree more than two, but it doesn't converge on the solution.
I was wondering if anyone had any ideas on what I have implemented incorrectly (Sorry about the horrible code).
EDIT
Working code can be found at GitHub https://github.com/Yacoby/KnightsTour

You can't update the neurons in place. Since U[t+1] depends on U[t] and V[t], if you have already updated V the calculation for U will be wrong
I think you should split the update into two phases
update_state and update_output, so all the U are updated and then all the V
for n in neurons:
n.update_state()
for n in neurons:
n.update_output()

First impression is that you only have one buffer for the board. I'm basing this on the fact that I don't see any buffer swaps between iterations - I haven't looked that closely and may easily be wrong.
If you modify a single buffer in place, when you do the neighbour counts, you base them on a partly modified board - not the board you had at the start.

After looking over your code, I think your explanation for the formula you used may be incorrect. You say that when updating the state you add four rather than two and subtract the output of the neuron itself. It looks to me like you're subtracting the output of the neuron itself twice. Your code which finds neighbors does not appear to distinguish between neighbors of the neuron and the neuron itself, and you run this code twice- once for each vertex.
Tests on my own code seem to confirm this. Convergence rates drastically improve when I subtract the neuron's own output twice rather than once.

Related

Check edges in a graph

So given a graph that has a cycle from vertex S to E as it goes through every vertex then ends on E. My goal here is to remove all extra edges so that there is just a path from S to E. To help me with this I have a function called check(node) I'm not given the code for it but it returns True or False if there still exist a path from S to E such that all nodes were visited only once until we end at E.
Example:
The plan is to remove a edge from vertex a to b and then run check(node) on the mutated graph and see if it still returns True so we know its safe to remove that edge, and if it returns False then add it back. Do that for every edge so only the needed edge remains, however I have no idea how to iterate through the edges.
I stored the graphs in a dictionary

Usually the approach to an Algorithms problem like this is you first figure out what algorithmic tools you can use. Most basic problems can be solved with an existing algorithm. Your first objective is to see if you can modify the problem set (ie the given graph) in such a way that you don't need to modify the algorithm, because modifying the algorithm lends to difficulties in assessing Big-O for the problem. If the graph can't be modified in any way that makes running a black boxed Algorithm easy, then you modify the algorithm. The last resort is to come up wit your own algorithm to solve the problem.
If my Algorithms recollection is correct, in short this is the Travelling Salesman Problem. If I'm understanding your question correctly, you want the shortest path possible that visits every node. You don't even need to modify your given graph in order to use the algorithm. It should theoretically find you the desired path. Only after the algorithm has run do you need to reduce the graph to its desired state. So I suggest finding some way to implement TSP to your specifications, and remove all edges that aren't part of the solution.
Here is some sample code from GeeksForGeeks that could help you get started

Baum Welch (EM Algorithm) likelihood (P(X)) is not monotonically converging

So I am sort of an amateur when comes to machine learning and I am trying to program the Baum Welch algorithm, which is a derivation of the EM algorithm for Hidden Markov Models. Inside my program I am testing for convergence using the probability of each observation sequence in the new model and then terminating once the new model is less than or equal to the old model. However, when I run the algorithm it seems to converge somewhat and gives results that are far better than random but when converging it goes down on the last iteration. Is this a sign of a bug or am I doing something wrong?
It seems to me that I should have been using the summation of the log of each observation's probability for the comparison instead since it seems like the function I am maximizing. However, the paper I read said to use the log of the sum of probabilities(which I am pretty sure is the same as the sum of the probabilities) of the observations(https://www.cs.utah.edu/~piyush/teaching/EM_algorithm.pdf).
I fixed this on another project where I implemented backpropogation with feed-forward neural nets by implementing a for loop with pre-set number of epochs instead of a while loop with a condition for the new iteration to be strictly greater than but I am wondering if this is a bad practice.
My code is at https://github.com/icantrell/Natural-Language-Processing
inside the nlp.py file.
Any advice would be appreciated.
Thank You.

For EM iterations, or any other iteration proved to be non-decreasing, you should be seeing increases until the size of increases becomes small compared with floating point error, at which time floating point errors violate the assumptions in the proof, and you may see not only a failure to increase, but a very small decrease - but this should only be very small.
One good way to check these sorts of probability based calculations is to create a small test problem where the right answer is glaringly obvious - so obvious that you can see whether the answers from the code under test are obviously correct at all.
It might be worth comparing the paper you reference with https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm#Proof_of_correctness. I think equations such as (11) and (12) are not intended for you to actually calculate, but as arguments to motivate and prove the final result. I think the equation corresponding to the traditional EM step, which you do calculate, is equation (15) which says that you change the parameters at each step to increase the expected log-likelihood, which is the expectation under the distribution of hidden states calculated according to the old parameters, which is the standard EM step. In fact, turning over I see this is stated explicitly at the top of P 8.

How can I make my neural network emphasize that some data is more important than the rest?

I looked around online but couldn't find anything, but I may well have missed a piece of literature on this. I am running a basic neural net on a 289 component vector to produce a 285 component vector. In my input, the last 4 pieces of data are critical to change the rest of the input into the resultant 285 for the output. That is to say, the input is 285 + 4, such that the 4 morph the rest of the input into the output.
But when running a neural network on this, I am not sure how to reflect this. Would I need to use convolution on the rest of the input? I want my system to emphasize the 4 data points that critically affect the other 285. I am still new to all of this, so a few pointers would be great!
Again, if there is something already written on this, then that would be awesome too.

I don't think you have any reason doing this since the network will infer that on its own. The weights will be reduced or enhanced for each input according to their importance considering the output.
What you could do though, is to have a preliminary network that is going to have the 285 component as an input, and then a new network that is going to have the 4 critical components and the output of the preliminary network as an input.
[285 compo.]---[neural network]---+---[neural network]---[output 285 compo.]
|
[4 compo.]-+
For instance, you could treat a picture with convolution networks and then add some meta information later in a fully connected network to process everything.

The neural network should more or less learn this thing by itself. Especially with newer approaches like deep learning & friends, where the amount of hand-tuning is almost zero. However, this does assume that the function which you're trying to learn is learnable and that the system you use has enough power to learn it. That's a function of the complexity of the network involved (number of layers, nodes, types of activations etc.), the learning algorithms involved, as well as the data you supply.
It's really hard to tell without knowing more about the domain you're addressing? What sort of signals are we talking about (I assume they're signals since you speak of convolution)? What are the four inputs about? I assume they have a different modality than the other 285.
Perhaps this doc will help a little bit though.

Theoretically, you can let the network try to learn this relationship. However, there are good reasons to try to rethink the way you're formulating the problem. Also, the difficulty a neural network will have learning this function is going to depend strongly on your specific problem (and the best way to figure it out is probably just to try it and find out).
Let me try to help by making an analogy to a simpler problem: let's take your 289-element vector and assume that 285 elements take values from -1 to 1 and the remaining four take values from -1000 to 1000. This maintains your original premise: that the four variables are somehow far more important in determining the output than the 285. (I understand that this loses the coupled relationship between the variables, but let's run with the example anyways.)
This is a simpler example for two reasons:
it's easier to see why it's harder to learn
there are a bag of well-understood tricks to solve it
Compared to a scenario where all 289 inputs have the same input range, a gradient descent algorithm will be slower to converge on the heterogeneous case. (Extra credit: try this!) Geoff Hinton has a rather famous set of slides which describes this effect fairly well: Lecture 6. I believe this is also part of a Coursera course now.
Hinton's slides also touch on two ways to attack this simpler version of the problem. The first is just to pre-process your inputs. If you scale down the inputs to have the same mean and variance, your gradient descent optimizer will converge more quickly. The other is to use a more powerful optimization method, specifically one with per-parameter adaptive learning rates, which handles this case as well as trickier scenarios. Andrej Karpathy's fantastic notes from Stanford's CS231n class are a good intro.
But let's tie this back to your problem: that there are four "special" variables which transform the entire input. Given enough time and input, it's possible that a network can learn this function. But understand that if this transformation is complex and makes the optimization landscape rough, your network will likely have some trouble dealing with it.
If there's a way to transform your representation of the problem to avoid this link, I'd say try to pursue that. If not, then be prepared to resort to some bigger guns to solve the problem.
Without knowing the specifics of your problem, it's hard to give more concrete advice. Plus, ultimately, you're the one that will be solving it, so you're going to be the expert eventually!

To emphasize on any vector elements in your input vector you will have to give less information of the unimportant vector to your neural network.
Try to encode the first less important 285 numbers into one number or any vector size you like, with a a multiplayer neural network then use that number with other 4 number as a input to a neural network.
Example:
v1=[1,2,3,..........285]
v2=[286,287,288,289]
v_out= Neural_network(input_vector=v1,neurons=[100,1]) # 100 hidden unit with one outpt.
v_final=Neural_network(input_vector=v_out,neurons=[100,1]) # 100 hidden unit with one outpt.

Applying Hidden Markov Models in Python

I recently had a homework assignment in my computational biology class to which I had to apply a HMM. Although I think I understand HMMs, I couldn't manage to apply them to my code. Basically, I had to apply the Fair Bet Casino problem to CpG islands in DNA--using observed outcomes to predict hidden states using a transition matrix. I managed to come up with a solution that worked correctly...but unfortunately in exponential time. My solution psuedocode looked like this:
*notes: trans_matrix contains both the probabilities of going from one state to the other state, and the probabilities of going from one observation to the other observation.
HMM(trans_matrix, observations):
movements = {}
for i in range(0,len(observations):
movements[i] = #list of 4 probabilities, all possible state changes
poss_paths = []
for m in movements.keys():
#add two new paths for every path currently in poss_paths
# (state either changes or doesn't)
#store a running probability for each path
correct_path = poss_paths[index_of_highest_probability]
return correct_path
I realize that I go into exponential time when I look at all the possible paths #add two new paths for every path currently in poss_paths. I'm not sure how to find the path of highest probability without looking at all the paths, though.
Thank you!
EDIT:
I did find a lot of info about the Viterbi algorithm when I was doing the assignment but I was confused as to how it actually gives the best answer. It seems like Viterbi (I was looking at the forward algorithm specifically, I think) looks at a specific position, moves forward a position or two, and then decides the "correct" next path increment only having looked at a few subsequent probabilities. I may be understanding this wrong; is this how the Viterbi works? Pseudocode is helpful. Thank you!

One benefit of Hidden Markov Models is that you can generally do what you need without considering all possible paths one by one. What you are trying to do looks like an expensive way of finding the single most probable path, which you can do by dynamic programming under the name of the Viterbi algorithm - see e.g. http://cs.brown.edu/research/ai/dynamics/tutorial/Documents/HiddenMarkovModels.html. There are other interesting things covered in documents like this which are not quite the same, such as working out the probabilities for the hidden state at a single position, or at all single positions. Very often this involves something called alpha and beta passes, which are a good search term, along with Hidden Markov Models.
There is a large description at http://en.wikipedia.org/wiki/Viterbi_algorithm with mathematical pseudo-code and what I think is python as well. Like most of these algorithms, it uses the Markov property that once you know the hidden state at a point you know everything you need to answer questions about that point in time - you don't need to know the past history. As in dynamic programming, you work from left to right along the data, using answers computed for output k-1 to work out answers for output k. What you want to work out at point k, for each state j, is the probability of the observed data up to and including that point, along the most likely path that ends up in state j at point k. That probability is the product of the probability of the observed data at k given state j, times the probability of the transition from some previous state at time k-1 to j times the probability of all of the observed data up to and including point k-1 given that you ended up at the previous state at time k-1 - this last bit is something you have just computed for time k-1. You consider all possible previous states and pick the one that gives you the highest combined probability. That gives you the answer for state j at time k, and you save the previous state that gave you the best answer. This may look like you are just fiddling around with outputs for k and k-1, but you now have an answer for time k that reflects all the data up to and including time k. You carry this on until k is the last point in your data, at which point you have answers for the probabilities of each final state given all the data. Pick the state at this time which gives you the highest probability and then trace all the way back using the info you saved about which previous state in time k-1 you used to compute the probability for data up to k and in state j at time k.

Applying machine learning to a guessing game?

I have a problem with a game I am making. I think I know the solution(or what solution to apply) but not sure how all the ‘pieces’ fit together.
How the game works:
(from How to approach number guessing game(with a twist) algorithm? )
users will be given items with a value(values change every day and the program is aware of the change in price). For example
Apple = 1
Pears = 2
Oranges = 3
They will then get a chance to choose any combo of them they like (i.e. 100 apples, 20 pears, and 1 oranges). The only output the computer gets is the total value(in this example, its currently $143). The computer will try to guess what they have. Which obviously it won’t be able to get correctly the first turn.
Value quantity(day1) value(day1)
Apple 1 100 100
Pears 2 20 40
Orange 3 1 3
Total 121 143
The next turn the user can modify their numbers but no more than 5% of the total quantity (or some other percent we may chose. I’ll use 5% for example.). The prices of fruit can change(at random) so the total value may change based on that also(for simplicity I am not changing fruit prices in this example). Using the above example, on day 2 of the game, the user returns a value of $152 and $164 on day 3. Here's an example.
quantity(day2) %change(day2) value(day2) quantity(day3) %change(day3) value(day3)
104 104 106 106
21 42 23 46
2 6 4 12
127 4.96% 152 133 4.72% 164
*(I hope the tables show up right, I had to manually space them so hopefully its not just doing it on my screen, if it doesn't work let me know and I'll try to upload a screenshot).
I am trying to see if I can figure out what the quantities are over time(assuming the user will have the patience to keep entering numbers). I know right now my only restriction is the total value cannot be more than 5% so I cannot be within 5% accuracy right now so the user will be entering it forever.
What I have done so far:
I have taken all the values of the fruit and total value of fruit basket that’s given to me and created a large table of all the possibilities. Once I have a list of all the possibilities I used graph theory and created nodes for each possible solution. I then create edges(links) between nodes from each day(for example day1 to day2) if its within 5% change. I then delete all nodes that do not have edges(links to other nodes), and as the user keeps playing I also delete entire paths when the path becomes a dead end.
This is great because it narrows the choices down, but now I’m stuck because I want to narrow these choices even more. I’ve been told this is a hidden markov problem but a trickier version because the states are changing(as you can see above new nodes are being added every turn and old/non-probable ones are being removed).
** if it helps, I got a amazing answer(with sample code) on a python implementation of the baum-welch model(its used to train the data) here: Example of implementation of Baum-Welch **
What I think needs to be done(this could be wrong):
Now that I narrowed the results down, I am basically trying to allow the program to try to predict the correct based the narrowed result base. I thought this was not possible but several people are suggesting this can be solved with a hidden markov model. I think I can run several iterations over the data(using a Baum-Welch model) until the probabilities stabilize(and should get better with more turns from the user).
The way hidden markov models are able to check spelling or handwriting and improve as they make errors(errors in this case is to pick a basket that is deleted upon the next turn as being improbable).
Two questions:
How do I figure out the transition and emission matrix if all states are at first equal? For example, as all states are equally likely something must be used to dedicate the probability of states changing. I was thinking of using the graph I made to weight the nodes with the highest number of edges as part of the calculation of transition/emission states? Does that make sense or is there a better approach?
How can I keep track of all the changes in states? As new baskets are added and old ones are removed, there becomes an issue of tracking the baskets. I though an Hierarchical Dirichlet Process hidden markov model(hdp-hmm) would be what I needed but not exactly sure how to apply it.
(sorry if I sound a bit frustrated..its a bit hard knowing a problem is solvable but not able to conceptually grasp what needs to be done).
As always, thanks for your time and any advice/suggestions would be greatly appreciated.

Like you've said, this problem can be described with a HMM. You are essentially interested in maintaining a distribution over latent, or hidden, states which would be the true quantities at each time point. However, it seems you are confusing the problem of learning the parameters for a HMM opposed to simply doing inference in a known HMM. You have the latter problem but propose employing a solution (Baum-Welch) designed to do the former. That is, you have the model already, you just have to use it.
Interestingly, if you go through coding a discrete HMM for your problem you get an algorithm very similar to what you describe in your graph-theory solution. The big difference is that your solution is tracking what is possible whereas a correct inference algorithm, like the Virterbi algorithm, will track what is likely. The difference is clear when there is overlap in the 5% range on a domain, that is, when multiple possible states could potentially transition to the same state. Your algorithm might add 2 edges to a point, but I doubt that when you compute the next day that has an effect (it should count twice, essentially).
Anyway, you could use the Viterbi algortihm, if you are only interested in the best guess at the most recent day I'll just give you a brief idea how you can just modify your graph-theory solution. Instead of maintaining edges between states maintain a fraction representing the probability that state is the correct one (this distribution is sometimes called the belief state). At each new day, propagate forward your belief state by incrementing each bucket by the probability of it's parent (instead of adding an edge your adding a floating point number). You also have to make sure your belief state is properly normalized (sums to 1) so just divide by its sum after each update. After that, you can weight each state by your observation, but since you don't have a noisy observation you can just go and set all the impossible states to being zero probability and then re-normalize. You now have a distribution over underlying quantities conditioned on your observations.
I'm skipping over a lot of statistical details here, just to give you the idea.
Edit (re: questions):
The answer to your question really depends on what you want, if you want only the distribution for the most recent day then you can get away with a one-pass algorithm like I've described. If, however, you want to have the correct distribution over the quantities at every single day you're going to have to do a backward pass as well. Hence, the aptly named forward-backward algorithm. I get the sense that since you are looking to go back a step and delete edges then you probably want the distribution for all days (unlike I originally assumed). Of course, you noticed there is information that can be used so that the "future can inform the past" so to speak, and this is exactly the reason why you need to do the backward pass as well, it's not really complicated you just have to run the exact same algorithm starting at the end of the chain. For a good overview check out Christopher Bishop's 6-piece tutorial on videolectures.net.
Because you mentioned adding/deleting edges let me just clarify the algorithm I described previously, keep in mind this is for a single forward pass. Let there be a total of N possible permutations of quantities, so you will have a belief state that is a sparse vector N elements long (called v_0). The first step you receive a observation of the sum, and you populate the vector by setting all the possible values to have probability 1.0, then re-normalize. The next step you create a new sparse vector (v_1) of all 0s, iterate over all non-zero entries in v_0 and increment (by the probability in v_0) all entries in v_1 that are within 5%. Then, zero out all the entries in v_1 that are not possible according to the new observation, then re-normalize v_1 and throw away v_0. repeat forever, v_1 will always be the correct distribution of possibilities.
By the way, things can get way more complex than this, if you have noisy observations or very large states or continuous states. For this reason it's pretty hard to read some of the literature on statistical inference; it's quite general.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.