Applying machine learning to a guessing game?

Applying machine learning to a guessing game? - python

I have a problem with a game I am making. I think I know the solution(or what solution to apply) but not sure how all the ‘pieces’ fit together.
How the game works:
(from How to approach number guessing game(with a twist) algorithm? )
users will be given items with a value(values change every day and the program is aware of the change in price). For example
Apple = 1
Pears = 2
Oranges = 3
They will then get a chance to choose any combo of them they like (i.e. 100 apples, 20 pears, and 1 oranges). The only output the computer gets is the total value(in this example, its currently $143). The computer will try to guess what they have. Which obviously it won’t be able to get correctly the first turn.
Value quantity(day1) value(day1)
Apple 1 100 100
Pears 2 20 40
Orange 3 1 3
Total 121 143
The next turn the user can modify their numbers but no more than 5% of the total quantity (or some other percent we may chose. I’ll use 5% for example.). The prices of fruit can change(at random) so the total value may change based on that also(for simplicity I am not changing fruit prices in this example). Using the above example, on day 2 of the game, the user returns a value of $152 and $164 on day 3. Here's an example.
quantity(day2) %change(day2) value(day2) quantity(day3) %change(day3) value(day3)
104 104 106 106
21 42 23 46
2 6 4 12
127 4.96% 152 133 4.72% 164
*(I hope the tables show up right, I had to manually space them so hopefully its not just doing it on my screen, if it doesn't work let me know and I'll try to upload a screenshot).
I am trying to see if I can figure out what the quantities are over time(assuming the user will have the patience to keep entering numbers). I know right now my only restriction is the total value cannot be more than 5% so I cannot be within 5% accuracy right now so the user will be entering it forever.
What I have done so far:
I have taken all the values of the fruit and total value of fruit basket that’s given to me and created a large table of all the possibilities. Once I have a list of all the possibilities I used graph theory and created nodes for each possible solution. I then create edges(links) between nodes from each day(for example day1 to day2) if its within 5% change. I then delete all nodes that do not have edges(links to other nodes), and as the user keeps playing I also delete entire paths when the path becomes a dead end.
This is great because it narrows the choices down, but now I’m stuck because I want to narrow these choices even more. I’ve been told this is a hidden markov problem but a trickier version because the states are changing(as you can see above new nodes are being added every turn and old/non-probable ones are being removed).
** if it helps, I got a amazing answer(with sample code) on a python implementation of the baum-welch model(its used to train the data) here: Example of implementation of Baum-Welch **
What I think needs to be done(this could be wrong):
Now that I narrowed the results down, I am basically trying to allow the program to try to predict the correct based the narrowed result base. I thought this was not possible but several people are suggesting this can be solved with a hidden markov model. I think I can run several iterations over the data(using a Baum-Welch model) until the probabilities stabilize(and should get better with more turns from the user).
The way hidden markov models are able to check spelling or handwriting and improve as they make errors(errors in this case is to pick a basket that is deleted upon the next turn as being improbable).
Two questions:
How do I figure out the transition and emission matrix if all states are at first equal? For example, as all states are equally likely something must be used to dedicate the probability of states changing. I was thinking of using the graph I made to weight the nodes with the highest number of edges as part of the calculation of transition/emission states? Does that make sense or is there a better approach?
How can I keep track of all the changes in states? As new baskets are added and old ones are removed, there becomes an issue of tracking the baskets. I though an Hierarchical Dirichlet Process hidden markov model(hdp-hmm) would be what I needed but not exactly sure how to apply it.
(sorry if I sound a bit frustrated..its a bit hard knowing a problem is solvable but not able to conceptually grasp what needs to be done).
As always, thanks for your time and any advice/suggestions would be greatly appreciated.

Like you've said, this problem can be described with a HMM. You are essentially interested in maintaining a distribution over latent, or hidden, states which would be the true quantities at each time point. However, it seems you are confusing the problem of learning the parameters for a HMM opposed to simply doing inference in a known HMM. You have the latter problem but propose employing a solution (Baum-Welch) designed to do the former. That is, you have the model already, you just have to use it.
Interestingly, if you go through coding a discrete HMM for your problem you get an algorithm very similar to what you describe in your graph-theory solution. The big difference is that your solution is tracking what is possible whereas a correct inference algorithm, like the Virterbi algorithm, will track what is likely. The difference is clear when there is overlap in the 5% range on a domain, that is, when multiple possible states could potentially transition to the same state. Your algorithm might add 2 edges to a point, but I doubt that when you compute the next day that has an effect (it should count twice, essentially).
Anyway, you could use the Viterbi algortihm, if you are only interested in the best guess at the most recent day I'll just give you a brief idea how you can just modify your graph-theory solution. Instead of maintaining edges between states maintain a fraction representing the probability that state is the correct one (this distribution is sometimes called the belief state). At each new day, propagate forward your belief state by incrementing each bucket by the probability of it's parent (instead of adding an edge your adding a floating point number). You also have to make sure your belief state is properly normalized (sums to 1) so just divide by its sum after each update. After that, you can weight each state by your observation, but since you don't have a noisy observation you can just go and set all the impossible states to being zero probability and then re-normalize. You now have a distribution over underlying quantities conditioned on your observations.
I'm skipping over a lot of statistical details here, just to give you the idea.
Edit (re: questions):
The answer to your question really depends on what you want, if you want only the distribution for the most recent day then you can get away with a one-pass algorithm like I've described. If, however, you want to have the correct distribution over the quantities at every single day you're going to have to do a backward pass as well. Hence, the aptly named forward-backward algorithm. I get the sense that since you are looking to go back a step and delete edges then you probably want the distribution for all days (unlike I originally assumed). Of course, you noticed there is information that can be used so that the "future can inform the past" so to speak, and this is exactly the reason why you need to do the backward pass as well, it's not really complicated you just have to run the exact same algorithm starting at the end of the chain. For a good overview check out Christopher Bishop's 6-piece tutorial on videolectures.net.
Because you mentioned adding/deleting edges let me just clarify the algorithm I described previously, keep in mind this is for a single forward pass. Let there be a total of N possible permutations of quantities, so you will have a belief state that is a sparse vector N elements long (called v_0). The first step you receive a observation of the sum, and you populate the vector by setting all the possible values to have probability 1.0, then re-normalize. The next step you create a new sparse vector (v_1) of all 0s, iterate over all non-zero entries in v_0 and increment (by the probability in v_0) all entries in v_1 that are within 5%. Then, zero out all the entries in v_1 that are not possible according to the new observation, then re-normalize v_1 and throw away v_0. repeat forever, v_1 will always be the correct distribution of possibilities.
By the way, things can get way more complex than this, if you have noisy observations or very large states or continuous states. For this reason it's pretty hard to read some of the literature on statistical inference; it's quite general.

Related

How to estimate the optimal cutpoint for a binary outcome in python

I have a dataset of diabetic patients which has been used to train an xgboost model in several outcomes such as stroke, amputation, and more. Originally we used the continuous numeric variables as-is, but we found ambiguity in the results since for example age was giving us results where the older you get the higher the risk to have a stroke.
But, for us as physicians we need a narrower range, so we divide those variables in bins. And indeed this gave us more insight. Nonetheless, we are seeing that some contiguous intervals appear in our results pretty close.
Continuing from the example above, bin(64-78) and bin(79-88), appear one after the other and no other bin from the age variable appears. So we think that the best approach, in this case, is to find the best optimal cutpoint at which the age starts to become a risk factor for stroke.
Then I came across this document (https://www.mayo.edu/research/documents/biostat-79pdf/doc-10027230) which explains in SAS how to find those cutpoints. I am not experienced enough to program this myself, so I want to know how could I achieve to find these cutpoints in python?
Do please restrict to that language, I have already seen R, SAS, even SPSS examples but none in python. There must be some way to do this in Python.

It's very difficult to determine without seeing the data but there are a few ways of doing it. One way is doing a logistic regression in your data which will give you a probability distribution from the binary class, you can then use the Receiver Operating Characteristic (ROC) to determine the optimal threshold depending on how important it is to you to prioritize true positive rate over the absence of false-positives.
You can find an article about this here

Financial Contagion (Epidemic spread) Model Meets Problem

Recently I want to rebuild the financial contagion model in the paper: Contagion in Financial Networks by Prasanna Gai.
Now I am stuck in the first figure:
(figure 3 actually).
What I Did
I used Python and networkx.
First, build the ER network with 1000 nodes, and the probability depends on which average degree I want to simulate. For example, if I want to simulate the average degree at 3, then the probability of generating ER network is 3/(1000-1) where 1000 is the network size.
Then for each node, I find how many nodes point towards it and count, to calculate the AiIB (weights). If node 1 has 3 nodes point towards it, then the weights on these edges are AiIB (0.2 in paper) / 3 (number of neighbours).
To simulate contagion, first randomly chose one node to remove all its assets. then it cannot pay back the liabilities to its neighbours if the liability is more than the capital buffer (Ki, 0.04 in the paper). For those banks received liabilities from more than one banks, even each link's weight is less than the Ki, if the sum of these liabilities is more than Ki it still is considered as bankrupt. The model is just like epidemic spread where the new bankrupt banks will influence the new batch of banks, ended at no more banks are bankrupted in this system.
As contagion is defined as more than 5% of the banks bankrupt in this network ( 50 in this case).
To plot the figure, each average degree needs to test 100 times:
probability = number of contagion happened / simulation times 100 here
extent = [under the case of happening contagion] sum of the fraction of bankrupt banks/number of contagion happened.
The raw code is available on GitHub. By running er_100.py you can get the figure of mine like this:
If you have any trouble of the code please let me know. (the code needs at least 1 hours to run on the GCP with 8 vCPU...)
I also tried the network with 60 nodes here is it looks like:
it has a little similar shape with figure 1. But this still not good and the small network is not what I want.
I don't know what problems with my code. In my view, I have considered everywhere and it should get similar results. I even started to question the authority of the paper...
If you have any ideas please help me with this.

This is a "hard" question to answer.
I still do not find any clue to the code. Then I rewrite the code in R and run it, here is the draft figure I got:
As you can see now the figure is just the one in the paper. But the algorithm and the structure were totally the same with I wrote it in Python.
Maybe this is a case that shows Python just can't do.
If anyone feels interesting of this question, want to divide more among the difference between Python and R, this is a great example. And I am very happy to offer any helps.
By the way the model code in R is available in GitHub and it still under update.
For those who spent time on reading my description, Thanks for your time.
update:
I am also can't believe this because in my view the code is doing count and calculate that's simple. I print out very thing at every step and check every node, from 10 nodes small network to 1000 nodes network, the log file reached over 50G. All looks normal and the number (bankrupt one) just not reach the threshold. Not like in R, with totally the same structure, the results is just same with the paper.
I really don't know why and have no idea.

How to generate all legal state-action pairs of connect four?

Consider a standard 7*6 board. Suppose I want to apply Q-Learning algorithm. For applying it, I need a set of all possible states and actions. There can be 3^(7*6) = 150094635296999121. Since its not feasible to store these, I am only considering legal states.
How can I generate Q(s,a) for all the legal states and actions?
This is not my homework. I am trying to learn about reinforcement algorithms. I have been searching about this since two days. The closest thing I have come to is consider only the legal states.

There are 3 process you need to set up. One that generates the next move, one that changes where that move leads, and lastly evaluating a block of 4x4 through a series of checks to see if this is a winner . Numpy and scipy will help with this.
Set up a Numpy array of zeroes. Change the number to 1 for player 1 moves and -1 for moves done by player 2. The 4x4 check is summing over the x axis and then the y axis and then the sum of the diagonals if the abs(sum(axis))==4 then yield board earlier than the end.
This may create duplicates depending on the implementation so put all of these in a set at the end.
**Edit due to comments and the question modification.
You need to use generators and do a depth first search. There is a max of 7 possible branches for any state with a possibility of 42 moves. You are only looking for winning or loosing states to store (don't save stalemates as they take the most memory). The states will be 2 sets of locations one for each player.
When you step forward and find a winning/losing state, store the state with the value, step backward to the previous move and update the value there storing this as well.
There are 144 possible ways of winning/losing to connect four with I don't know how many states associated with each. so I'm not sure how many steps away from winning you want to store.

Applying Hidden Markov Models in Python

I recently had a homework assignment in my computational biology class to which I had to apply a HMM. Although I think I understand HMMs, I couldn't manage to apply them to my code. Basically, I had to apply the Fair Bet Casino problem to CpG islands in DNA--using observed outcomes to predict hidden states using a transition matrix. I managed to come up with a solution that worked correctly...but unfortunately in exponential time. My solution psuedocode looked like this:
*notes: trans_matrix contains both the probabilities of going from one state to the other state, and the probabilities of going from one observation to the other observation.
HMM(trans_matrix, observations):
movements = {}
for i in range(0,len(observations):
movements[i] = #list of 4 probabilities, all possible state changes
poss_paths = []
for m in movements.keys():
#add two new paths for every path currently in poss_paths
# (state either changes or doesn't)
#store a running probability for each path
correct_path = poss_paths[index_of_highest_probability]
return correct_path
I realize that I go into exponential time when I look at all the possible paths #add two new paths for every path currently in poss_paths. I'm not sure how to find the path of highest probability without looking at all the paths, though.
Thank you!
EDIT:
I did find a lot of info about the Viterbi algorithm when I was doing the assignment but I was confused as to how it actually gives the best answer. It seems like Viterbi (I was looking at the forward algorithm specifically, I think) looks at a specific position, moves forward a position or two, and then decides the "correct" next path increment only having looked at a few subsequent probabilities. I may be understanding this wrong; is this how the Viterbi works? Pseudocode is helpful. Thank you!

One benefit of Hidden Markov Models is that you can generally do what you need without considering all possible paths one by one. What you are trying to do looks like an expensive way of finding the single most probable path, which you can do by dynamic programming under the name of the Viterbi algorithm - see e.g. http://cs.brown.edu/research/ai/dynamics/tutorial/Documents/HiddenMarkovModels.html. There are other interesting things covered in documents like this which are not quite the same, such as working out the probabilities for the hidden state at a single position, or at all single positions. Very often this involves something called alpha and beta passes, which are a good search term, along with Hidden Markov Models.
There is a large description at http://en.wikipedia.org/wiki/Viterbi_algorithm with mathematical pseudo-code and what I think is python as well. Like most of these algorithms, it uses the Markov property that once you know the hidden state at a point you know everything you need to answer questions about that point in time - you don't need to know the past history. As in dynamic programming, you work from left to right along the data, using answers computed for output k-1 to work out answers for output k. What you want to work out at point k, for each state j, is the probability of the observed data up to and including that point, along the most likely path that ends up in state j at point k. That probability is the product of the probability of the observed data at k given state j, times the probability of the transition from some previous state at time k-1 to j times the probability of all of the observed data up to and including point k-1 given that you ended up at the previous state at time k-1 - this last bit is something you have just computed for time k-1. You consider all possible previous states and pick the one that gives you the highest combined probability. That gives you the answer for state j at time k, and you save the previous state that gave you the best answer. This may look like you are just fiddling around with outputs for k and k-1, but you now have an answer for time k that reflects all the data up to and including time k. You carry this on until k is the last point in your data, at which point you have answers for the probabilities of each final state given all the data. Pick the state at this time which gives you the highest probability and then trace all the way back using the info you saved about which previous state in time k-1 you used to compute the probability for data up to k and in state j at time k.

Utilising Genetic algorithm to overcome different size datasets in model

SO I realise the question I am asking here is large and complex.
A potential solution to variences in sizes of
In all of my searching through statistical forums and posts I haven't come across a scientifically sound method of taking into account the type of data that I am encountering,
but I have thought up a (novel?) potential solutions to account perfectly (in my mind) for large and small datasets within the same model.
The proposed method involves using a genetic algorithm to alter two numbers defining a relationship between the size of the dataset making up an implied strike rate and the
percentage of the implied strike to be used, with the target of the model to maximise the homology of the number 1 in two columns of the following csv. (ultra simplified
but hopefully demonstrates the principle)
Example data
Date,PupilName,Unique class,Achieved rank,x,y,x/y,Average xy
12/12/2012,PupilName1,UniqueClass1,1,3000,9610,0.312174818,0.08527
12/12/2012,PupilName2,UniqueClass1,2,300,961,0.312174818,0.08527
12/12/2012,PupilName3,UniqueClass1,3,1,3,0.333333333,0.08527
13/12/2012,PupilName1,UniqueClass2,1,2,3,0.666666667,0.08527
13/12/2012,PupilName2,UniqueClass2,2,0,1,0,0.08527
13/12/2012,PupilName3,UniqueClass2,3,0,5,0,0.08527
13/12/2012,PupilName4,UniqueClass2,4,0,2,0,0.08527
13/12/2012,PupilName5,UniqueClass2,5,0,17,0,0.08527
14/12/2012,PupilName1,UniqueClass3,1,1,2,0.5,0.08527
14/12/2012,PupilName2,UniqueClass3,2,0,1,0,0.08527
14/12/2012,PupilName3,UniqueClass3,3,0,5,0,0.08527
14/12/2012,PupilName4,UniqueClass3,4,0,6,0,0.08527
14/12/2012,PupilName5,UniqueClass3,5,0,12,0,0.08527
15/12/2012,PupilName1,UniqueClass4,1,0,0,0,0.08527
15/12/2012,PupilName2,UniqueClass4,2,1,25,0.04,0.08527
15/12/2012,PupilName3,UniqueClass4,3,1,29,0.034482759,0.08527
15/12/2012,PupilName4,UniqueClass4,4,1,38,0.026315789,0.08527
16/12/2012,PupilName1,UniqueClass5,1,12,24,0.5,0.08527
16/12/2012,PupilName2,UniqueClass5,2,1,2,0.5,0.08527
16/12/2012,PupilName3,UniqueClass5,3,13,59,0.220338983,0.08527
16/12/2012,PupilName4,UniqueClass5,4,28,359,0.077994429,0.08527
16/12/2012,PupilName5,UniqueClass5,5,0,0,0,0.08527
17/12/2012,PupilName1,UniqueClass6,1,0,0,0,0.08527
17/12/2012,PupilName2,UniqueClass6,2,2,200,0.01,0.08527
17/12/2012,PupilName3,UniqueClass6,3,2,254,0.007874016,0.08527
17/12/2012,PupilName4,UniqueClass6,4,2,278,0.007194245,0.08527
17/12/2012,PupilName5,UniqueClass6,5,1,279,0.003584229,0.08527
So I have created a tiny model dataset, which contains some good examples of where my current methods fall short and how I feel a genetic algorithm can be used to fix this. If we look in the dataset above it contains 6 unique classes the ultimate objective of the algorithm is to create as high as possible correspondence between a rank of an adjusted x/y and the achieved rank in column 3 (zero based referencing.) In uniqueclass1 we have two identical x/y values, now these are comparatively large x/y values if you compare with the average (note the average isn't calculated from this dataset) but it would be common sense to expect that the 3000/9610 is more significant and therefore more likely to have an achieved rank of 1 than the 300/961. So what I want to do is make an adjusted x/y to overcome these differences in dataset sizes using a logarithmic growth relationship defined by the equation:
adjusted xy = ((1-exp(-y*α)) * x/y)) + ((1-(1-exp(-y*α)))*Average xy)
Where α is the only dynamic number
If I can explain my logic a little and open myself up to (hopefully) constructive criticsm. This graph below shows is an exponential growth relationship between size of the data set and the % of x/y contributing to the adjusted x/y. Essentially what the above equation says is as the dataset gets larger the percentage of the original x/y used in the adjusted x/y gets larger. Whatever percentage is left is made up by the average xy. Could hypothetically be 75% x/y and 25% average xy for 300/961 and 95%/5% for 3000/9610 creating an adjusted x/y which clearly demonstrates
For help with understanding the lowering of α would produce the following relationship where by a larger dataset would be requred to achieve the same "% of xy contributed"
Conversly increasing α would produce the following relationship where by a smaller dataset would be requred to achieve the same "% of xy contributed"
So I have explained my logic. I am also open to code snippets to help me overcome the problem. I have plans to make a multitude of genetic/evolutionary algorithms in the future and could really use a working example to pick apart and play with in order to help my understanding of how to utilise such abilities of python. If additional detail is required or further clarification about the problem or methods please do ask, I really want to be able to solve this problem and future problems of this nature.
So after much discussion about the methods available to overcome the problem presented here I have come to the conclusion that he best method would be a genetic algorithm to iterate α in order to maximise the homology/correspondance between a rank of an adjusted x/y and the achieved rank in column 3. It would be greatly greatly appreciated if anyone be able to help in that department?
So to clarify, this post is no longer a discussion about methodology
I am hoping someone can help me produce a genetic algorithm to maximise the homology between the results of the equation
adjusted xy = ((1-exp(-y*α)) * x/y)) + ((1-(1-exp(-y*α)))*Average xy)
Where adjusted xy applies to each row of the csv. Maximising homology could be achieved by minimising the difference between the rank of the adjusted xy (where the rank is by each Unique class only) and Achieved rank.
Minimising this value would maximise the homology and essentially solve the problem presented to me of different size datasets. If any more information is required please ask, I check this post about 20 times a day at the moment so should reply rather promptly. Many thanks SMNALLY.

The problem you are facing sounds to me like "Bias Variance Dilemna" from a general point of view. In a nutshell, a more precise model favours variance (sensitivity to change in a single training set), a more general model favours bias (model works for many training sets)
May I suggest not to focus on GA but look at Instance Base Learning and advanced regression techniques. The Andrew moore page at CMU is a good entry point.
And particularly those slides.
[EDIT]
After a second reading, here is my second understanding:
You have a set of example data with two related attributes X and Y.
You do not want X/Y to dominate when Y is small, (considered as less representative).
As a consequence you want to "weigth" the examples with a adapted value adjusted_xy .
You want adjusted_xy to be related to a third attribute R (rank). Related such as,per class, adjusted_xy is sorted like R.
To do so you suggest to put it as an optimization problem, searching for PARAMS of a given function F(X,Y,PARAMS)= adjusted_xy .
With the constraint that D=Distance( achieved rank for this class, rank of adjusted_xy for this class ) is minimal.
Your question, at least for me, is in the field of attribute selection/attribute adaptation. (I guess the data set will later be used for supervised learning ).
One problem that I see in your approach (if well understood) is that, at the end, rank will be highly related to adjusted_xy which will bring therefore no interesting supplementary information.
Once this said, I think you surely know how GA works . You have to
define the content of the chromosome : this appears to be your alpha parameter.
define an appropriate fitness function
The fitness function for one individual can be a sum of distances over all examples of the dataset.
As you are dealing with real values , other metaheuristics such as Evolution Strategies (ES) or Simulated Anealing may be more adapted than GA.
As solving optimization problems is cpu intensive, you might eventually consider C or Java instead of Python. (as fitness at least will be interpreted and thus cost a lot).
Alternatively I would look at using Y as a weight to some supervised learning algorithm (if supervised learning is the target).

Let's start by the problem: You consider the fact that some features lead to some of your classes a 'strike'. You are taking a subset of your data and try to establish a rule for the strikes. You do establish one but then you notice that the accuracy of your rule depends on the volume of the dataset that was used to establish the 'strike' rate anyway. You are also commenting on the effect of some samples in biasing your 'strike' estimate.
The immediate answer is that it looks like you have a lot of variation in your data, therefore you will in one way or another need to collect more to account for that variation. (That is, variation that is inherent to the problem).
The fact that in some cases the numbers end up in 'unusable cases' could also be down to outliers. That is, measurements that are 'out of bounds' for a number of reasons and which you would have to find a way to either exclude them or re-adjust them. But this depends a lot on the context of the problem.
'Strike rates' on their own will not help but they are perhaps a step towards the right direction. In any case, you can not compare strike rates if they are coming from samples of different sizes as you have found out too. If your problem is purely to determine the size of your sample so that your results conform to some specific accuracy then i would recommend that you have a look at Statistical Power and how does the sample size affects it. But still, to determine the sample size you need to know a bit more about your data, which brings us back to point #1 about the inherent variation.
Therefore, my attempt to an answer is this: If i have understood your question correctly, you are dealing with a classification problem in which you seek to assign a number of items (patients) to a number of classes (types of cancer) on the evidence of some features (existence of genetic markers, or frequency of their appearance or any other quantity anyway) about these items. But, some features might not exist for all items or, there is a core group of features but there might be some more that do not appear all the time. The question now is, which classifier do you use to achieve this? Logistic regression was mentioned previously and has not helped. Therefore, what i would suggest is going for a Naive Bayesian Classifier. The classifier can be trained with the datasets you have used to derive the 'strike rates' which will provide the a-priori probabilities. When the classifier is 'running' it will be using the features of new data to construct a likelihood that the patient who provided this data should be assigned to each class.
Perhaps the more common example for such a classifier is the spam-email detectors where the likelihood that an email is spam is judged on the existence of specific words in the email (and a suitable training dataset that provides a good starting point of course).
Now, in terms of trying this out practically (and since your post is tagged with python related tags :) ), i would like to recommend Weka. Weka contains a lot of related functionality including bootstrapping that could potentially help you with those differences in the size of the datasets. Although Weka is Java, bindings exist for it in Python too. I would definitely give it a go, the Weka package, book and community are very helpful.

No. Don't use a genetic algorithm.
The bigger the search space of models and parameters, the better your chances of finding a good fit for your data points. But the less this fit will mean. Especially since for some groups your sample sizes are small and therefore the measurements have a high random component to them. This is why, somewhat counterintuitively, it is often actually harder to find a good model for your data after collecting it than before.
You have taken the question to the programmer's lair. This is not the place for it. We solve puzzles.
This is not a puzzle to find the best line through the dots. You are searching for a model that makes sense and brings understanding on the subject matter. A genetic algorithm is very creative at line-through-dot drawing but will bring you little understanding.
Take the problem back where it belongs and ask the statisticians instead.
For a good model should be based on theory behind the data. It'll have to match the points on the right side of the graph, where (if I understand you right) most of the samples are. It'll be able to explain in hard probabilities how likely the deviations on the left are and tell you if they are significant or not.
If you do want to do some programming, I'd suggest you take the simplest linear model, add some random noise, and do a couple simulation runs for a population like your subjects. See if the data looks like the data you're looking at or if it generally 'looks' different, in which case there really is something nonlinear (and possibly interesting) going on on the left.

I once tackled a similar problem (as similar as problems like this ever are), in which there were many classes and high variance in features per data point. I personally used a Random Forest classifier (which I wrote in Java). Since your data is highly variant, and therefore hard to model, you could create multiple forests from different random samples of your large dataset and put a control layer on top to classify data against all the forests, then take the best score. I don't write python, but i found this link
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
which may give you something to play with.

Following Occam's razor, you must select a simpler model for small dataset and may want to switch to a more complex model as your dataset grows.
There are no [good] statistical tests that show you if a given model, in isolation, is a good predictor of your data. Or rather, a test may tell you that given model fitness is N, but you can never tell what the acceptable value of N is.
Thus, build several models and pick one with better tradeoff of predictive power and simplicity using Akaike information criterion. It has useful properties and not too hard to understand. :)
There are other tests of course, but AIC should get you started.
For a simple test, check out p-value

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.