How can I construct a likelihood function? [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I have a bunch of data about vampires and non-vampires. I have a matrix with 2000 subjects, which houses statistics about the subject.
#[height(cm), weight(kg), stake aversion, garlic aversion, reflectance, shiny, IS_VAMPIRE?]
if IS_VAMPIRE is 1, then the subject is a vampire, and 0 otherwise. I have a couple ideas about how I can construct a function to tell me if a new subject is a vampire or not, but I was wondering if anyone had any really good ideas I could pursue.

You can use one of the classifier algorithms in scikit-learn. If your bunch of data is already labeled, with you knowing who is and isn't a vampire, and you just want to classify the new ones, the easiest approach for a someone new to machine learning and scikit-learn is using the decision tree algorithm to build a classifier from your sample data and apply it to new ones.
http://scikit-learn.org/stable/modules/tree.html
>>> from sklearn import tree
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(X, Y)
Where X is a list (or a Numpy array) with all your data fields, except for the boolean is_vampire:
>>> X = [[v0_height, v0_weight, v0_stake_aversion, v0_garlic_aversion,
v0_reflectance, v0_shiny],
[v1_height, v1_weight, v1_stake_aversion, v1_garlic_aversion,
v1_reflectance, v1_shiny],
...
]
And Y is a list with the same length, matching the label for each one:
>>> Y = [v0_is_vampire, v1_is_vampire, ...]
After being fitted, the tree can be used to check if a new one is a vampire by the following call, where new is a sublist like those in X:
>>> clf.predict(new)
array([1])
Depending on how the range of values is distributed along your data, you may or may not need to feed all values you have to get a decent classification. You'll have to experiment a litttle with that.
Keep in mind that if your Y array provides only 1 and 0 values for the is_vampire label, then this approach will give you the same binary response. If your Y array has float values and you want to quantify the probability of a new one being a vampire with a value between 0 and 1, then just use the tree.DecisionTreeRegressor class instead of tree.DecisionTreeClassifier.
By the way, this probably isn't the best algorithm to do what you're asking, but it's quite straightforward and should get you started. If you get wrong results or performance problems, just get more information on what's a better approach for your case. This link can be very helpful: http://peekaboo-vision.blogspot.com.br/2013/01/machine-learning-cheat-sheet-for-scikit.html

I don't know if this will work, but maybe you could try using variables. so, for example, say hight is high(10), weight is low(1), stake aversion is high(10), garlic aversion is high(10), reflectance is high(10) and shininess is high(10). you then add up all those variables, then put the sum into another variable. If the end variable is, for example, 50 or higher, you are certain it is a vampire, making IS_VAMPIRE true/1. you would need some more states to account for the likeliness and it is a big block of code to write i would think, but if it works(i don't know if it will) then it would be nice. then again i am the most noobiest of the noobs when it comes to programming, maybe i am of no help here :/

Related

Finding combination of columns which provides best combination based on function return

I have a dataframe with daily returns 6 portfolios (PORT1, PORT2, PORT3, ... PORT6).
I have defined functions for compound annual returns and risk-adjusted returns. I can run this function for any one PORT.
I want to find a combination of portfolios (assume equal weighting) to obtain the highest returns. For example, a combination of PORT1, PORT3, PORT4, and PORT6 may provide the highest risk adjusted return. Is there a method to automatically run the defined function on all combinations and obtain the desired combination?
No code is included as I do not think it is necessary to show the computation used to determine risk adj return.
def returns(PORT):
val = ... [computation of return here for PORT]
return val
Finding the optimal location within a multidimensional space is possible, but people have made fortunes figuring out better ways of achieving exactly this.
The problem at the outset is setting out your possibility space. You've six dimensions, and presumably you want to allocate 1 unit of "stuff" across all those six, such that a vector of the allocations {a,b,c,d,e,f} sums to 1. That's still an infinity of numbers, so maybe we only start off with increments of size 0.10. So 10 increments possible, across 6 dimensions, gives you 10^6 possibilities.
So the simple brute-force method would be to "simply" run your function across the entire parameter space, store the values and pick the best one.
That may not be the answer you want, other methods exist, including randomising your guesses and limiting your results to a more manageable number. But the performance gain is offset with some uncertainty - and some potentially difficult conversations with your clients "What do you mean you did it randomly?!".
To make any guesses at what might be optimal, it would be helpful to have an understanding of the response curves each portfolio has under different circumstances and the sorts of risk/reward profiles you might expect them to operate under. Are they linear, quadratic, or are they more complex? If you can model them mathematically, you might be able to use an algorithm to reduce your search space.
Short (but fundamental) answer is "it depends".
You can do
import itertools
best_return = 0
for r in range(len(PORTS)):
for PORT in itertools.combinations(PORTS,r):
cur_return = returns(PORT)
if cur_return > best_return :
best_return = cur_return
best_PORT = PORT
You can also do
max([max([PORT for PORT in itertools.combinations(PORTS,r)], key = returns)
for r in range(len(PORTS))], key = returns)
However, this is more of an economics question than a CS one. Given a set of positions and their returns and risk, there are explicit formulae to find the optimal portfolio without having to brute force it.

h2o python balance classes

I'm having problems implementing a simple balancing for an H2ORandomForestEstimator, I'm trying to reproduce a simple example found in Darren Cook's book written in R ('Practical Machine Learning with H2O - pag. 107).
Working on the Iris Dataset, firstly I artificially unbalance the target variable cutting out a good share of virginica keeping first 120 rows.
Then I build 3 models, a vanilla one, one where I set balance_classes as True, and a last one where I set balance_classes as True and I input a list for class_sampling_factors to oversample the virginica one. List is [1.0,1.0,2.5], referred to columns sorted alphabetically.
I train them, and then output confusion matrix for train for each one.
I'm expecting an unbalanced output for the first one, and a balanced one for the last two, while I have always the same result. I checked the documentation example in Python, and I can't see anything wrong (I may be tired as well).
This is my code:
data_unb = data[1:120,:] # messing up with target variable
train, valid = data_unb.split_frame([0.8], seed=12345)
m1 = h2o.estimators.random_forest.H2ORandomForestEstimator(seed=12345)
m2 = h2o.estimators.random_forest.H2ORandomForestEstimator(balance_classes=True, seed=12345)
m3 = h2o.estimators.random_forest.H2ORandomForestEstimator(balance_classes=True, class_sampling_factors=[1.0,1.0,2.5], seed=12345)
m1.train(x=list(range(4)),y=4,training_frame=train,validation_frame=valid,model_id='RF_defaults')
m2.train(x=list(range(4)),y=4,training_frame=train,validation_frame=valid,model_id='RF_balanced')
m3.train(x=list(range(4)),y=4,training_frame=train,validation_frame=valid,model_id='RF_class_sampling',)
m1.confusion_matrix(train)
m2.confusion_matrix(train)
m3.confusion_matrix(train)
This is my output:
my confusion matrices (wrong)
this is my expected output.
expected confusion matrices
What am I evidently missing? Thanks in advance.
You're not missing anything. The offset_column is available in H2O Random Forest, but it's not actually functional. The bug is documented here and should be fixed in the next stable release of H2O. Sorry about the confusion!
It should work for the rest of the H2O algos (except XGBoost). If you wanted to try on a GBM, for example, you'd see it working.

Linear regression with Python

I am studying "Building Machine Learning System With Python (2nd)".
I have a silly doubt in very first chapters' answer part.
According to the book and based on my observation I always get 2nd order polynomial as the best fitting curve.
whenever I train my system with training dataset, I get different Test error for different Polynomial Function.
Thus my parameters of the equation also differs.
But surprisingly, I get approximately same answer every time in the range 9.19-9.99 .
My final hypothesis function each time have different parameters but I get approximately same answer.
Can anyone tell me the reason behind it?
[FYI:I am finding answer for y=100000]
I am sharing the code sample and the output of each iteration.
Here are the errors and the corresponding answers with it:
https://i.stack.imgur.com/alVzU.png
https://i.stack.imgur.com/JVGSm.png
https://i.stack.imgur.com/RB53X.png
Thanks in advance!
def error(f, x, y):
return sp.sum((f(x)-y)**2)
import scipy as sp
import matplotlib.pyplot as mp
data=sp.genfromtxt("web_traffic.tsv",delimiter="\t")
x=data[:,0]
y=data[:,1]
x=x[~sp.isnan(y)]
y=y[~sp.isnan(y)]
mp.scatter(x,y,s=10)
mp.title("web traffic over the month")
mp.xlabel("week")
mp.ylabel("hits/hour")
mp.xticks([w*24*7 for w in range(10)],["week %i"%i for i in range(10)])
mp.autoscale(enable=True,tight=True)
mp.grid(color='b',linestyle='-',linewidth=1)
mp.show()
infletion=int(3.5*7*24)
xa=x[infletion:]
ya=y[infletion:]
f1=sp.poly1d(sp.polyfit(xa,ya,1))
f2=sp.poly1d(sp.polyfit(xa,ya,2))
f3=sp.poly1d(sp.polyfit(xa,ya,3))
print(error(f1,xa,ya))
print(error(f2,xa,ya))
print(error(f3,xa,ya))
fx=sp.linspace(0,xa[-1],1000)
mp.plot(fx,f1(fx),linewidth=1)
mp.plot(fx,f2(fx),linewidth=2)
mp.plot(fx,f3(fx),linewidth=3)
frac=0.3
partition=int(frac*len(xa))
shuffled=sp.random.permutation(list(range(len(xa))))
test=sorted(shuffled[:partition])
train=sorted(shuffled[partition:])
fbt1=sp.poly1d(sp.polyfit(xa[train],ya[train],1))
fbt2=sp.poly1d(sp.polyfit(xa[train],ya[train],2))
fbt3=sp.poly1d(sp.polyfit(xa[train],ya[train],3))
fbt4=sp.poly1d(sp.polyfit(xa[train],ya[train],4))
print ("error in fbt1:%f"%error(fbt1,xa[test],ya[test]))
print ("error in fbt2:%f"%error(fbt2,xa[test],ya[test]))
print ("error in fbt3:%f"%error(fbt3,xa[test],ya[test]))
from scipy.optimize import fsolve
print (fbt2)
print (fbt2-100000)
maxreach=fsolve(fbt2-100000,x0=800)/(7*24)
print ("ans:%f"%maxreach)
Don't do this like that.
Linear regression is more "up to you" than you think.
Start by getting the slope of the line, (#1) average((f(x2)-f(x))/(x2-x))
Then use that answer as M to (#2) average(f(x)-M*x).
Now you have (#1) and (#2) as your regression.
For any type of regression similar to this ex, Polynomial,
you need to subtract the A-Factor (First Factor), by using the n super-delta of f(x) with every one with respect to delta(x). Ex. delta(ax^2+bx+c)/delta(x) gives you a equation with a and b, and from there it works. When doing this take the average every time if there is more entries. Do It like a window on a paper sliding down. Ex. You select entries 1-10, then 2-11,3-12 etc for some crazy awesome regression. You may want to create a matrix API. The best way to handle it, is first create a API that takes a row and a column out first. THEN you fool around with that to automate it. The Ratios of the in-out entries left in only 2 cols, is averaged and is the solution to the coefficient. Then Make a program to take rows out but for example leave row 1 & row 5 (OUTPUT), then row 2,row 5... row 4 and row 5. I wouldn't recommend python for coding this. I recommend C programming, because It prevents you from making dirty arrays that you don't remember. Systems-Theory you need to understand. You must create system-by-system. It is insane to code matrices without building automated sub-systems that are carefully tested. I failed until I worked on it in C, so I already made a 1 time shrinking function that is carefully tested, then built systems to automate getting 1 coefficient, tested that, then automated the repetition of that program to solve it. You won't understand any of this by using python or similar shortcuts. You use them after you realize what they really are. That's how I learned. I still am like how did I code that? I still am amazed. Problem is though, it's unstable above 4x4 (actually 4x5) matrices.
Good Luck,
Misha Taylor

Python - Generating/iterating a combination

I'd like to start by saying that I'm fairly new to Python and am very interested in coding in general. I have some familiarity with basic concepts, but the functions used specifically in Python are largely unknown to me. Fwiw I'm a visual/experiential learner.
I'd also like to state right off the bat that I apologize if this question has been asked and already answered. I found alot of "similar" questions, but none that really helped me find a solution.
Problem
As stated in my topic title I'm interested in creating an "output" (correct terminology?) of an nCr into a list or whatever is the best method to view it.
It would be a 10 choose 5. The ten variables could be numbers, letters, names, words, etc. There are no repeats and no order to the combinations.
Research
I would like to say that I'd looked at similar topics like this question/answer and found the concepts helpful, but have discovered that in the example code:
from itertools import izip
reduce(lambda x, y: x * y[0] / y[1], izip(xrange(n - r + 1, n+1), xrange(1, r+1)), 1)
The reduce tool/function isn't used that way anymore. I think I read that it changed to functools as of Python 3.
Question
How would the above code (examples are helpful) or any other be updated/changed to accommodate for reduce? Are there any other ways to output the combination results?
***Edit***
I think I didn't clearly connect the content of the Problem and Question headings. Basically my main question is under the Problem heading. While it is helpful to see how a person can use the itertools to make combinations from a list, I don't have any idea how to output the 10 choose 5. :\
Are you trying to do:
>>> import itertools
>>> for combination in itertools.combinations('ABCDEFGHIJ', 5):
>>> print(''.join(combination))
ABCDE
ABCDF
ABCDG
ABCDH
ABCDI
ABCDJ
ABCEF
ABCEG
ABCEH
ABCEI
If so, it's done. If you want to learn how to implement the combinations function yourself, you're lucky because the itertools.combinations documentation gives an implementation.

How to Sort Arrays in Dictionary?

I'm currently writing a program in Python to track statistics on video games. An example of the dictionary I'm using to track the scores :
ten = 1
sec = 9
fir = 10
thi5 = 6
sec5 = 8
games = {
'adom': [ten+fir+sec+sec5, "Ancient Domain of Mysteries"],
'nethack': [fir+fir+fir+sec+thi5, "Nethack"]
}
Right now, I'm going about this the hard way, and making a big long list of nested ifs, but I don't think that's the proper way to go about it. I was trying to figure out a way to sort the dictionary, via the arrays, and then, finding a way to display the first ten that pop up... instead of having to work deep in the if statements.
So... basically, my question is : Do you have any ideas that I could use to about making this easier, instead of wayyyy, way harder?
===== EDIT ====
the ten+fir produces numbers. I want to find a way to go about sorting the lists (I lack the knowledge of proper terminology) to go by the number (basically, whichever ones have the highest number in the first part of the array go first.
Here's an example of my current way of going about it (though, it's incomplete, due to it being very tiresome : Example Nests (paste2) (let's try this one?)
==== SECOND EDIT ====
In case someone doesn't see my comment below :
ten, fir, et cetera - these are just variables for scores. Basically, it goes from a top ten list into a variable number.
ten = 1, nin = 2, fir = 10, fir5 = 10, sec5 = 8, sec = 9...
so : 'adom': [ten+fir+sec+sec5, "Ancient Domain of Mysteries"] actually registers as : 'adom': [1+10+9+8, "Ancient Domain of Mysteries"] , which ends up looking like :
'adom': [28, "Ancient Domain of Mysteries"]
So, basically, if I ended up doing the "top two" out of my example, it'd be :
((1)) Nethack (48)
((2)) ADOM (28)
I'd write an actual number, but I'm thinking of changing a few things up, so the numbers might be a touch different, and I wouldn't want to rewrite it.
== THIRD (AND HOPEFULLY THE FINAL) EDIT ==
Fixed my original code example.
How about something like this:
scores = games.items()
scores.sort(key = lambda key, value: value[0])
return scores[:10]
This will return the first 10 items, sorted by the first item in the array.
I'm not sure if this is what you want though, please update the question (and fix the example link) if you need something else...
import heapq
return heapq.nlargest(10, games.iteritems(), key=lambda k, v: v[0])
is the most direct way to get the top ten key / value pairs, sorted by the first item of each "value" list. If you can define more precisely what output you want (just the names, the name / value pairs, or what else?) and the sorting criterion, this is easy to adjust, of course.
Wim's solution is good, but I'd say that you should probably go the extra mile and push this work off onto a database, rather than relying on Python. Python interfaces well with most types of databases, where much of what you're exploring is already a solved problem.
For example, instead of worrying about shifting your dictionaries to various other data types in order to properly sort them, you can simply get all the data for each pertinent entry pre-sorted based on the criteria of your query. There goes the need for convoluted sorting and resorting right there.
While dictionaries are tempting to use, because they give the illusion of database-like abilities to access data based on its attributes, I still think they stumble quite a bit with respect to implementation. I don't really have any numbers to throw at you, but just from personal experience, anything you do on Python when it comes to manipulating large amounts of data, you can do much faster and more efficient both in code and computation with something like MySQL.
I'm not sure what you have planned as far as the structure of your data goes, but along with adding data, changing its structure is a lot easier using a database, too.

Categories

Resources