I'm having problems implementing a simple balancing for an H2ORandomForestEstimator, I'm trying to reproduce a simple example found in Darren Cook's book written in R ('Practical Machine Learning with H2O - pag. 107).
Working on the Iris Dataset, firstly I artificially unbalance the target variable cutting out a good share of virginica keeping first 120 rows.
Then I build 3 models, a vanilla one, one where I set balance_classes as True, and a last one where I set balance_classes as True and I input a list for class_sampling_factors to oversample the virginica one. List is [1.0,1.0,2.5], referred to columns sorted alphabetically.
I train them, and then output confusion matrix for train for each one.
I'm expecting an unbalanced output for the first one, and a balanced one for the last two, while I have always the same result. I checked the documentation example in Python, and I can't see anything wrong (I may be tired as well).
This is my code:
data_unb = data[1:120,:] # messing up with target variable
train, valid = data_unb.split_frame([0.8], seed=12345)
m1 = h2o.estimators.random_forest.H2ORandomForestEstimator(seed=12345)
m2 = h2o.estimators.random_forest.H2ORandomForestEstimator(balance_classes=True, seed=12345)
m3 = h2o.estimators.random_forest.H2ORandomForestEstimator(balance_classes=True, class_sampling_factors=[1.0,1.0,2.5], seed=12345)
m1.train(x=list(range(4)),y=4,training_frame=train,validation_frame=valid,model_id='RF_defaults')
m2.train(x=list(range(4)),y=4,training_frame=train,validation_frame=valid,model_id='RF_balanced')
m3.train(x=list(range(4)),y=4,training_frame=train,validation_frame=valid,model_id='RF_class_sampling',)
m1.confusion_matrix(train)
m2.confusion_matrix(train)
m3.confusion_matrix(train)
This is my output:
my confusion matrices (wrong)
this is my expected output.
expected confusion matrices
What am I evidently missing? Thanks in advance.
You're not missing anything. The offset_column is available in H2O Random Forest, but it's not actually functional. The bug is documented here and should be fixed in the next stable release of H2O. Sorry about the confusion!
It should work for the rest of the H2O algos (except XGBoost). If you wanted to try on a GBM, for example, you'd see it working.
Related
I'm a new pytorch user and moderate experience with Tensorflow/Keras. The pytorch examples are fantastic. I've worked through the demand forecasting lab using the Temporal Fusion Transform (https://pytorch-forecasting.readthedocs.io/en/latest/tutorials/stallion.html).
All makes sense but haven't figured out how to save the predicted values in notebook section #20 to a numpy array.
Section #20,
*new_raw_predictions, new_x = best_tft.predict(new_prediction_data, mode="raw", return_x=True)*
I see the values in the tensors, print(new_raw_predictions) ,
like this --
*{'prediction': tensor([[[3.4951e+00, 1.7341e+01, 2.7446e+01, ..., 6.3175e+01,
9.0240e+01, 1.2589e+02],
[1.1698e+01, 2.3643e+01, 3.3291e+01, ..., 6.6374e+01,
9.1148e+01, 1.3173e+02],
I've seen some similar questions asked here but none seem to work. All attempts result in a similar error so I'm missing something fundamental about pytorch and the output tensor; I always get 'AttributeError: 'dict' object has no attribute new_raw_predictions'
A few examples of what's been tried:
*new_raw_predictions.cpu().numpy() new_raw_predictions.detach().cpu().numpy() new_raw_predictions.numpy()*
Goal is to save the predicted output so I can compare changes to the model. Thanks in advance!
It all depends on how you've created your model, because pytorch can return values however you specify. In your case, it looks like it returns a dictionary, of which 'prediction' is a key. You can convert to numpy using the command you supplied above, but with one change:
preds = new_raw_predictions['prediction'].detach().cpu().numpy()
of course if it's not on the GPU you don't need to use .detach().cpu(), just .numpy()
I have seen this topic showing many times on the Internet but never have a seen a complete, comprehensive solution that would work across the board for all use cases with the current library versions of sklearn. Could somebody please try to explain how that should be achieved using the following example?
In this example I'm using the following dataset
data = pd.read_csv('heart.csv')
# Preparing individual pipelines for numerical and categorical features
pipe_numeric = Pipeline(steps=[
('impute_num', SimpleImputer(
missing_values = np.nan,
strategy = 'median',
copy = False,
add_indicator = True)
)
])
pipe_categorical = Pipeline(steps=[
('impute_cat', SimpleImputer(
missing_values = np.nan,
strategy = 'constant',
fill_value = 99999,
copy = False)
),
('one_hot', OneHotEncoder(handle_unknown='ignore'))
])
# Combining them into a transformer
transformer_union = ColumnTransformer([
('feat_numeric', pipe_numeric, ['age']),
('feat_categorical', pipe_categorical, ['cp']),
], remainder = 'passthrough')
# Fitting the transformer
transformer_union.fit(data)
# We can then apply and get the data in the following way
transformer_union.transform(data)
# And it has the following shape
transformer_union.transform(data).shape
Now comes the main question: how to efficiently combine the output numpy array with the new column names that resulted from all the transformations? This example, even though would require quite some work, is still relatively simple, but this can get severely more complicated with bigger pipelines.
# Transformers object
transformers = transformer_union.named_transformers_
# Categorical features (from transformer)
transformers['feat_categorical'].named_steps['one_hot'].get_feature_names()
# Numerical features (from transformer) - no names are available?
transformers['feat_numeric'].named_steps['impute_num']
# All the other columns that were not transformed - no names are available?
transformers['remainder']
I've checked all kind of different examples and there doesn't seem to be any silver bullet for this:
sklearn doesn't support this natively - there's no way to get an aligned vector of column names that could be easily combined with the array into a new DF, but perhaps I'm mistaken - could anyone point me to a resource if that's the case?
Some people were implementing their custom transformers/ pipelines, but this gets a bit hectic when you want to build large pipelines
Are there any other sklearn-related packages that alleviate that issue?
I'm a little bit surprised by how sklearn manages that - in R in the tidymodels ecosystem (it's still under development, but nevertheless), this is handled very easily with the prep and bake methods. I would imagine it could somehow be done similarly.
Inspecting the final output in its entirety is vital to the data science work - could anyone advise on the best path?
The sklearn devs are working on this; discussion spans several SLEPs and many Issues. There is already some progress, with some transformers implementing get_features_names and others having internal attributes tracking column names when the input was a pandas dataframe. ColumnTransformer does have a get_feature_names, but Pipeline does not, so that it would fail on your example.
The most complete current solution seems to be sklearn-pandas:
https://github.com/scikit-learn-contrib/sklearn-pandas
Another interesting approach is hidden away inside eli5. In their explain_weights, they have a generic function transform_feature_names. It has a few specialized dispatches, but otherwise tries to call get_feature_names; most notably, there is a dispatch for Pipeline. Unfortunately, currently this will fail on a ColumnTransformer with a Pipeline as a transformer; see https://stackoverflow.com/a/62124484/10495893 for an example and a potential workaround.
I am studying "Building Machine Learning System With Python (2nd)".
I have a silly doubt in very first chapters' answer part.
According to the book and based on my observation I always get 2nd order polynomial as the best fitting curve.
whenever I train my system with training dataset, I get different Test error for different Polynomial Function.
Thus my parameters of the equation also differs.
But surprisingly, I get approximately same answer every time in the range 9.19-9.99 .
My final hypothesis function each time have different parameters but I get approximately same answer.
Can anyone tell me the reason behind it?
[FYI:I am finding answer for y=100000]
I am sharing the code sample and the output of each iteration.
Here are the errors and the corresponding answers with it:
https://i.stack.imgur.com/alVzU.png
https://i.stack.imgur.com/JVGSm.png
https://i.stack.imgur.com/RB53X.png
Thanks in advance!
def error(f, x, y):
return sp.sum((f(x)-y)**2)
import scipy as sp
import matplotlib.pyplot as mp
data=sp.genfromtxt("web_traffic.tsv",delimiter="\t")
x=data[:,0]
y=data[:,1]
x=x[~sp.isnan(y)]
y=y[~sp.isnan(y)]
mp.scatter(x,y,s=10)
mp.title("web traffic over the month")
mp.xlabel("week")
mp.ylabel("hits/hour")
mp.xticks([w*24*7 for w in range(10)],["week %i"%i for i in range(10)])
mp.autoscale(enable=True,tight=True)
mp.grid(color='b',linestyle='-',linewidth=1)
mp.show()
infletion=int(3.5*7*24)
xa=x[infletion:]
ya=y[infletion:]
f1=sp.poly1d(sp.polyfit(xa,ya,1))
f2=sp.poly1d(sp.polyfit(xa,ya,2))
f3=sp.poly1d(sp.polyfit(xa,ya,3))
print(error(f1,xa,ya))
print(error(f2,xa,ya))
print(error(f3,xa,ya))
fx=sp.linspace(0,xa[-1],1000)
mp.plot(fx,f1(fx),linewidth=1)
mp.plot(fx,f2(fx),linewidth=2)
mp.plot(fx,f3(fx),linewidth=3)
frac=0.3
partition=int(frac*len(xa))
shuffled=sp.random.permutation(list(range(len(xa))))
test=sorted(shuffled[:partition])
train=sorted(shuffled[partition:])
fbt1=sp.poly1d(sp.polyfit(xa[train],ya[train],1))
fbt2=sp.poly1d(sp.polyfit(xa[train],ya[train],2))
fbt3=sp.poly1d(sp.polyfit(xa[train],ya[train],3))
fbt4=sp.poly1d(sp.polyfit(xa[train],ya[train],4))
print ("error in fbt1:%f"%error(fbt1,xa[test],ya[test]))
print ("error in fbt2:%f"%error(fbt2,xa[test],ya[test]))
print ("error in fbt3:%f"%error(fbt3,xa[test],ya[test]))
from scipy.optimize import fsolve
print (fbt2)
print (fbt2-100000)
maxreach=fsolve(fbt2-100000,x0=800)/(7*24)
print ("ans:%f"%maxreach)
Don't do this like that.
Linear regression is more "up to you" than you think.
Start by getting the slope of the line, (#1) average((f(x2)-f(x))/(x2-x))
Then use that answer as M to (#2) average(f(x)-M*x).
Now you have (#1) and (#2) as your regression.
For any type of regression similar to this ex, Polynomial,
you need to subtract the A-Factor (First Factor), by using the n super-delta of f(x) with every one with respect to delta(x). Ex. delta(ax^2+bx+c)/delta(x) gives you a equation with a and b, and from there it works. When doing this take the average every time if there is more entries. Do It like a window on a paper sliding down. Ex. You select entries 1-10, then 2-11,3-12 etc for some crazy awesome regression. You may want to create a matrix API. The best way to handle it, is first create a API that takes a row and a column out first. THEN you fool around with that to automate it. The Ratios of the in-out entries left in only 2 cols, is averaged and is the solution to the coefficient. Then Make a program to take rows out but for example leave row 1 & row 5 (OUTPUT), then row 2,row 5... row 4 and row 5. I wouldn't recommend python for coding this. I recommend C programming, because It prevents you from making dirty arrays that you don't remember. Systems-Theory you need to understand. You must create system-by-system. It is insane to code matrices without building automated sub-systems that are carefully tested. I failed until I worked on it in C, so I already made a 1 time shrinking function that is carefully tested, then built systems to automate getting 1 coefficient, tested that, then automated the repetition of that program to solve it. You won't understand any of this by using python or similar shortcuts. You use them after you realize what they really are. That's how I learned. I still am like how did I code that? I still am amazed. Problem is though, it's unstable above 4x4 (actually 4x5) matrices.
Good Luck,
Misha Taylor
I've designed a model using Pymc3, and I have some trouble optimizing it with multiple data.
The model is a bit similar to the coal-mining disaster (as in the Pymc3 tutorial for those who know it), except there are multiple switchpoints.
The output of the network is a serie of real numbers for instance:
[151,152,150,20,19,18,0,0,0]
with Model() as accrochage_model:
time=np.linspace(0,n_cycles*data_length,n_cycles*data_length)
poisson = [Normal('poisson_0',5,1), Normal('poisson_1',10,1)]
variance=3
t = [Normal('t_0',0.5,0.01), Normal('t_1',0.7,0.01)]
taux = [Bernoulli('taux_{}'.format(i),t[i]) for i in range(n_peaks)]
switchpoint = [Poisson('switchpoint_{}'.format(i),poisson[i])*taux[i] for i in range(n_peaks)]
peak=[Normal('peak_0',150,2),Normal('peak_1',50,2),Normal('peak_2',0,2)]
z_init=switch(switchpoint[0]>=time%n_cycles,0,peak[0])
z_list=[switch(sum(switchpoint[j] for j in range(i))>=time%n_cycles,0,peak[i]-peak[i-1]) for i in range(1,n_peaks)]
z=(sum(z_list[i] for i in range(len(z_list))))
z+=z_init
m =Normal('m', z, variance,observed=data)
I have multiple realisations of the true distribution and I'd like taking all of them into account while performing optimization of the parameters of the system.
Right now my "data" that appears in observed=data is just one list of results , such as:
[151,152,150,20,19,18,0,0,0]
What I would like to do is give not just one but several lists of results,
for instance:
data=([151,152,150,20,19,18,0,0,0],[145,152,150,21,17,19,1,0,0],[151,149,153,17,19,18,0,0,1])
I tried using the shape parameter and making data an array of results but none of it seemed to work.
Does anyone have an idea of how it's possible to do the inference so that the network is optimized for an entire dataset and not a single output?
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I have a bunch of data about vampires and non-vampires. I have a matrix with 2000 subjects, which houses statistics about the subject.
#[height(cm), weight(kg), stake aversion, garlic aversion, reflectance, shiny, IS_VAMPIRE?]
if IS_VAMPIRE is 1, then the subject is a vampire, and 0 otherwise. I have a couple ideas about how I can construct a function to tell me if a new subject is a vampire or not, but I was wondering if anyone had any really good ideas I could pursue.
You can use one of the classifier algorithms in scikit-learn. If your bunch of data is already labeled, with you knowing who is and isn't a vampire, and you just want to classify the new ones, the easiest approach for a someone new to machine learning and scikit-learn is using the decision tree algorithm to build a classifier from your sample data and apply it to new ones.
http://scikit-learn.org/stable/modules/tree.html
>>> from sklearn import tree
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(X, Y)
Where X is a list (or a Numpy array) with all your data fields, except for the boolean is_vampire:
>>> X = [[v0_height, v0_weight, v0_stake_aversion, v0_garlic_aversion,
v0_reflectance, v0_shiny],
[v1_height, v1_weight, v1_stake_aversion, v1_garlic_aversion,
v1_reflectance, v1_shiny],
...
]
And Y is a list with the same length, matching the label for each one:
>>> Y = [v0_is_vampire, v1_is_vampire, ...]
After being fitted, the tree can be used to check if a new one is a vampire by the following call, where new is a sublist like those in X:
>>> clf.predict(new)
array([1])
Depending on how the range of values is distributed along your data, you may or may not need to feed all values you have to get a decent classification. You'll have to experiment a litttle with that.
Keep in mind that if your Y array provides only 1 and 0 values for the is_vampire label, then this approach will give you the same binary response. If your Y array has float values and you want to quantify the probability of a new one being a vampire with a value between 0 and 1, then just use the tree.DecisionTreeRegressor class instead of tree.DecisionTreeClassifier.
By the way, this probably isn't the best algorithm to do what you're asking, but it's quite straightforward and should get you started. If you get wrong results or performance problems, just get more information on what's a better approach for your case. This link can be very helpful: http://peekaboo-vision.blogspot.com.br/2013/01/machine-learning-cheat-sheet-for-scikit.html
I don't know if this will work, but maybe you could try using variables. so, for example, say hight is high(10), weight is low(1), stake aversion is high(10), garlic aversion is high(10), reflectance is high(10) and shininess is high(10). you then add up all those variables, then put the sum into another variable. If the end variable is, for example, 50 or higher, you are certain it is a vampire, making IS_VAMPIRE true/1. you would need some more states to account for the likeliness and it is a big block of code to write i would think, but if it works(i don't know if it will) then it would be nice. then again i am the most noobiest of the noobs when it comes to programming, maybe i am of no help here :/