i'm using RandomForestRegressor (from the great Scikt-Learn library in python) for my project,
it gives me good results, but i think i can do better.
when i'm giving features to 'fit(..)' function,
is it better to make categorical features as binary features?
example:
instead of:
===========
continent |
===========
1 |
===========
2 |
===========
3 |
===========
2 |
===========
make something like:
===========================
is_europe | is_asia | ...
===========================
1 | 0 |
===========================
0 | 1 |
===========================
because its working as a tree maybe the second option is better,
or does it will work the same for the first option?
thanks alot!
Binarizing categorical variables is highly recommended, and expects to outperform the model without binarizer transform. If scikit-learn considers continent = [1, 2, 3, 2] as numeric values (continuous variable [quantitative] instead of categorical [qualitative]), it imposes an artificial order constraint on that feature. For example, suppose continent=1 means is_europe, continent=2 means is_asia, and continent=3 means is_america, then it implies that is_asia is always in between is_europe and is_america when examing the relation of the continent feature to your response variable y, which is not necessarily true and have a chance to reduce the model effectiveness. In contrast, making it dummy variables has no such problem and scikit-learn will treat each binary feature separately.
To binarize your categorical variables in scikit-learn, you can use LabelBinarizer.
from sklearn.preprocessing import LabelBinarizer
# your data
# ===========================
continent = [1, 2, 3, 2]
continent_dict = {1:'is_europe', 2:'is_asia', 3:'is_america'}
print(continent_dict)
{1: 'is_europe', 2: 'is_asia', 3: 'is_america'}
# processing
# =============================
binarizer = LabelBinarizer()
# fit on the categorical feature
continent_dummy = binarizer.fit_transform(continent)
print(continent_dummy)
[[1 0 0]
[0 1 0]
[0 0 1]
[0 1 0]]
If you process your data in pandas, then its top-level function pandas.get_dummies also helps.
Related
I have a dataset of 5 features. Two of these features are very similar but do not have the same min and max values.
... | feature 2 | feature 3 | ...
--------------------------------
..., 208.429993, 206.619995, ...
..., 207.779999, 205.050003, ...
..., 206.029999, 203.410004, ...
..., 204.429993, 202.600006, ...
..., 206.429993, 204.25, ...
feature 3 is always smaller than feature 2 and it is important that it stays that way after scaling. But since feature 2 and features 3 do not have the exact same min and max values, after scaling they will both end up having 0 and 1 as min and max by default. This will remove the relationship between the values. In fact after scaling, the first sample becomes:
... | feature 2 | feature 3 | ...
--------------------------------
..., 0.00268, 0.00279, ...
This is something that I do not want. I cannot seem to find a way to manually change the min and max values of MinMaxScaler. There are other ugly hacks such as manipulating the data and combining feature2 and feature 3 into one for the scaling and splitting again afterward. But I would like to know first if there is a solution that is handled by sklearn, such as using the same min and max for multiple features.
Otherwise, the simplest workaround would do.
Fitting scaler with one column and transforming both. Trying with the data you posted:
feature_1 feature_2
0 208.429993 206.619995
1 207.779999 205.050003
2 206.029999 203.410004
3 204.429993 202.600006
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(df['feature_2'].values.reshape(-1,1))
scaler.transform(df)
array([[1.45024949, 1. ],
[1.288559 , 0.60945366],
[0.85323442, 0.20149259],
[0.45522189, 0. ]])
If you scale data that are outside of the range you used to fit the scaler, the scaled data will be outside of [0,1].
The only way to avoid it is to scale each column individually.
Whether or not this is a problem depends on what you want to do with the data after scaling.
By default, a scikit-learn DecisionTreeRegressor returns the mean of all target values from the training set in a given leaf node.
However, I am interested in getting back the list of target values from my training set that fell into the predicted leaf node. This will allow me to quantify the distribution, and also calculate other metrics like standard deviation.
Is this possible using scikit-learn?
I think what you're looking for is the apply method of the tree object. See here for the source. Here's an example:
import numpy as np
from sklearn.tree import DecisionTreeRegressor
rs = np.random.RandomState(1234)
x = rs.randn(10,2)
y = rs.randn(10)
md = rs.randint(1, 5)
dtr = DecisionTreeRegressor(max_depth=md)
dtr.fit(x, y)
# The `tree_` object's methods seem to complain if you don't use `float32.
leaf_ids = dtr.tree_.apply(x.astype(np.float32))
print leaf_ids
# => [5 6 6 5 2 6 3 6 6 3]
# Should be probably be equal for small depths.
print 2**md, np.unique(leaf_ids).shape[0]
# => 4, 4
I need to do some multinomial regression in Julia. In R I get the following result:
library(nnet)
data <- read.table("Dropbox/scripts/timeseries.txt",header=TRUE)
multinom(y~X1+X2,data)
# weights: 12 (6 variable)
initial value 10985.024274
iter 10 value 10438.503738
final value 10438.503529
converged
Call:
multinom(formula = y ~ X1 + X2, data = data)
Coefficients:
(Intercept) X1 X2
2 0.4877087 0.2588725 0.2762119
3 0.4421524 0.5305649 0.3895339
Residual Deviance: 20877.01
AIC: 20889.01
Here is my data
My first attempt was using Regression.jl. The documentation is quite sparse for this package so I am not sure which category is used as baseline, which parameters the resulting output corresponds to, etc. I filed an issue to ask about these things here.
using DataFrames
using Regression
import Regression: solve, Options, predict
dat = readtable("timeseries.txt", separator='\t')
X = convert(Matrix{Float64},dat[:,2:3])
y = convert(Vector{Int64},dat[:,1])
ret = solve(mlogisticreg(X',y,3), reg=ZeroReg(), options=Options(verbosity=:iter))
the result is
julia> ret.sol
3x2 Array{Float64,2}:
-0.573027 -0.531819
0.173453 0.232029
0.399575 0.29979
but again, I am not sure what this corresponds to.
Next I tried the Julia wrapper to Python's SciKitLearn:
using ScikitLearn
#sk_import linear_model: LogisticRegression
model = ScikitLearn.fit!(LogisticRegression(multi_class="multinomial", solver = "lbfgs"), X, y)
model[:coef_]
3x2 Array{Float64,2}:
-0.261902 -0.220771
-0.00453731 0.0540354
0.266439 0.166735
but I have not figured out how to extract the coefficients from this model. Updated with coefficients. These also don't look like the R results.
Any help trying to replicate R's results would be appreciate (using whatever package!)
Note the response variables are just the discretized time-lagged response i.e.
julia> dat[1:3,:]
3x3 DataFrames.DataFrame
| Row | y | X1 | X2 |
|-----|---|----|----|
| 1 | 3 | 1 | 0 |
| 2 | 3 | 0 | 1 |
| 3 | 1 | 0 | 1 |
for row 2 you can see that the response (0, 1) means the previous observation was a 3. Similarly (1,0) means previous observation was a 2 and (0,0) means previous observation was a 1.
Update:
For Regression.jl it seems it does not fit an intercept by default (and they call it "bias" instead of an intercept). By adding this term we get results very similar to python (not sure what the third column is though..)
julia> ret = solve(mlogisticreg(X',y,3, bias=1.0), reg=ZeroReg(), options=Options(verbosity=:iter))
julia> ret.sol
3x3 Array{Float64,2}:
-0.263149 -0.221923 -0.309949
-0.00427033 0.0543008 0.177753
0.267419 0.167622 0.132196
UPDATE:
Since the model coefficients are not identifiable I should not be expecting them to be the same acrossed these different implementations. However, the predicted probabilities should be the same, and in fact they are (using R, Regression.jl, or ScikitLearn).
For a multiclass problem I use Scikit-Learn. I find very little examples on how to load a custom dataset with multiple classes. The sklearn.datasets.load_files method does not seem to be suitable as files need to be stored multiple times. I now have the following structure:
X => Python list with lists of features (in text).
y => Python list with lists of classes (in text).
How do I transform this to a structure Scikit-Learn can use in a classifier?
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
X = np.loadtxt('samples.csv', delimiter=",")
y_aux = np.loadtxt('targets.csv', delimiter=",")
y = MultiLabelBinarizer().fit_transform(y_aux)
Code explanation: Let's say you have all your features stored in a file called samples.csv and the multiclass labels in another file called targets.csv (they could be of course stored in the same file and you'd just need to split columns). For clarity in this example my files contain:
samples.csv
4.0,3.2,5.5
6.8,5.6,3.3
targets.csv
1,4 <-- sample one belongs to classes 1 and 4
2,3 <-- sample two belongs to classes 2,3
MultiLabelBinarizer encodes the output targets in such a way that y variable is ready to be fed into Multiclass classifiers. The output of the code is:
y = array([[1, 0, 0, 1],
[0, 1, 1, 0]])
meaning sample one belongs to classes 1 and 4 and sample two belongs to 2 and 3.
I'm quite new to scikit-learn and was going through some of the examples of learning and predicting the samples in the iris dataset. But how do I load an external dataset for this purpose?
I downloaded a dataset that has data in the following form;
id attr1 attr2 .... label
123 0 0 ..... abc
234 0 0 ..... dsf
....
....
So how should I load this dataset in order to learn and draw prediction? Thanks.
One option is to use pandas. Assuming the data is space separated:
import pandas as pd
X = pd.read_csv('data.txt', sep=' ').values
where read_csv returns a DataFrame, and the values attribute returns a numpy array containing the data. You might want to separate out the last column of the above X as the labels, say into a one dimensional array y:
X, y = X[:, :-1], X[:, -1]