fit_transform data before running algorithm

fit_transform data before running algorithm - python

The preprocessing module further provides a utility class
StandardScaler that implements the Transformer API to compute the mean
and standard deviation on a training set so as to be able to later
reapply the same transformation on the testing set.
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler.fit_transform
When transforming the dataset you run an algorithm on, how do you link the results back to the original dataset?
E.g.
data = [[0, 0], [0, 0], [1, 1], [1, 1]]
print(data);
-->[[0, 0], [0, 0], [1, 1], [1, 1]]
myData = StandardScaler().fit_transform(data)
print(myData);
-->[[-1. -1.]
[-1. -1.]
[ 1. 1.]
[ 1. 1.]]
When running an algorithm on myData (unsupervised), how can you interpret results on that dataset when it's changed before running? E.g. when you run a clustering algorithm on myData, you are not clustering the original data.

Apply the inverse_transform to get back to the original data:
from sklearn.preprocessing import StandardScaler
import numpy as np
data = [[0, 0], [0, 1], [1, 0], [1, 1]]
scaler = StandardScaler()
myData = scaler.fit_transform(data)
restored = scaler.inverse_transform(myData)
assert np.allclose(restored, data) # check we got the original data back
Note how an instance of StandardScaler is stored in a variable for later use. After fitting, this instance contains all the information required to repeat or undo the transformation.
Now, if you performed clustering on myData you can pass the cluster prototypes (centers, or whatever you get from the clustering algorithm) to scaler.inverse_transform to get the clusters in the original data space.

Related

Lightgbm splits differently on the same dataset (one hot encoded vs one-vs-other split algorithm)

When doing small tests on lightgbm, I found a case I could not understand.
I create a small dataset with categorical columns:
import pandas as pd
X = pd.DataFrame(
[
[0, 1, 1],
[2, 2, 0],
[3, 3, 2],
[2, 3, 2],
[0, -3, 0],
[0, 0, 1],
[0, 3, 0],
[1, 2.5, 1],
[2, 5, 0],
[3, -1.5, 2],
],
columns=["col1", "col2", "col3"],
)
X["col1"] = X["col1"].astype("category")
X["col3"] = X["col3"].astype("category")
y = pd.Series([0, 1, 1, 0, 0, 0, 0, 1, 0, 1])
and its associated one hot encoded version:
from sklearn.preprocessing import OneHotEncoder
feats_to_encode = ["col1", "col3"]
enc = OneHotEncoder()
enc.fit(X[feats_to_encode])
X_one_hot_encoded = pd.DataFrame(
enc.transform(X[feats_to_encode]).toarray(),
columns=[
feats_to_encode[feat_id] + str(cat)
for feat_id in range(len(enc.categories_))
for cat in enc.categories_[feat_id]
],
)
X_one_hot_encoded["col2"] = X["col2"]
I then trained 2 lightgbm models, with both the previous datasets. I know that lightgbm manages categorical columns with special algorithms. "When number of categories of one feature smaller than or equal to max_cat_to_onehot, one-vs-other split algorithm will be used" (see max_cat_to_onehot). I would then naively expect that, since my categorical columns have less categories (4 and 3) than the parameter max_cat_to_onehot, I would get the same result with both the datasets, unless if the "one-vs-other split algorithm" is not equivalent to one hot encode categorical columns. I assumed this behaviour because of the names of the parameter, "max_cat_to_onehot", and the algorithm, "one-vs-other split algorithm".
from lightgbm import LGBMRegressor
params = {
"n_estimators": 1,
"max_depth": 2,
"min_child_samples": 1,
"importance_type": "gain",
"max_cat_to_onehot": 10,
}
model = LGBMRegressor(**params)
model.fit(X, y)
print(model._Booster.dump_model()["tree_info"][0]["tree_structure"]["split_gain"]) # 0.30476200580596924
model = LGBMRegressor(**params)
model.fit(X_one_hot_encoded, y)
print(model._Booster.dump_model()["tree_info"][0]["tree_structure"]["split_gain"]) # 1.0666699409484863
The first split is different between the 2 models. The second model chose the best possible split, but this is not the case for the first one.
Does someone know the reason of this? I assume that my guess for the behaviour of "one-vs-other split algorithm" is wrong.

I updated the lightgbm version, from 2.3.1 to 3.1.1. With the new version I got the expected results.

Map test data using sklearn TSNE

Is there a way to extract the mapping procedure in sklearn.manifold.TSNE in python so that you can map new data into the reduced dimensional space?
Importantly, I mean without having to retrain on the new data as well here.
For example say you trained a TSNE map as follows:
import numpy as np
from sklearn.manifold import TSNE
X = np.array([[0, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
X_embedded = TSNE(n_components=2).fit_transform(X)
As seen in the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
Can you extract the transformation so that you can map new data into the same space:
Y = np.array([[0, 0.8, 0.8], [0.1, 0, 1], [1.2, 0.2, 1], [1, 1.1, 1]])
Any help on this matter would be greatly appreciated!

tSNE is a non-linear, non-parametric embedding.
So there is no "closed form" way of updating it with new points. Even worse: adding new points may require existing points to move.
Because of this, making tSNE apply to new data will require substantial changes to the method, it won't be the original tSNE anymore.

Parametric t-SNE has option to apply on the test data but this is not available in Sklearn. Reference issue.
Having set this we have mention that it is implemented in other place here

warning message in scikit-learn [duplicate]

This question already has answers here:
Preprocessing in scikit learn - single sample - Depreciation warning
(8 answers)
Closed 5 years ago.
I wrote a very simple scikit-learn decision tree to implement XOR:
from sklearn import tree
X = [[0, 0], [1, 1], [0, 1], [1, 0]]
Y = [0, 0, 1, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
print(clf.predict([0,1]))
print(clf.predict([0,0]))
print(clf.predict([1,1]))
print(clf.predict([1,0]))
predict part generates some warning like this:
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17
and will raise ValueError in 0.19. Reshape your data either using
X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1)
if it contains a single sample.
I don't have a clear idea what needs to change and why? Please enlighten me!
Thank you in advance!

The input to clf.predict should be a 2D array. Thus, instead of writing
print(clf.predict([0,1]))
you need to write
print(clf.predict([[0,1]]))

The method operates on matrices (2D arrays), rather than vectors (1D arrays). As a convenience, the older code accepted a vector as a 1xN matrix. This led to usage errors as some users forgot which way a vector was oriented (1xN vs Nx1).
The suggestion tells you how to reshape your vector to the proper matrix shape. For constant vectors, just write them as matrices:
clf.predict( [ [0, 1] ] )
The "other direction" (wrong for this application) would be
clf.predict( [ [0], [1] ] )

As the warning message pointed out, you have single sample to test. Thus you could use reshape or fix as followings,
from sklearn import tree
X = [[0, 0], [1, 1], [0, 1], [1, 0]]
Y = [0, 0, 1, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
print (clf.predict([[0,1]]))
print (clf.predict([[0,0]]))
print (clf.predict([[1,1]]))
print (clf.predict([[1,0]]))

Adding new points to the t-SNE model

I try to use t-SNE algorithm in the scikit-learn:
import numpy as np
from sklearn.manifold import TSNE
X = np.array([[0, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
model = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)
model.fit_transform(X)
Output:
array([[ 0.00017599, 0.00003993], #1
[ 0.00009891, 0.00021913],
[ 0.00018554, -0.00009357],
[ 0.00009528, -0.00001407]]) #2
After that I try to add some points with the coordinates exactly like in the first array X to the existing model:
Y = np.array([[0, 0, 0], [1, 1, 1]])
model.fit_transform(Y)
Output:
array([[ 0.00017882, 0.00004002], #1
[ 0.00009546, 0.00022409]]) #2
But coords in the second array not equal to the first and last coords from the first array.
I understand that this is the right behaviour, but how can I add new coords to the model and get the same coords in the output array for the same coords in the input array?
Also I still need to get closest points even after appending new points.

Quoting the author of t-SNE from here: https://lvdmaaten.github.io/tsne/
Once I have a t-SNE map, how can I embed incoming test points in that map?
t-SNE learns a non-parametric mapping, which means that it does not learn an explicit function that maps data from the input space to the map. Therefore, it is not possible to embed test points in an existing map (although you could re-run t-SNE on the full dataset). A potential approach to deal with this would be to train a multivariate regressor to predict the map location from the input data. Alternatively, you could also make such a regressor minimize the t-SNE loss directly, which is what I did in this paper.
Also, this answer on stats.stackexchange.com contains ideas and a link to
a very nice and very fast recent Python implementation of t-SNE https://github.com/pavlin-policar/openTSNE that allows embedding of new points out of the box
and a link to https://github.com/berenslab/rna-seq-tsne/.

How to perform multivariable linear regression with scikit-learn?

Forgive my terminology, I'm not an ML pro. I might use the wrong terms below.
I'm trying to perform multivariable linear regression. Let's say I'm trying to work out user gender by analysing page views on a web site.
For each user whose gender I know, I have a feature matrix where each row represents a web site section, and the second element whether they visited it, e.g.:
male1 = [
[1, 1], # visited section 1
[2, 0], # didn't visit section 2
[3, 1], # visited section 3, etc
[4, 0]
]
So in scikit, I am building xs and ys. I'm representing a male as 1, and female as 0.
The above would be represented as:
features = male1
gender = 1
Now, I'm obviously not just training a model for a single user, but instead I have tens of thousands of users whose data I'm using for training.
I would have thought I should create my xs and ys as follows:
xs = [
[ # user1
[1, 1],
[2, 0],
[3, 1],
[4, 0]
],
[ # user2
[1, 0],
[2, 1],
[3, 1],
[4, 0]
],
...
]
ys = [1, 0, ...]
scikit doesn't like this:
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit(xs, ys)
It complains:
ValueError: Found array with dim 3. Estimator expected <= 2.
How am I supposed to supply a feature matrix to the linear regression algorithm in scikit-learn?

You need to create xs in a different way. According to the docs:
fit(X, y, sample_weight=None)
Parameters:
X : numpy array or sparse matrix of shape [n_samples, n_features]
Training data
y : numpy array of shape [n_samples, n_targets]
Target values
sample_weight : numpy array of shape [n_samples]
Individual weights for each sample
Hence xs should be a 2D array with as many rows as users and as many columns as web site sections. You defined xs as a 3D array though. In order to reduce the number of dimensions by one you could get rid of the section numbers through a list comprehension:
xs = [[visit for section, visit in user] for user in xs]
If you do so, the data you provided as an example gets transformed into:
xs = [[1, 0, 1, 0], # user1
[0, 1, 1, 0], # user2
...
]
and clf.fit(xs, ys) should work as expected.
A more efficient approach to dimension reduction would be that of slicing a NumPy array:
import numpy as np
xs = np.asarray(xs)[:,:,1]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

fit_transform data before running algorithm - python

Related

Lightgbm splits differently on the same dataset (one hot encoded vs one-vs-other split algorithm)

Map test data using sklearn TSNE

warning message in scikit-learn [duplicate]

Adding new points to the t-SNE model

How to perform multivariable linear regression with scikit-learn?

Categories

Resources