I want to use the class FunctionSampler from imblearn to create my own custom class for resampling my dataset. I have a one-dimensional feature Series containing paths for each subject and a label Series containing the labels for each subject. Both come from a pd.DataFrame. I know that I have to reshape the feature array first since it is one-dimensional. When I use the class RandomUnderSampler everything works fine, however if I pass both the features and labels first to the fit_resample method of FunctionSampler which then creates an instance of RandomUnderSampler and then calls fit_resample on this class, I get the following error:
ValueError: could not convert string to float: 'path_1'
Here's a minimal example producing the error:
import pandas as pd
from imblearn.under_sampling import RandomUnderSampler
from imblearn import FunctionSampler
# create one dimensional feature and label arrays X and y
# X has to be converted to numpy array and then reshaped.
X = pd.Series(['path_1','path_2','path_3'])
X = X.values.reshape(-1,1)
y = pd.Series([1,0,0])
FIRST METHOD (works)
rus = RandomUnderSampler()
X_res, y_res = rus.fit_resample(X,y)
SECOND METHOD (doesn't work)
def resample(X, y):
return RandomUnderSampler().fit_resample(X, y)
sampler = FunctionSampler(func=resample)
X_res, y_res = sampler.fit_resample(X, y)
Does anyone know what goes wrong here? It seems as the fit_resample method of FunctionSampler is not equal to the fit_resample method of RandomUnderSampler...
Your implementation of FunctionSampler is correct. The problem is with your dataset.
RandomUnderSampler seems to work for text data as well. There is no checking using check_X_y.
But FunctionSampler() has this check, see here
from sklearn.utils import check_X_y
X = pd.Series(['path_1','path_2','path_2'])
X = X.values.reshape(-1,1)
y = pd.Series([1,0,0])
check_X_y(X, y)
This will throw an error
ValueError: could not convert string to float: 'path_1'
The following example would work!
X = pd.Series(['1','2','2'])
X = X.values.reshape(-1,1)
y = pd.Series([1,0,0])
def resample(X, y):
return RandomUnderSampler().fit_resample(X, y)
sampler = FunctionSampler(func=resample)
X_res, y_res = sampler.fit_resample(X, y)
X_res, y_res
# (array([[2.],
# [1.]]), array([0, 1], dtype=int64))
Related
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('netflixprice.csv')
x = dataset.iloc[:,0].values
y = dataset.iloc[:, 1:6].values
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
x = np.array(ct.fit_transform(x))
IndexError Traceback (most recent call last)
Input In [8], in <cell line: 4>()
2 from sklearn.preprocessing import OneHotEncoder
3 ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
----> 4 x = np.array(ct.fit_transform(x))
data structure
New to this. Also anywhere i can learn more about data processing ?
It's hard to tell anything without knowing the structure of your data. However, it seems like you may want to reshape your x:
x = dataset.iloc[:, 0].values.reshape(-1, 1)
I could find a dataset that might be similar to yours and tried it, it worked.
As for learning how to process the data: I personally try to refer to the documentation of a method I want to apply. In your case it's here. However, a clue to where the problem was I could find in the error message:
def _get_column_indices(X, key):
"""Get feature column indices for input data X and key.
For accepted values of `key`, see the docstring of
:func:`_safe_indexing_column`.
"""
--> n_columns = X.shape[1] # this is where the problem is
key_dtype = _determine_key_type(key)
if isinstance(key, (list, tuple)) and not key:
# we get an empty list
IndexError: tuple index out of range
That made me suspect that you got an ndarray shaped (n,) when sliced x, which doesn't have columns that were required.
It also seems like you intended x to be the target rather than the only feature. With 6 other columns assigned to y you may want to swap x and y. You may still encode your target like you planned.
a= [-0.10266667,0.02666667,0.016 ,0.06666667,0.08266667]
b= [5.12,26.81,58.82,100.04,148.08]
the result in excel SLOPE(a,b) is 0.001062
How I can get the same result in Python what I get by using SLOPE in Excel?
Here you go.
import numpy as np
from sklearn.linear_model import LinearRegression
x = np.array([5.12,26.81,58.82,100.04,148.08]).reshape((-1, 1))
y = np.array([-0.10266667,0.02666667,0.016 ,0.06666667,0.08266667])
model = LinearRegression().fit(x, y)
print(model.coef_)
# methods and attributes available
print(dir(model))
In excel, SLOPE arguments are in the order y, x. I used those names here so it would be more obvious.
The reshape just makes x a lists of lists which is what is required. y is just needs to be a list. model has many other methods and attributes available. See dir(model).
Python t-sne implementation from this resource: https://lvdmaaten.github.io/tsne/
Btw I'm a beginner to scRNA-seq.
What I am trying to do: Use a scRNA-seq data set and run t-SNE on it but with using previously calculated PCAs (I have PCA.score and PCA.load files)
Q1: I should be able to use my selected calculated PCAs in the tSNE, but which file do I use the pca.score or pca.load when running Y = tsne.tsne(X)?
Q2: I've tried removing/replacing parts of the PCA calculating code to attempt to remove PCA preprocessing but it always seems to give an error. What should I change for it to properly use my already PCA data and not calculate PCA from it again?
The piece of PCA processing code is this in its raw form:
def pca(X=np.array([]), no_dims=50):
"""
Runs PCA on the NxD array X in order to reduce its dimensionality to
no_dims dimensions.
"""
print("Preprocessing the data using PCA...")
(n, d) = X.shape
X = X - np.tile(np.mean(X, 0), (n, 1))
(l, M) = X #np.linalg.eig(np.dot(X.T, X))
Y = np.dot(X, M[:, 0:no_dims])
return Y
You should use the PCA score.
As for not running pca, you can just comment out this line:
X = pca(X, initial_dims).real
What I did is to add a parameter do_pca and edit the function such:
def tsne(X=np.array([]), no_dims=2, initial_dims=50, perplexity=30.0,do_pca=True):
"""
Runs t-SNE on the dataset in the NxD array X to reduce its
dimensionality to no_dims dimensions. The syntaxis of the function is
`Y = tsne.tsne(X, no_dims, perplexity), where X is an NxD NumPy array.
"""
# Check inputs
if isinstance(no_dims, float):
print("Error: array X should have type float.")
return -1
if round(no_dims) != no_dims:
print("Error: number of dimensions should be an integer.")
return -1
# Initialize variables
if do_pca:
X = pca(X, initial_dims).real
(n, d) = X.shape
max_iter = 50
[.. rest stays the same..]
Using an example dataset, without commenting out that line:
import numpy as np
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import sys
import os
from tsne import *
X,y = load_digits(return_X_y=True,n_class=3)
If we run the default:
res = tsne(X=X,initial_dims=20,do_pca=True)
plt.scatter(res[:,0],res[:,1],c=y)
If we pass it a pca :
pc = pca(X)[:,:20]
res = tsne(X=pc,initial_dims=20,do_pca=False)
plt.scatter(res[:,0],res[:,1],c=y)
Goal
I am trying to build regressors that encapsulate the process of
transform the target from a non-numeric to a numeric format
internally, use numbers for all calculations
inverse-transform numeric-values back to the original format before presenting them to the user.
Ideally, the end user should be able to use the regressor without knowing the internals of the target conversions. The developer is expected to provide functions that implement the transform and inverse-transform logic.
Prototype Demo
With the help of sklearn.compose.TransformedTargetRegressor I was able to build a linear regression model that accepts timestamps as targets and internally converts them to seconds evolved since 1970-01-01 00:00:00 (Unix epoch). The fit and predict methods already work as expected.
import pandas as pd
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer
_check_inverse = False
# helper to convert a 2D numpy array of timestamps to a 2D array of seconds
def _to_float(timestamps):
deltas = pd.DataFrame(timestamps).sub(pd.Timestamp(0))
return deltas.apply(lambda s: s.dt.total_seconds()).values
# helper to convert a 2D numpy array of seconds to a 2D array of timestamps
def _to_timestamp(seconds):
return pd.DataFrame(seconds).apply(pd.to_datetime, unit='s').values
# build transformer from helper functions
time_transformer = FunctionTransformer(
func=_to_float,
inverse_func=_to_timestamp,
validate=True,
check_inverse=_check_inverse
)
# build TransformedTargetRegressor
tt_reg = TransformedTargetRegressor(
regressor=LinearRegression(),
transformer=time_transformer,
check_inverse=_check_inverse
)
Usage:
>>> import numpy as np
>>> X = np.array([[1], [2], [3]], dtype=float)
>>> y = pd.date_range(start=0, periods=3, freq='min')
>>> tt_reg = tt_reg.fit(X, y)
>>> tt_reg.predict(X)
array(['1970-01-01T00:00:00.000000000', '1970-01-01T00:01:00.000000000',
'1970-01-01T00:02:00.000000000'], dtype='datetime64[ns]')
However, methods that use the result of predict internally such as score (and possibly other methods of more complex sklearn regressors) fail because they can't handle the output of _to_timestamp:
>>> tt_reg.score(X, y)
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Users\actualpanda\.virtualenvs\SomeProject--3333Ox_\lib\site-packages\sklearn\base.py", line 435, in score
return r2_score(y, y_pred, sample_weight=sample_weight,
File "C:\Users\actualpanda\.virtualenvs\SomeProject--3333Ox_\lib\site-packages\sklearn\metrics\_regression.py", line 591, in r2_score
numerator = (weight * (y_true - y_pred) ** 2).sum(axis=0,
TypeError: ufunc 'square' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
In order to get the score, the user must know the internals of tt_reg.regressor_.
>>> tt_reg.regressor_.score(X, y.to_series().sub(pd.Timestamp(0)).dt.total_seconds())
1.0
Question
Is there a feasible way to build robust, user friendly sklearn regressors that can deal with non-numeric targets and don't leak their internals?
Updating the score method might solve your problem, as mentioned in comments.
from sklearn.utils import check_array
class MyTransformedTargetRegressor(TransformedTargetRegressor):
def score(self, X, y):
y = check_array(y, accept_sparse=False, force_all_finite=True,
ensure_2d=False)
if y.ndim == 1:
y_2d = y.reshape(-1, 1)
else:
y_2d = y
y_trans = self.transformer_.transform(y_2d)
if y_trans.ndim == 2 and y_trans.shape[1] == 1:
y_trans = y_trans.squeeze(axis=1)
return self.regressor_.score(X, y_trans)
Let us try with a different regressor
from sklearn.ensemble import BaggingRegressor
tt_reg = MyTransformedTargetRegressor(
regressor=BaggingRegressor(),
transformer=time_transformer,
check_inverse=_check_inverse
)
import numpy as np
n_samples =10000
X = np.arange(n_samples).reshape(-1,1)
y = pd.date_range(start=0, periods=n_samples, freq='min')
tt_reg = tt_reg.fit(X, y)
tt_reg.predict(X)
print(tt_reg.score(X, y))
# 0.9999999891236799
I wanted to create my own Transformer using scikit-learn FunctionTransformer and followed their example as a dry run. It worked, but then I wanted to take the inverse of that transformation just to see the end result. However, when I tried the inverse_transform, it returned the same thing as the transformation. How do I get the original values? I ask this because I plan on using this transformation to transform a target variable, then make predictions. Those predictions will need be inversely transformed after I predict.
As a side bar, should I fit on y_train and transform on my y_test? Or can I transform y all at once?
My transformer:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
import random
randomlist = []
for i in range(0,100):
n = random.randint(1,100)
randomlist.append(n)
y = pd.Series(randomlist)
y_train = y[:80]
y_test = y[80:]
target_trans = FunctionTransformer(np.log, validate=True, check_inverse = True)
logy_train = target_trans.fit_transform(y_train.values.reshape(-1,1))
logy_test = target_trans.transform(y_test.values.reshape(-1,1))
target_trans.inverse_transform(y_train.values.reshape(-1,1))
Within FunctionTransformer() you not only need to define check_inverse=True but also define the actual inverse function itself.
So for the above,
target_trans = FunctionTransformer(np.log, inverse_func = np.exp
,validate=True, check_inverse = True)
which yields the desired result.