I have been using pandas DataFrame with scikit-learn models. Internally the pandas DataFrame is converted to numpy ndarray transparently (e.g LogisticRegression, SVC, MultinomialNB, etc).
However sklearn.multiclass.OneVsOneClassifier fails when given pandas DataFrame, instead it behaves properly when given numpy ndarray.
This can be reproduced with this code.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsOneClassifier
df = pd.DataFrame({'a': range(15), 'b': range(20, 35)})
labels = pd.Series(['a', 'b', 'c'] * 5)
clf = OneVsOneClassifier(LogisticRegression())
# clf = LogisticRegression() # This works file
clf.fit(df, labels)
Can someone confirm if this is a bug?
If not point me to documentation where this is explained.
Additional information:
I traced the root cause for this to sklearn/multiclass.py in the last line of the function _fit_ovo_binary
return _fit_binary(estimator, X[ind[cond]], y_binary, classes=[i, j])
Here it assumes X is numpy ndarray (in reality is pandas DataFrame) and tries to use integer indexing. Which results in the exception.
Related
I'm trying to use scikitlearn's preprocessing to min-max scale a row on pandas. My solution works but gives me 2 warnings and I am wondering if there is a better way to do it.
Here is my function which does the minmaxscaling given a dataframe and columns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
def minMaxScale(df, cols):
scaler = MinMaxScaler()
return scaler.fit_transform(df[cols])
This is where I use it
df.loc[:,'pct_mm'] = minMaxScale(df,['pct'])
Where the column 'pct' exists and 'pct_mm' does not exist.
I get the following warning 2 times:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
How should I do this the way pandas wants me to?
Cannot reproduce the warnings:
import pandas as pd
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
def minMaxScale(df, cols):
scaler = MinMaxScaler()
return scaler.fit_transform(df[cols])
df = sns.load_dataset('iris')
df.loc[:, 'newcolumn'] = minMaxScale(df, ['sepal_length'])
However, if I do this:
df = sns.load_dataset('iris')
df2 = df[:]
df2.loc[:, 'newcolumn'] = minMaxScale(df, ['sepal_length'])
then I get two warnings as well.
Probably you derived df from another dataframe somewhere in your code. I recommend you to find the lines where you used df, and make sure to make a copy, like: df = old_df.copy().
Using scikit-learn or other suitable library I can perform univariate linear regression for a column against a dependent variable. If I were to do similar univariate regression between all pairs of dependent variable and independent variable then I can run a for loop and do it.
But is there a vectorised way to do so without the use of for loop?
For loop implementation:
df = pd.DataFrame({
'y': np.random.randn(20),
'x1': np.random.randn(20),
'x2': np.random.randn(20),
'x3': np.random.randn(20)})
from sklearn.linear_model import LinearRegression
for i in range(3):
t=LinearRegression().fit(df[['x'+str(i+1)]],df[['y']])
print(t.coef_)
print(t.intercept_)
Vectorized Implementation:
??
Related question with incorrect answer.
univariate regression in python
Scikit-learn APIs are not natively designed for your use case. I'm also not aware of a single high-level function that does what you're describing. However, this is a great problem for Dask, which will allow you to parallelize this quite nicely with very few changes to your code.
Dask's delayed API allows you to smoothly parallelize for-loops like in your example. I highly recommend checking out the introductory tutorial.
Original:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
np.random.seed(12)
df = pd.DataFrame({
'y': np.random.randn(20),
'x1': np.random.randn(20),
'x2': np.random.randn(20),
'x3': np.random.randn(20)})
from sklearn.linear_model import LinearRegression
for i in range(3):
t=LinearRegression().fit(df[['x'+str(i+1)]],df[['y']])
print(t.coef_, t.intercept_)
[[0.12973704]] [0.11022991]
[[0.09058823]] [0.02903383]
[[0.12020571]] [0.02941156]
Dask Version (using delayed):
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from dask import delayed
from dask.distributed import Client
import dask
client = Client() # set up a local Dask cluster
def fit_ols(df, feature_col, target_col="y"):
clf = LinearRegression()
clf.fit(df[[feature_col]], df[target_col])
return clf.coef_[0], clf.intercept_
results = []
for i in range(3):
xcol = f"x{i+1}"
res = delayed(fit_ols)(df, xcol)
results.append(res)
dask.compute(results)
([(0.1297370368274274, 0.11022990787315554),
(0.09058823367556357, 0.029033826351961462),
(0.12020570872010947, 0.02941156049324152)],)
The key change is wrapping your logic into a function. With that, we can rely on delayed to add tasks to a task graph and then call compute on the graph to execute the tasks in parallel.
Editing to include an example of how you might want to do this at a larger scale, since you have millions of columns. Note the scatter operation and passing of the future.
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from dask import delayed
from dask.distributed import Client, LocalCluster
import dask
# set up a local cluster
# my laptop has limited CPU cores, so only using two to parallelize
client = Client(n_workers=2, threads_per_worker=1)
# Setup data
np.random.seed(12)
nrows = 10000
ncols = 1000
df = pd.DataFrame({f"x{i}": np.random.randn(nrows) for i in range(ncols)})
df['y'] = np.random.randn(nrows)
# scatter the data to the workers beforehand
data_future = client.scatter(df, broadcast=True)
# Many calls to linear regression, parallelized with Dask
def fit_ols(df, feature_col, target_col="y"):
clf = LinearRegression()
clf.fit(df[[feature_col]], df[target_col])
return clf.coef_[0], clf.intercept_
results = []
for i in range(df.shape[1]-1):
xcol = f"x{i}"
# note how i'm passing the scattered data future
res = delayed(fit_ols)(data_future, xcol)
results.append(res)
res = dask.compute(results)
res[0][:5]
[(0.01898470494963711, -0.0013691188314244067),
(-0.018210714678412274, -0.0015618953728564272),
(-0.013344320263479937, -0.001588615271127849),
(-0.006330810820098386, -0.0016927025190818918),
(0.007960791021720603, -0.0016594945951821688)]
I've attached a screenshot of the dashboard showing the computation in progress:
I have a pandas DataFrame from the sklearn.datasets Boston house price data and am trying to convert this to a numpy array but keeping column names. Here is the code I tried:
from sklearn import datasets ## imports datasets from scikit-learn
import numpy as np
import pandas as pd
data = datasets.load_boston() ## loads Boston dataset from datasets library
df = pd.DataFrame(data.data, columns=data.feature_names)
X = df.to_numpy()
print(X.dtype.names)
However this returns None and therefore column names are not kept. Does anyone understand why?
Thanks
try this :
w = (data.feature_names).reshape(13,1)
X = np.vstack((w.T, data.data))
print (X)
I have a simple code to convert categorical data into one hot encoding in python:
a,1,p
b,3,r
a,5,t
I tried to convert them with python OneHotEncoder:
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np
data = pd.read_csv("C:\\test.txt", sep=",", header=None)
one_hot_encoder = OneHotEncoder(categorical_features=[0,2])
one_hot_encoder.fit(data.values)
This piece of code does not work and throws an error
ValueError: could not convert string to float: 't'
Can you please help me?
Try this:
from sklearn import preprocessing
for c in df.columns:
df[c]=df[c].apply(str)
le=preprocessing.LabelEncoder().fit(df[c])
df[c] =le.transform(df[c])
pd.to_numeric(df[c]).astype(np.float)
#user3104352,
I encountered the same behavior and found it frustrating.
Scikit-Learn requires all data to be numerical before it even considers selecting the columns provided in the categorical_features parameter.
Specifically, the column selection is handled by the _transform_selected() method in /sklearn/preprocessing/data.py and the very first line of that method is
X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES).
This check fails if any of the data in the provided dataframe X cannot be successfully converted to a float.
I agree that the documentation of sklearn.preprocessing.OneHotEncoder is very misleading in that regard.
I'm trying to load a sklearn.dataset, and missing a column, according to the keys (target_names, target & DESCR). I have tried various methods to include the last column, but with errors.
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print cancer.keys()
the keys are ['target_names', 'data', 'target', 'DESCR', 'feature_names']
data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
print data.describe()
with the code above, it only returns 30 column, when I need 31 columns. What is the best way load scikit-learn datasets into pandas DataFrame.
Another option, but a one-liner, to create the dataframe including the features and target variables is:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
df = pd.DataFrame(np.c_[cancer['data'], cancer['target']],
columns= np.append(cancer['feature_names'], ['target']))
If you want to have a target column you will need to add it because it's not in cancer.data. cancer.target has the column with 0 or 1, and cancer.target_names has the label. I hope the following is what you want:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print cancer.keys()
data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
print data.describe()
data = data.assign(target=pd.Series(cancer.target))
print data.describe()
# In case you want labels instead of numbers.
data.replace(to_replace={'target': {0: cancer.target_names[0]}}, inplace=True)
data.replace(to_replace={'target': {1: cancer.target_names[1]}}, inplace=True)
print data.shape # data.describe() won't show the "target" column here because I converted its value to string.
This works too, also using pd.Series.
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print cancer.keys()
data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
data['Target'] = pd.Series(data=cancer.target, index=data.index)
print data.keys()
print data.shape
Only target column is missing, so you can just add one.
df = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
df['target'] = cancer.target
mapping target names can be handled elegantly using map():
data["target"] = pd.Categorical(pd.Series(cancer.target).map(lambda x: cancer.target_names[x]))
As of scikit-learn 0.23 you can do the following to get a DataFrame with the target column included.
df = load_breast_cancer(as_frame=True)
df.frame