Applying scikitlearn preprocessing to pandas without causing warnings - python

I'm trying to use scikitlearn's preprocessing to min-max scale a row on pandas. My solution works but gives me 2 warnings and I am wondering if there is a better way to do it.
Here is my function which does the minmaxscaling given a dataframe and columns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
def minMaxScale(df, cols):
scaler = MinMaxScaler()
return scaler.fit_transform(df[cols])
This is where I use it
df.loc[:,'pct_mm'] = minMaxScale(df,['pct'])
Where the column 'pct' exists and 'pct_mm' does not exist.
I get the following warning 2 times:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
How should I do this the way pandas wants me to?

Cannot reproduce the warnings:
import pandas as pd
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
def minMaxScale(df, cols):
scaler = MinMaxScaler()
return scaler.fit_transform(df[cols])
df = sns.load_dataset('iris')
df.loc[:, 'newcolumn'] = minMaxScale(df, ['sepal_length'])
However, if I do this:
df = sns.load_dataset('iris')
df2 = df[:]
df2.loc[:, 'newcolumn'] = minMaxScale(df, ['sepal_length'])
then I get two warnings as well.
Probably you derived df from another dataframe somewhere in your code. I recommend you to find the lines where you used df, and make sure to make a copy, like: df = old_df.copy().

Related

Problems implementing Dask MinMaxScaler

I am having problems normalizing a dask.dataframe.core.DataFrame using Dask.dask_ml.preprocessing.MinMaxScaler, I am able to use sklearn.preprocessing.MinMaxScaler however I wish to use dask to scale up.
Minimal, Reproducible Example:
# Get data
ddf = dd.read_csv('test.csv') # See below
ddf = ddf.set_index('index')
# Pivot
ddf = ddf.categorize(columns=['item', 'name'])
ddf_p = ddf.pivot_table(index='item', columns='name', values='value', aggfunc='mean')
col = ddf_p.columns.to_list()
# sklearn verison
from sklearn.preprocessing import MinMaxScaler
scaler_s = MinMaxScaler()
scaled_ddf_s = scaler_s.fit_transform(ddf_p[col]) # Works!
# dask verison
from dask_ml.preprocessing import MinMaxScaler
scaler_d = MinMaxScaler()
scaled_values_d = scaler_d.fit_transform(ddf_p[col]) # Doesn't work
Error message:
TypeError: Categorical is not ordered for operation min
you can use .as_ordered() to change the Categorical to an ordered one
Not sure what the 'Categorical' is in the pivoted table, but I have tried to .as_ordered() the index:
from dask_ml.preprocessing import MinMaxScaler
scaler_d = MinMaxScaler()
ddf_p = ddf_p.index.cat.as_ordered()
scaled_values_d = scaler_d.fit_transform(ddf_p[col])
But I get the error message:
NotImplementedError: Series getitem in only supported for other series objects with matching partition structure
Additional information
test.csv:
index,item,name,value
2015-01-01,item_1,A,1
2015-01-01,item_1,B,2
2015-01-01,item_1,C,3
2015-01-01,item_1,D,4
2015-01-01,item_1,E,5
2015-01-02,item_2,A,10
2015-01-02,item_2,B,20
2015-01-02,item_2,C,30
2015-01-02,item_2,D,40
2015-01-02,item_2,E,50
Looking at this answer:
pivot_table produces a column index which is categorical because you
made the original column "Field" categorical. Writing the index to
parquet calls reset_index on the data-frame, and pandas cannot add a
new value to the columns index, because it is categorical. You can
avoid this using ddf.columns = list(ddf.columns).
Therefore adding ddf_p.columns = list(ddf_p.columns) solved the problem:
# dask verison
from dask_ml.preprocessing import MinMaxScaler
scaler_d = MinMaxScaler()
ddf_p.columns = list(ddf_p.columns)
scaled_values_d = scaler_d.fit_transform(ddf_p[col]) # Works!

Changing values in a pandas.DataFrame

I everybody, I'm new to python world and I'm trying to learn pandas and tensorflow.
At the moment I've a dataframe with positive and negative values that I want to manage to resize.
For example
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler
df = pd.read_excel ('/Users/dataset.xlsx')
print(df[:])
scaler = MinMaxScaler(feature_range=(0,1))
df_absolute = df.abs()
df_scaled = scaler.fit_transform(df_absolute)
#df_mod = df_scaled.loc[(df<0)] = df_scaled*-1
df_normalized = pd.DataFrame(df_mod)
print(df_normalized[:])
I've an error on the line with # and such as 'numpy.ndarray'.
How can I resolve this?
In the
df = pd.read_excel ('/Users/dataset.xlsx')
there is widespace ' ' should remove it
df = pd.read_excel('/Users/dataset.xlsx')

Merge gives me much more rows in the dataframe

Update: Like mentioned in the comments, my indices weren't unique. worked around via a pivot.table
I got the following code to perform a clustering on a df. This df is approximately 80 K rows (df is named 'Kmeans'). I then have another df with a common value with 'Kmeans' (namely 'SKU_NR') with slightly less than 80K rows (this df is named 'Historie'). I want to merge df 'Kmeans' with df 'Historie', but when I do this, it gives me over 2M rows. I've done this before and then it worked. What's going wrong in the code?
#load in libraries
import pandas as pd
import numpy as np
pd.options.mode.chained_assignment = None
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
#Load and prepare data
Historie = pd.read_excel("file.xlsx")
Kmeans = Historie[['SKU_NR','ORDER_ADV_CONS_UNITS_WK_PICK']]
Kmeans = Kmeans.dropna()
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3)
km.fit(Kmeans)
km.predict(Kmeans)
labels = km.labels_
Kmeans["Classification"] = labels
Kmeans = Kmeans[["SKU_NR","Classification"]]
Historie
=Historie[['SKU_NR','WEEKNR','ORDER_ADV_CONS_UNITS_WK_PICK',
'FORECAST_NEC_STOCK_BASE']]
Historie = Historie.merge(Kmeans, on = "SKU_NR")

Loading SKLearn cancer dataset into Pandas DataFrame

I'm trying to load a sklearn.dataset, and missing a column, according to the keys (target_names, target & DESCR). I have tried various methods to include the last column, but with errors.
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print cancer.keys()
the keys are ['target_names', 'data', 'target', 'DESCR', 'feature_names']
data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
print data.describe()
with the code above, it only returns 30 column, when I need 31 columns. What is the best way load scikit-learn datasets into pandas DataFrame.
Another option, but a one-liner, to create the dataframe including the features and target variables is:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
df = pd.DataFrame(np.c_[cancer['data'], cancer['target']],
columns= np.append(cancer['feature_names'], ['target']))
If you want to have a target column you will need to add it because it's not in cancer.data. cancer.target has the column with 0 or 1, and cancer.target_names has the label. I hope the following is what you want:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print cancer.keys()
data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
print data.describe()
data = data.assign(target=pd.Series(cancer.target))
print data.describe()
# In case you want labels instead of numbers.
data.replace(to_replace={'target': {0: cancer.target_names[0]}}, inplace=True)
data.replace(to_replace={'target': {1: cancer.target_names[1]}}, inplace=True)
print data.shape # data.describe() won't show the "target" column here because I converted its value to string.
This works too, also using pd.Series.
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print cancer.keys()
data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
data['Target'] = pd.Series(data=cancer.target, index=data.index)
print data.keys()
print data.shape
Only target column is missing, so you can just add one.
df = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
df['target'] = cancer.target
mapping target names can be handled elegantly using map():
data["target"] = pd.Categorical(pd.Series(cancer.target).map(lambda x: cancer.target_names[x]))
As of scikit-learn 0.23 you can do the following to get a DataFrame with the target column included.
df = load_breast_cancer(as_frame=True)
df.frame

OneVsOneClassifier fails with pandas DataFrame

I have been using pandas DataFrame with scikit-learn models. Internally the pandas DataFrame is converted to numpy ndarray transparently (e.g LogisticRegression, SVC, MultinomialNB, etc).
However sklearn.multiclass.OneVsOneClassifier fails when given pandas DataFrame, instead it behaves properly when given numpy ndarray.
This can be reproduced with this code.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsOneClassifier
df = pd.DataFrame({'a': range(15), 'b': range(20, 35)})
labels = pd.Series(['a', 'b', 'c'] * 5)
clf = OneVsOneClassifier(LogisticRegression())
# clf = LogisticRegression() # This works file
clf.fit(df, labels)
Can someone confirm if this is a bug?
If not point me to documentation where this is explained.
Additional information:
I traced the root cause for this to sklearn/multiclass.py in the last line of the function _fit_ovo_binary
return _fit_binary(estimator, X[ind[cond]], y_binary, classes=[i, j])
Here it assumes X is numpy ndarray (in reality is pandas DataFrame) and tries to use integer indexing. Which results in the exception.

Categories

Resources