Loading SKLearn cancer dataset into Pandas DataFrame - python

I'm trying to load a sklearn.dataset, and missing a column, according to the keys (target_names, target & DESCR). I have tried various methods to include the last column, but with errors.
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print cancer.keys()
the keys are ['target_names', 'data', 'target', 'DESCR', 'feature_names']
data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
print data.describe()
with the code above, it only returns 30 column, when I need 31 columns. What is the best way load scikit-learn datasets into pandas DataFrame.

Another option, but a one-liner, to create the dataframe including the features and target variables is:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
df = pd.DataFrame(np.c_[cancer['data'], cancer['target']],
columns= np.append(cancer['feature_names'], ['target']))

If you want to have a target column you will need to add it because it's not in cancer.data. cancer.target has the column with 0 or 1, and cancer.target_names has the label. I hope the following is what you want:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print cancer.keys()
data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
print data.describe()
data = data.assign(target=pd.Series(cancer.target))
print data.describe()
# In case you want labels instead of numbers.
data.replace(to_replace={'target': {0: cancer.target_names[0]}}, inplace=True)
data.replace(to_replace={'target': {1: cancer.target_names[1]}}, inplace=True)
print data.shape # data.describe() won't show the "target" column here because I converted its value to string.

This works too, also using pd.Series.
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print cancer.keys()
data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
data['Target'] = pd.Series(data=cancer.target, index=data.index)
print data.keys()
print data.shape

Only target column is missing, so you can just add one.
df = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
df['target'] = cancer.target

mapping target names can be handled elegantly using map():
data["target"] = pd.Categorical(pd.Series(cancer.target).map(lambda x: cancer.target_names[x]))

As of scikit-learn 0.23 you can do the following to get a DataFrame with the target column included.
df = load_breast_cancer(as_frame=True)
df.frame

Related

Changing values in a pandas.DataFrame

I everybody, I'm new to python world and I'm trying to learn pandas and tensorflow.
At the moment I've a dataframe with positive and negative values that I want to manage to resize.
For example
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler
df = pd.read_excel ('/Users/dataset.xlsx')
print(df[:])
scaler = MinMaxScaler(feature_range=(0,1))
df_absolute = df.abs()
df_scaled = scaler.fit_transform(df_absolute)
#df_mod = df_scaled.loc[(df<0)] = df_scaled*-1
df_normalized = pd.DataFrame(df_mod)
print(df_normalized[:])
I've an error on the line with # and such as 'numpy.ndarray'.
How can I resolve this?
In the
df = pd.read_excel ('/Users/dataset.xlsx')
there is widespace ' ' should remove it
df = pd.read_excel('/Users/dataset.xlsx')

Applying scikitlearn preprocessing to pandas without causing warnings

I'm trying to use scikitlearn's preprocessing to min-max scale a row on pandas. My solution works but gives me 2 warnings and I am wondering if there is a better way to do it.
Here is my function which does the minmaxscaling given a dataframe and columns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
def minMaxScale(df, cols):
scaler = MinMaxScaler()
return scaler.fit_transform(df[cols])
This is where I use it
df.loc[:,'pct_mm'] = minMaxScale(df,['pct'])
Where the column 'pct' exists and 'pct_mm' does not exist.
I get the following warning 2 times:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
How should I do this the way pandas wants me to?
Cannot reproduce the warnings:
import pandas as pd
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
def minMaxScale(df, cols):
scaler = MinMaxScaler()
return scaler.fit_transform(df[cols])
df = sns.load_dataset('iris')
df.loc[:, 'newcolumn'] = minMaxScale(df, ['sepal_length'])
However, if I do this:
df = sns.load_dataset('iris')
df2 = df[:]
df2.loc[:, 'newcolumn'] = minMaxScale(df, ['sepal_length'])
then I get two warnings as well.
Probably you derived df from another dataframe somewhere in your code. I recommend you to find the lines where you used df, and make sure to make a copy, like: df = old_df.copy().

Change pandas DataFrame to numpy array but keeping column names

I have a pandas DataFrame from the sklearn.datasets Boston house price data and am trying to convert this to a numpy array but keeping column names. Here is the code I tried:
from sklearn import datasets ## imports datasets from scikit-learn
import numpy as np
import pandas as pd
data = datasets.load_boston() ## loads Boston dataset from datasets library
df = pd.DataFrame(data.data, columns=data.feature_names)
X = df.to_numpy()
print(X.dtype.names)
However this returns None and therefore column names are not kept. Does anyone understand why?
Thanks
try this :
w = (data.feature_names).reshape(13,1)
X = np.vstack((w.T, data.data))
print (X)

How To ' Write To New .CSV File' or "Save As New .CSV File' In python

I have a CSV file, I want to apply One hot encoding, then save the new dataframe(dataset) as a new CSV file. But when the new file is saved, it only writes 5 Rows of dummies and all rows of original dataset!
I just want to save all rows and columns in the new file.csv, the final shape of the dataset is (237124, 417).
my code contain:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
import csv
dataset=pd.read_csv("C:/Users/User/Desktop/data.csv",encoding='cp1252')
dataset.shape
#output: (237124, 37)
dummies = pd.get_dummies(dataset, columns=["name","mark",....... ]).head()
dummies.shape
#output : (5, 380)
dataset = pd.concat([dataset, dummies], axis=1)
dataset.shape
#output: (237124, 417)
# i want this shape(original+dummies)
dataset.to_csv('OneHotEncodnig.csv', index=False)
You call df.head() in this line:
dummies = pd.get_dummies(dataset, columns=["name","mark",....... ]).head()
This is why you only get 5 dummy rows. Remove the .head() and you get all rows.
head() function will return only 5 rows by default. Please apply get_dummies() without head() function
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
import csv
dataset=pd.read_csv("C:/Users/User/Desktop/data.csv",encoding='cp1252')
dataset.shape
#output: (237124, 37)
dummies = pd.get_dummies(dataset, columns=["name","mark",....... ])
dummies.shape
#output : (5, 380)
dataset = pd.concat([dataset, dummies], axis=1)
dataset.shape
#output: (237124, 417)
# i want this shape(original+dummies)
dataset.to_csv('OneHotEncodnig.csv', index=False)

Merge gives me much more rows in the dataframe

Update: Like mentioned in the comments, my indices weren't unique. worked around via a pivot.table
I got the following code to perform a clustering on a df. This df is approximately 80 K rows (df is named 'Kmeans'). I then have another df with a common value with 'Kmeans' (namely 'SKU_NR') with slightly less than 80K rows (this df is named 'Historie'). I want to merge df 'Kmeans' with df 'Historie', but when I do this, it gives me over 2M rows. I've done this before and then it worked. What's going wrong in the code?
#load in libraries
import pandas as pd
import numpy as np
pd.options.mode.chained_assignment = None
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
#Load and prepare data
Historie = pd.read_excel("file.xlsx")
Kmeans = Historie[['SKU_NR','ORDER_ADV_CONS_UNITS_WK_PICK']]
Kmeans = Kmeans.dropna()
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3)
km.fit(Kmeans)
km.predict(Kmeans)
labels = km.labels_
Kmeans["Classification"] = labels
Kmeans = Kmeans[["SKU_NR","Classification"]]
Historie
=Historie[['SKU_NR','WEEKNR','ORDER_ADV_CONS_UNITS_WK_PICK',
'FORECAST_NEC_STOCK_BASE']]
Historie = Historie.merge(Kmeans, on = "SKU_NR")

Categories

Resources