Sklearn complains about one-column dataframes - python

Consider the following minimal example:
from time import sleep # To (try to) get warnings printed at the right places
import pandas as pd
from sklearn.neural_network import MLPClassifier
from sklearn.dummy import DummyClassifier
df = pd.DataFrame([[1, 1, 1, 1], [0, 0, 0, 0]])
mlp = MLPClassifier(tol=10)
dummy = DummyClassifier(strategy='uniform')
for size in [1, 2]:
input_columns = [0, 1]
output_columns = [j + 2 for j in range(size)]
print('Dimension of output: ', len(output_columns)) # Is 1 or 2
X = df[input_columns]
Y = df[output_columns]
print('MLPClassifier')
mlp.fit(X, Y)
sleep(3)
print('DummyClassifier')
dummy.fit(X, Y)
sleep(3)
print('\n\n\n')
At the first iteration, during the training of the MLPClassifier, Sklearn complains:
lib/python3.6/site-packages/sklearn/neural_network/_multilayer_perceptron.py:934: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
The second iteration runs fine. The DummyClassifier (dummy.fit) runs fine in both iterations.
The error is because I'm trying to send a one-column dataframe (Y) to mlp.fit. It doesn't happen on the second iteration, where Y is a two-column dataframe.
The question is: how can I properly pass the data to fit in the case of MLPClassifier? I've learned I can do Y = Y.values.ravel(), which works when the dataframe is one-column, but then it doesn't work for two-column dataframes. I'm looking for a consistent way to solve this generically for any number of columns.

One approach is checking if the number of columns ==1 beforehand.
if len(output_columns) == 1:
mlp.fit(X, Y.values.ravel())
else:
mlp.fit(X, Y)

Related

How to use the k means algorithm by importing data from JSON

I'm trying to use the k-means algorithm for some data contained within JSON files.
If, for example, I had multiple JSON files similar to this (example_file1.JSON) where each JSON file contained a different numeric value of "provides", how could I apply the k-means algorithm to the values โ€‹โ€‹of "provides" (with the value of k varying between 3 and 8)??
Example_file1.JSON
{
"Accenture": {
"platform": 0,
"provides": 2,
"government": 0,
"through": 1,
"clients": 0,
"business": 1,
"financial": 0,
"services": 2,
"information": 0,
"research": 0,
"health": 0,
},...
}
Thanks
Use sklearn.kmeans. As already explained here, you need to parse all the json and create a numpy array containing the features (in your case the value of "provides"), and then use it to fit the kmeans.
EDIT:
Let's provide a more complete answer:
import random
import string
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Create a test set of input json
N = 100
rng = np.random.default_rng()
numbers = rng.choice(N + 1, N // 2, replace=False) # Let's create 2 cluster
json_files = list()
for idx in range(N):
new_json = dict()
for _ in range(4):
new_json["".join(random.choice(string.ascii_uppercase) for _ in range(10))] = random.randint(0, 5)
if idx in numbers:
new_json["provides"] = random.uniform(0, 5) # First cluster is between 0 and 5
else:
new_json["provides"] = random.uniform(10, 15) # Second cluster is between 10 and 15
json_files.append({"Accenture": new_json})
This way, we created a generic set of 100 input json file. Each one of them has different entries with random keys. For half of them we've inserted some values between 0-5 (to get a clear first cluster), for the other half 10-15.
# Let's visualize it
X = np.array([json["Accenture"]["provides"] for json in json_files])
Y = np.ones(len(X))
plt.scatter(X, Y)
plt.show()
The setup is pretty clean, let's perform the actual k-means:
model = KMeans(n_clusters=2)
model.fit(X.reshape(-1, 1)) # Reshape is needed if you have only 1 feature
prediction = model.predict(X.reshape(-1, 1))
plt.scatter(X[prediction == 1], Y[prediction == 1], c="green")
plt.scatter(X[prediction == 0], Y[prediction == 0], c="red")
plt.show()
Which is the expected result. In the predicted variable you now have the labels for each json file.

Converting target dataset into a classification dataset โ€“ Pandas

I'm trying to convert the dataset into a classification dataset by:
Step 1: Split the range of target values into three equal parts - low, mid, and high.
Step 2: Reassign the target values into into three categorical values 0, 1, and 2, representing low, mid and high range of values, respectively.
I tried different approach by using the method that were suggesting in this post: How to automatically categorise data in panda dataframe? and didn't get the result I wanted. Any suggestion?
Dataset in question:
from sklearn.datasets import load_boston
data = load_boston()
X = data.data
y = data.target
let's find the lowest and give him the highest value (100) than the max(y) (50 in your example), we repeat this until we have done this for at 33% of your y, and we repeat this 2 times with another different value higher than max(y).
Then we use a function to modify your 100,200 and 300 to 0,1,2
from sklearn.datasets import load_boston
data = load_boston()
X = data.data
y = data.target
y = list(y)
print(y)
for i in range(len(y)):
index = y.index(min(y))
if i < len(y)/3:
y[index]=100
elif i > len(y)/3 and i < 2*(len(y)/3):
y[index]=200
else:
y[index]=300
def split_in_3(y):
if y == 100:
return 0
elif y == 200:
return 1
else:
return 2
y2 = map(split_in_3,y)
print(list(y2))

Clustering dataframe after concatenation of x and y

I have x and y arrays, x consists of three arrays and y consists of three arrays that consist of seven values
x= [np.array([6.03437288]), np.array([6.39850922]), np.array([6.07835145])]
y= [np.array([[-1.06565856, -0.16222044, 7.85850477, -2.62498475, -0.46315498,
-0.33087472, -0.1394244 ]]),
np.array([[-1.41487104e+00, 5.81421750e-03, 7.92917001e+00,
-3.37987517e+00, 1.14685839e-01, -2.91779263e-01,
2.51753851e-01]]),
np.array([[-1.56496814, 0.2612637 , 7.60577761, -3.55727614, 0.18844392,
-0.75112678, -0.48055978]])]
I concatenate x and y into one dataframe
df = pd.DataFrame({'x': x,'y': y})
then I tried to cluster this dataframe by k-medoids
kmedoids = KMedoids(n_clusters=3, random_state=0).fit(df)
cluster_labels = kmedoids.predict(df)
but I faced this error
ValueError: setting an array element with a sequence.
I tried to search for a solution to this problem, haven't found a concrete solution. any suggestions even with modified the code
Given arrays x and y as provided in question:
import pandas as pd
from sklearn_extra.cluster import KMedoids
df = pd.DataFrame({'x': x,'y': y})
First concatenate x and y of dataframe into one array per row:
df2 = df.apply(lambda r: np.append(r.x, r.y), axis = 1)
Then create one X array:
X = np.array(df2.values.tolist())
that can be passed to clustering method:
kmedoids = KMedoids(n_clusters=3, random_state=0).fit(X)
cluster_labels = kmedoids.predict(X)
result of clustering:
array([2, 0, 1], dtype=int64)

IndexError when ploting sklearn manifold TSNE

I try to run a t-sne but python shows me this error:
IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
Data is being provided by this link.
Here's the code:
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
#Step 1 - Download the data
dataframe_all = pd.read_csv('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv')
num_rows = dataframe_all.shape[0]
#Step 2 - Clearn the data
#count the number of missing elements (NaN) in each column
counter_nan = dataframe_all.isnull().sum()
counter_without_nan = counter_nan[counter_nan==0]
#remove the columns with missing elements
dataframe_all = dataframe_all[counter_without_nan.keys()]
#remove the first 7 columns which contain no descriminative information
dataframe_all = dataframe_all.ix[:,7:]
#Step 3: Create feature vectors
x = dataframe_all.ix[:,:-1].values
standard_scalar = StandardScaler()
x_std = standard_scalar.fit_transform(x)
# t distributed stochastic neighbour embedding (t-SNE) visualization
tsne = TSNE(n_components=2, random_state = 0)
x_test_2d = tsne.fit_transform(x_std)
#scatter plot the sample points among 5 classes
markers=('s','d','o','^','v')
color_map = {0:'red', 1:'blue', 2:'lightgreen', 3:'purple', 4:'cyan'}
plt.figure()
for idx, cl in enumerate(np.unique(x_test_2d)):
plt.scatter(x=x_test_2d[cl, 0],y =x_test_2d[cl, 1], c=color_map[idx], marker=markers[idx], label=cl)
plt.show()
What do I have to change in order to make this work?
The error is due to the following line:
plt.scatter(x_test_2d[cl, 0], x_test_2d[cl, 1], c=color_map[idx], marker=markers[idx])
Here, cl can take and takes not integer values (from np.unique(x_test_2d)) and this raises the error, e.g. the last value that cl takes is 99.46295 and then you use: x_test_2d[cl, 0] which translates into x_test_2d[99.46295, 0]
Define a variable y that hold the class labels, then use:
# variable holding the classes
y = dataframe_all.classe.values
y = np.array([ord(i) for i in y])
#scatter plot the sample points among 5 classes
plt.figure()
plt.scatter(x_test_2d[:, 0], x_test_2d[:, 1], c = y)
plt.show()
FULL CODE:
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
#Step 1 - Download the data
dataframe_all = pd.read_csv('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv')
num_rows = dataframe_all.shape[0]
#Step 2 - Clearn the data
#count the number of missing elements (NaN) in each column
counter_nan = dataframe_all.isnull().sum()
counter_without_nan = counter_nan[counter_nan==0]
#remove the columns with missing elements
dataframe_all = dataframe_all[counter_without_nan.keys()]
#remove the first 7 columns which contain no descriminative information
dataframe_all = dataframe_all.ix[:,7:]
#Step 3: Create feature vectors
x = dataframe_all.ix[:,:-1].values
standard_scalar = StandardScaler()
x_std = standard_scalar.fit_transform(x)
# t distributed stochastic neighbour embedding (t-SNE) visualization
tsne = TSNE(n_components=2, random_state = 0)
x_test_2d = tsne.fit_transform(x_std)
# variable holding the classes
y = dataframe_all.classe.values # you need this for the colors
y = np.array([ord(i) for i in y]) # convert letters to numbers
#scatter plot the sample points among 5 classes
plt.figure()
plt.scatter(x_test_2d[:, 0], x_test_2d[:, 1], c = y)
plt.show()

roc_auc_score - Only one class present in y_true

I am doing a k-fold XV on an existing dataframe, and I need to get the AUC score.
The problem is - sometimes the test data only contains 0s, and not 1s!
I tried using this example, but with different numbers:
import numpy as np
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 0, 0])
y_scores = np.array([1, 0, 0, 0])
roc_auc_score(y_true, y_scores)
And I get this exception:
ValueError: Only one class present in y_true. ROC AUC score is not
defined in that case.
Is there any workaround that can make it work in such cases?
You could use try-except to prevent the error:
import numpy as np
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 0, 0])
y_scores = np.array([1, 0, 0, 0])
try:
roc_auc_score(y_true, y_scores)
except ValueError:
pass
Now you can also set the roc_auc_score to be zero if there is only one class present. However, I wouldn't do this. I guess your test data is highly unbalanced. I would suggest to use stratified K-fold instead so that you at least have both classes present.
As the error notes, if a class is not present in the ground truth of a batch,
ROC AUC score is not defined in that case.
I'm against either throwing an exception (about what? This is the expected behaviour) or returning another metric (e.g. accuracy). The metric is not broken per se.
I don't feel like solving a data imbalance "issue" with a metric "fix". It would probably be better to use another sampling, if possibile, or just join multiple batches that satisfy the class population requirement.
I am facing the same problem now, and using try-catch does not solve my issue. I developed the code below in order to deal with that.
import pandas as pd
import numpy as np
class KFold(object):
def __init__(self, folds, random_state=None):
self.folds = folds
self.random_state = random_state
def split(self, x, y):
assert len(x) == len(y), 'x and y should have the same length'
x_, y_ = pd.DataFrame(x), pd.DataFrame(y)
y_ = y_.sample(frac=1, random_state=self.random_state)
x_ = x_.loc[y_.index]
event_index, non_event_index = list(y_[y == 1].index), list(y_[y == 0].index)
assert len(event_index) >= self.folds, 'number of folds should be less than the number of rows in x'
assert len(non_event_index) >= self.folds, 'number of folds should be less than number of rows in y'
indexes = []
#
#
#
step = int(np.ceil(len(non_event_index) / self.folds))
start, end = 0, step
while start < len(non_event_index):
train_fold = set(non_event_index[start:end])
valid_fold = set([k for k in non_event_index if k not in train_fold])
indexes.append([train_fold, valid_fold])
start, end = end, min(step + end, len(non_event_index))
#
#
#
step = int(np.ceil(len(event_index) / self.folds))
start, end, i = 0, step, 0
while start < len(event_index):
train_fold = set(event_index[start:end])
valid_fold = set([k for k in event_index if k not in train_fold])
indexes[i][0] = list(indexes[i][0].union(train_fold))
indexes[i][1] = list(indexes[i][1].union(valid_fold))
indexes[i] = tuple(indexes[i])
start, end, i = end, min(step + end, len(event_index)), i + 1
return indexes
I just wrote that code and I did not tested it exhaustively. It was tested only for binary categories. Hope it be useful yet.
You can increase the batch-size from e.g. from 32 to 64, you can use StratifiedKFold or StratifiedShuffleSplit. If the error still occurs, try shuffeling your data e.g. in your DataLoader.
Simply modify the code with 0 to 1 make it work
import numpy as np
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 1, 0, 0])
y_scores = np.array([1, 0, 0, 0])
roc_auc_score(y_true, y_scores)
I believe the error message has suggested that only one class in y_true (all zero), you need to give 2 classes in y_true.

Categories

Resources