how to reshape 3d data for sklearn classifiers - python

in my project i am trying to use sklearn classifiers, but cant input data in a model . shape of data consists of lists with 3 coordinates. the error is ValueError: setting an array element with a sequence.
shape of data
dataset1 = pd.read_csv(...)
dataset1 = pd.read_csv(...)
X=dataset1.iloc[:178,2:35]
y = dataset2.iloc[:,2:35]
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=32)
classifier.fit(X_train, y_train)

Related

ValueError: Expected 2D array, got scalar array instead: array=1.0. Reshape your data either using array.reshape(-1, 1)

While practicing RandomForest classification I got this error in Colab:
ValueError: Expected 2D array, got scalar array instead:
array=60.
Reshape your data either using array.reshape(-1, 1) if your data has a single
feature or array.reshape(1, -1) if it contains a single sample.
This is my code (Colab):
#Reshape to a vector for Random Forest / SVM training
n_features = image_features.shape[1]
image_features = np.expand_dims(image_features, axis=0)
X_for_RF = any(np.reshape(image_features, (x_train.shape[0], -1))) #Reshape to #images, features
# #Define the classifier
# from sklearn.ensemble import RandomForestClassifier
# RF_model = RandomForestClassifier(n_estimators = 50, random_state = 42)
#Can also use SVM but RF is faster and may be more accurate.
from sklearn import svm
SVM_model = svm.SVC(decision_function_shape='ovo') #For multiclass classification
SVM_model.fit(X_for_RF, y_train)
#Fit the model on training data
#RF_model.fit(X_for_RF, y_train) #For sklearn no one hot encoding
This is the link to my Notebook:
https://colab.research.google.com/drive/1XCHkZKtLKBsdFAPgxvMA8Rh-36gWoxYU?usp=sharing
This is the link to my Data:
https://drive.google.com/drive/folders/15nnAi3jNx4uj-bAoqUsmvk06OYI0syGR?usp=sharing

ValueError: Expected 2D array, got 1D array instead: array=[-1]

Here is the problem
Extract just the median_income column from the independent variables (from X_train and X_test).
Perform Linear Regression to predict housing values based on median_income.
Predict output for test dataset using the fitted model.
Plot the fitted model for training data as well as for test data to check if the fitted model satisfies the test data.
I did a linear regression earlier.Following is the code
import pandas as pd
import os
os.getcwd()
os.chdir('/Users/saurabhsaha/Documents/PGP-AI:ML-Purdue/New/datasets')
df=pd.read_excel('California_housing.xlsx')
df.total_bedrooms=df.total_bedrooms.fillna(df.total_bedrooms.mean())
x = df.iloc[:,2:8]
y = df.median_house_value
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=.20)
from sklearn.linear_model import LinearRegression
california_model = LinearRegression().fit(x_train,y_train)
california_model.predict(x_test)
Prdicted_values = pd.DataFrame(california_model.predict(x_test),columns=['Pred'])
Prdicted_values
Final = pd.concat([x_test.reset_index(drop=True),y_test.reset_index(drop=True),Prdicted_values],axis=1)
Final['Err_pct'] = abs(Final.median_house_value-
Final.Pred)/Final.median_house_value
Here is my dataset- https://docs.google.com/spreadsheets/d/1vYngxWw7tqX8FpwkWB5G7Q9axhe9ipTu/edit?usp=sharing&ouid=114925088866643320785&rtpof=true&sd=true
Following is my code.
x1_train=x_train.median_income
x1_train
x1_train.shape
x1_test=x_test.median_income
x1_test
type(x1_test)
x1_test.shape
from sklearn.linear_model import LinearRegression
california_model_new = LinearRegression().fit(x1_train,y_train)```
I get an error right here and when I try converting my 2 D array to 1 D as follows , i can not
```python
import numpy as np
x1_train= x1_train.reshape(-1, 1)
x1_test = x1_train.reshape(-1, 1)
This is the error I get
AttributeError: 'Series' object has no attribute 'reshape'
I am new to data science so if you can explain a bit then it would be real helpful
x1_train and x1_test are pandas Series objects, whereas the the reshape() method is applied to numpy arrays.
Do this instead:
x1_train= x1_train.to_numpy().reshape(-1, 1)
x1_test = x1_train.to_numpy().reshape(-1, 1)

Cosine similarity and SVC using scikit-learn

I am trying to utilize the cosine similarity kernel to text classification with SVM with a raw dataset of 1000 words:
# Libraries
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
# Data
x_train, x_test, y_train, y_test = train_test_split(raw_data[:, 0], raw_data[:, 1], test_size=0.33, random_state=42)
# CountVectorizer
c = CountVectorizer(max_features=1000, analyzer = "char")
X_train = c.fit_transform(x_train).toarray()
X_test = c.transform(x_test).toarray()
# Kernel
cosine_X_tr = cosine_similarity(X_train)
cosine_X_tst = cosine_similarity(X_test)
# SVM
svm_model = SVC(kernel="precomputed")
svm_model.fit(cosine_X_tr, y_train)
y_pred = svm_model.predict(cosine_X_tst)
But that code throws the following error:
ValueError: X has 330 features, but SVC is expecting 670 features as input
I've tried the following, but I don't know it is mathematically accurate and because also I want to implement other custom kernels not implemented within scikit-learn like histogram intersection:
cosine_X_tst = cosine_similarity(X_test, X_train)
So, basically the main problem resides in the dimensions of the matrix SVC recieves. Once CountVectorizer is applied to train and test datasets those have 1000 features because of max_features parameter:
Train dataset of shape (670, 1000)
Test dataset of shape (330, 1000)
But after applying cosine similarity are converted to squared matrices:
Train dataset of shape (670, 670)
Test dataset of shape (330, 330)
When SVC is fitted to train data it learns 670 features and will not be able to predict test dataset because has a different number of features (330). So, how can i solve that problem and be able to use custom kernels with SVC?
So, how can i solve that problem and be able to use custom kernels with SVC?
Define a function yourself, and pass that function to the kernel parmeter in SVC(), like: SVC(kernel=your_custom_function). See this.
Also, you should use the cosine_similarity kernel like below in your code:
svm_model = SVC(kernel=cosine_similarity)
svm_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)

Predict_proba function for Multi-target Classification

I am working on a Multi-Target (binary) classification. There are 11 targets and I am using sklearn's MultiOutputClassifier. I am having difficulty with the Predict_proba function. See a snippet of the dataset, and code below:
import pandas as pd
import numpy as npy
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.multioutput import MultiOutputClassifier
data = pd.read_csv("123.csv")
dataset
target = ['H67BC97','H67GC93','H67LC63','H67WC103','H67RC91','H67YC73','H67RC92','H67GC94','H67LC64','H67NC60','H67YC72']
train, test = train_test_split(data, test_size=0.2)
X_train = train.drop(['H67BC97','H67GC93','H67LC63','H67WC103','H67RC91','H67YC73','H67RC92','H67GC94','H67LC64','H67NC60','H67YC72','FORMULA_NUMBER'],axis=1)
X_test = test.drop(['H67BC97','H67GC93','H67LC63','H67WC103','H67RC91','H67YC73','H67RC92','H67GC94','H67LC64','H67NC60','H67YC72','FORMULA_NUMBER'],axis=1)
Y_train = train[target]
Y_test = test[target]
model = MultiOutputClassifier(GradientBoostingClassifier())
model.fit(X_train, Y_train)
target_probabilities = model.predict_proba(X_test)
print(target_probabilities)
probabilities
The probabilities output do not seem to be in the correct form. I get 11 565x2 arrays (565 is length of my test set). I'd like to save the target_probabilities to a csv file, but I get the error: ValueError: Expected 1D or 2D array, got 3D array instead. My question is essentially the same as on the link -
https://datascience.stackexchange.com/questions/22762/understanding-predict-proba-from-multioutputclassifier, but the answer there only explains why the output is a set of arrays.
EDIT: I have simplified the problem.
target_probabilities = array(target_probabilities)
Now target_probabilities is an (11,565,2) matrix - need to change the form of the matrix to be (565,11), where each row is of the form target_probabilities[:,i][:,1], for i in the range(0,565).

Python - SKLearn Fit Array Error

I'm relatively new to using sklearn and python for data analysis and am trying to run some linear regression on a dataset that I loaded from a .csv file.
I have loaded my data into train_test_split without any issues, but when I try to fit my training data I receive an error ValueError: Expected 2D array, got 1D array instead: ... Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample..
Error at model = lm.fit(X_train, y_train)
Because of my freshness with working with these packages, I'm trying to determine if this is the result of not setting my imported csv to a pandas data frame before running the regression or if this has to do with something else.
My CSV is in the format of:
Month,Date,Day of Week,Growth,Sunlight,Plants
7,7/1/17,Saturday,44,611,26
7,7/2/17,Sunday,30,507,14
7,7/5/17,Wednesday,55,994,25
7,7/6/17,Thursday,50,1014,23
7,7/7/17,Friday,78,850,49
7,7/8/17,Saturday,81,551,50
7,7/9/17,Sunday,59,506,29
Here is how I set up the regression:
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
organic = pd.read_csv("linear-regression.csv")
organic.columns
Index(['Month', 'Date', 'Day of Week', 'Growth', 'Sunlight', 'Plants'], dtype='object')
# Set the depedent (Growth) and independent (Sunlight)
y = organic['Growth']
X = organic['Sunlight']
# Test train split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print (X_train.shape, X_test.shape)
print (y_train.shape, y_test.shape)
(192,) (49,)
(192,) (49,)
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
# Error pointing to an array with values from Sunlight [611, 507, 994, ...]
You just need to adjust your last columns to
lm = linear_model.LinearRegression()
model = lm.fit(X_train.values.reshape(-1,1), y_train)
and the model will fit. The reason for this is that the linear model from sklearn expects
X : numpy array or sparse matrix of shape [n_samples,n_features]
So our training data must be of form [7,1] in this particular case
You are only using one feature, so it tells you what to do within the error:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature.
The data always has to be 2D in scikit-learn.
(Don't forget the typo in X = organic['Sunglight'])
Once you load the data into train_test_split(X, y, test_size=0.2), it returns Pandas Series X_train and X_test with (192, ) and (49, ) dimensions. As mentioned in the previous answers, sklearn expect matrices of shape [n_samples,n_features] as the X_train, X_test data. You can simply convert the Pandas Series X_train and X_test to Pandas Dataframes to change their dimensions to (192, 1) and (49, 1).
lm = linear_model.LinearRegression()
model = lm.fit(X_train.to_frame(), y_train)

Categories

Resources