I am working with tensorflow. I have a training dataset X_train and I want to create polynomial features from it. Specifically, I want to use PolynomialFeatures from sklearn.preprocessing. I'm getting an error
ValueError: setting an array element with a sequence.
The code that I try to compile is
n_features = 3
n_samples = 1000
X_train = tf.placeholder(tf.float32, shape=(n_samples,n_features))
poly = PolynomialFeatures(degree = 3)
poly.fit_transform(X_train)
Related
While practicing RandomForest classification I got this error in Colab:
ValueError: Expected 2D array, got scalar array instead:
array=60.
Reshape your data either using array.reshape(-1, 1) if your data has a single
feature or array.reshape(1, -1) if it contains a single sample.
This is my code (Colab):
#Reshape to a vector for Random Forest / SVM training
n_features = image_features.shape[1]
image_features = np.expand_dims(image_features, axis=0)
X_for_RF = any(np.reshape(image_features, (x_train.shape[0], -1))) #Reshape to #images, features
# #Define the classifier
# from sklearn.ensemble import RandomForestClassifier
# RF_model = RandomForestClassifier(n_estimators = 50, random_state = 42)
#Can also use SVM but RF is faster and may be more accurate.
from sklearn import svm
SVM_model = svm.SVC(decision_function_shape='ovo') #For multiclass classification
SVM_model.fit(X_for_RF, y_train)
#Fit the model on training data
#RF_model.fit(X_for_RF, y_train) #For sklearn no one hot encoding
This is the link to my Notebook:
https://colab.research.google.com/drive/1XCHkZKtLKBsdFAPgxvMA8Rh-36gWoxYU?usp=sharing
This is the link to my Data:
https://drive.google.com/drive/folders/15nnAi3jNx4uj-bAoqUsmvk06OYI0syGR?usp=sharing
in my project i am trying to use sklearn classifiers, but cant input data in a model . shape of data consists of lists with 3 coordinates. the error is ValueError: setting an array element with a sequence.
shape of data
dataset1 = pd.read_csv(...)
dataset1 = pd.read_csv(...)
X=dataset1.iloc[:178,2:35]
y = dataset2.iloc[:,2:35]
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=32)
classifier.fit(X_train, y_train)
I'm facing a problem I can't solve. Indeed, I try to create a model LSTM with keras, but I don't understand what the input data format should be.
My data train and my data test look like this:
date/value/value/value/value/value_i_want_to_predict
I've seen some people doing this:
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0, 1))
training_set_scaled = sc.fit_transform(training_set)X_train = []
y_train = []
for i in range(60, len(training_set_scaled)):
X_train.append(training_set_scaled[i-60: i, 0])
y_train.append(training_set_scaled[i, 0])
X_train, y_train = np.array(X_train), np.array(y_train)
But if I do that how do I predict my features without modifying the test data set?
I have a hard time understanding why we do this. Moreover, what I would like to use the values to predict the target in the last column. With this method I feel like I have to change the format of the data test and it's important that I can test the model on test data that are different and that I don't have to change.
Can someone help me?
EDIT
scaler.fit(df_train_x)
X_train = scaler.fit_transform(df_train_x)
X_test = scaler.transform(df_test_x)
y_train = np.array(df_train_y)
y_train = np.insert(y_train, 0, 0)
y_train = np.delete(y_train, -1)
The shape of the data is: (2420, 7)
That what I did. But The shape still remain 2D. So i used :
generator = TimeseriesGenerator(X_train, y_train, length=n_input, batch_size=32)
And the input shape of first layer is:
model.add(LSTM(150, activation='relu', return_sequences=True,input_shape=(2419, 7)))
but when i fit the generator to the model:
ValueError: Error when checking target: expected dense_10 to have 3 dimensions, but got array with shape (1, 1)
i really don't understand
I'm not sure to fullly understand your question but I will try my best.
I think the code you provided is problem specific, meaning it maybe not suitable for your imlementation.
For an LSTM (and for pretty much any neural network) you always want to scale your data before feeding it to the model. This helps avoid having completely different data ranges across your features. The MinMaxScaler scale your features to the range provided. For an explanation of why do you need scaling, you can have a look at this article.
Usualy, you want to first split your dataset in training and testing sets, using for example the train_test_split function of sklearn, then scale your features.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X = data.drop("feature_I_want_to_predict",axis=1)
y = data["feature_I_want_to_predict"]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
That way, X_train represent your training data, and y_train represent your labels for the training data. (and similarly for the test data)
I here used the StandardScaler instead of the MinMaxScaler. The standard scaler substracts the mean of the feature then divides by the standard deviation.
I am getting this error
ValueError: Expected 2D array, got scalar array instead: array=6.5.
Reshape your data either using array.reshape(-1, 1) if your data has a
single feature or array.reshape(1, -1) if it contains a single sample.
while executing this code
# SVR
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.svm import SVR
# Load dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Fitting the SVR to the data set
regressor = SVR(kernel = 'rbf', gamma = 'auto')
regressor.fit(X, y)
# Predicting a new result
y_pred = regressor.predict(6.5)
You need to understand how SVM works. Your trainig data is a matrix of shape (n_samples, n_features). That means, your SVM operates in feature space of n_features dimensions. Hence, it cannot predict a value for a scalar input, unless n_features is 1. You can only predict values for vectors of dimension n_features. So, if your data set has 5 columns, you can predict values for an arbitrary row-vector of 5 columns. See the below example.
import numpy as np
from sklearn.svm import SVR
# Data: 200 instances of 5 features each
X = randint(1, 100, size=(200, 5))
y = randint(0, 2, size=200)
reg = SVR()
reg.fit(X, y)
y_test = np.array([[0, 1, 2, 3, 4]]) # Input to .predict must be 2-dimensional
reg.predict(y_test)
# Predicting a new result with Linear Regression
X_test = np.array([[6.5]])
print(lin_reg.predict(X_test))
# Predicting a new result with Polynomial Regression
print(lin_reg_2.predict(poly_reg.fit_transform(X_test)))
I am creating LogisticRegression classifier with the following code:
regressor = LogisticRegression()
regressor.fit(x_train, y_train)
Both x_train and y_train shapes are
<class 'tuple'>: (32383,)
x_train contains values around range [0..1], and y_train contains only 0s and 1s.
Unfortunately, fit fails with error
ValueError: Found input variables with inconsistent numbers of samples: [1, 32383]
Adding transpose to arguments doesn't help.
To continue the solution that I proposed in my comment:
The problem is the shape of the x_train. So we need to reshape it:
From the documentation:
X : {array-like, sparse matrix}, shape (n_samples, n_features)
y : array-like, shape (n_samples,)
Example using scikit-learn and numpy:
from sklearn.linear_model import LogisticRegression
import numpy as np
# create the tuple data
x_train = tuple(range(32383))
x_train = np.asarray(x_train)
#same for y_train
y_train=tuple(range(32383))
y_train = np.asarray(y_train)
#convert tuples to nparray and reshape the x_train
x_train = x_train.reshape(32383,1)
#check if shape if (32383,)
y_train.shape
#create the model
lg = LogisticRegression()
#Fit the model
lg.fit(x_train, y_train)
This should work fine.
Hope it helps
I guess a little reshaping is necessary. I tried it like this :
from sklearn.linear_model import LogisticRegression
import numpy as np
#x_train = np.random.randn(10,1)
x_train = np.asarray(x_train).reshape(32383,1)
con = np.ones_like(x_train)
x_train = np.concatenate((con,x_train), axis =1)
#y = np.random.randn(10,1)
#y_train = np.where(y<0.5,1,0)
y_train = np.asarray(y_train).reshape(32383,1)
regressor = LogisticRegression()
regressor.fit(x_train,y_train)
The comments are just what i did to create some data. And dont forget to ad a constant like in the example, as far as i know sklearn isn't doing it. Also Statsmodels could be helpfull to you if you are interested in some statistical test and a pretty print of the results :
from statsmodels.api import Logit
logit =Logit(y_train, x_train)
fit= logit.fit()
fit.summary()
That will give you a little more statistical intel without much effort.