Number of plants
Total kg
900
7.565
1440
17.808
2340
25.373
900
10.340
900
10.610
I have a data set and I am trying to predict the "Total kg" based on the "Number of plants" using KNN.
from sklearn.neighbors import KNeighborsClassifier
df = pd.read_csv(r'final2.csv')
X = df.drop('Total kg', axis=1)
Y = df[["Total kg"]]
X=X.values
Y=Y.values
knn = KNeighborsClassifier(n_neighbors=1)
prediction = knn.fit(X,Y)
I get the following error: ValueError: Unknown label type: 'continuous'
Is there a way to use KNN for continuous variables?
You should not use a classifier for a regression task. Try:
knn = KNeighborsRegressor(n_neighbors=1)
Related
I'm doing a sentiment analysis project on a Twitter dataset. I used TF-IDF feature extraction and a logistic regression model for classification. So far I've trained the model with the following:
def get_tfidf_features(train_fit, ngrams=(1,1)):
vector = TfidfVectorizer(ngrams, sublinear_tf=True)
vector.fit(train_fit)
return vector
X = tf_vector.transform(df['text'])
y = df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.01, random_state = 42)
LR_model = LogisticRegression(solver='lbfgs')
LR_model.fit(X_train, y_train)
y_predict_lr = LR_model.predict(X_test)
This logistic regression model was trained on a dataset of about 1.5 million tweets. I have a set of about 18,000 tweets and I want to use this model to predict the sentiment scores for the tweets in this new dataset. I'm at a loss of how to actually apply this trained model to new data. The head of this new dataframe df_chi looks like this:
which has shape (18393, 7). I want to take the trained model I already have, apply it to the text column, and create a new sentiment column with those predicted scores in the df_chi dataframe. (Note: the image doesn't show cleaned text, but I'll do that.)
I'm a ML noob and I've never taken a trained model and applied it to new data. My confusion starts with extracting features from the df_chi text with TF-IDF. I attempted to do this (total guess):
tf_vector = get_tfidf_features(df_chi['text'])
X = tf_vector.transform(df_chi['text'])
df_chi['sentiment'] = LR_model.predict(X)
which gives the following ValueError:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-188-0cf1a4f34c8b> in <module>
1 tf_vector = get_tfidf_features(df_chi['text'])
2 X = tf_vector.transform(df_chi['text'])
----> 3 df_chi['sentiment'] = LR_model.predict(X)
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/_base.py in predict(self, X)
291 Predicted class label per sample.
292 """
--> 293 scores = self.decision_function(X)
294 if len(scores.shape) == 1:
295 indices = (scores > 0).astype(np.int)
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/_base.py in decision_function(self, X)
271 if X.shape[1] != n_features:
272 raise ValueError("X has %d features per sample; expecting %d"
--> 273 % (X.shape[1], n_features))
274
275 scores = safe_sparse_dot(X, self.coef_.T,
ValueError: X has 22806 features per sample; expecting 265054
Pretty sure my whole approach to applying the trained model on the new data is incorrect. What's the right way to do this?
Noodled around with this and came up with the following solution:
tfidf = TfidfVectorizer()
X_chi = tfidf.fit_transform(df_chi['text'])
X1 = pd.DataFrame.sparse.from_spmatrix(X)
X_chi1 = pd.DataFrame.sparse.from_spmatrix(X_chi)
not_existing_cols = [c for c in X1.columns.tolist() if c not in X_chi1]
X_chi1 = X_chi1.reindex(X_chi1.columns.tolist() + not_existing_cols, axis=1)
#X_chi.fillna(0, inplace=True)
X_chi1 = X_chi1[X1.columns.tolist()]
a = LR_model.predict(X_chi1)
df_chi['sentiment'] = a
Solution inspired by Logistic regression: X has 667 features per sample; expecting 74869
Looks a little clumsy, though. If it works it works, I guess. Though I suspect there might be a better way to do this, no?
Im trying to make prediction with logistic regression and to test accuracy with Python and sklearn library. Im using data that I downloaded from here:
http://archive.ics.uci.edu/ml/datasets/concrete+compressive+strength
its excel file. I wrote a code, but I always get the same error, and the error is:
ValueError: Unknown label type: 'continuous'
I have used the same logic when I made linear regression, and it works for linear regression.
This is the code:
import numpy as np
import pandas as pd
import xlrd
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
#Reading data from excel
data = pd.read_excel("DataSet.xls").round(2)
data_size = data.shape[0]
#print("Number of data:",data_size,"\n",data.head())
my_data = data[(data["Superpl"] == 0) & (data["FlyAsh"] == 0) & (data["BlastFurSlag"] == 0)].drop(columns=["Superpl","FlyAsh","BlastFurSlag"])
my_data = my_data[my_data["Days"]<=28]
my_data_size = my_data.shape[0]
#print("Size of dataset for 28 days or less:", my_data_size, "\n", my_data.head())
def logistic_regression(data_input, cement, water,
coarse_aggr, fine_aggr, days):
variable_list = []
result_list = []
for column in data_input:
variable_list.append(column)
result_list.append(column)
variable_list = variable_list[:-1]
result_list = result_list[-1]
variables = data_input[variable_list]
results = data_input[result_list]
#accuracy of prediction (splittig dataframe in train and test)
var_train, var_test, res_train, res_test = train_test_split(variables, results, test_size = 0.3, random_state = 42)
#making logistic model and fitting the data into logistic model
log_regression = linear_model.LogisticRegression()
model = log_regression.fit(var_train, res_train)
input_values = [cement, water, coarse_aggr, fine_aggr, days]
#predicting the outcome based on the input_values
predicted_strength = log_regression.predict([input_values]) #adding values for prediction
predicted_strength = round(predicted_strength[0], 2)
#calculating accuracy score
score = log_regression.score(var_test, res_test)
score = round(score*100, 2)
prediction_info = "\nPrediction of future strenght: " + str(predicted_strength) + " MPa\n"
accuracy_info = "Accuracy of prediction: " + str(score) + "%\n"
full_info = prediction_info + accuracy_info
return full_info
print(logistic_regression(my_data, 376.0, 214.6, 1003.5, 762.4, 3)) #true value affter 3 days: 16.28 MPa
Although you don't provide details of your data, judging from the error and the comment in the last line of your code:
#true value affter 3 days: 16.28 MPa
I conclude that you are in a regression (i.e numeric prediction) setting. A linear regression is an appropriate model for this task, but a logistic regression is not: logistic regression is for classification problems, and thus it expects binary (or categorical) data as target variables, not continuous values, hence the error.
In short, you are trying to apply a model that is inappropriate for your problem.
UPDATE (after link to the data): Indeed, reading closely the dataset description, you'll see (emphasis added):
The concrete compressive strength is the regression problem
while from scikit-learn User's Guide for logistic regression (again, emphasis added):
Logistic regression, despite its name, is a linear model for classification rather than regression.
I have data from 2012-2014 with some missing months in 2014. I would like to predict those months using a linear regression model trained on the 2012/2013 data.
2014 is missing June-August and has '' as its value so i clean it up using the following code, I also change 2012,2013 to have the same shape by cutting 20 data:
data2014NaN=data2014['mob'].replace(' ', np.nan)
data2014CleanNaN = data2014NaN[data2014NaN.notnull()]
data2012[0:300]
data2013[0:300]
Then I train a linear regression model using both years as a training set.
X = pd.concat([data2012[0:300], data2013[0:300]], axis=1, join='inner')
y = data2014CleanNaN .values
y = y.reshape(-1,1)
from sklearn.model_selection import train_test_split
# Split into 75% train and 25% test
X_train, X_test, y_train, y_test = train_test_split(X, y,
train_size=0.75,
random_state=4)
lm = LinearRegression()
lm.fit(X_train,y_train)
score = lm.score(X_test,y_test)
print("The prediction score on the test data is {:.2f}%".format(score*100))
However the result I got is an abysmal 4.65% and I'm not too sure on how to approach this problem, I assume I did something wrong when I cut down the data for 2012 and 2013
Here I attached the data (this is just dummy data):
2014:
date value
29/01/2014 10
30/01/2014 20
31/01/2014 15
1/02/2014 ' '
2012:
date value
29/01/2014 15
30/01/2014 18
31/01/2014 19
1/02/2014 50
I'm only using the value data, not sure if I'm in the right direction
Best Regards
It seems that your R^2 is not so good.
Cubic Spline Interpolation might perform better than linear regression in this case.
in python this api can be called:
import scipy.interpolate as st
source
also, if x is timestamp and y is a value, you can try time series analysis like AR or ARMA and Neural Network methods like RNN and LSTM.
LSTM samples built by keras:
model = Sequential()
model.add(LSTM(activation='tanh',input_shape = dataX[0].shape, output_dim=5, return_sequences = False))
model.add(Dense(output_dim = 1))
model.compile(optimizer='adam', loss='mae',metrics=['mse'])
model.fit(dataX , dataY, epochs = times , batch_size=1, verbose = 2,shuffle=False)
y_pred = model.predict(dataX)
I want to use logistic regression to predict and plot a curve from an Excel dataset and get its slope coefficients. However, when I run the code (see below) the error " ValueError: Unknown label type: 'continuous'. " occurs.
I read in similar questions that the y values should be 'int' type but I don't want to convert it because the y numbers are between 1.66 and 0.44...
Is there a solution for this kind of cases or should I try another regression model?
Thanks a lot in advance
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
import seaborn as sns
from sklearn.linear_model import LogisticRegression
df = pd.read_excel('Fatigue2.xlsx',sheet_name='Sheet4')
X = df[['Strain1', 'Temperature1']]
y = df['Cycles1']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=101)
#poly = PolynomialFeatures(degree=2)
#X_ = poly.fit_transform(X_train)
LR = LogisticRegression()
LR.fit(X_train,y_train)
g = sns.lmplot(x='Cycles1', y='Strain1', hue = 'Temperature1', data=df, fit_reg= False)
g.set(xscale='log', yscale ='log')
g.set_axis_labels("Cycles (log N)", "Strain")
print ('Coefficients : ', LR.coef_, 'Intercept :', LR.intercept_)
About the data, I have 97 values in total in an Excel sheet:
Cycles1 Strain1 Temperature1
27631 1.66 650
... ... 650
6496220 0.44 650
LogisticRegression from sklearn is a classifier, i.e. it expects that the response variable is categorical.
Your task is of regression. Moreover, the plot does not seem to have the asymptotic behavior of a logit on the right. You may have better results using a polynomial regression as described here.
Based on docs type_of_target(y):
Determine the type of data indicated by the target.
Note that this type is the most specific type that can be inferred.
For example:
binary is more specific but compatible with multiclass.
multiclass of integers is more specific but compatible with
continuous.
multilabel-indicator is more specific but compatible with
multiclass-multioutput.
Parameters
y : array-like
Returns
target_type : string
One of:
'continuous': y is an array-like of floats that are not all
integers, and is 1d or a column vector.
...
change y as y.astype(int)
My fellow Team,
Having an issue
----------------------
Avg.SessionLength TimeonApp TimeonWebsite LengthofMembership Yearly Amount Spent
0 34.497268 12.655651 39.577668 4.082621 587.951054
1 31.926272 11.109461 37.268959 2.664034 392.204933
2 33.000915 11.330278 37.110597 4.104543 487.547505
3 34.305557 13.717514 36.721283 3.120179 581.852344
4 33.330673 12.795189 37.536653 4.446308 599.406092
5 33.871038 12.026925 34.476878 5.493507 637.102448
6 32.021596 11.366348 36.683776 4.685017 521.572175
Want to apply KNN
X = df[['Avg. Session Length', 'Time on App','Time on Website', 'Length of Membership']]
y = df['Yearly Amount Spent']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
random_state=42)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)
ValueError: Unknown label type: 'continuous'
The values in Yearly Amount Spent column are real numbers, so they cannot serve as labels for a classification problem (see here):
When doing classification in scikit-learn, y is a vector of integers
or strings.
Hence you get the error. If you want to build a classification model, you need to decide how you transform them into a finite set of labels.
Note that if you just want to avoid the error, you could do
import numpy as np
y = np.asarray(df['Yearly Amount Spent'], dtype="|S6")
This will transform the values in y into strings of the required format. Yet, every label will appear in only one sample, so you cannot really build a meaningful model with such set of labels.
I think you are actually trying to do a regression rather than a classification, since your code pretty much looks like you want to predict
the yearly amount spent as a number. In this case, use
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=1)
instead. If you really have a classification task, for example you want to classify into classes like ('yearly amount spent is low', 'yearly amount spent is high',...), you should discretize the labels and convert them into strings or integer numbers (as explained by #Miriam Farber), according to the thresholds you need to set manually in this case.