I have a training data CSV with three columns (two for data and a third for targets) and I successfully predicted the target column for my test CSV. The problem is I need to inverse transform the results back to strings for further analysis. Below is the code and error.
from sklearn import datasets
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import defaultdict
df_train = pd.read_csv('/Users/justinchristensen/Documents/Python_Education/SKLearn/Path_Training_Data.csv')
df_test = pd.read_csv('/Users/justinchristensen/Documents/Python_Education/SKLearn/Path_Test_Data.csv')
#Separate columns in training data set
x_train = df_train.iloc[:,:-1]
y_train = df_train.iloc[:,-1:]
#Separate columns in test data set
x_test = df_test.iloc[:,:-1]
#Initiate classifier
clf = svm.SVC(gamma=0.001, C=100)
le = LabelEncoder()
#Transform strings into integers
x_train_encoded = x_train.apply(LabelEncoder().fit_transform)
y_train_encoded = y_train.apply(LabelEncoder().fit_transform)
x_test_encoded = x_test.apply(LabelEncoder().fit_transform)
#Fit the model into the classifier
clf.fit(x_train_encoded,y_train_encoded)
#Predict test values
y_pred = clf.predict(x_test_encoded)
The error
NotFittedError
Traceback (most recent call last)
<ipython-input-38-09840b0071d5> in <module>()
1
----> 2 y_pred_inverse = le.inverse_transform(y_pred)
~/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/label.py in inverse_transform(self, y)
146 y : numpy array of shape [n_samples]
147 """
--> 148 check_is_fitted(self, 'classes_')
149
150 diff = np.setdiff1d(y, np.arange(len(self.classes_)))
~/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
766
767 if not all_or_any([hasattr(estimator, attr) for attr in attributes]):
--> 768 raise NotFittedError(msg % {'name': type(estimator).__name__})
769
770
NotFittedError: This LabelEncoder instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
You need to use the same label object which you used for transforming your targets to get them back. Each time you use the Label Enocder you instantiated a new object. Use the same object.
Change the following line
y_train_encoded = y_train.apply(le().fit_transform)
y_test_encoded = y_test.apply(le().fit_transform)
Then use the same object to reverse the transformation. You can check the first example here in the documentation for reference as well.
Related
Here is the problem
Extract just the median_income column from the independent variables (from X_train and X_test).
Perform Linear Regression to predict housing values based on median_income.
Predict output for test dataset using the fitted model.
Plot the fitted model for training data as well as for test data to check if the fitted model satisfies the test data.
I did a linear regression earlier.Following is the code
import pandas as pd
import os
os.getcwd()
os.chdir('/Users/saurabhsaha/Documents/PGP-AI:ML-Purdue/New/datasets')
df=pd.read_excel('California_housing.xlsx')
df.total_bedrooms=df.total_bedrooms.fillna(df.total_bedrooms.mean())
x = df.iloc[:,2:8]
y = df.median_house_value
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=.20)
from sklearn.linear_model import LinearRegression
california_model = LinearRegression().fit(x_train,y_train)
california_model.predict(x_test)
Prdicted_values = pd.DataFrame(california_model.predict(x_test),columns=['Pred'])
Prdicted_values
Final = pd.concat([x_test.reset_index(drop=True),y_test.reset_index(drop=True),Prdicted_values],axis=1)
Final['Err_pct'] = abs(Final.median_house_value-
Final.Pred)/Final.median_house_value
Here is my dataset- https://docs.google.com/spreadsheets/d/1vYngxWw7tqX8FpwkWB5G7Q9axhe9ipTu/edit?usp=sharing&ouid=114925088866643320785&rtpof=true&sd=true
Following is my code.
x1_train=x_train.median_income
x1_train
x1_train.shape
x1_test=x_test.median_income
x1_test
type(x1_test)
x1_test.shape
from sklearn.linear_model import LinearRegression
california_model_new = LinearRegression().fit(x1_train,y_train)```
I get an error right here and when I try converting my 2 D array to 1 D as follows , i can not
```python
import numpy as np
x1_train= x1_train.reshape(-1, 1)
x1_test = x1_train.reshape(-1, 1)
This is the error I get
AttributeError: 'Series' object has no attribute 'reshape'
I am new to data science so if you can explain a bit then it would be real helpful
x1_train and x1_test are pandas Series objects, whereas the the reshape() method is applied to numpy arrays.
Do this instead:
x1_train= x1_train.to_numpy().reshape(-1, 1)
x1_test = x1_train.to_numpy().reshape(-1, 1)
im trying to test and train my data using the MLPRegressor, but always end up with the same error for the line "classifier.fit(xtrain, ttrain)", I'm relatively new to python so this has had me stuck for a while now
iv tried to use ravel() and reshaping it but neither has worked
anyone got any good advice?
data = pd.read_excel("data.xlsx")
print(data)
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
inputs = data.values[:,:8].astype(float)
#Normalize the inputs
scaler = MinMaxScaler()
scaled = scaler.fit_transform(inputs)
print(inputs.ptp(axis=0))
print(scaled.ptp(axis=0))
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_excel("ENB2012_data.xlsx")
x = df.iloc[:,1:2].values
y = df.iloc[:,2].values
df
from sklearn.neural_network import MLPRegressor
regressor = MLPRegressor(max_iter=5000)
regressor.fit(inputs, scaled)
outputs = regressor.predict(inputs)
print("MLP Regressor: \n", outputs)
from sklearn.metrics import mean_absolute_error
regressor = MLPRegressor(max_iter=5000)
regressor.fit(inputs, scaled)
outputs = regressor.predict(inputs)
print(mean_absolute_error(outputs, scaled))
from numpy.lib.shape_base import split
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.base import ClassifierMixin
#split the data to training and testing
xtrain, xtest, ttrain, ttest = train_test_split(outputs, scaled)
#train the classifiers
classifier = SVC(gamma = "auto")
classifier.fit(xtrain, ttrain)
ytrain = classifier(xtrain)
ytest = classifier.predict(xtest)
Edit:
ValueError Traceback (most recent call
last)
<ipython-input-229-62670b265a4b> in <module>()
4 #train the classifiers
5 classifier = SVC(gamma = "auto")
----> 6 classifier.fit(xtrain, ttrain)
7 ytrain = classifier(xtrain)
8 ytest = classifier.predict(xtest)
4 frames
/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in
column_or_1d(y, warn)
1037
1038 raise ValueError(
1039 "y should be a 1d array, got an array of shape {}
instead.".format(shape)
1040 )
1041
ValueError: y should be a 1d array, got an array of shape (576, 8)
instead.
While we need more context to answer your question, you should start with using np.shape(input) to check the shape of your input variable.
You should check after this line:
inputs = data.values[:,:8].astype(float)
# Check with np.shape
print(np.shape(input))
# Expected output:
...we need to know this
You want a 1-dimensional array for this. Here are a few examples,
# This is a 3x1 shape (1-dimensional).
np.shape([0, 1, 2])
# This is a 3x3 shape (multi-dimensional).
np.shape([[0, 1, 2],[3, 4, 5],[6, 7, 8]])
I am trying to to random oversampling over a small dataset for linear regression. However it seems the scikit learn sampling API doesnt work with float values as its target variable. Is there anyway to solve this?
This is a sample of my y_train values, which are log transformed.
3.688879
3.828641
3.401197
3.091042
4.624973
from imblearn.over_sampling import RandomOverSampler
X_over, y_over = RandomOverSampler(random_state=42).fit_sample(X_train,y_train)
--------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-53-036424abd2bd> in <module>
1 from imblearn.over_sampling import RandomOverSampler
~\Anaconda3\lib\site-packages\imblearn\base.py in fit_resample(self, X, y)
73 The corresponding label of `X_resampled`.
74 """
---> 75 check_classification_targets(y)
76 arrays_transformer = ArraysTransformer(X, y)
77 X, y, binarize_y = self._check_X_y(X, y)
~\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py in check_classification_targets(y)
170 if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
171 'multilabel-indicator', 'multilabel-sequences']:
--> 172 raise ValueError("Unknown label type: %r" % y_type)
173
174
ValueError: Unknown label type: 'continuous'
Re-sampling strategies are not meant for regression problems. Hence, the RandomOverSampler will not accept float type targets. There are approaches to re-sample data with continuous targets though. One example is the reg_resample which can be used like the following:
from imblearn.over_sampling import RandomOverSampler
from sklearn.datasets import make_regression
from reg_resampler import resampler
import numpy as np
# Create some dummy data for demonstration
X, y = make_regression(n_features=10)
df = np.append(X, y.reshape(100, 1), axis=1)
# Initialize the resampler object and generate pseudo-classes
rs = resampler()
y_classes = rs.fit(df, target=10)
# Now resample
X_res, y_res = rs.resample(
sampler_obj=RandomOverSampler(random_state=27),
trainX=df,
trainY=y_classes
)
The resampler object will generate pseudo-classes based on your target values and then use a classic re-sampling object from the imblearn package to re-sample your data. Note that the data you pass to the resampler object should contain all data, including the targets.
Doing some sentiment analysis, I am trying to get the feature importance using logistic regression. I found a reference here (How to get feature importance in logistic regression using weights?) to how to do it, but when implementing it, it's giving me error and I don't know why and how to solve.
Can some one help me ?
here is my code.
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import StandardScaler
## Creating Training data
Independent_var = df_final.tweet # the features
Dependent_var = df_final.sent_binary # the sentiment (positive, negative, neutral)
# Logistic regression
cv = CountVectorizer(min_df=2, max_df=0.50, ngram_range = (1,2), max_features=50)
text_count_vector = cv.fit_transform(Independent_var)
#standardized_data = StandardScaler(with_mean=False).fit_transform(text_count_vector)
feature_names = np.array(cv.get_feature_names())
#feature_names
## Splitting in the given training data for our training and testing
X_tr, X_test, y_tr, y_test = train_test_split(text_count_vector, Dependent_var, test_size=0.3, random_state=225)
LogReg = LogisticRegression(solver='lbfgs', multi_class='multinomial')
LogReg_clf = LogReg.fit(X_tr, y_tr)
#coefs = np.abs(LogReg_clf.coef_)
coefs = LogReg_clf.coef_
#get the sorting indices
sorted_index = np.argsort(coefs)[::-1]
# check if the sorting indices are correct
print(coefs[sorted_index])
#get the index of the top-20 features
top_20 = sorted_index[:20]
#get the names of the top 20 most important features
print(feature_names[top_20])
The error I get :
IndexError Traceback (most recent call last)
<ipython-input-103-b566f1c5a21c> in <module>
22 print(sorted_index)
23 # check if the sorting indices are correct
---> 24 print(coefs[sorted_index])
25
26 #get the index of the top-20 features
IndexError: index 23 is out of bounds for axis 0 with size 3
I'm new using Machine Learning and I am trying to predict the price of the stocks in 30 days.
This is my code:
import pandas as pd
import matplotlib.pyplot as plt
import pymysql as MySQLdb
import numpy as np
import sqlalchemy
import datetime
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing, svm
from sklearn.model_selection import train_test_split
forecast_out = int(30)
df['Prediction'] = df[['LastPrice']].shift(-forecast_out)
df['Prediction'].fillna(0)
X = np.array(df['Prediction'].fillna(0))
X = preprocessing.scale(X)
X_forecast = X[-forecast_out:]
X = X[:-forecast_out]
y = np.array(df['Prediction'].fillna(0))
y = y[:-forecast_out]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
X_train, X_test, y_train, y_test.reshape(-1,1)
# Training
clf = LinearRegression()
clf.fit(X_train,y_train)
# Testing
confidence = clf.score(X_test, y_test)
print("confidence: ", confidence)
forecast_prediction = clf.predict(X_forecast)
print(forecast_prediction)
I got this error:
ValueError: Expected 2D array, got 1D array instead:
array=[-0.46939923 -0.47076913 -0.47004993 ... -0.42782272 3.07433019 -0.46573474].
Reshape your data either using
array.reshape(-1, 1) if your data has a single feature
or
array.reshape(1, -1) if it contains a single sample.
It's expecting a 2D Array when you're only passing in a 1D Array. You can solve this by putting another set of brackets around where you're getting the probelm. For example
x = [1,2,3,4]
Foo(x)
If that throws the error, you could just do
Foo([x])