Simple question I am not getting output as expected.(Linear regression) - python

I am new to programming. Currently, I am learning machine learning from this video.
This is related to linear regression
CODE:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
df=pd.read_csv('homeprices.csv')
reg = linear_model.LinearRegression()
Problem 1
reg.fit(df[['area']],df.price)
Expected output should be
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
My output:
LinearRegression()
Problem 2
reg.predict(3300)
It's giving error when I use "()" but when I use 2D array "[[]]" It is giving me correct output, But I want to know why It is not giving me output(as shown in video) when I use the only parenthesis.

Problem 1 :
This is how the fitted model outputs are shown in the newest version of sklearn, i.e., 0.23. The parameters are the same, but they are not shown in the output.
You can use reg.get_params() to view the parameters.
Problem 2 :
Newer versions of Scikit-learn require 2D inputs for the predict function and we can make 3300 2D by [[3300]]:
reg.predict( [[3300]] )

Problem1:
it depends on the default parameters which you might have changed it before or any other reason which has changed it, but you can easily set your desired parameters while you are initializing the Linear classifier in this way:
reg = linear_model.LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Problem 2:
reg.predict(3300) it's not correct to pass the parameter to Pandas in that way and you can see that the instructor has also corrected his answer to the reg.predict([3300]) in the description of the youtube Post

try this but you should define your variable and fit them to get desired output
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression()
df=pd.read_csv('homeprices.csv')
reg =LinearRegression()

Related

Scikit-learn QuantileRegressor memory allocation error. No issue with statsmodel QuantReg with the same data

I'm trying to fit a quantile regression model to my input data. I would like to use sklearn, but I am getting a memory allocation error when I try to fit the model. The same data with the statsmodels equivalent function is working fine.
There error I get is the following:
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 55.9 GiB for an array with shape (86636, 86636) and data type float64
It doesn't make any sense, my X and y are shapes (86636, 4) and (86636, 1) respectively.
Here's my script:
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import QuantileRegressor
training_df = pd.read_csv("/path/to/training_df.csv") # 86,000 rows
FEATURES = [
"feature_1",
"feature_2",
"feature_3",
"feature_4",
]
TARGET = "target"
# STATSMODELS WORKS FINE WITH 86,000, RUNS IN 2-3 SECONDS.
model_statsmodels = sm.QuantReg(training_df[TARGET], training_df[FEATURES]).fit(q=0.5)
# SKLEARN GIVES A MEMORY ALLOCATION ERROR, OR TAKES MINUTES TO RUN IF I SIGNIFICANTLY TRIM THE DATA TO < 1000 ROWS.
model_sklearn = QuantileRegressor(quantile=0.5, alpha=0)
model_sklearn.fit(training_df[FEATURES], training_df[TARGET])
I've checked the sklearn documentation and pretty sure my inputs are fine as dataframes, I get the same issues with NDarrays. So not sure what the issue is. Is it possible there's an issue with something under-the-hood?
[Here][1] is the scikit-learn documentation for QunatileRegressor.
Many thanks for any help / ideas.
[1]: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.QuantileRegressor.html
0
The sklearn QuantileRegressor class uses linear programming to solve the quantile regression problem which is much more computationally expensive than iterative reweighted least squares as used by statsmodel QuantReg class.
Here is a github issue for the same problem: https://github.com/scikit-learn/scikit-learn/issues/22922

How to run a non-linear autoregression with exogenous inputs with sysidentpy?

I am trying to run a nonlinear autoregression with exogenous inputs (NARX) in Python.
This is my code
Step 1: import the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sysidentpy.model_structure_selection import FROLS
from sysidentpy.basis_function import Polynomial, Fourier
from sysidentpy.metrics import root_relative_squared_error
from sysidentpy.utils.generate_data import get_siso_data
from sysidentpy.utils.display_results import results
from sysidentpy.utils.plotting import plot_residues_correlation, plot_results
from sysidentpy.residues.residues_correlation import compute_residues_autocorrelation, compute_cross_correlation
from sklearn.model_selection import train_test_split
Step 2: import the data
df=pd.read_excel(r"C:\Users\Action\Downloads\Python\Practice_Data\sorted_data v2.xlsx")
Step 3: Organize the data
target_column = ['public health care services']
predictors = list(set(list(df.columns))-set(target_column))
df[predictors] = df[predictors]/df[predictors].max()
Step 4: Step up the training and testing data
X = df[predictors].values
y = df[target_column].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)
print(X_train.shape); print(X_test.shape)
Step 5: Create the NARX Neural Network Model
basis_function = Polynomial(degree=2)
model = FROLS(
basis_function=basis_function,
order_selection=True,
n_info_values=10,
extended_least_squares=False,
ylag=2, xlag=2,
info_criteria='aic',
estimator='least_squares',
)
Step 6: Apply fit the model
model.fit(X_train, y_train)
From step 6 I am experiencing an error
TypeError: fit() takes 1 positional argument but 3 were given
Step 7: Prediction
yhat = model.predict(X_test, y_test)
I am also experiencing an error
AttributeError: 'FROLS' object has no attribute 'final_model'
Step 8: Compute the RRSE
rrse = root_relative_squared_error(y_test, yhat)
print(rrse)
I am experiencing the following error
NameError: name 'yhat' is not defined
Well, I realise that this error is due to the error before it, so 'yhat' is not defined.
I would be grateful for any assistance.
I'm the developer of SysIdentPy and just found this question.
I hope you already solved it, but if not, here is the solution:
The first error you got
model.fit(X_train, y_train)
TypeError: fit() takes 1 positional argument but 3 were given
is due the fact you have to use keyword arguments instead of positional arguments. To fix it, just use:
model.fit(X=X_train, y=y_train)
All the other problems are consequences of the first one: without fiting the model you cannot predict and you will not have a final_model to access, for example.
I'll add a "check_fitted" method to give the users a more detailed message about this kind of error.
The use of keyword arguments instead of positional arguments was described in update v0.17.0 and the examples were adapted to follow this change in this same update, but this can be a common mistake and hard to understand without a propor error message if you havent read the docs.
Note: Its not related to your question, but you used the train_test_split method from sklearn to split your data. In a time series scenario this is usually (to not say always) wrong. I don't know what you were trying to do, but its worth checking this part too (take a look at https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html)
Hope it helps you.

plot calibration curve for machine learning

I have the code below and this code work only with the binary class so how can I use with three classes.
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import scikitplot as skp
orgnal_data = pd.read_excel("movie.xls")
# Program extracting first column
text = orgnal_data.iloc[:,0]
lable = orgnal_data.iloc[:,1]
x_train,x_test,y_train,y_test=train_test_split(fe,lable,test_size=0.30,random_state=40)
DT = DecisionTreeClassifier()
DT_y = DT.fit(x_train,y_train).predict(x_test)
clf_names = ['Decision Tree']
skp.metrics.plot_calibration_curve(y_test,DT_y,clf_names)
plt.show()
Since you use scikit-plot module, there is no function for a multiclass problem.
Read the source code here:
This function currently only works for binary classification.
So you can either 1) modify the source code or 2) open a github issue and request a function for multiclass problems.
EDIT 1:
Using scikit-learn you have some ML models that can handle multiclass problems. For example for the LinearSVC function here, the multiclass support is handled according to a one-vs-the-rest scheme.
So you can actually have models like this and then use the plot_calibration_curve function for each case (one VS rest) separately.

Find the sum of the residuals

I am doing a hands on exercise of Poissons Regression of Stats with Python in Fresco Play.
Problem statement is like:
Load the R dataset Insurance from the MASS package.
Capture the data as a pandas dataframe.
Build a Poisson regression model with a log of an independent variable
Holders, and dependent variable Claims.
Fit the model with data, and find the sum of the residuals.
I am stuck with the last line i.e. Sum of Residuals
I used np.sum(model.resid). But answer is not accepted
Here is my code
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
INS_data = sm.datasets.get_rdataset('Insurance','MASS').data
model = smf.poisson('Claims ~ np.log(Holders)', INS_data).fit()
print(np.sum(model.resid))
I was running the code in Python2 which gave wrong answer but running it in Python3 gave the correct answer. I don't know the reason but code works perfectly in Python3
For residual, you can use the basic concept of residual i.e. actual - predicted.
Here is the code snippet.
import statsmodels.api as sm
import numpy as np
import statsmodels.formula.api as smf
Insurance = sm.datasets.get_rdataset('Insurance','MASS')
data = Insurance.data
data['Holders_'] = np.log(data['Holders'])
model = smf.poisson('Claims ~ Holders_',data).fit()
y_predicted = p.predict(data['Holders_'])
residual = (data['Claims']-y_predicted)
print(sum(residual))
output
After much serach, i came to know that it is expecting cumulative sum so use
np.cumsum(model.resid)
It will pass in Frescoplay

How can I use libsvm on scikit learn?

I want to use libsvm as a classifier for predicition. I have used the following code:
import numpy as np
import sklearn
from sklearn.svm import libsvm
X = np.array([[0,1.22,45,2.111,9.344,0], [0,1.5,25,5,1,0]])
y = np.array([0.0,1.0])
clf=sklearn.svm.libsvm
clf.fit(X,y)
print(clf.predict([1,1.12,42,4.223,2.33,0]))
I got following error:
File "sklearn/svm/libsvm.pyx", line 270, in sklearn.svm.libsvm.predict (sklearn/svm/libsvm.c:3917)
TypeError: predict() takes at least 6 positional arguments (1 given)
Is this the correct way? How can I resolve the error?
Basically use sklearn.svm.SVC, since as it is stated in the documentation of sklearn, SVC is based on libsvm:
class SVC(BaseSVC):
C-Support Vector Classification.
The implementation is based on libsvm. The fit time complexity
is more than quadratic with the number of samples which makes it hard
to scale to dataset with more than a couple of 10000 samples.

Categories

Resources