How can I use libsvm on scikit learn? - python

I want to use libsvm as a classifier for predicition. I have used the following code:
import numpy as np
import sklearn
from sklearn.svm import libsvm
X = np.array([[0,1.22,45,2.111,9.344,0], [0,1.5,25,5,1,0]])
y = np.array([0.0,1.0])
clf=sklearn.svm.libsvm
clf.fit(X,y)
print(clf.predict([1,1.12,42,4.223,2.33,0]))
I got following error:
File "sklearn/svm/libsvm.pyx", line 270, in sklearn.svm.libsvm.predict (sklearn/svm/libsvm.c:3917)
TypeError: predict() takes at least 6 positional arguments (1 given)
Is this the correct way? How can I resolve the error?

Basically use sklearn.svm.SVC, since as it is stated in the documentation of sklearn, SVC is based on libsvm:
class SVC(BaseSVC):
C-Support Vector Classification.
The implementation is based on libsvm. The fit time complexity
is more than quadratic with the number of samples which makes it hard
to scale to dataset with more than a couple of 10000 samples.

Related

How to pickle or otherwise save an RFECV model after fitting for rapid classification of novel data

I am generating a predictive model for cancer diagnosis from a moderately large dataset (>4500 features).
I have got the rfecv to work, providing me with a model that I can evaluate nicely using ROC curves, confusion matrices etc., and which is performing acceptably for classifying novel data.
please find a truncated version of my code below.
logo = LeaveOneGroupOut()
model = RFECV(LinearDiscriminantAnalysis(), step=1, cv=logo.split(X, y, groups=trial_number))
model.fit(X, y)
As I say, this works well and provides a model I'm happy with. The trouble is, I would like to be able to save this model, so that I don't need to do the lengthy retraining everytime I want to evaluate new data.
When I have tried to pickle a standard LDA or other model object, this has worked fine. When I try to pickle this RFECV object, however, I get the following error:
Traceback (most recent call last):
File "/rds/general/user/***/home/data_analysis/analysis_report_generator.py", line 56, in <module>
pickle.dump(key, file)
TypeError: cannot pickle 'generator' object
In trying to address this, I have spent a long time trying to RTFM, google extensively and dug as deep as I dared into Stack without any luck.
I would be grateful if anyone could identify what I could do to pickle this model successfully for future extraction and re-use, or whether there is an equivalent way to save the parameters of the feature-extracted LDA model for rapid analysis of new data.
This occurs because LeaveOneGroupOut().split(X, y, groups=groups) returns a generator object—which cannot be pickled for reasons previously discussed.
To pickle it, you'd have to cast it to a finite number of splits with something like the following, or replace it with StratifiedKFold which does not have this issue.
rfecv = RFECV(
# ...
cv=list(LeaveOneGroupOut().split(X, y, groups=groups)),
)
MRE putting all the pieces together (here I've assigned groups randomly):
import pickle
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFECV
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import LeaveOneGroupOut
from numpy.random import default_rng
rng = default_rng()
X, y = make_classification(n_samples=500, n_features=15, n_informative=3, n_redundant=2, n_repeated=0, n_classes=8, n_clusters_per_class=1, class_sep=0.8, random_state=0)
groups = rng.integers(0, 5, size=len(y))
rfecv = RFECV(
estimator=LinearDiscriminantAnalysis(),
step=1,
cv=list(LeaveOneGroupOut().split(X, y, groups=groups)),
scoring="accuracy",
min_features_to_select=1,
n_jobs=4,
)
rfecv.fit(X, y)
with open("rfecv_lda.pickle", "wb") as fh:
pickle.dump(rfecv, fh)
Side note: A better method would be to avoid pickling the RFECV in the first place. rfecv.transform(X) masks feature columns that the search deemed unnecessary. If you have >4500 features and only need 10, you might want to simplify your data pipeline elsewhere.

Scikit-learn QuantileRegressor memory allocation error. No issue with statsmodel QuantReg with the same data

I'm trying to fit a quantile regression model to my input data. I would like to use sklearn, but I am getting a memory allocation error when I try to fit the model. The same data with the statsmodels equivalent function is working fine.
There error I get is the following:
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 55.9 GiB for an array with shape (86636, 86636) and data type float64
It doesn't make any sense, my X and y are shapes (86636, 4) and (86636, 1) respectively.
Here's my script:
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import QuantileRegressor
training_df = pd.read_csv("/path/to/training_df.csv") # 86,000 rows
FEATURES = [
"feature_1",
"feature_2",
"feature_3",
"feature_4",
]
TARGET = "target"
# STATSMODELS WORKS FINE WITH 86,000, RUNS IN 2-3 SECONDS.
model_statsmodels = sm.QuantReg(training_df[TARGET], training_df[FEATURES]).fit(q=0.5)
# SKLEARN GIVES A MEMORY ALLOCATION ERROR, OR TAKES MINUTES TO RUN IF I SIGNIFICANTLY TRIM THE DATA TO < 1000 ROWS.
model_sklearn = QuantileRegressor(quantile=0.5, alpha=0)
model_sklearn.fit(training_df[FEATURES], training_df[TARGET])
I've checked the sklearn documentation and pretty sure my inputs are fine as dataframes, I get the same issues with NDarrays. So not sure what the issue is. Is it possible there's an issue with something under-the-hood?
[Here][1] is the scikit-learn documentation for QunatileRegressor.
Many thanks for any help / ideas.
[1]: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.QuantileRegressor.html
0
The sklearn QuantileRegressor class uses linear programming to solve the quantile regression problem which is much more computationally expensive than iterative reweighted least squares as used by statsmodel QuantReg class.
Here is a github issue for the same problem: https://github.com/scikit-learn/scikit-learn/issues/22922

How to run a non-linear autoregression with exogenous inputs with sysidentpy?

I am trying to run a nonlinear autoregression with exogenous inputs (NARX) in Python.
This is my code
Step 1: import the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sysidentpy.model_structure_selection import FROLS
from sysidentpy.basis_function import Polynomial, Fourier
from sysidentpy.metrics import root_relative_squared_error
from sysidentpy.utils.generate_data import get_siso_data
from sysidentpy.utils.display_results import results
from sysidentpy.utils.plotting import plot_residues_correlation, plot_results
from sysidentpy.residues.residues_correlation import compute_residues_autocorrelation, compute_cross_correlation
from sklearn.model_selection import train_test_split
Step 2: import the data
df=pd.read_excel(r"C:\Users\Action\Downloads\Python\Practice_Data\sorted_data v2.xlsx")
Step 3: Organize the data
target_column = ['public health care services']
predictors = list(set(list(df.columns))-set(target_column))
df[predictors] = df[predictors]/df[predictors].max()
Step 4: Step up the training and testing data
X = df[predictors].values
y = df[target_column].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)
print(X_train.shape); print(X_test.shape)
Step 5: Create the NARX Neural Network Model
basis_function = Polynomial(degree=2)
model = FROLS(
basis_function=basis_function,
order_selection=True,
n_info_values=10,
extended_least_squares=False,
ylag=2, xlag=2,
info_criteria='aic',
estimator='least_squares',
)
Step 6: Apply fit the model
model.fit(X_train, y_train)
From step 6 I am experiencing an error
TypeError: fit() takes 1 positional argument but 3 were given
Step 7: Prediction
yhat = model.predict(X_test, y_test)
I am also experiencing an error
AttributeError: 'FROLS' object has no attribute 'final_model'
Step 8: Compute the RRSE
rrse = root_relative_squared_error(y_test, yhat)
print(rrse)
I am experiencing the following error
NameError: name 'yhat' is not defined
Well, I realise that this error is due to the error before it, so 'yhat' is not defined.
I would be grateful for any assistance.
I'm the developer of SysIdentPy and just found this question.
I hope you already solved it, but if not, here is the solution:
The first error you got
model.fit(X_train, y_train)
TypeError: fit() takes 1 positional argument but 3 were given
is due the fact you have to use keyword arguments instead of positional arguments. To fix it, just use:
model.fit(X=X_train, y=y_train)
All the other problems are consequences of the first one: without fiting the model you cannot predict and you will not have a final_model to access, for example.
I'll add a "check_fitted" method to give the users a more detailed message about this kind of error.
The use of keyword arguments instead of positional arguments was described in update v0.17.0 and the examples were adapted to follow this change in this same update, but this can be a common mistake and hard to understand without a propor error message if you havent read the docs.
Note: Its not related to your question, but you used the train_test_split method from sklearn to split your data. In a time series scenario this is usually (to not say always) wrong. I don't know what you were trying to do, but its worth checking this part too (take a look at https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html)
Hope it helps you.

Import error. Can't import kmeans_plusplus

I'm trying to replicate an example of a clustering model with scikit-learn:
import sklearn
sklearn.__version__
Returns:
'0.23.2'
And:
from sklearn.cluster import kmeans_plusplus
Returns the Error message:
ImportError: cannot import name 'kmeans_plusplus' from 'sklearn.cluster' (C:\Users\sddss\anaconda3\lib\site-packages\sklearn\cluster\__init__.py)
According to the documentation, kmeans_plusplus is
New in version 0.24.
so it is not available for the version 0.23.2 you are using.
Nevertheless, this should not be a real issue; the only difference between the "good old" K-Means already available in scikit-learn is the initialization of the cluster centers according to the kmeans++ algorithm; and this is already available in the standard KMeans. From the standard KMeans documentation regarding the init argument:
'k-means++' : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence
So, what you need to do instead is simply to use the "vanilla" KMeans of scikit-learn with the argument init='kmeans++':
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=n_clusters, init='kmeans++')
There is no kmeans_plusplus class or module for version 0.23.2. You need to import KMeans and set the init key word argument to kmeans++ to obtain the behaviour you want
from sklearn.cluster import KMeans
kmeans = KMeans(init='k-means++')

Simple question I am not getting output as expected.(Linear regression)

I am new to programming. Currently, I am learning machine learning from this video.
This is related to linear regression
CODE:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
df=pd.read_csv('homeprices.csv')
reg = linear_model.LinearRegression()
Problem 1
reg.fit(df[['area']],df.price)
Expected output should be
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
My output:
LinearRegression()
Problem 2
reg.predict(3300)
It's giving error when I use "()" but when I use 2D array "[[]]" It is giving me correct output, But I want to know why It is not giving me output(as shown in video) when I use the only parenthesis.
Problem 1 :
This is how the fitted model outputs are shown in the newest version of sklearn, i.e., 0.23. The parameters are the same, but they are not shown in the output.
You can use reg.get_params() to view the parameters.
Problem 2 :
Newer versions of Scikit-learn require 2D inputs for the predict function and we can make 3300 2D by [[3300]]:
reg.predict( [[3300]] )
Problem1:
it depends on the default parameters which you might have changed it before or any other reason which has changed it, but you can easily set your desired parameters while you are initializing the Linear classifier in this way:
reg = linear_model.LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Problem 2:
reg.predict(3300) it's not correct to pass the parameter to Pandas in that way and you can see that the instructor has also corrected his answer to the reg.predict([3300]) in the description of the youtube Post
try this but you should define your variable and fit them to get desired output
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression()
df=pd.read_csv('homeprices.csv')
reg =LinearRegression()

Categories

Resources