Leave one out Cross validation using sklearn (Multiple CSV) - python

I have 52 CSV files in a folder. I want to build a model based on this data. That's why I want to do Leave one out cross-validation on these data. How can I do this using sci-kit learn in python?
I tried from sci kit document and also search many resources.But I didn't found the solution. I have tried this code.
import glob
import numpy as np
import pandas as pd
from sklearn.cross_validation import LeaveOneOut
path=r'...................\Data\New design process data'
filelist=glob.glob(path + "/*.csv")
loo=LeaveOneOut()
for train,test in loo.split(filelist):
print("%s %s" % (train, test))
But it showed errors.
init() missing 1 required positional argument: 'n'
I am new in python as well as sci-kit learn. If anyone can help me, It would be a great convenience.

You should use the newer version of the module, which is located in sklearn.model_selection instead of sklearn.cross_validation. (The cross_validation module was depricated in 0.18.) Using this version, you can instantiate the class without the positional argument, and it also does not fail when you try to call split.
from sklearn.model_selection import LeaveOneOut
X = np.array([[1, 2], [3, 4]])
y = np.array([1, 2])
loo = LeaveOneOut() # works without passing an argument
loo.get_n_splits(X) # returns 2

Related

Import Error: cannot import name 'tree' from 'sklearn.tree'

I am on my second day of re-taking Python for the gazillionth time!
I am doing a tutorial on ML in Python, using the following code:
import sklearn.tree
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import tree
music_data = pd.read_csv('music.csv')
x = music_data.drop(columns=['genre'])
y = music_data['genre']
model = DecisionTreeClassifier()
model.fit(x,y)
tree.export_graphviz(model, out_file='music-recommender.dot',
feature_names=['age','gender'],
class_names= sorted(y.unique()),
label='all',
rounded=True,
filled=True)
I keep getting the following error:
ImportError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_13088/3820271611.py in <module>
2 import pandas as pd
3 from sklearn.tree import DecisionTreeClassifier
----> 4 from sklearn.tree import tree
5
6 music_data = pd.read_csv('music.csv')
ImportError: cannot import name 'tree' from 'sklearn.tree' (C:\Anaconda\lib\site-packages\sklearn\tree\__init__.py)
I've tried to find a solution online, but I don't think it's the version of Python/Anaconda because I literally just installed both. I also don't think it's the sklearn.tree since I was able to import DecisionClassifer.
As this answer indicates, you're looking at some older code; this is always a risk with programming. But there's another thing you need to know about your code.
First off, scikit-learn contains several modules, and almost everything you need from it is in one of those. In my experience, most people import things like this:
from sklearn.tree import DecisionTreeRegressor # A regressor class.
from sklearn.tree import plot_tree # A helpful function.
from sklearn.metrics import mean_squared_error # An evaluation function.
It looks like the tutorial wants something similar to plot_tree(). This new-ish function is much easier to use than the older Graphviz visualization. So unless you really need the DOT file for some reasons, you should be able to do this:
from sklearn.tree import plot_tree
sklearn.tree.plot_tree(model)
Bottom line: there will probably be more broken things in that material. So if I were you I'd either make a new environment with a version of sklearn matching whatever material you're using... or ditch that material and look for something newer.
from sklearn.tree import tree looks wrong. Did you mean from sklearn import tree ?
According to the official Scikit Learn Decision Trees Documentation you really do not need too much of importing.
It can be done simply as follows:
from sklearn import tree
import pandas as pd
music_data = pd.read_csv('music.csv')
X = music_data.drop(columns=['genre'])
y = music_data['genre']
model = tree.DecisionTreeClassifier()
model.fit(X,y)

How to run a non-linear autoregression with exogenous inputs with sysidentpy?

I am trying to run a nonlinear autoregression with exogenous inputs (NARX) in Python.
This is my code
Step 1: import the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sysidentpy.model_structure_selection import FROLS
from sysidentpy.basis_function import Polynomial, Fourier
from sysidentpy.metrics import root_relative_squared_error
from sysidentpy.utils.generate_data import get_siso_data
from sysidentpy.utils.display_results import results
from sysidentpy.utils.plotting import plot_residues_correlation, plot_results
from sysidentpy.residues.residues_correlation import compute_residues_autocorrelation, compute_cross_correlation
from sklearn.model_selection import train_test_split
Step 2: import the data
df=pd.read_excel(r"C:\Users\Action\Downloads\Python\Practice_Data\sorted_data v2.xlsx")
Step 3: Organize the data
target_column = ['public health care services']
predictors = list(set(list(df.columns))-set(target_column))
df[predictors] = df[predictors]/df[predictors].max()
Step 4: Step up the training and testing data
X = df[predictors].values
y = df[target_column].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)
print(X_train.shape); print(X_test.shape)
Step 5: Create the NARX Neural Network Model
basis_function = Polynomial(degree=2)
model = FROLS(
basis_function=basis_function,
order_selection=True,
n_info_values=10,
extended_least_squares=False,
ylag=2, xlag=2,
info_criteria='aic',
estimator='least_squares',
)
Step 6: Apply fit the model
model.fit(X_train, y_train)
From step 6 I am experiencing an error
TypeError: fit() takes 1 positional argument but 3 were given
Step 7: Prediction
yhat = model.predict(X_test, y_test)
I am also experiencing an error
AttributeError: 'FROLS' object has no attribute 'final_model'
Step 8: Compute the RRSE
rrse = root_relative_squared_error(y_test, yhat)
print(rrse)
I am experiencing the following error
NameError: name 'yhat' is not defined
Well, I realise that this error is due to the error before it, so 'yhat' is not defined.
I would be grateful for any assistance.
I'm the developer of SysIdentPy and just found this question.
I hope you already solved it, but if not, here is the solution:
The first error you got
model.fit(X_train, y_train)
TypeError: fit() takes 1 positional argument but 3 were given
is due the fact you have to use keyword arguments instead of positional arguments. To fix it, just use:
model.fit(X=X_train, y=y_train)
All the other problems are consequences of the first one: without fiting the model you cannot predict and you will not have a final_model to access, for example.
I'll add a "check_fitted" method to give the users a more detailed message about this kind of error.
The use of keyword arguments instead of positional arguments was described in update v0.17.0 and the examples were adapted to follow this change in this same update, but this can be a common mistake and hard to understand without a propor error message if you havent read the docs.
Note: Its not related to your question, but you used the train_test_split method from sklearn to split your data. In a time series scenario this is usually (to not say always) wrong. I don't know what you were trying to do, but its worth checking this part too (take a look at https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html)
Hope it helps you.

pyspark ML LabeledPoint not working with LinearRegression

I'm studying Spark 3.0.1 with pyspark, and have setup some data for simple OLS regression using
data = results.select('OrderMonthYear', 'SaleAmount').rdd.map(lambda row: LabeledPoint(row[1], [row[0]])).toDF()
The OrderMonthYear is my feature column (int), and SaleAmount is the response (float). The LabeledPoint method was imported from pyspark.mllib.regression. I then try to fit the regression model with
from pyspark.ml.regression import LinearRegression
lr = LinearRegression()
modelA = lr.fit(data, {lr.regParam:0.0})
to get this exception
IllegalArgumentException: requirement failed: Column features must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
This is clearly not very helpful, as the required and passed features seem to be the same structs. I've searched online, and only found answers to this problem for java, or for someone building the struct themselves. The exception was thrown from a util function that was just throwing a java exception (#Hide where the exception came from that shows a non-Pythonic JVM exception message.), so I can't debug further.
MLlib and RDD-based MLlib functions are deprecated. I suggest using vector assembler of ML:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
data = spark.createDataFrame([[0,1],[1,2],[2,3]]).toDF('OrderMonthYear', 'SaleAmount')
va = VectorAssembler(inputCols=['SaleAmount'], outputCol='features')
data2 = va.transform(data)
lr = LinearRegression(labelCol='OrderMonthYear')
model = lr.fit(data2)
For anyone else following the same LI Learning course, based on some modifications to the accepted answer above to align more with what I was seeing in the course, here's what Cmd 4 cell should look like:
# convenience for specifying schema
from pyspark.ml.feature import VectorAssembler
data = VectorAssembler(inputCols=['OrderMonthYear'], outputCol='features').transform(results.select("OrderMonthYear", "SaleAmount")).drop('OrderMonthYear').withColumnRenamed('SaleAmount', 'label')
display(data)
Alternatively, you can use the following which also works:
from pyspark.ml.linalg import Vectors
data = results.rdd.map(lambda r: (Vectors.dense(r[0]), r[1])).toDF(["features","label"])
display(data)
Then you should be good to go. Note that you'll want to make the same changes to Cmd 4 in notebooks 4.4 and 4.5 as well. Hope this helps!

Simple question I am not getting output as expected.(Linear regression)

I am new to programming. Currently, I am learning machine learning from this video.
This is related to linear regression
CODE:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
df=pd.read_csv('homeprices.csv')
reg = linear_model.LinearRegression()
Problem 1
reg.fit(df[['area']],df.price)
Expected output should be
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
My output:
LinearRegression()
Problem 2
reg.predict(3300)
It's giving error when I use "()" but when I use 2D array "[[]]" It is giving me correct output, But I want to know why It is not giving me output(as shown in video) when I use the only parenthesis.
Problem 1 :
This is how the fitted model outputs are shown in the newest version of sklearn, i.e., 0.23. The parameters are the same, but they are not shown in the output.
You can use reg.get_params() to view the parameters.
Problem 2 :
Newer versions of Scikit-learn require 2D inputs for the predict function and we can make 3300 2D by [[3300]]:
reg.predict( [[3300]] )
Problem1:
it depends on the default parameters which you might have changed it before or any other reason which has changed it, but you can easily set your desired parameters while you are initializing the Linear classifier in this way:
reg = linear_model.LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Problem 2:
reg.predict(3300) it's not correct to pass the parameter to Pandas in that way and you can see that the instructor has also corrected his answer to the reg.predict([3300]) in the description of the youtube Post
try this but you should define your variable and fit them to get desired output
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression()
df=pd.read_csv('homeprices.csv')
reg =LinearRegression()

plot calibration curve for machine learning

I have the code below and this code work only with the binary class so how can I use with three classes.
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import scikitplot as skp
orgnal_data = pd.read_excel("movie.xls")
# Program extracting first column
text = orgnal_data.iloc[:,0]
lable = orgnal_data.iloc[:,1]
x_train,x_test,y_train,y_test=train_test_split(fe,lable,test_size=0.30,random_state=40)
DT = DecisionTreeClassifier()
DT_y = DT.fit(x_train,y_train).predict(x_test)
clf_names = ['Decision Tree']
skp.metrics.plot_calibration_curve(y_test,DT_y,clf_names)
plt.show()
Since you use scikit-plot module, there is no function for a multiclass problem.
Read the source code here:
This function currently only works for binary classification.
So you can either 1) modify the source code or 2) open a github issue and request a function for multiclass problems.
EDIT 1:
Using scikit-learn you have some ML models that can handle multiclass problems. For example for the LinearSVC function here, the multiclass support is handled according to a one-vs-the-rest scheme.
So you can actually have models like this and then use the plot_calibration_curve function for each case (one VS rest) separately.

Categories

Resources