Scikit learn and data set analysis

Scikit learn and data set analysis - python

I am new to Scikit learn and I tried the first program they have given in their website the code is given below:
from sklearn import svm
from sklearn import datasets
clf = svm.SVC()
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf.fit(X, y)
while I compile the last line I get the following error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: fit() missing 1 required positional argument: 'y'
pls help me with this issue.

Since the code runs fine in vanilla format. Most likely you have multiple environments interfering. Try running in a new virtualenv.

Related

Get OOB score within a pipeline for Random Forest

I was wondering for a machine learning project: is it possible to implement RandomForestRegressor inside a pipeline?
Specifically, I need to determine the OOB score from a RandomForestRegressor. But my data requires a lot of preprocessing.
I tried several things, and this is the closest so far:
# Creation of the pipeline
rand_piped = Pipeline([
('preprocessor', preprocessor),
('model', RandomForestRegressor(max_depth=3, random_state=0, oob_score=True))
])
# Fitting our model
rand_piped.fit(df_X_train,df_Y_train.values.ravel())
# Getting our metrics and predictions
oob_score = rand_piped.oob_score_
At the moment I think my problem is that I still have an unclear idea of this method. So feel free to correct me. It returns this error:
Traceback (most recent call last):
File "/home/user/my_rf.py", line 15, in <module>
oob_score = rand_piped.oob_score_
AttributeError: 'Pipeline' object has no attribute 'oob_score_'

Pipelines are subscriptable, so you can look up the oob_score_ in the model step:
>>> rand_piped["model"].oob_score_
0.9297212997034854

I'm trying to standardize my attribute but I'm getting standard scaler error because Python can not find it

I'm trying to use a standard scaler but Python could not find,
here is the error:
z_scaler = Standardscaler()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-13-09f39beade2e> in <module>
----> 1 z_scaler = Standardscaler()
NameError: name 'Standardscaler' is not defined
Is there any particular package I need for Standard Scaler?

You're looking for StandardScaler , not Standardscaler.
from sklearn.preprocessing import StandardScaler

'svd' has no "split" attribute

I am trying to make an recommender system using SVD python package. I am importing csv file then doing the below operation, but it is showing error. How to solve this?
from surprise import SVD,Reader,Dataset
ratings = pd.read_csv("/content/ratings_small.csv")
data = Dataset.load_from_df(ratings[['userId','movieId','rating']],reader)
data.split(n_folds=5)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-29-f3bf344cf3e2> in <module>()
----> 1 data.split(n_folds=5)
AttributeError: 'DatasetAutoFolds' object has no attribute 'split'
It says it has not split attribute buti went through a question where they have used it.

You need to import KFold from model_selection to split the data and perform cross validation.
This works.
from surprise import SVD,Reader,Dataset
from surprise.model_selection import KFold
ratings = pd.read_csv("/content/ratings_small.csv")
data = Dataset.load_from_df(ratings[['userId','movieId','rating']],reader)
kf = KFold(n_splits=5)
kf.split(data)

Sklearn decision tree plot does not appear

I am trying to follow scikit learn example on decision trees:
from sklearn.datasets import load_iris
from sklearn import tree
X, y = load_iris(return_X_y=True)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, y)
When I try to plot the tree:
tree.plot_tree(clf.fit(iris.data, iris.target))
I get
NameError Traceback (most recent call last)
<ipython-input-2-e72b33a93ee6> in <module>
----> 1 tree.plot_tree(clf.fit(iris.data, iris.target))
NameError: name 'iris' is not defined

Your problem was different, but I ended up here through googling this issue and you have also same-ish issue present.
At least on windows matplotlib (which is used to show the tree with tree.plot_tree) will not show anything if you don't have plt.show() somewhere.
from sklearn import tree
import matplotlib.pyplot as plt
sometree = ....
tree.plot_tree(sometree)
plt.show() # mandatory on Windows

iris doesn't exist if you don't assign it. Use this line to plot:
tree.plot_tree(clf.fit(X, y))
You already assigned the X and y of load_iris() to a variable so you can use them.
Additionally, make sure the graphviz library's bin folder is in PATH.

Python sklearn GaussianNB : "MemoryError" but no leads on how to fix

I am running the following code to create and fit a GaussianNB classifier:
features_train, features_test, labels_train, labels_test = preprocess()
### compute the accuracy of your Naive Bayes classifier
# import the sklearn module for GaussianNB
from sklearn.naive_bayes import GaussianNB
import numpy as np
### create classifier
clf = GaussianNB()
### fit the classifier on the training features and labels
clf.fit(features_train, labels_train)
Running the above locally:
>>> runfile('C:/.../naive_bayes')
no. of Chris training emails: 4406
no. of Sara training emails: 4383
>>> clf
GaussianNB()
I believe this checks out "preprocess()" because it loads features_train, features_test, labels_train, labels_test successfully.
When I try to clf.score or clf.predict, I get a MemoryError:
>>> clf.predict(features_test)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\sklearn\naive_bayes.py", line 64, in predict
jll = self._joint_log_likelihood(X)
File "C:\Python27\lib\site-packages\sklearn\naive_bayes.py", line 343, in _joint_log_likelihood
n_ij -= 0.5 * np.sum(((X - self.theta_[i, :]) ** 2) /
MemoryError
>>> clf.score(features_test,labels_test)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\sklearn\base.py", line 295, in score
return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
File "C:\Python27\lib\site-packages\sklearn\naive_bayes.py", line 64, in predict
jll = self._joint_log_likelihood(X)
File "C:\Python27\lib\site-packages\sklearn\naive_bayes.py", line 343, in _joint_log_likelihood
n_ij -= 0.5 * np.sum(((X - self.theta_[i, :]) ** 2) /
MemoryError
I do not think it is a problem with my memory because I do not see a spike in RAM on my task manager, and not near the memory usage on my machine.
I suspect it is something with the Python version and the libraries versions.
Any help in going about diagnosing this is appreciated. I can provide more info as needed.

I believe I answered my question after reading some related posts online (did not use previously answered Stackoverflow posts).
The key for me was to simply move to 64-bit Python via Anaconda. All issues with 'MemoryError' were resolved when the exact same code that was run in 32-bit Python was retried in 64-bit. To my best understanding, this was the only variable that was changed.
Perhaps this is not a very satisfying answer, but it would be nice if this question can remain for others in the future searching for the exact same sklearn MemoryError problem.

I'm also taking that same Udacity course and I had the same exact problem. I installed Anaconda 64bits and executed the script inside Spyder and everything worked out as expected

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scikit learn and data set analysis - python

Since the code runs fine in vanilla format. Most likely you have multiple environments interfering. Try running in a new virtualenv.

Related

Get OOB score within a pipeline for Random Forest

I'm trying to standardize my attribute but I'm getting standard scaler error because Python can not find it

'svd' has no "split" attribute

Sklearn decision tree plot does not appear

Python sklearn GaussianNB : "MemoryError" but no leads on how to fix

Categories

Resources