PySpark logistic regression missing method - python

I am new to Pyspark. I am using logistic regression API. I followed some tutorials and worked this way :
from pyspark.ml.classification import LogisticRegression
train, test = df.randomSplit([0.80, 0.20], seed = some_seed)
LR = LogisticRegression(featuresCol = 'features', labelCol = 'label', maxIter=some_iter)
LR_model = LR.fit(train)
When I call
trainingSummary = LR_model.summary
trainingSummary.roc
I get
--------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-319-bf79768ab64e> in <module>()
1 trainingSummary = LR_model.summary
2
----> 3 trainingSummary.roc
AttributeError: 'LogisticRegressionTrainingSummary' object has no attribute 'roc'
Someone has an idea ?

Related

Py4JJavaError while calling .fit() in pyspark RandomForestClassifier

Im trying to run a RandomForestClassifier model on my dataset and below error pops up.Anyone knows a solution? Im using Spark version 3.3.1 and Python version 3.8.
model_df = output.select(["features","OrderMonth"])
train_df, test_df = model_df.randomSplit([0.7,0.3])
from pyspark.ml.classification import RandomForestClassifier
rfc = RandomForestClassifier(numTrees=10, labelCol="OrderMonth").fit(train_df)
rf_pred = rfc.transform(test_df)
rf_pred.show()
Py4JJavaError Traceback (most recent call last)
<ipython-input-56-5ed675f09e07> in <module>
7 from pyspark.ml.classification import RandomForestClassifier
8
----> 9 rfc = RandomForestClassifier(numTrees=10, labelCol="OrderMonth").fit(train_df)#change n_e to a bigger number
10
11 #rfc.fit(X_train,y_train)

How to slice a XGBClassifier/XGBRegressor model into sub-models?

This document shows that a XGBoost API trained model can be sliced by following code:
from sklearn.datasets import make_classification
import xgboost as xgb
booster = xgb.train({
'num_parallel_tree': 4, 'subsample': 0.5, 'num_class': 3},
num_boost_round=num_boost_round, dtrain=dtrain)
sliced: xgb.Booster = booster[3:7]
I tried it and it worked.
Since XGBoost provides Scikit-Learn Wrapper interface, I tried something like this:
from xgboost import XGBClassifier
clf_xgb = XGBClassifier().fit(X_train, y_train)
clf_xgb_sliced: clf_xgb.Booster = booster[3:7]
But got following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-18-84155815d877> in <module>
----> 1 clf_xgb_sliced: clf_xgb.Booster = booster[3:7]
AttributeError: 'XGBClassifier' object has no attribute 'Booster'
Since XGBClassifier has no attribute 'Booster', is there any way to slice a Scikit-Learn Wrapper interface trained XGBClassifier(/XGBRegressor) model?
The problem is with the type hint you are giving clf_xgb.Booster which does not match an existing argument. Try:
clf_xgb_sliced: xgb.Booster = clf_xgb.get_booster()[3:7]
instead.

numpy.float64' object is not callable - hyperparameter tuning

I'm trying to do hyperparameter tuning and every time I run this code.
from sklearn.model_selection import GridSearchCV
param_grid = {'C':[0,1,1,100,1000], 'kernel':['rbf','poly','sigmoid','linear'],'degree':[1,2,3,4,5,6]}
grid =GridSearchCV(svc.sc(),param_grid)
grid.fit(X_train,y_train)
I get this error
TypeError Traceback (most recent call last)
<ipython-input-64-74de9eeb3cae> in <module>
3
4 param_grid = {'C':[0,1,1,100,1000], 'kernel':['rbf','poly','sigmoid','linear'],'degree':[1,2,3,4,5,6]}
----> 5 grid =GridSearchCV(svc.sc(),param_grid)
6 grid.fit(X_train,y_train)
TypeError: 'numpy.float64' object is not callable
Any idea what to do? Also svc.sc is the way defined the model.
What is svc.sc()? Either way, you're probably not meant to call it at that point, just pass it as the callback to GridSearchCV, i.e. drop the parentheses:
grid = GridSearchCV(svc.sc, param_grid)

AttributeError: 'KMeans' object has no attribute 'setK'

Example from https://runawayhorse001.github.io/LearningApacheSpark/clustering.html
caused strange error while I decided to test the clustering example for Spark.
Example:
from sklearn.cluster import KMeans
import numpy as np
cost = np.zeros(20)
for k in range(2,20):
kmeans = KMeans()\
.setK(k)\
.setSeed(1) \
.setFeaturesCol("indexedFeatures")\
.setPredictionCol("cluster")
model = kmeans.fit(data)
cost[k] = model.computeCost(data)
And it caused Error in Kmeans attributes despite of fit already implemented.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-22-296a7d54514a> in <module>
2 cost = np.zeros(20)
3 for k in range(2,20):
----> 4 kmeans = KMeans()\
5 .setK(k)\
6 .setSeed(1) \
AttributeError: 'KMeans' object has no attribute 'setK'
I had similar issues in the past and .fit() solved them, but now it is not working.
You're importing the wrong KMeans. I believe that KMeans refer to the one in Spark ML, not in scikit-learn.
from pyspark.ml.clustering import KMeans

RANSAC algorithm using scikit-learn's RANSACRegressor

I tried to use the code below for fitting a robust regression model using RANSAC
from sklearn.linear_model import RANSACRegressor
ransac = RANSACRegressor(LinearRegression(),
max_trials=100,
min_samples=50,
residual_metric=lambda x: np.sum(np.abs(x), axis=1),
residual_threshold=5.0,
random_state=0)
ransac.fit(X,y)
And I get the following error below:
TypeError Traceback (most recent call last)
<ipython-input-38-832d8b5d351b> in <module>
5 residual_metric=lambda x: np.sum(np.abs(x), axis=1),
6 residual_threshold=5.0,
----> 7 random_state=0)
8 ransac.fit(X,y)
TypeError: __init__() got an unexpected keyword argument 'residual_metric'
Can you help me know what's wrong?
Most likely you got this code that was using an old version of ransac. The input residual_metric is deprecated. If you run without that, it works ok:
from sklearn.linear_model import RANSACRegressor, LinearRegression
ransac = RANSACRegressor(LinearRegression(),
max_trials=100,
min_samples=50,
residual_threshold=5.0,
random_state=0)
ransac
RANSACRegressor(base_estimator=LinearRegression(), min_samples=50,
random_state=0, residual_threshold=5.0)

Categories

Resources