Py4JJavaError while calling .fit() in pyspark RandomForestClassifier

Py4JJavaError while calling .fit() in pyspark RandomForestClassifier - python

Im trying to run a RandomForestClassifier model on my dataset and below error pops up.Anyone knows a solution? Im using Spark version 3.3.1 and Python version 3.8.
model_df = output.select(["features","OrderMonth"])
train_df, test_df = model_df.randomSplit([0.7,0.3])
from pyspark.ml.classification import RandomForestClassifier
rfc = RandomForestClassifier(numTrees=10, labelCol="OrderMonth").fit(train_df)
rf_pred = rfc.transform(test_df)
rf_pred.show()
Py4JJavaError Traceback (most recent call last)
<ipython-input-56-5ed675f09e07> in <module>
7 from pyspark.ml.classification import RandomForestClassifier
8
----> 9 rfc = RandomForestClassifier(numTrees=10, labelCol="OrderMonth").fit(train_df)#change n_e to a bigger number
10
11 #rfc.fit(X_train,y_train)

Related

How to slice a XGBClassifier/XGBRegressor model into sub-models?

This document shows that a XGBoost API trained model can be sliced by following code:
from sklearn.datasets import make_classification
import xgboost as xgb
booster = xgb.train({
'num_parallel_tree': 4, 'subsample': 0.5, 'num_class': 3},
num_boost_round=num_boost_round, dtrain=dtrain)
sliced: xgb.Booster = booster[3:7]
I tried it and it worked.
Since XGBoost provides Scikit-Learn Wrapper interface, I tried something like this:
from xgboost import XGBClassifier
clf_xgb = XGBClassifier().fit(X_train, y_train)
clf_xgb_sliced: clf_xgb.Booster = booster[3:7]
But got following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-18-84155815d877> in <module>
----> 1 clf_xgb_sliced: clf_xgb.Booster = booster[3:7]
AttributeError: 'XGBClassifier' object has no attribute 'Booster'
Since XGBClassifier has no attribute 'Booster', is there any way to slice a Scikit-Learn Wrapper interface trained XGBClassifier(/XGBRegressor) model?

The problem is with the type hint you are giving clf_xgb.Booster which does not match an existing argument. Try:
clf_xgb_sliced: xgb.Booster = clf_xgb.get_booster()[3:7]
instead.

AttributeError: 'KMeans' object has no attribute 'setK'

Example from https://runawayhorse001.github.io/LearningApacheSpark/clustering.html
caused strange error while I decided to test the clustering example for Spark.
Example:
from sklearn.cluster import KMeans
import numpy as np
cost = np.zeros(20)
for k in range(2,20):
kmeans = KMeans()\
.setK(k)\
.setSeed(1) \
.setFeaturesCol("indexedFeatures")\
.setPredictionCol("cluster")
model = kmeans.fit(data)
cost[k] = model.computeCost(data)
And it caused Error in Kmeans attributes despite of fit already implemented.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-22-296a7d54514a> in <module>
2 cost = np.zeros(20)
3 for k in range(2,20):
----> 4 kmeans = KMeans()\
5 .setK(k)\
6 .setSeed(1) \
AttributeError: 'KMeans' object has no attribute 'setK'
I had similar issues in the past and .fit() solved them, but now it is not working.

You're importing the wrong KMeans. I believe that KMeans refer to the one in Spark ML, not in scikit-learn.
from pyspark.ml.clustering import KMeans

Jupyter notebook is giving error when using MinMaxScalar or StandardScalar?

When performing StandardScalar or MinMaxScalar using PythonAdv kernel the jupyter notebook is printing error. However, when using Python 3 environment the same jupyter note book is working fine:
from sklearn.preprocessing import MinMaxScaler
# Scale X values
X_scalar = MinMaxScaler().fit(X_train)
#print(X_scalar)
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
Error:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-5-e5dc00a586d3> in <module>
4 X_scalar = MinMaxScaler().fit(X_train)
5 #print(X_scalar)
----> 6 X_train_scaled = X_scaler.transform(X_train)
7 X_test_scaled = X_scaler.transform(X_test)
NameError: name 'X_scaler' is not defined
I have Anaconda 3, python 3.6 and PythonAdv environments on Git Bash on Windows.

from sklearn.preprocessing import MinMaxScaler
# Scale X values
X_scaler = MinMaxScaler().fit(X_train)
#print(X_scalar)
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
There is a small typo. you define X_scalar then use X_scaler.

RANSAC algorithm using scikit-learn's RANSACRegressor

I tried to use the code below for fitting a robust regression model using RANSAC
from sklearn.linear_model import RANSACRegressor
ransac = RANSACRegressor(LinearRegression(),
max_trials=100,
min_samples=50,
residual_metric=lambda x: np.sum(np.abs(x), axis=1),
residual_threshold=5.0,
random_state=0)
ransac.fit(X,y)
And I get the following error below:
TypeError Traceback (most recent call last)
<ipython-input-38-832d8b5d351b> in <module>
5 residual_metric=lambda x: np.sum(np.abs(x), axis=1),
6 residual_threshold=5.0,
----> 7 random_state=0)
8 ransac.fit(X,y)
TypeError: __init__() got an unexpected keyword argument 'residual_metric'
Can you help me know what's wrong?

Most likely you got this code that was using an old version of ransac. The input residual_metric is deprecated. If you run without that, it works ok:
from sklearn.linear_model import RANSACRegressor, LinearRegression
ransac = RANSACRegressor(LinearRegression(),
max_trials=100,
min_samples=50,
residual_threshold=5.0,
random_state=0)
ransac
RANSACRegressor(base_estimator=LinearRegression(), min_samples=50,
random_state=0, residual_threshold=5.0)

PySpark logistic regression missing method

I am new to Pyspark. I am using logistic regression API. I followed some tutorials and worked this way :
from pyspark.ml.classification import LogisticRegression
train, test = df.randomSplit([0.80, 0.20], seed = some_seed)
LR = LogisticRegression(featuresCol = 'features', labelCol = 'label', maxIter=some_iter)
LR_model = LR.fit(train)
When I call
trainingSummary = LR_model.summary
trainingSummary.roc
I get
--------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-319-bf79768ab64e> in <module>()
1 trainingSummary = LR_model.summary
2
----> 3 trainingSummary.roc
AttributeError: 'LogisticRegressionTrainingSummary' object has no attribute 'roc'
Someone has an idea ?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Py4JJavaError while calling .fit() in pyspark RandomForestClassifier - python

Related

How to slice a XGBClassifier/XGBRegressor model into sub-models?

AttributeError: 'KMeans' object has no attribute 'setK'

Jupyter notebook is giving error when using MinMaxScalar or StandardScalar?

RANSAC algorithm using scikit-learn's RANSACRegressor

PySpark logistic regression missing method

Categories

Resources