Im trying to run a RandomForestClassifier model on my dataset and below error pops up.Anyone knows a solution? Im using Spark version 3.3.1 and Python version 3.8.
model_df = output.select(["features","OrderMonth"])
train_df, test_df = model_df.randomSplit([0.7,0.3])
from pyspark.ml.classification import RandomForestClassifier
rfc = RandomForestClassifier(numTrees=10, labelCol="OrderMonth").fit(train_df)
rf_pred = rfc.transform(test_df)
rf_pred.show()
Py4JJavaError Traceback (most recent call last)
<ipython-input-56-5ed675f09e07> in <module>
7 from pyspark.ml.classification import RandomForestClassifier
8
----> 9 rfc = RandomForestClassifier(numTrees=10, labelCol="OrderMonth").fit(train_df)#change n_e to a bigger number
10
11 #rfc.fit(X_train,y_train)
Related
This document shows that a XGBoost API trained model can be sliced by following code:
from sklearn.datasets import make_classification
import xgboost as xgb
booster = xgb.train({
'num_parallel_tree': 4, 'subsample': 0.5, 'num_class': 3},
num_boost_round=num_boost_round, dtrain=dtrain)
sliced: xgb.Booster = booster[3:7]
I tried it and it worked.
Since XGBoost provides Scikit-Learn Wrapper interface, I tried something like this:
from xgboost import XGBClassifier
clf_xgb = XGBClassifier().fit(X_train, y_train)
clf_xgb_sliced: clf_xgb.Booster = booster[3:7]
But got following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-18-84155815d877> in <module>
----> 1 clf_xgb_sliced: clf_xgb.Booster = booster[3:7]
AttributeError: 'XGBClassifier' object has no attribute 'Booster'
Since XGBClassifier has no attribute 'Booster', is there any way to slice a Scikit-Learn Wrapper interface trained XGBClassifier(/XGBRegressor) model?
The problem is with the type hint you are giving clf_xgb.Booster which does not match an existing argument. Try:
clf_xgb_sliced: xgb.Booster = clf_xgb.get_booster()[3:7]
instead.
Example from https://runawayhorse001.github.io/LearningApacheSpark/clustering.html
caused strange error while I decided to test the clustering example for Spark.
Example:
from sklearn.cluster import KMeans
import numpy as np
cost = np.zeros(20)
for k in range(2,20):
kmeans = KMeans()\
.setK(k)\
.setSeed(1) \
.setFeaturesCol("indexedFeatures")\
.setPredictionCol("cluster")
model = kmeans.fit(data)
cost[k] = model.computeCost(data)
And it caused Error in Kmeans attributes despite of fit already implemented.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-22-296a7d54514a> in <module>
2 cost = np.zeros(20)
3 for k in range(2,20):
----> 4 kmeans = KMeans()\
5 .setK(k)\
6 .setSeed(1) \
AttributeError: 'KMeans' object has no attribute 'setK'
I had similar issues in the past and .fit() solved them, but now it is not working.
You're importing the wrong KMeans. I believe that KMeans refer to the one in Spark ML, not in scikit-learn.
from pyspark.ml.clustering import KMeans
When performing StandardScalar or MinMaxScalar using PythonAdv kernel the jupyter notebook is printing error. However, when using Python 3 environment the same jupyter note book is working fine:
from sklearn.preprocessing import MinMaxScaler
# Scale X values
X_scalar = MinMaxScaler().fit(X_train)
#print(X_scalar)
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
Error:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-5-e5dc00a586d3> in <module>
4 X_scalar = MinMaxScaler().fit(X_train)
5 #print(X_scalar)
----> 6 X_train_scaled = X_scaler.transform(X_train)
7 X_test_scaled = X_scaler.transform(X_test)
NameError: name 'X_scaler' is not defined
I have Anaconda 3, python 3.6 and PythonAdv environments on Git Bash on Windows.
from sklearn.preprocessing import MinMaxScaler
# Scale X values
X_scaler = MinMaxScaler().fit(X_train)
#print(X_scalar)
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
There is a small typo. you define X_scalar then use X_scaler.
I tried to use the code below for fitting a robust regression model using RANSAC
from sklearn.linear_model import RANSACRegressor
ransac = RANSACRegressor(LinearRegression(),
max_trials=100,
min_samples=50,
residual_metric=lambda x: np.sum(np.abs(x), axis=1),
residual_threshold=5.0,
random_state=0)
ransac.fit(X,y)
And I get the following error below:
TypeError Traceback (most recent call last)
<ipython-input-38-832d8b5d351b> in <module>
5 residual_metric=lambda x: np.sum(np.abs(x), axis=1),
6 residual_threshold=5.0,
----> 7 random_state=0)
8 ransac.fit(X,y)
TypeError: __init__() got an unexpected keyword argument 'residual_metric'
Can you help me know what's wrong?
Most likely you got this code that was using an old version of ransac. The input residual_metric is deprecated. If you run without that, it works ok:
from sklearn.linear_model import RANSACRegressor, LinearRegression
ransac = RANSACRegressor(LinearRegression(),
max_trials=100,
min_samples=50,
residual_threshold=5.0,
random_state=0)
ransac
RANSACRegressor(base_estimator=LinearRegression(), min_samples=50,
random_state=0, residual_threshold=5.0)
I am new to Pyspark. I am using logistic regression API. I followed some tutorials and worked this way :
from pyspark.ml.classification import LogisticRegression
train, test = df.randomSplit([0.80, 0.20], seed = some_seed)
LR = LogisticRegression(featuresCol = 'features', labelCol = 'label', maxIter=some_iter)
LR_model = LR.fit(train)
When I call
trainingSummary = LR_model.summary
trainingSummary.roc
I get
--------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-319-bf79768ab64e> in <module>()
1 trainingSummary = LR_model.summary
2
----> 3 trainingSummary.roc
AttributeError: 'LogisticRegressionTrainingSummary' object has no attribute 'roc'
Someone has an idea ?