Pyspark: Random forest featureSubsetStrategy not accepting int or float

Pyspark: Random forest featureSubsetStrategy not accepting int or float - python

I'm building a random forest classifier using pyspark. I want to set featureSubsetStrategy to be a number rather than auto, sqrt, etc. The documentation states:
featureSubsetStrategy = Param(parent='undefined', name='featureSubsetStrategy', doc='The number of features to consider for splits at each tree node. Supported options: auto, all, onethird, sqrt, log2, (0.0-1.0], [1-n].')
However, when for example I choose a number such as 0.2, I get the following error:
TypeError: Invalid param value given for param "featureSubsetStrategy". Could not convert <class 'float'> to string type
The same happens if I was to use featureSubsetStrategy=5. How do you set it so it can be a int or float?
Example:
# setting target label
label_col = 'veh_pref_Economy'
# random forest parameters
max_depth = 2
subset_strategy = 0.2037
impurity = 'gini'
min_instances_per_node = 41
num_trees = 1
seed = 1246
rf_econ_gen = (RandomForestClassifier()
.setLabelCol(label_col)
.setFeaturesCol("features")
.setMaxDepth(max_depth)
.setFeatureSubsetStrategy(subset_strategy)
.setImpurity(impurity)
.setMinInstancesPerNode(min_instances_per_node)
.setNumTrees(num_trees)
.setSeed(seed))
This returns:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~/spark-2.2.1-bin-hadoop2.7/python/pyspark/ml/param/__init__.py in _set(self, **kwargs)
418 try:
--> 419 value = p.typeConverter(value)
420 except TypeError as e:
~/spark-2.2.1-bin-hadoop2.7/python/pyspark/ml/param/__init__.py in toString(value)
203 else:
--> 204 raise TypeError("Could not convert %s to string type" % type(value))
205
TypeError: Could not convert <class 'float'> to string type
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-28-71b9c2a0f1a0> in <module>()
3 .setFeaturesCol("features")
4 .setMaxDepth(max_depth)
----> 5 .setFeatureSubsetStrategy(subset_strategy)
6 .setImpurity(impurity)
7 .setMinInstancesPerNode(min_instances_per_node)
~/spark-2.2.1-bin-hadoop2.7/python/pyspark/ml/regression.py in setFeatureSubsetStrategy(self, value)
632 Sets the value of :py:attr:`featureSubsetStrategy`.
633 """
--> 634 return self._set(featureSubsetStrategy=value)
635
636 #since("1.4.0")
~/spark-2.2.1-bin-hadoop2.7/python/pyspark/ml/param/__init__.py in _set(self, **kwargs)
419 value = p.typeConverter(value)
420 except TypeError as e:
--> 421 raise TypeError('Invalid param value given for param "%s". %s' % (p.name, e))
422 self._paramMap[p] = value
423 return self
TypeError: Invalid param value given for param "featureSubsetStrategy". Could not convert <class 'float'> to string type

Try to place it as string.
subset_strategy = "0.2037"
rf_econ_gen = (RandomForestClassifier()
.setFeatureSubsetStrategy(subset_strategy))

Related

Out of sample forecasting

I have the following code to perform an out-of-sample assessment of a time series. The idea is to perform a recursive and rolling method to calculate MAPE and MSPE.
The code is as follows:
long = len(y)
n_estimation = 83
real = y[(n_estimation):len(y)]
n_forecasting = long - n_estimation
horizontes = 2
predicc = np.zeros((horizontes,n_forecasting))
MSFE = np.zeros((horizontes, 1))
MAPE = np.zeros((horizontes, 1))
for Periods_ahead in range(horizontes):
for i in range(0,n_forecasting):
aux_y = y[0:(n_estimation - Periods_ahead + i)]
model = SARIMAX(endog = aux_y, order = (1,1,0), seasonal_order = (1,1,0,4))
model_fit=model.fit(disp=0)
y_pred = fit.forecast(Periods_ahead + 1)
predicc[Periods_ahead][i] = y_pred[0][Periods_ahead]
error = np.array(real) - predicc[Periods_ahead]
MSFE[Periods_ahead] = np.mean(error**2)
MAPE[Periods_ahead] = np.mean(np.abs(error/np.array(real))) * 100
df_pred = pd.DataFrame({"V1":predicc[0], "V2":predicc[1]})
print("MSFE",MSFE)
print("MAPE %",MAPE)
I am getting the following error, most likely related to using a newer version of SARIMAX.
ValueError Traceback (most recent call last)
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\core\indexes\range.py:392, in RangeIndex.get_loc(self, key, method, tolerance)
391 try:
--> 392 return self._range.index(new_key)
393 except ValueError as err:
ValueError: 0 is not in range
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
c:\Users\dianaf\OneDrive - Microsoft\Documents\GitHub\big_data_operations\Homework2.ipynb Cell 36 in <cell line: 13>()
17 model_fit=model.fit(disp=0)
18 y_pred = fit.forecast(Periods_ahead + 1)
---> 19 predicc[Periods_ahead][i] = y_pred[0][Periods_ahead]
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\core\series.py:982, in Series.__getitem__(self, key)
979 return self._values[key]
981 elif key_is_scalar:
--> 982 return self._get_value(key)
984 if is_hashable(key):
985 # Otherwise index.get_value will raise InvalidIndexError
986 try:
987 # For labels that don't resolve as scalars like tuples and frozensets
...
--> 394 raise KeyError(key) from err
395 self._check_indexing_error(key)
396 raise KeyError(key)
KeyError: 0
Any idea how to fix it without downgrading to previous versions of statsmodel?
Thank you!

Pandas - TypeError: Cannot perform 'rand_' with a dtyped [bool] array and scalar of type [bool]

I wanted to change a value of a cell with the conditions of another cell value and used this code dfT.loc[dfT.state == "CANCELLED" & (dfT.Activity != "created"), "Activity"] = "cancelled"
This is an Example Table:
ID
Activity
state
1
created
CANCELLED
1
completed
CANCELLED
2
created
FINNISHED
2
completed
FINISHED
3
created
REJECTED
3
rejected
REJECTED
and There is a Type Error like this:
TypeError Traceback (most recent call last)
~\miniconda3\lib\site-packages\pandas\core\ops\array_ops.py in na_logical_op(x, y, op)
264 # (xint or xbool) and (yint or bool)
--> 265 result = op(x, y)
266 except TypeError:
~\miniconda3\lib\site-packages\pandas\core\ops\roperator.py in rand_(left, right)
51 def rand_(left, right):
---> 52 return operator.and_(right, left)
53
TypeError: ufunc 'bitwise_and' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
~\miniconda3\lib\site-packages\pandas\core\ops\array_ops.py in na_logical_op(x, y, op)
278 try:
--> 279 result = libops.scalar_binop(x, y, op)
280 except (
pandas\_libs\ops.pyx in pandas._libs.ops.scalar_binop()
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'bool'
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
<ipython-input-6-350c55a06fa7> in <module>
4 # dfT2 = dfT1[dfT1.Activity != 'created']
5 # df.loc[(df.state == "CANCELLED") & (df.Activity != "created"), "Activity"] = "cancelled"
----> 6 dfT.loc[dfT.state == "CANCELLED" & (dfT.Activity != "created"), "Activity"] = "cancelled"
7 dfT
~\miniconda3\lib\site-packages\pandas\core\ops\common.py in new_method(self, other)
63 other = item_from_zerodim(other)
64
---> 65 return method(self, other)
66
67 return new_method
~\miniconda3\lib\site-packages\pandas\core\arraylike.py in __rand__(self, other)
61 #unpack_zerodim_and_defer("__rand__")
62 def __rand__(self, other):
---> 63 return self._logical_method(other, roperator.rand_)
64
65 #unpack_zerodim_and_defer("__or__")
~\miniconda3\lib\site-packages\pandas\core\series.py in _logical_method(self, other, op)
4987 rvalues = extract_array(other, extract_numpy=True)
4988
-> 4989 res_values = ops.logical_op(lvalues, rvalues, op)
4990 return self._construct_result(res_values, name=res_name)
4991
~\miniconda3\lib\site-packages\pandas\core\ops\array_ops.py in logical_op(left, right, op)
353 filler = fill_int if is_self_int_dtype and is_other_int_dtype else fill_bool
354
--> 355 res_values = na_logical_op(lvalues, rvalues, op)
356 # error: Cannot call function of unknown type
357 res_values = filler(res_values) # type: ignore[operator]
~\miniconda3\lib\site-packages\pandas\core\ops\array_ops.py in na_logical_op(x, y, op)
286 ) as err:
287 typ = type(y).__name__
--> 288 raise TypeError(
289 f"Cannot perform '{op.__name__}' with a dtyped [{x.dtype}] array "
290 f"and scalar of type [{typ}]"
If anyone understand what's my mistake is please help.
Thanks in advance
-Alde

You need to wrap your conditions inside ()
Use:
dfT.loc[(dfT.state == "CANCELLED") & (dfT.Activity != "created"), "Activity"] = "cancelled"

ValueError: Object arrays cannot be loaded when allow_pickle=False

I tried to get solution for this code , hoping for a positive response
much_data = np.load('muchdata-50-50-20.npy')
output:
ValueError Traceback (most recent call last)
<ipython-input-6-6710fe7f2bb7> in <module>
----> 1 much_data = np.load('muchdata-50-50-20.npy')
~\anaconda3\envs\tf-gpu-cuda8\lib\site-packages\numpy\lib\npyio.py in load(file, mmap_mode, allow_pickle, fix_imports, encoding)
437 return format.open_memmap(file, mode=mmap_mode)
438 else:
--> 439 return format.read_array(fid, allow_pickle=allow_pickle,
440 pickle_kwargs=pickle_kwargs)
441 else:
~\anaconda3\envs\tf-gpu-cuda8\lib\site-packages\numpy\lib\format.py in read_array(fp, allow_pickle, pickle_kwargs)
725 # The array contained Python objects. We need to unpickle the data.
726 if not allow_pickle:
--> 727 raise ValueError("Object arrays cannot be loaded when "
728 "allow_pickle=False")
729 if pickle_kwargs is None:
ValueError: Object arrays cannot be loaded when allow_pickle=False
Please let me know the solution for this

Try
much_data = np.load('muchdata-50-50-20.npy', allow_pickle=True)

geoseries.to_crs() failed when use geopandas to calculate distance

I want to calculate distance between two lat/log point using geopandas series.distance and measure the result by unit meters, I know I should define crs first, but I tried several times to use to_crs(), but it is showing an error: b'no arguments in initialization list' seems like the function never worked. Anyone can help me on this problem?
def wgs84_to_CGCS2000(df,code):
result=df.to_crs(from_epsg(code))
return result
city=wgs84_to_CGCS2000(city,4549)
kfc=wgs84_to_CGCS2000(kfc,4549)
RuntimeError Traceback (most recent call last)
<ipython-input-42-c0d1c4e6af6a> in <module>
2 result=df.to_crs(from_epsg(code))
3 return result
----> 4 city=wgs84_to_CGCS2000(city,4549)
5 kfc=wgs84_to_CGCS2000(kfc,4549)
<ipython-input-42-c0d1c4e6af6a> in wgs84_to_CGCS2000(df, code)
1 def wgs84_to_CGCS2000(df,code):
----> 2 result=df.to_crs(from_epsg(code))
3 return result
4 city=wgs84_to_CGCS2000(city,4549)
5 kfc=wgs84_to_CGCS2000(kfc,4549)
C:\ProgramData\Anaconda3\lib\site-packages\geopandas\geodataframe.py in to_crs(self, crs, epsg, inplace)
441 else:
442 df = self.copy()
--> 443 geom = df.geometry.to_crs(crs=crs, epsg=epsg)
444 df.geometry = geom
445 df.crs = geom.crs
C:\ProgramData\Anaconda3\lib\site-packages\geopandas\geoseries.py in to_crs(self, crs, epsg)
302 except TypeError:
303 raise TypeError('Must set either crs or epsg for output.')
--> 304 proj_in = pyproj.Proj(self.crs, preserve_units=True)
305 proj_out = pyproj.Proj(crs, preserve_units=True)
306 project = partial(pyproj.transform, proj_in, proj_out)
C:\ProgramData\Anaconda3\lib\site-packages\pyproj\__init__.py in __new__(self, projparams, preserve_units, **kwargs)
360 # on case-insensitive filesystems).
361 projstring = projstring.replace('EPSG','epsg')
--> 362 return _proj.Proj.__new__(self, projstring)
363
364 def __call__(self, *args, **kw):
_proj.pyx in _proj.Proj.__cinit__()
RuntimeError: b'no arguments in initialization list'

ValueError invalid literal error with Python Multiprocessing Pool

I have a function that i'm trying to run in parallel. The function is of the form
def parallelFunc(curUser):
if curUser in neighbors.getUsers(): #neighbors is a global object of a class
userData = createData.createData(inpMat1,inpMat2,inpMat3, neighbors.getNeighbors(curUser) )
#inpMatX are numpy arrays/scipy sparse arrays/lists with global scope
return userData # tried returning a double value too, get the same error
else:
return 0
pp = Pool(4) # tried with different values
ret = pp.map(parallelFunc, userList)
When I try running this, I get the following error
ValueError: invalud literal for float(): 1.235443508738e
The error is in multiprocessing/pool.pyc . I'm doing this in an IPython notebook. Any ideas as to why this would happen?
Stack Trace :
ValueError Traceback (most recent call last)
<ipython-input-99-2731048b72e2> in <module>()
3
4 #st = time.time()
----> 5 ret = pp.map(parallelFunc, userList)
6 #ft = time.time()
7
/opt/Anaconda/lib/python2.7/multiprocessing/pool.pyc in map(self, func, iterable, chunksize)
249 '''
250 assert self._state == RUN
--> 251 return self.map_async(func, iterable, chunksize).get()
252
253 def imap(self, func, iterable, chunksize=1):
/opt/Anaconda/lib/python2.7/multiprocessing/pool.pyc in get(self, timeout)
565 return self._value
566 else:
--> 567 raise self._value
568
569 def _set(self, i, obj):
ValueError: invalid literal for float(): 1.34716296703.978260894942e+06

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyspark: Random forest featureSubsetStrategy not accepting int or float - python

Try to place it as string. subset_strategy = "0.2037" rf_econ_gen = (RandomForestClassifier() .setFeatureSubsetStrategy(subset_strategy))

Related

Out of sample forecasting

Pandas - TypeError: Cannot perform 'rand_' with a dtyped [bool] array and scalar of type [bool]

ValueError: Object arrays cannot be loaded when allow_pickle=False

geoseries.to_crs() failed when use geopandas to calculate distance

ValueError invalid literal error with Python Multiprocessing Pool

Categories

Resources