Is it possible to optimize hyperparameters for optional sklearn pipeline steps?

Is it possible to optimize hyperparameters for optional sklearn pipeline steps? - python

I tried to construct a pipeline that has some optional steps. However, I would like to optimize hyperparameters for those steps as I want to get the best option between not using them and using them with different configurations (in my case SelectFromModel - sfm).
clf = RandomForestRegressor(random_state = 1)
stdscl = StandardScaler()
sfm = SelectFromModel(RandomForestRegressor(random_state=1))
p_grid_lr = {"clf__max_depth": [10, 50, 100, None],
"clf__n_estimators": [10, 50, 100, 200, 500, 800],
"clf__max_features":[0.1, 0.5, 1.0,'sqrt','log2'],
"sfm": ['passthrough', sfm],
"sfm__max_depth": [10, 50, 100, None],
"sfm__n_estimators": [10, 50, 100, 200, 500, 800],
"sfm__max_features":[0.1, 0.5, 1.0,'sqrt','log2'],
}
pipeline=Pipeline([
('scl',stdscl),
('sfm',sfm),
('clf',clf)
])
gs_clf = GridSearchCV(estimator = pipeline, param_grid = p_grid_lr, cv =KFold(shuffle = True, n_splits = 5, random_state=1),scoring = 'r2', n_jobs =- 1)
gs_clf.fit(X_train, y_train)
clf = gs_clf.best_estimator_
The error that I get is 'string' object has no attribute 'set_params' which is understandable. Is there a way to specify which combinations should be tried together, in my case only 'passthrough' by itself and sfm with different hyperparameters?
Thanks!

As specified by #Robin, you might define p_grid_lr as a list of dictionaries. Indeed, here is what the docs of GridSearchCV states at this proposal:
param_grid: dict or list of dictionaries
Dictionary with parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.
p_grid_lr = [
{
"clf__max_depth": [10, 50, 100, None],
"clf__n_estimators": [10, 50, 100, 200, 500, 800],
"clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
"sfm__estimator__max_depth": [10, 50, 100, None],
"sfm__estimator__n_estimators": [10, 50, 100, 200, 500, 800],
"sfm__estimator__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
},
{
"clf__max_depth": [10, 50, 100, None],
"clf__n_estimators": [10, 50, 100, 200, 500, 800],
"clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
"sfm": ['passthrough'],
}
]
A less scalable alternative (for your case) might be the following
p_grid_lr_ = {
"clf__max_depth": [10, 50, 100, None],
"clf__n_estimators": [10, 50, 100, 200, 500, 800],
"clf__max_features": [0.1, 0.5, 1.0,'sqrt','log2'],
"sfm": ['passthrough',
SelectFromModel(RandomForestRegressor(random_state=1, max_depth=10, n_estimators=10, max_features=0.1)),
SelectFromModel(RandomForestRegressor(random_state=1, max_depth=10, n_estimators=50, max_features=0.1)),
...]
}
specifying all of the possible combinations for your parameters.
Moreover, be aware that to access parameters max_depth, n_estimators and max_features from the RandomForestRegressor estimator within SelectFromModel you should type parameters as
"sfm__estimator__max_depth": [10, 50, 100, None],
"sfm__estimator__n_estimators": [10, 50, 100, 200, 500, 800],
"sfm__estimator__max_features": [0.1, 0.5, 1.0,'sqrt','log2']
rather than as
"sfm__max_depth": [10, 50, 100, None],
"sfm__n_estimators": [10, 50, 100, 200, 500, 800],
"sfm__max_features": [0.1, 0.5, 1.0,'sqrt','log2']
because these parameters are from the estimator itself (max_features in principle might also be a parameter from SelectFromModel, but in such a case it may only attain integer values as from docs).
In general you can access all the parameters to be possibly optimized via pipeline.get_params().keys() (estimator.get_params().keys() in general).
Eventually, here's a nice reading from the user guide for Pipelines.

Referring to this example you could just make a list of dictionaries. One containing sfm and its related parameters and the other one not using "passthrough".

Related

Not being able to run grid search for Prophet on AWS EC2 instance

EC2 instance used: d2.4xlarge on EU servers
As the question says, I tried to run grid search with multiple configs for FBProphet on an AWS EC2 Ubuntu instance and I was not able to.
On my laptop it runs fine, just slowly. And that`s why I want to do the grid search on the VM
Could you please help me out? I think it is a problem because the VM apparently only uses 1 single vCPU ( I do not know why this happens).
Moreover, I tried disabling parallel processes on the Prophet`s side and set them to None/Threads/Processes and nothing happened. Still got the same error.
Can somebody please help me out because I am stuck.
Error code:
The error I get : 10:52:44 - cmdstanpy - INFO - CmdStan done processing.
Traceback (most recent call last):
File "demo_prophet.py", line 72, in <module>
m = Prophet(**params).fit(df_to_process) # Fit model with given params
File "/home/ubuntu/.local/lib/python3.8/site-packages/prophet/forecaster.py", line 1169, in fit
self.params = self.stan_backend.sampling(stan_init, dat, self.mcmc_samples, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/prophet/models.py", line 140, in sampling
self.stan_fit = self.model.sample(**args)
File "/home/ubuntu/.local/lib/python3.8/site-packages/cmdstanpy/model.py", line 1188, in sample
raise RuntimeError(msg)
RuntimeError: Error during sampling
Actual code snippet:
param_grid = {
'growth': ["linear", "logistic", "flat"],
'changepoint_range': [0.1, 0.5, 0.6, 0.8, 0.9],
'changepoint_prior_scale': [0.1, 0.5, 0.6, 0.9, 1, 10, 20],
'seasonality_prior_scale': [0.1, 0.5, 0.6, 1, 10.4, 20, 50],
'holidays_prior_scale': [1, 3, 7, 8, 9, 10, 10.4, 15, 20,50],
'n_changepoints': [1, 5, 10, 25, 50, 70, 100, 500],
"mcmc_samples": [0, 1, 6, 10, 20, 50, 70],
"interval_width": [0.1, 0.5, 0.9],
"uncertainty_samples": [0, 1, 5, 10, 50, 100, ]
}
name = "Amo"
df_to_process = dataframe_FKG79[dataframe_FKG79["Name"] == name]
if(df_to_process.shape[0] >= 2):
df_to_process.drop("Name", axis=1, inplace=True)
df_to_process.reset_index(inplace=True)
df_to_process.columns = ["ds", "y"]
all_params = [dict(zip(param_grid.keys(), v)) for v in itertools.product(*param_grid.values())]
rmses = [] # Store the RMSEs for each params here
for params in all_params:
m = Prophet(**params).fit(df_to_process) # Fit model with given params
cutoffs = [pd.to_datetime('2021-12-10'), pd.to_datetime('2021-12-31'), pd.to_datetime('2022-01-10')]
df_cv = cross_validation(m, initial = '950 days', cutoffs = cutoffs, horizon = '28 days')
df_p = performance_metrics(df_cv, rolling_window=1)
rmses.append(df_p['rmse'].values[0])
# Find the best parameters
tuning_results = pd.DataFrame(all_params)

How to compare int with list and receive index using np.where?

I am currently trying to get the index position of costShirts where the value in costShirts[0] is the largest number thats still smaller than the user input. I keep running into an error ">= not supported between instances of 'int' and 'list'.
I have been trying to convert orderQuantity into a list but that has not been working either.
costShirts = [[1, 12, 48, 72], [9.19, 6.09, 5.75, 5.35]]
costPrints = [[1, 3, 6, 12, 24, 36, 48, 72, 144, 300, 500, 1000, 2500, 5000, 10000], [15, 10, 8, 5, 2.5, 2, 1.75, 1.5, 1.25, 1, 0.75, 0.65, 0.6, 0.55, 0.5]]
probDemand = [[0, 25, 50, 75, 100, 125, 150, 175], [0.05, 0.15, 0.23, 0.27, 0.15, 0.07, 0.05, 0.03]]
shirtPrice = [15.99]
afterPrice = [5.00]
orderSetup = [200]
orderQuantity = int(input("Enter the order quantity: "))
susDemand = input("Enter the suspected demand: ")
if susDemand == "":
dq = int(np.random.choice(probDemand[0],1,probDemand[1]))
else:
dq = int(susDemand)
shirtPos = max(np.where(orderQuantity >= costShirts[0])[0])

The error message is actual very helpful in this case. Investigate both sides of that >= expression, and you will find that costShirts[0] is a list. You are trying to do a numpy-style vectorized comparison with a python list, which is not supported.
Two suggestions to correct this: convert costShirts to an ndarray up front (fine if you don't care about allowing the 1st row to be converted to int), or you can convert costShirts[0] to an ndarray in place. Either way the conversion to ndarray is accomplished by np.array(<list>).

Merge sorting a 2d array

I'm stuck again on trying to make this merge sort work.
Currently, I have a 2d array with a Unix timecode(fig 1) and merge sorting using (fig 2) I am trying to check the first value in each array i.e array[x][0] and then move the whole array depending on array[x][0] value, however, the merge sort creates duplicates of data and deletes other data (fig 3) my question is what am I doing wrong? I know it's the merge sort but cant see the fix.
fig 1
[[1422403200 100]
[1462834800 150]
[1458000000 25]
[1540681200 150]
[1498863600 300]
[1540771200 100]
[1540771200 100]
[1540771200 100]
[1540771200 100]
[1540771200 100]]
fig 2
import numpy as np
def sort(data):
if len(data) > 1:
Mid = len(data) // 2
l = data[:Mid]
r = data[Mid:]
sort(l)
sort(r)
z = 0
x = 0
c = 0
while z < len(l) and x < len(r):
if l[z][0] < r[x][0]:
data[c] = l[z]
z += 1
else:
data[c] = r[x]
x += 1
c += 1
while z < len(l):
data[c] = l[z]
z += 1
c += 1
while x < len(r):
data[c] = r[x]
x += 1
c += 1
print(data, 'done')
unixdate = [1422403200, 1462834800, 1458000000, 1540681200, 1498863600, 1540771200, 1540771200,1540771200, 1540771200, 1540771200]
price=[100, 150, 25, 150, 300, 100, 100, 100, 100, 100]
array = np.column_stack((unixdate, price))
sort(array)
print(array, 'sorted')
fig 3
[[1422403200 100]
[1458000000 25]
[1458000000 25]
[1498863600 300]
[1498863600 300]
[1540771200 100]
[1540771200 100]
[1540771200 100]
[1540771200 100]
[1540771200 100]]

I couldn't spot any mistake in your code.
I have tried your code and I can tell that the problem does not happen, at least with regular Python lists: The function doesn't change the number of occurrence of any element in the list.
data = [
[1422403200, 100],
[1462834800, 150],
[1458000000, 25],
[1540681200, 150],
[1498863600, 300],
[1540771200, 100],
[1540771200, 100],
[1540771200, 100],
[1540771200, 100],
[1540771200, 100],
]
sort(data)
from pprint import pprint
pprint(data)
Output:
[[1422403200, 100],
[1458000000, 25],
[1462834800, 150],
[1498863600, 300],
[1540681200, 150],
[1540771200, 100],
[1540771200, 100],
[1540771200, 100],
[1540771200, 100],
[1540771200, 100]]
Edit, taking into account the numpy context and the use of np.column_stack.
-I expect what happens there is that np.column_stack actually creates a view mapping over the two arrays. To get a real array rather than a link to your existing arrays, you should copy that array:-
array = np.column_stack((unixdate, price)).copy()
Edit 2, taking into account the numpy context
This behavior has actually nothing to do with np.column_stack; np.column_stack already performs a copy.
The reason your code doesn't work is because slicing behaves differently with numpy than with python. Slicing create a view of the array which maps indexes.
The erroneous lines are:
l = data[:Mid]
r = data[Mid:]
Since l and r just map to two pieces of the memory held by data, they are modified when data is. This is why the lines data[c] = l[z] and data[c] = r[x] overwrite values and create copies when moving values.
If data is a numpy array, we want l and r to be copies of data, not just views. This can be achieved using the copy method.
l = data[:Mid]
r = data[Mid:]
if isinstance(data, np.ndarray):
l = l.copy()
r = r.copy()
This way, I tested, the copy works.
Note
If you wanted to sort the data using python lists rather than numpy arrays, the equivalent of np.column_stack in vanilla python is zip:
z = zip([10, 20, 30, 40], [100, 200, 300, 400], [1000, 2000, 3000, 4000])
z
# <zip at 0x7f6ef80ce8c8>
# `zip` creates an iterator, which is ready to give us our entries.
# Iterators can only be walked once, which is not the case of lists.
list(z)
# [(10, 100, 1000), (20, 200, 2000), (30, 300, 3000), (40, 400, 4000)]
The entries are (non-mutable) tuples. If you need the entries to be editable, map list on them:
z = zip([10, 20, 30, 40], [100, 200, 300, 400], [1000, 2000, 3000, 4000])
li = list(map(list, z))
# [[10, 100, 1000], [20, 200, 2000], [30, 300, 3000], [40, 400, 4000]]
To transpose a matrix, use zip(*matrix):
def transpose(matrix):
return list(map(list, zip(*matrix)))
transpose(l)
# [[10, 20, 30, 40], [100, 200, 300, 400], [1000, 2000, 3000, 4000]]
You can also sort a python list li using li.sort(), or sort any iterator (lists are iterators), using sorted(li).
Here, I would use (tested):
sorted(zip(unixdate, price))

How do I fix a tuple in list error

So I have this bit of code:
Points = [ [400,100],[600,100],[800,100] , [300,300],[400,300],[500,300],[600,300] , [200,500],[400,500],[600,500],[800,500],[1000,500] , [300,700],[500,700][700,700][900,700] , [200,900],[400,900],[600,900] ]
And it produces this Error:
line 43, in <module>
Points = [ [400,100],[600,100],[800,100] , [300,300],[400,300],[500,300],[600,300] , [200,500],[400,500],[600,500],[800,500],[1000,500] , [300,700],[500,700][700,700][900,700] , [200,900],[400,900],[600,900] ]
TypeError: list indices must be integers, not tuple
What can I do to fix it?

You forgot two commas:
[500,700][700,700][900,700]
Now Python sees an attempt to index the list on the left-hand side with a (700, 700) tuple:
>>> [500,700][700,700]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: list indices must be integers, not tuple
The second [900, 700] 'list' would give you the same problem but doesn't yet come into play.
Fix it by adding commas between:
[500, 700], [700, 700], [900, 700]
or, as a complete list:
Points = [[400, 100], [600, 100], [800, 100], [300, 300], [400, 300], [500, 300], [600, 300], [200, 500], [400, 500], [600, 500], [800, 500], [1000, 500], [300, 700], [500, 700], [700, 700], [900, 700], [200, 900], [400, 900], [600, 900]]

You forgot to seperate a few by commas. See the fix.
>>> Points = [[400,100], [600,100], [800,100], [300,300], [400,300], [500,300], [600,300] ,[200,500], [400,500], [600,500], [800,500], [1000,500], [300,700], [500,700], [700,700],[900,700], [200,900], [400,900], [600,900]]
Forgetting the commas leads Python to believe that you're trying to access the first list with the second, which throws an error.

You need to separate each of the lists (in the outer list) with a ,:
Points = [ [400,100],[600,100],[800,100] , [300,300],[400,300],[500,300],[600,300] ,[200,500],[400,500],[600,500],[800,500],[1000,500] , [300,700],[500,700],[700,700],[900,700] , [200,900],[400,900],[600,900] ]

Define variables with the same list data but different objects using python

this is my code :
attackUp = [10, 15,10, 15,10, 15]
defenceUp = [10, 15,10, 15,10, 15]
magicUp = [10, 15,10, 15,10, 15]
attType = [1,1,1,1,1,1]
weightDown = [10, 15,10, 15,10, 15]
#装饰器数据
accAttackSword = [100, 100,100, 100,100, 100]
accAttackSaber = [100, 100,100, 100,100, 100]
accAttackAx = [100, 100,100, 100,100, 100]
accAttackHammer = [100, 100,100, 100,100, 100]
accAttackSpear = [100, 100,100, 100,100, 100]
accAttackFight = [100, 100,100, 100,100, 100]
accAttackBow = [100, 100,100, 100,100, 100]
accAttackMagicGun = [100, 100,100, 100,100, 100]
accAttackMagic = [100, 100,100, 100,100, 100]
mStrInstrument = [100, 100,100, 100,100, 100]
mStrCharms = [100, 100,100, 100,100, 100]
accDefencePhy = [100, 100,100, 100,100, 100]
accDefenceMag = [100, 100,100, 100,100, 100]
accWeight = [100, 90, 0, 0, 100, 90]
#战术书数据
bookTurn = [1,1]
bookAttackPhy = [100, 100]
bookAttackMag = [100, 100]
bookStrInstrument = [100, 100]
bookStrCharms = [100, 100]
bookDefencePhy = [100, 100]
bookDefenceMag = [100, 100]
bookWeight = [100, 100]
you can see that : Many variables has the same value , but i cant define them like this :
bookAttackPhy = bookAttackMag =bookStrInstrument=bookStrCharms=bookDefencePhy=[100, 100]
because all change if one of them changes.
Which is the best and easiest to define these variables?

Well, a step in the right direction would be to create a base list and then copy it using slice notation:
base = [100, 100, 100, 100]
value_a = base[:]
value_b = base[:]
and so on. This doesn't gain you much for the shorter lists, but it should be useful for the longer ones at least.
But I think more generally, a richer data structure would be better for something like this. Why not create a class? You could then use setattr to fill up class members in a fairly straightforward way.
class Weapons(object):
def __init__(self, base):
for weapon in ["saber", "sword", "axe"]:
setattr(self, weapon, base[:])
w = Weapons([100, 100, 100])
print w.__dict__
#output: {'sword': [100, 100, 100],
# 'saber': [100, 100, 100],
# 'axe': [100, 100, 100]}
w.axe[0] = 10
print w.axe # output: [10, 100, 100]
print w.sword # output: [100, 100, 100]

Define them all as empty arrays, then group the ones that need the same values into a list and iterate through that list, assigning the common values to each variable.

You could do something like:
defaultAttack = [100, 100,100, 100,100, 100]
accAttackSword = list(defaultAttack)
accAttackSaber = list(defaultAttack)
The list() constructor makes a copy of the list, so they will be able to change independently.

You can use list multiplication
accAttackSword = [100]*6
....
bookWeight = [100]*2
....
You might consider grouping all of the variables with similar prefixes either into dictionaries or nested lists (EDIT - or classes/objects). This could have benefits later for organization, and would allow you to iterate thru and set them all to the same initial values.
bookVars = ['AttackPhy', 'AttackMag', 'StrInstrument', 'StrCharms']
bookValues = dict()
for i in bookVars:
bookValues[i] = [100]*2
And to access...
bookValues
{'AttackMag': [100, 100], 'StrCharms': [100, 100], 'StrInstrument': [100, 100], 'AttackPhy': [100, 100]}
bookValues['AttackMag']
[100, 100]
EDIT - check out senderle's thing too. at a glance his seems a little better, but id definitely consider using one of our ideas - the point is to structure it a little more. whenever you have groups of variables with similar prefixed names, consider grouping them together in a more meaningful way. you are already doing so in your mind, so make the code follow!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is it possible to optimize hyperparameters for optional sklearn pipeline steps? - python

Referring to this example you could just make a list of dictionaries. One containing sfm and its related parameters and the other one not using "passthrough".

Related

Not being able to run grid search for Prophet on AWS EC2 instance

How to compare int with list and receive index using np.where?

Merge sorting a 2d array

How do I fix a tuple in list error

Define variables with the same list data but different objects using python

Categories

Resources