Get actual feature names from XGBoost model - python

I know this question has been asked several times and I've read them but still haven't been able to figure it out.
Like other people, my feature names at the end are shown as f56, f234, f12 etc. and I want to have the actual names instead of f-somethings! This is the part of the code related to the model:
optimized_params, xgb_model = find_best_parameters() #where fitting and GridSearchCV happens
xgdmat = xgb.DMatrix(X_train_scaled, y_train_scaled)
feature_names=xgdmat.feature_names
final_gb = xgb.train(optimized_params, xgdmat, num_boost_round =
find_optimal_num_trees(optimized_params,xgdmat))
final_gb.get_fscore()
mapper = {'f{0}'.format(i): v for i, v in enumerate(xgdmat.feature_names)}
mapped = {mapper[k]: v for k, v in final_gb.get_fscore().items()}
mapped
xgb.plot_importance(mapped, color='red')
I also tried this:
feature_important = final_gb.get_score(importance_type='weight')
keys = list(feature_important.keys())
values = list(feature_important.values())
data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=False)
data.plot(kind='barh')
but still the features are shown as f+number. I'd really appreciate any help.
What I'm doing at the moment is to get the number at the end of fs, like 234 from f234 and use it in X_train.columns[234] to see what the actual name was. However, I'm having second thoughts as the name I'm getting this way is the actual feature f234 represents.

First make a dictionary from your original features and map them back to feature names.
# create dict to use later
myfeatures = X_train_scaled.columns
dict_features = dict(enumerate(myfeatures))
# feat importance with names f1,f2,...
axsub = xgb.plot_importance(final_gb )
# get the original names back
Text_yticklabels = list(axsub.get_yticklabels())
dict_features = dict(enumerate(myfeatures))
lst_yticklabels = [ Text_yticklabels[i].get_text().lstrip('f') for i in range(len(Text_yticklabels))]
lst_yticklabels = [ dict_features[int(i)] for i in lst_yticklabels]
axsub.set_yticklabels(lst_yticklabels)
print(dict_features)
plt.show()
Here is the example how it works:

The problem can be solved by using feature_names parameter when creating your xgb.DMatrix
xgdmat = xgb.DMatrix(X_train_scaled, y_train_scaled,feature_names=feature_names)

Related

encoding do not running in the fucntion, but could run independently

I am working on the dataset, there are a number of categorical features, which I want to encode in top categories, while the bottom to be replaced with other in combination. As there is large number of features with different threshold values. I defined a function like this
def enc(threshold , features):
for i in features:
j = i + "other"
repl = train[features].value_counts()[train[features].value_counts() <= threshold].index
sample = pd.get_dummies(train[features].replace(repl , j))
train.drop(features , axis = 1 , inplace = True)
n_data = pd.concat([sample , train] , axis = 1 , join = "inner")
return n_data
It gives the error Cannot compare types 'ndarray(dtype=object)' and 'tuple' on the line sample = ...
Here threshold is some threshold value for top categories, and features is the list of features I want to iterate over.
When I try to run this code out of the function, it still gives me error.
But the catch is when I try just the major 2 lines out of for loop which are
repl = train["LandContour"].value_counts()[train["LandContour"].value_counts() <= 1000].index
a = pd.get_dummies(train["LandContour"].replace(repl , "LandContou_other"))
with dedicated values, it runs perfectly as I want.
I have also tried my level best of debugging and still was not able to rectify the error.
DataType of repl is <class 'pandas.core.indexes.multi.MultiIndex'>
I have also looked upon a similar question but it didn't solved my problem
I think there is some problem in the for iteration only, or the repl is causing some problem

Subgroup ArcPy list Query

Mornig, folks.
I have two equal sets of layers, disposed in subgroups in my ArcGIS Pro (2.9.0), as shown here.
It's important that they have the same name (Layer1, Layer2, ...) in both groups.
Now, I'm writing an ArcPy code that makes a Definition Query, but I want to do it only in one specific sub layer (Ex. Compare\Layer1 and Compare\Layer2).
For now, I have this piece of code that, I hope, can help.
p = arcpy.mp.ArcGISProject('current')
m = p.listMaps()[0]
l = m.listLayers()
for row in l:
print(row.name)
COD_QUERY = 123
for row in l:
if row.name in ('Compare\Layer1'):
row.definitionQuery = "CODIGO_EOL = {}".format(COD_QUERY)
print('ok')
When I write 'Compare\layer1' where's supposed to select only the Layer1 placed in the Compare group, the code doesn't work as expected and does the Query both in Compare\Layer1 and Base\Layer2. That's the exact problem tha I'm having.
Hope I can find some help with u guys. XD
The layer's name (or longName) does not include the group layer's name.
Try using a wildcard (follow the link and search for listLayers) and filter for the particular group layer. A group layer object has a method listLayers too, you can again leverage it to get a specific layer.
import arcpy
COD_QUERY = 123
project = arcpy.mp.ArcGISProject("current")
map = project.listMaps()[0]
filtered_group_layers = map.listLayers("Compare")
if filtered_group_layers and filtered_group_layers[0].isGroupLayer:
filtered_layers = filtered_group_layers[0].listLayers("Layer1")
if filtered_layers:
filtered_layers[0].definitionQuery = f"CODIGO_EOL = {COD_QUERY}"
Or you can use loops. The key here is to filter out the group layers using isGroupLayer property before accessing the layer's listLayers method.
import arcpy
COD_QUERY = 123
project = arcpy.mp.ArcGISProject("current")
map = project.listMaps()[0]
group_layers = (layer for layer in map.listLayers() if layer.isGroupLayer)
for group_layer in group_layers:
if group_layer.name in "Compare":
for layer in group_layer.listLayers():
if layer.name in "Layer1":
layer.definitionQuery = f"CODIGO_EOL = {COD_QUERY}"

How do reduce a set of columns along another set of columns, holding all other columns?

I think this is a simple operation, but for some reason I'm not finding immediate indicators in my quick perusal of the Pandas docs.
I have prototype working code below, but it seems kinda dumb IMO. I'm sure that there are much better ways to do this, and concepts to describe it.
Is there a better way? If not, at least better way to describe?
Abstract Problem
Basically, I have columns p0, p1, y0, y1, .... ... are just things I'd like held constant (remain as separate in table). p0, p1 are things I'd like to reduce against. y0, y1 are columns I'd like to be reduced.
DataFrame.grouby didn't seem like what I wanted. When perusing the code, I wasn't sure if anything else was I wanted. Multi-indexing also seemed like a possible context, but I didn't immediately see an example of what I desired.
Here's the code that does I what I want:
def merge_into(*, from_, to_):
for k, v in from_.items():
to_[k] = v
def reduce_along(df, along_cols, reduce_cols, df_reduce=pd.DataFrame.mean):
hold_cols = set(df.columns) - set(along_cols) - set(reduce_cols)
# dumb way to remember Dict[HeldValues, ValuesToReduce]
to_reduce_map = defaultdict(list)
for i in range(len(df)):
row = df.iloc[i]
# can I instead use a series? is that hashable?
key = tuple(row[hold_cols])
to_reduce = row[reduce_cols]
to_reduce_map[key].append(to_reduce)
rows = []
for key, to_reduce_list in to_reduce_map.items():
# ... yuck?
row = pd.Series({k: v for k, v in zip(hold_cols, key)})
reduced = df_reduce(pd.DataFrame(to_reduce_list))
merge_into(from_=reduced, to_=row)
rows.append(row)
return pd.DataFrame(rows)
reducto = reduce_along(summary, ["p0", "p1"], ["y0", "y1"])
display(reducto)
Background
I am running some sweeps for ML stuff; in it, I sweep on some model architecture param, as well as dataset size and the seed that controls random initialization of the model parameters.
I'd like to reduce along the seed to get a "feel" for what architectures are possibly more robust to initialization; for now, I'd like to see what dataset size helps the most. In the future, I'd like to do (heuristic) reduction along dataset size as well.
Actually, looks like DataFrame.groupby(hold_cols).agg({k: ["mean"] for k in reduce_cols}) is what I want. Source: https://jamesrledoux.com/code/group-by-aggregate-pandas
# See: https://stackoverflow.com/a/47699378/7829525
std = functools.partial(np.std)
def reduce_along(df, along_cols, reduce_cols, agg=[np.mean, std]):
hold_cols = list(set(df.columns) - set(along_cols) - set(reduce_cols))
hold_cols = [x for x in df.columns if x in hold_cols] # Preserve order
# From: https://jamesrledoux.com/code/group-by-aggregate-pandas
df = df.groupby(hold_cols).agg({k: ag for k in reduce_cols})
df = df.reset_index()
return df

Issue with Python list interaction with for loop

I am having a problem with my genetic feature optimization algorithm that I am attempting to build. The idea is that a specific combination of features will be tested and if the model accuracy using those features is higher than the previous maximum, then the combination of features replaces the previous maximum combination. through running through the remaining potential features in this way, the final combination should be the optimal combination of features given the feature vector size. Currently, the code that looks to achieve this looks like:
def mutate_features(features, feature):
new_features = features
index = random.randint(0,len(features)-1)
new_features[index] = feature
return new_features
def run_series(n, f_list, df):
features_list = []
results_list = []
max_results_list = [[0,0,0,0,0]]
max_feature_list = []
features = [0,0,0,0,1]
for i in range(0,5): # 5 has just been chosen as the range for testing purposes
results = run_algorithm(df, f_list, features)
features_list.append(features)
results_list.append(results)
if (check_result_vector(max_results_list, results)):
max_results_list.append(results)
max_feature_list.append(features)
else:
print("Revert to previous :" +str(max_feature_list[-1]))
features = max_feature_list[-1]
features = mutate_features(features, f_list[i])
print("Feature List = " +str(features_list))
print("Results List = " +str(results_list))
print("Max Results List = " +str(max_results_list))
print("Max Feature List = "+str(max_feature_list))
The output from this code has been included below;
Click to zoom or enlarge the photo
The section that I do not understand is the output of the max_feature_list and feature_list.
If anything is added through the use of .append() to the max_feature_list or the feature_list inside the for loop, it seems to change all items that are already members of the list to be the same as the latest addition to the list. I may not be fully understanding of the syntax/logic around this and would really appreciate any feedback as to why the program is doing this.
It happens because you change the values of features inside mutate_features function and then, since the append to max_feature_list is by reference, the populated values in max_feature_list are changing too because their underlying value changed.
One way to prevent such behaviour is to deepcopy features inside mutate_features, mutate the copied features as you want and then return it.
For example:
from copy import deepcopy
def mutate_features(features, feature):
new_features = deepcopy(features)
index = random.randint(0,len(features)-1)
new_features[index] = feature
return new_features
features = [1, 2, 3]
res = []
res.append(features)
features = mutate_features(features, feature)
res.append(features)
print(res)

How do you make a list of numpy.float64?

I am using python. I made this numpy.float64 and this shows the Chicago Cubs' win times by decades.
yr1874to1880 = np.mean(wonArray[137:143])
yr1881to1890 = np.mean(wonArray[127:136])
yr1891to1900 = np.mean(wonArray[117:126])
yr1901to1910 = np.mean(wonArray[107:116])
yr1911to1920 = np.mean(wonArray[97:106])
yr1921to1930 = np.mean(wonArray[87:96])
yr1931to1940 = np.mean(wonArray[77:86])
yr1941to1950 = np.mean(wonArray[67:76])
yr1951to1960 = np.mean(wonArray[57:66])
yr1961to1970 = np.mean(wonArray[47:56])
yr1971to1980 = np.mean(wonArray[37:46])
yr1981to1990 = np.mean(wonArray[27:36])
yr1991to2000 = np.mean(wonArray[17:26])
yr2001to2010 = np.mean(wonArray[7:16])
yr2011to2016 = np.mean(wonArray[0:6])
I want to put them together but I don't know how to. I tried for the list but it did not work. Does anyone know how to put them together in order to put them in the graph? I want to make a scatter graph with matplotlib. Thank you.
So with what you've shown, each variable you're setting becomes a float value. You can make them into a list by declaring:
list_of_values = [yr1874to1880, yr1881to1890, ...]
Adding all of the declared values to this results in a list of floats. For example, with just the two values above added:
>>>print list_of_values
[139.5, 131.0]
So that should explain how to obtain a list with the data from np.mean(). However, I'm guessing another question being asked is "how do I scatter plot this?" Using what is provided here, we have one axis of data, but to plot we need another (can't have a graph without x and y). Decide what the average wins is going to be compared against, and then that can be iterated over. For example, I'll use a simple integer in "decade" to act as the x axis:
import matplotlib.pyplot as plt
decade = 1
for i in list_of_values:
y = i
x = decade
decade += 1
plt.scatter(x, y)
plt.show()

Categories

Resources