Features Importance with list attributes from pd dataframe - python

I'm developing a ML algorithm to get a feature importance with an ExtraTrees.
The problem that I'm trying to solving is that the variables are not scalars but lists with different dimensions, or matrixes but for now I will focus only on lists.
From the momement the only think that I was able to do is a FI with the flat lists concateneted each other.
GOAL:
What I would like to do is to get a score point for each different list instead of a score for each lists' element.
I present here a toy example of the dataset and the current code:
df = pd.DataFrame({"list1": [[10,15,12,14],[20,30,10,43]], "R":[2,2] ,"C":[2,2] , "CLASS":[1,0] , "scalar1":[1,2] , "scalar2":[3,4]})
After PCA ( below ):
df['new'] = pd.Series([a.reshape((c, r)) for (a, c, r) in zip(df.A, df.C, df.R)])
df['pca'] = pd.Series([ pca_volatilities(matrix) for matrix in df.new ])
Becomes:
list1 # C # C1 # C2 # CLASS # R # new # pca # flat_pca
0 [10, 15, 12, 14] 2 1 3 1 2 [[10, 15], [12, 14]] [[-1.11803398875], [1.11803398875]] [-1.11803398875, 1.11803398875]
1 [20, 30, 10, 43] 2 2 4 0 2 [[20, 30], [10, 43]] [[-8.20060973343], [8.20060973343]] [-8.20060973343, 8.20060973343]
Here I present the fit:
X = np.concatenate([np.stack(df.flat_pca,axis=0), [df.C1, df.C2]], axis=0).transpose()
Y = np.array(df.CLASS)
model = ExtraTreesRegressor()
model.fit(X,Y)
model.feature_importances_
This returns:
array([ 0.2, 0.3, 0.2, 0.3]).
What I need is a score for list1 , C1,C2 and flat_pca. I don't know how to do this.
Hoping that someone is able to help me, thanks in advance !!!!!

Related

Ignored_column is not working when using grid search in h2o library - Python

The parameter called ignored_columns (see link) helps user to keep a feature that you want to be ignored when building a model.
When I build a simple ML model and analyze the feature importance, I can see that h2o ignores the column that I speficied during the training process, which can be observed from the feature importance. As shown below, column c is not used during training.
import pandas as pd
import h2o
from h2o.estimators import H2ODeepLearningEstimator
from h2o.grid.grid_search import H2OGridSearch
from h2o.estimators.random_forest import H2ORandomForestEstimator
h2o.init()
x = pd.DataFrame([[0, 1, 4], [5, 1, 6], [15, 2, 0], [25, 5 , 32],
[35, 11 ,89], [45, 15, 1], [55, 34,3], [60, 35,4]], columns = ['a','b','c'])
y = pd.DataFrame([4, 5, 20, 14, 32, 22, 38, 43], columns = ['label'])
hf = h2o.H2OFrame( pd.concat([x,y], axis="columns"))
X = hf.col_names[:-1]
y = hf.col_names[-1]
model= H2ORandomForestEstimator(ignored_columns = ['c'])
model.train(y = y, training_frame=hf)
model.varimp(use_pandas=True)
variable relative_importance scaled_importance percentage
0 b 33876.328125 1.000000 0.540893
1 a 28753.998047 0.848793 0.459107
However, when I turn on the grid search for the hyper parameter tunning, it does not seem like working.
params = {'max_depth': list(range(7, 16)), 'sample_rate': [0.8], }
criteria = {'strategy': 'RandomDiscrete', 'max_models': 4}
grid = H2OGridSearch(model= H2ORandomForestEstimator(ignored_columns = ['c']),
search_criteria=criteria,
hyper_params=params )
grid.train( y = y, training_frame=hf)
best_model = grid.get_grid(sort_by='rmse', decreasing=False)[0]
best_model.varimp(use_pandas=True)
variable relative_importance scaled_importance percentage
0 a 33525.109375 1.000000 0.516545
1 b 23314.916016 0.695446 0.359230
2 c 8062.515137 0.240492 0.124225

Replace int values in 2D np.array with list of 3 values to make it 3D

I came along this problem when helping on this question where OP does some image processing. Regardless if there are other ways to do the whole thing, in one part, I have a 2D np.array filles with integers. The integers are just mask values, each standing for a RGB color.
I have a dictionary with integers as keys and arrays of RGB colors as value. This is the mapping and the goal is to replace each int in the array with the colors.
Starting with this array where all RGB-array where already replaced by integers so now it is an array of shape (2,3) (originially it was shape(2,3,3))
import numpy as np
arr = np.array([0,2,4,1,3,5]).reshape(2,3)
print(arr)
array([[0, 2, 4],
[1, 3, 5]])
Here is the dictionary (chosen numbers are just random for the example):
dic = {0 : [10,20,30], 1 : [12,22,32], 2 : [15,25,35], 3 : [40,50,60], 4 : [100,200,300], 5 : [250,350,450]}
replacing all these values with the arrays makes it an array with shape (2,3,3) like this:
array([[[ 10, 20, 30],
[ 15, 25, 35],
[100, 200, 300]],
[[ 12, 22, 32],
[ 40, 50, 60],
[250, 350, 450]]])
I looked into np.where because I thought it is the most obvious to me but I always got the error that the shapes are incorrect.
I don't know where exactly I'm stuck, when googling, I came across np.dstack, np.concatenate, reading about changing the shape with np.newaxis / None but I just don't get it done. Maybe creating a new array with np.zeros_like and go from there.
Do I need to create something like a placeholder before I'm able to insert an array holding these 3 RGB values?
Since every single key is in the array because it is created based on that, I thought about loop through the dict, check for key in array and replace it with the dict.value. Am I at least in the right direction or does that lead to nothing?
Any help much appreciated!!!
In this regard, we can create an array of dictionary values by unpacking that and then order them based on the specified orders in the arr. So:
np.array([*dic.values()])[arr]
If the dictionary keys were not in a sorted order, we can create a mask array for ordering based on them, using np.argsort. So, after sorting the array of dictionary values based on the mask array, we can get the results again e.g.:
dic = {0: [10, 20, 30], 2: [15, 25, 35], 3: [40, 50, 60], 1: [12, 22, 32], 4: [100, 200, 300], 5: [250, 350, 450]}
sort_mask = np.array([*dic.keys()]).argsort()
# [0 3 1 2 4 5]
np.array([*dic.values()])[sort_mask][arr]
# [[[ 10 20 30]
# [ 15 25 35]
# [100 200 300]]
#
# [[ 12 22 32]
# [ 40 50 60]
# [250 350 450]]]

How to bin a 2D array in numpy?

I'm new to numpy and I have a 2D array of objects that I need to bin into a smaller matrix and then get a count of the number of objects in each bin to make a heatmap. I followed the answer on this thread to create the bins and do the counts for a simple array but I'm not sure how to extend it to 2 dimensions. Here's what I have so far:
data_matrix = numpy.ndarray((500,500),dtype=float)
# fill array with values.
bins = numpy.linspace(0,50,50)
digitized = numpy.digitize(data_matrix, bins)
binned_data = numpy.ndarray((50,50))
for i in range(0,len(bins)):
for j in range(0,len(bins)):
k = len(data_matrix[digitized == i:digitized == j]) # <-not does not work
binned_data[i:j] = k
P.S. the [digitized == i] notation on an array will return an array of binary values. I cannot find documentation on this notation anywhere. A link would be appreciated.
You can reshape the array to a four dimensional array that reflects the desired block structure, and then sum along both axes within each block. Example:
>>> a = np.arange(24).reshape(4, 6)
>>> a
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23]])
>>> a.reshape(2, 2, 2, 3).sum(3).sum(1)
array([[ 24, 42],
[ 96, 114]])
If a has the shape m, n, the reshape should have the form
a.reshape(m_bins, m // m_bins, n_bins, n // n_bins)
At first I was also going to suggest that you use np.histogram2d rather than reinventing the wheel, but then I realized that it would be overkill to use that and would need some hacking still.
If I understand correctly, you just want to sum over submatrices of your input. That's pretty easy to brute force: going over your output submatrix and summing up each subblock of your input:
import numpy as np
def submatsum(data,n,m):
# return a matrix of shape (n,m)
bs = data.shape[0]//n,data.shape[1]//m # blocksize averaged over
return np.reshape(np.array([np.sum(data[k1*bs[0]:(k1+1)*bs[0],k2*bs[1]:(k2+1)*bs[1]]) for k1 in range(n) for k2 in range(m)]),(n,m))
# set up dummy data
N,M = 4,6
data_matrix = np.reshape(np.arange(N*M),(N,M))
# set up size of 2x3-reduced matrix, assume congruity
n,m = N//2,M//3
reduced_matrix = submatsum(data_matrix,n,m)
# check output
print(data_matrix)
print(reduced_matrix)
This prints
print(data_matrix)
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]]
print(reduced_matrix)
[[ 24 42]
[ 96 114]]
which is indeed the result for summing up submatrices of shape (2,3).
Note that I'm using // for integer division to make sure it's python3-compatible, but in case of python2 you can just use / for division (due to the numbers involved being integers).
Another solution is to have a look at the binArray function on the comments here:
Binning a numpy array
To use your example :
data_matrix = numpy.ndarray((500,500),dtype=float)
binned_data = binArray(data_matrix, 0, 10, 10, np.sum)
binned_data = binArray(binned_data, 1, 10, 10, np.sum)
The result sum all square of size 10x10 in data_matrix (of size 500x500) to obtain a single value per square in binned_data (of size 50x50).
Hope this help !

scale two matrices with scipy or sklearn

I would like to scale a matrix X1 (by column), and then scale another matrix X2 with mean and standard deviations found when scaling X1.
As far as I know, sklearn does not return mean/variance when scaling a matrix. Is there an alternative approach without me implementing it?
For example:
X1
1 2 3 4
5 6 7 8
9 10 11 12
X2
12 13 14 15
16 17 18 19
replace X2[i][j] with (X2[i][j] - mean[X1[:, i]]) / std[X1[:, i]]
The scale function of sklearn preprocessing cannot be used because it does not return mean and variance.
The Standard Scaler from scikit learn handles this, and corner cases, pretty well.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X1)
output = scaler.transform(X2)
If necessary, you can access the means and standard deviations of the feature columns using
scaler.std_
scaler.mean_
You can also use the StandardScaler in a pipeline as preprocessing preceding an estimator.
Both .std() and .mean() method accept axis parameter to calculate row wise/column wise statistics, the rest will be taken care of by boardcasting:
In [170]:
X1
Out[170]:
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
In [171]:
X2
Out[171]:
array([[12, 13, 14, 15],
[16, 17, 18, 19]])
In [172]:
(X2-X1.mean(0))/X1.std(0)
Out[172]:
array([[ 2.14330352, 2.14330352, 2.14330352, 2.14330352],
[ 3.3680484 , 3.3680484 , 3.3680484 , 3.3680484 ]])

Applying an operation to the rows of a matrix, except for some rows

I have a numpy matrix M and I need to apply some operations to all the rows of the matrix, except for a determined rows.
For example, suppose I have rows [3,5] whose elements should be avoided from an operation like M[:,8] = 4. So I want to have all the rows of the 8th column to be set to 4, but I want to avoid doing so to rows 3 and 5. How can I do this in numpy?
Edit: basically I need that to avoid a division by zero when doing a normalization by the sum of the elements of a row. Some rows are all zeros, so doing the summation (which is zero) then dividing by the summation will give a division by zero. What I'm doing is that I find out which rows are all zeros and then I want not to do the normalization operation for those specific rows.
Perhaps something like this?
>>> import numpy as np
>>> M = np.arange(32).reshape(8, 4)
>>> ignore = {3, 5}
>>> rest = [i for i in xrange(M.shape[0]) if i not in ignore]
>>> M[rest, 3] = 4
>>> M
array([[ 0, 1, 2, 4],
[ 4, 5, 6, 4],
[ 8, 9, 10, 4],
[12, 13, 14, 15],
[16, 17, 18, 4],
[20, 21, 22, 23],
[24, 25, 26, 4],
[28, 29, 30, 4]])
Based on your edit, in order to solve your specific problem, where you seem to manipulating a matrix with non-negative entries, you may exploit the following trick
import numpy as np
rng = np.random.RandomState(42)
M = rng.randn(10, 10) ** 2
M[[0, 5]] = 0. # set 2 lines to 0
M_norm = M / (M.sum(axis=1) + 1e-18)[:, np.newaxis]
Obviously this result is not exact, but exact enough to not notice the difference. To make it slightly better, you can also write
M_norm = M / np.maximum(M.sum(axis=1), 1e-18)[:, np.newaxis]
If this still isn't sufficient, and you want it exact, for the general case (negativity allowed) you can write
row_sums = M.sum(axis=1)
row_sums[row_sums == 0] = 1.
M_norm = M / row_sums[:, np.newaxis] # dividing the zeros by 1 still yields 0
To add some robustness, you could also do
tolerance = 1e-6
row_sums = M.sum(axis=1)
OK_rows = np.abs(row_sums) > tolerance
M_norm = np.zeros_like(M)
M_norm[OK_rows] = M[OK_rows] / row_sums[OK_rows][:, np.newaxis]

Categories

Resources