I'm making a fit with a scikit model (that is a ExtraTreesRegressor ) with the aim of make supervised features selection.
I've made a toy example in order to be as most clear as possible. That's the toy code:
import pandas as pd
import numpy as np
from sklearn.ensemble import ExtraTreesRegressor
from itertools import chain
# Original Dataframe
df = pd.DataFrame({"A": [[10,15,12,14],[20,30,10,43]], "R":[2,2] ,"C":[2,2] , "CLASS":[1,0]})
X = np.array([np.array(df.A).reshape(1,4) , df.C , df.R])
Y = np.array(df.CLASS)
# prints
X = np.array([np.array(df.A), df.C , df.R])
Y = np.array(df.CLASS)
print("X",X)
print("Y",Y)
print(df)
df['A'].apply(lambda x: print("ORIGINAL SHAPE",np.array(x).shape,"field:",x))
df['A'] = df['A'].apply(lambda x: np.array(x).reshape(4,1),"field:",x)
df['A'].apply(lambda x: print("RESHAPED SHAPE",np.array(x).shape,"field:",x))
model = ExtraTreesRegressor()
model.fit(X,Y)
model.feature_importances_
X [[[10, 15, 12, 14] [20, 30, 10, 43]]
[2 2]
[2 2]]
Y [1 0]
A C CLASS R
0 [10, 15, 12, 14] 2 1 2
1 [20, 30, 10, 43] 2 0 2
ORIGINAL SHAPE (4,) field: [10, 15, 12, 14]
ORIGINAL SHAPE (4,) field: [20, 30, 10, 43]
---------------------------
That's the arise exception:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-37-5a36c4c17ea0> in <module>()
7 print(df)
8 model = ExtraTreesRegressor()
----> 9 model.fit(X,Y)
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/ensemble/forest.py in fit(self, X, y, sample_weight)
210 """
211 # Validate or convert input data
--> 212 X = check_array(X, dtype=DTYPE, accept_sparse="csc")
213 if issparse(X):
214 # Pre-sort indices to avoid that each individual tree of the
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
371 force_all_finite)
372 else:
--> 373 array = np.array(array, dtype=dtype, order=order, copy=copy)
374
375 if ensure_2d:
ValueError: setting an array element with a sequence.
I've noticed that involves np.arrays. So I've tried to fit another toy dataframe, that is the most basic one, with only scalars and there are not arised errors. I've tried to keep the same code and just modify the same toy dataframe by adding another field that contains monodimensional arrays, and now the same exception was arised.
I've looked around but so far I've not found a solution even by trying to make some reshapes, conversions into lists, np.array etc. and matrixed in my real problem. Now I'm keeping trying along this direction.
I've also seen that usually this kind of problem is arised when there are arrays withdifferent lengths betweeen samples but that is not the case of the toy example.
Anyone know how to deal with this structures/exception ?
Thanks in advance for any help.
Have a closer look at your X:
>>> X
array([[[10, 15, 12, 14], [20, 30, 10, 43]],
[2, 2],
[2, 2]], dtype=object)
>>> type(X[0,0])
<class 'list'>
Notice that it's dtype=object, and one of these objects is a list, hence "setting array element with sequence. Part of the problem is that np.array(df.A) does not correctly create a 2D array:
>>> np.array(df.A)
array([[10, 15, 12, 14], [20, 30, 10, 43]], dtype=object)
>>> _.shape
(2,) # oops!
But using np.stack(df.A) fixes the problem.
Are you looking for:
>>> X = np.concatenate([
np.stack(df.A), # condense A to (N, 4)
np.expand_dims(df.C, axis=-1), # expand C to (N, 1)
np.expand_dims(df.R, axis=-1), # expand R to (N, 1)
axis=-1
)
>>> X
array([[10, 15, 12, 14, 2, 2],
[20, 30, 10, 43, 2, 2]], dtype=int64)
To convert Pandas' DataFrame to NumPy's matrix,
import pandas as pd
def df2mat(df):
a = df.as_matrix()
n = a.shape[0]
m = len(a[0])
b = np.zeros((n,m))
for i in range(n):
for j in range(m):
b[i,j]=a[i][j]
return b
df = pd.DataFrame({"A":[[1,2],[3,4]]})
b = df2mat(df.A)
After then, concatenate.
Related
I am trying to convert this matlab code to python:
#T2 = (sum((log(X(1:m,:)))'));
Here is my code in python:
T2 = sum(np.log(X[0:int(m),:]).T)
where m = 103 and X is a matrix:
f1 = np.float64(135)
f2 = np.float64(351)
X = np.float64(p[:, int(f1):int(f2)])
and p is dictionary (loaded data)
The problem is python gives me the exact same value with same dimension (216x103) like matlab before applying the sum function on (np.log(X[0:int(m), :]).T). However. after applying the sum function it gives me the correct value but wrong dimension (103x1). The correct dimension is (1x103). I have tried using transpose after getting the sum but it doesnt work. Any suggestions how to get my desired dimension?
A matrix in MATLAB consists of m rows and n columns, but a matrix in NumPy is an array of arrays. Each subarray is a flat vector having 1 dimension equal to the number of its elements n. MATLAB doesn't have flat vectors at all, a row is 1xn matrix, a column is mx1 matrix, and a scalar is 1x1 matrix.
So, back to the question, when you write T2 = sum(np.log(X[0:int(m),:]).T) in Python, it's neither 103x1 nor 1x103, it's a flat 103 vector. If you specifically want a 1x103 matrix like MATLAB, just reshape(1,-1) and you don't have to transpose since you can sum over the second axis.
import numpy as np
X = np.random.rand(216,103)
m = 103
T2 = np.sum(np.log(X[:m]), axis=1).reshape(1,-1)
T2.shape
# (1, 103)
Lets make a demo 2d array:
In [19]: x = np.arange(12).reshape(3,4)
In [20]: x
Out[20]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
And apply the base Python sum function (which isn't the same as numpy's own):
In [21]: sum(x)
Out[21]: array([12, 15, 18, 21])
The result is a (4,) shape array (not 4x1). Print sum(x).shape if you don't believe me.
The numpy.sum function adds all terms if no axis is given:
In [22]: np.sum(x)
Out[22]: 66
or with axis:
In [23]: np.sum(x, axis=0)
Out[23]: array([12, 15, 18, 21])
In [24]: np.sum(x, axis=1)
Out[24]: array([ 6, 22, 38])
The Python sum treats x as a list of arrays, and adds them together
In [25]: list(x)
Out[25]: [array([0, 1, 2, 3]), array([4, 5, 6, 7]), array([ 8, 9, 10, 11])]
In [28]: x[0]+x[1]+x[2]
Out[28]: array([12, 15, 18, 21])
Transpose, without parameter, switch axes. It does not add any dimensions:
In [29]: x.T # (4,3) shape
Out[29]:
array([[ 0, 4, 8],
[ 1, 5, 9],
[ 2, 6, 10],
[ 3, 7, 11]])
In [30]: sum(x).T
Out[30]: array([12, 15, 18, 21]) # still (4,) shape
Octave
>> x=reshape(0:11,4,3)'
x =
0 1 2 3
4 5 6 7
8 9 10 11
>> sum(x)
ans =
12 15 18 21
>> sum(x,1)
ans =
12 15 18 21
>> sum(x,2)
ans =
6
22
38
edit
The np.sum function has a keepdims parmeter:
In [32]: np.sum(x, axis=0, keepdims=True)
Out[32]: array([[12, 15, 18, 21]]) # (1,4) shape
In [33]: np.sum(x, axis=1, keepdims=True)
Out[33]:
array([[ 6], # (3,1) shape
[22],
[38]])
If I reshape the array to 3d, and sum, the result is 2d - unless I keepdims:
In [34]: np.sum(x.reshape(3,2,2), axis=0).shape
Out[34]: (2, 2)
In [36]: np.sum(x.reshape(3,2,2), axis=0,keepdims=True).shape
Out[36]: (1, 2, 2)
MATLAB/Octave on the other hand keeps the dims by default:
sum(reshape(x,3,2,2)) # (1,2,2)
unless I sum on that last, 3rd:
sum(reshape(x,3,2,2),3) # (3,2)
The key is that MATLAB everything is 2d, with the option of additional trailing dimensions, which aren't handled the same way. In numpy every number of dimensions, from 0 on up is handled the same way.
Can this for loop be written in a simpler way?
import itertools
import numpy as np
def f(a, b, c): # placeholder for a complex function
print(a+b+c)
a = np.arange(12).reshape(3, 4)
for y, x in itertools.product(range(a.shape[0]-1), range(a.shape[1]-1)):
f(a[y, x], a[y, x+1], a[y+1, x])
The other options I tried, look more convoluted, e.g.:
it = np.nditer(a[:-1, :-1], flags=['multi_index'])
for e in it:
y, x = it.multi_index
f(a[y, x], a[y, x+1], a[y+1, x])
Posting it as an answer, and sorry if this is too obvious, but isn't this simply
for y in range(a.shape[0]-1):
for x in range(a.shape[1]-1):
f(a[y, x], a[y, x+1], a[y+1, x])
If I use your method I got:
expected = [5, 8, 11, 17, 20, 23]
but you can vectorize the computation by generating an array containing the data in a more suitable way:
a_stacked = np.stack([a[:-1, :-1], a[:-1, 1:], a[1:, :-1]], axis=0)
From there multiple solutions:
If you already know the function will be the sum:
>>> a_stacked.sum(axis=0)
array([[ 5, 8, 11],
[17, 20, 23]])
If you know that your function is already vectorized:
>>> f(*a_stacked)
array([[ 5, 8, 11],
[17, 20, 23]])
If your function does not vectorize, you can use np.vectorize for convenience (no performance improvement):
>>> np.vectorize(f)(*a_stacked)
array([[ 5, 8, 11],
[17, 20, 23]])
Obviously you can flatten the array next.
I have a numpy array of size (15 x 200 x 3) called rap.
I would like to slice it based on a 2d list such as this one:
fragment = [0 93
7 102
6 43
11 167]
This is basically the list of the first two indices of the original 3d array, which I want to return.
It gives error when I try to do it this way:
rap_sliced = rap[fragment, :]
or
rap_sliced = rap[list(fragment), :]
rap_sliced = rap[fragment]
What am I doing wrong?
Assuming input:
>>> fragment
[[0, 93], [7, 102], [6, 43], [11, 167]]
>>> fragment=np.array(fragment)
this will work:
rap[fragment[:, 0], fragment[:, 1], :]
So
numpy_array[X, Y, Z]
where X, Y, Z could be single value, list (one dimensional), or :
Alternatively for numpy you can do:
numpy_array[boolean_array]
where numpy_array.shape=boolean_array.shape and boolean_array essentially provides you True/False, whether return or not value with given coordinates from numpy_array
I have the following array:
import numpy as np
a = np.array([[ 1, 2, 3],
[ 1, 2, 3],
[ 1, 2, 3]])
I understand that np.random.shuffle(a.T) will shuffle the array along the row, but what I need is for it to shuffe each row idependently. How can this be done in numpy? Speed is critical as there will be several million rows.
For this specific problem, each row will contain the same starting population.
import numpy as np
np.random.seed(2018)
def scramble(a, axis=-1):
"""
Return an array with the values of `a` independently shuffled along the
given axis
"""
b = a.swapaxes(axis, -1)
n = a.shape[axis]
idx = np.random.choice(n, n, replace=False)
b = b[..., idx]
return b.swapaxes(axis, -1)
a = a = np.arange(4*9).reshape(4, 9)
# array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8],
# [ 9, 10, 11, 12, 13, 14, 15, 16, 17],
# [18, 19, 20, 21, 22, 23, 24, 25, 26],
# [27, 28, 29, 30, 31, 32, 33, 34, 35]])
print(scramble(a, axis=1))
yields
[[ 3 8 7 0 4 5 1 2 6]
[12 17 16 9 13 14 10 11 15]
[21 26 25 18 22 23 19 20 24]
[30 35 34 27 31 32 28 29 33]]
while scrambling along the 0-axis:
print(scramble(a, axis=0))
yields
[[18 19 20 21 22 23 24 25 26]
[ 0 1 2 3 4 5 6 7 8]
[27 28 29 30 31 32 33 34 35]
[ 9 10 11 12 13 14 15 16 17]]
This works by first swapping the target axis with the last axis:
b = a.swapaxes(axis, -1)
This is a common trick used to standardize code which deals with one axis.
It reduces the general case to the specific case of dealing with the last axis.
Since in NumPy version 1.10 or higher swapaxes returns a view, there is no copying involved and so calling swapaxes is very quick.
Now we can generate a new index order for the last axis:
n = a.shape[axis]
idx = np.random.choice(n, n, replace=False)
Now we can shuffle b (independently along the last axis):
b = b[..., idx]
and then reverse the swapaxes to return an a-shaped result:
return b.swapaxes(axis, -1)
If you don't want a return value and want to operate on the array directly, you can specify the indices to shuffle.
>>> import numpy as np
>>>
>>>
>>> a = np.array([[1,2,3], [1,2,3], [1,2,3]])
>>>
>>> # Shuffle row `2` independently
>>> np.random.shuffle(a[2])
>>> a
array([[1, 2, 3],
[1, 2, 3],
[3, 2, 1]])
>>>
>>> # Shuffle column `0` independently
>>> np.random.shuffle(a[:,0])
>>> a
array([[3, 2, 3],
[1, 2, 3],
[1, 2, 1]])
If you want a return value as well, you can use numpy.random.permutation, in which case replace np.random.shuffle(a[n]) with a[n] = np.random.permutation(a[n]).
Warning, do not do a[n] = np.random.shuffle(a[n]). shuffle does not return anything, so the row/column you end up "shuffling" will be filled with nan instead.
Good answer above. But I will throw in a quick and dirty way:
a = np.array([[1,2,3], [1,2,3], [1,2,3]])
ignore_list_outpput = [np.random.shuffle(x) for x in a]
Then, a can be something like this
array([[2, 1, 3],
[4, 6, 5],
[9, 7, 8]])
Not very elegant but you can get this job done with just one short line.
Building on my comment to #Hun's answer, here's the fastest way to do this:
def shuffle_along(X):
"""Minimal in place independent-row shuffler."""
[np.random.shuffle(x) for x in X]
This works in-place and can only shuffle rows. If you need more options:
def shuffle_along(X, axis=0, inline=False):
"""More elaborate version of the above."""
if not inline:
X = X.copy()
if axis == 0:
[np.random.shuffle(x) for x in X]
if axis == 1:
[np.random.shuffle(x) for x in X.T]
if not inline:
return X
This, however, has the limitation of only working on 2d-arrays. For higher dimensional tensors, I would use:
def shuffle_along(X, axis=0, inline=True):
"""Shuffle along any axis of a tensor."""
if not inline:
X = X.copy()
np.apply_along_axis(np.random.shuffle, axis, X) # <-- I just changed this
if not inline:
return X
You can do it with numpy without any loop or extra function, and much more faster. E. g., we have an array of size (2, 6) and we want a sub array (2,2) with independent random index for each column.
import numpy as np
test = np.array([[1, 1],
[2, 2],
[0.5, 0.5],
[0.3, 0.3],
[4, 4],
[7, 7]])
id_rnd = np.random.randint(6, size=(2, 2)) # select random numbers, use choice and range if don want replacement.
new = np.take_along_axis(test, id_rnd, axis=0)
Out:
array([[2. , 2. ],
[0.5, 2. ]])
It works for any number of dimensions.
As of NumPy 1.20.0 released in January 2021 we have a permuted() method on the new Generator type (introduced with the new random API in NumPy 1.17.0, released in July 2019). This does exactly what you need:
import numpy as np
rng = np.random.default_rng()
a = np.array([
[1, 2, 3],
[1, 2, 3],
[1, 2, 3],
])
shuffled = rng.permuted(a, axis=1)
This gives you something like
>>> print(shuffled)
[[2 3 1]
[1 3 2]
[2 1 3]]
As you can see, the rows are permuted independently. This is in sharp contrast with both rng.permutation() and rng.shuffle().
If you want an in-place update you can pass the original array as the out keyword argument. And you can use the axis keyword argument to choose the direction along which to shuffle your array.
I have a numpy matrix M and I need to apply some operations to all the rows of the matrix, except for a determined rows.
For example, suppose I have rows [3,5] whose elements should be avoided from an operation like M[:,8] = 4. So I want to have all the rows of the 8th column to be set to 4, but I want to avoid doing so to rows 3 and 5. How can I do this in numpy?
Edit: basically I need that to avoid a division by zero when doing a normalization by the sum of the elements of a row. Some rows are all zeros, so doing the summation (which is zero) then dividing by the summation will give a division by zero. What I'm doing is that I find out which rows are all zeros and then I want not to do the normalization operation for those specific rows.
Perhaps something like this?
>>> import numpy as np
>>> M = np.arange(32).reshape(8, 4)
>>> ignore = {3, 5}
>>> rest = [i for i in xrange(M.shape[0]) if i not in ignore]
>>> M[rest, 3] = 4
>>> M
array([[ 0, 1, 2, 4],
[ 4, 5, 6, 4],
[ 8, 9, 10, 4],
[12, 13, 14, 15],
[16, 17, 18, 4],
[20, 21, 22, 23],
[24, 25, 26, 4],
[28, 29, 30, 4]])
Based on your edit, in order to solve your specific problem, where you seem to manipulating a matrix with non-negative entries, you may exploit the following trick
import numpy as np
rng = np.random.RandomState(42)
M = rng.randn(10, 10) ** 2
M[[0, 5]] = 0. # set 2 lines to 0
M_norm = M / (M.sum(axis=1) + 1e-18)[:, np.newaxis]
Obviously this result is not exact, but exact enough to not notice the difference. To make it slightly better, you can also write
M_norm = M / np.maximum(M.sum(axis=1), 1e-18)[:, np.newaxis]
If this still isn't sufficient, and you want it exact, for the general case (negativity allowed) you can write
row_sums = M.sum(axis=1)
row_sums[row_sums == 0] = 1.
M_norm = M / row_sums[:, np.newaxis] # dividing the zeros by 1 still yields 0
To add some robustness, you could also do
tolerance = 1e-6
row_sums = M.sum(axis=1)
OK_rows = np.abs(row_sums) > tolerance
M_norm = np.zeros_like(M)
M_norm[OK_rows] = M[OK_rows] / row_sums[OK_rows][:, np.newaxis]