Smoothing a 2-D Numpy Array with a Kernel - python

Suppose I have an (m x n) 2-d numpy array that are just 0's and 1's. I want to "smooth" the array by running, for example, a 3x3 kernel over the array and taking the majority value within that kernel. For values at the edges, I would just ignore the "missing" values.
For example, let's say the array looked like
import numpy as np
x = np.array([[1, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 1, 1, 0],
[0, 0, 1, 1, 0, 1, 1, 0],
[0, 0, 1, 0, 1, 1, 1, 0],
[0, 1, 1, 1, 1, 0, 1, 0],
[0, 0, 1, 1, 1, 1, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0]])
Starting at the top left "1", a 3 x 3 kernel centered at the first top left element, would be missing the first row and first column. The way I want to treat that is just ignore that and consider the remaining 2 x 2 matrix:
1 0
0 0
In this case, the majority value is 0, so set that element to 0. Repeating this for all elements, the resulting 2-d array I would want is:
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 1 1 1 1 0
0 0 1 1 1 1 1 0
0 0 1 1 1 1 1 0
0 0 1 1 1 1 1 0
0 0 1 1 1 1 0 0
0 0 0 0 0 0 0 0
How do I accomplish this?

You can use skimage.filters.rank.majority to assign to each value the most occuring one within its neighborhood. The 3x3 kernel can be defined using skimage.morphology.square:
from skimage.filters.rank import majority
from skimage.morphology import square
majority(x.astype('uint8'), square(3))
array([[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 1, 0, 0],
[0, 0, 1, 1, 1, 1, 1, 0],
[0, 0, 1, 1, 1, 1, 1, 0],
[0, 0, 1, 1, 1, 1, 1, 0],
[0, 0, 1, 1, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8)
Note: You'll need the latest stable version of scikit-image for majority. More here

I ended up doing something like this (which is based off of How do I use scipy.ndimage.filters.gereric_filter?):
import scipy.ndimage.filters
import scipy.stats as scs
def filter_most_common_element(a, w_k=np.ones(shape=(3, 3))):
"""
Creating a function for scipy.ndimage.generic_filter.
See https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.generic_filter.html for more information
on generic filters.
This filter takes a kernel of np.ones() to find the most common element in the array.
Based off of https://stackoverflow.com/questions/61197364/smoothing-a-2-d-numpy-array-with-a-kernel
"""
a = a.reshape(w_k.shape)
a = np.multiply(a, w_k)
# See https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.stats.mode.html
most_common_element = scs.mode(a, axis=None)[0][0]
return most_common_element
x = np.array([[1, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 1, 1, 0],
[0, 0, 1, 1, 0, 1, 1, 0],
[0, 0, 1, 0, 1, 1, 1, 0],
[0, 1, 1, 1, 1, 0, 1, 0],
[0, 0, 1, 1, 1, 1, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0]])
out = scipy.ndimage.filters.generic_filter(x, filter_most_common_element, footprint=np.ones((3,3)),mode='constant',cval=0.0)
out
array([[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 1, 0, 0],
[0, 0, 1, 1, 1, 1, 1, 0],
[0, 0, 1, 1, 1, 1, 1, 0],
[0, 0, 1, 1, 1, 1, 1, 0],
[0, 0, 1, 1, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0]])

Related

How to make a lower triangle array of 10 but repeated across a diagonal n times?

I am trying to create an array of 10 for each item I have, but then put those arrays of 10 into a larger array diagonally with zeros filling the missing spaces.
Here is an example of what I am looking for, but only with arrays of 3.
import numpy as np
arr = np.tri(3,3)
arr
This creates an array that looks like this:
[[1,0,0],
[1,1,0],
[1,1,1]]
But I need an array of 10 * n that looks like this: (using arrays a 3 for example here, with n=2)
{1,0,0,0,0,0,
1,1,0,0,0,0,
1,1,1,0,0,0,
0,0,0,1,0,0,
0,0,0,1,1,0,
0,0,0,1,1,1}
Any help would be appreciated, thanks!
I have also tried
df_arr2 = pd.concat([df_arr] * (n), ignore_index=True)
df_arr3 = pd.concat([df_arr2] *(n), axis=1, ignore_index=True)
But this repeats the matrix across all rows and columns, when I only want the diagnonal ones.
Now I got it... AFAIU, the OP wants those np.tri triangles in the diagonal of a bigger, multiple of 3 square shaped array.
As per example, for n=2:
import numpy as np
n = 2
tri = np.tri(3)
arr = np.zeros((n*3, n*3))
for i in range(0, n*3, 3):
arr[i:i+3,i:i+3] = tri
arr.astype(int)
# Out:
# array([[1, 0, 0, 0, 0, 0],
# [1, 1, 0, 0, 0, 0],
# [1, 1, 1, 0, 0, 0],
# [0, 0, 0, 1, 0, 0],
# [0, 0, 0, 1, 1, 0],
# [0, 0, 0, 1, 1, 1]])
I saw #brandt's solution which is definitely the best. Incase you want to construct the them manually you can use this method:
def custom_triangle_matrix(rows, rowlen, tsize):
cm = []
for i in range(rows):
row = []
for j in range(min((i//tsize)*tsize, rowlen)):
row.append(0)
for j in range((i//tsize)*tsize, min(((i//tsize)*tsize) + i%tsize + 1, rowlen)):
row.append(1)
for j in range(((i//tsize)*tsize) + i%tsize + 1, rowlen):
row.append(0)
cm.append(row)
return cm
Here are some example executions and what they look like using ppprint:
matrix = custom_triangle_matrix(6, 6, 3)
pprint.pprint(matrix)
[[1, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0],
[0, 0, 0, 1, 0, 0],
[0, 0, 0, 1, 1, 0],
[0, 0, 0, 1, 1, 1]]
matrix = custom_triangle_matrix(6, 9, 3)
pprint.pprint(matrix)
[[1, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 1, 0, 0, 0]]
matrix = custom_triangle_matrix(9, 6, 3)
pprint.pprint(matrix)
[[1, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0],
[0, 0, 0, 1, 0, 0],
[0, 0, 0, 1, 1, 0],
[0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0]]
matrix = custom_triangle_matrix(10, 10, 5)
pprint.pprint(matrix)
[[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 1, 1, 1, 1, 0],
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1]]
Good Luck!

Is there a way to populate a full matrix given only an octant segment, with symmetry?

I have an octant of a symmetric matrix which looks like this:
arr_in = [[1],
[0, 0],
[0, 0, 1],
[0, 0, 0, 0],
[0, 1, 0, 1, 0]]
I need to convert this into the full array, is there a way to do this with numpy? The full matrix end product should be:
0 1 0 1 0 1 0 1 0
1 0 0 0 0 0 0 0 1
0 0 1 0 0 0 1 0 0
1 0 0 0 0 0 0 0 1
0 0 0 0 1 0 0 0 0
1 0 0 0 0 0 0 0 1
0 0 1 0 0 0 1 0 0
1 0 0 0 0 0 0 0 1
0 1 0 1 0 1 0 1 0
I used np.clip and np.rot90:
import numpy as np
arr_in = [[1],
[0, 0],
[0, 0, 1],
[0, 0, 0, 0],
[0, 1, 0, 1, 0]]
x = np.zeros((5, 5), dtype="uint8")
for idx, row in enumerate(arr_in):
x[idx, :len(row)] = row
np.clip(x + x.T, 0, 1, out=x)
final = np.zeros((9, 9), dtype="uint8")
final[:5, :5] = np.rot90(x, 2) # NW corner
final[:5, 4:] = np.rot90(x, 1) # NE corner
final[4:, :5] = np.rot90(x, 3) # SW corner
final[4:, 4:] = x # SE corner
Output:
array([[0, 1, 0, 1, 0, 1, 0, 1, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 1, 0, 0, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 1, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 1, 0, 0, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 1, 0, 1, 0, 1, 0, 1, 0]], dtype=uint8)
Place the triangle in a square numpy array.
Reflect that in the diagonal.
Place the result in a bigger numpy array.
Reflect that horizontally then vertically.
import numpy as np
arr_in = [[1],
[0, 0],
[0, 0, 1],
[0, 0, 0, 0],
[0, 1, 0, 1, 0]]
l = len( arr_in )
arr = np.zeros( (l,l), dtype = np.int64 )
# Generate square numpy array. There may be a neater way to do this.
for row, a in enumerate(arr_in):
arr[ row, :len(a) ] = a
arr
# array([[1, 0, 0, 0, 0],
# [0, 0, 0, 0, 0],
# [0, 0, 1, 0, 0],
# [0, 0, 0, 0, 0],
# [0, 1, 0, 1, 0]])
tr = np.tril( arr, -1 ) # Lower triangle, missing the diagonal ( -1 )
tr
# array([[0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0],
# [0, 1, 0, 1, 0]])
arr += tr.T # arr += transpose of tr
arr
# array([[1, 0, 0, 0, 0],
# [0, 0, 0, 0, 1],
# [0, 0, 1, 0, 0],
# [0, 0, 0, 0, 1],
# [0, 1, 0, 1, 0]])
result = np.zeros( (9,9), dtype = np.int64 ) # Create result array
result[ 4:, 4:] = arr # Fill the lower RH square
result
# array([[0, 0, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 1, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 1],
# [0, 0, 0, 0, 0, 0, 1, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 1],
# [0, 0, 0, 0, 0, 1, 0, 1, 0]])
result[ :4 ] = result[ 5:][::-1] # Reflect in horizontal mirror
result
# array([[0, 0, 0, 0, 0, 1, 0, 1, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 1],
# [0, 0, 0, 0, 0, 0, 1, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 1],
# [0, 0, 0, 0, 1, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 1],
# [0, 0, 0, 0, 0, 0, 1, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 1],
# [0, 0, 0, 0, 0, 1, 0, 1, 0]])
result[ :, :4 ] = result[ :, 5: ][:, ::-1] # Reflect in vertical mirror
result
# array([[0, 1, 0, 1, 0, 1, 0, 1, 0],
# [1, 0, 0, 0, 0, 0, 0, 0, 1],
# [0, 0, 1, 0, 0, 0, 1, 0, 0],
# [1, 0, 0, 0, 0, 0, 0, 0, 1],
# [0, 0, 0, 0, 1, 0, 0, 0, 0],
# [1, 0, 0, 0, 0, 0, 0, 0, 1],
# [0, 0, 1, 0, 0, 0, 1, 0, 0],
# [1, 0, 0, 0, 0, 0, 0, 0, 1],
# [0, 1, 0, 1, 0, 1, 0, 1, 0]])

Apply a function to series of list without apply in pandas

I have a dataframe
df = pd.DataFrame({'Binary_List': [[0, 0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 1]]})
df
Binary_List
0 [0, 0, 1, 0, 0, 0, 0]
1 [0, 1, 0, 0, 0, 0, 0]
2 [0, 0, 1, 1, 0, 0, 0]
3 [0, 0, 0, 0, 1, 1, 1]
I want to apply a function to each list, without use of apply because apply is very slow when running on large dataset
def count_one(lst):
index = [i for i, e in enumerate(lst) if e != 0]
# some more steps
return len(index)
df['Value'] = df['Binary_List'].apply(lambda x: count_one(x))
df
Binary_List Value
0 [0, 0, 1, 0, 0, 0, 0] 1
1 [0, 1, 0, 0, 0, 0, 0] 1
2 [0, 0, 1, 1, 0, 0, 0] 2
3 [0, 0, 0, 0, 1, 1, 1] 3
I tried using this, but no improvement
vfunc = np.vectorize(count_one)
df['Value'] = vfunc(df['Binary_List'])
This gives me error
df['Value'] = count_one(df['Binary_List'])
you can try DataFrame.explode:
df.explode('Binary_List').reset_index().groupby('index').sum()
Binary_List
index
0 1
1 1
2 2
3 3
Also you can do:
pd.Series([np.array(key).sum() for key in df['Binary_List']])
0 1
1 1
2 2
3 3
dtype: int64
for getting length of list items you can use str function like below
df = pd.DataFrame({'Binary_List': [[0, 0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 1]]})
df["Binary_List"].astype(np.str).str.count("1")

How to permutate rows in sparse (Numpy) matrix efficiently using permutation array?

I used the Scipy Reverse Cuthill-McKee implementation (scipy.sparse.csgraph.reverse_cuthill_mckee) for creating a band matrix using a (high-dimensional) sparse csr_matrix.
The result of this method is a permutation array whichs gives me the indices of how to permutate the rows of my matrix as I understood.
Now is there any efficient solution for doing this permutation on my sparse csr_matrix in any other sparse matrix (csr, lil_matrix, etc)?
I tried a for-loop but my matrix has dimension like 200,000 x 150,000 and it takes too much time.
A = csr_matrix((data,(rowind,columnind)), shape=(200000, 150000), dtype=np.uint8)
permutation_array = csgraph.reverse_cuthill_mckee(A, false)
result_matrix = lil_matrix((200000, 150000), dtype=np.uint8)
i=0
for x in np.nditer(permutation_array):
result_matrix[x, :]=A[i, :]
i+=1
The result of the reverse_cuthill_mckee call is an array which is like a tupel containing the indices for my permutation. So this array is something like: [199999 54877 54873 ..., 12045 9191 0] (size = 200,000)
This means:
row with index 0 has now index 199999,
row with index 1 has now index 54877,
row with index 2 has now index 54873,
etc. see: https://en.wikipedia.org/wiki/Permutation#Definition_and_notations
(As I understood the return)
Thank you
I wonder if you are applying the permutation array correctly.
Make a random matrix (float) and convert it to a uint8 (beware, csr calculations might not work with this dtype):
In [963]: ran=sparse.random(10,10,.3, format='csr')
In [964]: A = sparse.csr_matrix((np.ones(ran.data.shape).astype(np.uint8),ran.indices, ran.indptr))
In [965]: A.A
Out[965]:
array([[1, 1, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 1, 1, 1, 1, 1, 1, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0, 0, 1, 0, 1],
[0, 1, 0, 0, 1, 1, 0, 0, 0, 0],
[1, 0, 1, 0, 0, 1, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 1],
[0, 1, 1, 1, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 1, 0, 0, 0]], dtype=uint8)
(oops, used the wrong matrix here):
In [994]: permutation_array = csgraph.reverse_cuthill_mckee(A, False)
In [995]: permutation_array
Out[995]: array([9, 7, 0, 4, 6, 3, 5, 1, 8, 2], dtype=int32)
My first inclination is to use such an array to simply index rows of the original matrix:
In [996]: A[permutation_array,:].A
Out[996]:
array([[0, 0, 0, 0, 1, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 1],
[1, 1, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 1, 0, 0, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0, 0, 1, 0, 1],
[1, 0, 1, 0, 0, 1, 0, 1, 0, 0],
[0, 1, 1, 1, 1, 1, 1, 0, 1, 0],
[0, 1, 1, 1, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8)
I see some clustering; maybe the best we can expect from a random matrix.
You on the other hand appear to be doing:
In [997]: res = sparse.lil_matrix(A.shape,dtype=A.dtype)
In [998]: res[permutation_array,:] = A
In [999]: res.A
Out[999]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 1],
[0, 0, 0, 0, 1, 1, 1, 0, 0, 0],
[1, 0, 1, 0, 0, 1, 0, 1, 0, 0],
[1, 1, 0, 0, 0, 0, 0, 1, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 1, 1, 0, 0, 0, 0],
[0, 1, 1, 1, 1, 1, 1, 0, 1, 0],
[0, 1, 1, 1, 0, 1, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0, 1, 0, 0, 0]], dtype=uint8)
I don't see any improvement in clustering of 1s in res.
The docs for the MATLAB equivalent say
r = symrcm(S) returns the symmetric reverse Cuthill-McKee ordering of S. This is a permutation r such that S(r,r) tends to have its nonzero elements closer to the diagonal.
In numpy terms, that means:
In [1019]: I,J=np.ix_(permutation_array,permutation_array)
In [1020]: A[I,J].A
Out[1020]:
array([[0, 0, 0, 1, 1, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 1, 0, 1, 0, 0, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0, 0, 1, 0, 0],
[0, 1, 1, 0, 0, 0, 1, 0, 0, 1],
[0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 1, 1, 1, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8)
And indeed there are more 0 bands in the 2 off diagonal corners.
And using the bandwidth calculation on the MATLAB page, https://www.mathworks.com/help/matlab/ref/symrcm.html
In [1028]: i,j=A.nonzero()
In [1029]: np.max(i-j)
Out[1029]: 7
In [1030]: i,j=A[I,J].nonzero()
In [1031]: np.max(i-j)
Out[1031]: 5
The MATLAB docs say that with this permutation, the eigenvalues remain the same. Testing:
In [1032]: from scipy.sparse import linalg
In [1048]: linalg.eigs(A.astype('f'))[0]
Out[1048]:
array([ 3.14518213+0.j , -0.96188843+0.j ,
-0.58978939+0.62853903j, -0.58978939-0.62853903j,
1.09950364+0.54544497j, 1.09950364-0.54544497j], dtype=complex64)
In [1049]: linalg.eigs(A[I,J].astype('f'))[0]
Out[1049]:
array([ 3.14518023+0.j , 1.09950352+0.54544479j,
1.09950352-0.54544479j, -0.58978981+0.62853914j,
-0.58978981-0.62853914j, -0.96188819+0.j ], dtype=complex64)
Eigenvalues are not the same for the row permutations we tried earlier:
In [1050]: linalg.eigs(A[permutation_array,:].astype('f'))[0]
Out[1050]:
array([ 2.95226836+0.j , -1.60117996+0.52467293j,
-1.60117996-0.52467293j, -0.01723826+1.06249797j,
-0.01723826-1.06249797j, 0.90314150+0.j ], dtype=complex64)
In [1051]: linalg.eigs(res.astype('f'))[0]
Out[1051]:
array([-0.05822830-0.97881651j, -0.99999994+0.j ,
1.17350495+0.j , -0.91237622+0.8656373j ,
-0.91237622-0.8656373j , 2.26292515+0.j ], dtype=complex64)
This [I,J] permutation works with the example matrix in http://ciprian-zavoianu.blogspot.com/2009/01/project-bandwidth-reduction.html
In [1058]: B = np.matrix('1 0 0 0 1 0 0 0;0 1 1 0 0 1 0 1;0 1 1 0 1 0 0 0;0 0 0
...: 1 0 0 1 0;1 0 1 0 1 0 0 0; 0 1 0 0 0 1 0 1;0 0 0 1 0 0 1 0;0 1 0 0 0
...: 1 0 1')
In [1059]: B
Out[1059]:
matrix([[1, 0, 0, 0, 1, 0, 0, 0],
[0, 1, 1, 0, 0, 1, 0, 1],
[0, 1, 1, 0, 1, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 1, 0],
[1, 0, 1, 0, 1, 0, 0, 0],
[0, 1, 0, 0, 0, 1, 0, 1],
[0, 0, 0, 1, 0, 0, 1, 0],
[0, 1, 0, 0, 0, 1, 0, 1]])
In [1060]: Bm=sparse.csr_matrix(B)
In [1061]: Bm
Out[1061]:
<8x8 sparse matrix of type '<class 'numpy.int32'>'
with 22 stored elements in Compressed Sparse Row format>
In [1062]: permB = csgraph.reverse_cuthill_mckee(Bm, False)
In [1063]: permB
Out[1063]: array([6, 3, 7, 5, 1, 2, 4, 0], dtype=int32)
In [1064]: Bm[np.ix_(permB,permB)].A
Out[1064]:
array([[1, 1, 0, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0, 0, 0],
[0, 0, 1, 1, 1, 0, 0, 0],
[0, 0, 1, 1, 1, 1, 0, 0],
[0, 0, 0, 0, 1, 1, 1, 0],
[0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 1, 1]], dtype=int32)

How does pandas qcut() method select what bins to put extra items in?

I would like to understand how pd.qcut() selects where to put extra items when numItems % binSize != 0. For example, I wrote this code to check how 0-9 items are binned in a decile setting
for i in range(10):
a = pd.qcut(pd.Series(range(i+10)),10,False).value_counts().ix[range(10)].tolist()
a = [x-1 for x in a]
print(str(i),'extra:',a)
0 extra: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
1 extra: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
2 extra: [1, 0, 0, 0, 0, 0, 0, 0, 0, 1]
3 extra: [1, 0, 0, 0, 1, 0, 0, 0, 0, 1]
4 extra: [1, 0, 0, 1, 0, 0, 1, 0, 0, 1]
5 extra: [1, 0, 1, 0, 1, 0, 0, 1, 0, 1]
6 extra: [1, 1, 0, 1, 0, 1, 0, 1, 0, 1]
7 extra: [1, 1, 0, 1, 1, 0, 1, 0, 1, 1]
8 extra: [1, 1, 1, 0, 1, 1, 0, 1, 1, 1]
9 extra: [1, 1, 1, 1, 1, 0, 1, 1, 1, 1]
Of course, this will change as numItems and binSize changes. Do you have any insight on how the algorithm works to try to select where to put the extra items? It appears that it tries to balance them in some way

Categories

Resources