Odd behavior of np.argsort with Pandas - python

Here is np.argsort applied in four different ways.
print(np.argsort([1,np.nan,3,np.nan, 4]))
print(np.argsort(pd.DataFrame([[1,np.nan,3,np.nan, 4]])).values)
print(np.argsort(pd.Series([1,np.nan,3,np.nan, 4]).values)) # same as first
print(np.argsort(pd.Series([1,np.nan,3,np.nan, 4])).values)
Output:
[0 2 4 1 3]
[[0 2 4 1 3]]
[0 2 4 1 3]
[ 0 -1 1 -1 2]
This is very unexpected behavior. No mention of it in numpy (obviously it will not mention Pandas).
In the Pandas documentation you can find
Returns: Series[np.intp]
Positions of values within the sort order with -1 indicating nan values.
Why? What would be a place where we would want this kind of behavior?

Related

Problem with understanding of work of np.argpartition

I have problem with execution of np.argpartition
I have nd.array
example = np.array([[5,6,7,3,4],[1,2,3,7,5],[6,7,4,2,3],[1,2,3,5,9],[2,3,6,1,2,]])
out: [[5 6 7 3 4]
[1 2 3 7 5]
[6 7 4 2 3]
[1 2 3 5 9]
[2 3 6 1 2]]
I can get indices for sorted array by np.argsort
print(np.argsort(example))
out:
[[3 4 0 1 2]
[0 1 2 4 3]
[3 4 2 0 1]
[0 1 2 3 4]
[3 0 4 1 2]]
I want to use np.argsort to economy some time for executing, because I need only 3 sorted element in each row of this array. I use this code to do it:
print(np.argpartition(example, 3, axis=1))
out: [[3 4 0 1 2]
[1 0 2 4 3]
[3 4 2 0 1]
[1 0 2 3 4]
[3 4 0 1 2]]
I expect that the first three indices of each row will match the indices in the sorted array, but this is not the caseю That doesn't work . I don't understand what I did wrong.
np.argpartition(example, k, axis=1) does not return sorted array for first k elements. It only returns indices such that only (k+1)th element is sorted. If you see in your output, only the 4th element matches with argsort()
If you want first three sorted elements, you have to give a list for k parameter
index_array = np.argpartition(example, [0,1,2], axis=1)
print(np.take_along_axis(example,index_array, axis=1)) ##this will give you first 3 sorted elements

Numpy: value substitution according to neighbours

I need to change value for items in a numpy array on the basis of their neighbours values.
More specifically, let's suppose that I have just 3 possible values for each item in a numpy representing an image. Let's suppose my numpy is the following one:
[[1,1,1,1,1,1,1],
[1,1,1,1,1,1,1],
[1,1,1,2,1,1,1],
[1,1,1,2,1,1,1],
[1,1,1,1,1,1,1],
[3,3,3,3,3,3,3],
[3,3,3,3,3,3,3],
[3,3,3,3,3,3,3]
]
What I want is:
Since the size of group of contiguous items containing the value 2 in such example is less than (3 x 3) matrix, I need to assign them the value of neighbour items: in such case 1!
Resulting numpy has to be
[[1,1,1,1,1,1,1],
[1,1,1,1,1,1,1],
[1,1,1,**1**,1,1,1],
[1,1,1,**1**,1,1,1],
[1,1,1,1,1,1,1],
[3,3,3,3,3,3,3],
[3,3,3,3,3,3,3],
[3,3,3,3,3,3,3]
]
What I would like to have is that the 'spurious' elements (only two cells containing the value 2 in an area with a predominance of 1 values) are eliminated and uniformed to the area in which they appear. I hope I have explained. Thanks for any information you can give me. Thanks a lot.
In image processing this operations are called morphological filtering. In your case you can use an opening.
import numpy as np
from skimage.morphology.grey import opening
from skimage.morphology import square
a = np.array(
[[1,1,1,1,1,1,1],
[1,1,1,1,1,1,1],
[1,1,1,2,1,1,1],
[1,1,1,2,1,1,1],
[1,1,1,1,1,1,1],
[3,3,3,3,3,3,3],
[3,3,3,3,3,3,3],
[3,3,3,3,3,3,3]
])
opening(a, square(3))
Out:
[[1 1 1 1 1 1 1]
[1 1 1 1 1 1 1]
[1 1 1 1 1 1 1]
[1 1 1 1 1 1 1]
[1 1 1 1 1 1 1]
[3 3 3 3 3 3 3]
[3 3 3 3 3 3 3]
[3 3 3 3 3 3 3]]

Assign large value to np.array

i try to replace all the 0 value inside the array with 1.0/875713. But my code did not work, so just wondering is this due to type size limitation and how to solve this problem?
value = 1.0/875713
print(value)
arr = np.array([1,2,3,0,3,0,0,0,2,3,4,5])
arr[arr == 0] = value
print(arr)
1.14192663578e-06
[1 2 3 0 3 0 0 0 2 3 4 5]
Expecting results
[1 2 3 1.14192663578e-06 3 1.14192663578e-06 1.14192663578e-06 1.14192663578e-06 2 3 4 5]
Numpy array has a type. You can learn more in docs
In your code, if you type arr.dtype, the result will be dtype('int32')
To reach your goal, you should run arr = arr.astype('float32') before running arr[arr == 0] = value, then you will get the expected output.

How to make fcluster to return the same output as cut_tree?

There are a couple of related questions to this, the most relevant is this question I think.
Let's say I have a dataset like this (highly simplified for demonstration purposes):
import numpy as np
import pandas as pd
from scipy.spatial import distance
from scipy.cluster import hierarchy
val = np.array([[0.20288834, 0.80406494, 4.59921579, 14.28184739],
[0.22477082, 1.43444223, 6.87992605, 12.90299896],
[0.22811485, 0.74509454, 3.85198421, 19.22564266],
[0.20374529, 0.73680174, 3.63178517, 17.82544951],
[0.22722696, 0.86113728, 3.00832186, 16.62306058],
[0.25577882, 0.85671779, 3.70655719, 17.49690061],
[0.23018219, 0.68039151, 2.50815837, 15.09039053],
[0.21638751, 1.12455083, 3.56246872, 18.82866991],
[0.26600895, 1.09415595, 2.85300018, 17.93139433],
[0.22369445, 0.73689845, 3.24919113, 18.60914745]])
df = pd.DataFrame(val, columns=["C{}".format(i) for i in range(val.shape[1])])
C0 C1 C2 C3
0 0.202888 0.804065 4.599216 14.281847
1 0.224771 1.434442 6.879926 12.902999
2 0.228115 0.745095 3.851984 19.225643
3 0.203745 0.736802 3.631785 17.825450
4 0.227227 0.861137 3.008322 16.623061
5 0.255779 0.856718 3.706557 17.496901
6 0.230182 0.680392 2.508158 15.090391
7 0.216388 1.124551 3.562469 18.828670
8 0.266009 1.094156 2.853000 17.931394
9 0.223694 0.736898 3.249191 18.609147
I want to cluster the columns of this dataframe and thereby also specify the number of clusters I obtain. Typically, this can be achieved by using the cut_tree function.
However, currently, cut_tree is broken and therefore I looked for alternatives which led me to the link at the beginning of this post where it is suggested to use fcluster as alternative.
Problem is that I don't see how to specify an exact number of clusters but only a maximum number using the maxclust argument.
So for my simple example from above I can do:
# number of target cluster
n_clusters = range(1, 5)
for n_clust in n_clusters:
Z = hierarchy.linkage(distance.pdist(df.T.values), method='average', metric='euclidean')
print("--------\nValues from flcuster:\n{}".format(hierarchy.fcluster(Z, n_clust, criterion='maxclust')))
print("\nValues from cut_tree:\n{}".format(hierarchy.cut_tree(Z, n_clust).T))
which prints
Values from flcuster:
[1 1 1 1]
Values from cut_tree:
[[0 0 0 0]]
--------
Values from flcuster:
[1 1 1 2]
Values from cut_tree:
[[0 0 0 1]]
--------
Values from flcuster:
[1 1 1 2]
Values from cut_tree:
[[0 0 1 2]]
--------
Values from flcuster:
[1 1 1 2]
Values from cut_tree:
[[0 1 2 3]]
As one can see, fcluster returns 2 distinct clusters at maximum while cut_tree returns the desired number.
Is there a way to get the same output for fcluster for the time until the bug in cut_tree is fixed? If not, is there any other good alternative for this in another package?
Not sure how to get the right number of clusters out of fcluster here.
As an alternative, scikit-learn has AgglomerativeClustering:
from sklearn.cluster import AgglomerativeClustering
# number of target cluster
n_clusters = range(1, 5)
for n_clust in n_clusters:
Z = hierarchy.linkage(distance.pdist(df.T.values), method='average', metric='euclidean')
print("--------\nValues from flcuster:\n{}".format(hierarchy.fcluster(Z, n_clust, criterion='maxclust')))
print("\nValues from cut_tree:\n{}".format(hierarchy.cut_tree(Z, n_clust).T))
print("\nValues from AgglomerativeClustering:\n{}".format(AgglomerativeClustering(n_clusters=n_clust, affinity='euclidean', linkage='average').fit(df.T.values).labels_))
which returns the right number of clusters for the provided dataset (although in different order):
Values from flcuster:
[1 1 1 1]
Values from cut_tree:
[[0 0 0 0]]
Values from AgglomerativeClustering:
[0 0 0 0]
--------
Values from flcuster:
[1 1 1 2]
Values from cut_tree:
[[0 0 0 1]]
Values from AgglomerativeClustering:
[0 0 0 1]
--------
Values from flcuster:
[1 1 1 2]
Values from cut_tree:
[[0 0 1 2]]
Values from AgglomerativeClustering:
[0 0 2 1]
--------
Values from flcuster:
[1 1 1 2]
Values from cut_tree:
[[0 1 2 3]]
Values from AgglomerativeClustering:
[3 1 2 0]

create matrix as having a subset of columns from another matrix

I need to get a new matrix generated by selecting a subset of columns from another matrix, given a list (or tuple) of column indices.
The following is the code I am working on (there is a bit more than just the attempt to create a new matrix, but might be interesting for you to have some context).
A = matrix(QQ,[
[2,1,4,-1,2],
[1,-1,5,1,1],
[-1,2,-7,0,1],
[2,-1,8,-1,2]
])
print "A\n",A
print "A rref\n",A.rref()
p = A.pivots()
print "A pivots",p
with the following output:
A
[ 2 1 4 -1 2]
[ 1 -1 5 1 1]
[-1 2 -7 0 1]
[ 2 -1 8 -1 2]
A rref
[ 1 0 3 0 0]
[ 0 1 -2 0 0]
[ 0 0 0 1 0]
[ 0 0 0 0 1]
A pivots (0, 1, 3, 4)
Now I expected to find easily a method from matrix objects which allowed to construct a new matrix with a subset of columns by just giving the tuple p as parameter, but could not find anything like that.
Any ideas on how to solve this elegantly in a sage-friendly way? (avoiding for loops and excess code)
thanks!
You can use the matrix_from_columns method: A.matrix_from_columns(p).
Just found how to do this in the easiest and most concise way:
A[:,p]

Categories

Resources