Getting indices of both zero and nonzero elements in array - python

I need to find the indicies of both the zero and nonzero elements of an array.
Put another way, I want to find the complementary indices from numpy.nonzero().
The way that I know to do this is as follows:
indices_zero = numpy.nonzero(array == 0)
indices_nonzero = numpy.nonzero(array != 0)
This however means searching the array twice, which for large arrays is not efficient. Is there an efficient way to do this using numpy?

Assuming you already have the range for use numpy.arange(len(array)), just get and store the logical indices:
bindices_zero = (array == 0)
then when you actually need the integer indices you can do
indices_zero = numpy.arange(len(array))[bindices_zero]
or
indices_nonzero = numpy.arange(len(array))[~bindices_zero]

You can use boolean indexing:
In [82]: a = np.random.randint(-5, 5, 100)
In [83]: a
Out[83]:
array([-2, -1, 4, -3, 1, -2, 2, -1, 2, -1, -3, 3, -3, -4, 1, 2, 1,
3, 3, 0, 1, -3, -4, 3, -5, -1, 3, 2, 3, 0, -5, 4, 3, -5,
-3, 1, -1, 0, -4, 0, 1, -5, -5, -1, 3, -2, -5, -5, 1, 0, -1,
1, 1, -1, -2, -2, 1, 1, -4, -4, 1, -3, -3, -5, 3, 0, -5, -2,
-2, 4, 1, -4, -5, -1, 3, -3, 2, 4, -4, 4, 2, -2, -4, 3, 4,
-2, -4, 2, -4, -1, 0, -3, -1, 2, 3, 1, 1, 2, 1, 4])
In [84]: mask = a != 0
In [85]: a[mask]
Out[85]:
array([-2, -1, 4, -3, 1, -2, 2, -1, 2, -1, -3, 3, -3, -4, 1, 2, 1,
3, 3, 1, -3, -4, 3, -5, -1, 3, 2, 3, -5, 4, 3, -5, -3, 1,
-1, -4, 1, -5, -5, -1, 3, -2, -5, -5, 1, -1, 1, 1, -1, -2, -2,
1, 1, -4, -4, 1, -3, -3, -5, 3, -5, -2, -2, 4, 1, -4, -5, -1,
3, -3, 2, 4, -4, 4, 2, -2, -4, 3, 4, -2, -4, 2, -4, -1, -3,
-1, 2, 3, 1, 1, 2, 1, 4])
In [86]: a[-mask]
Out[86]: array([0, 0, 0, 0, 0, 0, 0])

I'm not sure about a built-in numpy method for accomplishing this, but you could use an old-fashioned for loop, I believe. Something like:
indices_zero = []
indices_nonzero = []
for index in xrange(len(array)):
if array[index] == 0:
indicies_zero.append(index)
else:
indicies_nonzero.append(index)
Something like this should accomplish what you want, by only looping once.

Related

matrix python numpy with positif and negative value

i want to generate a diagonal matrix with size such as nxn
This is a toeplitz matrix, you can use SciPy's linalg.toeplitz to construct such a pattern. You can look at its implementation code here which uses from np.lib.stride_tricks.as_strided under the hood.
>>> toeplitz(-np.arange(3), np.arange(3))
array([[ 0, 1, 2],
[-1, 0, 1],
[-2, -1, 0]])
>>> toeplitz(-np.arange(6), np.arange(6))
array([[ 0, 1, 2, 3, 4, 5],
[-1, 0, 1, 2, 3, 4],
[-2, -1, 0, 1, 2, 3],
[-3, -2, -1, 0, 1, 2],
[-4, -3, -2, -1, 0, 1],
[-5, -4, -3, -2, -1, 0]])
It's quite easy to write as a custom function:
def diagonal(N):
a = np.arange(N)
return a-a[:,None]
diagonal(3)
array([[ 0, 1, 2],
[-1, 0, 1],
[-2, -1, 0]])
diagonal(6)
array([[ 0, 1, 2, 3, 4, 5],
[-1, 0, 1, 2, 3, 4],
[-2, -1, 0, 1, 2, 3],
[-3, -2, -1, 0, 1, 2],
[-4, -3, -2, -1, 0, 1],
[-5, -4, -3, -2, -1, 0]])

Interchanging rows in Numpy produces an embedded array

I'm trying to interchange the rows of np.array A using the following array:
A = np.array([[0,-3,-6,4,9],
[-1,-2,-1,3,1],
[-2,-3,0,3,-1],
[1,4,5,-9,-7]])
When I use the following code:
A = np.array([A[3],A[0],A[1],A[2]])
my array becomes
array([[ 1, 4, 5, -9, -7],
[ 0, -3, -6, 4, 9],
[-1, -2, -1, 3, 1],
[-2, -3, 0, 3, -1]])
like I hoped, wished and dreamed. When I try a broader slice, though (as I would need for larger matrices), it doesn't work quite as well:
A = np.array([A[3], A[0:3]])
A
array([array([-2, -3, 0, 3, -1]),
array([[ 1, 4, 5, -9, -7],
[ 0, -3, -6, 4, 9],
[-1, -2, -1, 3, 1]])], dtype=object)
Why is this happening/how can I correctly perform this slice?
The first expression can be written much more simply as
A = A[[3, 0, 1, 2], :])
The second can therefore be written as
A = A[[3, *range(3)], :]
This is more general than using roll, since you can move an arbitrary row with something like
A = A[[1, *range(1), *range(2, 4)], :]
You could use vstack:
In [5]: np.vstack([A[3], A[0:3]])
Out[5]:
array([[ 1, 4, 5, -9, -7],
[ 0, -3, -6, 4, 9],
[-1, -2, -1, 3, 1],
[-2, -3, 0, 3, -1]])
np.roll as commented is probably the best choice. You could also use np.r_:
A[np.r_[3,0:3]]
Out:
array([[ 1, 4, 5, -9, -7],
[ 0, -3, -6, 4, 9],
[-1, -2, -1, 3, 1],
[-2, -3, 0, 3, -1]])

Finding the distance to the next higher value in pandas dataframe

I have a data frame containing floating point values
my_df = pd.DataFrame([1,2,1,4,3,2,5,4,7])
I'm trying to find for each number, when (how many indices need to move forward) till I find the next number larger than the current number, if there is no larger number, I mark it with some value (like 999999).
So for the example above, the correct answer should be
result = [1,2,1,3,2,1,2,1,999999]
Currently I've solved it by very slow double loop with itertuples (meaning O(n^2))
Is there a smarter way to do it ?
Here's a numpy based one leveraging broadcasting:
a = my_df.squeeze().to_numpy() # my_df.squeeze().values for versions 0.24.0.<
diff_mat = a - a[:,None]
result = (np.triu(diff_mat)>0).argmax(1) - np.arange(diff_mat.shape[1])
result[result <= 0] = 99999
print(result)
array([ 1, 2, 1, 3, 2, 1, 2, 1, 99999],
dtype=int64)
Where diff_mat is the distance matrix, and we're looking for the values from the main diagonal onwards, which are greater than 0:
array([[ 0, 1, 0, 3, 2, 1, 4, 3, 6],
[-1, 0, -1, 2, 1, 0, 3, 2, 5],
[ 0, 1, 0, 3, 2, 1, 4, 3, 6],
[-3, -2, -3, 0, -1, -2, 1, 0, 3],
[-2, -1, -2, 1, 0, -1, 2, 1, 4],
[-1, 0, -1, 2, 1, 0, 3, 2, 5],
[-4, -3, -4, -1, -2, -3, 0, -1, 2],
[-3, -2, -3, 0, -1, -2, 1, 0, 3],
[-6, -5, -6, -3, -4, -5, -2, -3, 0]], dtype=int64)
We have np.triu for that:
np.triu(diff_mat)
array([[ 0, 1, 0, 3, 2, 1, 4, 3, 6],
[ 0, 0, -1, 2, 1, 0, 3, 2, 5],
[ 0, 0, 0, 3, 2, 1, 4, 3, 6],
[ 0, 0, 0, 0, -1, -2, 1, 0, 3],
[ 0, 0, 0, 0, 0, -1, 2, 1, 4],
[ 0, 0, 0, 0, 0, 0, 3, 2, 5],
[ 0, 0, 0, 0, 0, 0, 0, -1, 2],
[ 0, 0, 0, 0, 0, 0, 0, 0, 3],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)
And by checking which are greater than 0, and taking the argmax of the boolean ndarray we'll find the first value greater than 0 in each row:
(np.triu(diff_mat)>0).argmax(1)
array([1, 3, 3, 6, 6, 6, 8, 8, 0], dtype=int64)
We only need to subtract the corresponding offset from the main diagonal to the beginning

Hamming distance (Simhash python) giving out unexpected value

I was checking out Simhash module ( https://github.com/leonsim/simhash ).
I presume that the Simhash("String").distance(Simhash("Another string")) is the hamming distance between the two strings. Now, I am not sure I understand this "get_features(string) method completely, as shown in (https://leons.im/posts/a-python-implementation-of-simhash-algorithm/).
def get_features(s):
width = 2
s = s.lower()
s = re.sub(r'[^\w]+', '', s)
return [s[i:i + width] for i in range(max(len(s) - width + 1, 1))]
Now, when I try to compute distance between "aaaa" and "aaas" using the width 2, it gives out the distance as 0.
from simhash import Simhash
Simhash(get_features("aaas")).distance(Simhash(get_features("aaaa")))
I am not sure what am I missing out in here.
Dig into code
The width, in your case, is the key parameter in get_features(), which give different splitted words. The get_features() in your case will output like:
['aa', 'aa', 'aa']
['aa', 'aa', 'as']
Then Simhash calculates these list as unweighted features (which means the default weight of each feature is 1) and output like:
86f24ba207a4912
86f24ba207a4912
They are the same!
The reason is from simhash algorithm itself. Let's look into the code:
def build_by_features(self, features):
"""
`features` might be a list of unweighted tokens (a weight of 1
will be assumed), a list of (token, weight) tuples or
a token -> weight dict.
"""
v = [0] * self.f
masks = [1 << i for i in range(self.f)]
if isinstance(features, dict):
features = features.items()
for f in features:
if isinstance(f, basestring):
h = self.hashfunc(f.encode('utf-8'))
w = 1
else:
assert isinstance(f, collections.Iterable)
h = self.hashfunc(f[0].encode('utf-8'))
w = f[1]
for i in range(self.f):
v[i] += w if h & masks[i] else -w
ans = 0
for i in range(self.f):
if v[i] >= 0:
ans |= masks[i]
self.value = ans
from: leonsim/simhash
The calculation process can be divied into 4 steps:
1) hash each splitted word (feature), to transform string into binary numbers;
2) weight them;
3) assumble weighted bits together;
4) change the assumbled number into binary and output as the value.
Now, in your case, the step 3 will output like:
[-3, 3, -3, -3, 3, -3, -3, -3, 3, -3, -3, 3, -3, -3, 3, -3, -3, 3, -3, 3, 3, 3, 3, -3, -3, -3, -3, -3, -3, 3, -3, -3, -3, 3, -3, 3, 3, 3, -3, 3, -3, -3, 3, -3, -3, 3, -3, -3, 3, 3, 3, 3, -3, 3, 3, -3, -3, -3, -3, 3, -3, -3, -3, -3]
[-1, 3, -3, -1, 3, -3, -3, -1, 3, -3, -3, 1, -1, -1, 1, -3, -3, 3, -1, 3, 1, 3, 1, -3, -1, -3, -3, -1, -1, 3, -1, -1, -1, 3, -1, 1, 3, 1, -1, 1, -3, -3, 1, -1, -3, 3, -3, -1, 1, 3, 3, 3, -3, 3, 3, -3, -1, -1, -1, 1, -3, -3, -3, -1]
And after step 4, the 2 output the same value.
Other parameter
If you change the width from 2 to 1, 3, 4, you will get different result of
Simhash(get_features()).
Your case shows the limitation of simhash with short length text.

Defining a multidimensional field with nonstandard domain

I have an array a in Python, let's say a=np.array([3, 4]), and would like to define an ndarray (or something like that) of type [-3:3, -4:4], in other words, a collection x of real numbers x[-3,-4], x[-3,-3],...,x[3,4], the i'th coordinate ranging over integers between -a[i] and a[i]. If the array length is given (2 in this example), I could use
np.mgrid[-a[0]:a[0]:1.0,-a[1]:a[1]:1.0][0].
But what should I do if the length of a is unknown?
You could generate a list of ranges with
[np.arange(-x,x+1) for x in a]
I'd have to play around with mgrid, or another function in index_tricks to figure how to use it. I may to make it a tuple or pass it with a *.
mgrid wants slices, so this would replicate your first call
In [60]: np.mgrid[[slice(-x,x+1) for x in [3,4]]]
Out[60]:
array([[[-3, -3, -3, -3, -3, -3, -3, -3, -3],
[-2, -2, -2, -2, -2, -2, -2, -2, -2],
[-1, -1, -1, -1, -1, -1, -1, -1, -1],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 1, 1, 1, 1, 1, 1, 1, 1, 1],
[ 2, 2, 2, 2, 2, 2, 2, 2, 2],
[ 3, 3, 3, 3, 3, 3, 3, 3, 3]],
[[-4, -3, -2, -1, 0, 1, 2, 3, 4],
[-4, -3, -2, -1, 0, 1, 2, 3, 4],
[-4, -3, -2, -1, 0, 1, 2, 3, 4],
[-4, -3, -2, -1, 0, 1, 2, 3, 4],
[-4, -3, -2, -1, 0, 1, 2, 3, 4],
[-4, -3, -2, -1, 0, 1, 2, 3, 4],
[-4, -3, -2, -1, 0, 1, 2, 3, 4]]])
which of course can be generalized to use a.
My initial arange approach works with meshgrid (producing a list of arrays):
In [71]: np.meshgrid(*[np.arange(-x,x+1) for x in [3,4]],indexing='ij')
Out[71]:
[array([[-3, -3, -3, -3, -3, -3, -3, -3, -3],
[-2, -2, -2, -2, -2, -2, -2, -2, -2],
...
[-4, -3, -2, -1, 0, 1, 2, 3, 4],
[-4, -3, -2, -1, 0, 1, 2, 3, 4],
[-4, -3, -2, -1, 0, 1, 2, 3, 4]])]

Categories

Resources