R's order equivalent in python - python

Any ideas what is the python's equivalent for R's order?
order(c(10,2,-1, 20), decreasing = F)
# 3 2 1 4

In numpy there is a function named argsort
import numpy as np
lst = [10,2,-1,20]
np.argsort(lst)
# array([2, 1, 0, 3])
Note that python list index starting at 0 while starting at 1 in R.

It is numpy.argsort()
import numpy
a = numpy.array([10,2,-1, 20])
a.argsort()
# array([2, 1, 0, 3])
and if you want to explore the decreasing = T option. You can try,
(-a).argsort()
#array([3, 0, 1, 2])

Related

vaex filter an dataframe using mask from anther series

I want to use a mask from series x to filter out a vaex dataframe y.
I know how to do this in pandas and numpy. In pandas it's like:
import pandas as pd
a = [0,0,0,1,1,1,0,0,0]
b = [4,5,7,8,9,9,0,6,4]
x = pd.Series(a)
y = pd.Series(b)
print(y[x==1])
The result is like:
3 8
4 9
5 9
dtype: int64
But in vaex, the following code doesn't work.
import vaex
import numpy as np
a = np.array([0, 0, 0, 1, 1, 1, 0, 0, 0])
b = np.array([4, 5, 7, 8, 9, 9, 0, 6, 4])
x = vaex.from_arrays(x=a)
y = vaex.from_arrays(x=b)
print(y[x.x == 1].values)
The result is empty:
[]
It seems that vaex doesn't have the same index concept as pandas and numpy. Although the two dataframe is equal shape, array y can't use mask x.x==1.
Is there any way to achieve the equavilent result as pandas does please?
Thanks
While Vaex has a similar API to that of Pandas (similarly named methods, that do the same thing), the implementations of the two libraries is completely different and thus it is not easy to "mix and match".
In order to work with any kind of data, that data needs to be part of the same Vaex dataframe.
So in order to achieve what you want, something like this is possible:
import vaex
import numpy as np
a = np.array([0, 0, 0, 1, 1, 1, 0, 0, 0])
b = np.array([4, 5, 7, 8, 9, 9, 0, 6, 4])
y = vaex.from_arrays(x1=b)
y.add_column(name='x2', f_or_array=a)
print(y[y.x2 == 1])

numpy bincount sequential slices of array

Given numpy row containing numbers from range(n),
I want to apply the following transformation:
[1 0 1 2] --> [[0 1 0] [1 1 0] [1 2 0] [1 2 1]]
We just go through the input list and bincount all elements to the left of the current (including).
import numpy as np
n = 3
a = np.array([1, 0, 1, 2])
out = []
for i in range(a.shape[0]):
out.append(np.bincount(a[:i+1], minlength=n))
out = np.array(out)
Is there any way to speed this up? I'm wondering if it's possible to get rid of that loop completely and use matrix magic only.
EDIT:
Thanks, lbragile, for mentioning list comprehensions. It's not what I meant. (I'm not sure if it's even significant asymptotically). I was thinking about some more complex things such as rewriting this based on how bincount operation works under the hood.
You can use cumsum like so:
idx = [1,0,1,2]
np.identity(np.max(idx)+1,int)[idx].cumsum(0)
# array([[0, 1, 0],
# [1, 1, 0],
# [1, 2, 0],
# [1, 2, 1]])
Using list comprehension:
fast_out = [np.bincount(a[:i+1], minlength=n) for i in range(a.shape[0])]
print(fast_out)
Output:
[array([0, 1, 0]), array([1, 1, 0]), array([1, 2, 0]), array([1, 2, 1])]
To time the code use the following:
import timeit
def timer(code_to_test):
elapsed_time = timeit.timeit(code_to_test, number=100)/100
print(elapsed_time)
your_code = """
import numpy as np
n = 3
a = np.array([1, 0, 1, 2])
out = []
for i in range(a.shape[0]):
out.append(np.bincount(a[:i+1], minlength=n))
out = np.array(out)
"""
list_comp_code = """
import numpy as np
n = 3
a = np.array([1, 0, 1, 2])
fast_out = [np.bincount(a[:i+1], minlength=n) for i in range(a.shape[0])]
"""
timer(your_code) # 0.001330663086846471
timer(list_comp_code) # 1.4601880684494972e-05
So the list comprehension method is over 91 times faster when averaged over 100 trials.

Finding the indexed location of values in a unsorted numpy array from data in another unsorted numpy array [duplicate]

This question already has answers here:
Finding indices of matches of one array in another array
(4 answers)
Closed 3 years ago.
I have a numpy array A which contains unique IDs that can be in any order - e.g. A = [1, 3, 2]. I have a second numpy array B, which is a record of when the ID is used - e.g. B = [3, 3, 1, 3, 2, 1, 2, 3, 1, 1, 2, 3, 3, 1]. Array B is always much longer than array A.
I need to find the indexed location of the ID in A for each time the ID is used in B. So in the example above my returned result would be: result = [1, 1, 0, 1, 2, 0, 2, 1, 0, 0, 2, 1, 1, 0].
I've already written a simple solution that gets the correct result using a for loop to append the result to a new list and using numpy.where, but I can't figure out the correct syntax to vectorize this.
import numpy as np
A = np.array([1, 3, 2])
B = np.array([3, 3, 1, 3, 2, 1, 2, 3, 1, 1, 2, 3, 3, 1])
IdIndxs = []
for ID in B:
IdIndxs.append(np.where(A == ID)[0][0])
IdIndxs = np.array(IdIndxs)
Can someone come up with a simple vector based solution that runs quickly - the for loop becomes very slow when running on a typical problem where is A is of the size of 10K-100K elements and B is some multiple, usually 5-10x larger than A.
I'm sure the solution is simple, but I just can't see it today.
You can use this:
import numpy as np
# test data
A = np.array([1, 3, 2])
B = np.array([3, 3, 1, 3, 2, 1, 2, 3, 1, 1, 2, 3, 3, 1])
# get indexes
sorted_keys = np.argsort(A)
indexes = sorted_keys[np.searchsorted(A, B, sorter=sorted_keys)]
Output:
[1 1 0 1 2 0 2 1 0 0 2 1 1 0]
The numpy-indexed library (disclaimer: I am its author) was designed to provide these type of vectorized operations where numpy for some reason does not. Frankly given how common this vectorized list.index equivalent is useful it definitely ought to be in numpy; but numpy is a slow-moving project that takes backwards compatibility very seriously, and I dont think we will see this until numpy2.0; but until then this is pip and conda installable with the same ease.
import numpy_indexed as npi
idx = npi.indices(A, B)
Reworking your logic but using a list comprehension and numpy.fromiter which should boost performance.
IdIndxs = np.fromiter([np.where(A == i)[0][0] for i in B], B.dtype)
About performance
I've done a quick test comparing fromiter with your solution, and I do not see such boost in performance. Even using a B array of millions of elements, they are of the same order.

Calculate the sum of every 5 elements in a python array

I have a python array in which I want to calculate the sum of every 5 elements. In my case I have the array c with ten elements. (In reality it has a lot more elements.)
c = [1, 0, 0, 0, 0, 2, 0, 0, 0, 0]
So finally I would like to have a new array (c_new) which should show the sum of the first 5 elements, and the second 5 elements
So the result should be that one
1+0+0+0+0 = 1
2+0+0+0+0 = 2
c_new = [1, 2]
Thank you for your help
Markus
You can use np.add.reduceat by passing indices where you want to split and sum:
import numpy as np
c = [1, 0, 0, 0, 0, 2, 0, 0, 0, 0]
np.add.reduceat(c, np.arange(0, len(c), 5))
# array([1, 2])
Heres one way of doing it -
c = [1, 0, 0, 0, 0, 2, 0, 0, 0, 0]
print [sum(c[i:i+5]) for i in range(0, len(c), 5)]
Result -
[1, 2]
If five divides the length of your vector and it is contiguous then
np.reshape(c, (-1, 5)).sum(axis=-1)
It also works if it is non contiguous, but then it is typically less efficient.
Benchmark:
def aredat():
return np.add.reduceat(c, np.arange(0, len(c), 5))
def reshp():
np.reshape(c, (-1, 5)).sum(axis=-1)
c = np.random.random(10_000_000)
timeit(aredat, number=100)
3.8516048429883085
timeit(reshp, number=100)
3.09542763303034
So where possible, reshapeing seems a bit faster; reduceat has the advantage of gracefully handling non-multiple-of-five vectors.
why don't you use this ?
np.array([np.sum(i, axis = 0) for i in c.reshape(c.shape[0]//5,5,c.shape[1])])
There are various ways to achieve that. Will leave, below, two options using numpy built-in methods.
Option 1
numpy.sum and numpy.ndarray.reshape as follows
c_sum = np.sum(np.array(c).reshape(-1, 5), axis=1)
[Out]: array([1, 2])
Option 2
Using numpy.vectorize, a custom lambda function, and numpy.arange as follows
c_sum = np.vectorize(lambda x: sum(c[x:x+5]))(np.arange(0, len(c), 5))
[Out]: array([1, 2])

Iterate over numpy.ma array, ignoring masked values

I would like to iterate over only unmasked values in a np.ma.ndarray.
With the following:
import numpy as np
a = np.ma.array([1, 2, 3], mask = [0, 1, 0])
for i in a:
print i
I get:
1
--
3
I would like to get the following:
1
3
It seems like np.nditer() may be the way to go, but I don't find any flags that might specify this. How might I do this? Thanks!
you want to use a.compressed()
import numpy as np
a = np.ma.array([1, 2, 3], mask = [0, 1, 0])
for i in a.compressed():
print i
which gives:
1
3

Categories

Resources