Expand a numpy array based on a value in another array - python

I have the following numpy array a = np.array([1,1,2,1,3]) that should be transformed into the following array b = np.array([1,1,1,1,1,1,1,1]).
What happens is that all the non 1 values in the a array should be expanded in the b array to their multiple defined in the a array. Simpler said, the 2 should become 2 ones, and the 3 should become 3 ones.
Frankly, I couldn't find a numpy function that does this, but I'm sure one exists. Any advice would be very welcome! Thank you!

We can simply do -
np.ones(a.sum(),dtype=int)
This will accomodate all numbers : 1s and non-1s, because of the summing and hence give us the desired output.

In [71]: np.ones(len(a),int).repeat(a)
Out[71]: array([1, 1, 1, 1, 1, 1, 1, 1])
For this small example it is faster than np.ones(a.sum(),int), but it doesn't scale quite as well. But overall both are fast.

Here's one possible way based on the number you wanna be repeated:
In [12]: a = np.array([1,1,2,1,3])
In [13]: mask = a != 1
In [14]: np.concatenate((a[~mask], np.repeat(1, np.prod(a[mask]))))
Out[14]: array([1, 1, 1, 1, 1, 1, 1, 1, 1])

Related

Creating an Array with an equal number of 0 and 1 in a random order?

Before I explain, here is my code for reference:
import numpy as np
arrayteam = [[3,3,3,3,3,3],[2,2,2,2,2,2],[1,1,1,1,1,1]]
#nteams = 3
#nsubteam = 2
#newnums = np.zeros((len(arrayteam),len(arrayteam[0])))
subteam = []
for i in range(len(arrayteam)):
subteam.append(np.random.choice([0,1],size=len(arrayteam[i]),p=[0.5,0.5]))
print (subteam)
Here is the output:
[array([0, 1, 0, 1, 1, 1]), array([1, 1, 0, 1, 0, 0]), array([1, 0, 1, 1, 1, 1])]
As you can see, it randomly chooses 0s and 1s which is what I want, however the number of 0s and 1s is unequal in each array, obviously because I have it as p=0.5, so there's a 50% chance it will choose 0 or 1. I want to have it so that there are 3 zeros and 3 ones in each array, but they occur in a random order. How can I do this?
Also, how can I end up changing exactly where I want the zeros and ones to occur? For example, what if I want the first 3 numbers in the array to be 0s, and the second 3 to be ones? Or what if I want them to alternate?
Use random.shuffle. Note that it shuffles in-place, and returns None:
import random
length = 3
subteam = [0] * length + [1] * length
random.shuffle(subteam)
print(subteam)

Selective deletion by value in numpy array

EDITED: Refined problem statement
I am still figuring out the fancy options which are offered by the numpy library. Following topic came on my desk:
Purpose:
In a multi-dimensional array I select one column. This slicing works fine. But after that, values stored in another list need to be filtered out of the column values.
Current status:
array1 = np.asarray([[0,1,2],[1,0,3],[2,3,0]])
print(array1)
array1woZero = np.nonzero(array1)
print(array1woZero)
toBeRemoved = []
toBeRemoved.append(1)
print(toBeRemoved)
column = array1[:,1]
result = np.delete(column,toBeRemoved)
The above mentioned code does not bring the expected result. In fact, the np.delete() command just removes the value at index 1 - but I would need the value of 1 to be filtered out instead. What I also do not understand is the shape change when applying the nonzero to array1: While array1 is (3,3), the array1woZero turns out into a tuple of 2 dims with 6 values each.
0
Array of int64
(6,)
0
0
1
1
2
2
1
Array of int64
(6,)
1
2
0
2
0
1
My feeling is that I would require something like slicing with an exclusion operator. Do you have any hints for me to solve that? Is it necessary to use different data structures?
In [18]: arr = np.asarray([[0,1,2],[1,0,3],[2,3,0]])
In [19]: arr
Out[19]:
array([[0, 1, 2],
[1, 0, 3],
[2, 3, 0]])
nonzero gives the indices of all non-zero elements of its argument (arr):
In [20]: idx = np.nonzero(arr)
In [21]: idx
Out[21]: (array([0, 0, 1, 1, 2, 2]), array([1, 2, 0, 2, 0, 1]))
This is a tuple of arrays, one per dimension. That output can be confusing, but it is easily used to return all of those non-zero elements:
In [22]: arr[idx]
Out[22]: array([1, 2, 1, 3, 2, 3])
Indexing like this, with a pair of arrays, produces a 1d array. In your example there is just one 0 per row, but in general that's not the case.
This is the same indexing - with 2 lists of the same length:
In [24]: arr[[0,0,1,1,2,2], [1,2,0,2,0,1]]
Out[24]: array([1, 2, 1, 3, 2, 3])
idx[0] just selects on array of that tuple, the row indices. That probably isn't what you want. And I doubt if you want to apply np.delete to that tuple.
It's hard to tell from the description, and code, what you want. Maybe that's because you don't understand what nonzero is producing.
We can also select the nonzero elements with boolean masking:
In [25]: arr>0
Out[25]:
array([[False, True, True],
[ True, False, True],
[ True, True, False]])
In [26]: arr[ arr>0 ]
Out[26]: array([1, 2, 1, 3, 2, 3])
the hint with the boolean masking very good and helped me to develop my own solution. The symbolic names in the following code snippets are different, but the idea should become clear anyway.
At the beginning, I have my overall searchSpace.
searchSpace = relativeDistances[currentNode,:]
Assume that its shape is (5,). My filter is defined on the indexes, i.e. range 0..4. Then I define another numpy array "filter" of same shape with all 1, and the values to be filtered out I set to 0.
filter = np.full(shape=nodeCount,fill_value=1,dtype=np.int32())
filter[0] = 0
filter[3] = 0
searchSpace = searchSpace * filter
minValue = searchSpace[searchSpace > 0].min()
neighborNode = np.where(searchSpace==minValue)
The filter array provides me the flexibility to adjust the filter later on as part of a loop. Using the element-wise multiplication with 0 and subsequent boolean masking, I can create my reduced searchSpace for minimum search. Compared to a separate array or list, I still have the original shape, which is required to get the correct index in the where-statement.

Efficient way: find row where nearly no zero appears in column

I have a problem that as to be solved as efficient as possible. My current approach kind of works, but is extreme slow.
I have a dataframe with multiple columns, in this case I only care for one of them. It contains positive continuous numbers and some zeros.
my goal: is to find the row where nearly no zeros appear in the following rows.
To make clear what I mean I wrote this example to replicate my problem:
df = pd.DataFrame([0,0,0,0,1,0,1,0,0,2,0,0,0,1,1,0,1,2,3,4,0,4,0,5,1,0,1,2,3,4,
0,0,1,2,1,1,1,1,2,2,1,3,6,1,1,5,1,2,3,4,4,4,3,5,1,2,1,2,3,4],
index=pd.date_range('2018-01-01', periods=60, freq='15T'))
There are some zeros at the beginning, but they get less after some time.
Here comes my unoptimized code to visualize the number of zeros:
zerosum = 0 # counter for all zeros that have appeared so far
for i in range(len(df)):
if(df[0][i]== 0.0):
df.loc[df.index[i],'zerosum']=zerosum
zerosum+=1
else:
df.loc[df.index[i],'zerosum']=zerosum
df['zerosum'].plot()
With that unoptimized code I can see the distribution of zeros over time.
My expected output: would be in this example the date 01-Jan-2018 08:00, because no zeros appear after that date.
The problem I have when dealing with my real data is some single zeros can appear later. Therefore I can't just pick the last row that contains a zero. I have to somehow inspect the distribution of zeros and ignore later outliers.
Note: The visualization is not necessary to solve my problem, I just included it to explain my problem as good as possible. Thanks
Ok
Second go
import pandas as pd
import numpy as np
import math
df = pd.DataFrame([0,0,0,0,1,0,1,0,0,2,0,0,0,1,1,0,1,2,3,4,0,4,0,5,1,0,1,2,3,4,
0,0,1,2,1,1,1,1,2,2,1,3,6,1,1,5,1,2,3,4,4,4,3,5,1,2,1,2,3,4],
index=pd.date_range('2018-01-01', periods=60, freq='15T'),
columns=['values'])
We create a column that contains the rank of each zero, and zero if there is a non-zero value
df['zero_idx'] = np.where(df['values']==0,np.cumsum(np.where(df['values']==0,1,0)), 0)
We can use this column to get the location of any zero of any rank. I dont know what your criteria is for naming a zero an outlier. But lets say we want to make sure at we are past at least 90% of all zeros...
# Total number of zeros
n_zeros = max(df['zero_idx'])
# Get past at least this percentage
tolerance = 0.9
# The rank of the abovementioned zero
rank_tolerance = math.ceil(tolerance * n_zeros)
df[df['zero_idx']==rank_tolerance].index
Out[44]: DatetimeIndex(['2018-01-01 07:30:00'], dtype='datetime64[ns]', freq='15T')
Okay, If you need to get the index after the last zero occurred, you can try this:
last = 0
for i in range(len(df)):
if(df[0][i] == 0):
last = i
print(df.iloc[last+1])
or by Filtering:
new = df.loc[df[0]==0]
last = df.index.get_loc(new.index[-1])
print(df.iloc[last+1])
here my solution using a filter and cumsum:
df = pd.DataFrame([0, 0, 0, 0, 1, 0, 1, 0, 0, 2, 0, 0, 0, 1, 1, 0, 1, 2, 3, 4, 0, 4, 0, 5, 1, 0, 1, 2, 3, 4,
0, 0, 1, 2, 1, 1, 1, 1, 2, 2, 1, 3, 6, 1, 1, 5, 1, 2, 3, 4, 4, 4, 3, 5, 1, 2, 1, 2, 3, 4],
index=pd.date_range('2018-01-01', periods=60, freq='15T'))
a = df[0] == 0
df['zerosum'] = a.cumsum()
maxval = max(df['zerosum'])
firstdate = df[df['zerosum'] == maxval].index[1]
print(firstdate)
output:
2018-01-01 08:00:00

Perform ID based averaging of one array with IDs from another array - NumPy

I have two numpy arrays
A= np.array([1,1,1,1,0,0,0,0,0,1])
B= np.array([2,2,2,2,32,1,12,124,1,2)
C= #mean of B's elements where A is 1
D= #mean of B's elements where A is 0
How can I do this? I think it's some combination of np.mean and np.ma but I don't understand how you can calculate the mean with a mask?
You can use np.bincount for a generic case when you might be dealing with other such IDs/tags in A, like so -
np.bincount(A,B)/np.bincount(A)
Basically, np.bincount(A,B) gives us the ID based summations of B, where the IDs are from A. Then, we are dividing those summations by the count of each group of IDs to get the average values per ID group.
Sample run -
In [12]: A
Out[12]: array([1, 1, 1, 1, 0, 0, 0, 0, 0, 1])
In [13]: B
Out[13]: array([ 2, 2, 2, 2, 32, 1, 12, 124, 1, 2])
In [14]: B[A==0].mean() # Using boolean indexing per ID and getting avg
Out[14]: 34.0
In [15]: B[A==1].mean()
Out[15]: 2.0
In [16]: np.bincount(A,B)/np.bincount(A)
Out[16]: array([ 34., 2.])

Product of array elements by group in numpy (Python)

I'm trying to build a function that returns the products of subsets of array elements. Basically I want to build a prod_by_group function that does this:
values = np.array([1, 2, 3, 4, 5, 6])
groups = np.array([1, 1, 1, 2, 3, 3])
Vprods = prod_by_group(values, groups)
And the resulting Vprods should be:
Vprods
array([6, 4, 30])
There's a great answer here for sums of elements that I think it should be similar to:
https://stackoverflow.com/a/4387453/1085691
I tried taking the log first, then sum_by_group, then exp, but ran into numerical issues.
There are some other similar answers here for min and max of elements by group:
https://stackoverflow.com/a/8623168/1085691
Edit: Thanks for the quick answers! I'm trying them out. I should add that I want it to be as fast as possible (that's the reason I'm trying to get it in numpy in some vectorized way, like the examples I gave).
Edit: I evaluated all the answers given so far, and the best one is given by #seberg below. Here's the full function that I ended up using:
def prod_by_group(values, groups):
order = np.argsort(groups)
groups = groups[order]
values = values[order]
group_changes = np.concatenate(([0], np.where(groups[:-1] != groups[1:])[0] + 1))
return np.multiply.reduceat(values, group_changes)
If you groups are already sorted (if they are not you can do that with np.argsort), you can do this using the reduceat functionality to ufuncs (if they are not sorted, you would have to sort them first to do it efficiently):
# you could do the group_changes somewhat faster if you care a lot
group_changes = np.concatenate(([0], np.where(groups[:-1] != groups[1:])[0] + 1))
Vprods = np.multiply.reduceat(values, group_changes)
Or mgilson answer if you have few groups. But if you have many groups, then this is much more efficient. Since you avoid boolean indices for every element in the original array for every group. Plus you avoid slicing in a python loop with reduceat.
Of course pandas does these operations conveniently.
Edit: Sorry had prod in there. The ufunc is multiply. You can use this method for any binary ufunc. This means it works for basically all numpy functions that can work element wise on two input arrays. (ie. multiply normally multiplies two arrays elementwise, add adds them, maximum/minimum, etc. etc.)
First set up a mask for the groups such that you expand the groups in another dimension
mask=(groups==unique(groups).reshape(-1,1))
mask
array([[ True, True, True, False, False, False],
[False, False, False, True, False, False],
[False, False, False, False, True, True]], dtype=bool)
now we multiply with val
mask*val
array([[1, 2, 3, 0, 0, 0],
[0, 0, 0, 4, 0, 0],
[0, 0, 0, 0, 5, 6]])
now you can already do prod along the axis 1 except for those zeros, which is easy to fix:
prod(where(mask*val,mask*val,1),axis=1)
array([ 6, 4, 30])
As suggested in the comments, you can also use the Pandas module. Using the grouby() function, this task becomes an one-liner:
import numpy as np
import pandas as pd
values = np.array([1, 2, 3, 4, 5, 6])
groups = np.array([1, 1, 1, 2, 3, 3])
df = pd.DataFrame({'values': values, 'groups': groups})
So df then looks as follows:
groups values
0 1 1
1 1 2
2 1 3
3 2 4
4 3 5
5 3 6
Now you can groupby() the groups column and apply numpy's prod() function to each of the groups like this
df.groupby(groups)['values'].apply(np.prod)
which gives you the desired output:
1 6
2 4
3 30
Well, I doubt this is a great answer, but it's the best I can come up with:
np.array([np.product(values[np.flatnonzero(groups == x)]) for x in np.unique(groups)])
It's not a numpy solution, but it's fairly readable (I find that sometimes numpy solutions aren't!):
from operator import itemgetter, mul
from itertools import groupby
grouped = groupby(zip(groups, values), itemgetter(0))
groups = [reduce(mul, map(itemgetter(1), vals), 1) for key, vals in grouped]
print groups
# [6, 4, 30]

Categories

Resources