Differences in an array based on groups defined by another array - python

I have two arrays of the same size. One, call it A, contains a series of repeated numbers; the other, B contains random numbers.
import numpy as np
A = np.array([1,1,1,2,2,2,0,0,0,3,3])
B = np.array([1,2,3,6,5,4,7,8,9,10,11])
I need to find the differences in B between the two extremes defined by the groups in A. More specifically, I need an output C such as
C = [2, -2, 2, 1]
where each term is the difference 3 - 1, 4 - 6, 9 - 7, and 11 - 10, i.e., the difference between the extremes in B identified by the groups of repeated numbers in A.
I tried to play around with itertools.groupby to isolate the groups in the first array, but it is not clear to me how to exploit the indexing to operate the differences in the second.

Edit: C is now sorted the same way as in the question
C = []
_, idx = np.unique(A, return_index=True)
for i in A[np.sort(idx)]:
bs = B[A==i]
C.append(bs[-1] - bs[0])
print(C) // [2, -2, 2, 1]
np.unique returns, for each unique value in A, the index of the first appearance of it.
i in A[np.sort(idx)] iterates over the unique values in the order of the indexes.
B[A==i] extracts the values from B at the same indexes as those values in A.

This is easily achieved using pandas' groupby:
A = np.array([1,1,1,2,2,2,0,0,0,3,3])
B = np.array([1,2,3,6,5,4,7,8,9,10,11])
import pandas as pd
pd.Series(B).groupby(A, sort=False).agg(lambda g: g.iloc[-1]-g.iloc[0]).to_numpy()
output: array([ 2, -2, 2, 1])
using itertools.groupby:
from itertools import groupby
[(x:=list(g))[-1][1]-x[0][1] for k, g in groupby(zip(A,B), lambda x: x[0])]
output: [2, -2, 2, 1]
NB. Note that the two solutions will behave differently if there are different non-consecutive groups

Related

Monte Carlo Simulation with multiple distributions in each loop

I have an array of NaNs 10 columns wide and 5 rows long.
I have a 5x3 array of poisson random number generations. This represents 5 runs of each A, B, and C, where each column has a different lambda value for the poisson distribution.
A B C
[1, 1, 2,
1, 2, 2,
2, 1, 4,
1, 2, 3,
0, 1, 2]
Each row represents the number of events. That is, the first row would produce one event of type A, one event of type B, and two events of type C.
I would like to loop through each row and produce a set of uniform random numbers. For A, it would between 1 and 100, for B it would be between 101 and 200, and for C it would be between 201 and 300.
The output of the first row would have four numbers, one number between 1 and 100, one number between 101 and 200, and two numbers between 201 and 300. So a sample output of the first row might be:
[34, 105, 287, 221]
The second output row would have five numbers in it, the third row would have seven, etc. I would like to store it in my array of NaNs by overwriting the NaNs that get replaced in each row. Can anyone please help with this? Thanks!
I've got a rather inefficient/unvectorised method which may or may not be what you're looking for, because one part of your question is unclear to me. Do you want the final array to have rows of different sizes, or to be the same size but padded with nans?
This solution assumes padding with nans, since you talked about the nans being overwritten and didn't mention the extra/unused nans being deleted. I'm also assuming that your ABC thing is structured into a numpy array of size (5,3), and I'm calling the array of nans results_arr.
import numpy as np
from random import randint
# Initializing the arrays
results_arr = np.full((5,10), np.nan)
abc = np.array([[1, 1, 2], [1, 2, 2], [2, 1, 4], [1, 2, 3], [0, 1, 2]])
# Loops through each row in ABC
for row_idx in range(len(abc)):
a, b, c = abc[row_idx]
# Here, I'm getting a number in the specified uniform distribution as many times as is specified in the A column. The other 2 loops do the same for the B and C columns.
for i in range(0, a):
results_arr[row_idx, i] = randint(1, 100)
for j in range(a, a+b):
results_arr[row_idx, j] = randint(101, 200)
for k in range(a+b, a+b+c):
results_arr[row_idx, k] = randint(201, 300)
Hope that helps!
P.S. Here's a solution with uneven rows. The result is stored in a list of lists because numpy doesn't support ragged arrays (i.e. rows of different lengths).
import numpy as np
from random import randint
# Initializations
results_arr = []
abc = np.array([[1, 1, 2], [1, 2, 2], [2, 1, 4], [1, 2, 3], [0, 1, 2]])
# Same code logic as before, just storing the results differently
for row_idx in range(len(abc)):
a, b, c = abc[row_idx]
results_this_row = []
for i in range(0, a):
results_this_row.append(randint(1, 100))
for j in range(a, a+b):
results_this_row.append(randint(101, 200))
for k in range(a+b, a+b+c):
results_this_row.append(randint(201, 300))
results_arr.append(results_this_row)
I hope these two solutions cover what you're looking for!

Tensor reduction based off index vector

As an example, I have 2 tensors: A = [1;2;3;4;5;6;7] and B = [2;3;2]. The idea is that I want to reduce A based off B - such that B's values represent how to sum A's values- such that B = [2;3;2] means the reduced A shall be the sum of the first 2 values, next 3, and last 2: A' = [(1+2);(3+4+5);(6+7)]. It is apparent that the sum of B shall always be equal to the length of A. I'm trying to do this as efficiently as possible - preferably specific functions or matrix operations contained within pytorch/python. Thanks!
Here is the solution.
First, we create an array of indices B_idx with the same size of A.
Then, accumulate (add) all elements in A based on the indices B_idx using index_add_.
A = torch.arange(1, 8)
B = torch.tensor([2, 3, 2])
B_idx = [idx.repeat(times) for idx, times in zip(torch.arange(len(B)), B)]
B_idx = torch.cat(B_idx) # tensor([0, 0, 1, 1, 1, 2, 2])
A_sum = torch.zeros_like(B)
A_sum.index_add_(dim=0, index=B_idx, source=A)
print(A_sum) # tensor([ 3, 12, 13])

Row-wise difference in two list in pandas

I am using pandas to incrementally find out new elements i.e. for every row, I'd see whether values in list have been seen before. If they are, we will ignore them. If not, we will select them.
I was able to do this using row.iterrows(), but I have >1M rows, so I believe vectorized apply might be better.
Here's sample data and code. Once you run this code, you will get expected output:
from numpy import nan as NA
import collections
df = pd.DataFrame({'ID':['A','B','C','A','B','A','A','A','D','E','E','E'],
'Value': [1,2,3,4,3,5,2,3,7,2,3,9]})
#wrap all elements by group in a list
Changed_df=df.groupby('ID')['Value'].apply(list).reset_index()
Changed_df=Changed_df.rename(columns={'Value' : 'Elements'})
Changed_df=Changed_df.reset_index(drop=True)
def flatten(l):
for el in l:
if isinstance(el, collections.Iterable) and not isinstance(el, (str, bytes)):
yield from flatten(el)
else:
yield el
Changed_df["Elements_s"]=Changed_df['Elements'].shift()
#attempt 1: For loop
Changed_df["Diff"]=NA
Changed_df["count"]=0
Elements_so_far = []
#replace NA with empty list in columns that will go through list operations
for col in ["Elements","Elements_s","Diff"]:
Changed_df[col] = Changed_df[col].apply(lambda d: d if isinstance(d, list) else [])
for idx,row in Changed_df.iterrows():
diff = list(set(row['Elements']) - set(Elements_so_far))
Changed_df.at[idx, "Diff"] = diff
Elements_so_far.append(row['Elements'])
Elements_so_far = flatten(Elements_so_far)
Elements_so_far = list(set(Elements_so_far)) #keep unique elements
Changed_df.loc[idx,"count"]=diff.__len__()
Commentary about the code:
I am not a fan of this code because it's clunky and inefficient.
I am saying inefficient because I have created Elements_s which holds shifted values. Another reason for inefficiency is for loop through rows.
Elements_so_far keeps track of all the elements we have discovered for every row. If there is a new element that shows up, we count that in Diff column.
We also keep track of the length of new elements discovered in count column.
I'd appreciate if an expert could help me with a vectorized version of the code.
I did try the vectorized version, but I couldn't go too far.
#attempt 2:
Changed_df.apply(lambda x: [i for i in x['Elements'] if i in x['Elements_s']], axis=1)
I was inspired from How to compare two columns both with list of strings and create a new column with unique items? to do above, but I couldn't do it. The linked SO thread does row-wise difference among columns.
I am using Python 3.6.7 by Anaconda. Pandas version is 0.23.4
You could using sort and then use numpy to get the unique indexes and then construct your groupings, e.g.:
In []:
df = df.sort_values(by='ID').reset_index(drop=True)
_, i = np.unique(df.Value.values, return_index=True)
df.iloc[i].groupby(df.ID).Value.apply(list)
Out[]:
ID
A [1, 2, 3, 4, 5]
D [7]
E [9]
Name: Value, dtype: object
Or to get close to your current output:
In []:
df = df.sort_values(by='ID').reset_index(drop=True)
_, i = np.unique(df.Value.values, return_index=True)
s1 = df.groupby(df.ID).Value.apply(list).rename('Elements')
s2 = df.iloc[i].groupby(df.ID).Value.apply(list).rename('Diff').reindex(s1.index, fill_value=[])
pd.concat([s1, s2, s2.apply(len).rename('Count')], axis=1)
Out[]:
Elements Diff Count
ID
A [1, 4, 5, 2, 3] [1, 2, 3, 4, 5] 5
B [2, 3] [] 0
C [3] [] 0
D [7] [7] 1
E [2, 3, 9] [9] 1
One alternative using drop duplicates and groupby
# Groupby and apply list func.
df1 = df.groupby('ID')['Value'].apply(list).to_frame('Elements')
# Sort values , drop duplicates by Value column then use groupby.
df1['Diff'] = df.sort_values(['ID','Value']).drop_duplicates('Value').groupby('ID')['Value'].apply(list)
# Use str.len for count.
df1['Count'] = df1['Diff'].str.len().fillna(0).astype(int)
# To fill NaN with empty list
df1['Diff'] = df1.Diff.apply(lambda x: x if type(x)==list else [])
Elements Diff Count
ID
A [1, 4, 5, 2, 3] [1, 2, 3, 4, 5] 5
B [2, 3] [] 0
C [3] [] 0
D [7] [7] 1
E [2, 3, 9] [9] 1

Python: turn single array of sorted, repeat values into an array of arrays?

I have a sorted array with some repeated values. How can this array be turned into an array of arrays with the subarrays grouped by value (see below)? In actuality, my_first_array has ~8 million entries, so the solution would preferably be as time efficient as possible.
my_first_array = [1,1,1,3,5,5,9,9,9,9,9,10,23,23]
wanted_array = [ [1,1,1], [3], [5,5], [9,9,9,9,9], [10], [23,23] ]
itertools.groupby makes this trivial:
import itertools
wanted_array = [list(grp) for _, grp in itertools.groupby(my_first_array)]
With no key function, it just yields groups consisting of runs of identical values, so you list-ify each one in a list comprehension; easy-peasy. You can think of it as basically a within-Python API for doing the work of the GNU toolkit program, uniq, and related operations.
In CPython (the reference interpreter), groupby is implemented in C, and it operates lazily and linearly; the data must already appear in runs matching the key function, so sorting might make it too expensive, but for already sorted data like you have, there is nothing that will be more efficient.
Note: If the inputs might be value identical, but different objects, it may make sense for memory reasons to change list(grp) for _, grp to [k] * len(list(grp)) for k, grp. The former would retain the original (possibly value but not identity duplicate) objects in the final result, the latter would replicate the first object from each group instead, reducing the final cost per group to the cost of N references to a single object, instead of N references to between 1 and N objects.
I am assuming that the input is a NumPy array and you are looking for a list of arrays as output. Now, you can split the input array at indices where those shifts (groups of repeats have boundaries) with np.split. To find such indices, there are two ways - Using np.unique with its optional argument return_index set as True, and another with a combination of np.where and np.diff. Thus, we would have two approaches as listed next.
With np.unique -
import numpy as np
_,idx = np.unique(my_first_array, return_index=True)
out = np.split(my_first_array, idx)[1:]
With np.where and np.diff -
idx = np.where(np.diff(my_first_array)!=0)[0] + 1
out = np.split(my_first_array, idx)
Sample run -
In [28]: my_first_array
Out[28]: array([ 1, 1, 1, 3, 5, 5, 9, 9, 9, 9, 9, 10, 23, 23])
In [29]: _,idx = np.unique(my_first_array, return_index=True)
...: out = np.split(my_first_array, idx)[1:]
...:
In [30]: out
Out[30]:
[array([1, 1, 1]),
array([3]),
array([5, 5]),
array([9, 9, 9, 9, 9]),
array([10]),
array([23, 23])]
In [31]: idx = np.where(np.diff(my_first_array)!=0)[0] + 1
...: out = np.split(my_first_array, idx)
...:
In [32]: out
Out[32]:
[array([1, 1, 1]),
array([3]),
array([5, 5]),
array([9, 9, 9, 9, 9]),
array([10]),
array([23, 23])]
Here is a solution, although it might not be very efficient:
my_first_array = [1,1,1,3,5,5,9,9,9,9,9,10,23,23]
wanted_array = [ [1,1,1], [3], [5,5], [9,9,9,9,9], [10], [23,23] ]
new_array = [ [my_first_array[0]] ]
count = 0
for i in range(1,len(my_first_array)):
a = my_first_array[i]
if a == my_first_array[i - 1]:
new_array[count].append(a)
else:
count += 1
new_array.append([])
new_array[count].append(a)
new_array == wanted_array
This is O(n):
a = [1,1,1,3,5,5,9,9,9,9,9,10,23,23,24]
res = []
s = 0
e = 0
length = len(a)
while s < length:
b = []
while e < length and a[s] == a[e]:
b.append(a[s])
e += 1
res.append(b)
s = e
print res

Product of array elements by group in numpy (Python)

I'm trying to build a function that returns the products of subsets of array elements. Basically I want to build a prod_by_group function that does this:
values = np.array([1, 2, 3, 4, 5, 6])
groups = np.array([1, 1, 1, 2, 3, 3])
Vprods = prod_by_group(values, groups)
And the resulting Vprods should be:
Vprods
array([6, 4, 30])
There's a great answer here for sums of elements that I think it should be similar to:
https://stackoverflow.com/a/4387453/1085691
I tried taking the log first, then sum_by_group, then exp, but ran into numerical issues.
There are some other similar answers here for min and max of elements by group:
https://stackoverflow.com/a/8623168/1085691
Edit: Thanks for the quick answers! I'm trying them out. I should add that I want it to be as fast as possible (that's the reason I'm trying to get it in numpy in some vectorized way, like the examples I gave).
Edit: I evaluated all the answers given so far, and the best one is given by #seberg below. Here's the full function that I ended up using:
def prod_by_group(values, groups):
order = np.argsort(groups)
groups = groups[order]
values = values[order]
group_changes = np.concatenate(([0], np.where(groups[:-1] != groups[1:])[0] + 1))
return np.multiply.reduceat(values, group_changes)
If you groups are already sorted (if they are not you can do that with np.argsort), you can do this using the reduceat functionality to ufuncs (if they are not sorted, you would have to sort them first to do it efficiently):
# you could do the group_changes somewhat faster if you care a lot
group_changes = np.concatenate(([0], np.where(groups[:-1] != groups[1:])[0] + 1))
Vprods = np.multiply.reduceat(values, group_changes)
Or mgilson answer if you have few groups. But if you have many groups, then this is much more efficient. Since you avoid boolean indices for every element in the original array for every group. Plus you avoid slicing in a python loop with reduceat.
Of course pandas does these operations conveniently.
Edit: Sorry had prod in there. The ufunc is multiply. You can use this method for any binary ufunc. This means it works for basically all numpy functions that can work element wise on two input arrays. (ie. multiply normally multiplies two arrays elementwise, add adds them, maximum/minimum, etc. etc.)
First set up a mask for the groups such that you expand the groups in another dimension
mask=(groups==unique(groups).reshape(-1,1))
mask
array([[ True, True, True, False, False, False],
[False, False, False, True, False, False],
[False, False, False, False, True, True]], dtype=bool)
now we multiply with val
mask*val
array([[1, 2, 3, 0, 0, 0],
[0, 0, 0, 4, 0, 0],
[0, 0, 0, 0, 5, 6]])
now you can already do prod along the axis 1 except for those zeros, which is easy to fix:
prod(where(mask*val,mask*val,1),axis=1)
array([ 6, 4, 30])
As suggested in the comments, you can also use the Pandas module. Using the grouby() function, this task becomes an one-liner:
import numpy as np
import pandas as pd
values = np.array([1, 2, 3, 4, 5, 6])
groups = np.array([1, 1, 1, 2, 3, 3])
df = pd.DataFrame({'values': values, 'groups': groups})
So df then looks as follows:
groups values
0 1 1
1 1 2
2 1 3
3 2 4
4 3 5
5 3 6
Now you can groupby() the groups column and apply numpy's prod() function to each of the groups like this
df.groupby(groups)['values'].apply(np.prod)
which gives you the desired output:
1 6
2 4
3 30
Well, I doubt this is a great answer, but it's the best I can come up with:
np.array([np.product(values[np.flatnonzero(groups == x)]) for x in np.unique(groups)])
It's not a numpy solution, but it's fairly readable (I find that sometimes numpy solutions aren't!):
from operator import itemgetter, mul
from itertools import groupby
grouped = groupby(zip(groups, values), itemgetter(0))
groups = [reduce(mul, map(itemgetter(1), vals), 1) for key, vals in grouped]
print groups
# [6, 4, 30]

Categories

Resources