Get cumulative count per 2d array - python
I have general data, e.g. strings:
np.random.seed(343)
arr = np.sort(np.random.randint(5, size=(10, 10)), axis=1).astype(str)
print (arr)
[['0' '1' '1' '2' '2' '3' '3' '4' '4' '4']
['1' '2' '2' '2' '3' '3' '3' '4' '4' '4']
['0' '2' '2' '2' '2' '3' '3' '4' '4' '4']
['0' '1' '2' '2' '3' '3' '3' '4' '4' '4']
['0' '1' '1' '1' '2' '2' '2' '2' '4' '4']
['0' '0' '1' '1' '2' '3' '3' '3' '4' '4']
['0' '0' '2' '2' '2' '2' '2' '2' '3' '4']
['0' '0' '1' '1' '1' '2' '2' '2' '3' '3']
['0' '1' '1' '2' '2' '2' '3' '4' '4' '4']
['0' '1' '1' '2' '2' '2' '2' '2' '4' '4']]
I need count with reset if difference for counter of cumulative values, so is used pandas.
First create DataFrame:
df = pd.DataFrame(arr)
print (df)
0 1 2 3 4 5 6 7 8 9
0 0 1 1 2 2 3 3 4 4 4
1 1 2 2 2 3 3 3 4 4 4
2 0 2 2 2 2 3 3 4 4 4
3 0 1 2 2 3 3 3 4 4 4
4 0 1 1 1 2 2 2 2 4 4
5 0 0 1 1 2 3 3 3 4 4
6 0 0 2 2 2 2 2 2 3 4
7 0 0 1 1 1 2 2 2 3 3
8 0 1 1 2 2 2 3 4 4 4
9 0 1 1 2 2 2 2 2 4 4
How it working for one column:
First compare shifted data and add cumulative sum:
a = (df[0] != df[0].shift()).cumsum()
print (a)
0 1
1 2
2 3
3 3
4 3
5 3
6 3
7 3
8 3
9 3
Name: 0, dtype: int32
And then call GroupBy.cumcount:
b = a.groupby(a).cumcount() + 1
print (b)
0 1
1 1
2 1
3 2
4 3
5 4
6 5
7 6
8 7
9 8
dtype: int64
If want apply solution to all columns is possible use apply:
print (df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
0 1 2 3 4 5 6 7 8 9
0 1 1 1 1 1 1 1 1 1 1
1 1 1 1 2 1 2 2 2 2 2
2 1 2 2 3 1 3 3 3 3 3
3 2 1 3 4 1 4 4 4 4 4
4 3 2 1 1 1 1 1 1 5 5
5 4 1 2 2 2 1 1 1 6 6
6 5 2 1 1 3 1 1 1 1 7
7 6 3 1 1 1 2 2 2 2 1
8 7 1 2 1 1 3 1 1 1 1
9 8 2 3 2 2 4 1 1 2 2
But it is slow, because large data. Is possible create some fast numpy solution?
I find solutions working only for 1d array.
General Idea
Consider the generic case where we perform this cumulative counting or if you think of them as ranges, we could call them - Grouped ranges.
Now, the idea starts off simple - Compare one-off slices along the respective axis to look for inequalities. Pad with True at the start of each row/col (depending on axis of counting).
Then, it gets complicated - Setup an ID array with the intention that we would a final cumsum which would be desired output in its flattened order. So, the setup starts off with initializing a 1s array with same shape as input array. At each group start in input, offset the ID array with the previous group lengths. Follow the code (should give more insight) on how we would do it for each row -
def grp_range_2drow(a, start=0):
# Get grouped ranges along each row with resetting at places where
# consecutive elements differ
# Input(s) : a is 2D input array
# Store shape info
m,n = a.shape
# Compare one-off slices for each row and pad with True's at starts
# Those True's indicate start of each group
p = np.ones((m,1),dtype=bool)
a1 = np.concatenate((p, a[:,:-1] != a[:,1:]),axis=1)
# Get indices of group starts in flattened version
d = np.flatnonzero(a1)
# Setup ID array to be cumsumed finally for desired o/p
# Assign into starts with previous group lengths.
# Thus, when cumsumed on flattened version would give us flattened desired
# output. Finally reshape back to 2D
c = np.ones(m*n,dtype=int)
c[d[1:]] = d[:-1]-d[1:]+1
c[0] = start
return c.cumsum().reshape(m,n)
We would extend this to solve for a generic case of row and columns. For the columns case, we would simply transpose, feed to earlier row-solution and finally transpose back, like so -
def grp_range_2d(a, start=0, axis=1):
# Get grouped ranges along specified axis with resetting at places where
# consecutive elements differ
# Input(s) : a is 2D input array
if axis not in [0,1]:
raise Exception("Invalid axis")
if axis==1:
return grp_range_2drow(a, start=start)
else:
return grp_range_2drow(a.T, start=start).T
Sample run
Let's consider a sample run as would find grouped ranges along each column with each group starting with 1 -
In [330]: np.random.seed(0)
In [331]: a = np.random.randint(1,3,(10,10))
In [333]: a
Out[333]:
array([[1, 2, 2, 1, 2, 2, 2, 2, 2, 2],
[2, 1, 1, 2, 1, 1, 1, 1, 1, 2],
[1, 2, 2, 1, 1, 2, 2, 2, 2, 1],
[2, 1, 2, 1, 2, 2, 1, 2, 2, 1],
[1, 2, 1, 2, 2, 2, 2, 2, 1, 2],
[1, 2, 2, 2, 2, 1, 2, 1, 1, 2],
[2, 1, 2, 1, 2, 1, 1, 1, 1, 1],
[2, 2, 1, 1, 1, 2, 2, 1, 2, 1],
[1, 2, 1, 2, 2, 2, 2, 2, 2, 1],
[2, 2, 1, 1, 2, 1, 1, 2, 2, 1]])
In [334]: grp_range_2d(a, start=1, axis=0)
Out[334]:
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 2],
[1, 1, 1, 1, 2, 1, 1, 1, 1, 1],
[1, 1, 2, 2, 1, 2, 1, 2, 2, 2],
[1, 1, 1, 1, 2, 3, 1, 3, 1, 1],
[2, 2, 1, 2, 3, 1, 2, 1, 2, 2],
[1, 1, 2, 1, 4, 2, 1, 2, 3, 1],
[2, 1, 1, 2, 1, 1, 1, 3, 1, 2],
[1, 2, 2, 1, 1, 2, 2, 1, 2, 3],
[1, 3, 3, 1, 2, 1, 1, 2, 3, 4]])
Thus, to solve our case for dataframe input & output, it would be -
out = grp_range_2d(df.values, start=1,axis=0)
pd.DataFrame(out,columns=df.columns,index=df.index)
And the numba solution. For such tricky problem, it always wins, here by a 7x factor vs numpy, since only one pass on res is done.
from numba import njit
#njit
def thefunc(arrc):
m,n=arrc.shape
res=np.empty((m+1,n),np.uint32)
res[0]=1
for i in range(1,m+1):
for j in range(n):
if arrc[i-1,j]:
res[i,j]=res[i-1,j]+1
else : res[i,j]=1
return res
def numbering(arr):return thefunc(arr[1:]==arr[:-1])
I need to externalize arr[1:]==arr[:-1] since numba doesn't support strings.
In [75]: %timeit numbering(arr)
13.7 µs ± 373 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [76]: %timeit grp_range_2dcol(arr)
111 µs ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
For bigger array (100 000 rows x 100 cols), the gap is not so wide :
In [168]: %timeit a=grp_range_2dcol(arr)
1.54 s ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [169]: %timeit a=numbering(arr)
625 ms ± 43.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If arr can be convert to 'S8', we can win a lot of time :
In [398]: %timeit arr[1:]==arr[:-1]
584 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [399]: %timeit arr.view(np.uint64)[1:]==arr.view(np.uint64)[:-1]
196 ms ± 18.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using the method of Divakar column wise is pretty faster, even so there is probably a fully vectorized way.
#function of Divakar
def grp_range(a):
idx = a.cumsum()
id_arr = np.ones(idx[-1],dtype=int)
id_arr[0] = 0
id_arr[idx[:-1]] = -a[:-1]+1
return id_arr.cumsum()
#create the equivalent of (df != df.shift()).cumsum() but faster
arr_sum = np.vstack([np.ones(10), np.cumsum((arr != np.roll(arr, 1, 0))[1:],0)+1])
#use grp_range column wise on arr_sum
arr_result = np.array([grp_range(np.unique(arr_sum[:,i],return_counts=1)[1])
for i in range(arr_sum.shape[1])]).T+1
To check the equality:
# of the cumsum
print (((df != df.shift()).cumsum() ==
np.vstack([np.ones(10), np.cumsum((arr != np.roll(arr, 1, 0))[1:],0)+1]))
.all().all())
#True
print ((df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount() + 1) ==
np.array([grp_range(np.unique(arr_sum[:,i],return_counts=1)[1])
for i in range(arr_sum.shape[1])]).T+1)
.all().all())
#True
and the speed:
%timeit df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount() + 1)
#19.4 ms ± 2.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
arr_sum = np.vstack([np.ones(10), np.cumsum((arr != np.roll(arr, 1, 0))[1:],0)+1])
arr_res = np.array([grp_range(np.unique(arr_sum[:,i],return_counts=1)[1])
for i in range(arr_sum.shape[1])]).T+1
#562 µs ± 82.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
EDIT: with Numpy, you can also use np.maximum.accumulate with np.arange.
def accumulate(arr):
n,m = arr.shape
arr_arange = np.arange(1,n+1)[:,np.newaxis]
return np.concatenate([ np.ones((1,m)),
arr_arange[1:] - np.maximum.accumulate(arr_arange[:-1]*
(arr[:-1,:] != arr[1:,:]))],axis=0)
Some TIMING
arr_100 = np.sort(np.random.randint(50, size=(100000, 100)), axis=1).astype(str)
Solution with np.maximum.accumulate
%timeit accumulate(arr_100)
#520 ms ± 72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution of Divakar
%timeit grp_range_2drow(arr_100.T, start=1).T
#1.15 s ± 64.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution with Numba of B. M.
%timeit numbering(arr_100)
#228 ms ± 31.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Related
How to structure vectorized function with pandas?
I am unsure how to structure a function I want to vectorize in pandas. I have two df's like such: contents = pd.DataFrame({ 'Items': [1, 2, 3, 1, 1, 2], }) cats = pd.DataFrame({ 'Cat1': ['1|2|4'], 'Cat2': ['3|2|5'], 'Cat3': ['6|9|11'], }) My goal is to .insert a new column to contents that, per row, is either 1 if contents['Items'] is element of cats['cat1'] or 0 otherwise. That is to be repeated per cat. Goal format: contents = pd.DataFrame({ 'Items': [1, 2, 3, 1, 1, 2], 'contains_Cat1': [1, 1, 0, 1, 1, 1], 'contains_Cat2': [0, 1, 1, 0, 0, 1], 'contains_Cat3': [0, 0, 0, 0, 0, 0], }) As my contents df is big(!) I would like to vectorize this. My approach for each cat is to do something like this contents.insert( loc=len(contents.columns), column='contains_Cat1', value=has_content(contents, cats['Cat1']) def has_content(contents: pd.DataFrame, cat: pd.Series) -> pd.Series: # Initialization of pd.Series here?? if contents['Items'] in cat: return True else: return False My question is: How do I structure my has_content(...)? Especially unclear to me is how I initialize that pd.Series to contain all False values. Do I even need to? After that I know how to check if something is contained in something else. But can I really do it column-wise like above and return immediately without becoming cell-wise?
Try with str.get_dummies then reshape with stack and unstack out = cats.stack().str.get_dummies().stack()\ .unstack(level=1).reset_index(level=0,drop=True)\ .reindex(contents.Items.astype(str)) Out[229]: Cat1 Cat2 Cat3 Items 1 1 0 0 2 1 1 0 3 0 1 0 1 1 0 0 1 1 0 0 2 1 1 0 Improvement: out=cats.stack().str.get_dummies().droplevel(0).T\ .add_prefix('contains_').reindex(contents['Items'].astype(str)).reset_index() Out[230]: Items contains_Cat1 contains_Cat2 contains_Cat3 0 1 1 0 0 1 2 1 1 0 2 3 0 1 0 3 1 1 0 0 4 1 1 0 0 5 2 1 1 0
Simple method: contents = (contents.join([pd.Series(contents.Items.astype(str). str.contains(cats[c][0]).astype(int), name="Contains_"+c) for c in cats])) contents: Items contains_Cat1 contains_Cat2 contains_Cat3 0 1 1 0 0 1 2 1 1 0 2 3 0 1 0 3 1 1 0 0 4 1 1 0 0 5 2 1 1 0 Time comparison: %%timeit -n 2000 (contents.join([pd.Series(contents.Items.astype(str). str.contains(cats[c][0]).astype(int), name="Contains_"+c) for c in cats])) 3.01 ms ± 344 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each) %%timeit -n 2000 cats.stack().str.get_dummies().stack()\ .unstack(level=1).reset_index(level=0,drop=True)\ .reindex(contents.Items.astype(str)) 5.13 ms ± 584 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each) %%timeit -n 2000 cats.stack().str.get_dummies().droplevel(0).T\ .add_prefix('contains_').reindex(contents['Items'].astype(str)).reset_index() 4.58 ms ± 512 µs per loop (mean ± std. dev. of 7 runs, 2000 loops each)
Summing the different values in pandas data frame
I want to sum the different values for each column. i think that i should use a special aggregation using apply() but i don't know the correct code A B C D E F G 1 2 3 4 5 6 7 1 3 3 4 8 7 7 2 2 3 5 8 1 1 2 1 3 5 7 5 1 #i want to have this result for each value in column A A B C D E F G 1 5 3 4 13 13 7 2 3 3 5 15 6 1
You can vectorize this by dropping duplicates per index positions. You can then re-create the origin matrix conveniently using a sparse matrix. You could accomplish the same thing create a zero array and adding, but this way you avoid the large memory requirement if your A column is very sparse. from scipy import sparse def non_dupe_sums_2D(ids, values): v = np.unique(ids) x, y = values.shape r = np.arange(y) m = np.repeat(a, y) n = np.tile(r, x) u = np.unique(np.column_stack((m, n, values.ravel())), axis=0) return sparse.csr_matrix((u[:, 2], (u[:, 0], u[:, 1])))[v].A a = df.iloc[:, 0].to_numpy() b = df.iloc[:, 1:].to_numpy() non_dupe_sums_2D(a, b) array([[ 5, 3, 4, 13, 13, 7], [ 3, 3, 5, 15, 6, 1]], dtype=int64) Performance df = pd.DataFrame(np.random.randint(1, 100, (100, 100))) a = df.iloc[:, 0].to_numpy() b = df.iloc[:, 1:].to_numpy() %timeit pd.concat([g.apply(lambda x: x.unique().sum()) for v,g in df.groupby(0) ], axis=1) 1.09 s ± 9.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit df.iloc[:, 1:].groupby(df.iloc[:, 0]).apply(sum_unique) 1.05 s ± 4.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit non_dupe_sums_2D(a, b) 7.95 ms ± 30.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) Validation >>> np.array_equal(non_dupe_sums_2D(a, b), df.iloc[:, 1:].groupby(df.iloc[:, 0]).apply(sum_unique).values) True
I'd do something like: def sum_unique(x): return x.apply(lambda x: x.unique().sum()) df.groupby('A')[df.columns ^ {'A'}].apply(sum_unique).reset_index() which gives me: A B C D E F G 0 1 5 3 4 13 13 7 1 2 3 3 5 15 6 1 which seems to be what you're expecting
Not so ideal, but here's one way with apply: pd.concat([g.apply(lambda x: x.unique().sum()) for v,g in df.groupby('A') ], axis=1) Output: 0 1 A 1 2 B 5 3 C 3 3 D 4 5 E 13 15 F 13 6 G 7 1 You can certainly transpose the dataframe to obtain the expected output.
How to create a new column containing the largest value in a list that is smaller than cell value in an existing column?
I have a pandas dataframe that looks like: a 0 0 1 -2 2 4 3 1 4 6 I also have a list A = [-1, 2, 5, 7] I want to add a new column called 'b', that contains the largest value in A that is smaller than the cell value in column 'a'. If no such value exists, I want the value in 'b' to be 'X'. So, the goal is to get: a b 0 0 -1 1 -2 X 2 4 2 3 1 -1 4 6 5 How do I achieve this?
There is a build-in function merge_asof s=pd.DataFrame({'a':A,'b':A}) pd.merge_asof(df.assign(index=df.index).sort_values('a'),s,on='a').set_index('index').sort_index().fillna('X') Out[284]: a b index 0 0 -1 1 -2 X 2 4 2 3 1 -1 4 6 5
def largest_min(x): less_than = list(filter(lambda l: l < x, A)) if len(less_than): return max(less_than) return 'X' df['b'] = df['a'].apply(largest_min) edited: To fix error and and 'X' for no values found
Not sure of a pandas method, but numpy.searchsorted is a perfect fit here. Finds indices where elements should be inserted to maintain order. Once you have the indices that your elements would be inserted into to maintain the sort, you can look at the element to the left of these indices in your lookup array to find the closest smaller element. If the element would be inserted at the beginning of the list (index 0), we know that a smaller element does not exist in the lookup list, and we account for that scenario using np.where A = np.array([-1, 2, 5, 7]) r = np.searchsorted(A, df.a.values) df.assign(b=np.where(r == 0, np.nan, A[r-1])).fillna('X') a b 0 0 -1 1 -2 X 2 4 2 3 1 -1 4 6 5 This method will be much faster than apply here. df = pd.concat([df]*10_000) %%timeit r = np.searchsorted(A, df.a.values) df.assign(b=np.where(r == 0, np.nan, A[r-1])).fillna('X') 6.09 ms ± 367 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit df['a'].apply(largest_min) 196 ms ± 5.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Here is another way to do it as well: df1 = pd.Series(A) def filler(val): v = df1[df1 < val.iloc[0]].max() return v df.assign(b=df.apply(filler, axis=1).fillna('X')) a b 0 0 -1 1 -2 X 2 4 2 3 1 -1 4 6 5
df = pd.DataFrame({'a':[0,1,4,1,6]}) A = [-1,2,5,7] new_list = [] for i in df.iterrows(): for j in range(len(A)): if A[j] < i[1]['a']: print(A[j]) pass elif j == 0: new_list.append(A[j]) break else: new_list.append(A[j-1]) break df['b'] = new_list
Matrix contains specific number per row
I want to know if there is at least one zero in each row of a matrix i = 0 for row in range(rows): if A[row].contains(0): i += 1 i == rows is this right or is there a better way?
You can reproduce the effect of your whole code block in a single vectorized operation: np.all((rows == 0).sum(axis=1)) Alternatively (building off of Mateen Ulhaq's suggestion in the comments), you could do: np.all(np.any(rows == 0, axis=1)) Testing it out a = np.arange(5*5).reshape(5,5) b = a.copy() b[:, 3] = 0 print('a\n%s\n' % a) print('b\n%s\n' % b) print('method 1') print(np.all((a == 0).sum(axis=1))) print(np.all((b == 0).sum(axis=1))) print() print('method 2') print(np.all(np.any(a == 0, axis=1))) print(np.all(np.any(b == 0, axis=1))) Output: a [[ 0 1 2 3 4] [ 5 6 7 8 9] [10 11 12 13 14] [15 16 17 18 19] [20 21 22 23 24]] b [[ 0 1 2 0 4] [ 5 6 7 0 9] [10 11 12 0 14] [15 16 17 0 19] [20 21 22 0 24]] method 1 False True method 2 False True Timings %%timeit np.all((a == 0).sum(axis=1)) 8.73 µs ± 56.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) %%timeit np.all(np.any(a == 0, axis=1)) 7.87 µs ± 54 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) So the second method (which uses np.any) is slightly faster.
Zero out few columns of a numpy matrix [duplicate]
This question already has an answer here: Efficiently zero elements of numpy array using a boolean mask (1 answer) Closed 1 year ago. i have a numpy matrix 10x10 and want to zero values in some columns, accordingly to a vector [1,0,0,0,0,1,0,0,1,0] - how to do it with best performance? using other python libraries is also acceptable, if work better
The simplest way to do this is multiplication. Multiplying a value by 0 zeroes it out, and multiplying a value by 1 has no effect, so multiplying your matrix with your vector will do exactly what you want: m = np.random.randint(1, 10, (10,10)) v = np.array([1,0,0,0,0,1,0,0,1,0]) print(m * v) Output: [[7 0 0 0 0 5 0 0 5 0] [8 0 0 0 0 5 0 0 6 0] [1 0 0 0 0 5 0 0 9 0] [1 0 0 0 0 6 0 0 1 0] [5 0 0 0 0 8 0 0 5 0] [5 0 0 0 0 4 0 0 9 0] [1 0 0 0 0 3 0 0 9 0] [1 0 0 0 0 9 0 0 8 0] [6 0 0 0 0 4 0 0 6 0] [1 0 0 0 0 6 0 0 1 0]] You were concerned that multiplying might be too slow, and wanted to know how to do it by selecting. That's easy too: bv = v.astype(np.bool) m[:,bv] = 0 print(m) Or, instead of astype, you could use bv = v == 1, but since you end up with the exact same bool array, and I can't imagine that would make a difference. So, which is fastest? In [123]: %timeit m*v 2.87 µs ± 53.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [124]: bv = v.astype(np.bool) In [125]: %timeit m[:,v.astype(np.bool)] 5.02 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [127]: bv = v==1 In [128]: %timeit m[:,v.astype(np.bool)] 5.03 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) So, the "slow" way actually runs in less than two thirds the time. Also, it takes only 5 microseconds no matter how you do it—which is what you should expect, given how small the array is.