Conditional counting within groups - python

I wanted to do conditional counting after groupby; for example, group by values of column A, and then count within each group how often value 5 appears in column B.
If I was doing this for the entire DataFrame, it's just len(df[df['B']==5]). So I hoped I could do df.groupby('A')[df['B']==5].size(). But I guess boolean indexing doesn't work within GroupBy objects.
Example:
import pandas as pd
df = pd.DataFrame({'A': [0, 4, 0, 4, 4, 6], 'B': [5, 10, 10, 5, 5, 10]})
groups = df.groupby('A')
# some more code
# in the end, I want to get pd.Series({0: 1, 1: 2, 6: 0})

Select all rows where B equals 5, and then apply groupby/size:
In [43]: df.loc[df['B']==5].groupby('A').size()
Out[43]:
A
0 1
4 2
dtype: int64
Alternatively, you could use groupby/agg with a custom function:
In [44]: df.groupby('A')['B'].agg(lambda ser: (ser==5).sum())
Out[44]:
A
0 1
4 2
Name: B, dtype: int64
Note that generally speaking, using agg with a custom function will be slower than using groupby with a builtin method such as size. So prefer the first option over the second.
In [45]: %timeit df.groupby('A')['B'].agg(lambda ser: (ser==5).sum())
1000 loops, best of 3: 927 µs per loop
In [46]: %timeit df.loc[df['B']==5].groupby('A').size()
1000 loops, best of 3: 649 µs per loop
To include A values where the size is zero, you could reindex the result:
import pandas as pd
df = pd.DataFrame({'A': [0, 4, 0, 4, 4, 6], 'B': [5, 10, 10, 5, 5, 10]})
result = df.loc[df['B'] == 5].groupby('A').size()
result = result.reindex(df['A'].unique())
yields
A
0 1.0
4 2.0
6 NaN
dtype: float64

Related

Calculate "energy" of columns with pandas

I try to calculate the signal energy of my pandas.DataFrame following this formula for discrete-time signal. I tried with apply and applymap, also with reduce, as suggested here: How do I columnwise reduce a pandas dataframe? . But all I tried resulted doing the operation for each element, not for the whole column.
This not a signal processing specific question, it's just an example how to apply a "summarize" (I don't know the right term for this) function to columns.
My workaround, was to get the raw numpy.array data and do my calculations. But I am pretty sure there is a pandatic way to do this (and surly a more numpyic way).
import pandas as pd
import numpy as np
d = np.array([[2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
[0, -1, 2, -3, 4, -5, 6, -7, 8, -9],
[0, 1, -2, 3, -4, 5, -6, 7, -8, 9]]).transpose()
df = pd.DataFrame(d)
energies = []
# a same as d
a = df.as_matrix()
assert(np.array_equal(a, d))
for column in range(a.shape[1]):
energies.append(sum(a[:,column] ** 2))
print(energies) # [40, 285, 285]
Thanks in advance!
You could do the following for dataframe output -
(df**2).sum(axis=0) # Or (df**2).sum(0)
For performance, we could work with array extracted from the dataframe -
(df.values**2).sum(axis=0) # Or (df.values**2).sum(0)
For further performance boost, there's np.einsum -
a = df.values
out = np.einsum('ij,ij->j',a,a)
Runtime test -
In [31]: df = pd.DataFrame(np.random.randint(0,9,(1000,30)))
In [32]: %timeit (df**2).sum(0)
1000 loops, best of 3: 518 µs per loop
In [33]: %timeit (df.values**2).sum(0)
10000 loops, best of 3: 40.2 µs per loop
In [34]: def einsum_based(a):
...: a = df.values
...: return np.einsum('ij,ij->j',a,a)
...:
In [35]: %timeit einsum_based(a)
10000 loops, best of 3: 32.2 µs per loop
You can use DataFrame.pow with DataFrame.sum:
print (df.pow(2).sum())
0 40
1 285
2 285
dtype: int64
print (df.pow(2).sum().values.tolist())
[40, 285, 285]
There is a property df.var() which returns variance of the columns. Which is energy (dependent on definition, you might need to multiply it by the number of elements df.var()*df.shape[0]).

Pandas - assign histogram bucket to each row

Here is my dataframe:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4, 6, 4, 3, 2, 7]})
buckets = [(0,3),(3,5),(5,9)]
I also have histogram buckets stated above. Now I would like to assign each row of dataframe to buckets index. So I would like to get new column with the following info:
df['buckets_index'] = [0,0,0,1,2,1,0,0,2]
Of course, I can do it with loops, but I have fairly big dataframe (2.5 mil rows), so I need to get it done quickly.
Any thoughts?
You can use pd.cut, with labels=False if you only want the index:
buckets = [0,3,5,9]
df['bucket'] = pd.cut(df['A'], bins=buckets)
df['bucket_idx'] = pd.cut(df['A'], bins=buckets, labels=False)
The resulting output:
A bucket bucket_idx
0 1 (0, 3] 0
1 2 (0, 3] 0
2 3 (0, 3] 0
3 4 (3, 5] 1
4 6 (5, 9] 2
5 4 (3, 5] 1
6 3 (0, 3] 0
7 2 (0, 3] 0
8 7 (5, 9] 2
You could use np.searchsorted -
df['buckets_index'] = np.asarray(buckets)[:,1].searchsorted(df.A.values)
Runtime test -
In [522]: df = pd.DataFrame({'A': np.random.randint(1,8,(10000))})
In [523]: buckets = [0,3,5,9]
In [524]: %timeit pd.cut(df['A'], bins=buckets, labels=False)
1000 loops, best of 3: 460 µs per loop # #root's soln
In [525]: buckets = [(0,3),(3,5),(5,9)]
In [526]: %timeit np.asarray(buckets)[:,1].searchsorted(df.A.values)
10000 loops, best of 3: 166 µs per loop
Outside limits cases : For such cases, we need to use clipping, like so -
np.asarray(buckets)[:,1].searchsorted(df.A.values).clip(max=len(buckets)-1)

python: mean of variable length 2 matrix

Consider the following variable length 2D array
[
[1, 2, 3],
[4, 5],
[6, 7, 8, 9]
]
How can i find the mean of the variables along the column?
I want something like [(1+4+6)/3,(2+5+7)/3, (3+8)/2, 9/1]
So the end result would be [3.667, 4.667, 5.5, 9]
Is this possible using numpy?
I tried np.mean(x, axis=0), but numpy expects the arrays of same dimension.
Right now, I am popping the elements of each column and finding the mean. Is there a better way to achieve the result?
You could use pandas:
import pandas as pd
a = [[1, 2, 3],
[4, 5],
[6, 7, 8, 9]]
df = pd.DataFrame(a)
# 0 1 2 3
# 0 1 2 3 NaN
# 1 4 5 NaN NaN
# 2 6 7 8 9
df.mean()
# 0 3.666667
# 1 4.666667
# 2 5.500000
# 3 9.000000
# dtype: float64
Here is another solution that only uses numpy:
import numpy as np
nrows = len(a)
ncols = max(len(row) for row in a)
arr = np.zeros((nrows, ncols))
arr.fill(np.nan)
for jrow, row in enumerate(a):
for jcol, col in enumerate(row):
arr[jrow, jcol] = col
print np.nanmean(arr, axis=0)
# array([ 3.66666667, 4.66666667, 5.5 , 9. ])
Very simple alternative approach using itertools.izip_longest() as:
>>> mean_list = []
>>> for sub_list in izip_longest(*my_list):
... filtered_list = filter(None, sub_list)
... mean_list.append(sum(filtered_list)/(len(filtered_list)*1.0))
...
>>> mean_list
[3.6666666666666665, 4.666666666666667, 5.5, 9.0]
where my_list equals to:
[
[1, 2, 3],
[4, 5],
[6, 7, 8, 9]
]
Listed in this post is an almost vectorized approach using NumPy. We would try to assign each element in list element an ID based on their positions. These IDs could then be fed to np.bincount as it would perform ID based summations. Finally, we would divide the summations respectively by the lengths of each ID to get the final average values.
Thus, we would have an implementation like so -
def variable_mean(a):
vals = np.concatenate(a)
lens = np.array(map(len,a))
id_arr = np.ones(vals.size,dtype=int)
id_arr[0] = 0
id_arr[lens.cumsum()[:-1]] = -lens[:-1] + 1
IDs = id_arr.cumsum()
return np.bincount(IDs,vals)/np.bincount(IDs)
Runtime test -
In [298]: # Setup input
...: N = 1000 # number of elems in input list
...: minL = 3 # min len of an element (list) in input list
...: maxL = 10 # max len of an element (list) in input list
...: a = [list(np.random.randint(0,9,(i))) \
...: for i in np.random.randint(minL,maxL,(N))]
...:
In [299]: %timeit pd.DataFrame(a).mean() ##Julien Spronck's pandas soln
100 loops, best of 3: 3.33 ms per loop
In [300]: %timeit variable_mean(a)
100 loops, best of 3: 2.36 ms per loop
In [301]: # Setup input
...: N = 1000 # number of elems in input list
...: minL = 3 # min len of an element (list) in input list
...: maxL = 100 # max len of an element (list) in input list
...: a = [list(np.random.randint(0,9,(i))) \
...: for i in np.random.randint(minL,maxL,(N))]
...:
In [302]: %timeit pd.DataFrame(a).mean() ##Julien Spronck's pandas soln
10 loops, best of 3: 27.1 ms per loop
In [303]: %timeit variable_mean(a)
100 loops, best of 3: 9.58 ms per loop
If you want to do it manually, what I would do:
max_length = 0
Figure out the max array length:
for array in arrays:
if len(array) > max:
max = len(array)
Pad all arrays to that length with 'None'
for array in arrays:
while len(array) < max:
array.append(None)
Zip will group the columns
columns = zip(*arrays)
columns == [(1, 4, 6), (2, 5, 7), (3, 'None', 8), ('None', 'None', 9)]
Calculate the average as you would for any list:
for col in columns:
count = 0
sum = 0.0
for num in col:
if num is not None:
count += 1
sum += float(num)
print "%s: Avg %s" % (col, sum/count)
Or as a list comprehension after padding the arrays:
[sum(filter(None, col))/float(len(filter(None, col))) for col in zip(*arrays)]
Output:
(1, 4, 6): Avg 3.66666666667
(2, 5, 7): Avg 4.66666666667
(3, 'None', 8): Avg 5.5
('None', 'None', 9): Avg 9.0
In Py3, zip_longest takes a fillvalue parameter:
In [1208]: ll=[
...: [1, 2, 3],
...: [4, 5],
...: [6, 7, 8, 9]
...: ]
In [1209]: list(itertools.zip_longest(*ll, fillvalue=np.nan))
Out[1209]: [(1, 4, 6), (2, 5, 7), (3, nan, 8), (nan, nan, 9)]
By filling with nan, I can use np.nanmean to take the mean ignoring the nan. nanmean turns its input (here _ from the previous line) into an array:
In [1210]: np.nanmean(_, axis=1)
Out[1210]: array([ 3.66666667, 4.66666667, 5.5 , 9. ])

numpy.argmin for elements greater than a threshold

I'm interested in getting the location of the minimum value in an 1-d NumPy array that meets a certain condition (in my case, a medium threshold). For example:
import numpy as np
limit = 3
a = np.array([1, 2, 4, 5, 2, 5, 3, 6, 7, 9, 10])
I'd like to effectively mask all numbers in a that are under the limit, such that the result of np.argmin would be 6. Is there a computationally cheap way to mask values that don't meet a condition and then apply np.argmin?
You could store the valid indices and use those for both selecting the valid elements from a and also indexing into with the argmin() among the selected elements to get the final index output. Thus, the implementation would look something like this -
valid_idx = np.where(a >= limit)[0]
out = valid_idx[a[valid_idx].argmin()]
Sample run -
In [32]: limit = 3
...: a = np.array([1, 2, 4, 5, 2, 5, 3, 6, 7, 9, 10])
...:
In [33]: valid_idx = np.where(a >= limit)[0]
In [34]: valid_idx[a[valid_idx].argmin()]
Out[34]: 6
Runtime test -
For performance benchmarking, in this section I am comparing the other solution based on masked array against a regular array based solution as proposed earlier in this post for various datasizes.
def masked_argmin(a,limit): # Defining func for regular array based soln
valid_idx = np.where(a >= limit)[0]
return valid_idx[a[valid_idx].argmin()]
In [52]: # Inputs
...: a = np.random.randint(0,1000,(10000))
...: limit = 500
...:
In [53]: %timeit np.argmin(np.ma.MaskedArray(a, a<limit))
1000 loops, best of 3: 233 µs per loop
In [54]: %timeit masked_argmin(a,limit)
10000 loops, best of 3: 101 µs per loop
In [55]: # Inputs
...: a = np.random.randint(0,1000,(100000))
...: limit = 500
...:
In [56]: %timeit np.argmin(np.ma.MaskedArray(a, a<limit))
1000 loops, best of 3: 1.73 ms per loop
In [57]: %timeit masked_argmin(a,limit)
1000 loops, best of 3: 1.03 ms per loop
This can simply be accomplished using numpy's MaskedArray
import numpy as np
limit = 3
a = np.array([1, 2, 4, 5, 2, 5, 3, 6, 7, 9, 10])
b = np.ma.MaskedArray(a, a<limit)
np.ma.argmin(b) # == 6

Randomly selecting rows from numpy array

I want to randomly select rows from a numpy array. Say I have this array-
A = [[1, 3, 0],
[3, 2, 0],
[0, 2, 1],
[1, 1, 4],
[3, 2, 2],
[0, 1, 0],
[1, 3, 1],
[0, 4, 1],
[2, 4, 2],
[3, 3, 1]]
To randomly select say 6 rows, I am doing this:
B = A[np.random.choice(A.shape[0], size=6, replace=False), :]
I want another array C which has the rows which were not selected in B.
Is there some in-built method to do this or do I need to do a brute-force, checking rows of B with rows of A?
You can make any number of row-wise random partitions of A by slicing a shuffled sequence of row indices:
ind = numpy.arange( A.shape[ 0 ] )
numpy.random.shuffle( ind )
B = A[ ind[ :6 ], : ]
C = A[ ind[ 6: ], : ]
If you don't want to change the order of the rows in each subset, you can sort each slice of the indices:
B = A[ sorted( ind[ :6 ] ), : ]
C = A[ sorted( ind[ 6: ] ), : ]
(Note that the solution provided by #MaxNoe also preserves row order.)
Solution
This gives you the indices for the selection:
sel = np.random.choice(A.shape[0], size=6, replace=False)
and this B:
B = A[sel]
Get all not selected indices:
unsel = list(set(range(A.shape[0])) - set(sel))
and use them for C:
C = A[unsel]
Variation with NumPy functions
Instead of using set and list, you can use this:
unsel2 = np.setdiff1d(np.arange(A.shape[0]), sel)
For the example array the pure Python version:
%%timeit
unsel1 = list(set(range(A.shape[0])) - set(sel))
100000 loops, best of 3: 8.42 µs per loop
is faster than the NumPy version:
%%timeit
unsel2 = np.setdiff1d(np.arange(A.shape[0]), sel)
10000 loops, best of 3: 77.5 µs per loop
For larger A the NumPy version is faster:
A = np.random.random((int(1e4), 3))
sel = np.random.choice(A.shape[0], size=6, replace=False)
%%timeit
unsel1 = list(set(range(A.shape[0])) - set(sel))
1000 loops, best of 3: 1.4 ms per loop
%%timeit
unsel2 = np.setdiff1d(np.arange(A.shape[0]), sel)
1000 loops, best of 3: 315 µs per loop
You can use boolean masks and draw random indices from an integer array which is as long as yours. The ~ is an elementwise not:
idx = np.arange(A.shape[0])
mask = np.zeros_like(idx, dtype=bool)
selected = np.random.choice(idx, 6, replace=False)
mask[selected] = True
B = A[mask]
C = A[~mask]

Categories

Resources