Here is my dataframe:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4, 6, 4, 3, 2, 7]})
buckets = [(0,3),(3,5),(5,9)]
I also have histogram buckets stated above. Now I would like to assign each row of dataframe to buckets index. So I would like to get new column with the following info:
df['buckets_index'] = [0,0,0,1,2,1,0,0,2]
Of course, I can do it with loops, but I have fairly big dataframe (2.5 mil rows), so I need to get it done quickly.
Any thoughts?
You can use pd.cut, with labels=False if you only want the index:
buckets = [0,3,5,9]
df['bucket'] = pd.cut(df['A'], bins=buckets)
df['bucket_idx'] = pd.cut(df['A'], bins=buckets, labels=False)
The resulting output:
A bucket bucket_idx
0 1 (0, 3] 0
1 2 (0, 3] 0
2 3 (0, 3] 0
3 4 (3, 5] 1
4 6 (5, 9] 2
5 4 (3, 5] 1
6 3 (0, 3] 0
7 2 (0, 3] 0
8 7 (5, 9] 2
You could use np.searchsorted -
df['buckets_index'] = np.asarray(buckets)[:,1].searchsorted(df.A.values)
Runtime test -
In [522]: df = pd.DataFrame({'A': np.random.randint(1,8,(10000))})
In [523]: buckets = [0,3,5,9]
In [524]: %timeit pd.cut(df['A'], bins=buckets, labels=False)
1000 loops, best of 3: 460 µs per loop # #root's soln
In [525]: buckets = [(0,3),(3,5),(5,9)]
In [526]: %timeit np.asarray(buckets)[:,1].searchsorted(df.A.values)
10000 loops, best of 3: 166 µs per loop
Outside limits cases : For such cases, we need to use clipping, like so -
np.asarray(buckets)[:,1].searchsorted(df.A.values).clip(max=len(buckets)-1)
Related
I have a data frame column with numeric values:
df['percentage'].head()
46.5
44.2
100.0
42.12
I want to see the column as bin counts:
bins = [0, 1, 5, 10, 25, 50, 100]
How can I get the result as bins with their value counts?
[0, 1] bin amount
[1, 5] etc
[5, 10] etc
...
You can use pandas.cut:
bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = pd.cut(df['percentage'], bins)
print (df)
percentage binned
0 46.50 (25, 50]
1 44.20 (25, 50]
2 100.00 (50, 100]
3 42.12 (25, 50]
bins = [0, 1, 5, 10, 25, 50, 100]
labels = [1,2,3,4,5,6]
df['binned'] = pd.cut(df['percentage'], bins=bins, labels=labels)
print (df)
percentage binned
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5
Or numpy.searchsorted:
bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = np.searchsorted(bins, df['percentage'].values)
print (df)
percentage binned
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5
...and then value_counts or groupby and aggregate size:
s = pd.cut(df['percentage'], bins=bins).value_counts()
print (s)
(25, 50] 3
(50, 100] 1
(10, 25] 0
(5, 10] 0
(1, 5] 0
(0, 1] 0
Name: percentage, dtype: int64
s = df.groupby(pd.cut(df['percentage'], bins=bins)).size()
print (s)
percentage
(0, 1] 0
(1, 5] 0
(5, 10] 0
(10, 25] 0
(25, 50] 3
(50, 100] 1
dtype: int64
By default cut returns categorical.
Series methods like Series.value_counts() will use all categories, even if some categories are not present in the data, operations in categorical.
Using the Numba module for speed up.
On big datasets (more than 500k), pd.cut can be quite slow for binning data.
I wrote my own function in Numba with just-in-time compilation, which is roughly six times faster:
from numba import njit
#njit
def cut(arr):
bins = np.empty(arr.shape[0])
for idx, x in enumerate(arr):
if (x >= 0) & (x < 1):
bins[idx] = 1
elif (x >= 1) & (x < 5):
bins[idx] = 2
elif (x >= 5) & (x < 10):
bins[idx] = 3
elif (x >= 10) & (x < 25):
bins[idx] = 4
elif (x >= 25) & (x < 50):
bins[idx] = 5
elif (x >= 50) & (x < 100):
bins[idx] = 6
else:
bins[idx] = 7
return bins
cut(df['percentage'].to_numpy())
# array([5., 5., 7., 5.])
Optional: you can also map it to bins as strings:
a = cut(df['percentage'].to_numpy())
conversion_dict = {1: 'bin1',
2: 'bin2',
3: 'bin3',
4: 'bin4',
5: 'bin5',
6: 'bin6',
7: 'bin7'}
bins = list(map(conversion_dict.get, a))
# ['bin5', 'bin5', 'bin7', 'bin5']
Speed comparison:
# Create a dataframe of 8 million rows for testing
dfbig = pd.concat([df]*2000000, ignore_index=True)
dfbig.shape
# (8000000, 1)
%%timeit
cut(dfbig['percentage'].to_numpy())
# 38 ms ± 616 µs per loop (mean ± standard deviation of 7 runs, 10 loops each)
%%timeit
bins = [0, 1, 5, 10, 25, 50, 100]
labels = [1,2,3,4,5,6]
pd.cut(dfbig['percentage'], bins=bins, labels=labels)
# 215 ms ± 9.76 ms per loop (mean ± standard deviation of 7 runs, 10 loops each)
We could also use np.select:
bins = [0, 1, 5, 10, 25, 50, 100]
df['groups'] = (np.select([df['percentage'].between(i, j, inclusive='right')
for i,j in zip(bins, bins[1:])],
[1, 2, 3, 4, 5, 6]))
Output:
percentage groups
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5
Convenient and fast version using Numpy
np.digitize is a convenient and fast option:
import pandas as pd
import numpy as np
df = pd.DataFrame({'x': [1,2,3,4,5]})
df['y'] = np.digitize(a['x'], bins=[3,5])
print(df)
returns
x y
0 1 0
1 2 0
2 3 1
3 4 1
4 5 2
I have a list object,i want to know that how many numbers are in a particular interval?The code is as follows
a = [1, 7, 4, 7, 4, 8, 5, 2, 17, 8, 3, 12, 9, 6, 28]
interval = 3
a = list(map(lambda x:int(x/interval),a))
for i in range(min(a),max(a)+1):
print(i*interval,(i+1)*interval,':',a.count(i))
Output
0 3 : 2
3 6 : 4
6 9 : 5
9 12 : 1
12 15 : 1
15 18 : 1
18 21 : 0
21 24 : 0
24 27 : 0
27 30 : 1
Is there a simple way to get this information?The simpler the better
Now that we're talking about performance, I'd like to offer my numpy solution using bincount:
import numpy as np
interval = 3
a = [1, 7, 4, 7, 4, 8, 5, 2, 17, 8, 3, 12, 9, 6, 28]
l = max(a) // interval + 1
b = np.bincount(a, minlength=l*interval).reshape((l,interval)).sum(axis=1)
(minlength is necessary just to be able to reshape if max(a) isn't a multiple of interval)
With the lables taken from Erfan's answer we get:
rnge = range(0, max(a) + interval + 1, interval)
lables = [f'[{i}-{j})' for i, j in zip(rnge[:-1], rnge[1:])]
for l,b in zip(lables,b):
print(l,b)
[0-3) 2
[3-6) 4
[6-9) 5
[9-12) 1
[12-15) 1
[15-18) 1
[18-21) 0
[21-24) 0
[24-27) 0
[27-30) 1
This is much faster than the pandas solution.
Performance and scaling comparison
In order to assess the scaling capability, I just replaced a = [1, ..., 28] * n and timed the execution (without imports and printing) for n = 1, 10, 100, 1K, 10K and 100K:
(python 3.7.3 on win32 / pandas 0.24.2 / numpy 1.16.2)
Pandas solution with pd.cut and groupby
s = pd.Series(a)
bins = pd.cut(s, range(0, s.max() + interval, interval), right=False)
s.groupby(bins).count()
[0, 3) 2
[3, 6) 4
[6, 9) 5
[9, 12) 1
[12, 15) 1
[15, 18) 1
[18, 21) 0
[21, 24) 0
[24, 27) 0
[27, 30) 1
dtype: int64
To get cleaner bins results, we can use this method from linked answer:
s = pd.Series(a)
rnge = range(0, s.max() + interval, interval)
labels = [f'{i}-{j}' for i, j in zip(rnge[:-1], rnge[1:])]
bins = pd.cut(s, range(0, s.max() + interval, interval), right=False, labels=labels)
s.groupby(bins).count()
0-3 2
3-6 4
6-9 5
9-12 1
12-15 1
15-18 1
18-21 0
21-24 0
24-27 0
27-30 1
dtype: int64
You can do it in one line using a dictionary comprehension :
a = [1, 7, 4, 7, 4, 8, 5, 2, 17, 8, 3, 12, 9, 6, 28]
{"[{};{}[".format(x, x+3) : len( [y for y in a if y >= x and y < x+3] )
for x in range(min(a), max(a), 3)}
Output :
{'[1;4[': 3,
'[4;7[': 4,
'[7;10[': 5,
'[10;13[': 1,
'[13;16[': 0,
'[16;19[': 1,
'[19;22[': 0,
'[22;25[': 0,
'[25;28[': 0}
Performance comparison :
Pandas solution with pd.cut and groupby : 8.51 ms ± 32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Dictionary comprehension : 19.7 µs ± 37.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Using np.bincount : 22.4 µs ± 263 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I have a pandas dataframe in which the column values exist as lists. Each list has several elements and one element can exist in several rows. An example dataframe is:
X = pd.DataFrame([(1,['a','b','c']),(2,['a','b']),(3,['c','d'])],columns=['A','B'])
X =
A B
0 1 [a, b, c]
1 2 [a, b]
2 3 [c, d]
I want to find all the rows, i.e. dataframe indexes, corresponding to elements in the lists, and create a dictionary out of it. Disregard column A here, as column B is the one of interest! So element 'a' occurs in index 0,1, which gives {'a':[0,1]}. The solution for this example dataframe is:
Y = {'a':[0,1],'b':[0,1],'c':[0,2],'d':[2]}
I have written a code that works fine, and I can get a result. My problem is more to do with the speed of computation. My actual dataframe has about 350,000 rows and the lists in the column 'B' can contain up to 1,000 elements. But at present the code is running for several hours! I was wondering whether my solution is very inefficient.
Any help with a faster more efficient way will be really appreciated!
Here is my solution code:
import itertools
import pandas as pd
X = pd.DataFrame([(1,['a','b','c']),(2,['a','b']),(3,['c','d'])],columns=['A','B'])
B_dict = []
for idx,val in X.iterrows():
B = val['B']
B_dict.append(dict(zip(B,[[idx]]*len(B))))
B_dict = [{k: list(itertools.chain.from_iterable(list(filter(None.__ne__, [d.get(k) for d in B_dict])))) for k in set().union(*B_dict)}]
print ('Result:',B_dict[0])
Output
Result: {'d': [2], 'c': [0, 2], 'b': [0, 1], 'a': [0, 1]}
The code for the final line in the for loop was borrowed from here: Combine values of same keys in a list of dicts, and remove None value from a list without removing the 0 value
I think a defaultdict will work here in about 1 minute:
from collections import defaultdict
from itertools import chain
dd = defaultdict(list)
for k,v in zip(chain.from_iterable(df.B.ravel()), df.index.repeat(df.B.str.len()).tolist()):
dd[k].append(v)
Output:
defaultdict(list, {'a': [0, 1], 'b': [0, 1], 'c': [0, 2], 'd': [2]})
X = pd.DataFrame([(1, ['a', 'b', 'c']*300), (2, ['a', 'b']*50),
(3, ['c', 'd']*34)], columns=['A', 'B'])
df = pd.concat([X]*150000).reset_index(drop=True)
%%timeit
dd = defaultdict(list)
for k,v in zip(chain.from_iterable(df.B.ravel()), df.index.repeat(df.B.str.len()).tolist()):
dd[k].append(v)
#38.1 s ± 238 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
idx = np.arange(len(df)).repeat(df['B'].str.len(), 0)
s = df.iloc[idx, ].assign(B=np.concatenate(df['B'].values))['B']
d = s.to_frame().reset_index().groupby('B')['index'].apply(list).to_dict()
#1min 24s ± 458 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
explode your list with this method: https://stackoverflow.com/a/46740682/9177877
then groupby and apply list:
idx = np.arange(len(X)).repeat(X['B'].str.len(), 0)
s = X.iloc[idx, ].assign(B=np.concatenate(X['B'].values))['B']
d = s.to_frame().reset_index().groupby('B')['index'].apply(list).to_dict()
# {'a': [0, 1], 'b': [0, 1], 'c': [0, 2], 'd': [2]}
It's pretty quick on 150,000 rows:
# sample data
X = pd.DataFrame([(1,['a','b','c']),(2,['a','b']),(3,['c','d'])],columns=['A','B'])
df = pd.concat([X]*50000).reset_index(drop=True)
%%timeit
idx = np.arange(len(df)).repeat(df['B'].str.len(), 0)
s = df.iloc[idx, ].assign(B=np.concatenate(df['B'].values))['B']
d = s.to_frame().reset_index().groupby('B')['index'].apply(list).to_dict()
# 530 ms ± 46.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
X = pd.DataFrame([(1,['a','b','c']),(2,['a','b']),(3,['c','d'])],columns=['A','B'])
df = X['B'].apply(pd.Series).T.unstack().reset_index().drop(columns = ['level_1']).dropna()
df.groupby(0)['level_0'].apply(list).to_dict()
I make your B column its own DF, transpose it to make the Index's the columns, unstack it, then finish cleaning it. It looks like:
df
level_0 0
0 0 a
1 0 b
2 0 c
3 1 a
4 1 b
6 2 c
7 2 d
Then I group by column 0, make it a list, then a dict.
Consider the following variable length 2D array
[
[1, 2, 3],
[4, 5],
[6, 7, 8, 9]
]
How can i find the mean of the variables along the column?
I want something like [(1+4+6)/3,(2+5+7)/3, (3+8)/2, 9/1]
So the end result would be [3.667, 4.667, 5.5, 9]
Is this possible using numpy?
I tried np.mean(x, axis=0), but numpy expects the arrays of same dimension.
Right now, I am popping the elements of each column and finding the mean. Is there a better way to achieve the result?
You could use pandas:
import pandas as pd
a = [[1, 2, 3],
[4, 5],
[6, 7, 8, 9]]
df = pd.DataFrame(a)
# 0 1 2 3
# 0 1 2 3 NaN
# 1 4 5 NaN NaN
# 2 6 7 8 9
df.mean()
# 0 3.666667
# 1 4.666667
# 2 5.500000
# 3 9.000000
# dtype: float64
Here is another solution that only uses numpy:
import numpy as np
nrows = len(a)
ncols = max(len(row) for row in a)
arr = np.zeros((nrows, ncols))
arr.fill(np.nan)
for jrow, row in enumerate(a):
for jcol, col in enumerate(row):
arr[jrow, jcol] = col
print np.nanmean(arr, axis=0)
# array([ 3.66666667, 4.66666667, 5.5 , 9. ])
Very simple alternative approach using itertools.izip_longest() as:
>>> mean_list = []
>>> for sub_list in izip_longest(*my_list):
... filtered_list = filter(None, sub_list)
... mean_list.append(sum(filtered_list)/(len(filtered_list)*1.0))
...
>>> mean_list
[3.6666666666666665, 4.666666666666667, 5.5, 9.0]
where my_list equals to:
[
[1, 2, 3],
[4, 5],
[6, 7, 8, 9]
]
Listed in this post is an almost vectorized approach using NumPy. We would try to assign each element in list element an ID based on their positions. These IDs could then be fed to np.bincount as it would perform ID based summations. Finally, we would divide the summations respectively by the lengths of each ID to get the final average values.
Thus, we would have an implementation like so -
def variable_mean(a):
vals = np.concatenate(a)
lens = np.array(map(len,a))
id_arr = np.ones(vals.size,dtype=int)
id_arr[0] = 0
id_arr[lens.cumsum()[:-1]] = -lens[:-1] + 1
IDs = id_arr.cumsum()
return np.bincount(IDs,vals)/np.bincount(IDs)
Runtime test -
In [298]: # Setup input
...: N = 1000 # number of elems in input list
...: minL = 3 # min len of an element (list) in input list
...: maxL = 10 # max len of an element (list) in input list
...: a = [list(np.random.randint(0,9,(i))) \
...: for i in np.random.randint(minL,maxL,(N))]
...:
In [299]: %timeit pd.DataFrame(a).mean() ##Julien Spronck's pandas soln
100 loops, best of 3: 3.33 ms per loop
In [300]: %timeit variable_mean(a)
100 loops, best of 3: 2.36 ms per loop
In [301]: # Setup input
...: N = 1000 # number of elems in input list
...: minL = 3 # min len of an element (list) in input list
...: maxL = 100 # max len of an element (list) in input list
...: a = [list(np.random.randint(0,9,(i))) \
...: for i in np.random.randint(minL,maxL,(N))]
...:
In [302]: %timeit pd.DataFrame(a).mean() ##Julien Spronck's pandas soln
10 loops, best of 3: 27.1 ms per loop
In [303]: %timeit variable_mean(a)
100 loops, best of 3: 9.58 ms per loop
If you want to do it manually, what I would do:
max_length = 0
Figure out the max array length:
for array in arrays:
if len(array) > max:
max = len(array)
Pad all arrays to that length with 'None'
for array in arrays:
while len(array) < max:
array.append(None)
Zip will group the columns
columns = zip(*arrays)
columns == [(1, 4, 6), (2, 5, 7), (3, 'None', 8), ('None', 'None', 9)]
Calculate the average as you would for any list:
for col in columns:
count = 0
sum = 0.0
for num in col:
if num is not None:
count += 1
sum += float(num)
print "%s: Avg %s" % (col, sum/count)
Or as a list comprehension after padding the arrays:
[sum(filter(None, col))/float(len(filter(None, col))) for col in zip(*arrays)]
Output:
(1, 4, 6): Avg 3.66666666667
(2, 5, 7): Avg 4.66666666667
(3, 'None', 8): Avg 5.5
('None', 'None', 9): Avg 9.0
In Py3, zip_longest takes a fillvalue parameter:
In [1208]: ll=[
...: [1, 2, 3],
...: [4, 5],
...: [6, 7, 8, 9]
...: ]
In [1209]: list(itertools.zip_longest(*ll, fillvalue=np.nan))
Out[1209]: [(1, 4, 6), (2, 5, 7), (3, nan, 8), (nan, nan, 9)]
By filling with nan, I can use np.nanmean to take the mean ignoring the nan. nanmean turns its input (here _ from the previous line) into an array:
In [1210]: np.nanmean(_, axis=1)
Out[1210]: array([ 3.66666667, 4.66666667, 5.5 , 9. ])
I wanted to do conditional counting after groupby; for example, group by values of column A, and then count within each group how often value 5 appears in column B.
If I was doing this for the entire DataFrame, it's just len(df[df['B']==5]). So I hoped I could do df.groupby('A')[df['B']==5].size(). But I guess boolean indexing doesn't work within GroupBy objects.
Example:
import pandas as pd
df = pd.DataFrame({'A': [0, 4, 0, 4, 4, 6], 'B': [5, 10, 10, 5, 5, 10]})
groups = df.groupby('A')
# some more code
# in the end, I want to get pd.Series({0: 1, 1: 2, 6: 0})
Select all rows where B equals 5, and then apply groupby/size:
In [43]: df.loc[df['B']==5].groupby('A').size()
Out[43]:
A
0 1
4 2
dtype: int64
Alternatively, you could use groupby/agg with a custom function:
In [44]: df.groupby('A')['B'].agg(lambda ser: (ser==5).sum())
Out[44]:
A
0 1
4 2
Name: B, dtype: int64
Note that generally speaking, using agg with a custom function will be slower than using groupby with a builtin method such as size. So prefer the first option over the second.
In [45]: %timeit df.groupby('A')['B'].agg(lambda ser: (ser==5).sum())
1000 loops, best of 3: 927 µs per loop
In [46]: %timeit df.loc[df['B']==5].groupby('A').size()
1000 loops, best of 3: 649 µs per loop
To include A values where the size is zero, you could reindex the result:
import pandas as pd
df = pd.DataFrame({'A': [0, 4, 0, 4, 4, 6], 'B': [5, 10, 10, 5, 5, 10]})
result = df.loc[df['B'] == 5].groupby('A').size()
result = result.reindex(df['A'].unique())
yields
A
0 1.0
4 2.0
6 NaN
dtype: float64