In pandas I have a dataframe of the form:
>>> import pandas as pd
>>> df = pd.DataFrame({'ID':[51,51,51,24,24,24,31], 'x':[0,1,0,0,1,1,0]})
>>> df
ID x
51 0
51 1
51 0
24 0
24 1
24 1
31 0
For every 'ID' the value of 'x' is recorded several times, it is either 0 or 1. I want to select those rows from df that contain an 'ID' for which 'x' is 1 at least twice.
For every 'ID' I manage to count the number of times 'x' is 1, by
>>> df.groupby('ID')['x'].sum()
ID
51 1
24 2
31 0
But I don't know how to proceed from here. I would like the following output:
ID x
24 0
24 1
24 1
Use groupby and filter
df.groupby('ID').filter(lambda s: s.x.sum()>=2)
Output:
ID x
3 24 0
4 24 1
5 24 1
df = pd.DataFrame({'ID':[51,51,51,24,24,24,31], 'x':[0,1,0,0,1,1,0]})
df.loc[df.groupby(['ID'])['x'].transform(func=sum)>=2,:]
out:
ID x
3 24 0
4 24 1
5 24 1
Using np.bincount and pd.factorize
alternative advance technique to draw better performance
f, u = df.ID.factorize()
df[np.bincount(f, df.x.values)[f] >= 2]
ID x
3 24 0
4 24 1
5 24 1
In obnoxious one-liner form
df[(lambda f, w: np.bincount(f, w)[f] >= 2)(df.ID.factorize()[0], df.x.values)]
ID x
3 24 0
4 24 1
5 24 1
np.bincount and np.unique
I could've used np.unique with the return_inverse parameter to accomplish the same exact thing. But, np.unique will sort the array and will change the time complexity of the solution.
u, f = np.unique(df.ID.values, return_inverse=True)
df[np.bincount(f, df.x.values)[f] >= 2]
One-liner
df[(lambda f, w: np.bincount(f, w)[f] >= 2)(np.unique(df.ID.values, return_inverse=True)[1], df.x.values)]
Timing
%timeit df[(lambda f, w: np.bincount(f, w)[f] >= 2)(df.ID.factorize()[0], df.x.values)]
%timeit df[(lambda f, w: np.bincount(f, w)[f] >= 2)(np.unique(df.ID.values, return_inverse=True)[1], df.x.values)]
%timeit df.groupby('ID').filter(lambda s: s.x.sum()>=2)
%timeit df.loc[df.groupby(['ID'])['x'].transform(func=sum)>=2]
%timeit df.loc[df.groupby(['ID'])['x'].transform('sum')>=2]
small data
1000 loops, best of 3: 302 µs per loop
1000 loops, best of 3: 241 µs per loop
1000 loops, best of 3: 1.52 ms per loop
1000 loops, best of 3: 1.2 ms per loop
1000 loops, best of 3: 1.21 ms per loop
large data
np.random.seed([3,1415])
df = pd.DataFrame(dict(
ID=np.random.randint(100, size=10000),
x=np.random.randint(2, size=10000)
))
1000 loops, best of 3: 528 µs per loop
1000 loops, best of 3: 847 µs per loop
10 loops, best of 3: 20.9 ms per loop
1000 loops, best of 3: 1.47 ms per loop
1000 loops, best of 3: 1.55 ms per loop
larger data
np.random.seed([3,1415])
df = pd.DataFrame(dict(
ID=np.random.randint(100, size=100000),
x=np.random.randint(2, size=100000)
))
1000 loops, best of 3: 2.01 ms per loop
100 loops, best of 3: 6.44 ms per loop
10 loops, best of 3: 29.4 ms per loop
100 loops, best of 3: 3.84 ms per loop
100 loops, best of 3: 3.74 ms per loop
Related
df,
Name
Sri
Sri,Ram
Sri,Ram,kumar
Ram
I am trying to calculate the value counts for each value.
I am not getting my output when using
df["Name"].values_count()
my desired output is,
Sri 3
Ram 3
Kumar 1
split the column, stack to long format, then count:
df.Name.str.split(',', expand=True).stack().value_counts()
#Sri 3
#Ram 3
#kumar 1
#dtype: int64
Or maybe:
df.Name.str.get_dummies(',').sum()
#Ram 3
#Sri 3
#kumar 1
#dtype: int64
Or concatenate before value_counts:
pd.value_counts(pd.np.concatenate(df.Name.str.split(',')))
#Sri 3
#Ram 3
#kumar 1
#dtype: int64
Timing:
%timeit df.Name.str.split(',', expand=True).stack().value_counts()
#1000 loops, best of 3: 1.02 ms per loop
%timeit df.Name.str.get_dummies(',').sum()
#1000 loops, best of 3: 1.18 ms per loop
%timeit pd.value_counts(pd.np.concatenate(df.Name.str.split(',')))
#1000 loops, best of 3: 573 µs per loop
# option from #Bharathshetty
from collections import Counter
%timeit pd.Series(Counter((df['Name'].str.strip() + ',').sum().rstrip(',').split(',')))
# 1000 loops, best of 3: 498 µs per loop
# option inspired by #Bharathshetty
%timeit pd.value_counts(df.Name.str.cat(sep=',').split(','))
# 1000 loops, best of 3: 483 µs per loop
I have 2 numpy arrays of shape (5,1) say:
a=[1,2,3,4,5]
b=[2,4,2,3,6]
How can I make a matrix multiplying each i-th element with each j-th? Like:
..a = [1,2,3,4,5]
b
2 2, 4, 6, 8,10
4 4, 8,12,16,20
2 2, 4, 6, 8,10
3 3, 6, 9,12,15
6 6,12,18,24,30
Without using forloops? is there any combination of reshape, reductions or multiplications that I can use?
Right now I create a a*b tiling of each array along rows and along colums and then multiply element wise, but it seems to me there must be an easier way.
With numpy.outer() and numpy.transpose() routines:
import numpy as np
a = [1,2,3,4,5]
b = [2,4,2,3,6]
c = np.outer(a,b).transpose()
print(c)
Or just with swapped array order:
c = np.outer(b, a)
The output;
[[ 2 4 6 8 10]
[ 4 8 12 16 20]
[ 2 4 6 8 10]
[ 3 6 9 12 15]
[ 6 12 18 24 30]]
For some reason np.multiply.outer seems to be faster than np.outer for small inputs. And broadcasting is faster still - but for bigger arrays they are all pretty much equal.
%timeit np.outer(a,b)
%timeit np.multiply.outer(a,b)
%timeit a[:, None]*b
100000 loops, best of 3: 5.97 µs per loop
100000 loops, best of 3: 3.27 µs per loop
1000000 loops, best of 3: 1.38 µs per loop
a = np.random.randint(0,10,100)
b = np.random.randint(0,10,100)
%timeit np.outer(a,b)
%timeit np.multiply.outer(a,b)
%timeit a[:, None]*b
100000 loops, best of 3: 15.5 µs per loop
100000 loops, best of 3: 14 µs per loop
100000 loops, best of 3: 13.5 µs per loop
a = np.random.randint(0,10,10000)
b = np.random.randint(0,10,10000)
%timeit np.outer(a,b)
%timeit np.multiply.outer(a,b)
%timeit a[:, None]*b
10 loops, best of 3: 154 ms per loop
10 loops, best of 3: 154 ms per loop
10 loops, best of 3: 152 ms per loop
After experimenting with timing various types of lookups on a Pandas (0.17.1) DataFrame I am left with a few questions.
Here is the set up...
import pandas as pd
import numpy as np
import itertools
letters = [chr(x) for x in range(ord('a'), ord('z'))]
letter_combinations = [''.join(x) for x in itertools.combinations(letters, 3)]
df1 = pd.DataFrame({
'value': np.random.normal(size=(1000000)),
'letter': np.random.choice(letter_combinations, 1000000)
})
df2 = df1.sort_values('letter')
df3 = df1.set_index('letter')
df4 = df3.sort_index()
So df1 looks something like this...
print(df1.head(5))
>>>
letter value
0 bdh 0.253778
1 cem -1.915726
2 mru -0.434007
3 lnw -1.286693
4 fjv 0.245523
Here is the code to test differences in lookup performance...
print('~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / UNSORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df1[df1.letter == 'ben']
%timeit df1[df1.letter == 'amy']
%timeit df1[df1.letter == 'abe']
print('~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / SORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df2[df2.letter == 'ben']
%timeit df2[df2.letter == 'amy']
%timeit df2[df2.letter == 'abe']
print('~~~~~~~~~~~~~~~~~~~~~INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df3.loc['ben']
%timeit df3.loc['amy']
%timeit df3.loc['abe']
print('~~~~~~~~~~~~~~~~~~~~~SORTED INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
%timeit df4.loc['ben']
%timeit df4.loc['amy']
%timeit df4.loc['abe']
And the results...
~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / UNSORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
10 loops, best of 3: 59.7 ms per loop
10 loops, best of 3: 59.7 ms per loop
10 loops, best of 3: 59.7 ms per loop
~~~~~~~~~~~~~~~~~NON-INDEXED LOOKUPS / SORTED DATASET~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
10 loops, best of 3: 192 ms per loop
10 loops, best of 3: 192 ms per loop
10 loops, best of 3: 193 ms per loop
~~~~~~~~~~~~~~~~~~~~~INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slowest run took 4.66 times longer than the fastest. This could mean that an intermediate result is being cached
10 loops, best of 3: 40.9 ms per loop
10 loops, best of 3: 41 ms per loop
10 loops, best of 3: 40.9 ms per loop
~~~~~~~~~~~~~~~~~~~~~SORTED INDEXED LOOKUPS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slowest run took 1621.00 times longer than the fastest. This could mean that an intermediate result is being cached
1 loops, best of 3: 259 µs per loop
1000 loops, best of 3: 242 µs per loop
1000 loops, best of 3: 243 µs per loop
Questions...
It's pretty clear why the lookup on the sorted index is so much faster, binary search to get O(log(n)) performance vs O(n) for a full array scan. But, why is the lookup on the sorted non-indexed df2 column SLOWER than the lookup on the unsorted non-indexed column df1?
What is up with the The slowest run took x times longer than the fastest. This could mean that an intermediate result is being cached. Surely, the results aren't being cached. Is it because the created index is lazy and isn't actually reindexed until needed? That would explain why it is only on the first call to .loc[].
Why isn't an index sorted by default? The fixed cost of the sort can be too much?
The disparity in these %timeit results
In [273]: %timeit df1[df1['letter'] == 'ben']
10 loops, best of 3: 36.1 ms per loop
In [274]: %timeit df2[df2['letter'] == 'ben']
10 loops, best of 3: 108 ms per loop
also shows up in the pure NumPy equality comparisons:
In [275]: %timeit df1['letter'].values == 'ben'
10 loops, best of 3: 24.1 ms per loop
In [276]: %timeit df2['letter'].values == 'ben'
10 loops, best of 3: 96.5 ms per loop
Under the hood, Pandas' df1['letter'] == 'ben' calls a Cython
function
which loops through the values of the underlying NumPy array,
df1['letter'].values. It is essentially doing the same thing as
df1['letter'].values == 'ben' but with different handling of NaNs.
Moreover, notice that simply accessing the items in df1['letter'] in
sequential order can be done more quickly than doing the same for df2['letter']:
In [11]: %timeit [item for item in df1['letter']]
10 loops, best of 3: 49.4 ms per loop
In [12]: %timeit [item for item in df2['letter']]
10 loops, best of 3: 124 ms per loop
The difference in times within each of these three sets of %timeit tests are
roughly the same. I think that is because they all share the same cause.
Since the letter column holds strings, the NumPy arrays df1['letter'].values and
df2['letter'].values have dtype object and therefore they hold
pointers to the memory location of the arbitrary Python objects (in this case strings).
Consider the memory location of the strings stored in the DataFrames, df1 and
df2. In CPython the id returns the memory location of the object:
memloc = pd.DataFrame({'df1': list(map(id, df1['letter'])),
'df2': list(map(id, df2['letter'])), })
df1 df2
0 140226328244040 140226299303840
1 140226328243088 140226308389048
2 140226328243872 140226317328936
3 140226328243760 140226230086600
4 140226328243368 140226285885624
The strings in df1 (after the first dozen or so) tend to appear sequentially
in memory, while sorting causes the strings in df2 (taken in order) to be
scattered in memory:
In [272]: diffs = memloc.diff(); diffs.head(30)
Out[272]:
df1 df2
0 NaN NaN
1 -952.0 9085208.0
2 784.0 8939888.0
3 -112.0 -87242336.0
4 -392.0 55799024.0
5 -392.0 5436736.0
6 952.0 22687184.0
7 56.0 -26436984.0
8 -448.0 24264592.0
9 -56.0 -4092072.0
10 -168.0 -10421232.0
11 -363584.0 5512088.0
12 56.0 -17433416.0
13 56.0 40042552.0
14 56.0 -18859440.0
15 56.0 -76535224.0
16 56.0 94092360.0
17 56.0 -4189368.0
18 56.0 73840.0
19 56.0 -5807616.0
20 56.0 -9211680.0
21 56.0 20571736.0
22 56.0 -27142288.0
23 56.0 5615112.0
24 56.0 -5616568.0
25 56.0 5743152.0
26 56.0 -73057432.0
27 56.0 -4988200.0
28 56.0 85630584.0
29 56.0 -4706136.0
Most of the strings in df1 are 56 bytes apart:
In [14]:
In [16]: diffs['df1'].value_counts()
Out[16]:
56.0 986109
120.0 13671
-524168.0 215
-56.0 1
-12664712.0 1
41136.0 1
-231731080.0 1
Name: df1, dtype: int64
In [20]: len(diffs['df1'].value_counts())
Out[20]: 7
In contrast the strings in df2 are scattered all over the place:
In [17]: diffs['df2'].value_counts().head()
Out[17]:
-56.0 46
56.0 44
168.0 39
-112.0 37
-392.0 35
Name: df2, dtype: int64
In [19]: len(diffs['df2'].value_counts())
Out[19]: 837764
When these objects (strings) are located sequentially in memory, their values
can be retrieved more quickly. This is why the equality comparisons performed by
df1['letter'].values == 'ben' can be done faster than those in df2['letter'].values
== 'ben'. The lookup time is smaller.
This memory accessing issue also explains why there is no disparity in the
%timeit results for the value column.
In [5]: %timeit df1[df1['value'] == 0]
1000 loops, best of 3: 1.8 ms per loop
In [6]: %timeit df2[df2['value'] == 0]
1000 loops, best of 3: 1.78 ms per loop
df1['value'] and df2['value'] are NumPy arrays of dtype float64. Unlike object
arrays, their values are packed together contiguously in memory. Sorting df1
with df2 = df1.sort_values('letter') causes the values in df2['value'] to be
reordered, but since the values are copied into a new NumPy array, the values
are located sequentially in memory. So accessing the values in df2['value'] can
be done just as quickly as those in df1['value'].
(1) pandas currently has no knowledge of the sortedness of a column.
If you want to take advantage of sorted data, you could use df2.letter.searchsorted See #unutbu's answer for an explanation of what's actually causing the difference in time here.
(2) The hash table that sits underneath the index is lazily created, then cached.
This question already has answers here:
Count number of words per row
(5 answers)
Closed 4 years ago.
Suppose we have simple Dataframe
df = pd.DataFrame(['one apple','banana','box of oranges','pile of fruits outside', 'one banana', 'fruits'])
df.columns = ['fruits']
how to calculate number of words in keywords, similar to:
1 word: 2
2 words: 2
3 words: 1
4 words: 1
IIUC then you can do the following:
In [89]:
count = df['fruits'].str.split().apply(len).value_counts()
count.index = count.index.astype(str) + ' words:'
count.sort_index(inplace=True)
count
Out[89]:
1 words: 2
2 words: 2
3 words: 1
4 words: 1
Name: fruits, dtype: int64
Here we use the vectorised str.split to split on spaces, and then apply len to get the count of the number of elements, we can then call value_counts to aggregate the frequency count.
We then rename the index and sort it to get the desired output
UPDATE
This can also be done using str.len rather than apply which should scale better:
In [41]:
count = df['fruits'].str.split().str.len()
count.index = count.index.astype(str) + ' words:'
count.sort_index(inplace=True)
count
Out[41]:
0 words: 2
1 words: 1
2 words: 3
3 words: 4
4 words: 2
5 words: 1
Name: fruits, dtype: int64
Timings
In [42]:
%timeit df['fruits'].str.split().apply(len).value_counts()
%timeit df['fruits'].str.split().str.len()
1000 loops, best of 3: 799 µs per loop
1000 loops, best of 3: 347 µs per loop
For a 6K df:
In [51]:
%timeit df['fruits'].str.split().apply(len).value_counts()
%timeit df['fruits'].str.split().str.len()
100 loops, best of 3: 6.3 ms per loop
100 loops, best of 3: 6 ms per loop
You could use str.count with space ' ' as delimiter.
In [1716]: count = df['fruits'].str.count(' ').add(1).value_counts(sort=False)
In [1717]: count.index = count.index.astype('str') + ' words:'
In [1718]: count
Out[1718]:
1 words: 2
2 words: 2
3 words: 1
4 words: 1
Name: fruits, dtype: int64
Timings
str.count is marginally faster
Small
In [1724]: df.shape
Out[1724]: (6, 1)
In [1725]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
1000 loops, best of 3: 649 µs per loop
In [1726]: %timeit df['fruits'].str.split().apply(len).value_counts()
1000 loops, best of 3: 840 µs per loop
Medium
In [1728]: df.shape
Out[1728]: (6000, 1)
In [1729]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
100 loops, best of 3: 6.58 ms per loop
In [1730]: %timeit df['fruits'].str.split().apply(len).value_counts()
100 loops, best of 3: 6.99 ms per loop
Large
In [1732]: df.shape
Out[1732]: (60000, 1)
In [1733]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
1 loop, best of 3: 57.6 ms per loop
In [1734]: %timeit df['fruits'].str.split().apply(len).value_counts()
1 loop, best of 3: 73.8 ms per loop
I have a very large file (5GB), and I need to count the number of occurence using two columns
a b c d e
0 2 3 1 5 4
1 2 3 2 5 4
2 1 3 2 5 4
3 2 4 1 5 3
4 2 4 1 5 3
so obviously I have to find
(2,3):2
(1,3):1
(2,4):2
How can I do that in a very fast way.
I used:
df.groupby(['a','b']).count().to_dict()
Let's say that the final result would be
a b freq
2 3 2
1 3 1
2 4 2
Approach for the first version of the question - dictionary as result
If you have high frequencies, i.e. few combinations of a and b, the final dictionary will be small. If you have many of different combinations, you will need lots of RAM.
If you have low frequencies and enough RAM, looks like your approach is good.
Some timings for 5e6 rows and numbers from 0 to 19:
>>> df = pd.DataFrame(np.random.randint(0, 19, size=(5000000, 5)), columns=list('abcde'))
>>> df.shape
(5000000, 5)
%timeit df.groupby(['a','b']).count().to_dict()
1 loops, best of 3: 552 ms per loop
%timeit df.groupby(['a','b']).size()
1 loops, best of 3: 619 ms per loop
%timeit df.groupby(['a','b']).count()
1 loops, best of 3: 588 ms per loop
Using a different range of integers, here up to sys.maxsize (9223372036854775807), changes the timings considerably:
import sys
df = pd.DataFrame(np.random.randint(0, high=sys.maxsize, size=(5000000, 5)),
columns=list('abcde'))
%timeit df.groupby(['a','b']).count().to_dict()
1 loops, best of 3: 41.3 s per loop
%timeit df.groupby(['a','b']).size()
1 loops, best of 3: 11.4 s per loop
%timeit df.groupby(['a','b']).count()
1 loops, best of 3: 12.9 s per loop`
Solution for the updated question
df2 = df.drop(list('cd'), axis=1)
df2.rename(columns={'e': 'feq'}, inplace=True)
g = df2.groupby(['a','b']).count()
g.reset_index(inplace=True)
print(g)
a b feq
0 1 3 1
1 2 3 2
2 2 4 2
It is not much faster though.
For range 0 to 19:
%%timeit
df2 = df.drop(list('cd'), axis=1)
df2.rename(columns={'e': 'feq'}, inplace=True)
g = df2.groupby(['a','b']).count()
g.reset_index(inplace=True)
1 loops, best of 3: 564 ms per loop
For range 0 to sys.maxsize:
%%timeit
df2 = df.drop(list('cd'), axis=1)
df2.rename(columns={'e': 'feq'}, inplace=True)
g = df2.groupby(['a','b']).count()
g.reset_index(inplace=True)
1 loops, best of 3: 10.2 s per loop