Pandas: Selecting rows based on value counts of a particular column

Pandas: Selecting rows based on value counts of a particular column - python

Whats the simplest way of selecting all rows from a panda dataframe, who's sym occurs exactly twice in the entire table? For example, in the table below, I would like to select all rows with sym in ['b','e'], since the value_counts for these symbols equal 2.
df=pd.DataFrame({'sym':['a', 'b', 'b', 'c', 'd','d','d','e','e'],'price':np.random.randn(9)})
price sym
0 -0.0129 a
1 -1.2940 b
2 1.8423 b
3 -0.7160 c
4 -2.3216 d
5 -0.0120 d
6 -0.5914 d
7 0.6280 e
8 0.5361 e
df.sym.value_counts()
Out[237]:
d 3
e 2
b 2
c 1
a 1

I think you can use groupby by column sym and filter values with length == 2:
print df.groupby("sym").filter(lambda x: len(x) == 2)
price sym
1 0.400157 b
2 0.978738 b
7 -0.151357 e
8 -0.103219 e
Second solution use isin with boolean indexing:
s = df.sym.value_counts()
print s[s == 2].index
Index([u'e', u'b'], dtype='object')
print df[df.sym.isin(s[s == 2].index)]
price sym
1 0.400157 b
2 0.978738 b
7 -0.151357 e
8 -0.103219 e
And fastest solution with transform and boolean indexing:
print (df[df.groupby("sym")["sym"].transform('size') == 2])
price sym
1 -1.2940 b
2 1.8423 b
7 0.6280 e
8 0.5361 e

You can use map, which should be faster than using groupby and transform:
df[df['sym'].map(df['sym'].value_counts()) == 2]
e.g.
%%timeit
df[df['sym'].map(df['sym'].value_counts()) == 2]
Out[1]:
1.83 ms ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df[df.groupby("sym")["sym"].transform('size') == 2]
Out[2]:
2.08 ms ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Related

Repeating pandas Series with specific order

Repeating pandas Series with repeat() function:
s = pd.Series(['a', 'b', 'c'])
s.repeat(2)
0 a
0 a
1 b
1 b
2 c
2 c
dtype: object
Need to get output like this:
0 a
1 b
2 c
0 a
1 b
2 c
dtype: object

Use np.tile with Series.loc if performance is important:
a = s.loc[np.tile(s.index, 2)]
print (a)
0 a
1 b
2 c
0 a
1 b
2 c
dtype: object
s = pd.Series(['a', 'b', 'c'])
In [25]: %timeit (s.loc[np.tile(s.index, 2000)])
612 µs ± 26.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [26]: %timeit (pd.concat([s] * 2000))
22.2 ms ± 251 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
EDIT:
s = pd.Series(['a', 'b', 'c'], index = pd.date_range('2015-01-01', periods=3))
print (s)
a = s.loc[np.tile(s.index, 2)]
print (a)
2015-01-01 a
2015-01-02 b
2015-01-03 c
2015-01-01 a
2015-01-02 b
2015-01-03 c
dtype: object

You can use from pandas concat function as follow
pd.concat([s] * 2)

Create a sequence of numbers and reset itself when certain number is reached

I have a dataframe which first column has 11 rows, i want to create a second column and count from 1 to 4 and then reset the count and start from 1 to 4 and stop counting when reaches the last row.
for instance, I have df['item'] and the code should create a df['new column']:
df['item']= [a b c d e f g h i j k]
df['new column'] = [1 2 3 4 1 2 3 4 1 2 3]

Use modulo with 4 and add 1:
import pandas as pd
df = pd.DataFrame({'item': list('abcdefghijk')})
#default index solution
df['new column'] = df.index % 4 + 1
#general solution
#df['new column'] = np.arange(len(df)) % 4 + 1
print(df)
Output:
item new column
0 a 1
1 b 2
2 c 3
3 d 4
4 e 1
5 f 2
6 g 3
7 h 4
8 i 1
9 j 2
10 k 3
If large DataFrame performance is for each solution different:
df = pd.DataFrame({'a':range(1000000)})
In [307]: %timeit df['new column'] = (len(df)*[1, 2, 3, 4])[:len(df)]
363 ms ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [308]: %timeit df['new column1'] = df.index % 4 + 1
35.1 ms ± 416 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [309]: %timeit df['new column2'] = np.arange(len(df)) % 4 + 1
14.4 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

You can repeat the list [1, 2, 3, 4] n times simply by doing n * [1, 2, 3, 4]. Thus your new column is created with:
df['new column'] = (len(df)*[1, 2, 3, 4])[:len(df)]

Pandas DataFrame aggregated column with names of other columns as value

I'm trying to create a new column in my DataFrame that is a list of aggregated column names. Here's a sample DataFrame:
In [1]: df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
In [2]: df
Out[2]:
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
I'd like to create a new column containing a list of column names where a certain condition is met. Say that I'm interested in columns where value > 3 -- I would want an output that looks like this:
In [3]: df
Out[3]:
A B C D E F Flag
0 1 4 7 1 5 7 ['B', 'C', 'E', 'F']
1 2 5 8 3 3 4 ['B', 'C', 'F']
2 3 6 9 5 6 3 ['B', 'C', 'D', 'E']
Currently, I'm using apply:
df['Flag'] = df.apply(lambda row: [list(df)[i] for i, j in enumerate(row) if j > 3], axis = 1)
This gets the job done, but feels clunky and I'm wondering if there is a more elegant solution.
Thanks!

Use df.dot() here:
df['Flag']=(df>3).dot(df.columns).apply(list)
print(df)
A B C D E F Flag
0 1 4 7 1 5 7 [B, C, E, F]
1 2 5 8 3 3 4 [B, C, F]
2 3 6 9 5 6 3 [B, C, D, E]

I still like for loop here
df['Flag']=[df.columns[x].tolist() for x in df.gt(3).values]
df
Out[968]:
A B C D E F Flag
0 1 4 7 1 5 7 [B, C, E, F]
1 2 5 8 3 3 4 [B, C, F]
2 3 6 9 5 6 3 [B, C, D, E]

One option is to create a dataframe of booleans by checking which values are above a certain threshold with DataFrame.gt, and take the dot product with the column names. Finally use apply(list) to obtain lists from the resulting strings:
df['Flag'] = df.gt(3).dot(df.columns).apply(list)
A B C D E F Flag
0 1 4 7 1 5 7 [B, C, E, F]
1 2 5 8 3 3 4 [B, C, F]
2 3 6 9 5 6 3 [B, C, D, E]

Another way:
df['Flag'] = df.T.apply(lambda x: list(x[x>3].index))

Edit: adding timing for all solutions of this question
I prefer a solution without apply
df['Flag'] = df.reset_index().melt(id_vars='index', value_name='val', var_name='col').query('val > 3').groupby('index')['col'].agg(list)
Or
df['Flag'] = df.stack().rename('val').reset_index(level=1).query('val > 3').groupby(level=0)['level_1'].agg(list)
Out[2576]:
A B C D E F Flag
0 1 4 7 1 5 7 [B, C, E, F]
1 2 5 8 3 3 4 [B, C, F]
2 3 6 9 5 6 3 [B, C, D, E]
Test data:
a = [
[1, 4, 7, 1, 5, 7],
[2, 5, 8, 3, 3, 4],
[3, 6, 9, 5, 6, 3],
] * 10000
df = pd.DataFrame(a, columns = list('ABCDEF'))
Timing with %timeit:
In [79]: %timeit (df>3).dot(df.columns).apply(list)
40.8 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [80]: %timeit [df.columns[x].tolist() for x in df.gt(3).values]
1.23 s ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [81]: %timeit df.gt(3).dot(df.columns).apply(list)
37.6 ms ± 644 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [82]: %timeit df.T.apply(lambda x: list(x[x>3].index))
16.4 s ± 99.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [83]: %timeit df.stack().rename('val').reset_index(level=1).query('val > 3')
...: .groupby(level=0)['level_1'].agg(list)
4.05 s ± 15.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [84]: %timeit df.apply(lambda x: df.columns[np.argwhere(x>3).ravel()].values
...: , 1)
c:\program files\python37\lib\site-packages\numpy\core\fromnumeric.py:56: Future
Warning: Series.nonzero() is deprecated and will be removed in a future version.
Use Series.to_numpy().nonzero() instead
return getattr(obj, method)(*args, **kwds)
12 s ± 45.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Fastest are solution using .dot

Using numpy.argwhere and ravel():
df.apply(lambda x: df.columns[np.argwhere(x>3).ravel()].values, 1)

we can use # also
df['Flag'] = ((df >3) # df.columns).map(list)

Remove '.' from Thousands of Column Heads [python]

My DataFrame has around 9K columns, and I want to remove the . from every column name, see example column names below:
`traffic.seas1`
`traffic.seas2`
`traffic.seas3`
These are just three, I have 9K columns, some do not have . but many do. How can I remove them efficiently, as the rename function is too manual.

You can use str.replace:
df.columns = df.columns.str.replace('.','')
Or list comprehension with replace:
df.columns = [x.replace('.','') for x in df.columns]
Sample:
df = pd.DataFrame({'traffic.seas1':list('abcdef'),
'traffic.seas2':[4,5,4,5,5,4],
'traffic.seas3':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
D E F traffic.seas1 traffic.seas2 traffic.seas3
0 1 5 a a 4 7
1 3 3 a b 5 8
2 5 6 a c 4 9
3 7 9 b d 5 4
4 1 2 b e 5 2
5 0 4 b f 4 3
df.columns = df.columns.str.replace('.','')
print (df)
D E F trafficseas1 trafficseas2 trafficseas3
0 1 5 a a 4 7
1 3 3 a b 5 8
2 5 6 a c 4 9
3 7 9 b d 5 4
4 1 2 b e 5 2
5 0 4 b f 4 3
Timings:
N = 9000
df = pd.DataFrame(np.random.randint(10, size=(3, N))).add_prefix('traffic.seas')
print (df)
In [161]: %timeit df.columns = df.columns.str.replace('.','')
4.4 ms ± 267 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [162]: %timeit df.columns = [x.replace('.','') for x in df.columns]
2.53 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

You can use list comprehension on df.columns like this:
df.columns = [c.replace('.', '') for c in df.columns]
For example:
df = pd.DataFrame({'foo': [1], 'bar.z': [2]})
>>> df.columns
Index(['bar.z', 'foo'], dtype='object')
df.columns = [c.replace('.', '') for c in df.columns]
>>> df
barz foo
0 2 1

Selecting all column names where value is greater than another column in pandas

I'm trying to find the column names of each column in a pandas dataframe where the value is greater than that of another column.
For example, if I have the following dataframe:
A B C D threshold
0 1 3 3 1 2
1 2 3 6 1 5
2 9 5 0 2 4
For each row I would like to return the names of the columns where the values are greater than the threshold, so I would have:
0: B, C
1: C
2: A, B
Any help would be much appreciated!

If you want a large increase in speed you can use NumPy's vectorized where function.
s = np.where(df.gt(df['threshold'],0), ['A, ', 'B, ', 'C, ', 'D, ', ''], '')
pd.Series([''.join(x).strip(', ') for x in s])
0 B, C
1 C
2 A, B
dtype: object
There is more than an order of magnitude speedup vs #jezrael and MaxU solutions when using a dataframe of 100,000 rows. Here I create the test DataFrame first.
n = 100000
df = pd.DataFrame(np.random.randint(0, 10, (n, 5)),
columns=['A', 'B', 'C', 'D', 'threshold'])
Timings
%%timeit
>>> s = np.where(df.gt(df['threshold'],0), ['A, ', 'B, ', 'C, ', 'D, ', ''], '')
>>> pd.Series([''.join(x).strip(', ') for x in s])
280 ms ± 5.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
>>> df1 = df.drop('threshold', 1).gt(df['threshold'], 0)
>>> df1 = df1.apply(lambda x: ', '.join(x.index[x]),axis=1)
3.15 s ± 82.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
>>> x = df.drop('threshold',1)
>>> x.T.gt(df['threshold']).agg(lambda c: ', '.join(x.columns[c]))
3.28 s ± 145 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can use:
df1 = df.drop('threshold', 1).gt(df['threshold'], 0)
df1 = df1.apply(lambda x: ', '.join(x.index[x]),axis=1)
print (df1)
0 B, C
1 C
2 A, B
dtype: object
Similar solution:
df1 = df.drop('threshold', 1).gt(df['threshold'], 0).stack().rename_axis(('a','b'))
.reset_index(name='boolean')
a = df1[df1['boolean']].groupby('a')['b'].apply(', '.join).reset_index()
print (a)
a b
0 0 B, C
1 1 C
2 2 A, B

you can do it this way:
In [99]: x = df.drop('threshold',1)
In [100]: x
Out[100]:
A B C D
0 1 3 3 1
1 2 3 6 1
2 9 5 0 2
In [102]: x.T.gt(df['threshold']).agg(lambda c: ', '.join(x.columns[c]))
Out[102]:
0 B, C
1 C
2 A, B
dtype: object

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Selecting rows based on value counts of a particular column - python

Related

Repeating pandas Series with specific order

Create a sequence of numbers and reset itself when certain number is reached

Pandas DataFrame aggregated column with names of other columns as value

Remove '.' from Thousands of Column Heads [python]

Selecting all column names where value is greater than another column in pandas

Categories

Resources