Count elements in defined groups in pandas dataframe - python

Say I have a dataframe and I want to count how many times we have element e.g [1,5,2] in a/each column.
I could do something like
elem_list = [1,5,2]
for e in elemt_list:
(df["col1"]==e).sum()
but isn't there a better way like
elem_list = [1,5,2]
df["col1"].count_elements(elem_list)
#1 5 # 1 occurs 5 times
#5 3 # 5 occurs 3 times
#2 0 # 2 occurs 0 times
Note it should count all the elements in the list, and return "0" if an element in the list is not in the column.

You can use value_counts and reindex:
df = pd.DataFrame({'col1': [1,1,5,1,5,1,1,4,3]})
elem_list = [1,5,2]
df['col1'].value_counts().reindex(elem_list, fill_value=0)
output:
1 5
5 2
2 0
benchmark (100k values):
# setup
df = pd.DataFrame({'col1': np.random.randint(0,10, size=100000)})
df['col1'].value_counts().reindex(elem_list, fill_value=0)
# 774 µs ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
pd.Categorical(df['col1'],elem_list).value_counts()
# 2.72 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df.loc[df["col1"].isin(elem_list), 'col1'].value_counts().reindex(elem_list, fill_value=0)
# 2.98 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pass to the Categorical which will return 0 for missing item
pd.Categorical(df['col1'],elem_list).value_counts()
Out[62]:
1 3
5 0
2 1
dtype: int64

First filter by Series.isin and DataFrame.loc and then use Series.value_counts, last if order is important add Series.reindex:
df.loc[df["col1"].isin(elem_list), 'col1'].value_counts().reindex(elem_list, fill_values=0)

You could do something like that:
df = pd.DataFrame({"col1":np.random.randint(0,10, 100)})
df[df["col1"].isin([0,1])].value_counts()
# col1
# 1 17
# 0 10
# dtype: int64

Related

How do i sort the numeric values within a cell in Pandas series

I have a panda series/column where values are like -
Values
101;1001
130;125
113;99
1001;101
I need to sort the values within the cell with an expected outcome like below using python as the dataframe is large with more than 5 million values so any faster way would be appreciated.
Values
101;1001
125;130
99;113
101;1001
Convert splitted values to integers, sorting, convert back to strings and join:
df['Values'] = df['Values'].apply(lambda x: ';'.join(map(str, sorted(map(int, x.split(';'))))))
Performance:
#10k rows
df = pd.concat([df] * 10000, ignore_index=True)
#enke solution
In [52]: %timeit df['Values'].str.split(';').explode().sort_values(key=lambda x: x.str.zfill(10)).groupby(level=0).agg(';'.join)
616 ms ± 6.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [53]: %timeit df['Values'].apply(lambda x: ';'.join(map(str, sorted(map(int, x.split(';'))))))
70.7 ms ± 420 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#1M rows
df = pd.concat([df] * 1000000, ignore_index=True)
#mozway solution
In [60]: %timeit df['Values'] = [';'.join(map(str, sorted(map(int, x.split(';'))))) for x in df['Values']]
8.03 s ± 409 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [61]: %timeit df['Values'] = df['Values'].map(lambda x: ';'.join(map(str, sorted(map(int, x.split(';'))))))
7.88 s ± 602 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution for 2 columns:
df1 = df['Values'].str.split(';', expand=True).astype(int)
df1 = pd.DataFrame(np.sort(df1, axis=1), index=df1.index, columns=df1.columns)
print (df1)
0 1
0 101 1001
1 125 130
2 99 113
3 101 1001
You can use a list comprehension, this will be faster on small datasets:
df['Values'] = [';'.join(map(str, sorted(map(int, x.split(';'))))) for x in df['Values']]
output:
Values
0 101;1001
1 125;130
2 99;113
3 101;1001
timings:
For two columns:
df2 = pd.DataFrame([sorted(map(int, x.split(';'))) for x in df['Values']])
output:
0 1
0 101 1001
1 125 130
2 99 113
3 101 1001

How to get Index of first Row with none-zero minimum value in Pandas DataFrame?

Assuming I have the following Pandas DataFrame:
U A B
0 2000 10 20
1 3000 40 0
2 2100 20 30
3 2500 0 30
4 2600 30 40
How can I get the index of first row that both A and B have non-zero value and (A+B)/2 is larger than 15 ?
In this example, I would like to get 2 since it is the first row that have non-zero A and B column and avg value of 25 which is more than 15
Note that this DataFrame is huge, I am looking for the fastest way to the index value.
Lets try:
df[(df.A.ne(0)&df.B.ne(0))&((df.A+df.B)/2).gt(15)].first_valid_index()
I find more readable explicit variables, like:
AB2 = (df['A']+df['B'])/2
filter = (df['A'] != 0) & (df['B'] != 0) & (AB2>15)
your_index = df[filter].index[0]
Performance
For this use case (ridiculous dataset)
%%timeit
df[(df.A.ne(0)&df.B.ne(0))&((df.A+df.B)/2).gt(15)].first_valid_index()
**1.21 ms** ± 35.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
AB2 = (df['A']+df['B'])/2
filter = (df['A'].ne(0)) & (df['B'].ne(0)) & (AB2>15)
df[filter].index[0]
**1.08 ms** ± 28.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
df.query("A!=0 and B!=0 and (A+B)/2 > 15").index[0]
**2.71 ms** ± 157 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
If the dataframe is large, query might be faster:
df.query("A!=0 and B!=0 and (A+B)/2 > 15").index[0]
2

What is the fastest way to perform a replace on a column of a Pandas DataFrame based on the index of a separate Series?

Sorry if I've been googling the wrong keywords, but I haven't been able to find an efficient way to replace all instances of an integer in a DataFrame column with its corresponding indexed value from a secondary Series.
I'm working with the output of a third party program that strips the row and column labels from an input matrix and replaces them with their corresponding indices. I'd like to restore the true labels from the indices.
I have a dummy example of the dataframe and series in question:
In [6]: df
Out[6]:
idxA idxB var2
0 0 1 2.0
1 0 2 3.0
2 2 4 2.0
3 2 1 1.0
In [8]: labels
Out[8]:
0 A
1 B
2 C
3 D
4 E
Name: label, dtype: object
Currently, I'm converting the series to a dictionary and using replace:
label_dict = labels.to_dict()
df['idxA'] = df.idxA.replace(label_dict)
df['idxB'] = df.idxB.replace(label_dict)
which does give me the expected result:
In [12]: df
Out[12]:
idxA idxB var2
0 A B 2.0
1 A C 3.0
2 C E 2.0
3 C B 1.0
However, this is very slow for my full dataset (approximately 3.8 million rows in the table, and 19,000 labels). Is there a more efficient way to approach this?
Thanks!
EDIT: I accepted #coldspeed's answer. Couldn't paste a code block in the comment reply to his answer, but his solution sped up the dummy code by about an order of magnitude:
In [10]: %timeit df.idxA.replace(label_dict)
4.41 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: %timeit df.idxA.map(labels)
435 µs ± 3.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You can call map for each column using apply:
df.loc[:, 'idxA':'idxB'] = df.loc[:, 'idxA':'idxB'].apply(lambda x: x.map(labels))
df
idxA idxB var2
0 A B 2.0
1 A C 3.0
2 C E 2.0
3 C B 1.0
This is effectively iterating over every column (but the map operation for a single column is vectorized, so it is fast). It might just be faster to do
cols_of_interest = ['idxA', 'idxB', ...]
for c in cols_of_interest: df[c] = df[c].map(labels)
map is faster than replace, depending on the number of columns to replace. Your mileage may vary.
df_ = df.copy()
df = pd.concat([df_] * 10000, ignore_index=True)
%timeit df.loc[:, 'idxA':'idxB'].replace(labels)
%%timeit
for c in ['idxA', 'idxB']:
df[c].map(labels)
6.55 ms ± 87.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.95 ms ± 70 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

slice panda string in vectorised way [duplicate]

This question already has answers here:
How to slice strings in a column by another column in pandas
(2 answers)
Closed 4 years ago.
I am trying to slice the string in vectorized way and answer is NaN. Although work OK if sequence index (say like str[:1]) is constant. Any help
df = pd.DataFrame({'NAME': ['abc','xyz','hello'], 'SEQ': [1,2,1]}) #
df['SUB'] = df['NAME'].str[:df['SEQ']]
The output is
NAME SEQ SUB
0 abc 1 NaN
1 xyz 2 NaN
2 hello 1 NaN
Unfortunately vectorized solution does not exist.
Use apply with lambda function:
df['SUB'] = df.apply(lambda x: x['NAME'][:x['SEQ']], axis=1)
Or zip with list comprehension for better performance:
df['SUB'] = [x[:y] for x, y in zip(df['NAME'], df['SEQ'])]
print (df)
NAME SEQ SUB
0 abc 1 a
1 xyz 2 xy
2 hello 1 h
Timings:
df = pd.DataFrame({'NAME': ['abc','xyz','hello'], 'SEQ': [1,2,1]})
df = pd.concat([df] * 1000, ignore_index=True)
In [270]: %timeit df["SUB"] = df.groupby("SEQ").NAME.transform(lambda g: g.str[: g.name])
4.23 ms ± 222 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [271]: %timeit df['SUB'] = df.apply(lambda x: x['NAME'][:x['SEQ']], axis=1)
104 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [272]: %timeit df['SUB'] = [x[:y] for x, y in zip(df['NAME'], df['SEQ'])]
785 µs ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Using groupby:
df["SUB"] = df.groupby("SEQ").NAME.transform(lambda g: g.str[: g.name])
Might make sense if there are few unique values in SEQ.

How to create a dummy variable in Pandas Dataframe if a column matches certain values?

I have a Pandas Dataframe with a column (ip) with certain values and another Pandas Series not in this DataFrame with a collection of these values. I want to create a column in the DataFrame that is 1 if a given line has its ipin my Pandas Series (black_ip).
import pandas as pd
dict = {'ip': {0: 103022, 1: 114221, 2: 47902, 3: 23550, 4: 84644}, 'os': {0: 23, 1: 19, 2: 17, 3: 13, 4: 19}}
df = pd.DataFrame(dict)
df
ip os
0 103022 23
1 114221 19
2 47902 17
3 23550 13
4 84644 19
blacklist = pd.Series([103022, 23550])
blacklist
0 103022
1 23550
My question is: how can I create a new column in df such that it shows 1 when the given ip in the blacklist and zero otherwise?
Sorry if this too dumb, I'm still new to programming. Thanks a lot in advance!
Use isin with astype:
df['new'] = df['ip'].isin(blacklist).astype(np.int8)
Also is possible convert column to categoricals:
df['new'] = pd.Categorical(df['ip'].isin(blacklist).astype(np.int8))
print (df)
ip os new
0 103022 23 1
1 114221 19 0
2 47902 17 0
3 23550 13 1
4 84644 19 0
For interesting in large DataFrame converting to Categorical not save memory:
df = pd.concat([df] * 10000, ignore_index=True)
df['new1'] = pd.Categorical(df['ip'].isin(blacklist).astype(np.int8))
df['new2'] = df['ip'].isin(blacklist).astype(np.int8)
df['new3'] = df['ip'].isin(blacklist)
print (df.memory_usage())
Index 80
ip 400000
os 400000
new1 50096
new2 50000
new3 50000
dtype: int64
Timings:
np.random.seed(4545)
N = 10000
df = pd.DataFrame(np.random.randint(1000,size=N), columns=['ip'])
print (len(df))
10000
blacklist = pd.Series(np.random.randint(500,size=int(N/100)))
print (len(blacklist))
100
In [320]: %timeit df['ip'].isin(blacklist).astype(np.int8)
465 µs ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [321]: %timeit pd.Categorical(df['ip'].isin(blacklist).astype(np.int8))
915 µs ± 49.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [322]: %timeit pd.Categorical(df['ip'], categories = blacklist.unique()).notnull().astype(int)
1.59 ms ± 20.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [323]: %timeit df['new_column'] = [1 if x in blacklist.values else 0 for x in df.ip]
81.8 ms ± 2.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Slow, but simple and readable method:
Another way to do this would be to use create your new column using a list comprehension, set to assign a 1 if your ip value is in blacklist and a 0 otherwise:
df['new_column'] = [1 if x in blacklist.values else 0 for x in df.ip]
>>> df
ip os new_column
0 103022 23 1
1 114221 19 0
2 47902 17 0
3 23550 13 1
4 84644 19 0
EDIT: Faster method building on Categorical: If you want to maximize speed, the following would be quite fast, though not quite as fast as the .isin non-categorical method. It builds on the use of pd.Categorical as suggested by #jezrael, but leveraging it's capacity for assigning categories:
df['new_column'] = pd.Categorical(df['ip'],
categories = blacklist.unique()).notnull().astype(int)
Timings:
import numpy as np
import pandas as pd
np.random.seed(4545)
N = 10000
df = pd.DataFrame(np.random.randint(1000,size=N), columns=['ip'])
blacklist = pd.Series(np.random.randint(500,size=int(N/100)))
%timeit df['ip'].isin(blacklist).astype(np.int8)
# 453 µs ± 8.81 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pd.Categorical(df['ip'].isin(blacklist).astype(np.int8))
# 892 µs ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pd.Categorical(df['ip'], categories = \
blacklist.unique()).notnull().astype(int)
# 565 µs ± 32.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Categories

Resources