I have a dataframe like the following:
label val
a 0
b -1
b 0
b 1
a 1
b 1
My goal here is to group by the label column and get two aggregated columns. One that shows the amount of rows in each group (eg. a:2, b:4) and second the proportion in each group where val = 1. What is the best way to do this in pandas?
Finding the proportion of a column that satisfies a condition is equivalent to taking the mean of a Boolean Series. This allows for it to be done quickly. Since s and df share the same index, it's perfectly fine to use one to group the other.
To get multiple aggregations for a column, supply a list that specifies what you want to do.
s = df.val.eq(1)
s.groupby(df.label).agg(['size', 'mean'])
# size mean
#label
#a 2 0.5
#b 4 0.5
When the number of groups becomes large using "tricks" like this can be significantly faster than using a lambda because many of the basic groupby aggregations have cythonized versions that are extremely performant.
# Create a sample df with 20,000 unique groups
df = pd.concat([df]*10000, ignore_index=True)
df['label'] = df.index//3
%%timeit
s = df.val.eq(1)
s.groupby(df.label).agg(['size', 'mean'])
#10.8 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
def portion(x): return (x.eq(1).sum())/len(x)
df.groupby('label').val.agg(['size', portion])
#7.93 s ± 82.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Try:
def portion(x): return (x.eq(1).sum())/len(x)
df.groupby('label').val.agg(['size', portion])
Output:
size portion
label
a 2 0.5
b 4 0.5
Related
Need help extracting multiple numbers from a column in a dataframe and remove duplicates and separate them with a comma.
Col1
Abcde 10 hijk20
wewrw5 gagdhdh5
Mnbjgkh10,20, 30
Expected output;
Col2
10,20
5
10,20,30
Try this:
punctuations = ['!','(',')','-','[',']','{','}',';',':','"','<','>','.','/','?']
for index, row in dataframe.iterrows():
content = dataframe.iloc[index:index, column_index]
for p in punctuations:
content.replace(p, " ")
only_numbers = re.sub("[^0-9]", " ", content)
only_numbers.strip()
numbers_found = only_numbers.split(" ")
no_duplicates = list(set(numbers_found))
comma_separated = ",".join(no_duplicates)
dataframe.iloc[index:index, column_index] = comma_separated
Does this answer your question? findall() from the re module with the regular expression r'\d+' returns a list containing all non-overlapping matches of one or more consecutive decimal digits in the string. The built-in set() removes any duplicates from that list and applying the sorted() built-in returns a sorted list of the elements in the set. We also make use of numpy.vectorize as it is faster than apply() from Pandas for this particular application (at least on my system) though I have shown how to use apply() as well.
Method 1
import pandas as pd
import numpy as np
import re
# compile RE - matches one or more decimal digits
p = re.compile(r'\d+')
# data
d = {'col1': ['Abcde 10 hijk20', 'wewrw5 gagdhdh5', 'Mnbjgkh10,20, 30'],
'col2': [''] * 3}
# DataFrame
df = pd.DataFrame(d)
# modify col2 based on col1
df['col2'] = np.vectorize(
lambda y: ','.join(sorted(set(p.findall(y)))),
)(df['col1'])
print(df)
Output
col1 col2
0 Abcde 10 hijk20 10,20
1 wewrw5 gagdhdh5 5
2 Mnbjgkh10,20, 30 10,20,30
If you can only use pandas and not numpy, you can do
Method 2
# modify col2 based on col1
df['col2'] = df.apply(
lambda x: ','.join(sorted(set(p.findall(x['col1'])))) , axis=1)
or even
Method 3
# modify col2 based on col1
for index, row in df.iterrows():
row['col2'] = ','.join(sorted(set(p.findall(row['col1']))))
Efficiency
On my system, vectorize (method 1) is fastest, method 3 is second fastest and method 2 is the slowest.
# Method 1
82.9 µs ± 170 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
# Method 2
399 µs ± 8.54 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# Method 3
117 µs ± 178 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Suppose I have data for 50K shoppers and the products they bought. I want to count the number of times each user purchased product "a". value_counts seems to be the fastest way to calculate these types of numbers for a grouped pandas data frame. However, I was surprised at how much slower it was to calculate the purchase frequency for just one specific product (e.g., "a") using agg or apply. I could select a specific column from a data frame created using value_counts but that could be rather inefficient on very large data sets with lots of products.
Below a simulated example where each customer purchases 10 times from a set of three products. At this size you already notice speed differences between apply and agg compared to value_counts. Is there a better/faster way to extract information like this from a grouped pandas data frame?
import pandas as pd
import numpy as np
df = pd.DataFrame({
"col1": [f'c{j}' for i in range(10) for j in range(50000)],
"col2": np.random.choice(["a", "b", "c"], size=500000, replace=True)
})
dfg = df.groupby("col1")
# value_counts is fast
dfg["col2"].value_counts().unstack()
# apply and agg are (much) slower
dfg["col2"].apply(lambda x: (x == "a").sum())
dfg["col2"].agg(lambda x: (x == "a").sum())
# much faster to do
dfg["col2"].value_counts().unstack()["a"]
EDIT:
Two great responses to this question. Given the starting point of an already grouped data frame, it seems there may not be a better/faster way to count the number of occurrences of a single level in a categorical variable than using (1) apply or agg with a lambda function or (2) using value_counts to get the counts for all levels and then selecting the one you need.
The groupby/size approach is an excellent alternative to value_counts. With a minor edit to Cainã Max Couto-Silva's answer, this would give:
dfg = df.groupby(['col1', 'col2'])
dfg.size().unstack(fill_value=0)["a"]
I assume there would be a trade-off at some point where if you have many levels apply/agg or value_counts on an already grouped data frame may be faster than the groupby/size approach which requires creating a newly grouped data frame. I'll post back when I have some time to look into that.
Thanks for the comments and answers!
This is still faster:
dfg = df.groupby(['col1','col2'])
dfg.size().unstack()
Tests:
%%timeit
pd.crosstab(df.col1, df.col2)
# > 712 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
dfg = df.groupby("col1")
dfg["col2"].value_counts().unstack()
# > 165 ms ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
dfg = df.groupby(['col1','col2'])
dfg.size().unstack()
# > 131 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If we expand the dataframe to 5 million rows:
df = pd.concat([df for _ in range(10)])
print(f'df.shape = {df.shape}')
# > df.shape = (5000000, 2)
print(f'{df.shape[0]:,} rows.')
# > 5,000,000 rows.
%%timeit
pd.crosstab(df.col1, df.col2)
# > 1.58 s ± 33.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
dfg = df.groupby("col1")
dfg["col2"].value_counts().unstack()
# > 1.27 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
dfg = df.groupby(['col1','col2'])
dfg.size().unstack()
# > 847 ms ± 53.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Filter before value_counts
df.loc[df.col2=='a','col1'].value_counts()['c0']
Also I think crosstab is 'faster' than groupby + value_counts
pd.crosstab(df.col1, df.col2)
How can I find the difference in two rows and divide this result by the sum of two rows?
Here is how to do it in Excel.
Here is the formula I want to replicate, using Python.
=ABS(((B3-B2)/(B3+B2)/2)/((A3-A2)/(A3+A2)/2))
I know the difference can be calculated with df.diff(), but I can't figure out how to do the sum.
import pandas as pd
data = {'Price':[50,46],'Quantity':[3,6]}
df = pd.DataFrame(data)
print(df)
Can use rolling.sum with a window size of 2:
(df.diff()/df.rolling(2).sum()).eval('abs(Quantity/Price)')
0 NaN
1 8.0
dtype: float64
Basically you already have the diff then already you have two row sum
Since diff : x[2]-x[1] Then 'sum' : x[2]+x[1]=x[2]*2-(x[2]-x[1])
In your case the sum can be calculated by
df*2-df.diff()
Out[714]:
Price Quantity
0 NaN NaN
1 96.0 9.0
So the output is
(df.diff()/(df*2-df.diff())).eval('abs(Quantity/Price)')
Out[718]:
0 NaN
1 8.0
dtype: float64
For small dataframes the use of .eval() is not efficient.
The following is faster upto some 100.000 rows:
df = (df.diff() / df.rolling(2).sum()).div(2)
df['result'] = abs(df.Quantity / df.Price)
32.9 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
vs.
39.6 ms ± 931 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I have an excel file with 100 columns each with 1000 entries. Each of those entries can only take 3 particular values of (0.8, 0.0 and 0.37) I want to count the number of mismatches between every combination of two column's entry.
For example, the excel sheet below shows the mismatches between the columns:
|---------------------|------------------|---------------------|---------------|
| Column 1 | Column 2 | Column 3 | Mismatch |
|---------------------|------------------|---------------------|---------------|
| 0.37 | 0.8 | 0.0 | 3 |
|---------------------|------------------|---------------------|---------------|
| 0.0 | 0.0 | 0.8 | 2 |
|---------------------|------------------|---------------------|---------------|
First we compare column 1 against column 2. Since there is a difference between the first rows we add 1 to the corresponding row of the mismatch column. We repeat this for column 1 vs column 3 and then column 2 vs column 3. So we need to iterate over every unique combination of two columns.
The brute force way of doing this is a nested loop which iterates over two columns at a time. I was wondering if there is a panda-y way of doing this.
This is what I will handle this problem
from itertools import combinations
L = df.columns.tolist()
pd.concat([df[x[0]]!=df[x[1]] for x in list( combinations(L, 2))],axis=1).sum(1)
0 3
1 2
dtype: int64
Since you sum pairwise combinations it's the same as checking the first column against the second through the last columns, the second against the third through the last and so on. Checking N-1 (N number of columns) equalities against the DataFrame and summing will be quite a bit faster than checking NC2 individual column pairings, especially with your large number of columns:
from functools import reduce
reduce(lambda x,y: x+y, [df.iloc[:, i+1:].ne(df.iloc[:, i], axis=0).sum(1)
for i in range(len(df.columns)-1)])
0 3
1 2
dtype: int64
Some timings with your data size
import numpy as np
import pandas as pd
from itertools import combinations
np.random.seed(123)
df = pd.DataFrame(np.random.choice([0, 0.8, 0.37], (1000,100)))
%timeit reduce(lambda x, y: x+y, [df.iloc[:, i+1:].ne(df.iloc[:, i], axis=0).sum(1) for i in range(len(df.columns)-1)])
#157 ms ± 659 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.concat([df[x[0]]!=df[x[1]] for x in list( combinations(L, 2))],axis=1).sum(1)
#1.55 s ± 9.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can gain slightly using numpy and summing the values though you lose the index:
%timeit np.sum([df.iloc[:, i+1:].ne(df.iloc[:, i], axis=0).sum(1).to_numpy() for i in range(len(df.columns)-1)], axis=0)
#139 ms ± 715 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Sorry if I've been googling the wrong keywords, but I haven't been able to find an efficient way to replace all instances of an integer in a DataFrame column with its corresponding indexed value from a secondary Series.
I'm working with the output of a third party program that strips the row and column labels from an input matrix and replaces them with their corresponding indices. I'd like to restore the true labels from the indices.
I have a dummy example of the dataframe and series in question:
In [6]: df
Out[6]:
idxA idxB var2
0 0 1 2.0
1 0 2 3.0
2 2 4 2.0
3 2 1 1.0
In [8]: labels
Out[8]:
0 A
1 B
2 C
3 D
4 E
Name: label, dtype: object
Currently, I'm converting the series to a dictionary and using replace:
label_dict = labels.to_dict()
df['idxA'] = df.idxA.replace(label_dict)
df['idxB'] = df.idxB.replace(label_dict)
which does give me the expected result:
In [12]: df
Out[12]:
idxA idxB var2
0 A B 2.0
1 A C 3.0
2 C E 2.0
3 C B 1.0
However, this is very slow for my full dataset (approximately 3.8 million rows in the table, and 19,000 labels). Is there a more efficient way to approach this?
Thanks!
EDIT: I accepted #coldspeed's answer. Couldn't paste a code block in the comment reply to his answer, but his solution sped up the dummy code by about an order of magnitude:
In [10]: %timeit df.idxA.replace(label_dict)
4.41 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: %timeit df.idxA.map(labels)
435 µs ± 3.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You can call map for each column using apply:
df.loc[:, 'idxA':'idxB'] = df.loc[:, 'idxA':'idxB'].apply(lambda x: x.map(labels))
df
idxA idxB var2
0 A B 2.0
1 A C 3.0
2 C E 2.0
3 C B 1.0
This is effectively iterating over every column (but the map operation for a single column is vectorized, so it is fast). It might just be faster to do
cols_of_interest = ['idxA', 'idxB', ...]
for c in cols_of_interest: df[c] = df[c].map(labels)
map is faster than replace, depending on the number of columns to replace. Your mileage may vary.
df_ = df.copy()
df = pd.concat([df_] * 10000, ignore_index=True)
%timeit df.loc[:, 'idxA':'idxB'].replace(labels)
%%timeit
for c in ['idxA', 'idxB']:
df[c].map(labels)
6.55 ms ± 87.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.95 ms ± 70 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)