Fastest way to perform comparisons between every column using Pandas - python

I have an excel file with 100 columns each with 1000 entries. Each of those entries can only take 3 particular values of (0.8, 0.0 and 0.37) I want to count the number of mismatches between every combination of two column's entry.
For example, the excel sheet below shows the mismatches between the columns:
|---------------------|------------------|---------------------|---------------|
| Column 1 | Column 2 | Column 3 | Mismatch |
|---------------------|------------------|---------------------|---------------|
| 0.37 | 0.8 | 0.0 | 3 |
|---------------------|------------------|---------------------|---------------|
| 0.0 | 0.0 | 0.8 | 2 |
|---------------------|------------------|---------------------|---------------|
First we compare column 1 against column 2. Since there is a difference between the first rows we add 1 to the corresponding row of the mismatch column. We repeat this for column 1 vs column 3 and then column 2 vs column 3. So we need to iterate over every unique combination of two columns.
The brute force way of doing this is a nested loop which iterates over two columns at a time. I was wondering if there is a panda-y way of doing this.

This is what I will handle this problem
from itertools import combinations
L = df.columns.tolist()
pd.concat([df[x[0]]!=df[x[1]] for x in list( combinations(L, 2))],axis=1).sum(1)
0 3
1 2
dtype: int64

Since you sum pairwise combinations it's the same as checking the first column against the second through the last columns, the second against the third through the last and so on. Checking N-1 (N number of columns) equalities against the DataFrame and summing will be quite a bit faster than checking NC2 individual column pairings, especially with your large number of columns:
from functools import reduce
reduce(lambda x,y: x+y, [df.iloc[:, i+1:].ne(df.iloc[:, i], axis=0).sum(1)
for i in range(len(df.columns)-1)])
0 3
1 2
dtype: int64
Some timings with your data size
import numpy as np
import pandas as pd
from itertools import combinations
np.random.seed(123)
df = pd.DataFrame(np.random.choice([0, 0.8, 0.37], (1000,100)))
%timeit reduce(lambda x, y: x+y, [df.iloc[:, i+1:].ne(df.iloc[:, i], axis=0).sum(1) for i in range(len(df.columns)-1)])
#157 ms ± 659 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.concat([df[x[0]]!=df[x[1]] for x in list( combinations(L, 2))],axis=1).sum(1)
#1.55 s ± 9.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can gain slightly using numpy and summing the values though you lose the index:
%timeit np.sum([df.iloc[:, i+1:].ne(df.iloc[:, i], axis=0).sum(1).to_numpy() for i in range(len(df.columns)-1)], axis=0)
#139 ms ± 715 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Related

Pandas Question create two aggregations with one being conditionally created

I have a dataframe like the following:
label val
a 0
b -1
b 0
b 1
a 1
b 1
My goal here is to group by the label column and get two aggregated columns. One that shows the amount of rows in each group (eg. a:2, b:4) and second the proportion in each group where val = 1. What is the best way to do this in pandas?
Finding the proportion of a column that satisfies a condition is equivalent to taking the mean of a Boolean Series. This allows for it to be done quickly. Since s and df share the same index, it's perfectly fine to use one to group the other.
To get multiple aggregations for a column, supply a list that specifies what you want to do.
s = df.val.eq(1)
s.groupby(df.label).agg(['size', 'mean'])
# size mean
#label
#a 2 0.5
#b 4 0.5
When the number of groups becomes large using "tricks" like this can be significantly faster than using a lambda because many of the basic groupby aggregations have cythonized versions that are extremely performant.
# Create a sample df with 20,000 unique groups
df = pd.concat([df]*10000, ignore_index=True)
df['label'] = df.index//3
%%timeit
s = df.val.eq(1)
s.groupby(df.label).agg(['size', 'mean'])
#10.8 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
def portion(x): return (x.eq(1).sum())/len(x)
df.groupby('label').val.agg(['size', portion])
#7.93 s ± 82.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Try:
def portion(x): return (x.eq(1).sum())/len(x)
df.groupby('label').val.agg(['size', portion])
Output:
size portion
label
a 2 0.5
b 4 0.5

How can I find the difference in two rows and divide this result by the sum of two rows?

How can I find the difference in two rows and divide this result by the sum of two rows?
Here is how to do it in Excel.
Here is the formula I want to replicate, using Python.
=ABS(((B3-B2)/(B3+B2)/2)/((A3-A2)/(A3+A2)/2))
I know the difference can be calculated with df.diff(), but I can't figure out how to do the sum.
import pandas as pd
data = {'Price':[50,46],'Quantity':[3,6]}
df = pd.DataFrame(data)
print(df)
Can use rolling.sum with a window size of 2:
(df.diff()/df.rolling(2).sum()).eval('abs(Quantity/Price)')
0 NaN
1 8.0
dtype: float64
Basically you already have the diff then already you have two row sum
Since diff : x[2]-x[1] Then 'sum' : x[2]+x[1]=x[2]*2-(x[2]-x[1])
In your case the sum can be calculated by
df*2-df.diff()
Out[714]:
Price Quantity
0 NaN NaN
1 96.0 9.0
So the output is
(df.diff()/(df*2-df.diff())).eval('abs(Quantity/Price)')
Out[718]:
0 NaN
1 8.0
dtype: float64
For small dataframes the use of .eval() is not efficient.
The following is faster upto some 100.000 rows:
df = (df.diff() / df.rolling(2).sum()).div(2)
df['result'] = abs(df.Quantity / df.Price)
32.9 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
vs.
39.6 ms ± 931 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

What is the fastest way to perform a replace on a column of a Pandas DataFrame based on the index of a separate Series?

Sorry if I've been googling the wrong keywords, but I haven't been able to find an efficient way to replace all instances of an integer in a DataFrame column with its corresponding indexed value from a secondary Series.
I'm working with the output of a third party program that strips the row and column labels from an input matrix and replaces them with their corresponding indices. I'd like to restore the true labels from the indices.
I have a dummy example of the dataframe and series in question:
In [6]: df
Out[6]:
idxA idxB var2
0 0 1 2.0
1 0 2 3.0
2 2 4 2.0
3 2 1 1.0
In [8]: labels
Out[8]:
0 A
1 B
2 C
3 D
4 E
Name: label, dtype: object
Currently, I'm converting the series to a dictionary and using replace:
label_dict = labels.to_dict()
df['idxA'] = df.idxA.replace(label_dict)
df['idxB'] = df.idxB.replace(label_dict)
which does give me the expected result:
In [12]: df
Out[12]:
idxA idxB var2
0 A B 2.0
1 A C 3.0
2 C E 2.0
3 C B 1.0
However, this is very slow for my full dataset (approximately 3.8 million rows in the table, and 19,000 labels). Is there a more efficient way to approach this?
Thanks!
EDIT: I accepted #coldspeed's answer. Couldn't paste a code block in the comment reply to his answer, but his solution sped up the dummy code by about an order of magnitude:
In [10]: %timeit df.idxA.replace(label_dict)
4.41 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: %timeit df.idxA.map(labels)
435 µs ± 3.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You can call map for each column using apply:
df.loc[:, 'idxA':'idxB'] = df.loc[:, 'idxA':'idxB'].apply(lambda x: x.map(labels))
df
idxA idxB var2
0 A B 2.0
1 A C 3.0
2 C E 2.0
3 C B 1.0
This is effectively iterating over every column (but the map operation for a single column is vectorized, so it is fast). It might just be faster to do
cols_of_interest = ['idxA', 'idxB', ...]
for c in cols_of_interest: df[c] = df[c].map(labels)
map is faster than replace, depending on the number of columns to replace. Your mileage may vary.
df_ = df.copy()
df = pd.concat([df_] * 10000, ignore_index=True)
%timeit df.loc[:, 'idxA':'idxB'].replace(labels)
%%timeit
for c in ['idxA', 'idxB']:
df[c].map(labels)
6.55 ms ± 87.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.95 ms ± 70 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Speed up Python .loc function search

I am pulling out a value from a table, searching for the value based on matches in other columns. Right now, because there are hundreds of thousands of grid cells to go through, each call of the function takes a few seconds, but it adds up to hours. Is there a faster way to do this?
data_1 = data.loc[(data['test1'] == test1) & (data['test2'] == X) & (data['Column'] == col1) & (data['Row']== row1)].Value
Sample data
Column Row Value test2 test1
2 3 5 X 0TO4
2 6 10 Y 100UP
2 10 5.64 Y 10TO14
5 2 9.4 Y 15TO19
9 2 6 X 20TO24
13 11 7.54 X 25TO29
25 2 6.222 X 30TO34
It may be worth a quick read-through on the enhancing performance docs to see what best fits your needs.
One option is to drop down to numpy using .values and slicing. Without seeing your actual data or use case, I created the following synthetic data:
data=pd.DataFrame({'column':[np.random.randint(30) for i in range(100000)],
'row':[np.random.randint(50) for i in range(100000)],
'value':[np.random.randint(100)+np.random.rand() for i in range(100000)],
'test1':[np.random.choice(['X','Y']) for i in range(100000)],
'test2':[np.random.choice(['d','e','f','g','h','i']) for i in range(100000)]})
data.head()
column row value test1 test2
0 4 30 88.367151 X e
1 7 10 92.482926 Y d
2 1 17 11.151060 Y i
3 27 10 78.707897 Y g
4 19 35 95.204207 Y h
Then using %timeit I got the following results using .loc indexing, boolean masking, and numpy slicing
(Note, at this point I realized I missed one of the lookups so that may affect the total time count but ratios should hold true)
%timeit data_1 = data.loc[(data['test1'] == 'X') & (data['column'] >=12) & (data['row'] > 22)]['value']
13 ms ± 538 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit data_1 = data[(data['test1'] == 'X') & (data['column'] >=12) & (data['row'] > 22)]['value']
13.1 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Now, this next part contains some overhead for converting the dataframe to a numpy array. If you're converting it once then doing multiple lookups against it, then this will be faster. But if not, you will likely end up taking longer for a single convert/slice
Without considering conversion time:
d1=data.values
%timeit d1[(d1[:,3]=='X')&(d1[:,0]>=12)&(d1[:,1]>22)][:,2]
8.37 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Approximately 30% improvement
With conversion time:
%timeit d1=data.values;d1[(d1[:,3]=='X')&(d1[:,0]>=12)&(d1[:,1]>22)][:,2]
20.6 ms ± 624 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Approximately 50% worse
You can index by test1, test2, Column and Row, and then lookup by that index.
Indexing:
data.set_index(["test1", "test2", "Column", "Row"], inplace=True)
and then lookup by doing this:
data_1 = data.loc[(test1, X, col1, row1)].Value

Using two columns from previous row to determine column value in a pandas data frame

I need to calculate a value for each row in a Pandas data frame by comparing two columns to the values of the same columns for the previous row. I was able to do this by using iloc, but it takes a really long time when applying it to over 100K rows.
I tried using lambda, but it seems that it only returns one row or one column at the time, so I can't use it to compare multiple columns and rows at the same time.
In this example, I subtract the value of 'b' for the previous row from the value of 'b' for the current row, but only if the value of 'a' is the same for both rows.
This is the code I've been using:
import pandas as pd
df = pd.DataFrame({'a':['a','a','b','b','b'],'b':[1,2,3,4,5]})
df['increase'] = 0
for row in range(len(df)):
if row > 0:
if df.iloc[row]['a'] == df.iloc[row - 1]['a']:
df.iloc[row, 2] = df.iloc[row]['b'] - df.iloc[row - 1]['b']
is there a faster way to do the same calculation?
Thanks.
IIUC, you can suing groupby +diff
df.groupby('a').b.diff().fillna(0)
Out[193]:
0 0.0
1 1.0
2 0.0
3 1.0
4 1.0
Name: b, dtype: float64
After assign it back
df['increase']=df.groupby('a').b.diff().fillna(0)
df
Out[198]:
a b increase
0 a 1 0.0
1 a 2 1.0
2 b 3 0.0
3 b 4 1.0
4 b 5 1.0
Here is one solution:
df['increase'] = [0] + [(d - c) if a == b else 0 for a, b, c, d in \
zip(df.a, df.a[1:], df.b, df.b[1:])]
Some benchmarking vs #Wen's pandonic solution:
df = pd.DataFrame({'a':['a','a','b','b','b']*20000,'b':[1,2,3,4,5]*20000})
%timeit [0] + [(d - c) if a == b else 0 for a, b, c, d in zip(df.a, df.a[1:], df.b, df.b[1:])]
# 51.6 ms ± 898 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.groupby('a').b.diff().fillna(0)
# 37.8 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Categories

Resources