Efficient way to drop subset of columns using a threshold of row numbers - python

I have a 10 million row dataframe like this
>>> df.info(show_counts=True)
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 10000000 non-null datetime64[ns]
1 cust1 6000647 non-null float64
2 cust2 6001585 non-null float64
3 cust3 6000415 non-null float64
4 cust4 9001290 non-null float64
5 cust5 9000402 non-null float64
6 cust6 9000093 non-null float64
7 cust7 8999538 non-null float64
8 cust8 9000211 non-null float64
9 cust9 9000745 non-null float64
10 cust10 9001119 non-null float64
In the general case, all of the columns contain NA values. In this example columns cust1, cust2, cust3 contain around 40% of NA values, 10% for the rest. Column date has no missing values, for the sake of testing - the general problem assumes every column can have any number of NA values.
I'm looking for an idiomatic/efficient way to drop those custXX columns whose rows contain less than 70% (i.e. 7 million) of non-NA values.
I'm treating DataFrame.dropna(axis=1, thresh=thresh) as a baseline result, just to see how much time it would take Pandas to clear the whole dataframe.
%timeit df.dropna(axis=1, thresh=thresh)
701 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I can't use the subset parameter, because in this case it would affect rows, not columns.
I've tried the following solutions:
Split the dataframe into one containing only the custXX subset of columns, and the other containing the data column. Drop NA columns in the first DF, then merge it with the other one using index:
def split_merge(df):
date_df = df[['date']]
rest_df = df.drop('date', axis=1)
cleared = rest_df.dropna(thresh=thresh, axis=1)
return date_df.merge(cleared, left_index=True, right_index=True)
%timeit split_merge(df)
1.65 s ± 49.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Select custXX DF subset, for each column count number of non-NA values, select only those columns where count is at least 70%, then use those columns to select from original dataframe
def count_select(df):
nan_cols = df.filter(like='cust').columns
non_na_counts = df[nan_cols].notna().sum()
valid_cols = non_na_counts[non_na_counts >= thresh]
all_cols = pd.concat([pd.Series(0, index=['date']), valid_cols]).index
return df[all_cols]
%timeit count_select(df)
1.73 s ± 79.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Similar to the previous one, but instead of counting, we drop NA values and use all of the resulting columns to select from the original dataframe:
def select_dropna_select(df):
nan_cols = df.filter(like='cust')
cleared = nan_cols.dropna(axis=1, thresh=thresh).columns
new_cols = ['date', *cleared.values]
return df[new_cols]
%timeit select_dropna_select(df)
1.54 s ± 14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The last one is the fastest, but still more than twice slower than the baseline solution (clearing the whole dataframe). Is there an idiomatic way to do it that would achieve similar efficiency?

Here is one approach. I would encourage you to perform the timeit analysis on this solution and let me know the results.
m = df.filter(like='cust').isna().mean() > .3 # True if more than 30% nulls
df.drop(m.index[m], axis=1) # drop the columns

Related

How do I add a column for each repeated element in Python Pandas?

I'm trying to group all costs for each client in a report separated in columns. The number of columns added depends on how many times the same client adds a cost.
For example:
Client
Costs
A
5
B
10
B
2
B
5
A
4
The result that I want:
Client
Cost_1
Cost_2
Cost_3
Cost_n
A
5
4
B
10
2
5
Keep in mind the original database is huge so any efficiency would help.
You can use GroupBy.cumcount() to get the serial number of column Cost. Then use df.pivot() to transform the data into columns. Use .add_prefix together with the serial number of columns to format the column labels.
df['cost_num'] = df.groupby('Client').cumcount() + 1
(df.pivot('Client', 'cost_num', 'Costs')
.add_prefix('Cost_')
.rename_axis(columns=None)
.reset_index()
)
Result:
Client Cost_1 Cost_2 Cost_3
0 A 5.0 4.0 NaN
1 B 10.0 2.0 5.0
System Performance
Let's see the system performance for 500,000 rows:
df2 = pd.concat([df] * 100000, ignore_index=True)
%%timeit
df2['cost_num'] = df2.groupby('Client').cumcount() + 1
(df2.pivot('Client', 'cost_num', 'Costs')
.add_prefix('Cost_')
.rename_axis(columns=None)
.reset_index()
)
587 ms ± 26.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It takes 0.587 second to run 500,000 rows of dataframe.
Let's see the system performance for 5,000,000 rows:
df3 = pd.concat([df] * 1000000, ignore_index=True)
%%timeit
df3['cost_num'] = df3.groupby('Client').cumcount() + 1
(df3.pivot('Client', 'cost_num', 'Costs')
.add_prefix('Cost_')
.rename_axis(columns=None)
.reset_index()
)
6.35 s ± 128 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It takes 6.35 seconds to run 5,000,000 rows of dataframe.

Pandas Question create two aggregations with one being conditionally created

I have a dataframe like the following:
label val
a 0
b -1
b 0
b 1
a 1
b 1
My goal here is to group by the label column and get two aggregated columns. One that shows the amount of rows in each group (eg. a:2, b:4) and second the proportion in each group where val = 1. What is the best way to do this in pandas?
Finding the proportion of a column that satisfies a condition is equivalent to taking the mean of a Boolean Series. This allows for it to be done quickly. Since s and df share the same index, it's perfectly fine to use one to group the other.
To get multiple aggregations for a column, supply a list that specifies what you want to do.
s = df.val.eq(1)
s.groupby(df.label).agg(['size', 'mean'])
# size mean
#label
#a 2 0.5
#b 4 0.5
When the number of groups becomes large using "tricks" like this can be significantly faster than using a lambda because many of the basic groupby aggregations have cythonized versions that are extremely performant.
# Create a sample df with 20,000 unique groups
df = pd.concat([df]*10000, ignore_index=True)
df['label'] = df.index//3
%%timeit
s = df.val.eq(1)
s.groupby(df.label).agg(['size', 'mean'])
#10.8 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
def portion(x): return (x.eq(1).sum())/len(x)
df.groupby('label').val.agg(['size', portion])
#7.93 s ± 82.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Try:
def portion(x): return (x.eq(1).sum())/len(x)
df.groupby('label').val.agg(['size', portion])
Output:
size portion
label
a 2 0.5
b 4 0.5

How can I find the difference in two rows and divide this result by the sum of two rows?

How can I find the difference in two rows and divide this result by the sum of two rows?
Here is how to do it in Excel.
Here is the formula I want to replicate, using Python.
=ABS(((B3-B2)/(B3+B2)/2)/((A3-A2)/(A3+A2)/2))
I know the difference can be calculated with df.diff(), but I can't figure out how to do the sum.
import pandas as pd
data = {'Price':[50,46],'Quantity':[3,6]}
df = pd.DataFrame(data)
print(df)
Can use rolling.sum with a window size of 2:
(df.diff()/df.rolling(2).sum()).eval('abs(Quantity/Price)')
0 NaN
1 8.0
dtype: float64
Basically you already have the diff then already you have two row sum
Since diff : x[2]-x[1] Then 'sum' : x[2]+x[1]=x[2]*2-(x[2]-x[1])
In your case the sum can be calculated by
df*2-df.diff()
Out[714]:
Price Quantity
0 NaN NaN
1 96.0 9.0
So the output is
(df.diff()/(df*2-df.diff())).eval('abs(Quantity/Price)')
Out[718]:
0 NaN
1 8.0
dtype: float64
For small dataframes the use of .eval() is not efficient.
The following is faster upto some 100.000 rows:
df = (df.diff() / df.rolling(2).sum()).div(2)
df['result'] = abs(df.Quantity / df.Price)
32.9 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
vs.
39.6 ms ± 931 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Fastest way to perform comparisons between every column using Pandas

I have an excel file with 100 columns each with 1000 entries. Each of those entries can only take 3 particular values of (0.8, 0.0 and 0.37) I want to count the number of mismatches between every combination of two column's entry.
For example, the excel sheet below shows the mismatches between the columns:
|---------------------|------------------|---------------------|---------------|
| Column 1 | Column 2 | Column 3 | Mismatch |
|---------------------|------------------|---------------------|---------------|
| 0.37 | 0.8 | 0.0 | 3 |
|---------------------|------------------|---------------------|---------------|
| 0.0 | 0.0 | 0.8 | 2 |
|---------------------|------------------|---------------------|---------------|
First we compare column 1 against column 2. Since there is a difference between the first rows we add 1 to the corresponding row of the mismatch column. We repeat this for column 1 vs column 3 and then column 2 vs column 3. So we need to iterate over every unique combination of two columns.
The brute force way of doing this is a nested loop which iterates over two columns at a time. I was wondering if there is a panda-y way of doing this.
This is what I will handle this problem
from itertools import combinations
L = df.columns.tolist()
pd.concat([df[x[0]]!=df[x[1]] for x in list( combinations(L, 2))],axis=1).sum(1)
0 3
1 2
dtype: int64
Since you sum pairwise combinations it's the same as checking the first column against the second through the last columns, the second against the third through the last and so on. Checking N-1 (N number of columns) equalities against the DataFrame and summing will be quite a bit faster than checking NC2 individual column pairings, especially with your large number of columns:
from functools import reduce
reduce(lambda x,y: x+y, [df.iloc[:, i+1:].ne(df.iloc[:, i], axis=0).sum(1)
for i in range(len(df.columns)-1)])
0 3
1 2
dtype: int64
Some timings with your data size
import numpy as np
import pandas as pd
from itertools import combinations
np.random.seed(123)
df = pd.DataFrame(np.random.choice([0, 0.8, 0.37], (1000,100)))
%timeit reduce(lambda x, y: x+y, [df.iloc[:, i+1:].ne(df.iloc[:, i], axis=0).sum(1) for i in range(len(df.columns)-1)])
#157 ms ± 659 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.concat([df[x[0]]!=df[x[1]] for x in list( combinations(L, 2))],axis=1).sum(1)
#1.55 s ± 9.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can gain slightly using numpy and summing the values though you lose the index:
%timeit np.sum([df.iloc[:, i+1:].ne(df.iloc[:, i], axis=0).sum(1).to_numpy() for i in range(len(df.columns)-1)], axis=0)
#139 ms ± 715 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

What is the fastest way to perform a replace on a column of a Pandas DataFrame based on the index of a separate Series?

Sorry if I've been googling the wrong keywords, but I haven't been able to find an efficient way to replace all instances of an integer in a DataFrame column with its corresponding indexed value from a secondary Series.
I'm working with the output of a third party program that strips the row and column labels from an input matrix and replaces them with their corresponding indices. I'd like to restore the true labels from the indices.
I have a dummy example of the dataframe and series in question:
In [6]: df
Out[6]:
idxA idxB var2
0 0 1 2.0
1 0 2 3.0
2 2 4 2.0
3 2 1 1.0
In [8]: labels
Out[8]:
0 A
1 B
2 C
3 D
4 E
Name: label, dtype: object
Currently, I'm converting the series to a dictionary and using replace:
label_dict = labels.to_dict()
df['idxA'] = df.idxA.replace(label_dict)
df['idxB'] = df.idxB.replace(label_dict)
which does give me the expected result:
In [12]: df
Out[12]:
idxA idxB var2
0 A B 2.0
1 A C 3.0
2 C E 2.0
3 C B 1.0
However, this is very slow for my full dataset (approximately 3.8 million rows in the table, and 19,000 labels). Is there a more efficient way to approach this?
Thanks!
EDIT: I accepted #coldspeed's answer. Couldn't paste a code block in the comment reply to his answer, but his solution sped up the dummy code by about an order of magnitude:
In [10]: %timeit df.idxA.replace(label_dict)
4.41 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: %timeit df.idxA.map(labels)
435 µs ± 3.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You can call map for each column using apply:
df.loc[:, 'idxA':'idxB'] = df.loc[:, 'idxA':'idxB'].apply(lambda x: x.map(labels))
df
idxA idxB var2
0 A B 2.0
1 A C 3.0
2 C E 2.0
3 C B 1.0
This is effectively iterating over every column (but the map operation for a single column is vectorized, so it is fast). It might just be faster to do
cols_of_interest = ['idxA', 'idxB', ...]
for c in cols_of_interest: df[c] = df[c].map(labels)
map is faster than replace, depending on the number of columns to replace. Your mileage may vary.
df_ = df.copy()
df = pd.concat([df_] * 10000, ignore_index=True)
%timeit df.loc[:, 'idxA':'idxB'].replace(labels)
%%timeit
for c in ['idxA', 'idxB']:
df[c].map(labels)
6.55 ms ± 87.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.95 ms ± 70 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Categories

Resources