How to sum over some columns based on condition in pandas - python

I have a data frame like this:
mydf = {'p1':[0.1, 0.2, 0.3], 'p2':[0.2, 0.1,0.3], 'p3':[0.1,0.9, 0.01], 'p4':[0.11, 0.2, 0.4], 'p5':[0.3, 0.1,0.5],
'w1':['cancel','hello', 'hi'], 'w2':['good','bad','ugly'], 'w3':['thanks','CUSTOM_MASK','great'],
'w4':['CUSTOM_MASK','CUSTOM_UNKNOWN', 'trible'],'w5':['CUSTOM_MASK','CUSTOM_MASK','job']}
df = pd.DataFrame(mydf)
So what I need to do is to sum up all values in column p1,p2,p3,p4,p5 if the correspondent values in w1,w2,w3,w4,w5 is not CUSTOM_MASK or CUSTOM_UNKNOWN.
So the result would be to add a column to the data frame like this: (0.1+0.2+0.1=0.4 is for the first row).
top_p
0.4
0.3
1.51
So my question is that is there any pandas way to do this?
What I have done so far is to loop through the rows and then columns and check the values (CUSTOM_MASK, CUSTOM_UNKNOWN) and then sum it up if those values was not exist in the columns.

You can use mask. The idea is to create a boolean mask with the w columns, and use it to filter the relevant w columns and sum:
df['top_p'] = df.filter(like='p').mask(df.filter(like='w').isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
Output:
p1 p2 p3 p4 p5 w1 w2 w3 w4 w5 top_p
0 0.1 0.2 0.10 0.11 0.3 cancel good thanks CUSTOM_MASK CUSTOM_MASK 0.40
1 0.2 0.1 0.90 0.20 0.1 hello bad CUSTOM_MASK CUSTOM_UNKNOWN CUSTOM_MASK 0.30
2 0.3 0.3 0.01 0.40 0.5 hi ugly great trible job 1.51
Before summing, the output of mask looks like:
p1 p2 p3 p4 p5
0 0.1 0.2 0.10 NaN NaN
1 0.2 0.1 NaN NaN NaN
2 0.3 0.3 0.01 0.4 0.5

Here's a way to do this using np.dot():
pCols, wCols = ['p'+str(i + 1) for i in range(5)], ['w'+str(i + 1)for i in range(5)]
mydf['top_p'] = mydf.apply(lambda x: np.dot(x[pCols], ~(x[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']))), axis=1)
We first prepare the two sets of column names p1,...,p5 and w1,...,w5.
Then we use apply() to take the dot product of the values in the pN columns with the filtering criteria based on the wN columns (namely include only contributions from pN column values whose corresponding wN column value is not in the list of excluded strings).
Output:
p1 p2 p3 p4 p5 w1 w2 w3 w4 w5 top_p
0 0.1 0.2 0.10 0.11 0.3 cancel good thanks CUSTOM_MASK CUSTOM_MASK 0.40
1 0.2 0.1 0.90 0.20 0.1 hello bad CUSTOM_MASK CUSTOM_UNKNOWN CUSTOM_MASK 0.30
2 0.3 0.3 0.01 0.40 0.5 hi ugly great trible job 1.51
Alternatively, element-wise multiplication and sum across columns can be used like this:
pCols, wCols = [[c for c in mydf.columns if c[0] == char] for char in 'pw']
colMap = {wCols[i] : pCols[i] for i in range(len(pCols))}
mydf['top_p'] = (mydf[pCols] * ~mydf[wCols].rename(columns=colMap).isin(['CUSTOM_MASK','CUSTOM_UNKNOWN'])).sum(axis=1)
Here, we needed to rename the columns of one of the 5-column DataFrames to ensure that * (DataFrame.multiply()) can do the element-wise multiplication.
UPDATE: Here are a few timing comparisons on various possible methods for solving this question:
#1. Pandas mask and sum (see answer by #enke):
df['top_p'] = df.filter(like='p').mask(df.filter(like='w').isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
#2. Pandas apply with Numpy dot solution:
pCols, wCols = ['p'+str(i + 1) for i in range(5)], ['w'+str(i + 1)for i in range(5)]
df['top_p'] = df.apply(lambda x: np.dot(x[pCols], ~(x[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']))), axis=1)
#3. Pandas element-wise multiply and sum:
pCols, wCols = [[c for c in df.columns if c[0] == char] for char in 'pw']
colMap = {wCols[i] : pCols[i] for i in range(len(pCols))}
df['top_p'] = (df[pCols] * ~df[wCols].rename(columns=colMap).isin(['CUSTOM_MASK','CUSTOM_UNKNOWN'])).sum(axis=1)
#4. Numpy element-wise multiply and sum:
pCols, wCols = [[c for c in df.columns if c[0] == char] for char in 'pw']
df['top_p'] = (df[pCols].to_numpy() * ~df[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
Timing results:
Timeit results for df with 30000 rows:
method_1 ran in 0.008165133331203833 seconds using 3 iterations
method_2 ran in 13.408894366662329 seconds using 3 iterations
method_3 ran in 0.007688766665523872 seconds using 3 iterations
method_4 ran in 0.006326200003968552 seconds using 3 iterations
Time performance results:
Method #4 (numpy multiply/sum) is about 20% faster than the runners-up.
Methods #1 and #3 (pandas mask/sum vs multiply/sum) are neck-and-neck in second place.
Method #2 (pandas apply/numpy dot) is frightfully slow.
Here's the timeit() test code in case it's of interest:
import pandas as pd
import numpy as np
nListReps = 10000
df = pd.DataFrame({'p1':[0.1, 0.2, 0.3]*nListReps, 'p2':[0.2, 0.1,0.3]*nListReps, 'p3':[0.1,0.9, 0.01]*nListReps, 'p4':[0.11, 0.2, 0.4]*nListReps, 'p5':[0.3, 0.1,0.5]*nListReps,
'w1':['cancel','hello', 'hi']*nListReps, 'w2':['good','bad','ugly']*nListReps, 'w3':['thanks','CUSTOM_MASK','great']*nListReps,
'w4':['CUSTOM_MASK','CUSTOM_UNKNOWN', 'trible']*nListReps,'w5':['CUSTOM_MASK','CUSTOM_MASK','job']*nListReps})
from timeit import timeit
def foo_1(df):
df['top_p'] = df.filter(like='p').mask(df.filter(like='w').isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
return df
def foo_2(df):
pCols, wCols = ['p'+str(i + 1) for i in range(5)], ['w'+str(i + 1)for i in range(5)]
df['top_p'] = df.apply(lambda x: np.dot(x[pCols], ~(x[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']))), axis=1)
return df
def foo_3(df):
pCols, wCols = [[c for c in df.columns if c[0] == char] for char in 'pw']
colMap = {wCols[i] : pCols[i] for i in range(len(pCols))}
df['top_p'] = (df[pCols] * ~df[wCols].rename(columns=colMap).isin(['CUSTOM_MASK','CUSTOM_UNKNOWN'])).sum(axis=1)
return df
def foo_4(df):
pCols, wCols = [[c for c in df.columns if c[0] == char] for char in 'pw']
df['top_p'] = (df[pCols].to_numpy() * ~df[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
return df
n = 3
print(f'Timeit results for df with {len(df.index)} rows:')
for foo in ['foo_'+str(i + 1) for i in range(4)]:
t = timeit(f"{foo}(df.copy())", setup=f"from __main__ import df, {foo}", number=n) / n
print(f'{foo} ran in {t} seconds using {n} iterations')
Conclusion:
The absolute fastest of these four approaches seems to be Numpy element-wise multiply and sum. However, #enke's Pandas mask and sum is pretty close in performance and is arguably the most aesthetically pleasing of the four candidates.
Perhaps this hybrid of the two (which runs about as fast as #4 above) is worth considering:
df['top_p'] = (df.filter(like='p').to_numpy() * ~df.filter(like='w').isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)

Related

Find event and non-event rate using pandas

I have a dataframe like as shown below
import numpy as np
import pandas as pd
np.random.seed(100)
df = pd.DataFrame({'grade': np.random.choice(list('ABCD'),size=(20)),
'dash': np.random.choice(list('PQRS'),size=(20)),
'dumeel': np.random.choice(list('QWER'),size=(20)),
'dumma': np.random.choice((1234),size=(20)),
'target': np.random.choice([0,1],size=(20))
})
I would like to do the below
a) event rate - Compute the % occurrence of 1s (from target column) for each unique value in a each of the input categorical column
b) non event rate - Compute the % occurrence of 0s (from target column) for each unique value in each of the input categorical columns
I tried the below
input_category_columns = df.select_dtypes(include='object')
df_rate_calc = pd.DataFrame()
for ip in input_category_columns:
feature,target = ip,'target'
df_rate_calc['col_name'] = (pd.crosstab(df[feature],df[target],normalize='columns'))
I would like to do this on a million rows and if there is any efficient approach, would really be helpful
I expect my output to be like as shown below. I have shown for only two columns but I want to produce this output for all categorical columns
Here is one approach:
Select the catgorical columns (cols)
Melt the dataframe with target as id variable and cols as value variables
Group the dataframe and use value_counts to calculate frequency
Unstack to reshape the dataframe
cols = df.select_dtypes('object')
df_out = (
df.melt('target', cols)
.groupby(['variable', 'target'])['value']
.value_counts(normalize=True)
.unstack(1, fill_value=0)
)
print(df_out)
target 0 1
variable value
dash P 0.4 0.3
Q 0.2 0.3
R 0.2 0.1
S 0.2 0.3
dumeel E 0.2 0.2
Q 0.1 0.0
R 0.4 0.6
W 0.3 0.2
grade A 0.4 0.2
B 0.0 0.2
C 0.4 0.3
D 0.2 0.3

correlation matrix filtering based on high variables correlation with selection of least correlated with target variable at scale using vectors

I have this resulting correlation matrix:
id
row
col
corr
target_corr
0
a
b
0.95
0.2
1
a
c
0.7
0.2
2
a
d
0.2
0.2
3
b
a
0.95
0.7
4
b
c
0.35
0.7
5
b
d
0.65
0.7
6
c
a
0.7
0.6
7
c
b
0.35
0.6
8
c
d
0.02
0.6
9
d
a
0.2
0.3
10
d
b
0.65
0.3
11
d
c
0.02
0.3
After filtering high correlated variables based on "corr" variable I
try to add new column that will compare will decide to mark "keep" the
least correlated variable from "row" or mark "drop" of that variable
for the most correlated variable "target_corr" column. In other works
from corelated variables matching cut > 0.5 select the one least correlated to
"target_corr":
Expected result:
id
row
col
corr
target_corr
drop/keep
0
a
b
0.95
0.2
keep
1
a
c
0.7
0.2
keep
2
b
a
0.95
0.7
drop
3
b
d
0.65
0.7
drop
4
c
a
0.7
0.6
drop
5
d
b
0.65
0.3
keep
This approach does use very large dataframes so resulting corr matrix for example is > 100kx100k and generated using pyspark:
def corrwith_matrix_no_save(df, data_cols=None, select_targets = None, method='pearson'):
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.stat import Correlation
from pyspark.mllib.stat import Statistics
start_time = time.time()
vector_col = "corr_features"
if data_cols == None and select_targets == None:
data_cols = df.columns
select_target = list(df.columns)
assembler = VectorAssembler(inputCols=data_cols, outputCol=vector_col)
df_vector = assembler.transform(df).select(vector_col)
matrix = Correlation.corr(df_vector, vector_col, method)
result = matrix.collect()[0]["pearson({})".format(vector_col)].values
final_df = pd.DataFrame(result.reshape(-1, len(data_cols)), columns=data_cols, index=data_cols)
final_df = final_df.apply(lambda x: x.abs() if np.issubdtype(x.dtype, np.number) else x )
corr_df = final_df[select_target]
#corr_df.columns = [str(col) + '_corr' for col in corr_df.columns]
corr_df['column_names'] = corr_df.index
print('Execution time for correlation_matrix function:', time.time() - start_time)
return corr_df
created the dataframe from uper triagle with numpy.triuand numpy.stack + added the target column my merging 2 resulting dataframes (if code is required can provide but will increase the content a lot so will provide only if needs clarifcation).
def corrX_to_ls(corr_mtx) :
# Get correlation matrix and upper triagle
df_target = corr_mtx['target']
corr_df = corr_mtx.drop('target', inplace=True)
up = corr_df.where(np.triu(np.ones(corr_df.shape), k=1).astype(np.bool))
print('This is triu: \n', up )
df = up.stack().reset_index()
df.columns = ['row','col','corr']
df_lsDF = df.query("row" != "col")
df_target_corr = df_target.reset_index()
df_target_corr.columns = ['target_col', 'target_corr']
sample_df = df_lsDF.merge(df_target_corr, how='left', left_ob='row', right_on='target_col')
sample_df = sample_df.drop('target_col', 1)
return (sample_df)
Now after filtering resulting dataframe based on df.Corr > cut where cut > 0.50 got stuck at marking what variable o keep and what to drop
( I do look to mark them only then select into a list variables) ...
so help on solving it will be greatly appreciated and will also
benefit community when working on distributed system.
Note: Looking for example/solution to scale so I can distribute
operations on executors so lists or like a group/subset of the
dataframe to be done in parallel and avoid loops is what I do look, so
numpy.vectorize, threading and/or multiprocessing
approaches is what I do look.
Additional "thinking" from top of my mind: I do think on grouping by
"row" column so can distribute processing each group on executors or
by using lists distribute processing in parallel on executors so each
list will generate a job for each thread from ThreadPool ( I done
done this approach for column vectors but for very large
matrix/dataframes can become inefficient so for rows I think will
work).
Given final_df as the sample input, you can try:
# filter
output = final_df.query('corr>target_corr').copy()
# assign drop/keep
output['drop_keep'] = np.where(output['corr']>2*output['target_corr'],
'keep','drop')
Output:
id row col corr target_corr drop_keep
0 0 a b 0.95 0.2 keep
1 1 a c 0.70 0.2 keep
3 3 b a 0.95 0.7 drop
6 6 c a 0.70 0.6 drop
10 10 d b 0.65 0.3 keep

Subset original dataframe based on grouped quantiles

This is my df:
NAME DEPTH A1 A2 A3 AA4 AA5 AI4 AC5 Surface
0 Ron 2800.04 8440.53 1330.99 466.77 70.19 56.79 175.96 77.83 C
1 Ron 2801.04 6084.15 997.13 383.31 64.68 51.09 154.59 73.88 C
2 Ron 2802.04 4496.09 819.93 224.12 62.18 47.61 108.25 63.86 C
3 Ben 2803.04 5766.04 927.69 228.41 65.51 49.94 106.02 62.61 L
4 Ron 2804.04 6782.89 863.88 223.79 63.68 47.69 101.95 61.83 L
... ... ... ... ... ... ... ... ... ... ...
So, my first problem has been answered here:
Find percentile in pandas dataframe based on groups
Using:
df.groupby('Surface')['DEPTH'].quantile([.1, .9])
I can get the percentiles [.1,.9] from DEPTH grouped by Surface, which is what I need:
Surface
C 0.1 2800.24
0.9 2801.84
L 0.1 3799.74
0.9 3960.36
N 0.1 2818.24
0.9 2972.86
P 0.1 3834.94
0.9 4001.16
Q 0.1 3970.64
0.9 3978.62
R 0.1 3946.14
0.9 4115.96
S 0.1 3902.03
0.9 4073.26
T 0.1 3858.14
0.9 4029.96
U 0.1 3583.01
0.9 3843.76
V 0.1 3286.01
0.9 3551.06
Y 0.1 2917.00
0.9 3135.86
X 0.1 3100.01
0.9 3345.76
Z 0.1 4128.56
0.9 4132.56
Name: DEPTH, dtype: float64
Now, I believe that was already the hardest part. What is left is subsetting the original df to include only the values in between those DEPTH percentiles .1 & .9. So for example: DEPTH values in Surface group "Z" have to be greater than 4128.56 and less than 4132.56.
Note that I need df again, not df.groupby("Surface"): the final df would be exactly the same, but the rows whose depths are outside the borders should be dropped.
This seems so easy ... any ideas?
Thanks!
When you need to filter rows within groups it's often simpler and faster to use groupby + transform to broadcast the result to every row within a group and then filter the original DataFrame. In this case we can check if 'DEPTH' is between those two quantiles.
Sample Data
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({'DEPTH': np.random.normal(0,1,100),
'Surface': np.random.choice(list('abcde'), 100)})
Code
gp = df.groupby('Surface')['DEPTH']
df1 = df[df['DEPTH'].between(gp.transform('quantile', 0.1),
gp.transform('quantile', 0.9))]
For clarity, here you can see that transform will broadcast the scalar result to every row that belongs to the group, in this case defined by 'Surface'
pd.concat([df['Surface'], gp.transform('quantile', 0.1).rename('q = 0.1')], axis=1)
# Surface q = 0.1
#0 a -1.164557
#1 e -0.967809
#2 a -1.164557
#3 c -1.426986
#4 b -1.544816
#.. ... ...
#95 a -1.164557
#96 e -0.967809
#97 b -1.544816
#98 b -1.544816
#99 b -1.544816
#
#[100 rows x 2 columns]

How does pandas quantile( ) function works internally?

In this post:
How does pandas calculate quartiles?
This is the explanation given by #perl on the working of quantile() function:
df = pd.DataFrame([5,7,10,15,19,21,21,22,22,23,23,23,23,23,24,24,24,24,25], columns=['val'])
Let's consider 0.25 (same logic with 0.75, of course): element number should be (len(df)-1)*0.25 = (19 - 1)*0.25 = 4.5, so we're between element 4 (which is 19 -- we start counting from 0) and element 5 (which is 21). So, we have i = 19, j = 21, fraction = 0.5, and i + (j - i) * fraction = 20
I am still not able to figure out how quantile() function works.
All the formulas for quantiles suggest that we should take q * (n+1), where q is the quantile to be calculated. However, in the explanation by #perl, the formula used is q*(n-1). Why (n-1) instead of (n+1) ?
Secondly, why is the fraction 0.5 being used by #perl?
Is there any difference in the method of quantile calculation, if the total data points are even or odd?*
if we take two data frames:
df1 = pd.DataFrame([2,4,6,8,10,12]) # n=6 (even)
df2 = pd.DataFrame([1,3,5,7,9]) # n=5 (odd)
their respective quantiles are as under (pic attached)quantile chart:
I am unable to find out how the quantiles are being calculated in the above two cases.
q -> df1 -> df2
0.2 -> 4.0 -> 2.6
0.25 -> 4.5 -> 3.0
0.5 -> 7.0 -> 5.0
0.75 -> 9.5 -> 7.0
0.8 -> 10.0 -> 7.4
Can someone explain please ? I will be highly thankful.
Thanks in advance.
Vineet
I am not sure but you can try this.
0 <= q <= 1
df = pd.DataFrame([1,3,5,7,9], columns=['val'])
df.quantile(0.25)
output: val 3.0
Explanation: n=5, q = 0.25. As i have used q = 0.25,then we can use index = n/4 = 1.25
Condition for index:
if index decimal fraction like 0.25 < 0.50, then index = floor(index)
if index decimal fraction > 0.50, then index = ceil(index)
if index decimal fraction == 0.50, then value = int(index)+0.5

Efficiently combine min/max on different columns of a pandas dataframe

I have a pandas dataframe that contains the results of computation and need to:
take the maximum value of a column and for that value find the maximum value of another column
take the minimum value of a column and for that value find the maximum value of another column
Is there a more efficient way to do it?
Setup
metrictuple = namedtuple('metrics', 'prob m1 m2')
l1 =[metrictuple(0.1, 0.4, 0.04),metrictuple(0.2, 0.4, 0.04),metrictuple(0.4, 0.4, 0.1),metrictuple(0.7, 0.2, 0.3),metrictuple(1.0, 0.1, 0.5)]
df = pd.DataFrame(l1)
# df
# prob m1 m2
#0 0.1 0.4 0.04
#1 0.2 0.4 0.04
#2 0.4 0.4 0.10
#3 0.7 0.2 0.30
#4 1.0 0.1 0.50
tmp = df.loc[(df.m1.max() == df.m1), ['prob','m1']]
res1 = tmp.loc[tmp.prob.max() == tmp.prob, :].to_records(index=False)[0]
#(0.4, 0.4)
tmp = df.loc[(df.m2.min() == df.m2), ['prob','m2']]
res2 = tmp.loc[tmp.prob.max() == tmp.prob, :].to_records(index=False)[0]
#(0.2, 0.04)
Pandas isn't ideal for numerical computations. This is because there is a significant overhead in slicing and selecting data, in this example df.loc.
The good news is that pandas interacts well with numpy, so you can easily drop down to the underlying numpy arrays.
Below I've defined some helper functions which makes the code more readable. Note that numpy slicing is performed via row and column numbers starting from 0.
arr = df.values
def arr_max(x, col):
return x[x[:,col]==x[:,col].max()]
def arr_min(x, col):
return x[x[:,col]==x[:,col].min()]
res1 = arr_max(arr_max(arr, 1), 0)[:,:2] # array([[ 0.4, 0.4]])
res2 = arr_max(arr_min(arr, 2), 0)[:,[0,2]] # array([[ 0.2 , 0.04]])

Categories

Resources