Pandas Slice by Pairwise Attributes

Pandas Slice by Pairwise Attributes - python

I have two lists, namely;
x = [3, 7, 9, ...] and y = [13, 17, 19, ...]
And I have a dataframe like this:
df =
x y z
0 0 10 0.54
1 1 11 0.68
2 2 12 0.75
3 3 13 0.23
4 4 14 0.52
5 5 15 0.14
6 6 16 0.23
. . . ..
. . . ..
What I want to do is slice the dataframe given the pairwise combos in an efficient manner, as so:
df_slice = df [ ( (df.x == x[0]) & (df.y == y[0]) ) |
( (df.x == x[1]) & (df.y == y[0]) ) |
....
( (df.x == x[-1) & (df.y == y[-1]) ) ]
df_slice =
x y z
3 3 13 0.23
7 7 17 0.74
9 9 19 0.24
. .. .. ....
Is there any way to do this programmatically and quickly?

Create helper DataFrame and DataFrame.merge with no on parameter, so merging by all intersected columns, here by x and y:
x = [3, 7, 9]
y = [13, 17, 19]
df1 = pd.DataFrame({'x':x, 'y':y})
df2 = df.merge(df1)
print (df2)
x y
0 3 13
Or get interesection of MultiIndexes by Index.isin and filter by boolean indexing:
mux = pd.MultiIndex.from_arrays([x, y])
df2 = df[df.set_index(['x','y']).index.isin(mux)]
print (df2)
x y
3 3 13
Your solution should be changed with list comprehension of zipped lists and np.logical_or.reduce:
mask = np.logical_or.reduce([(df.x == a) & (df.y == b) for a, b in zip(x, y)])
df2 = df[mask]
print (df2)
x y z
3 3 13 0.23

Related

Optimize Assign Value to Cells of a Column based on a couple of conditions in Pandas Data frame

I am trying to assign values to some new columns in my data frame based on some conditions. But it is taking very long to execute.
First up, I tried using itertuples
for n in df1.itertuples():
if (df2[(df2['x'] == n.x) & (df2['y'] == n.y)].empty):
df1['new_col1'][n.Index] = 0.00
else:
df1['new_col'][n.Index] = df2[(df2['x'] == n.x) & (df2['y'] == n.y)]['value']
I also tried the same logic using map function
def foo(x,y):
if (df2[(df2['x'] == x) & (df2['y'] == y)].empty):
return 0.00
else:
return df2[(df2['x'] == x) & (df2['y'] == y)]['value']
map(foo, df1['x'],df1['y'])
Now, I am sure my code is nowhere near optimized, I tried multiple ways to optimize but they keep throwing one error or the other.
Any leads on how to optimize the code and reduce the execution time for the same.

Use pd.merge:
df1 = df1.merge(df2[['x', 'y', 'value']].rename(columns={'value': 'new_col'}),
on=['x', 'y'], how='left').fillna({'new_col': 0})
print(df1)
# Output
x y new_col
0 1 11 0.0
1 2 12 22.0
2 3 13 23.0
Setup:
df1 = pd.DataFrame({'x': [1, 2, 3], 'y': [11, 12, 13]})
df2 = pd.DataFrame({'x': [2, 3, 4], 'y': [12, 13, 14], 'value': [22, 23, 24]})
print(df1)
print(df2)
# Output
x y
0 1 11
1 2 12
2 3 13
x y value
0 2 12 22
1 3 13 23
2 4 14 24

Pandas if any n of m conditions are met

Example.
Let's say I have dataframe with several columns and I want to select rows that match all 4 conditions, I would write:
condition = (df['A'] < 10) & (df['B'] < 10) & (df['C'] < 10) & (df['D'] < 10)
df.loc[condition]
Contrary to that if I want to select rows that match any of 4 conditions I would write:
condition = (df['A'] < 10) | (df['B'] < 10) | (df['C'] < 10) | (df['D'] < 10)
df.loc[condition]
Now, what if I want to select rows that match any two of those 4 conditions? That would be rows that match any combination of either columns (A and B), (A and C), (A and D), (B and C) or (C and D). It is obvious that I can write complex condition with all those combinations:
condition = ((df['A'] < 10) & (df['B'] < 10)) |\
((df['A'] < 10) & (df['C'] < 10)) |\
((df['A'] < 10) & (df['D'] < 10)) |\
((df['B'] < 10) & (df['C'] < 10)) |\
((df['C'] < 10) & (df['D'] < 10))
df.loc[condition]
But if there is 50 columns and I want to match any 20 columns of those 50, that would become impossible to list all possible combinations into condition. Is there a way to do that somehow better?

Since True == 1 and False == 0 you can find rows that satisfy atleast N conditions by checking the sum. Series have most of the basic comparisons as attributes so you could make a single condition list with a variety of checks and then use getattr to make it tidy.
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0, 20, (5,4)), columns=list('ABCD'))
# can check `eq`, `lt`, `le`, `gt`, `ge`, `isin`
cond_list = [('A', 'lt', 10), ('B', 'ge', 10), ('D', 'eq', 4),
('C', 'isin', [2, 4, 6])]
df_c = pd.concat([getattr(df[col], attr)(val).astype(int)
for col,attr,val in cond_list], axis=1)
# A B D C
#0 0 0 0 1
#1 0 1 0 0
#2 1 1 0 0
#3 1 1 0 0
#4 0 1 0 1
# Each row satisfies this many conditions
df_c.sum(1)
#0 1
#1 1
#2 2
#3 2
#4 2
#dtype: int64
#Select those that satisfy at least 2.
df[df_c.sum(1).ge(2)]
# A B C D
#2 0 17 15 9
#3 0 14 0 15
#4 19 14 4 0
If you need some more complicated comparisons that aren't possible with .getattr then you can write them out yourself and concat that list of Series.
df_c = pd.concat([df['A'].lt(10), df['B'].ge(10), df['D'].eq(4),
df['C'].isin([2,4,6])],
axis=1).astype(int)

Here's a method using itertools.combinations, so we can get all necessary combinations of our conditions. Then we check the "amount" of time condition1 & condition2 are True:
# test dataframe
np.random.seed(10)
df = pd.DataFrame(np.random.randint(20, size=(10,5)), columns=list('ABCDE'))
print(df)
A B C D E
0 9 4 15 0 17
1 16 17 8 9 0
2 10 8 4 19 16
3 4 15 11 11 1
4 8 4 14 17 19
5 13 5 13 19 13
6 12 1 4 18 13
7 11 10 9 15 18
8 16 7 11 17 14
9 7 11 1 0 12
from itertools import combinations
conditions = [(df['A'] < 10), (df['B'] < 15), (df['C'] >= 5), (df['D'] <= 9)]
mask = pd.concat([x&y for x, y in combinations(conditions, 2)], axis=1).sum(axis=1).ge(2)
df[mask]
A B C D E
0 9 4 15 0 17
4 8 4 14 17 19
9 7 11 1 0 12

Compare multiple columns of a dataframe and store the result in a new column

I have data which looks like this(I've set 'rule_id' as the index):
rule_id a b c d
50378 2 0 0 5
50402 12 9 6 0
52879 0 4 3 2
After using this code:
coeff = df.T
# compute the coefficients
for name, s in coeff.items():
top = 100 # start at 100
r = []
for i, v in enumerate(s):
if v == 0: # reset to 100 on a 0 value
top=100
else:
top = top/2 # else half the previous value
r.append(top)
coeff.loc[:, name] = r # set the whole column in one operation
# transpose back to have a companion dataframe for df
coeff = coeff.T
# build a new column from 2 consecutive ones, using the coeff dataframe
def build_comp(col1, col2, i):
conditions = [(df[col1] == 0) & (df[col2] == 0), (df[col1] != 0) & (df[col2] == 0), (df[col1] == df[col2]),
(df[col1] != 0) & (df[col2] != 0)]
choices = [np.nan , 100 , coeff[col1] , df[col2]/df[col1]*coeff[col1]+coeff[col1]]
df['comp{}'.format(i)] = np.select(conditions , choices)
old = df.columns[0] # store name of first column
#Ok, enumerate all the columns (except first one)
for i, col in enumerate(df.columns[1:], 1):
build_comp(old, col, i)
old = col # keep current column name for next iteration
# special processing for last comp column
df['comp{}'.format(i+1)] = np.where(df[col] == 0, np.nan, 100)
my data looks like this:
rule_id a b c d comp1 comp2 comp3 comp4
50378 2 0 0 5 100 NaN NaN 100
50402 12 9 6 0 87.5 41.66 100 NaN
52879 0 4 3 2 NaN 87.5 41.66 100
So 'df' here is the dataframe which stores my data which I have mentioned above.
Look at the first row . According to my code , if two columns are compared and the first column has a non-zero value(2) and the second column has 0 , then 100 should be updated in the new column , which I am able to achieve , if there is comparison between more than one non-zero value (look at row 2) , then the comparison is like this:
9/12 *50 +50 = 87.5
then
6/9 * 25 + 25 = 41.66
which I am able to achieve but the third comparison between column 'c' and 'd' which is between value 6 and 0 should be:
0/6 *12.5 + 12.5 = 12.5
which I am having problem in achieving. So instead of 100 in row 2 comp3 , the value should be 12.5. Same goes for the last row too where values are 4 ,3 and 2
This is the result I want:
rule_id a b c d comp1 comp2 comp3 comp4
50378 2 0 0 5 100 NaN NaN 100
50402 12 9 6 0 87.5 41.66 12.5 NaN
52879 0 4 3 2 NaN 87.5 41.66 12.5

You say:
the third comparison between column 'c' and 'd' which is between value 6 and 0 should be:
0/6 *12.5 + 12.5 = 12.5
But your code says:
conditions = [(df[col1] == 0) & (df[col2] == 0), (df[col1] != 0) & (df[col2] == 0), (df[col1] == df[col2]),
(df[col1] != 0) & (df[col2] != 0)]
choices = [np.nan , 100 , coeff[col1] , df[col2]/df[col1]*coeff[col1]+coeff[col1]]
Clearly (6, 0) satisfies condition[1] and therefore produces 100. You seem to think it should satisfy condition[3] which is that both are non-zero, but (6, 0) does not satisfy that condition, and even if it did it would not matter because condition[1] is matched first, and np.select() chooses the first match.
Perhaps you want something like this:
conditions = [(df[col1] == 0) & (df[col2] == 0), (df[col1] == df[col2])]
choices = [np.nan , coeff[col1]]
default = df[col2]/df[col1]*coeff[col1]+coeff[col1]
df['comp{}'.format(i)] = np.select(conditions , choices, default)

Just to participate, here is a contribution to your code, for the definition of the coeff matrix, where the computation is performed directly on whole columns.
Initialization:
>>> df = pd.DataFrame([[2, 0, 0, 5], [12, 9, 6, 0], [0, 4, 3, 2]],
... index=[50378, 50402, 52879],
... columns=['a', 'b', 'c', 'd'])
>>> df
a b c d
50378 2 0 0 5
50402 12 9 6 0
52879 0 4 3 2
Then computing the coefficients:
>>> # taking care of coefficients, using direct computation on columns
>>> coeff2 = pd.DataFrame(index=df.index, columns=df.columns)
>>> top = pd.Series([100]*len(df.index), index=df.index)
>>> for col_name, col in df.iteritems(): # loop over columns
... eq0 = (col==0) # boolean serie, identifying rows where content is 0
... top[eq0] = 100 # where `eq0` is `True`, set 100...
... top[~eq0] = top[~eq0] / 2 # ... and divide others by 2
... coeff2[col_name] = top # assign to output
>>> coeff2
Which gives:
a b c d
50378 50 100 100 50
50402 50 25 12.5 100
52879 100 50 25 12.5
(For the core of your question, John identified the lack of condition in the function, so no need for me to participate.)

Groupby Pandas generate multiple fields with condition

I have a pandas dataframe as such:
df = pandas.DataFrame( {
"Label" : ["A", "A", "B", "B", "C" , "C"] ,
"Value" : [1, 9, 1, 1, 9, 9],
"Weight" : [2, 4, 6, 8, 10, 12} )
I would like to group the data by 'Label' and generate 2 fields.
The First field, 'newweight' would sum Weight if Value==1
The Second field, 'weightvalue' would sum Weight*Value
So I would be left with the following dataframe:
Label newweight weightvalue
A 2 38
B 14 14
C 0 198
I have looked into the pandas groupby() function but have had trouble generating the 2 fields with it.

Use groupby.apply, you can do:
df.groupby('Label').apply(
lambda g: pd.Series({
"newweight": g.Weight[g.Value == 1].sum(),
"weightvalue": g.Weight.mul(g.Value).sum()
})).fillna(0)
# newweight weightvalue
#Label
#A 2.0 38.0
#B 14.0 14.0
#C 0.0 198.0

pd.DataFrame({'Label':df.Label.unique(),'newweight':df.groupby('Label').apply(lambda x : sum((x.Value==1)*x.Weight)).values,'weightvalue':df.groupby('Label').apply(lambda x : sum(x.Value*x.Weight)).values})
Out[113]:
Label newweight weightvalue
0 A 2 38
1 B 14 14
2 C 0 198

Fast
Super complicated but very cool approach using Numpy's bincount. And likely very fast.
v = df.Value.values
w = df.Weight.values
p = v * w
f, u = pd.factorize(df.Label.values)
pd.DataFrame(dict(
newweight=np.bincount(f, p).astype(int),
weightvalue=np.bincount(f, p * (v == 1)).astype(int)
), pd.Index(u, name='Label'))
newweight weightvalue
Label
A 38 2
B 14 14
C 198 0
Creative
Using pd.DataFrame.eval
e = """
newweight = Value * Weight
weightvalue = newweight * (Value == 1)
"""
df.set_index('Label').eval(e).iloc[:, -2:].sum(level=0)
newweight weightvalue
Label
A 38 2
B 14 14
C 198 0

subset and change multiple columns with different "functions"

How is it possible to change multiple columns on subset by some conditions in a pandas dataframe?
For example given the input data:
import pandas as pd
dat = pd.DataFrame({"y": ("441912", "abc", "121", "4455")})
dat['leny'] = dat['y'].str.len()
dat['yfoo'] = None
dat
y leny yfoo
1: 441912 6 NA
2: abc 3 NA
3: 121 3 NA
4: 4455 4 NA
Then subset the rows for which y starts with 44 and has a length of 4 or 5, then for those rows strip the 44 from the beginning in y, substract 2 from leny and set yfoo to False, resulting to the following output:
y leny yfoo
1: 441912 6 NA
2: abc 3 NA
3: 121 3 NA
4: 55 2 FALSE
My attempt at doing this:
# pandas struggle follows
dat[dat.leny.isin((4, 5)) & dat.y.str.match('^44', na=False)]
What do I do next?

Create a mask:
m = dat.leny.isin((4, 5)) & dat.y.str.startswith('44')
Now, use loc and perform your operations.
dat.loc[m, 'y'] = dat.loc[m, 'y'].str[2:]
dat.loc[m, 'leny'] -= 2
dat.loc[m, 'yfoo'] = False
dat
y leny yfoo
0 441912 6 None
1 abc 3 None
2 121 3 None
3 55 2 False

Using a comprehension to gather data.
y = dat.y.values.tolist()
dat2 = np.array([
[x[2:], len(x) - 2, False, i]
for i, x in enumerate(y)
if x.startswith('44') and (len(x) // 2 == 2)
], object)
dat.iloc[dat2[:, -1].astype(int), :] = dat2[:, :-1]
dat
y leny yfoo
0 441912 6 None
1 abc 3 None
2 121 3 None
3 55 2 False

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Slice by Pairwise Attributes - python

Related

Optimize Assign Value to Cells of a Column based on a couple of conditions in Pandas Data frame

Pandas if any n of m conditions are met

Compare multiple columns of a dataframe and store the result in a new column

Groupby Pandas generate multiple fields with condition

subset and change multiple columns with different "functions"

Categories

Resources