Groupby Pandas generate multiple fields with condition - python

I have a pandas dataframe as such:
df = pandas.DataFrame( {
"Label" : ["A", "A", "B", "B", "C" , "C"] ,
"Value" : [1, 9, 1, 1, 9, 9],
"Weight" : [2, 4, 6, 8, 10, 12} )
I would like to group the data by 'Label' and generate 2 fields.
The First field, 'newweight' would sum Weight if Value==1
The Second field, 'weightvalue' would sum Weight*Value
So I would be left with the following dataframe:
Label newweight weightvalue
A 2 38
B 14 14
C 0 198
I have looked into the pandas groupby() function but have had trouble generating the 2 fields with it.

Use groupby.apply, you can do:
df.groupby('Label').apply(
lambda g: pd.Series({
"newweight": g.Weight[g.Value == 1].sum(),
"weightvalue": g.Weight.mul(g.Value).sum()
})).fillna(0)
# newweight weightvalue
#Label
#A 2.0 38.0
#B 14.0 14.0
#C 0.0 198.0

pd.DataFrame({'Label':df.Label.unique(),'newweight':df.groupby('Label').apply(lambda x : sum((x.Value==1)*x.Weight)).values,'weightvalue':df.groupby('Label').apply(lambda x : sum(x.Value*x.Weight)).values})
Out[113]:
Label newweight weightvalue
0 A 2 38
1 B 14 14
2 C 0 198

Fast
Super complicated but very cool approach using Numpy's bincount. And likely very fast.
v = df.Value.values
w = df.Weight.values
p = v * w
f, u = pd.factorize(df.Label.values)
pd.DataFrame(dict(
newweight=np.bincount(f, p).astype(int),
weightvalue=np.bincount(f, p * (v == 1)).astype(int)
), pd.Index(u, name='Label'))
newweight weightvalue
Label
A 38 2
B 14 14
C 198 0
Creative
Using pd.DataFrame.eval
e = """
newweight = Value * Weight
weightvalue = newweight * (Value == 1)
"""
df.set_index('Label').eval(e).iloc[:, -2:].sum(level=0)
newweight weightvalue
Label
A 38 2
B 14 14
C 198 0

Related

Optimize Assign Value to Cells of a Column based on a couple of conditions in Pandas Data frame

I am trying to assign values to some new columns in my data frame based on some conditions. But it is taking very long to execute.
First up, I tried using itertuples
for n in df1.itertuples():
if (df2[(df2['x'] == n.x) & (df2['y'] == n.y)].empty):
df1['new_col1'][n.Index] = 0.00
else:
df1['new_col'][n.Index] = df2[(df2['x'] == n.x) & (df2['y'] == n.y)]['value']
I also tried the same logic using map function
def foo(x,y):
if (df2[(df2['x'] == x) & (df2['y'] == y)].empty):
return 0.00
else:
return df2[(df2['x'] == x) & (df2['y'] == y)]['value']
map(foo, df1['x'],df1['y'])
Now, I am sure my code is nowhere near optimized, I tried multiple ways to optimize but they keep throwing one error or the other.
Any leads on how to optimize the code and reduce the execution time for the same.
Use pd.merge:
df1 = df1.merge(df2[['x', 'y', 'value']].rename(columns={'value': 'new_col'}),
on=['x', 'y'], how='left').fillna({'new_col': 0})
print(df1)
# Output
x y new_col
0 1 11 0.0
1 2 12 22.0
2 3 13 23.0
Setup:
df1 = pd.DataFrame({'x': [1, 2, 3], 'y': [11, 12, 13]})
df2 = pd.DataFrame({'x': [2, 3, 4], 'y': [12, 13, 14], 'value': [22, 23, 24]})
print(df1)
print(df2)
# Output
x y
0 1 11
1 2 12
2 3 13
x y value
0 2 12 22
1 3 13 23
2 4 14 24

How to compare the values of two columns and reorder the values based on comparison

I am new to python and I faced a hard problem that I could not figure out how to do it. I need to compare values in two columns and then reorder the values based on this comparison. For example, let's assume we have the following dataframe. Now I need to check the values in B if they are smaller than the values in the cumulative sum of A. If a value in B higher than the value in cum of A then we stop at that row. And we reorder the observation based on the numerical value of column B until that row (the row is also included).
For instance, when we look at the values in B we see that all the values are smaller than the values in A until part 4.
Part No
A
B
cum of A
1
2
13
2
2
4
17
6
3
7
15
13
4
5
16
18
5
10
19
28
6
9
16
37
7
8
12
45
So that I need to reorder all values based on numerical values in B until part 4 (part 4 is also included). Then the new table should be
Part No
A
B
cum of A
1
2
13
2
3
7
15
9
4
5
16
14
2
4
17
18
5
10
19
28
6
9
16
37
7
8
12
45
And if there is still a higher value in cum of A than the values in B among these reordered observations (first four parts), then part 4 will be taken out from the list and put in a separate list.
This process should be repeated until all parts will be ordered.
Could anyone help me to get the code for this part?
This should work:
import pandas as pd
data = [ { "Part No": 1, "A": 2, "B": 13, "cum of A": 2 }, { "Part No": 2, "A": 4, "B": 17, "cum of A": 6 }, { "Part No": 3, "A": 7, "B": 15, "cum of A": 13 }, { "Part No": 4, "A": 5, "B": 16, "cum of A": 18 }, { "Part No": 5, "A": 10, "B": 19, "cum of A": 28 }, { "Part No": 6, "A": 9, "B": 16, "cum of A": 37 }, { "Part No": 7, "A": 8, "B": 12, "cum of A": 45 } ]
df = pd.DataFrame(data)
smaller_indexes = len(df[df['B'] > df['cum of A']]) + 1 # check where B is larger than 'cum of A', take the length of that part of the dataframe and add 1 to get the index number of the last value to use for sorting
df[:smaller_indexes] = df[:smaller_indexes].sort_values(by=['B']) # slice the dataframe by the found index number, and sort only that slice of the dataframe
df['cum of A'] = df['A'].cumsum() #recount the cumsum of the entire dataframe
This will output:
Part No
A
B
cum of A
0
1
2
13
2
1
3
7
15
9
2
4
5
16
14
3
2
4
17
18
4
5
10
19
28
5
6
9
16
37
6
7
8
12
45
I know it is a little rough but if you want, you can use below:
#order the values in list
def order(a):
for i in range(len(a)):
# print(i)
b = a[i-1] - a[i]
if b>0:
temp = a[i-1]
a[i-1] = a[i]
a[i] = temp
for j in range(i,0,-1):
b = a[j-1] - a[j]
if b>0:
temp = a[j-1]
a[j-1] = a[j]
a[j] = temp
return a
b = [13,15,16,17,19,16,12]
a = [2, 7, 5, 4, 10, 9, 8]
#other list to append the last value if cum is greater than value of "b" after ordering b
otherList = []
# isRecursive and indexToTake is for:
# "And if there is still a higher value in cum of A than the values in B
# among these reordered observations (first four parts), then part 4 will be
# taken out from the list and put in a separate list."
def solve(isRecursive, indexToTake):
c = 0
for i in range(len(a)):
c += a[i] #cum sum
if c> b[i]: #check if cum is greater than the value of "b" at that index
if isRecursive:
if i >indexToTake:
return
otherList.append(b[indexToTake])
return
b[0:i+1] = order(b[0:i+1])
solve(True, i) #check if the values at "b" are greater than cum after ordered until the indexToTake
break
solve(False, 0) #start solving

How to sort a MultiIndex by one level without changing the order of the other levels

I am struggling to sort a pivot table according to one level of a MultiIndex.
My target is to sort the values in the level according to a list of values which basically works.
But i also want to preserve the original order of the other levels.
import pandas as pd
import numpy as np
import random
group_size = 3
n = 10
df = pd.DataFrame({
'i_a': list(np.arange(0, group_size))*n,
'i_b': random.choices(list("ARBMC"), k=n*group_size),
'value': np.random.randint(0, 100, size=n*group_size),
})
pt = pd.pivot_table(
df,
index=['i_a', 'i_b'],
values=['value'],
aggfunc='sum'
)
# The pivot table looks like this
value
i_a i_b
0 A 48
B 55
C 161
M 41
R 126
1 A 60
B 236
C 99
M 30
R 202
2 A 22
B 144
C 30
M 146
R 168
# defined order for i_b
ORDER = {
"A": 0,
"R": 1,
"B": 2,
"M": 3,
"C": 4,
}
def order_by_list(value, ascending=True):
try:
idx = ORDER[value]
except KeyError:
# place items which are not available at the last place
idx = len(ORDER)
if not ascending:
# reverse the order
idx = -idx
return idx
def sort_by_ib(df):
return pt.sort_index(level=["i_b"],
key=lambda index: index.map(order_by_list),
sort_remaining=False
)
pt_sorted = pt.pipe(sort_by_ib)
# i_a index of pt_sorted is rearranged what i dont want
value
i_a i_b
0 A 48
1 A 60
2 A 22
0 R 126
1 R 202
2 R 168
0 B 55
1 B 236
2 B 144
0 M 41
1 M 30
2 M 146
0 C 161
1 C 99
2 C 30
# Instead, The sorted pivot table should look like this
value
i_a i_b
0 A 48
R 126
B 55
M 41
C 161
1 A 60
R 202
B 236
M 30
C 99
2 A 22
R 168
B 144
M 146
C 30
What is the preferred/recommended way to do this?
If want change order you can crete helper column for mapping, add to index parameter in pivot_table and last remove by droplevel. If added before i_b it is sorting by id_a and new levels:
df['new'] = df['i_b'].map(ORDER)
pt = pd.pivot_table(
df,
index=[ 'i_a','new', 'i_b'],
values=['value'],
aggfunc='sum'
).droplevel(1)
print (pt)
value
i_a i_b
0 A 217
R 135
M 150
C 43
1 A 44
R 266
B 44
M 13
C 128
2 A 167
R 3
B 85
M 159
C 81

Pandas Slice by Pairwise Attributes

I have two lists, namely;
x = [3, 7, 9, ...] and y = [13, 17, 19, ...]
And I have a dataframe like this:
df =
x y z
0 0 10 0.54
1 1 11 0.68
2 2 12 0.75
3 3 13 0.23
4 4 14 0.52
5 5 15 0.14
6 6 16 0.23
. . . ..
. . . ..
What I want to do is slice the dataframe given the pairwise combos in an efficient manner, as so:
df_slice = df [ ( (df.x == x[0]) & (df.y == y[0]) ) |
( (df.x == x[1]) & (df.y == y[0]) ) |
....
( (df.x == x[-1) & (df.y == y[-1]) ) ]
df_slice =
x y z
3 3 13 0.23
7 7 17 0.74
9 9 19 0.24
. .. .. ....
Is there any way to do this programmatically and quickly?
Create helper DataFrame and DataFrame.merge with no on parameter, so merging by all intersected columns, here by x and y:
x = [3, 7, 9]
y = [13, 17, 19]
df1 = pd.DataFrame({'x':x, 'y':y})
df2 = df.merge(df1)
print (df2)
x y
0 3 13
Or get interesection of MultiIndexes by Index.isin and filter by boolean indexing:
mux = pd.MultiIndex.from_arrays([x, y])
df2 = df[df.set_index(['x','y']).index.isin(mux)]
print (df2)
x y
3 3 13
Your solution should be changed with list comprehension of zipped lists and np.logical_or.reduce:
mask = np.logical_or.reduce([(df.x == a) & (df.y == b) for a, b in zip(x, y)])
df2 = df[mask]
print (df2)
x y z
3 3 13 0.23

Compare multiple columns of a dataframe and store the result in a new column

I have data which looks like this(I've set 'rule_id' as the index):
rule_id a b c d
50378 2 0 0 5
50402 12 9 6 0
52879 0 4 3 2
After using this code:
coeff = df.T
# compute the coefficients
for name, s in coeff.items():
top = 100 # start at 100
r = []
for i, v in enumerate(s):
if v == 0: # reset to 100 on a 0 value
top=100
else:
top = top/2 # else half the previous value
r.append(top)
coeff.loc[:, name] = r # set the whole column in one operation
# transpose back to have a companion dataframe for df
coeff = coeff.T
# build a new column from 2 consecutive ones, using the coeff dataframe
def build_comp(col1, col2, i):
conditions = [(df[col1] == 0) & (df[col2] == 0), (df[col1] != 0) & (df[col2] == 0), (df[col1] == df[col2]),
(df[col1] != 0) & (df[col2] != 0)]
choices = [np.nan , 100 , coeff[col1] , df[col2]/df[col1]*coeff[col1]+coeff[col1]]
df['comp{}'.format(i)] = np.select(conditions , choices)
old = df.columns[0] # store name of first column
#Ok, enumerate all the columns (except first one)
for i, col in enumerate(df.columns[1:], 1):
build_comp(old, col, i)
old = col # keep current column name for next iteration
# special processing for last comp column
df['comp{}'.format(i+1)] = np.where(df[col] == 0, np.nan, 100)
my data looks like this:
rule_id a b c d comp1 comp2 comp3 comp4
50378 2 0 0 5 100 NaN NaN 100
50402 12 9 6 0 87.5 41.66 100 NaN
52879 0 4 3 2 NaN 87.5 41.66 100
So 'df' here is the dataframe which stores my data which I have mentioned above.
Look at the first row . According to my code , if two columns are compared and the first column has a non-zero value(2) and the second column has 0 , then 100 should be updated in the new column , which I am able to achieve , if there is comparison between more than one non-zero value (look at row 2) , then the comparison is like this:
9/12 *50 +50 = 87.5
then
6/9 * 25 + 25 = 41.66
which I am able to achieve but the third comparison between column 'c' and 'd' which is between value 6 and 0 should be:
0/6 *12.5 + 12.5 = 12.5
which I am having problem in achieving. So instead of 100 in row 2 comp3 , the value should be 12.5. Same goes for the last row too where values are 4 ,3 and 2
This is the result I want:
rule_id a b c d comp1 comp2 comp3 comp4
50378 2 0 0 5 100 NaN NaN 100
50402 12 9 6 0 87.5 41.66 12.5 NaN
52879 0 4 3 2 NaN 87.5 41.66 12.5
You say:
the third comparison between column 'c' and 'd' which is between value 6 and 0 should be:
0/6 *12.5 + 12.5 = 12.5
But your code says:
conditions = [(df[col1] == 0) & (df[col2] == 0), (df[col1] != 0) & (df[col2] == 0), (df[col1] == df[col2]),
(df[col1] != 0) & (df[col2] != 0)]
choices = [np.nan , 100 , coeff[col1] , df[col2]/df[col1]*coeff[col1]+coeff[col1]]
Clearly (6, 0) satisfies condition[1] and therefore produces 100. You seem to think it should satisfy condition[3] which is that both are non-zero, but (6, 0) does not satisfy that condition, and even if it did it would not matter because condition[1] is matched first, and np.select() chooses the first match.
Perhaps you want something like this:
conditions = [(df[col1] == 0) & (df[col2] == 0), (df[col1] == df[col2])]
choices = [np.nan , coeff[col1]]
default = df[col2]/df[col1]*coeff[col1]+coeff[col1]
df['comp{}'.format(i)] = np.select(conditions , choices, default)
Just to participate, here is a contribution to your code, for the definition of the coeff matrix, where the computation is performed directly on whole columns.
Initialization:
>>> df = pd.DataFrame([[2, 0, 0, 5], [12, 9, 6, 0], [0, 4, 3, 2]],
... index=[50378, 50402, 52879],
... columns=['a', 'b', 'c', 'd'])
>>> df
a b c d
50378 2 0 0 5
50402 12 9 6 0
52879 0 4 3 2
Then computing the coefficients:
>>> # taking care of coefficients, using direct computation on columns
>>> coeff2 = pd.DataFrame(index=df.index, columns=df.columns)
>>> top = pd.Series([100]*len(df.index), index=df.index)
>>> for col_name, col in df.iteritems(): # loop over columns
... eq0 = (col==0) # boolean serie, identifying rows where content is 0
... top[eq0] = 100 # where `eq0` is `True`, set 100...
... top[~eq0] = top[~eq0] / 2 # ... and divide others by 2
... coeff2[col_name] = top # assign to output
>>> coeff2
Which gives:
a b c d
50378 50 100 100 50
50402 50 25 12.5 100
52879 100 50 25 12.5
(For the core of your question, John identified the lack of condition in the function, so no need for me to participate.)

Categories

Resources