Aggregate values by multiple columns - python

My dataframe looks like this.
df = pd.DataFrame({
'ID': [1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3],
'text': ['a', 'a', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'e', 'e', 'e', 'f', 'g'] ,
'out_text': ['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x13', 'x14'] ,
'Rule_1': ['N', 'N', 'N', 'Y', 'N', 'N', 'N', 'N', 'N', 'N','N', 'N', 'Y', 'Y'],
'Rule_2': ['Y', 'N', 'N', 'N', 'Y', 'N', 'N', 'N', 'N', 'N','N', 'N', 'Y', 'N'],
'Rule_3': ['N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N','N', 'N', 'Y', 'Y']})
ID text out_text Rule_1 Rule_2 Rule_3
0 1 a x1 N Y N
1 1 a x2 N N N
2 1 b x3 N N N
3 1 b x4 Y N N
4 2 c x5 N Y N
5 2 c x6 N N N
6 2 c x7 N N N
7 2 d x8 N N N
8 2 d x9 N N N
9 2 e x10 N N N
10 2 e x11 N N N
11 2 e x12 N N N
12 3 f x13 Y Y Y
13 3 g x14 Y N Y
I have to aggregate Rule_1, Rule_2, Rule_3 to such that if a combination of ID and Text has a 'Y' in any of these columns, the overall result is a Y for that combination. In our example 1-a and 1-b are Y overall. 2-d and 2-e are 'N'. How do I aggregate multiple columns?

Let's try using max(1) to aggregate the rules by rows, then groupyby().any() to check if any row has Y:
(df[['Rule_1','Rule_2','Rule_3']].eq('Y')
.max(axis=1)
.groupby([df['ID'],df['text']])
.any()
)
Output:
ID text
1 a True
b True
2 c True
d False
e False
3 f True
g True
dtype: bool
Or if you want Y/N, we can change max/any to max, and drop comparison:
(df[['Rule_1','Rule_2','Rule_3']]
.max(axis=1)
.groupby([df['ID'],df['text']])
.max()
)
Output:
ID text
1 a Y
b Y
2 c Y
d N
e N
3 f Y
g Y
dtype: object

Related

Python Array List getting values with double different mod

I need help pulling data from a list with different techniques in python
For example:
We have a list with 20 different values.
lst = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','r','s','t','w']
mod = 5
roundMod= 3
DESIRED OUTPUT
Round 1 :
1 - a,
2 - b,
3 - c,
4 - d,
5 - e,
Round 2 :
1 - a,
2 - b,
3 - c,
4 - d,
5 - e,
Round 3 :
1 - a,
2 - b,
3 - c,
4 - d,
5 - e,
Round 1:
6 - f,
7 - g,
8 - h,
9 - i,
10 - j,
Round 2 :
6 - f,
7 - g,
8 - h,
9 - i,
10 - j,
Round 3 :
6 - f,
7 - g,
8 - h,
9 - i,
10 - j,
I have a mod for getting max 5 values for each round and roundmod for maximum round before getting next 5 element
IIUC, you want to slice the List with stepwise starting/ending points. Use an integer division (//) for this:
List = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','r','s','t','w']
mod = 5
roundMod= 3
for i in range(6): # not sure how the number of "lines" is defined
d = i//roundMod
print(f'{i=}, {d=},', List[d*mod:(d+1)*mod])
output:
i=0, d=0, ['a', 'b', 'c', 'd', 'e']
i=1, d=0, ['a', 'b', 'c', 'd', 'e']
i=2, d=0, ['a', 'b', 'c', 'd', 'e']
i=3, d=1, ['f', 'g', 'h', 'i', 'j']
i=4, d=1, ['f', 'g', 'h', 'i', 'j']
i=5, d=1, ['f', 'g', 'h', 'i', 'j']
If you also want to track the round, use divmod:
List = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','r','s','t','w']
mod = 5
roundMod= 3
for i in range(6):
d,r = divmod(i, roundMod)
print(f'Round {r+1}: ', List[d*mod:(d+1)*mod])
output:
Round 1: ['a', 'b', 'c', 'd', 'e']
Round 2: ['a', 'b', 'c', 'd', 'e']
Round 3: ['a', 'b', 'c', 'd', 'e']
Round 1: ['f', 'g', 'h', 'i', 'j']
Round 2: ['f', 'g', 'h', 'i', 'j']
Round 3: ['f', 'g', 'h', 'i', 'j']
This seems a job for a generator:
def pull(lst, mod = 5, round_mod = 3):
counter = 0
while True:
start = counter // round_mod
if start * mod >= len(lst):
break
yield lst[start * mod:(start + 1)*mod]
counter += 1
puller = pull(l)
print([x for x in puller])
OUTPUT
[['a', 'b', 'c', 'd', 'e'], ['a', 'b', 'c', 'd', 'e'], ['a', 'b', 'c', 'd', 'e'], ['f', 'g', 'h', 'i', 'j'], ['f', 'g', 'h', 'i', 'j'], ['f', 'g', 'h', 'i', 'j'], ['k', 'l', 'm', 'n', 'o'], ['k', 'l', 'm', 'n', 'o'], ['k', 'l', 'm', 'n', 'o'], ['p', 'r', 's', 't', 'w'], ['p', 'r', 's', 't', 'w'], ['p', 'r', 's', 't', 'w']]
or, to reproduce exactly your desired output:
for n, x in enumerate(puller):
print(f'Round {n + 1}: {", ".join([f"{i + 1} - {v}" for i, v in enumerate(x)])}')
OUTPUT
Round 1: 1 - a, 2 - b, 3 - c, 4 - d, 5 - e
Round 2: 1 - a, 2 - b, 3 - c, 4 - d, 5 - e
Round 3: 1 - a, 2 - b, 3 - c, 4 - d, 5 - e
Round 4: 1 - f, 2 - g, 3 - h, 4 - i, 5 - j
Round 5: 1 - f, 2 - g, 3 - h, 4 - i, 5 - j
Round 6: 1 - f, 2 - g, 3 - h, 4 - i, 5 - j
Round 7: 1 - k, 2 - l, 3 - m, 4 - n, 5 - o
Round 8: 1 - k, 2 - l, 3 - m, 4 - n, 5 - o
Round 9: 1 - k, 2 - l, 3 - m, 4 - n, 5 - o
Round 10: 1 - p, 2 - r, 3 - s, 4 - t, 5 - w
Round 11: 1 - p, 2 - r, 3 - s, 4 - t, 5 - w
Round 12: 1 - p, 2 - r, 3 - s, 4 - t, 5 - w

Plotly python bar plot stack order

Here is my code for the dataframe
df = pd.DataFrame({'var_1':['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'],
'var_2':['m', 'n', 'o', 'm', 'n', 'o', 'm', 'n', 'o'],
'var_3':[np.random.randint(25, 33) for _ in range(9)]})
Here is the dataframe that I have
var_1 var_2 var_3
0 a m 27
1 a n 28
2 a o 28
3 b m 31
4 b n 30
5 b o 25
6 c m 27
7 c n 32
8 c o 27
Here is the code I used to get the stacked bar plot
fig = px.bar(df, x='var_3', y='var_1', color='var_2', orientation='h', text='var_3')
fig.update_traces(textposition='inside', insidetextanchor='middle')
fig
But I want the bar to stack in descending order of the values, largest at the start/bottom and smallest at top
How should I update the layout to get that
df = pd.DataFrame({'var_1':['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'],
'var_2':['m', 'n', 'o', 'm', 'n', 'o', 'm', 'n', 'o'],
'var_3':[np.random.randint(25, 33) for _ in range(9)]})
df.sort_values(['var_1', 'var_3'], ignore_index=True, inplace=True, ascending=False)
# colors
colors = {'o': 'red',
'm': 'blue',
'n': 'green'}
# traces
data = []
# loop across the different rows
for i in range(df.shape[0]):
data.append(go.Bar(x=[df['var_3'][i]],
y=[df['var_1'][i]],
orientation='h',
text=str(df['var_3'][i]),
marker=dict(color=colors[df['var_2'][i]]),
name=df['var_2'][i],
legendgroup=df['var_2'][i],
showlegend=(i in [1, 2, 3])))
# layout
layout = dict(barmode='stack',
yaxis={'title': 'var_1'},
xaxis={'title': 'var_3'})
# figure
fig = go.Figure(data=data, layout=layout)
fig.update_traces(textposition='inside', insidetextanchor='middle')
fig.show()

Combination of pair elements within list in a list

I'm trying to obtain the combinations of each element in a list within a list. Given this case:
my_list
[['A', 'B'], ['C', 'D', 'E'], ['F', 'G', 'H', 'I']]
The output would be:
0
1
0
A
B
1
C
D
2
C
E
3
D
E
4
F
G
5
F
H
6
F
I
7
G
H
8
G
I
9
H
I
Or it could also be a new list instead of a DataFrame:
my_new_list
[['A','B'], ['C','D'], ['C','E'],['D','E'], ['F','G'],['F','H'],['F','I'],['G','H'],['G','I'],['H','I']]
This should do it. You have to flatten the result of combinations.
from itertools import combinations
x = [['A', 'B'], ['C', 'D', 'E'], ['F', 'G', 'H', 'I']]
y = [list(combinations(xx, 2)) for xx in x]
z = [list(item) for subl in y for item in subl]
z
[['A', 'B'],
['C', 'D'],
['C', 'E'],
['D', 'E'],
['F', 'G'],
['F', 'H'],
['F', 'I'],
['G', 'H'],
['G', 'I'],
['H', 'I']]
Create combination by itertools.combinations with flatten values in list comprehension:
from itertools import combinations
L = [['A', 'B'], ['C', 'D', 'E'], ['F', 'G', 'H', 'I']]
data = [list(j) for i in L for j in combinations(i, 2)]
print (data)
[['A', 'B'], ['C', 'D'], ['C', 'E'],
['D', 'E'], ['F', 'G'], ['F', 'H'],
['F', 'I'], ['G', 'H'], ['G', 'I'],
['H', 'I']]
And then pass to DataFrame by constructor:
df = pd.DataFrame(data)
print (df)
0 1
0 A B
1 C D
2 C E
3 D E
4 F G
5 F H
6 F I
7 G H
8 G I
9 H I
def get_pair( arrs ):
result = []
for arr in arrs:
for i in range(0, len(arr) - 1 ):
for j in range( i + 1, len(arr) ):
result.append( [arr[i], arr[j]] )
return result
arrs = [['A', 'B'], ['C', 'D', 'E'], ['F', 'G', 'H', 'I']]
print( get_pair(arrs) )

pandas: select certain amount of rows based on column ranking using loop

I have a dataframe which looks like this
pd.DataFrame({'a':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'b':['N', 'Y', 'Y', 'N', 'Y', 'N', 'Y', 'N', 'N', 'Y'],
'c':[4, 5, 9, 8, 1, 3, 7, 2, 6, 10]})
a b c
0 A N 4
1 B Y 5
2 C Y 9
3 D N 8
4 E Y 1
5 F N 3
6 G Y 7
7 H N 2
8 I N 6
9 J Y 10
Out of the 10 rows I want to select 5 rows based on the following criteria:
column 'c' is my rank column.
select the rows with lowest 2 ranks (rows 4 and 7 selected)
select all rows where column 'b' = 'Y' AND rank <=5 (row 1 selected)
in the event fewer than 5 rows are selected using the above criteria the remaining open positions should be filled by rank order (lowest) with rows where 'b' = 'Y' and which have rank <= 7 (row 6 selected)
in the event fewer than 5 rows pass the first 3 criteria fill remaining positions in rank order (lowest) where 'b' = 'N'
I have tried this (which covers rule 1 & 2) but struggling how to go on from there
df['selected'] = ''
df.loc[(df.c <= 2), 'selected'] = 'rule_1'
df.loc[((df.c <= 5) & (df.b == 'Y')), 'selected'] = 'rule_2'
my resulting dataframe should look like this
a b c selected
0 A N 4 False
1 B Y 5 rule_2
2 C Y 9 False
3 D N 8 rule_4
4 E Y 1 rule_1
5 F N 3 False
6 G Y 7 rule_3
7 H N 2 rule_1
8 I N 6 False
9 J Y 10 False
based on on of the solutions provided by Vinod Karantothu below I went for the following which seems to work:
def solution(df):
def sol(df, b='Y'):
result_df_rule1 = df.sort_values('c')[:2]
result_df_rule1['action'] = 'rule_1'
result_df_rule2 = df.sort_values('c')[2:].loc[df['b'] == b].loc[df['c'] <= 5]
result_df_rule2['action'] = 'rule_2'
result = pd.concat([result_df_rule1, result_df_rule2]).head(5)
if len(result) < 5:
remaining_rows = pd.concat([df, result, result]).drop_duplicates(subset='a', keep=False)
result_df_rule3 = remaining_rows.loc[df['b'] == b].loc[df['c'] <= 7]
result_df_rule3['action'] = 'rule_3'
result = pd.concat([result, result_df_rule3]).head(5)
return result, pd.concat([remaining_rows, result, result]).drop_duplicates(subset='a', keep=False)
result, remaining_data = sol(df)
if len(result) < 5:
result1, remaining_data = sol(remaining_data, 'N')
result1['action'] = 'rule_4'
result = pd.concat([result, result1]).head(5).drop_duplicates(subset='a', keep=False).merge(df, how='outer', on='a')
return result
if __name__ == '__main__':
df = pd.DataFrame({'a': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'b': ['N', 'Y', 'Y', 'N', 'Y', 'N', 'Y', 'N', 'N', 'Y'],
'c': [4, 5, 9, 8, 1, 3, 7, 2, 6, 10]})
result = solution(df)
print(result)
import pandas as pd
def solution(df):
def sol(df, b='Y'):
result_df_rule1 = df.sort_values('c')[:2]
result_df_rule2 = df.sort_values('c')[2:].loc[df['b'] == b].loc[df['c'] <= 5]
result = pd.concat([result_df_rule1, result_df_rule2]).head(5)
if len(result) < 5:
remaining_rows = pd.concat([df, result, result]).drop_duplicates(keep=False)
result_df_rule3 = remaining_rows.loc[df['b'] == b].loc[df['c'] <= 7]
result = pd.concat([result, result_df_rule3]).head(5)
return result, pd.concat([remaining_rows, result, result]).drop_duplicates(keep=False)
result, remaining_data = sol(df)
if len(result) < 5:
result1, remaining_data = sol(remaining_data, 'N')
result = pd.concat([result, result1]).head(5)
return result
if __name__ == '__main__':
df = pd.DataFrame({'a':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'b':['N', 'Y', 'Y', 'N', 'Y', 'N', 'Y', 'N', 'N', 'Y'],
'c':[4, 5, 9, 8, 1, 3, 7, 2, 6, 10]})
result = solution(df)
print(result)
Result:
a b c
4 E Y 1
7 H N 2
1 B Y 5
6 G Y 7
5 F N 3
For your 4th RULE, you have mentioned in your resulting dataframe, ROW_INDEX 3 will come, but it has rank of 8 which is not lowest, ROW_INDEX 5 should come according to the RULES you have given:
import pandas as pd
data = pd.DataFrame({'a':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'b':['N', 'Y', 'Y', 'N', 'Y', 'N', 'Y', 'N', 'N', 'Y'],
'c':[4, 5, 9, 8, 1, 3, 7, 2, 6, 10]})
data1 = data.nsmallest(2, ['c'])
dataX = data.drop(data1.index)
data2 = dataX[((dataX.b == "Y") & (dataX.c<=5))]
dataX = dataX.drop(data2.index)
data3 = dataX[((dataX.b == "Y") & (dataX.c<=7))]
dataX = dataX.drop(data3.index)
data4 = dataX[((dataX.b == "N"))]
data4 = data4.nsmallest(1, ['c'])
resultframes = [data1, data2, data3, data4]
resultfinal = pd.concat(resultframes)
print(resultfinal)
And here is the output:
a b c
4 E Y 1
7 H N 2
1 B Y 5
6 G Y 7
5 F N 3
You can create extra columns for the rules, then sort and take the head. IIUC from the comments then rule 3 already covers rule 2 so no need to calculate it separately.
df['r1'] = df.c < 3
df['r3'] = (df.c <= 7) & (df.b == 'Y')
print(df.sort_values(['r1', 'r3', 'c'], ascending=[False, False, True])[['a', 'b', 'c']].head(5))
a b c
4 E Y 1
7 H N 2
1 B Y 5
6 G Y 7
5 F N 3
Sorting on boolean column works because True > False.
Note: You might need to tweak the code to your expectations with different datasets. For example your last row 9 J Y 10 is currently not covered by any of the rules. You can take this approach and extend it if needed.

Counting sequential occurrences in a list and

I have 3 lists as follows:
L1 = ['H', 'H', 'T', 'T', 'T', 'H', 'H', 'H', 'H', 'T']
L2 = ['H', 'H', 'T', 'T', 'T', 'H', 'H', 'H', 'H', 'T' , 'T', 'H, 'T', 'T', 'T', 'H', 'H', 'H', 'T']
L3 = ['H', 'T', 'H', 'H']
I would like to count sequential occurrences of 'H' in each list and produce the following table showing the frequencies of these 'H' sequences:
Length | L1 | L2 | L3
----------------------
1 0 1 1
2 1 1 1
3 0 1 0
4 1 1 0
5 0 0 0
I know that doing the following gives me the frequnecies of a sequence in a list:
from itertools import groupby
[len(list(g[1])) for g in groupby(L1) if g[0]=='H']
[2, 4]
But am in need of an elegant way to take this further over the remaining lists and ensuring that a '0' is placed for unobserved lengths.
You can use collections.Counter to create a frequency dict from a generator expression that outputs the lengths of sequences generated by itertools.groupby, and then iterate through a range of possible lengths to output the frequencies from the said dict, with 0 as a default value in absence of a frequency.
Using L1 as an example:
from itertools import groupby
from collections import Counter
counts = Counter(sum(1 for _ in g) for k, g in groupby(L1) if k == 'H')
print([counts[length] for length in range(1, 6)])
This outputs:
[0, 1, 0, 1, 0]
You can use itertools.groupby with collections.Counter:
import itertools as it, collections as _col
def scores(l):
return _col.Counter([len(list(b)) for a, b in it.groupby(l, key=lambda x:x == 'H') if a])
L1 = ['H', 'H', 'T', 'T', 'T', 'H', 'H', 'H', 'H', 'T']
L2 = ['H', 'H', 'T', 'T', 'T', 'H', 'H', 'H', 'H', 'T' , 'T', 'H', 'T', 'T', 'T', 'H', 'H', 'H', 'T']
L3 = ['H', 'T', 'H', 'H']
d = {'L1':scores(L1), 'L2':scores(L2), 'L3':scores(L3)}
r = '\n'.join([f'Length | {" | ".join(d.keys())} ', '-'*20]+[f'{i} {" ".join(str(b.get(i, 0)) for b in d.values())}' for i in range(1, 6)])
print(r)
Output:
Length | L1 | L2 | L3
--------------------
1 0 1 1
2 1 1 1
3 0 1 0
4 1 1 0
5 0 0 0
This might work :
from itertools import groupby
a = [len(list(v)) if k=='H' and v else 0 for k,v in groupby(''.join(L1))]
For a sample L4 = ['T', 'T'] where there is no 'H' item in list, it returns [0].
For L1 it returns [2, 0, 4, 0].
For L2 it returns [2, 0, 4, 0, 1, 0, 3, 0].
For L3 it returns [1, 0, 2].
Please try max([len(x) for x in ''.join(y).split('T')]) where y is your list.

Categories

Resources