How can I stack rows that share any column value in common?

How can I stack rows that share any column value in common? - python

I'm not sure the wording of the title is optimal, because the problem I have is a little tricky to explain. In code, I have a df that looks something like this:
import pandas as pd
import numpy as np
a = ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'D', 'E', 'E']
b = [3, 1, 2, 3, 12, 4, 7, 8, 3, 10, 12]
df = pd.DataFrame([a, b]).T
df
Yields
0 1
0 A 3
1 A 1
2 A 2
3 B 3
4 B 12
5 B 4
6 C 7
7 C 8
8 D 3
9 E 10
10 E 12
I'm aware of groupby methods to group by values in a column, but that's not exactly what I want. I kind of want to go a step past that, where any intersection in column 1 between groups of column 0 are grouped together. My wording is terrible (which is probably why I'm having trouble putting this into code), but here's basically what I want as output:
0 1
0 A-B-D-E 3
1 A-B-D-E 1
2 A-B-D-E 2
3 A-B-D-E 3
4 A-B-D-E 12
5 A-B-D-E 4
6 C 7
7 C 8
8 A-B-D-E 3
9 A-B-D-E 10
10 A-B-D-E 12
Basically, A, B, and D all share the value 3 in column 1, so their labels get grouped together in column 0. Now, because B and E share value 12 in column 1, and B shares the value 3 in column 1 with A and D, E gets grouped in with A, B, and D as well. The only value in column 0 that remained independent is C, because it has no intersections with any other group.
In my head this ends up being a recursive loop, but I can't seem to figure out the exact logic. Any help would be appreciated.

If anyone in the future is experiencing the same thing, this works (it's probably not the best solution in the world, though):
import pandas as pd
import numpy as np
a = ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'D', 'E', 'E']
b = ['3', '1', '2', '3', '12', '4', '7', '8', '3', '10', '12']
df = pd.DataFrame([a, b]).T
df.columns = 'a', 'b'
df2 = df.copy()
def flatten(container):
for i in container:
if isinstance(i, (list,tuple)):
for j in flatten(i):
yield j
else:
yield i
bad = True
i =1
while bad:
print("Round "+str(i))
i = i+1
len_checker = []
for variant in list(set(df.a)):
eGenes = list(set(df.loc[df.a==variant, 'b']))
inter_variants = []
for gene in eGenes:
inter_variants.append(list(set(df.loc[df.b==gene, 'a'])))
if type(inter_variants[0]) is not str:
inter_variants = [x for x in flatten(inter_variants)]
inter_variants = list(set(inter_variants))
len_checker.append(inter_variants)
if len(inter_variants) != 1:
df2.loc[df2.a.isin(inter_variants),'a']='-'.join(inter_variants)
good_checker = max([len(x) for x in len_checker])
df['a'] = df2.a
if good_checker == 1:
bad=False
df.a = df.a.apply(lambda x: '-'.join(list(set(x.split('-')))))
df.drop_duplicates(inplace=True)

The following creates the output you want, without recursions. I have not tested it with other constellations though (other order, more combinations etc.).
a = ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'D', 'E', 'E']
b = [3, 1, 2, 3, 12, 4, 7, 8, 3, 10, 12]
df = list(zip(a, b))
print(df)
class Bucket:
def __init__(self, keys, values):
self.keys = set(keys)
self.values = set(values)
def contains_key(self, key):
return key in self.keys
def add_if_contained(self, key, value):
if value in self.values:
self.keys.add(key)
return True
elif key in self.keys:
self.values.add(value)
return True
return False
def merge(self, bucket):
self.keys.update(bucket.keys)
self.values.update(bucket.values)
def __str__(self):
return f'{self.keys} :: {self.values}>'
def __repr__(self):
return str(self)
res = []
for tup in df:
added = False
if res:
selected_bucket = None
remove_idx = None
for idx, bucket in enumerate(res):
if not added:
added = bucket.add_if_contained(tup[0], tup[1])
selected_bucket = bucket
elif bucket.contains_key(tup[0]):
selected_bucket.merge(bucket)
remove_idx = idx
if remove_idx is not None:
res.pop(remove_idx)
if not added:
res.append(Bucket({tup[0]}, {tup[1]}))
print(res)
Generates the following output:
$ python test.py
[('A', 3), ('A', 1), ('A', 2), ('B', 3), ('B', 12), ('B', 4), ('C', 7), ('C', 8), ('D', 3), ('E', 10), ('E', 12)]
[{'B', 'D', 'A', 'E'} :: {1, 2, 3, 4, 10, 12}>, {'C'} :: {8, 7}>]

Related

pandas: select certain amount of rows based on column ranking using loop

I have a dataframe which looks like this
pd.DataFrame({'a':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'b':['N', 'Y', 'Y', 'N', 'Y', 'N', 'Y', 'N', 'N', 'Y'],
'c':[4, 5, 9, 8, 1, 3, 7, 2, 6, 10]})
a b c
0 A N 4
1 B Y 5
2 C Y 9
3 D N 8
4 E Y 1
5 F N 3
6 G Y 7
7 H N 2
8 I N 6
9 J Y 10
Out of the 10 rows I want to select 5 rows based on the following criteria:
column 'c' is my rank column.
select the rows with lowest 2 ranks (rows 4 and 7 selected)
select all rows where column 'b' = 'Y' AND rank <=5 (row 1 selected)
in the event fewer than 5 rows are selected using the above criteria the remaining open positions should be filled by rank order (lowest) with rows where 'b' = 'Y' and which have rank <= 7 (row 6 selected)
in the event fewer than 5 rows pass the first 3 criteria fill remaining positions in rank order (lowest) where 'b' = 'N'
I have tried this (which covers rule 1 & 2) but struggling how to go on from there
df['selected'] = ''
df.loc[(df.c <= 2), 'selected'] = 'rule_1'
df.loc[((df.c <= 5) & (df.b == 'Y')), 'selected'] = 'rule_2'
my resulting dataframe should look like this
a b c selected
0 A N 4 False
1 B Y 5 rule_2
2 C Y 9 False
3 D N 8 rule_4
4 E Y 1 rule_1
5 F N 3 False
6 G Y 7 rule_3
7 H N 2 rule_1
8 I N 6 False
9 J Y 10 False
based on on of the solutions provided by Vinod Karantothu below I went for the following which seems to work:
def solution(df):
def sol(df, b='Y'):
result_df_rule1 = df.sort_values('c')[:2]
result_df_rule1['action'] = 'rule_1'
result_df_rule2 = df.sort_values('c')[2:].loc[df['b'] == b].loc[df['c'] <= 5]
result_df_rule2['action'] = 'rule_2'
result = pd.concat([result_df_rule1, result_df_rule2]).head(5)
if len(result) < 5:
remaining_rows = pd.concat([df, result, result]).drop_duplicates(subset='a', keep=False)
result_df_rule3 = remaining_rows.loc[df['b'] == b].loc[df['c'] <= 7]
result_df_rule3['action'] = 'rule_3'
result = pd.concat([result, result_df_rule3]).head(5)
return result, pd.concat([remaining_rows, result, result]).drop_duplicates(subset='a', keep=False)
result, remaining_data = sol(df)
if len(result) < 5:
result1, remaining_data = sol(remaining_data, 'N')
result1['action'] = 'rule_4'
result = pd.concat([result, result1]).head(5).drop_duplicates(subset='a', keep=False).merge(df, how='outer', on='a')
return result
if __name__ == '__main__':
df = pd.DataFrame({'a': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'b': ['N', 'Y', 'Y', 'N', 'Y', 'N', 'Y', 'N', 'N', 'Y'],
'c': [4, 5, 9, 8, 1, 3, 7, 2, 6, 10]})
result = solution(df)
print(result)

import pandas as pd
def solution(df):
def sol(df, b='Y'):
result_df_rule1 = df.sort_values('c')[:2]
result_df_rule2 = df.sort_values('c')[2:].loc[df['b'] == b].loc[df['c'] <= 5]
result = pd.concat([result_df_rule1, result_df_rule2]).head(5)
if len(result) < 5:
remaining_rows = pd.concat([df, result, result]).drop_duplicates(keep=False)
result_df_rule3 = remaining_rows.loc[df['b'] == b].loc[df['c'] <= 7]
result = pd.concat([result, result_df_rule3]).head(5)
return result, pd.concat([remaining_rows, result, result]).drop_duplicates(keep=False)
result, remaining_data = sol(df)
if len(result) < 5:
result1, remaining_data = sol(remaining_data, 'N')
result = pd.concat([result, result1]).head(5)
return result
if __name__ == '__main__':
df = pd.DataFrame({'a':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'b':['N', 'Y', 'Y', 'N', 'Y', 'N', 'Y', 'N', 'N', 'Y'],
'c':[4, 5, 9, 8, 1, 3, 7, 2, 6, 10]})
result = solution(df)
print(result)
Result:
a b c
4 E Y 1
7 H N 2
1 B Y 5
6 G Y 7
5 F N 3

For your 4th RULE, you have mentioned in your resulting dataframe, ROW_INDEX 3 will come, but it has rank of 8 which is not lowest, ROW_INDEX 5 should come according to the RULES you have given:
import pandas as pd
data = pd.DataFrame({'a':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'b':['N', 'Y', 'Y', 'N', 'Y', 'N', 'Y', 'N', 'N', 'Y'],
'c':[4, 5, 9, 8, 1, 3, 7, 2, 6, 10]})
data1 = data.nsmallest(2, ['c'])
dataX = data.drop(data1.index)
data2 = dataX[((dataX.b == "Y") & (dataX.c<=5))]
dataX = dataX.drop(data2.index)
data3 = dataX[((dataX.b == "Y") & (dataX.c<=7))]
dataX = dataX.drop(data3.index)
data4 = dataX[((dataX.b == "N"))]
data4 = data4.nsmallest(1, ['c'])
resultframes = [data1, data2, data3, data4]
resultfinal = pd.concat(resultframes)
print(resultfinal)
And here is the output:
a b c
4 E Y 1
7 H N 2
1 B Y 5
6 G Y 7
5 F N 3

You can create extra columns for the rules, then sort and take the head. IIUC from the comments then rule 3 already covers rule 2 so no need to calculate it separately.
df['r1'] = df.c < 3
df['r3'] = (df.c <= 7) & (df.b == 'Y')
print(df.sort_values(['r1', 'r3', 'c'], ascending=[False, False, True])[['a', 'b', 'c']].head(5))
a b c
4 E Y 1
7 H N 2
1 B Y 5
6 G Y 7
5 F N 3
Sorting on boolean column works because True > False.
Note: You might need to tweak the code to your expectations with different datasets. For example your last row 9 J Y 10 is currently not covered by any of the rules. You can take this approach and extend it if needed.

Keep largest value based on sum of two groupbys in pandas

I have a pandas dataframe. The dataframe has 4 columns. The last one is just some random data. The first two columns are columns I will group by and sum the value column. Of each grouping, I would only like to keep the first row (i.e. the group with the largest sum).
My data:
import pandas as pd
df = pd.DataFrame(data=[['0', 'A', 3, 'a'],
['0', 'A', 2, 'b'],
['0', 'A', 1, 'c'],
['0', 'B', 3, 'd'],
['0', 'B', 4, 'e'],
['0', 'B', 4, 'f'],
['1', 'C', 3, 'g'],
['1', 'C', 2, 'h'],
['1', 'C', 1, 'i'],
['1', 'D', 3, 'j'],
['1', 'D', 4, 'k'],
['1', 'D', 4, 'l']
], columns=['group col 1', 'group col 2', 'value', 'random data']
)
Desired output:
group col 1 group col 2 value random data
3 0 B 3 d
4 0 B 4 e
5 0 B 4 f
9 1 D 3 j
10 1 D 4 k
11 1 D 4 l
I have an inefficient way of getting there, but looking for a simpler solution.
My solution:
df1 = df.groupby(['group col 1','group col 2']).agg('sum').reset_index()
biggest_groups= df1.sort_values(by=['group col 1', 'value'], ascending=[True, False])
biggest_groups = biggest_groups.groupby('group col 1').head(1)
pairs = biggest_groups[['group col 1', 'group col 2']].values.tolist()
pairs = [tuple(i) for i in pairs]
df = df[df[['group col 1', 'group col 2']].apply(tuple, axis = 1).isin(pairs)]

IIUC you will need two groupby here, one is to get the sum , then we base on the group select the max again
s=df.groupby(['group col 1', 'group col 2']).value.transform('sum')
s=df[s.groupby(df['group col 1']).transform('max')==s]
group col 1 group col 2 value random data
3 0 B 3 d
4 0 B 4 e
5 0 B 4 f
9 1 D 3 j
10 1 D 4 k
11 1 D 4 l

Pandas comparison

I'm trying to simplify pandas and python syntax when executing a basic Pandas operation.
I have 4 columns:
a_id
a_score
b_id
b_score
I create a new label called doc_type based on the following:
a >= b, doc_type: a
b > a, doc_type: b
Im struggling in how to calculate in Pandas where a exists but b doesn't, in this case then a needs to be the label. Right now it returns the else statement or b.
I needed to create 2 additional comparison which at scale may be efficient as I already compare the data before. Looking how to improve it.
df = pd.DataFrame({
'a_id': ['A', 'B', 'C', 'D', '', 'F', 'G'],
'a_score': [1, 2, 3, 4, '', 6, 7],
'b_id': ['a', 'b', 'c', 'd', 'e', 'f', ''],
'b_score': [0.1, 0.2, 3.1, 4.1, 5, 5.99, None],
})
print df
# Replace empty string with NaN
m_score = r['a_score'] >= r['b_score']
m_doc = (r['a_id'].isnull() & r['b_id'].isnull())
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)
# Calculate higher score
df['doc_id'] = df.apply(lambda df: df['a_id'] if df['a_score'] >= df['b_score'] else df['b_id'], axis=1)
# Select type based on higher score
r['doc_type'] = numpy.where(m_score, 'a',
numpy.where(m_doc, numpy.nan, 'b'))
# Additional lines looking for improvement:
df['doc_type'].loc[(df['a_id'].isnull() & df['b_id'].notnull())] = 'b'
df['doc_type'].loc[(df['a_id'].notnull() & df['b_id'].isnull())] = 'a'
print df

Use numpy.where, assuming your logic is:
Both exist, the doc_type will be the one with higher score;
One missing, the doc_type will be the one not null;
Both missing, the doc_type will be null;
Added an extra edge case at the last line:
import numpy as np
df = df.replace('', np.nan)
df['doc_type'] = np.where(df.b_id.isnull() | (df.a_score >= df.b_score),
np.where(df.a_id.isnull(), None, 'a'), 'b')
df

Not sure I fully understand all conditions or if this has any particular edge cases, but I think you can just do an np.argmax on the columns and swap the values for 'a' or 'b' when you're done:
In [21]: import numpy as np
In [22]: df['doc_type'] = pd.Series(np.argmax(df[["a_score", "b_score"]].values, axis=1)).replace({0: 'a', 1: 'b'})
In [23]: df
Out[23]:
a_id a_score b_id b_score doc_type
0 A 1 a 0.10 a
1 B 2 b 0.20 a
2 C 3 c 3.10 b
3 D 4 d 4.10 b
4 2 e 5.00 b
5 F f 5.99 a
6 G 7 NaN a

Use the apply method in pandas with a custom function, trying out on your dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a_id': ['A', 'B', 'C', 'D', '', 'F', 'G'],
'a_score': [1, 2, 3, 4, '', 6, 7],
'b_id': ['a', 'b', 'c', 'd', 'e', 'f', ''],
'b_score': [0.1, 0.2, 3.1, 4.1, 5, 5.99, None],
})
df = df.replace('',np.NaN)
def func(row):
if np.isnan(row.a_score) and np.isnan(row.b_score):
return np.NaN
elif np.isnan(row.b_score) and not(np.isnan(row.a_score)):
return 'a'
elif not(np.isnan(row.b_score)) and np.isnan(row.a_score):
return 'a'
elif row.a_score>=row.b_score:
return 'a'
elif row.b_score>row.a_score:
return 'b'
df['doc_type'] = df.apply(func,axis=1)
You can make the function as complicated as you need and include any amount of comparisons and add more conditions later if you need to.

Python: Create vectors of same length using two DataFrames

I have two dataframes as follows:
d1 = {'person' : ['1', '1', '1', '2', '2', '3', '3', '4', '4'],
'category' : ['A', 'B', 'C', 'B', 'D', 'E', 'F', 'F', 'D'],
'value' : [2, 3, 1, 2, 1, 4, 2, 1, 3]}
d2 = {'group' : [100, 100, 100, 200, 200, 300, 300],
'category' : ['A', 'D', 'F', 'B', 'C', 'A', 'F'],
'value' : [10, 8, 8, 6, 7, 8, 5]}
I want to get vectors of the same length out of the column category (i.e. indexed by category) for each person and group. In other words, I want to transform this long dataframes into wide format where the name of the new columns are the values of the column category.
What is the best way to do this? This is an example of what I need:
id type A B C D E F
0 100 group 10 0 0 8 0 8
1 200 group 0 6 7 0 0 0
2 300 group 8 0 0 0 0 5
3 1 person 2 3 1 0 0 0
4 2 person 0 2 0 1 0 0
5 3 person 0 0 0 0 4 2
6 4 person 0 0 0 3 0 1
My current script appends both dataframes and then it gets a pivot table. My concern is that in this case the types of the id columns are different.
I do this because sometimes not all the categories are in each dataframe (e.g. 'E' is not in df2).
This is what I have:
import pandas as pd
d1 = {'person' : ['1', '1', '1', '2', '2', '3', '3', '4', '4'],
'category' : ['A', 'B', 'C', 'B', 'D', 'E', 'F', 'F', 'D'],
'value' : [2, 3, 1, 2, 1, 4, 2, 1, 3]}
d2 = {'group' : [100, 100, 100, 200, 200, 300, 300],
'category' : ['A', 'D', 'F', 'B', 'C', 'A', 'F'],
'value' : [10, 8, 8, 6, 7, 8, 5]}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
df1['type'] = 'person'
df2['type'] = 'group'
df1.rename(columns={'person': 'id'}, inplace = True)
df2.rename(columns={'group': 'id'}, inplace = True)
rawpivot = pd.DataFrame([])
rawpivot = rawpivot.append(df1)
rawpivot = rawpivot.append(df2)
pivot = rawpivot.pivot_table(index=['id','type'], columns='category', values='value', aggfunc='sum', fill_value=0)
pivot.reset_index(inplace = True)

import pandas as pd
d1 = {'person' : ['1', '1', '1', '2', '2', '3', '3', '4', '4'],
'category' : ['A', 'B', 'C', 'B', 'D', 'E', 'F', 'F', 'D'],
'value' : [2, 3, 1, 2, 1, 4, 2, 1, 3]}
d2 = {'group' : [100, 100, 100, 200, 200, 300, 300],
'category' : ['A', 'D', 'F', 'B', 'C', 'A', 'F'],
'value' : [10, 8, 8, 6, 7, 8, 5]}
cols = ['idx', 'type', 'A', 'B', 'C', 'D', 'E', 'F']
df1 = pd.DataFrame(columns=cols)
def add_data(type_, data):
global df1
for id_, category, value in zip(data[type_], data['category'], data['value']):
if id_ not in df1.idx.values:
row = pd.DataFrame({'idx': id_, 'type': type_}, columns = cols, index=[0])
df1 = df1.append(row, ignore_index = True)
df1.loc[df1['idx']==id_, category] = value
add_data('group', d2)
add_data('person', d1)
df1 = df1.fillna(0)
df1 now holds the following values
idx type A B C D E F
0 100 group 10 0 0 8 0 8
1 200 group 0 6 7 0 0 0
2 300 group 8 0 0 0 0 5
3 1 person 2 3 1 0 0 0
4 2 person 0 2 0 1 0 0
5 3 person 0 0 0 0 4 2
6 4 person 0 0 0 3 0 1

Pandas Melt on Multi-index Columns Without Manually Specifying Levels (Python 3.5.1)

I have a Pandas DataFrame that looks something like:
df = pd.DataFrame({'col1': {0: 'a', 1: 'b', 2: 'c'},
'col2': {0: 1, 1: 3, 2: 5},
'col3': {0: 2, 1: 4, 2: 6},
'col4': {0: 3, 1: 6, 2: 2},
'col5': {0: 7, 1: 2, 2: 3},
'col6': {0: 2, 1: 9, 2: 5},
})
df.columns = [list('AAAAAA'), list('BBCCDD'), list('EFGHIJ')]
A
B C D
E F G H I J
0 a 1 2 3 7 2
1 b 3 4 6 2 9
2 c 5 6 2 3 5
I basically just want to melt the data frame so that each column level becomes a new column. In other words, I can achieve what I want pretty simply with pd.melt():
pd.melt(df, value_vars=[('A', 'B', 'E'),
('A', 'B', 'F'),
('A', 'C', 'G'),
('A', 'C', 'H'),
('A', 'D', 'I'),
('A', 'D', 'J')])
However, in my real use-case, There are many initial columns (a lot more than 6), and it would be great if I could make this generalizable so I didn't have to precisely specify the tuples in value_vars. Is there a way to do this in a generalizable way? I'm basically looking for a way to tell pd.melt that I just want to set value_vars to a list of tuples where in each tuple the first element is the first column level, the second is the second column level, and the third element is the third column level.

If you don't specify value_vars, then all columns (that are not specified as id_vars) are used by default:
In [10]: pd.melt(df)
Out[10]:
variable_0 variable_1 variable_2 value
0 A B E a
1 A B E b
2 A B E c
3 A B F 1
4 A B F 3
...
However, if for some reason you do need to generate the list of column-tuples, you could use df.columns.tolist():
In [57]: df.columns.tolist()
Out[57]:
[('A', 'B', 'E'),
('A', 'B', 'F'),
('A', 'C', 'G'),
('A', 'C', 'H'),
('A', 'D', 'I'),
('A', 'D', 'J')]
In [56]: pd.melt(df, value_vars=df.columns.tolist())
Out[56]:
variable_0 variable_1 variable_2 value
0 A B E a
1 A B E b
2 A B E c
3 A B F 1
4 A B F 3
...

I had this same question, but my base dataset was actually just a series with 3-level Multi-Index. I found this answer to 'melt' a Series into a Dataframe from this blog post: https://discuss.analyticsvidhya.com/t/how-to-convert-the-multi-index-series-into-a-data-frame-in-python/5119/2
Basically, you just use the DataFrame Constructor on the Series and it does exactly what you want Melt to do.
pd.DataFrame(series)

I tried working with pd.melt(), but wasn't able to get it to run properly. I found it much easier to use df.unstack(), which completely modifies it to long format, and then convert it back to the required format using df.pivot(). These links might help:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.unstack.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I stack rows that share any column value in common? - python

Related

pandas: select certain amount of rows based on column ranking using loop

Keep largest value based on sum of two groupbys in pandas

Pandas comparison

Python: Create vectors of same length using two DataFrames

Pandas Melt on Multi-index Columns Without Manually Specifying Levels (Python 3.5.1)

Categories

Resources