I have a pandas dataframe. The dataframe has 4 columns. The last one is just some random data. The first two columns are columns I will group by and sum the value column. Of each grouping, I would only like to keep the first row (i.e. the group with the largest sum).
My data:
import pandas as pd
df = pd.DataFrame(data=[['0', 'A', 3, 'a'],
['0', 'A', 2, 'b'],
['0', 'A', 1, 'c'],
['0', 'B', 3, 'd'],
['0', 'B', 4, 'e'],
['0', 'B', 4, 'f'],
['1', 'C', 3, 'g'],
['1', 'C', 2, 'h'],
['1', 'C', 1, 'i'],
['1', 'D', 3, 'j'],
['1', 'D', 4, 'k'],
['1', 'D', 4, 'l']
], columns=['group col 1', 'group col 2', 'value', 'random data']
)
Desired output:
group col 1 group col 2 value random data
3 0 B 3 d
4 0 B 4 e
5 0 B 4 f
9 1 D 3 j
10 1 D 4 k
11 1 D 4 l
I have an inefficient way of getting there, but looking for a simpler solution.
My solution:
df1 = df.groupby(['group col 1','group col 2']).agg('sum').reset_index()
biggest_groups= df1.sort_values(by=['group col 1', 'value'], ascending=[True, False])
biggest_groups = biggest_groups.groupby('group col 1').head(1)
pairs = biggest_groups[['group col 1', 'group col 2']].values.tolist()
pairs = [tuple(i) for i in pairs]
df = df[df[['group col 1', 'group col 2']].apply(tuple, axis = 1).isin(pairs)]
IIUC you will need two groupby here, one is to get the sum , then we base on the group select the max again
s=df.groupby(['group col 1', 'group col 2']).value.transform('sum')
s=df[s.groupby(df['group col 1']).transform('max')==s]
group col 1 group col 2 value random data
3 0 B 3 d
4 0 B 4 e
5 0 B 4 f
9 1 D 3 j
10 1 D 4 k
11 1 D 4 l
Related
I have a little problem with the transformation from wide to long on a dataset. I tried with melt but I didn't get a good result. I hope that someone could help me. The dataset is as follow:
pd.DataFrame({'id': [0, 1, 2, 3, 4, 5],
'type': ['a', 'b', 'c', 'd', 'e', 'f'],
'rank': ['alpha', 'beta', 'gamma', 'epsilon', 'phi', 'ro'],
'type.1': ['d', 'g', 'z', 'a', 'nan', 'nan'],
'rank.1': ['phi', 'sigma', 'gamma', 'lambda', 'nan', 'nan'],
'type.2': ['nan', 'nan', 'j', 'r', 'nan', 'nan'],
'rank.2': ['nan', 'nan', 'eta', 'theta', 'nan', 'nan']})
And I need the dataset in this way:
pd.DataFrame({'id': [0, 0, 1, 1, 2, 2, 2, 3, 3, 3, 4, 5],
'type': ['a', 'd', 'b', 'g', 'c', 'z', 'j', 'd', 'a', 'r', 'e', 'f'],
'rank': ['alpha', 'phi', 'beta', 'sigma', 'gamma', 'gamma', 'eta', 'epsilon', 'lambda', 'theta', 'phi', 'ro']})
Can anyone help me with that? Thanks a lot
Use wide_to_long:
# normalize the `type` and `rank` columns so they have the same format as others
df = df.rename(columns={'type': 'type.0', 'rank': 'rank.0'})
(pd.wide_to_long(df, stubnames=['type', 'rank'], i='id', j='var', sep='.')
[lambda x: (x['type'] != 'nan') | (x['rank'] != 'nan')].reset_index())
id var type rank
0 0 0 a alpha
1 1 0 b beta
2 2 0 c gamma
3 3 0 d epsilon
4 4 0 e phi
5 5 0 f ro
6 0 1 d phi
7 1 1 g sigma
8 2 1 z gamma
9 3 1 a lambda
10 2 2 j eta
11 3 2 r theta
You can drop the var column if not needed.
One option is pivot_longer from pyjanitor, which abstracts the reshaping process:
# pip install janitor
import janitor
(df
.pivot_longer(
index = 'id',
names_to = ['type', 'rank'],
names_pattern = ['type', 'rank'],
sort_by_appearance = True)
.loc[lambda df: ~df.eq('nan').any(1)]
)
id type rank
0 0 a alpha
1 0 d phi
3 1 b beta
4 1 g sigma
6 2 c gamma
7 2 z gamma
8 2 j eta
9 3 d epsilon
10 3 a lambda
11 3 r theta
12 4 e phi
15 5 f ro
The idea for this particular reshape is that each regex in names_pattern is used to pair the matching column with the paired name in names_to.
Consider these two dataframes:
index = [0, 1, 2, 3]
columns = ['col0', 'col1']
data = [['A', 'D'],
['B', 'E'],
['C', 'F'],
['A', 'D']
]
df1 = pd.DataFrame(data, index, columns)
df2 = pd.DataFrame(data = [10, 20, 30, 40], index = pd.MultiIndex.from_tuples([('A', 'D'), ('B', 'E'), ('C', 'F'), ('X', 'Z')]), columns = ['col2'])
I want to add a column to df1 that tells me the value from looking at df2. The expected result would be like this:
index = [0, 1, 2, 3]
columns = ['col0', 'col1', 'col2']
data = [['A', 'D', 10],
['B', 'E', 20],
['C', 'F', 30],
['A', 'D', 10]
]
df3 = pd.DataFrame(data, index, columns)
What is the best way to achieve this? I am wondering if it should be done with a dictionary and then map or perhaps something simpler. I'm unsure.
Merge normally:
pd.merge(df1, df2, left_on=["col0", "col1"], right_index=True, how="left")
Output:
col0 col1 col2
0 A D 10
1 B E 20
2 C F 30
3 A D 10
try this:
indexes = list(map(tuple, df1.values))
df1["col2"] = df2.loc[indexes].values
Output:
#print(df1)
col0 col1 col2
0 A D 10
1 B E 20
2 C F 30
3 A D 10
I've got a dataframe like this
pd.DataFrame(
[
['1', 'x', 'a'],
['1', 'y', 'b'],
['1', 'z', 'c'],
['2', 'x', 'a'],
['2', 'y', 'b'],
['2', 'z', 'c']
], columns = ['one', 'two', 'three']
)
one two three
0 1 x a
1 1 y b
2 1 z c
3 2 x a
4 2 y b
5 2 z c
I'd like to end up with a dataframe like the following,
one two plus three
0 1 x + a\ny + b\nz + c
1 2 x + a\ny + b\nz + c
How can I do this? I've tried using df.sum(axis=1) but can't figure out how to group the df to contain each 3 records, sum horizontally and add \n between
Try with groupy and agg + join
s=df[['two','three']].agg('+'.join,1).groupby(df.one).agg('/n'.join).\
to_frame('two + three').reset_index()
one two + three
0 1 x+a/ny+b/nz+c
1 2 x+a/ny+b/nz+c
import pandas as pd
df = pd.DataFrame(
[
['1', 'x', 'a'],
['1', 'y', 'b'],
['1', 'z', 'c'],
['2', 'x', 'a'],
['2', 'y', 'b'],
['2', 'z', 'c']
], columns = ['one', 'two', 'three']
)
df['two_plus_three'] = df['two'] + ' + ' +df['three'] + '\n'
df.groupby('one')[['two_plus_three']].sum().reset_index()
one two_plus_three
0 1 x + a\ny + b\nz + c\n
1 2 x + a\ny + b\nz + c\n
I'm not sure the wording of the title is optimal, because the problem I have is a little tricky to explain. In code, I have a df that looks something like this:
import pandas as pd
import numpy as np
a = ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'D', 'E', 'E']
b = [3, 1, 2, 3, 12, 4, 7, 8, 3, 10, 12]
df = pd.DataFrame([a, b]).T
df
Yields
0 1
0 A 3
1 A 1
2 A 2
3 B 3
4 B 12
5 B 4
6 C 7
7 C 8
8 D 3
9 E 10
10 E 12
I'm aware of groupby methods to group by values in a column, but that's not exactly what I want. I kind of want to go a step past that, where any intersection in column 1 between groups of column 0 are grouped together. My wording is terrible (which is probably why I'm having trouble putting this into code), but here's basically what I want as output:
0 1
0 A-B-D-E 3
1 A-B-D-E 1
2 A-B-D-E 2
3 A-B-D-E 3
4 A-B-D-E 12
5 A-B-D-E 4
6 C 7
7 C 8
8 A-B-D-E 3
9 A-B-D-E 10
10 A-B-D-E 12
Basically, A, B, and D all share the value 3 in column 1, so their labels get grouped together in column 0. Now, because B and E share value 12 in column 1, and B shares the value 3 in column 1 with A and D, E gets grouped in with A, B, and D as well. The only value in column 0 that remained independent is C, because it has no intersections with any other group.
In my head this ends up being a recursive loop, but I can't seem to figure out the exact logic. Any help would be appreciated.
If anyone in the future is experiencing the same thing, this works (it's probably not the best solution in the world, though):
import pandas as pd
import numpy as np
a = ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'D', 'E', 'E']
b = ['3', '1', '2', '3', '12', '4', '7', '8', '3', '10', '12']
df = pd.DataFrame([a, b]).T
df.columns = 'a', 'b'
df2 = df.copy()
def flatten(container):
for i in container:
if isinstance(i, (list,tuple)):
for j in flatten(i):
yield j
else:
yield i
bad = True
i =1
while bad:
print("Round "+str(i))
i = i+1
len_checker = []
for variant in list(set(df.a)):
eGenes = list(set(df.loc[df.a==variant, 'b']))
inter_variants = []
for gene in eGenes:
inter_variants.append(list(set(df.loc[df.b==gene, 'a'])))
if type(inter_variants[0]) is not str:
inter_variants = [x for x in flatten(inter_variants)]
inter_variants = list(set(inter_variants))
len_checker.append(inter_variants)
if len(inter_variants) != 1:
df2.loc[df2.a.isin(inter_variants),'a']='-'.join(inter_variants)
good_checker = max([len(x) for x in len_checker])
df['a'] = df2.a
if good_checker == 1:
bad=False
df.a = df.a.apply(lambda x: '-'.join(list(set(x.split('-')))))
df.drop_duplicates(inplace=True)
The following creates the output you want, without recursions. I have not tested it with other constellations though (other order, more combinations etc.).
a = ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'D', 'E', 'E']
b = [3, 1, 2, 3, 12, 4, 7, 8, 3, 10, 12]
df = list(zip(a, b))
print(df)
class Bucket:
def __init__(self, keys, values):
self.keys = set(keys)
self.values = set(values)
def contains_key(self, key):
return key in self.keys
def add_if_contained(self, key, value):
if value in self.values:
self.keys.add(key)
return True
elif key in self.keys:
self.values.add(value)
return True
return False
def merge(self, bucket):
self.keys.update(bucket.keys)
self.values.update(bucket.values)
def __str__(self):
return f'{self.keys} :: {self.values}>'
def __repr__(self):
return str(self)
res = []
for tup in df:
added = False
if res:
selected_bucket = None
remove_idx = None
for idx, bucket in enumerate(res):
if not added:
added = bucket.add_if_contained(tup[0], tup[1])
selected_bucket = bucket
elif bucket.contains_key(tup[0]):
selected_bucket.merge(bucket)
remove_idx = idx
if remove_idx is not None:
res.pop(remove_idx)
if not added:
res.append(Bucket({tup[0]}, {tup[1]}))
print(res)
Generates the following output:
$ python test.py
[('A', 3), ('A', 1), ('A', 2), ('B', 3), ('B', 12), ('B', 4), ('C', 7), ('C', 8), ('D', 3), ('E', 10), ('E', 12)]
[{'B', 'D', 'A', 'E'} :: {1, 2, 3, 4, 10, 12}>, {'C'} :: {8, 7}>]
I have two dataframes as follows:
d1 = {'person' : ['1', '1', '1', '2', '2', '3', '3', '4', '4'],
'category' : ['A', 'B', 'C', 'B', 'D', 'E', 'F', 'F', 'D'],
'value' : [2, 3, 1, 2, 1, 4, 2, 1, 3]}
d2 = {'group' : [100, 100, 100, 200, 200, 300, 300],
'category' : ['A', 'D', 'F', 'B', 'C', 'A', 'F'],
'value' : [10, 8, 8, 6, 7, 8, 5]}
I want to get vectors of the same length out of the column category (i.e. indexed by category) for each person and group. In other words, I want to transform this long dataframes into wide format where the name of the new columns are the values of the column category.
What is the best way to do this? This is an example of what I need:
id type A B C D E F
0 100 group 10 0 0 8 0 8
1 200 group 0 6 7 0 0 0
2 300 group 8 0 0 0 0 5
3 1 person 2 3 1 0 0 0
4 2 person 0 2 0 1 0 0
5 3 person 0 0 0 0 4 2
6 4 person 0 0 0 3 0 1
My current script appends both dataframes and then it gets a pivot table. My concern is that in this case the types of the id columns are different.
I do this because sometimes not all the categories are in each dataframe (e.g. 'E' is not in df2).
This is what I have:
import pandas as pd
d1 = {'person' : ['1', '1', '1', '2', '2', '3', '3', '4', '4'],
'category' : ['A', 'B', 'C', 'B', 'D', 'E', 'F', 'F', 'D'],
'value' : [2, 3, 1, 2, 1, 4, 2, 1, 3]}
d2 = {'group' : [100, 100, 100, 200, 200, 300, 300],
'category' : ['A', 'D', 'F', 'B', 'C', 'A', 'F'],
'value' : [10, 8, 8, 6, 7, 8, 5]}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
df1['type'] = 'person'
df2['type'] = 'group'
df1.rename(columns={'person': 'id'}, inplace = True)
df2.rename(columns={'group': 'id'}, inplace = True)
rawpivot = pd.DataFrame([])
rawpivot = rawpivot.append(df1)
rawpivot = rawpivot.append(df2)
pivot = rawpivot.pivot_table(index=['id','type'], columns='category', values='value', aggfunc='sum', fill_value=0)
pivot.reset_index(inplace = True)
import pandas as pd
d1 = {'person' : ['1', '1', '1', '2', '2', '3', '3', '4', '4'],
'category' : ['A', 'B', 'C', 'B', 'D', 'E', 'F', 'F', 'D'],
'value' : [2, 3, 1, 2, 1, 4, 2, 1, 3]}
d2 = {'group' : [100, 100, 100, 200, 200, 300, 300],
'category' : ['A', 'D', 'F', 'B', 'C', 'A', 'F'],
'value' : [10, 8, 8, 6, 7, 8, 5]}
cols = ['idx', 'type', 'A', 'B', 'C', 'D', 'E', 'F']
df1 = pd.DataFrame(columns=cols)
def add_data(type_, data):
global df1
for id_, category, value in zip(data[type_], data['category'], data['value']):
if id_ not in df1.idx.values:
row = pd.DataFrame({'idx': id_, 'type': type_}, columns = cols, index=[0])
df1 = df1.append(row, ignore_index = True)
df1.loc[df1['idx']==id_, category] = value
add_data('group', d2)
add_data('person', d1)
df1 = df1.fillna(0)
df1 now holds the following values
idx type A B C D E F
0 100 group 10 0 0 8 0 8
1 200 group 0 6 7 0 0 0
2 300 group 8 0 0 0 0 5
3 1 person 2 3 1 0 0 0
4 2 person 0 2 0 1 0 0
5 3 person 0 0 0 0 4 2
6 4 person 0 0 0 3 0 1