Pandas groupby and concatenate strings - python

I've got a dataframe like this
pd.DataFrame(
[
['1', 'x', 'a'],
['1', 'y', 'b'],
['1', 'z', 'c'],
['2', 'x', 'a'],
['2', 'y', 'b'],
['2', 'z', 'c']
], columns = ['one', 'two', 'three']
)
one two three
0 1 x a
1 1 y b
2 1 z c
3 2 x a
4 2 y b
5 2 z c
I'd like to end up with a dataframe like the following,
one two plus three
0 1 x + a\ny + b\nz + c
1 2 x + a\ny + b\nz + c
How can I do this? I've tried using df.sum(axis=1) but can't figure out how to group the df to contain each 3 records, sum horizontally and add \n between

Try with groupy and agg + join
s=df[['two','three']].agg('+'.join,1).groupby(df.one).agg('/n'.join).\
to_frame('two + three').reset_index()
one two + three
0 1 x+a/ny+b/nz+c
1 2 x+a/ny+b/nz+c

import pandas as pd
df = pd.DataFrame(
[
['1', 'x', 'a'],
['1', 'y', 'b'],
['1', 'z', 'c'],
['2', 'x', 'a'],
['2', 'y', 'b'],
['2', 'z', 'c']
], columns = ['one', 'two', 'three']
)
df['two_plus_three'] = df['two'] + ' + ' +df['three'] + '\n'
df.groupby('one')[['two_plus_three']].sum().reset_index()
one two_plus_three
0 1 x + a\ny + b\nz + c\n
1 2 x + a\ny + b\nz + c\n

Related

Pandas, adding multiple columns of list

I have a dataframe like this one
df = pd.DataFrame({'A' : [['a', 'b', 'c'], ['e', 'f', 'g','g']], 'B' : [['1', '4', 'a'], ['5', 'a']]})
I would like to create another column C that will be a column of list like the others but this one will be the "union" of the others
Something like this :
df = pd.DataFrame({'A' : [['a', 'b', 'c'], ['e', 'f', 'g', 'g']], 'B' : [['1', '4', 'a'], ['5', 'a']], 'C' : [['a', 'b', 'c', '1', '4', 'a'], ['e', 'f', 'g', 'g', '5', 'a']]})
But i have like hundreds of columns and C will be the "union" of these hundreds of columns i dont want to index on it like this :
df['C'] = df['A'] + df['B]
And i dont want to make a for loop because the dataframe i am manipulating are too big and i want something fast
Thank you for helping
As you have lists, you cannot vectorize the operation.
A list comprehension might be the fastest:
from itertools import chain
df['out'] = [list(chain.from_iterable(x[1:])) for x in df.itertuples()]
Example:
A B C out
0 [a, b, c] [1, 4, a] [x, y] [a, b, c, 1, 4, a, x, y]
1 [e, f, g, g] [5, a] [z] [e, f, g, g, 5, a, z]
As an alternative to #mozway 's answer, you could try something like this:
df = pd.DataFrame({'A': [['a', 'b', 'c'], ['e', 'f', 'g','g']], 'B' : [['1', '4', 'a'], ['5', 'a']]})
df['C'] = df.sum(axis=1).astype(str)
use 'astype' as required for list contents
you can use the apply method
df['C']=df.apply(lambda x: [' '.join(i) for i in list(x[df.columns.to_list()])], axis=1)

Appending columns using loc pandas dataframe

I am working with a dataframe that I have created with the below code:
df = pd.DataFrame({'player': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
'playerlookup': ['B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'],
'score': ['10', '9', '8', '7', '6', '5', '4', '3']})
I want to add a new column called "scorelookup" to this dataframe that for each row, takes the value in the 'playerlookup' column, searches for it in the 'player' column and then returns the score in a new column. For example, the value in the "scorelookup" column in the first row of the dataframe would be '9' because that was the score for player 'B'. In instances where the value in the 'playerlookup' column isn't contained within the 'player' column (for example the last row of the table which has a value of 'I' in the 'playerlookup' column), the value in that column would be blank.
I have tried using code like:
df['playerlookup'].apply(lambda n: df.loc[df['player'] == n, 'score'])
but have been unsuccessful.
Any help massively appreciated!
I hope this is the result you are looking for :
import pandas as pd
df = pd.DataFrame({'player': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
'playerlookup': ['B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'],
'score': ['10', '9', '8', '7', '6', '5', '4', '3']})
d1 = df[["playerlookup"]].copy()
d2 = df[["player","score"]].copy()
d1.rename({'playerlookup':'player'}, axis='columns',inplace=True)
df["scorelookup"] = d1.merge(d2, on='player', how='left')["score"]
The output
player playerlookup score scorelookup
0 A B 10 9
1 B C 9 8
2 C D 8 7
3 D E 7 6
4 E F 6 5
5 F G 5 4
6 G H 4 3
7 H I 3 NaN

Keep largest value based on sum of two groupbys in pandas

I have a pandas dataframe. The dataframe has 4 columns. The last one is just some random data. The first two columns are columns I will group by and sum the value column. Of each grouping, I would only like to keep the first row (i.e. the group with the largest sum).
My data:
import pandas as pd
df = pd.DataFrame(data=[['0', 'A', 3, 'a'],
['0', 'A', 2, 'b'],
['0', 'A', 1, 'c'],
['0', 'B', 3, 'd'],
['0', 'B', 4, 'e'],
['0', 'B', 4, 'f'],
['1', 'C', 3, 'g'],
['1', 'C', 2, 'h'],
['1', 'C', 1, 'i'],
['1', 'D', 3, 'j'],
['1', 'D', 4, 'k'],
['1', 'D', 4, 'l']
], columns=['group col 1', 'group col 2', 'value', 'random data']
)
Desired output:
group col 1 group col 2 value random data
3 0 B 3 d
4 0 B 4 e
5 0 B 4 f
9 1 D 3 j
10 1 D 4 k
11 1 D 4 l
I have an inefficient way of getting there, but looking for a simpler solution.
My solution:
df1 = df.groupby(['group col 1','group col 2']).agg('sum').reset_index()
biggest_groups= df1.sort_values(by=['group col 1', 'value'], ascending=[True, False])
biggest_groups = biggest_groups.groupby('group col 1').head(1)
pairs = biggest_groups[['group col 1', 'group col 2']].values.tolist()
pairs = [tuple(i) for i in pairs]
df = df[df[['group col 1', 'group col 2']].apply(tuple, axis = 1).isin(pairs)]
IIUC you will need two groupby here, one is to get the sum , then we base on the group select the max again
s=df.groupby(['group col 1', 'group col 2']).value.transform('sum')
s=df[s.groupby(df['group col 1']).transform('max')==s]
group col 1 group col 2 value random data
3 0 B 3 d
4 0 B 4 e
5 0 B 4 f
9 1 D 3 j
10 1 D 4 k
11 1 D 4 l

Creating a new column based on selecting by multiple conditions between two pandas dataframes

I have two dataframes that contain (some) common columns (A,B,C), but are ordered differently and have different values for C.
I'd like to replace the 'C' values in first dataframe with those from the second.
I can create a toy example like this:
A = [ 1, 1, 1, 2, 2, 2, 3, 3, 3 ]
B = [ 'x', 'y', 'z', 'x', 'y', 'y', 'x', 'x', 'x' ]
C = [ 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i' ]
df1 = pd.DataFrame( { 'A' : A,
'B' : B,
'C' : C } )
A.reverse()
B.reverse()
C = [ c.upper() for c in reversed(C) ]
df2 = pd.DataFrame( { 'A' : A,
'B' : B,
'C' : C } )
I'd like to update df1 so that it looks like this - i.e. it has the 'C' values from df2:
A = [ 1, 1, 1, 2, 2, 2, 3, 3, 3 ]
B = [ 'x', 'y', 'z', 'x', 'y', 'y', 'x', 'x', 'x' ]
C = [ 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I' ]
I've tried:
df1['C'] = df2[ (df2['A'] == df1['A']) & (df2['B'] == df1['B']) ]['C']
but that doesn't work because, I think, the order of A and B are different.
merge_df = pd.merge(df1, df2, on=['A', 'B'])
df1['C'] = merge_df['C_y']
I think your toy code has a problem in [ c.upper() for c in C.reverse() ].
C.reverse() return None.
It is not easy, because duplicates in columns A and B (3,x).
So I create new columns D by cumcount and then use
merge, last remove unnecessary columns:
df1['D'] = df1.groupby(['A','B']).C.cumcount()
df2['D'] = df2.groupby(['A','B']).C.cumcount(ascending=False)
df3 = pd.merge(df1, df2, on=['A','B','D'], how='right', suffixes=('_',''))
df3 = df3.drop(['C_', 'D'], axis=1)
print (df3)
A B C
0 1 x A
1 1 y B
2 1 z C
3 2 x D
4 2 y E
5 2 y F
6 3 x G
7 3 x H
8 3 x I

Python: Create vectors of same length using two DataFrames

I have two dataframes as follows:
d1 = {'person' : ['1', '1', '1', '2', '2', '3', '3', '4', '4'],
'category' : ['A', 'B', 'C', 'B', 'D', 'E', 'F', 'F', 'D'],
'value' : [2, 3, 1, 2, 1, 4, 2, 1, 3]}
d2 = {'group' : [100, 100, 100, 200, 200, 300, 300],
'category' : ['A', 'D', 'F', 'B', 'C', 'A', 'F'],
'value' : [10, 8, 8, 6, 7, 8, 5]}
I want to get vectors of the same length out of the column category (i.e. indexed by category) for each person and group. In other words, I want to transform this long dataframes into wide format where the name of the new columns are the values of the column category.
What is the best way to do this? This is an example of what I need:
id type A B C D E F
0 100 group 10 0 0 8 0 8
1 200 group 0 6 7 0 0 0
2 300 group 8 0 0 0 0 5
3 1 person 2 3 1 0 0 0
4 2 person 0 2 0 1 0 0
5 3 person 0 0 0 0 4 2
6 4 person 0 0 0 3 0 1
My current script appends both dataframes and then it gets a pivot table. My concern is that in this case the types of the id columns are different.
I do this because sometimes not all the categories are in each dataframe (e.g. 'E' is not in df2).
This is what I have:
import pandas as pd
d1 = {'person' : ['1', '1', '1', '2', '2', '3', '3', '4', '4'],
'category' : ['A', 'B', 'C', 'B', 'D', 'E', 'F', 'F', 'D'],
'value' : [2, 3, 1, 2, 1, 4, 2, 1, 3]}
d2 = {'group' : [100, 100, 100, 200, 200, 300, 300],
'category' : ['A', 'D', 'F', 'B', 'C', 'A', 'F'],
'value' : [10, 8, 8, 6, 7, 8, 5]}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
df1['type'] = 'person'
df2['type'] = 'group'
df1.rename(columns={'person': 'id'}, inplace = True)
df2.rename(columns={'group': 'id'}, inplace = True)
rawpivot = pd.DataFrame([])
rawpivot = rawpivot.append(df1)
rawpivot = rawpivot.append(df2)
pivot = rawpivot.pivot_table(index=['id','type'], columns='category', values='value', aggfunc='sum', fill_value=0)
pivot.reset_index(inplace = True)
import pandas as pd
d1 = {'person' : ['1', '1', '1', '2', '2', '3', '3', '4', '4'],
'category' : ['A', 'B', 'C', 'B', 'D', 'E', 'F', 'F', 'D'],
'value' : [2, 3, 1, 2, 1, 4, 2, 1, 3]}
d2 = {'group' : [100, 100, 100, 200, 200, 300, 300],
'category' : ['A', 'D', 'F', 'B', 'C', 'A', 'F'],
'value' : [10, 8, 8, 6, 7, 8, 5]}
cols = ['idx', 'type', 'A', 'B', 'C', 'D', 'E', 'F']
df1 = pd.DataFrame(columns=cols)
def add_data(type_, data):
global df1
for id_, category, value in zip(data[type_], data['category'], data['value']):
if id_ not in df1.idx.values:
row = pd.DataFrame({'idx': id_, 'type': type_}, columns = cols, index=[0])
df1 = df1.append(row, ignore_index = True)
df1.loc[df1['idx']==id_, category] = value
add_data('group', d2)
add_data('person', d1)
df1 = df1.fillna(0)
df1 now holds the following values
idx type A B C D E F
0 100 group 10 0 0 8 0 8
1 200 group 0 6 7 0 0 0
2 300 group 8 0 0 0 0 5
3 1 person 2 3 1 0 0 0
4 2 person 0 2 0 1 0 0
5 3 person 0 0 0 0 4 2
6 4 person 0 0 0 3 0 1

Categories

Resources