Pandas count over groups - python

I have a pandas dataframe that looks as follows:
ID round player1 player2
1 1 A B
1 2 A C
1 3 B D
2 1 B C
2 2 C D
2 3 C E
3 1 B C
3 2 C D
3 3 C A
The dataframe contains sport match results, where the ID column denotes one tournament, the round column denotes the round for each tournament, and player1 and player2 columns contain the names of players that played against eachother in the respective round.
I now want to cumulatively count the tournament participations for, say, player A. In pseudocode this means: If the player with name A comes up in either the player1 or player2 column per tournament ID, increment the counter by 1.
The result should look like this (note: in my example player A did participate in tournaments with the IDs 1 and 3):
ID round player1 player2 playerAparticipated
1 1 A B 1
1 2 A C 1
1 3 B D 1
2 1 B C 0
2 2 C D 0
2 3 C E 0
3 1 B C 2
3 2 C D 2
3 3 C A 2
My current status is, that I added a "helper" column containing the values 1 or 0 denoting, if the respective player participated in the tournament:
ID round player1 player2 helper
1 1 A B 1
1 2 A C 1
1 3 B D 1
2 1 B C 0
2 2 C D 0
2 3 C E 0
3 1 B C 1
3 2 C D 1
3 3 C A 1
I think that I just need one final step, e.g., a smart use of cumsum() that counts the helper column in the desired way. However, I could not come up with the solution yet.

I think you need:
drop_duplicates by column ID first and then set_index
filter out 0 values by boolean indexing, cumsum and last reindex for add 0 for missing index values
new column create by map
df1 = df.drop_duplicates('ID').set_index('ID')
s = df1.loc[df1['helper'] != 0, 'helper'].cumsum().reindex(index=df1.index, fill_value=0)
df['playerAparticipated'] = df['ID'].map(s)
print (df)
ID round player1 player2 helper playerAparticipated
0 1 1 A B 1 1
1 1 2 A C 1 1
2 1 3 B D 1 1
3 2 1 B C 0 0
4 2 2 C D 0 0
5 2 3 C E 0 0
6 3 1 B C 1 2
7 3 2 C D 1 2
8 3 3 C A 1 2
Instead map is possible use join with rename:
df = df.join(s.rename('playerAparticipated'), on='ID')
print (df)
ID round player1 player2 helper playerAparticipated
0 1 1 A B 1 1
1 1 2 A C 1 1
2 1 3 B D 1 1
3 2 1 B C 0 0
4 2 2 C D 0 0
5 2 3 C E 0 0
6 3 1 B C 1 2
7 3 2 C D 1 2
8 3 3 C A 1 2

A similar approach to #jezrael that I cooked up a little slower :).
First, move ID into your index:
df = df.reset_index().set_index(['index','ID'])
# round player1 player2 helper
# index ID
# 0 1 1 A B 1
# 1 1 2 A C 1
# 2 1 3 B D 1
# 3 2 1 B C 0
# 4 2 2 C D 0
# 5 2 3 C E 0
# 6 3 1 B C 1
# 7 3 2 C D 1
# 8 3 3 C A 1
Next, filter out rows where helper is 0 and get a cumulative sum of tournaments by ID, and assign the result to a variable:
tournament_count = df[df['helper'] > 0].groupby(['ID','helper']).first().reset_index(level=1)['helper'].cumsum().rename("playerAparticipated")
# ID
# 1 1
# 3 2
Finally, join the tournament_count DataFrame with df:
df.join(tournament_counts, how="left").fillna(0)
# round player1 player2 helper tournament_counts
# index ID
# 0 1 1 A B 1 1.0
# 1 1 2 A C 1 1.0
# 2 1 3 B D 1 1.0
# 3 2 1 B C 0 0.0
# 4 2 2 C D 0 0.0
# 5 2 3 C E 0 0.0
# 6 3 1 B C 1 2.0
# 7 3 2 C D 1 2.0
# 8 3 3 C A 1 2.0

Related

Use groupby and merge to create new column in pandas

So I have a pandas dataframe that looks something like this.
name is_something
0 a 0
1 b 1
2 c 0
3 c 1
4 a 1
5 b 0
6 a 1
7 c 0
8 a 1
Is there a way to use groupby and merge to create a new column that gives the number of times a name appears with an is_something value of 1 in the whole dataframe? The updated dataframe would look like this:
name is_something no_of_times_is_something_is_1
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
I know you can just loop through the dataframe to do this but I'm looking for a more efficient way because the dataset I'm working with is quite large. Thanks in advance!
If there are only 0 and 1 values in is_something column only use sum with GroupBy.transform for new column filled by aggregate values:
df['new'] = df.groupby('name')['is_something'].transform('sum')
print (df)
name is_something new
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
If possible multiple values first compare by 1, convert to integer and then use transform with sum:
df['new'] = df['is_something'].eq(1).view('i1').groupby(df['name']).transform('sum')
Or we just map it
df['New']=df.name.map(df.query('is_something ==1').groupby('name')['is_something'].sum())
df
name is_something New
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3
You could do:
df['new'] = df.groupby('name')['is_something'].transform(lambda xs: xs.eq(1).sum())
print(df)
Output
name is_something new
0 a 0 3
1 b 1 1
2 c 0 1
3 c 1 1
4 a 1 3
5 b 0 1
6 a 1 3
7 c 0 1
8 a 1 3

Determine reverse order of data given X/Y coordinates

Imagine an electrical connector. It has pins. Each pin has a corresponding X/Y location in space. I am trying to figure out how to mirror, or 'flip' each pin on the connector given their X/Y coordinate. note: I am using pandas version 23.4 We can assume that x,y, and pin are not unique but connector is. Connectors can be any size, so two rows of 5, 3 rows of 6, etc.
x y pin connector
1 1 A 1
2 1 B 1
3 1 C 1
1 2 D 1
2 2 E 1
3 2 F 1
1 1 A 2
2 1 B 2
3 1 C 2
1 2 D 2
2 2 E 2
3 2 F 2
The dataframe column, 'flip', is the solution I am trying to get to. Notice the pins that would be in the same row are now in reverse order.
x y pin flip connector
1 1 A C 1
2 1 B B 1
3 1 C A 1
1 2 D F 1
2 2 E E 1
3 2 F D 1
1 1 A C 2
2 1 B B 2
3 1 C A 2
1 2 D F 2
2 2 E E 2
3 2 F D 2
IIUC try using [::-1] a reversing element and groupby with transform:
df['flip'] = df.groupby(['connector','y'])['pin'].transform(lambda x: x[::-1])
Output:
x y pin connector flip
0 1 1 A 1 C
1 2 1 B 1 B
2 3 1 C 1 A
3 1 2 D 1 F
4 2 2 E 1 E
5 3 2 F 1 D
6 1 1 A 2 C
7 2 1 B 2 B
8 3 1 C 2 A
9 1 2 D 2 F
10 2 2 E 2 E
11 3 2 F 2 D
import io
import pandas as pd
data = """
x y pin connector
1 1 A 1
2 1 B 1
3 1 C 1
1 2 D 1
2 2 E 1
3 2 F 1
1 1 A 2
2 1 B 2
3 1 C 2
1 2 D 2
2 2 E 2
3 2 F 2
"""
#strip blank lines at the beginning and end
data = data.strip()
#make it quack like a file
data_file = io.StringIO(data)
#read data from a "wsv" file (whitespace separated values)
df = pd.read_csv(data_file, sep='\s+')
Make the new column:
flipped = []
for name, group in df.groupby(['connector','y']):
flipped.extend(group.loc[::-1,'pin'])
df = df.assign(flip=flipped)
df
Final DataFrame:
x y pin connector flip
0 1 1 A 1 C
1 2 1 B 1 B
2 3 1 C 1 A
3 1 2 D 1 F
4 2 2 E 1 E
5 3 2 F 1 D
6 1 1 A 2 C
7 2 1 B 2 B
8 3 1 C 2 A
9 1 2 D 2 F
10 2 2 E 2 E
11 3 2 F 2 D
You can create a map between the original coordinates and the coordinates of the 'flipped' component. Then you can select the flipped values.
import numpy as np
midpoint = 2
coordinates_of_flipped = pd.MultiIndex.from_arrays([df['x'].map(lambda x: x - midpoint * np.sign(x - midpoint )), df['y'], df['connector']])
df['flipped'] = df.set_index(['x', 'y', 'connector']).loc[coordinates_of_flipped].reset_index()['pin']
which gives
Out[30]:
x y pin connector flipped
0 1 1 A 1 C
1 2 1 B 1 B
2 3 1 C 1 A
3 1 2 D 1 F
4 2 2 E 1 E
5 3 2 F 1 D
6 1 1 A 2 C
7 2 1 B 2 B
8 3 1 C 2 A
9 1 2 D 2 F
10 2 2 E 2 E
11 3 2 F 2 D

Counting Precedant Entries of a column and creating a new varaible of these counts

I have a data frame and I want to count the number of consecutive entries of one column and record the counts in a separate variable. Here is an example:
ID Class
1 A
1 A
2 A
1 B
1 B
1 B
2 B
1 C
1 C
2 A
2 A
2 A
I want in each group ID to count the number of consecutive classes, so the output would look like this:
ID Class Counts
1 A 0
1 A 1
2 A 0
1 B 0
1 B 1
1 B 2
2 B 0
1 C 0
1 C 1
2 A 0
2 A 1
2 A 2
I am not looking the frequency of occurrence of a specific entries like here, rather the consecutive occurrences of an entry on the ID level
You can use cumcount by Series which is create by cumsum of shifted concanecate values by shift:
#use separator which is not in data like _ or ¥
s = df['ID'].astype(str) + '¥' + df['Class']
df['Counts'] = df.groupby(s.ne(s.shift()).cumsum()).cumcount()
print (df)
ID Class Counts
0 1 A 0
1 1 A 1
2 2 A 0
3 1 B 0
4 1 B 1
5 1 B 2
6 2 B 0
7 1 C 0
8 1 C 1
9 2 A 0
10 2 A 1
11 2 A 2
Another solution with ngroup (pandas 0.20.2+):
s = df.groupby(['ID','Class']).ngroup()
df['Counts'] = df.groupby(s.ne(s.shift()).cumsum()).cumcount()
print (df)
ID Class Counts
0 1 A 0
1 1 A 1
2 2 A 0
3 1 B 0
4 1 B 1
5 1 B 2
6 2 B 0
7 1 C 0
8 1 C 1
9 2 A 0
10 2 A 1
11 2 A 2

Return entries with common columns values in pandas DataFrame - python

I have a DataFrame in python pandas which contains several different entries (rows) having also integer values in columns, for example:
A B C D E F G H
0 1 2 1 0 1 2 1 2
1 0 1 1 1 1 2 1 2
2 1 2 1 2 1 2 1 3
3 0 1 1 1 1 2 1 2
4 2 2 1 2 1 2 1 3
I would return just the rows which contain common values in columns, the result should be:
A B C D E F G H
1 0 1 1 1 1 2 1 2
3 0 1 1 1 1 2 1 2
Thanks in advance
You can use the boolean mask from duplicated passing param keep=False:
In [3]:
df[df.duplicated(keep=False)]
Out[3]:
A B C D E F G H
1 0 1 1 1 1 2 1 2
3 0 1 1 1 1 2 1 2
Here is the mask showing the rows that are duplicates, passing keep=False returns all duplicate rows, by default it would return the first duplicate row:
In [4]:
df.duplicated(keep=False)
Out[4]:
0 False
1 True
2 False
3 True
4 False
dtype: bool
Need duplicated with parameter keep=False for return all duplicates with boolean indexing:
print (df.duplicated(keep=False))
0 False
1 True
2 False
3 True
4 False
dtype: bool
df = df[df.duplicated(keep=False)]
print (df)
A B C D E F G H
1 0 1 1 1 1 2 1 2
3 0 1 1 1 1 2 1 2
Also if need remove first or last duplicates rows use:
df1 = df[df.duplicated()]
#same as 'first', default parameter, so an be omit
#df1 = df[df.duplicated(keep='first')]
print (df1)
A B C D E F G H
3 0 1 1 1 1 2 1 2
df2 = df[df.duplicated(keep='last')]
print (df2)
A B C D E F G H
1 0 1 1 1 1 2 1 2

cumulative number of unique elements for pandas dataframe

i have a pandas data frame
id tag
1 A
1 A
1 B
1 C
1 A
2 B
2 C
2 B
I want to add a column which computes the cumulative number of unique tags over at id level. More specifically, I would like to have
id tag count
1 A 1
1 A 1
1 B 2
1 C 3
1 A 3
2 B 1
2 C 2
2 B 2
For a given id, count will be non-decreasing. Thanks for your help!
I think this does what you want:
unique_count = df.drop_duplicates().groupby('id').cumcount() + 1
unique_count.reindex(df.index).ffill()
The +1 is because the count starts at zero. This only works if the dataframe is sorted by id. Was that intended? You can always sort beforehand.
You can find some other approaches in R and Python here
df = pd.DataFrame({'id':[1,1,1,1,1,2,2,2],'tag':["A","A", "B","C","A","B","C","B"]})
df['count']=df.groupby('id')['tag'].apply(lambda x: (~pd.Series(x).duplicated()).cumsum())
id tag count
0 1 A 1
1 1 A 1
2 1 B 2
3 1 C 3
4 1 A 3
5 2 B 1
6 2 C 2
7 2 B 2
How about this:
d['X'] = 1
d.groupby("Col").X.cumsum()
idt=[1,1,1,1,1,2,2,2]
tag=['A','A','B','C','A','B','C','B']
df=pd.DataFrame(tag,index=idt,columns=['tag'])
df=df.reset_index()
print(df)
index tag
0 1 A
1 1 A
2 1 B
3 1 C
4 1 A
5 2 B
6 2 C
7 2 B
df['uCnt']=df.groupby(['index','tag']).cumcount()+1
print(df)
index tag uCnt
0 1 A 1
1 1 A 2
2 1 B 1
3 1 C 1
4 1 A 3
5 2 B 1
6 2 C 1
7 2 B 2
df['uCnt']=df['uCnt']//df['uCnt']**2
print(df)
index tag uCnt
0 1 A 1
1 1 A 0
2 1 B 1
3 1 C 1
4 1 A 0
5 2 B 1
6 2 C 1
7 2 B 0
df['uCnt']=df.groupby(['index'])['uCnt'].cumsum()
print(df)
index tag uCnt
0 1 A 1
1 1 A 1
2 1 B 2
3 1 C 3
4 1 A 3
5 2 B 1
6 2 C 2
7 2 B 2
df=df.set_index('index')
print(df)
tag uCnt
index
1 A 1
1 A 1
1 B 2
1 C 3
1 A 3
2 B 1
2 C 2
2 B 2

Categories

Resources