Pandas: Cumulative count from two columns - python

winner loser winner_matches loser_matches
Dave Harry 1 1
Jim Dave 1 2
Dave Steve 3 1
I'm trying to build a running count of how many matches a player has participated in based on their name's appearance in either the winner or loser column (ie, Dave above has a running count of 3 since he's been in every match). I'm new to pandas and have tried a few combinations of cumcount and groupby but I'm not sure if I just need to manually loop over the dataset and store all the names myself.
EDIT: to clarify, I need the running totals in the dataframe as shown above and not just a Series printed out later on! Thanks

First create MultiIndex Series by DataFrame.stack, then GroupBy.cumcount, for DataFrame add unstack with add_suffix:
print (df)
winner loser
0 Dave Harry
1 Jim Dave
2 Dave Steve
s = df.stack()
#if multiple columns in original df
#s = df[['winner','loser']].stack()
df1 = s.groupby(s).cumcount().add(1).unstack().add_suffix('_matches')
print (df1)
winner_matches loser_matches
0 1 1
1 1 2
2 3 1
Last append to original DataFrame by join:
df = df.join(df1)
print (df)
winner loser winner_matches loser_matches
0 Dave Harry 1 1
1 Jim Dave 1 2
2 Dave Steve 3 1

you need flatten,
pd.Series(df[['winner','loser']].values.flatten()).value_counts()
[out]
Dave 3
Jim 1
Harry 1
Steve 1

Related

How to count the amount of words said by someone pandas dataframe

I have a dataframe like this am I'm trying to count the words said by a specific author.
Author Text Date
Jake hey hey my names Jake 1.04.1997
Mac hey my names Mac 1.02.2019
Sarah heymy names Sarah 5.07.2001
I've been trying to get it set up in a way where if i were to search for the word "hey" it would produce
Author Count
Jake 2
Mac 1
Use Series.str.count with aggregate sum:
df1 = df['Text'].str.count('hey').groupby(df['Author']).sum().reset_index(name='Count')
print (df1)
Author Count
0 Jake 2
1 Mac 0
2 Sarah 1
If need filter out rows with 0 values add boolean indexing:
s = df['Text'].str.count('hey')
df1 = s[ s.gt(0)].groupby(df['Author']).sum().reset_index(name='Count')
print (df1)
Author Count
0 Jake 2
1 Sarah 1
EDIT: for test hey separately add words boundaries \b\b like:
df1 = df['Text'].str.count(r'\bhey\b').groupby(df['Author']).sum().reset_index(name='Count')
print (df1)
Author Count
0 Jake 2
1 Mac 1
2 Sarah 0
s = df['Text'].str.count(r'\bhey\b')
df1 = s[ s.gt(0)].groupby(df['Author']).sum().reset_index(name='Count')
print (df1)
Author Count
0 Jake 2
1 Mac 1
If df is your original dataframe
newDF = pd.DataFrame(columns=['Author','Count'])
newDF['Author'] = df['Author']
newDF['Count'] = df['Text'].str.count("hey")
newDF.drop(newDF[newDF['Count'] == 0].index, inplace=True)

Pandas - dense rank but keep current group numbers

I'm dealing with pandas dataframe and have a frame like:
data = {
"name": ["Andrew", "Andrew", "James", "James", "Mary", "Andrew", "Michael"],
"id": [3, 3, 1, 0, 0, 0, 2]
}
df = pd.DataFrame(data)
----------------------
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 0
4 Mary 0
5 Andrew 0
6 Michael 2
I'm trying to write code to group values by "name" column. However, I want to keep the current group numbers.
If the value is 0, it means that there is no assignment.
For the example above, assign a value of 3 for each occurrence of Andrew and a value of 1 for each occurrence of James. For Mary, there is no assignment so assign next/unique number.
The expected output:
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 1
4 Mary 4
5 Andrew 3
6 Michael 2
I've spent time already trying to figure this out. I managed to get to something like this:
df.loc[df["id"].eq(0), "id"] = ( df['name'].rank(method='dense').astype(int))
The issue with above it that it ignore records equal 0, thus numbers are incorrect. I removed that part (values equal to 0) but then numbering is not preserved.
Can u please support me?
Replace 0 values to missing values, so if use GroupBy.transform with first get all existing values instead them and then replace missing values by Series.rank with add maximal id and converting to integers:
df = df.replace({'id':{0:np.nan}})
df['id'] = df.groupby('name')['id'].transform('first')
s = df.loc[df["id"].isna(), 'name'].rank(method='dense') + df['id'].max()
df['id'] = df['id'].fillna(s).astype(int)
print (df)
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 1
4 Mary 4
5 Andrew 3
6 Michael 2
IIUC you can first fill in the non-zero IDs with groupby.transform('max') to get the max existing ID, then complete the names without ID to the next available ID on the masked data (you can use factorize or rank as you wish):
# fill existing non-zero IDs
s = df.groupby('name')['id'].transform('max')
m = s.eq(0)
df['id'] = s.mask(m)
# add new ones
df.loc[m, 'id'] = pd.factorize(df.loc[m, 'name'])[0]+df['id'].max()+1
# or rank, although factorize is more appropriate for non numerical data
# df.loc[m, 'id'] = df.loc[m, 'name'].rank(method='dense')+df['id'].max()
# optional, if you want integers
df['id']= df['id'].convert_dtypes()
output:
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 1
4 Mary 4
5 Andrew 3
6 Michael 2

Python merging data frames and renaming column values

In python, I have a df that looks like this
Name ID
Anna 1
Sarah 2
Max 3
And a df that looks like this
Name ID
Dan 1
Hallie 2
Cam 3
How can I merge the df’s so that the ID column looks like this
Name ID
Anna 1
Sarah 2
Max 3
Dan 4
Hallie 5
Cam 6
This is just a minimal reproducible example. My actual data set has 1000’s of values. I’m basically merging data frames and want the ID’s in numerical order (continuation of previous data frame) instead of repeating from one each time.
Use pd.concat:
out = pd.concat([df1, df2.assign(ID=df2['ID'] + df1['ID'].max())], ignore_index=True)
print(out)
# Output
Name ID
0 Anna 1
1 Sarah 2
2 Max 3
3 Dan 4
4 Hallie 5
5 Cam 6
Concatenate the two DataFrames, reset_index and use the new index to assign "ID"s
df_new = pd.concat((df1, df2)).reset_index(drop=True)
df_new['ID'] = df_new.index + 1
Output:
Name ID
0 Anna 1
1 Sarah 2
2 Max 3
3 Dan 4
4 Hallie 5
5 Cam 6
You can concat dataframes with ignore_index=True and then set ID column:
df = pd.concat([df1, df2], ignore_index=True)
df['ID'] = df.index + 1

How to create a new column based on information in another column?

I try to create a new column in panda dataframe. I have names in one column, I want to attain numbers to them in a new column. If name is repeated sequentially, they get the same number, if they are repeated after different names then they should get another number
For example, my df is like
Name/
Stephen
Stephen
Mike
Carla
Carla
Stephen
my new column should be
Numbers/
0
0
1
2
2
3
Sorry, I couldn't paste my dataframe here.
Try:
df['Numbers'] = (df['Name'] != df['Name'].shift()).cumsum() - 1
Output:
Name Numbers
0 Stephen 0
1 Stephen 0
2 Mike 1
3 Carla 2
4 Carla 2
5 Stephen 3

Python/Pandas - Duplicating an index in a new column in a Pandas DataFrame

I have a DataFrame with indexes 1,2,3.
Name
1 Rob
2 Mark
3 Alex
I want to duplicate that index in a new column so it gets like this:
Name Number
1 Rob 1
2 Mark 2
3 Alex 3
Any ideas?
EDIT
I forgot one important part: those items in the Numbers column should be turned into string
You can try:
df['Number'] = df.index.astype(str)
Name Number
1 Rob 1
2 Mark 2
3 Alex 3

Categories

Resources