Group by and Count distinct words in Pandas DataFrame

Group by and Count distinct words in Pandas DataFrame - python

By year and name, I am hoping to count the occurrence of words in a dataframe from imported from Excel which results will also be exported to Excel.
This is the sample code:
source = pd.DataFrame({'Name' : ['John', 'Mike', 'John','John'],
'Year' : ['1999', '2000', '2000','2000'],
'Message' : [
'I Love You','Will Remember You','Love','I Love You]})
Excepted results are the following in a dataframe. Any ideas?
Year Name Message Count
1999 John I 1
1999 John love 1
1999 John you 1
2000 Mike Will 1
2000 Mike Remember 1
2000 Mike You 1
2000 John Love 2
2000 John I 1
2000 John You 1

I think you can first split column Message, create Serie and add it to original source. Last groupby with size:
#split column Message to new df, create Serie by stack
s = (source.Message.str.split(expand=True).stack())
#remove multiindex
s.index = s.index.droplevel(-1)
s.name= 'Message'
print(s)
0 I
0 Love
0 You
1 Will
1 Remember
1 You
2 Love
3 I
3 Love
3 You
Name: Message, dtype: object
#remove old column Message
source = source.drop(['Message'], axis=1)
#join Serie s to df source
df = (source.join(s))
#aggregate size
print (df.groupby(['Year', 'Name', 'Message']).size().reset_index(name='count'))
Year Name Message count
0 1999 John I 1
1 1999 John Love 1
2 1999 John You 1
3 2000 John I 1
4 2000 John Love 2
5 2000 John You 1
6 2000 Mike Remember 1
7 2000 Mike Will 1
8 2000 Mike You 1

Related

Create categorical column in python from string values

I have a pandas dataframe that includes a "Name" column. Strings in the Name column may contain "Joe", "Bob", or "Joe Bob". I want to add a column for the type of person: just Joe, just Bob, or Both.
I was able to do this by creating boolean columns, turning them into strings, combining the strings, and then replacing the values. It just...didn't feel very elegant! I am new to Python...is there a better way to do this?
My original dataframe:
df = pd.DataFrame(data= [['Joe Biden'],['Bobby Kennedy'],['Joe Bob Briggs']], columns = ['Name'])
0
Name
1
Joe Biden
2
Bobby Kennedy
3
Joe Bob Briggs
I added two boolean columns to find names:
df['Joe'] = df.Name.str.contains('Joe')
df['Joe'] = df.Joe.astype('int')
df['Bob'] = df.Name.str.contains('Bob')
df['Bob'] = df.Bob.astype('int')
Now my dataframe looks like this:
df = pd.DataFrame(data= [['Joe Biden',1,0],['Bobby Kennedy',0,1],['Joe Bob Briggs',1,1]], columns = ['Name','Joe', 'Bob'])
0
Name
Joe
Bob
1
Joe Biden
1
0
2
Bobby Kennedy
0
1
3
Joe Bob Briggs
1
1
But what I really want is one "Type" column with categorical values: Joe, Bob, or Both.
To do that, I added a column to combine the booleans, then I replaced the values:
df["Type"] = df["Joe"].astype(str) + df["Bob"].astype(str)
0
Name
Joe
Bob
Type
1
Joe Biden
1
0
10
2
Bobby Kennedy
0
1
1
3
Joe Bob Briggs
1
1
11
df['Type'] = df.Type.astype('str') df['Type'].replace({'11': 'Both', '10': 'Joe','1': 'Bob'}, inplace=True)
0
Name
Joe
Bob
Type
1
Joe Biden
1
0
Joe
2
Bobby Kennedy
0
1
Bob
3
Joe Bob Briggs
1
1
Both
This feels clunky. Anyone have a better way?
Thanks!

You can use np.select to create the column Type.
You need to ordered correctly your condlist from the most precise to the widest.
df['Type'] = np.select([df['Name'].str.contains('Joe') & df['Name'].str.contains('Bob'),
df['Name'].str.contains('Joe'),
df['Name'].str.contains('Bob')],
choicelist=['Both', 'Joe', 'Bob'])
Output:
>>> df
Name Type
0 Joe Biden Joe
1 Bobby Kennedy Bob
2 Joe Bob Briggs Both

I am trying to groupby values in a specific column that holds multiple values?

I have this huge netflix dataset which I am trying to see which actors appeared in the most movies/tv shows specifically in America. First, I created a list of unique actors from the dataset. Then created a nested for loop to loop through each name in list3(containing unique actors which checked each row in df3(filtered dataset with 2000+rows) if the column cast contained the current actors name from list3. I believe using iterrows takes too long
myDict1 = {}
for name in list3:
if name not in myDict1:
myDict1[name] = 0
for index, row in df3.iterrows():
if name in row["cast"]:
myDict1[name] += 1
myDict1
Title
cast
Movie1
Robert De Niro, Al Pacino, Tarantino
Movie2
Tom Hanks, Robert De Niro, Tom Cruise
Movie3
Tom Cruise, Zendaya, Seth Rogen
I want my output to be like this:
Name
Count
Robert De Niro
2
Tom Cruise
2

Use
out = df['cast'].str.split(', ').explode().value_counts()
out = pd.DataFrame({'Name': out.index, 'Count': out.values})
>>> out
Name Count
0 Tom Cruise 2
1 Robert De Niro 2
2 Zendaya 1
3 Seth Rogen 1
4 Tarantino 1
5 Al Pacino 1
6 Tom Hanks 1

l=['Robert De Niro','Tom Cruise']#list
df=df.assign(cast=df['cast'].str.split(',')).apply(pd.Series.explode)#convert cast into list and explode
df[df['cast'].str.contains("|".join(l))].groupby('cast').size().reset_index().rename(columns={'cast':'Name',0:'Count'})#groupby cast, find size and rename columns
Name Count
0 Robert De Niro 2
1 Tom Cruise 2

You could use collections.Counter to get the counts of the actors, after splitting the strings:
from collections import Counter
pd.DataFrame(Counter(df.cast.str.split(", ").sum()).items(),
columns = ['Name', 'Count'])
Name Count
0 Robert De Niro 2
1 Al Pacino 1
2 Tarantino 1
3 Tom Hanks 1
4 Tom Cruise 2
5 Zendaya 1
6 Seth Rogen 1
If you are keen about speed, and you have lots of data, you could dump the entire processing within plain python and rebuild the dataframe:
from itertools import chain
pd.DataFrame(Counter(chain.from_iterable(ent.split(", ")
for ent in df.cast)).items(),
columns = ['Name', 'Count'])

Looking back at previous row in data-frame and selecting specific records

I have a dataframe df which looks like:
name year dept metric
0 Steve Jones 2018 A 0.703300236
1 Steve Jones 2019 A 0.255587222
2 Jane Smith 2018 A 0.502505934
3 Jane Smith 2019 B 0.698808749
4 Barry Evans 2019 B 0.941325241
5 Tony Edwards 2017 B 0.880940126
6 Tony Edwards 2018 B 0.649086123
7 Tony Edwards 2019 A 0.881365905
I would like to create 2 new data-frame which contains the records where someone has moved from dept A to B and and another where someone has moved from dept B to A. Therefore my desired output is:
name year dept metric
0 Jane Smith 2018 A 0.502505934
1 Tony Edwards 2019 B 0.649086123
name year dept metric
0 Jane Smith 2019 B 0.698808749
1 Tony Edwards 2018 B 0.881365905
Where records for the year the last year that someone is in their old dept are captured in one data-frame and the first year in the new dept are captured in another only. The records are sorted by name and year so will be in the correct order.
I've tried :
for row in agg_data.rows:
df['match'] = np.where(df.dept == 'A' and df.dept.shift() =='B','1')
df['match'] = np.where(df.dept == 'B' and df.dept.shift() =='A','2')
and then select out the records into a data-frame but I get it to work.

I believe you need:
df = df[df.groupby('name')['dept'].transform('nunique') > 1]
df = df.drop_duplicates(['name','dept'], keep='last')
df1 = df.drop_duplicates('name')
print (df1)
name year dept metric
2 Jane Smith 2018 A 0.502506
6 Tony Edwards 2018 B 0.649086
df2 = df.drop_duplicates('name', keep='last')
print (df2)
name year dept metric
3 Jane Smith 2019 B 0.698809
7 Tony Edwards 2019 A 0.881366

You could join the initial dataframe with a shift of itself to have convecutive rows on same line. Then you ask the departments you want requiring the names to be the same and you get the indices of one of the expected rows, the other row just has an adjacent index. It gives:
df = agg_data.join(agg_data.shift(), rsuffix='_old')
df1 = df[(df.name_old==df.name)&(df.dept_old=='A')&(df.dept=='B')]
print(pd.concat([agg_data.loc[df1.index], agg_data.loc[df1.index-1]]
).sort_index())
df2 = df[(df.name_old==df.name)&(df.dept_old=='B')&(df.dept=='A')]
print(pd.concat([agg_data.loc[df2.index], agg_data.loc[df2.index-1]]
).sort_index())
with following output:
name year dept metric
2 Jane Smith 2018 A 0.502506
3 Jane Smith 2019 B 0.698809
name year dept metric
6 Tony Edwards 2018 B 0.649086
7 Tony Edwards 2019 A 0.881366

I come up with a solution using drop_duplicates, groupby and rank. Creating df2 on rank=2 and creating df1 on rank==1 and name exists in df2
df['rk'] = df.sort_values(['name', 'dept', 'year']).drop_duplicates(['name', 'dept'], keep='last').groupby('name').year.rank()
df2 = df[df.rk.eq(2)].drop('rk', 1)
df1 = df[df.rk.eq(1) & df.name.isin(df2.name)].drop('rk', 1)
df1:
name year dept metric
2 Jane Smith 2018 A 0.502506
6 Tony Edwards 2018 B 0.649086
df2:
name year dept metric
3 Jane Smith 2019 B 0.698809
7 Tony Edwards 2019 A 0.881366

pandas: Group by splitting string value in all rows (a column) and aggregation function

If i have dataset like this:
id person_name salary
0 [alexander, william, smith] 45000
1 [smith, robert, gates] 65000
2 [bob, alexander] 56000
3 [robert, william] 80000
4 [alexander, gates] 70000
If we sum that salary column then we will get 316000
I really want to know how much person who named 'alexander, smith, etc' (in distinct) makes in salary if we sum all of the salaries from its splitting name in this dataset (that contains same string value).
output:
group sum_salary
alexander 171000 #sum from id 0 + 2 + 4 (which contain 'alexander')
william 125000 #sum from id 0 + 3
smith 110000 #sum from id 0 + 1
robert 145000 #sum from id 1 + 3
gates 135000 #sum from id 1 + 4
bob 56000 #sum from id 2
as we see the sum of sum_salary columns is not the same as the initial dataset. all because the function requires double counting.
I thought it seems familiar like string count, but what makes me confuse is the way we use aggregation function. I've tried creating a new list of distinct value in person_name columns, then stuck comes.
Any help is appreciated, Thank you very much

Solutions working with lists in column person_name:
#if necessary
#df['person_name'] = df['person_name'].str.strip('[]').str.split(', ')
print (type(df.loc[0, 'person_name']))
<class 'list'>
First idea is use defaultdict for store sumed values in loop:
from collections import defaultdict
d = defaultdict(int)
for p, s in zip(df['person_name'], df['salary']):
for x in p:
d[x] += int(s)
print (d)
defaultdict(<class 'int'>, {'alexander': 171000,
'william': 125000,
'smith': 110000,
'robert': 145000,
'gates': 135000,
'bob': 56000})
And then:
df1 = pd.DataFrame({'group':list(d.keys()),
'sum_salary':list(d.values())})
print (df1)
group sum_salary
0 alexander 171000
1 william 125000
2 smith 110000
3 robert 145000
4 gates 135000
5 bob 56000
Another solution with repeating values by length of lists and aggregate sum:
from itertools import chain
df1 = pd.DataFrame({
'group' : list(chain.from_iterable(df['person_name'].tolist())),
'sum_salary' : df['salary'].values.repeat(df['person_name'].str.len())
})
df2 = df1.groupby('group', as_index=False, sort=False)['sum_salary'].sum()
print (df2)
group sum_salary
0 alexander 171000
1 william 125000
2 smith 110000
3 robert 145000
4 gates 135000
5 bob 56000

Another sol:
df_new=(pd.DataFrame({'person_name':np.concatenate(df.person_name.values),
'salary':df.salary.repeat(df.person_name.str.len())}))
print(df_new.groupby('person_name')['salary'].sum().reset_index())
person_name salary
0 alexander 171000
1 bob 56000
2 gates 135000
3 robert 145000
4 smith 110000
5 william 125000

Can be done concisely with dummies though performance will suffer due to all of the .str methods:
df.person_name.str.join('*').str.get_dummies('*').multiply(df.salary, 0).sum()
#alexander 171000
#bob 56000
#gates 135000
#robert 145000
#smith 110000
#william 125000
#dtype: int64

I parsed this as strings of lists, by copying OP's data and using pandas.read_clipboard(). In case this was indeed the case (a series of strings of lists), this solution would work:
df = df.merge(df.person_name.str.split(',', expand=True), left_index=True, right_index=True)
df = df[[0, 1, 2, 'salary']].melt(id_vars = 'salary').drop(columns='variable')
# Some cleaning up, then a simple groupby
df.value = df.value.str.replace('[', '')
df.value = df.value.str.replace(']', '')
df.value = df.value.str.replace(' ', '')
df.groupby('value')['salary'].sum()
Output:
value
alexander 171000
bob 56000
gates 135000
robert 145000
smith 110000
william 125000

Another way you can do this is with iterrows(). This will not be as fast jezraels solution. But it works:
ids = []
names = []
salarys = []
# Iterate over the rows and extract the names from the lists in person_name column
for ix, row in df.iterrows():
for name in row['person_name']:
ids.append(row['id'])
names.append(name)
salarys.append(row['salary'])
# Create a new 'unnested' dataframe
df_new = pd.DataFrame({'id':ids,
'names':names,
'salary':salarys})
# Groupby on person_name and get the sum
print(df_new.groupby('names').salary.sum().reset_index())
Output
names salary
0 alexander 171000
1 bob 56000
2 gates 135000
3 robert 145000
4 smith 110000
5 william 125000

Counting the occurrences of a substring from one column within another column

I have two dataframes I am working with, one which contains a list of players and another that contains play by play data for the players from the other dataframe. Portions of the rows of interest within these two dataframes are shown below.
0 Matt Carpenter
1 Jason Heyward
2 Peter Bourjos
3 Matt Holliday
4 Jhonny Peralta
5 Matt Adams
...
Name: Name, dtype: object
0 Matt Carpenter grounded out to second (Grounder).
1 Jason Heyward doubled to right (Liner).
2 Matt Holliday singled to right (Liner). Jason Heyward scored.
...
Name: Play, dtype: object
What I am trying to do is create a column in the first dataframe that counts the number of occurrences of the string (df['Name'] + ' scored') in the column in the other dataframe. For example, it would search for instances of "Matt Carpenter scored", "Jason Heyward scored", etc. I know you can use str.contains to do this type of thing, but it only seems to work if you put in the explicit string. For example,
batter_game_logs_df['R vs SP'] = len(play_by_play_SP_df[play_by_play_SP_df['Play'].str.contains('Jason Heyward scored')].index)
works fine but if I try
batter_game_logs_df['R vs SP'] = len(play_by_play_SP_df[play_by_play_SP_df['Play'].str.contains(batter_game_logs_df['Name'].astype(str) + ' scored')].index)
it returns the error 'Series' objects are mutable, thus they cannot be hashed. I have looked at various similar questions but cannot find the solution to this problem for the life of me. Any assistance on this would be greatly appreciated, thank you!

I think need findall by regex with join all values of Name, then create indicator columns by MultiLabelBinarizer and add all missing columns by reindex:
s = df1['Name'] + ' scored'
pat = r'\b{}\b'.format('|'.join(s))
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df2['Play'].str.findall(pat)),
columns=mlb.classes_,
index=df2.index).reindex(columns=s, fill_value=0)
print (df)
Name Matt Carpenter scored Jason Heyward scored Peter Bourjos scored \
0 0 0 0
1 0 0 0
2 0 1 0
Name Matt Holliday scored Jhonny Peralta scored Matt Adams scored
0 0 0 0
1 0 0 0
2 0 0 0
Last if necessary join to df1:
df = df2.join(df)
print (df)
Play Matt Carpenter scored \
0 Matt Carpenter grounded out to second (Grounder). 0
1 Jason Heyward doubled to right (Liner). 0
2 Matt Holliday singled to right (Liner). Jason ... 0
Jason Heyward scored Peter Bourjos scored Matt Holliday scored \
0 0 0 0
1 0 0 0
2 1 0 0
Jhonny Peralta scored Matt Adams scored
0 0 0
1 0 0
2 0 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Group by and Count distinct words in Pandas DataFrame - python

Related

Create categorical column in python from string values

I am trying to groupby values in a specific column that holds multiple values?

Looking back at previous row in data-frame and selecting specific records

pandas: Group by splitting string value in all rows (a column) and aggregation function

Counting the occurrences of a substring from one column within another column

Categories

Resources