Counting the occurrences of a substring from one column within another column - python

I have two dataframes I am working with, one which contains a list of players and another that contains play by play data for the players from the other dataframe. Portions of the rows of interest within these two dataframes are shown below.
0 Matt Carpenter
1 Jason Heyward
2 Peter Bourjos
3 Matt Holliday
4 Jhonny Peralta
5 Matt Adams
...
Name: Name, dtype: object
0 Matt Carpenter grounded out to second (Grounder).
1 Jason Heyward doubled to right (Liner).
2 Matt Holliday singled to right (Liner). Jason Heyward scored.
...
Name: Play, dtype: object
What I am trying to do is create a column in the first dataframe that counts the number of occurrences of the string (df['Name'] + ' scored') in the column in the other dataframe. For example, it would search for instances of "Matt Carpenter scored", "Jason Heyward scored", etc. I know you can use str.contains to do this type of thing, but it only seems to work if you put in the explicit string. For example,
batter_game_logs_df['R vs SP'] = len(play_by_play_SP_df[play_by_play_SP_df['Play'].str.contains('Jason Heyward scored')].index)
works fine but if I try
batter_game_logs_df['R vs SP'] = len(play_by_play_SP_df[play_by_play_SP_df['Play'].str.contains(batter_game_logs_df['Name'].astype(str) + ' scored')].index)
it returns the error 'Series' objects are mutable, thus they cannot be hashed. I have looked at various similar questions but cannot find the solution to this problem for the life of me. Any assistance on this would be greatly appreciated, thank you!

I think need findall by regex with join all values of Name, then create indicator columns by MultiLabelBinarizer and add all missing columns by reindex:
s = df1['Name'] + ' scored'
pat = r'\b{}\b'.format('|'.join(s))
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df2['Play'].str.findall(pat)),
columns=mlb.classes_,
index=df2.index).reindex(columns=s, fill_value=0)
print (df)
Name Matt Carpenter scored Jason Heyward scored Peter Bourjos scored \
0 0 0 0
1 0 0 0
2 0 1 0
Name Matt Holliday scored Jhonny Peralta scored Matt Adams scored
0 0 0 0
1 0 0 0
2 0 0 0
Last if necessary join to df1:
df = df2.join(df)
print (df)
Play Matt Carpenter scored \
0 Matt Carpenter grounded out to second (Grounder). 0
1 Jason Heyward doubled to right (Liner). 0
2 Matt Holliday singled to right (Liner). Jason ... 0
Jason Heyward scored Peter Bourjos scored Matt Holliday scored \
0 0 0 0
1 0 0 0
2 1 0 0
Jhonny Peralta scored Matt Adams scored
0 0 0
1 0 0
2 0 0

Related

Create categorical column in python from string values

I have a pandas dataframe that includes a "Name" column. Strings in the Name column may contain "Joe", "Bob", or "Joe Bob". I want to add a column for the type of person: just Joe, just Bob, or Both.
I was able to do this by creating boolean columns, turning them into strings, combining the strings, and then replacing the values. It just...didn't feel very elegant! I am new to Python...is there a better way to do this?
My original dataframe:
df = pd.DataFrame(data= [['Joe Biden'],['Bobby Kennedy'],['Joe Bob Briggs']], columns = ['Name'])
0
Name
1
Joe Biden
2
Bobby Kennedy
3
Joe Bob Briggs
I added two boolean columns to find names:
df['Joe'] = df.Name.str.contains('Joe')
df['Joe'] = df.Joe.astype('int')
df['Bob'] = df.Name.str.contains('Bob')
df['Bob'] = df.Bob.astype('int')
Now my dataframe looks like this:
df = pd.DataFrame(data= [['Joe Biden',1,0],['Bobby Kennedy',0,1],['Joe Bob Briggs',1,1]], columns = ['Name','Joe', 'Bob'])
0
Name
Joe
Bob
1
Joe Biden
1
0
2
Bobby Kennedy
0
1
3
Joe Bob Briggs
1
1
But what I really want is one "Type" column with categorical values: Joe, Bob, or Both.
To do that, I added a column to combine the booleans, then I replaced the values:
df["Type"] = df["Joe"].astype(str) + df["Bob"].astype(str)
0
Name
Joe
Bob
Type
1
Joe Biden
1
0
10
2
Bobby Kennedy
0
1
1
3
Joe Bob Briggs
1
1
11
df['Type'] = df.Type.astype('str') df['Type'].replace({'11': 'Both', '10': 'Joe','1': 'Bob'}, inplace=True)
0
Name
Joe
Bob
Type
1
Joe Biden
1
0
Joe
2
Bobby Kennedy
0
1
Bob
3
Joe Bob Briggs
1
1
Both
This feels clunky. Anyone have a better way?
Thanks!
You can use np.select to create the column Type.
You need to ordered correctly your condlist from the most precise to the widest.
df['Type'] = np.select([df['Name'].str.contains('Joe') & df['Name'].str.contains('Bob'),
df['Name'].str.contains('Joe'),
df['Name'].str.contains('Bob')],
choicelist=['Both', 'Joe', 'Bob'])
Output:
>>> df
Name Type
0 Joe Biden Joe
1 Bobby Kennedy Bob
2 Joe Bob Briggs Both

I am trying to groupby values in a specific column that holds multiple values?

I have this huge netflix dataset which I am trying to see which actors appeared in the most movies/tv shows specifically in America. First, I created a list of unique actors from the dataset. Then created a nested for loop to loop through each name in list3(containing unique actors which checked each row in df3(filtered dataset with 2000+rows) if the column cast contained the current actors name from list3. I believe using iterrows takes too long
myDict1 = {}
for name in list3:
if name not in myDict1:
myDict1[name] = 0
for index, row in df3.iterrows():
if name in row["cast"]:
myDict1[name] += 1
myDict1
Title
cast
Movie1
Robert De Niro, Al Pacino, Tarantino
Movie2
Tom Hanks, Robert De Niro, Tom Cruise
Movie3
Tom Cruise, Zendaya, Seth Rogen
I want my output to be like this:
Name
Count
Robert De Niro
2
Tom Cruise
2
Use
out = df['cast'].str.split(', ').explode().value_counts()
out = pd.DataFrame({'Name': out.index, 'Count': out.values})
>>> out
Name Count
0 Tom Cruise 2
1 Robert De Niro 2
2 Zendaya 1
3 Seth Rogen 1
4 Tarantino 1
5 Al Pacino 1
6 Tom Hanks 1
l=['Robert De Niro','Tom Cruise']#list
df=df.assign(cast=df['cast'].str.split(',')).apply(pd.Series.explode)#convert cast into list and explode
df[df['cast'].str.contains("|".join(l))].groupby('cast').size().reset_index().rename(columns={'cast':'Name',0:'Count'})#groupby cast, find size and rename columns
Name Count
0 Robert De Niro 2
1 Tom Cruise 2
You could use collections.Counter to get the counts of the actors, after splitting the strings:
from collections import Counter
pd.DataFrame(Counter(df.cast.str.split(", ").sum()).items(),
columns = ['Name', 'Count'])
Name Count
0 Robert De Niro 2
1 Al Pacino 1
2 Tarantino 1
3 Tom Hanks 1
4 Tom Cruise 2
5 Zendaya 1
6 Seth Rogen 1
If you are keen about speed, and you have lots of data, you could dump the entire processing within plain python and rebuild the dataframe:
from itertools import chain
pd.DataFrame(Counter(chain.from_iterable(ent.split(", ")
for ent in df.cast)).items(),
columns = ['Name', 'Count'])

pandas: Group by splitting string value in all rows (a column) and aggregation function

If i have dataset like this:
id person_name salary
0 [alexander, william, smith] 45000
1 [smith, robert, gates] 65000
2 [bob, alexander] 56000
3 [robert, william] 80000
4 [alexander, gates] 70000
If we sum that salary column then we will get 316000
I really want to know how much person who named 'alexander, smith, etc' (in distinct) makes in salary if we sum all of the salaries from its splitting name in this dataset (that contains same string value).
output:
group sum_salary
alexander 171000 #sum from id 0 + 2 + 4 (which contain 'alexander')
william 125000 #sum from id 0 + 3
smith 110000 #sum from id 0 + 1
robert 145000 #sum from id 1 + 3
gates 135000 #sum from id 1 + 4
bob 56000 #sum from id 2
as we see the sum of sum_salary columns is not the same as the initial dataset. all because the function requires double counting.
I thought it seems familiar like string count, but what makes me confuse is the way we use aggregation function. I've tried creating a new list of distinct value in person_name columns, then stuck comes.
Any help is appreciated, Thank you very much
Solutions working with lists in column person_name:
#if necessary
#df['person_name'] = df['person_name'].str.strip('[]').str.split(', ')
print (type(df.loc[0, 'person_name']))
<class 'list'>
First idea is use defaultdict for store sumed values in loop:
from collections import defaultdict
d = defaultdict(int)
for p, s in zip(df['person_name'], df['salary']):
for x in p:
d[x] += int(s)
print (d)
defaultdict(<class 'int'>, {'alexander': 171000,
'william': 125000,
'smith': 110000,
'robert': 145000,
'gates': 135000,
'bob': 56000})
And then:
df1 = pd.DataFrame({'group':list(d.keys()),
'sum_salary':list(d.values())})
print (df1)
group sum_salary
0 alexander 171000
1 william 125000
2 smith 110000
3 robert 145000
4 gates 135000
5 bob 56000
Another solution with repeating values by length of lists and aggregate sum:
from itertools import chain
df1 = pd.DataFrame({
'group' : list(chain.from_iterable(df['person_name'].tolist())),
'sum_salary' : df['salary'].values.repeat(df['person_name'].str.len())
})
df2 = df1.groupby('group', as_index=False, sort=False)['sum_salary'].sum()
print (df2)
group sum_salary
0 alexander 171000
1 william 125000
2 smith 110000
3 robert 145000
4 gates 135000
5 bob 56000
Another sol:
df_new=(pd.DataFrame({'person_name':np.concatenate(df.person_name.values),
'salary':df.salary.repeat(df.person_name.str.len())}))
print(df_new.groupby('person_name')['salary'].sum().reset_index())
person_name salary
0 alexander 171000
1 bob 56000
2 gates 135000
3 robert 145000
4 smith 110000
5 william 125000
Can be done concisely with dummies though performance will suffer due to all of the .str methods:
df.person_name.str.join('*').str.get_dummies('*').multiply(df.salary, 0).sum()
#alexander 171000
#bob 56000
#gates 135000
#robert 145000
#smith 110000
#william 125000
#dtype: int64
I parsed this as strings of lists, by copying OP's data and using pandas.read_clipboard(). In case this was indeed the case (a series of strings of lists), this solution would work:
df = df.merge(df.person_name.str.split(',', expand=True), left_index=True, right_index=True)
df = df[[0, 1, 2, 'salary']].melt(id_vars = 'salary').drop(columns='variable')
# Some cleaning up, then a simple groupby
df.value = df.value.str.replace('[', '')
df.value = df.value.str.replace(']', '')
df.value = df.value.str.replace(' ', '')
df.groupby('value')['salary'].sum()
Output:
value
alexander 171000
bob 56000
gates 135000
robert 145000
smith 110000
william 125000
Another way you can do this is with iterrows(). This will not be as fast jezraels solution. But it works:
ids = []
names = []
salarys = []
# Iterate over the rows and extract the names from the lists in person_name column
for ix, row in df.iterrows():
for name in row['person_name']:
ids.append(row['id'])
names.append(name)
salarys.append(row['salary'])
# Create a new 'unnested' dataframe
df_new = pd.DataFrame({'id':ids,
'names':names,
'salary':salarys})
# Groupby on person_name and get the sum
print(df_new.groupby('names').salary.sum().reset_index())
Output
names salary
0 alexander 171000
1 bob 56000
2 gates 135000
3 robert 145000
4 smith 110000
5 william 125000

Group by and Count distinct words in Pandas DataFrame

By year and name, I am hoping to count the occurrence of words in a dataframe from imported from Excel which results will also be exported to Excel.
This is the sample code:
source = pd.DataFrame({'Name' : ['John', 'Mike', 'John','John'],
'Year' : ['1999', '2000', '2000','2000'],
'Message' : [
'I Love You','Will Remember You','Love','I Love You]})
Excepted results are the following in a dataframe. Any ideas?
Year Name Message Count
1999 John I 1
1999 John love 1
1999 John you 1
2000 Mike Will 1
2000 Mike Remember 1
2000 Mike You 1
2000 John Love 2
2000 John I 1
2000 John You 1
I think you can first split column Message, create Serie and add it to original source. Last groupby with size:
#split column Message to new df, create Serie by stack
s = (source.Message.str.split(expand=True).stack())
#remove multiindex
s.index = s.index.droplevel(-1)
s.name= 'Message'
print(s)
0 I
0 Love
0 You
1 Will
1 Remember
1 You
2 Love
3 I
3 Love
3 You
Name: Message, dtype: object
#remove old column Message
source = source.drop(['Message'], axis=1)
#join Serie s to df source
df = (source.join(s))
#aggregate size
print (df.groupby(['Year', 'Name', 'Message']).size().reset_index(name='count'))
Year Name Message count
0 1999 John I 1
1 1999 John Love 1
2 1999 John You 1
3 2000 John I 1
4 2000 John Love 2
5 2000 John You 1
6 2000 Mike Remember 1
7 2000 Mike Will 1
8 2000 Mike You 1

pandas - DataFrame expansion with outer join

First of all I am very new at pandas and am trying to lean so thorough answers will be appreciated.
I want to generate a pandas DataFrame representing a map witter tag subtoken -> poster where tag subtoken means anything in the set {hashtagA} U {i | i in split('_', hashtagA)} from a table matching poster -> tweet
For example:
In [1]: df = pd.DataFrame([["jim", "i was like #yolo_omg to her"], ["jack", "You are so #yes_omg #best_place_ever"], ["neil", "Yo #rofl_so_funny"]])
In [2]: df
Out[2]:
0 1
0 jim i was like #yolo_omg to her
1 jack You are so #yes_omg #best_place_ever
2 neil Yo #rofl_so_funny
And from that I want to get something like
0 1
0 jim yolo_omg
1 jim yolo
2 jim omg
3 jack yes_omg
4 jack yes
5 jack omg
6 jack best_place_ever
7 jack best
8 jack place
9 jack ever
10 neil rofl_so_funny
11 neil rofl
12 neil so
13 neil funny
I managed to construct this mostrosity that actually does the job:
In [143]: df[1].str.findall('#([^\s]+)') \
.apply(pd.Series).stack() \
.apply(lambda s: [s] + s.split('_') if '_' in s else [s]) \
.apply(pd.Series).stack().to_frame().reset_index(level=0) \
.join(df, on='level_0', how='right', lsuffix='_l')[['0','0_l']]
Out[143]:
0 0_l
0 0 jim yolo_omg
1 jim yolo
2 jim omg
0 jack yes_omg
1 jack yes
2 jack omg
1 0 jack best_place_ever
1 jack best
2 jack place
3 jack ever
0 0 neil rofl_so_funny
1 neil rofl
2 neil so
3 neil funny
But I have a very strong feeling that there are much better ways of doing this, especially given that the real dataset set is huge.
pandas indeed has a function for doing this natively.
Series.str.findall()
This basically applies a regex and captures the group(s) you specify in it.
So if I had your dataframe:
df = pd.DataFrame([["jim", "i was like #yolo_omg to her"], ["jack", "You are so #yes_omg #best_place_ever"], ["neil", "Yo #rofl_so_funny"]])
What I would do is first to set the names of your columns, like this:
df.columns = ['user', 'tweet']
Or do it on creation of the dataframe:
df = pd.DataFrame([["jim", "i was like #yolo_omg to her"], ["jack", "You are so #yes_omg #best_place_ever"], ["neil", "Yo #rofl_so_funny"]], columns=['user', 'tweet'])
Then I would simply apply the extract function with a regex:
df['tag'] = df["tweet"].str.findall("(#[^ ]*)")
And I would use the negative character group instead of a positive one, this is more likely to survive special cases.
How about using list comprehensions in python and then reverting back to pandas? Requires a few lines of code but is perhaps more readable.
import re
get the hash tags
tags = [re.findall('#([^\s]+)', t) for t in df[1]]
make lists of the tags with subtokens for each user
st = [[t] + [s.split('_') for s in t] for t in tags]
subtokens = [[i for s in poster for i in s] for poster in st]
put back into DataFrame with poster names
df2 = pd.DataFrame(subtokens, index=df[0]).stack()
In [250]: df2
Out[250]:
jim 0 yolo_omg
1 yolo
2 omg
jack 0 yes_omg
1 best_place_ever
2 yes
3 omg
4 best
5 place
6 ever
neil 0 rofl_so_funny
1 rofl
2 so
3 funny
dtype: object

Categories

Resources