pandas equivalent select count(distinct col1, col2) group by col3 - python

Make DataFrame:
people = ['shayna','shayna','shayna','shayna','john']
dates = ['01-01-18','01-01-18','01-01-18','01-02-18','01-02-18']
places = ['hospital', 'hospital', 'inpatient', 'hospital', 'hospital']
d = {'Person':people,'Service_Date':dates, 'Site_Where_Served':places}
df = pd.DataFrame(d)
df
Person Service_Date Site_Where_Served
shayna 01-01-18 hospital
shayna 01-01-18 hospital
shayna 01-01-18 inpatient
shayna 01-02-18 hospital
john 01-02-18 hospital
What I would like to do is count the unique pairs of Person and their Service_Date grouped by Site_Where_Served.
Expected Output:
Site_Where_Served Site_Visit_Count
hospital 3
inpatient 1
My attempt:
df[['Person', 'Service_Date']].groupby(df['Site_Where_Served']).nunique().reset_index(name='Site_Visit_Count')
But then it doesn't know how to reset the index. So, I tried leaving that out and I realize that it isn't counting the unique pair of 'Person' and 'Service_Date', because the output looks like this:
Person Service_Date
Site_Where_Served
hospital 2 2
inpatient 1 1

drop_duplicates with groupby + count
(df.drop_duplicates()
.groupby('Site_Where_Served')
.Site_Where_Served.count()
.reset_index(name='Site_Visit_Count')
)
Site_Where_Served Site_Visit_Count
0 hospital 3
1 inpatient 1
Note, one tiny difference between count/size is that the former does not count NaN entries.
Tuplization, groupby and nunique
This is really only fixing your current solution, but I would not recommend this as it is quite long winded with more steps than necessary. First, tuplize your columns, group by Site_Where_Served, and then count:
(df[['Person', 'Service_Date']]
.apply(tuple, 1)
.groupby(df.Site_Where_Served)
.nunique()
.reset_index(name='Site_Visit_Count')
)
Site_Where_Served Site_Visit_Count
0 hospital 3
1 inpatient 1

In my opinion, a better way is to drop duplicates before using groupby.size:
res = df.drop_duplicates()\
.groupby('Site_Where_Served').size()\
.reset_index(name='Site_Visit_Count')
print(res)
Site_Where_Served Site_Visit_Count
0 hospital 3
1 inpatient 1

Maybe value_counts
(df.drop_duplicates()
.Site_Where_Served
.value_counts()
.to_frame('Site_Visit_Count')
.rename_axis('Site_Where_Served')
.reset_index()
)
Site_Where_Served Site_Visit_Count
0 hospital 3
1 inpatient 1

Counter 1
pd.Series(Counter(df.drop_duplicates().Site_Where_Served)) \
.rename_axis('Site_Where_Served').reset_index(name='Site_Visit_Count')
Site_Where_Served Site_Visit_Count
0 hospital 3
1 inpatient 1
Counter 2
pd.DataFrame(
list(Counter(t[2] for t in set(map(tuple, df.values))).items()),
columns=['Site_Where_Served', 'Site_Visit_Count']
)
Site_Where_Served Site_Visit_Count
0 hospital 3
1 inpatient 1

Related

(Pandas) creating a column that counts the number of times a value in DFa occurs in DFb

I have two dataframes, one consists of people and scores, the other consists of each time one of the people did a thing.
df_people = pd.DataFrame(
{
'Name' : ['Angie', 'John', 'Joanne', 'Shivangi'],
'ID' : ['0021', '0022', '0023', '0024'],
'Code' : ['BHU', 'JNU', 'DU', 'BHU'],
}
)
df_actions = pd.DataFrame(
{
'ID' : ['0023', '0021', '0022', '0021'],
'Act' : ['Enter', 'Enter', 'Enter', 'Leave'],
}
)
I would like to create a column in df_people that represents the count of each time they appear in df_actions based on the shared 'ID' column.
it would look like
Name
ID
Code
Count
0
Angie
0023
BHU
1
1
John
0021
JNU
2
2
Joanne
0022
DU
1
3
Shivan
0024
BHU
0
I have tried just taking a value count and insterting that as a new column into df_people but it seems very clunky.
Any advice would be much appreciated.
Another option is Series.map with Series.value_counts
new_df = df_people.assign(Count=df_people['ID'].map(df_actions['ID'].value_counts())
.fillna(0, downcast='infer'))
print(new_df)
Name ID Code Count
0 Angie 0021 BHU 2
1 John 0022 JNU 1
2 Joanne 0023 DU 1
3 Shivangi 0024 BHU 0
Use first GroupBy.agg to compute the counts, then merge:
(df_people
.merge(df_actions.groupby('ID')['Act'].agg(count='count'),
left_on='ID', right_index=True, how='left')
.fillna({'count': 0}, downcast='infer')
)
output:
Name ID Code count
0 Angie 0021 BHU 2
1 John 0022 JNU 1
2 Joanne 0023 DU 1
3 Shivangi 0024 BHU 0

Python DataFrame pivot_table not returning column headers

I have the following df that I am trying to pivot by '
data = {'Country': ['India','India', 'India','India','India',
'USA','USA'],
'Personality': ['Sachin Tendulkar','Sachin Tendulkar','Sania Mirza','Sachin Tendulkar', 'Sania
Mirza', 'Serena Williams','Venus Willians'] }
#create a dataframe from the data
df = pd.DataFrame(data, columns = ['Country','Personality'])
My issue is with the following line of code:
df.pivot_table(index=['Country'],
columns=['Personality'],values='Personality', aggfunc='count', fill_value=0)
I expect the op to look like the following:
Sachin Tendulkar Sania Mirza Serena Williams Venus Williams
Country
India 3 2 0 0
USA 0 0 1 1
However, all I see is the Index column after running the above code.
If you put len in for aggfunc, it works:
df.pivot_table(index='Country', columns='Personality', values='Personality', aggfunc=len, fill_value=0)
Output:
Personality Sachin Tendulkar Sania Mirza Serena Williams Venus Willians
Country
India 3 2 0 0
USA 0 0 1 1

Replacing columns with different Dataframe

I have two dataframes,namely 'df' and 'df1'
df
Out[14]:
first country Rating
0 Robert US 100
1 Chris Aus 99
2 Scarlett US 100
df1
Out[17]:
last Role
0 Downey IronMan
1 Hemsworth Thor
2 Johansson BlackWidow
Expected output:
first last Role Rating
0 Robert Downey IronMan 100
1 Chris Hemsworth Thor 99
2 Scarlett Johansson BlackWidow 100
I need to drop off the 'country' column and replace with another dataframe(ie. 'df1')
I understand,I can join dataframes and drop 'country' column,but I need columns exactly in this order.
IIUC:
new_df = df.merge(df1, on='Role').drop('country', axis=1)
new_df = new_df[['first', 'last', 'Role', 'Rating']]
Could you give this a shot?
df1.join(df2, lsuffix='', rsuffix='r_')[["first", "last", "Role", "Rating"]]
Output:
first last Role Rating
0 Robert Downey IronMan 100
1 Chris Hemsworth Thor 99
2 Scarlett Johansson BlackWidow 100
#Moahmed you can try the below approach:
df2 = pd.concat([df,df1], axis = 1)
df2 = df2[['first','last','Role','Rating']]
df2.head()

pandas: Group by splitting string value in all rows (a column) and aggregation function

If i have dataset like this:
id person_name salary
0 [alexander, william, smith] 45000
1 [smith, robert, gates] 65000
2 [bob, alexander] 56000
3 [robert, william] 80000
4 [alexander, gates] 70000
If we sum that salary column then we will get 316000
I really want to know how much person who named 'alexander, smith, etc' (in distinct) makes in salary if we sum all of the salaries from its splitting name in this dataset (that contains same string value).
output:
group sum_salary
alexander 171000 #sum from id 0 + 2 + 4 (which contain 'alexander')
william 125000 #sum from id 0 + 3
smith 110000 #sum from id 0 + 1
robert 145000 #sum from id 1 + 3
gates 135000 #sum from id 1 + 4
bob 56000 #sum from id 2
as we see the sum of sum_salary columns is not the same as the initial dataset. all because the function requires double counting.
I thought it seems familiar like string count, but what makes me confuse is the way we use aggregation function. I've tried creating a new list of distinct value in person_name columns, then stuck comes.
Any help is appreciated, Thank you very much
Solutions working with lists in column person_name:
#if necessary
#df['person_name'] = df['person_name'].str.strip('[]').str.split(', ')
print (type(df.loc[0, 'person_name']))
<class 'list'>
First idea is use defaultdict for store sumed values in loop:
from collections import defaultdict
d = defaultdict(int)
for p, s in zip(df['person_name'], df['salary']):
for x in p:
d[x] += int(s)
print (d)
defaultdict(<class 'int'>, {'alexander': 171000,
'william': 125000,
'smith': 110000,
'robert': 145000,
'gates': 135000,
'bob': 56000})
And then:
df1 = pd.DataFrame({'group':list(d.keys()),
'sum_salary':list(d.values())})
print (df1)
group sum_salary
0 alexander 171000
1 william 125000
2 smith 110000
3 robert 145000
4 gates 135000
5 bob 56000
Another solution with repeating values by length of lists and aggregate sum:
from itertools import chain
df1 = pd.DataFrame({
'group' : list(chain.from_iterable(df['person_name'].tolist())),
'sum_salary' : df['salary'].values.repeat(df['person_name'].str.len())
})
df2 = df1.groupby('group', as_index=False, sort=False)['sum_salary'].sum()
print (df2)
group sum_salary
0 alexander 171000
1 william 125000
2 smith 110000
3 robert 145000
4 gates 135000
5 bob 56000
Another sol:
df_new=(pd.DataFrame({'person_name':np.concatenate(df.person_name.values),
'salary':df.salary.repeat(df.person_name.str.len())}))
print(df_new.groupby('person_name')['salary'].sum().reset_index())
person_name salary
0 alexander 171000
1 bob 56000
2 gates 135000
3 robert 145000
4 smith 110000
5 william 125000
Can be done concisely with dummies though performance will suffer due to all of the .str methods:
df.person_name.str.join('*').str.get_dummies('*').multiply(df.salary, 0).sum()
#alexander 171000
#bob 56000
#gates 135000
#robert 145000
#smith 110000
#william 125000
#dtype: int64
I parsed this as strings of lists, by copying OP's data and using pandas.read_clipboard(). In case this was indeed the case (a series of strings of lists), this solution would work:
df = df.merge(df.person_name.str.split(',', expand=True), left_index=True, right_index=True)
df = df[[0, 1, 2, 'salary']].melt(id_vars = 'salary').drop(columns='variable')
# Some cleaning up, then a simple groupby
df.value = df.value.str.replace('[', '')
df.value = df.value.str.replace(']', '')
df.value = df.value.str.replace(' ', '')
df.groupby('value')['salary'].sum()
Output:
value
alexander 171000
bob 56000
gates 135000
robert 145000
smith 110000
william 125000
Another way you can do this is with iterrows(). This will not be as fast jezraels solution. But it works:
ids = []
names = []
salarys = []
# Iterate over the rows and extract the names from the lists in person_name column
for ix, row in df.iterrows():
for name in row['person_name']:
ids.append(row['id'])
names.append(name)
salarys.append(row['salary'])
# Create a new 'unnested' dataframe
df_new = pd.DataFrame({'id':ids,
'names':names,
'salary':salarys})
# Groupby on person_name and get the sum
print(df_new.groupby('names').salary.sum().reset_index())
Output
names salary
0 alexander 171000
1 bob 56000
2 gates 135000
3 robert 145000
4 smith 110000
5 william 125000

Counting the occurrences of a substring from one column within another column

I have two dataframes I am working with, one which contains a list of players and another that contains play by play data for the players from the other dataframe. Portions of the rows of interest within these two dataframes are shown below.
0 Matt Carpenter
1 Jason Heyward
2 Peter Bourjos
3 Matt Holliday
4 Jhonny Peralta
5 Matt Adams
...
Name: Name, dtype: object
0 Matt Carpenter grounded out to second (Grounder).
1 Jason Heyward doubled to right (Liner).
2 Matt Holliday singled to right (Liner). Jason Heyward scored.
...
Name: Play, dtype: object
What I am trying to do is create a column in the first dataframe that counts the number of occurrences of the string (df['Name'] + ' scored') in the column in the other dataframe. For example, it would search for instances of "Matt Carpenter scored", "Jason Heyward scored", etc. I know you can use str.contains to do this type of thing, but it only seems to work if you put in the explicit string. For example,
batter_game_logs_df['R vs SP'] = len(play_by_play_SP_df[play_by_play_SP_df['Play'].str.contains('Jason Heyward scored')].index)
works fine but if I try
batter_game_logs_df['R vs SP'] = len(play_by_play_SP_df[play_by_play_SP_df['Play'].str.contains(batter_game_logs_df['Name'].astype(str) + ' scored')].index)
it returns the error 'Series' objects are mutable, thus they cannot be hashed. I have looked at various similar questions but cannot find the solution to this problem for the life of me. Any assistance on this would be greatly appreciated, thank you!
I think need findall by regex with join all values of Name, then create indicator columns by MultiLabelBinarizer and add all missing columns by reindex:
s = df1['Name'] + ' scored'
pat = r'\b{}\b'.format('|'.join(s))
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df2['Play'].str.findall(pat)),
columns=mlb.classes_,
index=df2.index).reindex(columns=s, fill_value=0)
print (df)
Name Matt Carpenter scored Jason Heyward scored Peter Bourjos scored \
0 0 0 0
1 0 0 0
2 0 1 0
Name Matt Holliday scored Jhonny Peralta scored Matt Adams scored
0 0 0 0
1 0 0 0
2 0 0 0
Last if necessary join to df1:
df = df2.join(df)
print (df)
Play Matt Carpenter scored \
0 Matt Carpenter grounded out to second (Grounder). 0
1 Jason Heyward doubled to right (Liner). 0
2 Matt Holliday singled to right (Liner). Jason ... 0
Jason Heyward scored Peter Bourjos scored Matt Holliday scored \
0 0 0 0
1 0 0 0
2 1 0 0
Jhonny Peralta scored Matt Adams scored
0 0 0
1 0 0
2 0 0

Categories

Resources