home_team_name home_team_goal_count
0 Bayern München 2
1 Bayern München 2
2 Bayern München 1
3 Köln 2
4 Köln 2
I groupby the data on the variable home_team_name.
df.groupby("home_team_name")
The values of home_team_goal_count can only be 2 or 1. I want to get the minimum number of occurrences
of the values in each group. The result I would want is 1 for Bayern Munchen and 0 for Koln. To illustrate Bayern Munchen has 2 times 2 and 1 times 1, therefore the minimum is 1. Koln has 2 times 2 and 0 time 1 therefore the minimum is 0.
First count values by SeriesGroupBy.value_counts, reshape and add 0 for all combinations 1,2 and last get minimum by min:
s = (df.groupby("home_team_name")['home_team_goal_count']
.value_counts()
.unstack(fill_value=0)
.min(axis=1))
print (s)
home_team_name
Bayern München 1
Köln 0
dtype: int64
Details:
print (df.groupby("home_team_name")['home_team_goal_count']
.value_counts()
.unstack(fill_value=0))
home_team_goal_count 1 2
home_team_name
Bayern München 1 2
Köln 0 2
If possible only 1 or only 2 values in input data is necessary reindex:
s = (df.groupby("home_team_name")['home_team_goal_count']
.value_counts()
.unstack(fill_value=0)
.reindex([1, 2], axis=1, fill_value=0)
.min(axis=1))
Let's try using pd.crosstab:
pd.crosstab(df['home_team_name'], df['home_team_goal_count'])\
.reindex([1, 2], axis=1, fill_value=0).min(1)
Result:
home_team_name
Bayern München 1
Köln 0
dtype: int64
import pandas as pd
import numpy as np
list1=['Bayern Munchen','Bayern Munchen','Bayern Munchen','FC Koln','FC Koln']
list2=[2,2,1,2,2]
d={'Home Team Name':list1,'Home Team Goal Count':list2}
data=pd.DataFrame(d)
data['Name']= data['Home Team Name'] +" "+ data['Home Team Goal Count'].astype(str)
data['Name']
Out[39]:
0 Bayern Munchen 2
1 Bayern Munchen 2
2 Bayern Munchen 1
3 FC Koln 2
4 FC Koln 2
name,count=np.unique(data['Name'].tolist(),return_counts=True)
name=[' '.join(x.split(' ')[:-1]) for x in name]
name
Out[99]: ['Bayern Munchen', 'Bayern Munchen', 'FC Koln']
min_val=pd.DataFrame({"Name":name,"Count":count})
name=[]
min_val_count=[]
for x in min_val.Name.unique():
name.append(min_val[min_val.Name!=x].min()[0])
if min_val[min_val.Name!=x].min()[1]==2:
min_val_count.append(0)
else:
min_val_count.append(min_val[min_val.Name!=x].min()[1])
minimum_val_dict=dict(zip(name,min_val_count))
minimum_val_dict
Out[104]: {'FC Koln': 0, 'Bayern Munchen': 1}
A slightly longer version as compared to the answers above.
Even another way to do this would be to use a cateorical variable, since there's a finite set of states. So:
(
df
.astype({"home_team_goal_count": "category"})
.groupby("home_team_name")["home_team_goal_count"]
.apply(lambda x: x.value_counts().min())
)
If you want to know which value occurred the least, you can call .idxmin() instead of .min().
Related
Hello I have this Pandas code (look below) but turn out it give me this error: TypeError: can only concatenate str (not "int") to str
import pandas as pd
import numpy as np
import os
_data0 = pd.read_excel("C:\\Users\\HP\\Documents\\DataScience task\\Gender_Age.xlsx")
_data0['Age' + 1]
I wanted to change the element values from column 'Age', imagine if I wanted to increase the column elements from 'Age' by 1, how do i do that? (With Number of Children as well)
The output I wanted:
First Name Last Name Age Number of Children
0 Kimberly Watson 36 2
1 Victor Wilson 35 6
2 Adrian Elliott 35 2
3 Richard Bailey 36 5
4 Blake Roberts 35 6
Original output:
First Name Last Name Age Number of Children
0 Kimberly Watson 24 1
1 Victor Wilson 23 5
2 Adrian Elliott 23 1
3 Richard Bailey 24 4
4 Blake Roberts 23 5
Try:
df['Age'] = df['Age'] - 12
df['Number of Children'] = df['Number of Children'] - 1
My dataframe df is:
Election Year Votes Party Region
0 2000 50 A a
1 2000 100 B a
2 2000 26 A b
3 2000 180 B b
4 2000 300 A c
5 2000 46 C c
6 2005 149 A a
7 2005 46 B a
8 2005 312 A b
9 2005 23 B b
10 2005 16 A c
11 2005 35 C c
I want to get the Party winning maximum region every year. So the desired output is:
Election Year Party
2000 B
2005 A
I tried this code to get the the above output, but it is giving error:
winner = df.groupby(['Election Year'])['Votes'].max().reset_index()
winner = winner.groupby('Election Year').first().reset_index()
winner = winner[['Election Year', 'Party']].to_string(index=False)
winner
how can I get the desired output?
Here is one approach with nested groupby. We first count per-party votes in each year-region pair, then use mode to find the party winning most regions. The mode need not be unique (if two or more parties win the same number of regions).
df.groupby(["Year", "Region"])\
.apply(lambda gp: gp.groupby("Party").Votes.sum().idxmax())\
.unstack().mode(1).rename(columns={0: "Party"})
Party
Year
2000 B
2005 A
To address the comment, you can replace idxmax above with nlargest and diff to find regions where win margin is below a given number.
margin = df.groupby(["Year", "Region"])\
.apply(lambda gp: gp.groupby("Party").Votes.sum().nlargest(2).diff()) > -125
print(margin[margin].reset_index()[["Year", "Region"]])
# Year Region
# 0 2000 a
# 1 2005 a
# 2 2005 c
You can use GroupBy.idxmax() to get the index of max Votes for each group of Election Year, then use .loc to locate the rows followed by selection of required columns, as followed:
df.loc[df.groupby('Election Year')['Votes'].idxmax()][['Election Year', 'Party']]
Result:
Election Year Party
4 2000 A
8 2005 A
Edit
If we are to get the Party winning most Region, we can use the following codes (without using the slow .apply() with lambda function):
(df.loc[
df.groupby(['Election Year', 'Region'])['Votes'].idxmax()]
[['Election Year', 'Party', 'Region']]
.pivot(index='Election Year', columns='Region')
.mode(axis=1)
).rename({0: 'Party'}, axis=1).reset_index()
Result:
Election Year Party
0 2000 B
1 2005 A
Try this
winner = df.groupby(['Election Year','Party'])['Votes'].max().reset_index()
winner.drop('Votes', axis = 1, inplace = True)
winner
Another method: (closed to #hilberts_drinking_problem in fact)
>>> df.groupby(["Election Year", "Region"]) \
.apply(lambda x: x.loc[x["Votes"].idxmax(), "Party"]) \
.unstack().mode(axis="columns") \
.rename(columns={0: "Party"}).reset_index()
Election Year Party
0 2000 B
1 2005 A
I believe the one liner df.groupby(["Election Year"]).max().reset_index()['Election Year', 'Party'] solves your problem
This question already has answers here:
Change one value based on another value in pandas
(7 answers)
Closed 2 years ago.
I have a sample dataset here. In real case, it has a train and test dataset. Both of them have around 300 columns and 800 rows. I want to filter out all those rows based on a certain value in one column and then set all values in that row from column 3 e.g. to column 50 to zero. How can I do it?
Sample dataset:
import pandas as pd
data = {'Name':['Jai', 'Princi', 'Gaurav','Princi','Anuj','Nancy'],
'Age':[27, 24, 22, 32,66,43],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj', 'Katauj', 'vbinauj'],
'Payment':[15,20,40,50,3,23],
'Qualification':['Msc', 'MA', 'MCA', 'Phd','MA','MS']}
df = pd.DataFrame(data)
df
Here is the output of sample dataset:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 Kanpur 20 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 Kannauj 50 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
As you can see, in the first column, there values =="Princi", So if I find rows that Name column value =="Princi", then I want to set column "Address" and "Payment" in those rows to zero.
Here is the expected output:
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA #this row
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd #this row
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
In my real dataset, I tried:
train.loc[:, 'got':'tod']# get those columns # I could select all those columns
and train.loc[df['column_wanted'] == "that value"] # I got all those rows
But how can I combine them? Thanks for your help!
Use the loc accessor; df.loc[boolean selection, columns]
df.loc[df['Name'].eq('Princi'),'Address':'Payment']=0
Name Age Address Payment Qualification
0 Jai 27 Delhi 15 Msc
1 Princi 24 0 0 MA
2 Gaurav 22 Allahabad 40 MCA
3 Princi 32 0 0 Phd
4 Anuj 66 Katauj 3 MA
5 Nancy 43 vbinauj 23 MS
My data looks like:
Club Count
0 AC Milan 2
1 Ajax 1
2 FC Barcelona 4
3 Bayern Munich 2
4 Chelsea 1
5 Dortmund 1
6 FC Porto 1
7 Inter Milan 1
8 Juventus 1
9 Liverpool 2
10 Man U 2
11 Real Madrid 7
I'm trying to plot an Area plot using Club as the X Axis, when plotting all data, it looks correct but the X axis displayed is the index and not the Clubs.
When specifying the index as Club(index=x), it shows correct, but the scale of the y axis is set from 0 to 0.05, assuming that's why nothing is displayed since the count is from 1 to 7 any suggestions ?
Code used:
data.columns = ['Club', 'Count']
x=data.Club
y=data.Count
print(data)
ax.margins(0, 10)
data.plot.area()
df = pd.DataFrame(y,index=x)
df.plot.area()
results:
Change to
df = pd.Series(y,index=x)
df.plot.area()
If i have dataset like this:
id person_name salary
0 [alexander, william, smith] 45000
1 [smith, robert, gates] 65000
2 [bob, alexander] 56000
3 [robert, william] 80000
4 [alexander, gates] 70000
If we sum that salary column then we will get 316000
I really want to know how much person who named 'alexander, smith, etc' (in distinct) makes in salary if we sum all of the salaries from its splitting name in this dataset (that contains same string value).
output:
group sum_salary
alexander 171000 #sum from id 0 + 2 + 4 (which contain 'alexander')
william 125000 #sum from id 0 + 3
smith 110000 #sum from id 0 + 1
robert 145000 #sum from id 1 + 3
gates 135000 #sum from id 1 + 4
bob 56000 #sum from id 2
as we see the sum of sum_salary columns is not the same as the initial dataset. all because the function requires double counting.
I thought it seems familiar like string count, but what makes me confuse is the way we use aggregation function. I've tried creating a new list of distinct value in person_name columns, then stuck comes.
Any help is appreciated, Thank you very much
Solutions working with lists in column person_name:
#if necessary
#df['person_name'] = df['person_name'].str.strip('[]').str.split(', ')
print (type(df.loc[0, 'person_name']))
<class 'list'>
First idea is use defaultdict for store sumed values in loop:
from collections import defaultdict
d = defaultdict(int)
for p, s in zip(df['person_name'], df['salary']):
for x in p:
d[x] += int(s)
print (d)
defaultdict(<class 'int'>, {'alexander': 171000,
'william': 125000,
'smith': 110000,
'robert': 145000,
'gates': 135000,
'bob': 56000})
And then:
df1 = pd.DataFrame({'group':list(d.keys()),
'sum_salary':list(d.values())})
print (df1)
group sum_salary
0 alexander 171000
1 william 125000
2 smith 110000
3 robert 145000
4 gates 135000
5 bob 56000
Another solution with repeating values by length of lists and aggregate sum:
from itertools import chain
df1 = pd.DataFrame({
'group' : list(chain.from_iterable(df['person_name'].tolist())),
'sum_salary' : df['salary'].values.repeat(df['person_name'].str.len())
})
df2 = df1.groupby('group', as_index=False, sort=False)['sum_salary'].sum()
print (df2)
group sum_salary
0 alexander 171000
1 william 125000
2 smith 110000
3 robert 145000
4 gates 135000
5 bob 56000
Another sol:
df_new=(pd.DataFrame({'person_name':np.concatenate(df.person_name.values),
'salary':df.salary.repeat(df.person_name.str.len())}))
print(df_new.groupby('person_name')['salary'].sum().reset_index())
person_name salary
0 alexander 171000
1 bob 56000
2 gates 135000
3 robert 145000
4 smith 110000
5 william 125000
Can be done concisely with dummies though performance will suffer due to all of the .str methods:
df.person_name.str.join('*').str.get_dummies('*').multiply(df.salary, 0).sum()
#alexander 171000
#bob 56000
#gates 135000
#robert 145000
#smith 110000
#william 125000
#dtype: int64
I parsed this as strings of lists, by copying OP's data and using pandas.read_clipboard(). In case this was indeed the case (a series of strings of lists), this solution would work:
df = df.merge(df.person_name.str.split(',', expand=True), left_index=True, right_index=True)
df = df[[0, 1, 2, 'salary']].melt(id_vars = 'salary').drop(columns='variable')
# Some cleaning up, then a simple groupby
df.value = df.value.str.replace('[', '')
df.value = df.value.str.replace(']', '')
df.value = df.value.str.replace(' ', '')
df.groupby('value')['salary'].sum()
Output:
value
alexander 171000
bob 56000
gates 135000
robert 145000
smith 110000
william 125000
Another way you can do this is with iterrows(). This will not be as fast jezraels solution. But it works:
ids = []
names = []
salarys = []
# Iterate over the rows and extract the names from the lists in person_name column
for ix, row in df.iterrows():
for name in row['person_name']:
ids.append(row['id'])
names.append(name)
salarys.append(row['salary'])
# Create a new 'unnested' dataframe
df_new = pd.DataFrame({'id':ids,
'names':names,
'salary':salarys})
# Groupby on person_name and get the sum
print(df_new.groupby('names').salary.sum().reset_index())
Output
names salary
0 alexander 171000
1 bob 56000
2 gates 135000
3 robert 145000
4 smith 110000
5 william 125000