I am a beginner in Python and trying to learn it. We have a df called allyears that has years, gender, names in it.
Something like this:
name
sex
number
year
John
M
1
2010
Jane
F
2
2011
I want to group the top10 names for a given year with their respective counts. I tried this code, but this is not returning what I am looking for.
males = allyears[(allyears.year>2009)&(allyears.sex=='M')]
maleNameCounts = pd.DataFrame(males.groupby(['year', 'name']).count())
maleNameCounts.sort_values('number', ascending=True)
How should I be approaching this problem?
Hope this helps:
Add a column with counts
df["name_count"] = df[name].map(df.name.value_counts())
Optional to remove duplicates
df = df.drop_duplicates(["name"])
Sort (by counts)
df = df.sort_values("name_count")
Note that this can all be tweaked were necessary.
You can try following:
males = allyears[(allyears.year>2009)&(allyears.sex=='M'),]
maleNameCounts = df.groupby(['Year', 'Name']).size().nlargest(10).reset_index().rename(columns={0:'count'})
maleNameCounts
Related
I am looking to calculate last 3 months of unique employee ID count using pandas. I am able to calculate unique employee ID count for current month but not sure how to do it for last 3 months.
df['DateM'] = df['Date'].dt.to_period('M')
df.groupby("DateM")["EmpId"].nunique().reset_index().rename(columns={"EmpId":"One Month Unique EMP count"}).sort_values("DateM",ascending=False).reset_index(drop=True)
testdata.xlsx Google drive link..
https://docs.google.com/spreadsheets/d/1Kaguf72YKIsY7rjYfctHop_OLIgOvIaS/edit?usp=sharing&ouid=117123134308310688832&rtpof=true&sd=true
After using above groupby command I get output for 1 month groups based on DateM column which correct.
Similarly I'm looking for another column where 3 months unique active user count based on EmpId is calculated.
Sample output:
I tried calculating same using rolling window but it doesn't help. Even I tried creating period for last 3 months and also search it before asking this question. Thanks for your help in advance, otherwise I'll have to calculate it manually.
I don't know if you are looking for 3 consecutive months or something else because your date discontinues at 2022-09 to 2022-10.
I also don't know your purpose, so I give a general solution here. In case you only want to count unique for every 3 consecutive months, then it is much easier. The solution here gives you the list of unique empid for every 3 consecutive months. Note that: this means for 2022-08, I will count 3 consecutive months as 2022-08, 2022-09, and 2022-10. And so on
# Sort data:
df.sort_values(by='datem', inplace=True, ignore_index=True)
# Create `dfu` which is `df` with unique `empid` for each `datem` only:
dfu = df.groupby(['datem', 'empid']).count().reset_index()
dfu.rename(columns={'date':'count'}, inplace=True)
dfu.sort_values(by=['datem', 'empid'], inplace=True, ignore_index=True)
dfu
# Obtain the list of unique periods:
unique_period = dfu['datem'].unique()
# Create empty dataframe:
dfe = pd.DataFrame(columns=['datem', 'empid', 'start_period'])
for p in unique_period:
# Create 3 consecutive range:
tem_range = pd.period_range(start=p, freq='M', periods=3)
# Extract dataframe from `dfu` with period in range wanted:
tem_dfu = dfu.loc[dfu['datem'].isin(tem_range),:].copy()
# Some cleaning:
tem_dfu.drop_duplicates(subset='empid', keep='first')
tem_dfu.drop(columns='count', inplace=True)
tem_dfu['start_period'] = p
# Concat and obtain desired output:
dfe = pd.concat([dfe, tem_dfu])
dfe
Hope this is what you are looking for
I have a dataframe that shows each audience's ranking for a bunch of movies. I wanted to make a list of movies with the most ratings, for each gender.
Here's what I did:
most_rated_gen=lens.groupby(['sex','title']).size().sort_values(ascending=False).to_frame()
I was expecting to see a dataframe that looks something like this:
sex | title
M A
B
C
D
F B
C
D
A
Instead, I got this:
I don't know why it shows M F M F M. Any ideas how I could fix this?
You can use nlargest() if your aggregated column has a name. Assuming the column name is ratings_count. You can use this code.
most_rated_gen.groupby(['sex'])['ratings_count'].nlargest()
Source
As you group by sex, the output will contain the sex column.
You have a shortcut for your operation with value_counts:
df.value_counts(['sex', 'title']).sort_index(kind='mergesort')
If you want your data to be sorted by index while preserving the order of values then you have to use sort_index with kind='mergesort' as parameter.
I'm trying to put 0 or 1 in the 'winner' column if there are somebody who won in the member list in a year.
There is a dictionary with an award winner.
award_winner = {'2010':['Momo','Dahyum'],'2011':['Nayeon','Sana'],'2012':['Moon','Jihyo']}
And This is the data frame:
df = pd.DataFrame({'member':[['Jeong-yeon','Momo'],['Jay-z','Bieber'],['Kim','Moon']],'year' : ['2010','2011','2012']})
From the data frame, I would like to see if there's any award winner in each year(dataframe's year) based on the dictionary.
For example, let's look at the first row. Momo won in 2010 and Moon won in 2012 so the desired output of the dataframe should be like this:
So this is the code so far:
df['winner'] = 0 #empty column
def winner_classifier():
for i in range(len(df['member'])): #searching if there are any award winner in df
if df['member'][row][i] in award_winner[df['year'][row]]: #I couldn't make row to
return 1
else:
continue
df['winner'] = df['member'].apply(winner_classifier)
or
In here, I can't assign row. I want the code to look up if there's any winner based on the year from dictionary. So the code should go row by row and check but i can't,,
I summarized the problem like this to ask in stack overflow. But there are more than 10,000 rows and I thought it would be possible if use pandas 'apply' to solve this problem.
Already tried double for loop without using pandas and that took too long.
I tried to use groupby() but i was wonderinghow should i use..
like..
df['winner'] = df['year'].groupby().apply(winner_classifier)..?
Could you help me with this?
Thank you :)
Create a df from dictionary so that you can merge it later
winners = pd.DataFrame({
'year' : list(award_winner.keys()),
'winner': list(award_winner.values())})
print (winners)
year winner
0 2010 [Momo, Dahyum]
1 2011 [Nayeon, Sana]
2 2012 [Moon, Jihyo]
Now merge and find the intersection of awards with members
result = df.merge(winners, on="year")
result['result'] = result.apply(
lambda x: len(set(x.member).intersection(x.winner)) != 0, axis=1)
result = result.drop(['winner'], axis=1)
print (result)
member year result
0 [Jeong-yeon, Momo] 2010 True
1 [Jay-z, Bieber] 2011 False
2 [Kim, Moon] 2012 True
You can make use of Python's set() capability here to easily compare two lists of arbitrary length.
I have written this as a row-wise iterator as I wasn't entirely sure what you wanted the result to look like (ie. do you just want a true/false, or do you want to record the "winner" each row?). With 10k rows it shouldn't be a problem to iterate over the dataframe row by row.
for index, row in df.iterrows():
members_who_were_winners = set(row.member) & set(award_winner[row.year])
if len(members_who_were_winners) > 0:
# You could also write the member name to a new column etc
df.at[index, 'winner_this_year'] = True
else:
df.at[index, 'winner_this_year'] = False
import pandas as pd
df = pd.read_csv('admission_data.csv')
df.head()
female = 0
male = 0
for row in df:
if df['gender']).any()=='female':
female = female+1
else:
male = male+1
print (female)
print male
The CSV file has 5 columnsHere is the picture
I want to find the total number of females, males and number of them admitted, number of females admitted, males admitted
Thank you. This is the code I have tried and some more iterations of the above code but none of them seem to work.
Your if logic is wrong.
No need for a loop at all.
print(df['gender'].tolist().count('female'))
print(df['gender'].tolist().count('male'))
Alternatively you can use value_counts as #Wen suggested:
print(df['gender'].value_counts()['male'])
print(df['gender'].value_counts()['female'])
Rule of thumb: 99% of the times there is no need to use explicit loops when working with pandas. If you find yourself using one then there is most probably a better (and faster) way.
You just need value_counts
df['gender'].value_counts()
I created the below csv file:
student_id,gender,major,admitted
35377,female,chemistry,False
56105,male,physics,True
31441,female,chemistry,False
51765,male,physics,True
31442,female,chemistry,True
Reading the csv file into dataframe:
import pandas as pd
df=pd.read_csv('D:/path/test1.csv', sep=',')
df[df['admitted']==True].groupby(['gender','admitted']).size().reset_index(name='count')
df
gender admitted count
0 female True 1
1 male True 2
Hope this helps!
i think you can use these brother,
// This line creates create a data frame which only have gender as male
count_male=df[df['Gender']=="male"]
// 2nd line you are basically counting how many values are there in gender column
count_male['Gender'].size
(or)
count_male=df['Gender']=="male"]
count_male.sum()
Take the values in the column gender, store in a list, and count the occurrences:
import pandas as pd
df = pd.read_csv('admission_data.csv')
print(list(df['gender']).count('female'))
print(list(df['gender']).count('male'))
I am very new to python (and to stack overflow!) so hopefully this makes sense!
I have a dataframe which contains years and names (amongst otherthings however this is all I am interested in working with).
I have done df = df.groupby(['year', 'name']).size() to get the amount of times each names appears in each year.
it returns something similar to this:
year name
2001 nameone 2
2001 nametwo 3
2002 nameone 1
2002 nametwo 5
what I'm trying to do is put the size data in to a new column called 'count'.
(eventually what I am intending to do with this is plot it on graphs)
Any help would be greatly appreciated!
Here is the raw code (I have condensed it a bit for convenience) :
hso_df = pd.read_csv('HibernationSurveyObservationsCleaned.csv')
hso_df[["startDate", "endDate", "commonName"]]
year_df = hso_df
year_df['startDate'] = pd.to_datetime(hso_df['startDate'] )
year_df['year'] = year_df['startDate'].dt.year
year_df = year_df[["year", "commonName"]].sort_values('year')
year_df = year_df.groupby(['year', 'commonName']).size()
here is an image of the first 3 rows of the data displayed with .head()
The only columns that are of interest from this data are the commonName and the year (I have taken this from startDate)
IIUC you want transform to add the result of the groupby with its index aligned to the original df:
df['count'] = df.groupby(['year', 'name']).transform('size')
EDIT
Looking at your requirements, I suggest calling reset_index on the groupby result and then merging this back to your main df:
year_df= year_df.reset_index()
hso_df.merge(year_df).rename(columns={0:'count'})