I need help While counting the cities in a dataframe [duplicate] - python

I have a dataset with several Oscar winners. I have the following columns: Name of winner, award, place of birth, date of birth and year. I want to check how many rows are filled per year. Let's say for 2005 we have the winner of best director and best actor and for 2006 we have the winner for best supporting actor. I want to get something like this as the result:
year_of_award number of rows
2005 2
2006 1
It looks something so simple, but I can't get it right. Most posts I found would recommend the combination of group by with count().
However, when I write the code below, I get the number of rows for all columns. So I have the year and other 4 columns filled with the number of rows.
df.groupby(['year_of_award']).count()
How can I get just the year and the number of rows?

Try for pandas 0.25+
df.groupby(['year_of_award']).agg(number_of_rows=('award': 'count'))
else
df.groupby(['year_of_award']).agg({'award': 'count'}).rename(columns={'count': 'number_of_rows'})

Related

Find which column has unique values that can help distinguish the rows with Pandas

I have the following dataframe, which contains 2 rows:
index name food color number year hobby music
0 Lorenzo pasta blue 5 1995 art jazz
1 Lorenzo pasta blue 3 1995 art jazz
I want to write a code that will be able to tell me which column is the one that can distinguish between the these two rows.
For example , in this dataframe, the column "number" is the one that distinguish between the two rows.
Unti now I have done this very simply by just go over column after column using iloc and see the values.
duplicates.iloc[:,3]
>>>
0 blue
1 blue
It's important to take into account that:
This should be for loop, each time I check it on new generated dataframe.
There may be nore than 2 rows which I need to check
There may be more than 1 column that can distinguish between the rows.
I thought that the way to check such a thing will be something like take each time one column, get the unique values and check if they are equal to each other ,similarly to this:
for n in np.arange(0,len(df.columns)):
tmp=df.iloc[:,n]
and then I thought to compare if all the values are similar to each other on the temporal dataframe, but here I got stuck because sometimes I have many rows and also I need.
My end goal: to be able to check inside for loop to identify the column that has different values in each row of the temporaldtaframe, hence can help to distinguish between the rows.
You can apply the duplicated method on all columns:
s = df.apply(pd.Series.duplicated).any()
s[~s].index
Output: ['number']

How to get the index name after regrouping for certain maximum value of another column

I have a dataframe containing election data of four different years. Column "Votes" contain the total votes a party got for different constituencies in each year. I need to find the winning party (party who has got maximum total votes) of each year. I have grouped the data using "Election year" and "Party". Now how can I get the Election Year and Party for the above case?
df1 = df.groupby(['Election Year', 'Party']).sum()
print(df1.loc[df1['Votes'].idxmax()])
The above code is not giving the expected result.
I have attached the
Dataframe after using groupby
How can I get the expected result. Any suggestions is appreciated.

How can I count the number of rows per group in Pandas?

I have a dataset with several Oscar winners. I have the following columns: Name of winner, award, place of birth, date of birth and year. I want to check how many rows are filled per year. Let's say for 2005 we have the winner of best director and best actor and for 2006 we have the winner for best supporting actor. I want to get something like this as the result:
year_of_award number of rows
2005 2
2006 1
It looks something so simple, but I can't get it right. Most posts I found would recommend the combination of group by with count().
However, when I write the code below, I get the number of rows for all columns. So I have the year and other 4 columns filled with the number of rows.
df.groupby(['year_of_award']).count()
How can I get just the year and the number of rows?
Try for pandas 0.25+
df.groupby(['year_of_award']).agg(number_of_rows=('award': 'count'))
else
df.groupby(['year_of_award']).agg({'award': 'count'}).rename(columns={'count': 'number_of_rows'})

How can I combine two variables into one in order to obtain an overall frequency pandas?

I have a problem.
How can I combine two variables into one, in order to obtain an overall frequency pandas?
An example
Name, Count
Watch 2
Watch 3
Jacob 4
Jacob 3
Ashley 2
Ashley 2
The output I want is
Name Count
Watch, 5
Jacob, 7
Ashley, 4
For my dataset which is around 700 rows, this is what I have been trying with groupby.
df.groupby(["NameOfProduct", "Number_Count"]).size().reset_index(name="Time")
It only give me the count of times, the variables appear in the dataset.
Hope you guys can help.
Thank you, and have a good evening :)
I think your issue is your are also grouping by Count. To get the correct groupby you would only want to groupby Name. For Example:
df.groupby(['Name']).sum()
This will take the sum of Count for every unique name in the DataFrame which should result in your requested output.
If you groupby ['Name', 'Count'] and use size() you will end up with a value of 1 for each group except for when Name = Ashley and Count=2 (in this case the result would be 2). This is because size is going to return the size of each unique group.

Grouping values based on another column and summing those values together

I'm currently working on a mock analysis of a mock MMORPG's microtransaction data. This is an example of a few lines of the CSV file:
PID Username Age Gender ItemID Item Name Price
0 Jack78 20 Male 108 Spikelord 3.53
1 Aisovyak 40 Male 143 Blood Scimitar 1.56
2 Glue42 24 Male 92 Final Critic 4.88
Here's where things get dicey- I successfully use the groupby function to get a result where purchases are grouped by the gender of their buyers.
test = purchase_data.groupby(['Gender', "Username"])["Price"].mean().reset_index()
gets me the result (truncated for readability)
Gender Username Price
0 Female Adastirin33 $4.48
1 Female Aerithllora36 $4.32
2 Female Aethedru70 $3.54
...
29 Female Heudai45 $3.47
.. ... ... ...
546 Male Yadanu52 $2.38
547 Male Yadaphos40 $2.68
548 Male Yalae81 $3.34
What I'm aiming for currently is to find the average amount of money spent by each gender as a whole. How I imagine this would be done is by creating a method that checks for the male/female/other tag in front of a username, and then adds the average spent by that person to a running total which I can then manipulate later. Unfortunately, I'm very new to Python- I have no clue where to even begin, or if I'm even on the right track.
Addendum: jezrael misunderstood the intent of this question. While he provided me with a method to clean up my output series, he did not provide me a method or even a hint towards my main goal, which is to group together the money spent by gender (Females are shown in all but my first snippet, but there are males further down the csv file and I don't want to clog the page with too much pasta) and put them towards a single variable.
Addendum2: Another solution suggested by jezrael,
purchase_data.groupby(['Gender'])["Price"].sum().reset_index()
creates
Gender Price
0 Female $361.94
1 Male $1,967.64
2 Other / Non-Disclosed $50.19
Sadly, using figures from this new series (which would yield the average price per purchase recorded in this csv) isn't quite what I'm looking for, due to the fact that certain users have purchased multiple items in the file. I'm hunting for a solution that lets me pull from my test frame the average amount of money spent per user, separated and grouped by gender.
It sounds to me like you think in terms of database tables. The groupby() does not return one by default -- which the group label(s) are not presented as a column but as row indices. But you can make it do in that way instead: (note the as_index argument to groupby())
mean = purchase_data.groupby(['Gender', "SN"], as_index=False).mean()
gender = mean.groupby(['Gender'], as_index=False).mean()
Then what you want is probably gender[['Gender','Price']]
Basically, sum up per user, then average (mean) up per gender.
In one line
print(df.groupby(['Gender','Username']).sum()['Price'].reset_index()[['Gender','Price']].groupby('Gender').mean())
Or in some lines
df1 = df.groupby(['Gender','Username']).sum()['Price'].reset_index()
df2 = df1[['Gender','Price']].groupby('Gender').mean()
print(df2)
Some notes,
I read your example from the clipboard
import pandas as pd
df = pd.read_clipboard()
which required a separator or the item names to be without spaces.
I put an extra space into space lord for the test. Normally, you
should provide an example file good enough to do the test, so you'd
need one with at least one female in.
To get the average spent by per person, first need to find the mean of the usernames.
Then to get the average amount of average spent per user per gender, do groupby again:
df1 = df.groupby(by=['Gender', 'Username']).mean().groupby(by='Gender').mean()
df1['Gender'] = df1.index
df1.reset_index(drop=True, inplace=True)
df1[['Gender', 'Price']]

Categories

Resources