Grouping values based on another column and summing those values together - python

I'm currently working on a mock analysis of a mock MMORPG's microtransaction data. This is an example of a few lines of the CSV file:
PID Username Age Gender ItemID Item Name Price
0 Jack78 20 Male 108 Spikelord 3.53
1 Aisovyak 40 Male 143 Blood Scimitar 1.56
2 Glue42 24 Male 92 Final Critic 4.88
Here's where things get dicey- I successfully use the groupby function to get a result where purchases are grouped by the gender of their buyers.
test = purchase_data.groupby(['Gender', "Username"])["Price"].mean().reset_index()
gets me the result (truncated for readability)
Gender Username Price
0 Female Adastirin33 $4.48
1 Female Aerithllora36 $4.32
2 Female Aethedru70 $3.54
...
29 Female Heudai45 $3.47
.. ... ... ...
546 Male Yadanu52 $2.38
547 Male Yadaphos40 $2.68
548 Male Yalae81 $3.34
What I'm aiming for currently is to find the average amount of money spent by each gender as a whole. How I imagine this would be done is by creating a method that checks for the male/female/other tag in front of a username, and then adds the average spent by that person to a running total which I can then manipulate later. Unfortunately, I'm very new to Python- I have no clue where to even begin, or if I'm even on the right track.
Addendum: jezrael misunderstood the intent of this question. While he provided me with a method to clean up my output series, he did not provide me a method or even a hint towards my main goal, which is to group together the money spent by gender (Females are shown in all but my first snippet, but there are males further down the csv file and I don't want to clog the page with too much pasta) and put them towards a single variable.
Addendum2: Another solution suggested by jezrael,
purchase_data.groupby(['Gender'])["Price"].sum().reset_index()
creates
Gender Price
0 Female $361.94
1 Male $1,967.64
2 Other / Non-Disclosed $50.19
Sadly, using figures from this new series (which would yield the average price per purchase recorded in this csv) isn't quite what I'm looking for, due to the fact that certain users have purchased multiple items in the file. I'm hunting for a solution that lets me pull from my test frame the average amount of money spent per user, separated and grouped by gender.

It sounds to me like you think in terms of database tables. The groupby() does not return one by default -- which the group label(s) are not presented as a column but as row indices. But you can make it do in that way instead: (note the as_index argument to groupby())
mean = purchase_data.groupby(['Gender', "SN"], as_index=False).mean()
gender = mean.groupby(['Gender'], as_index=False).mean()
Then what you want is probably gender[['Gender','Price']]

Basically, sum up per user, then average (mean) up per gender.
In one line
print(df.groupby(['Gender','Username']).sum()['Price'].reset_index()[['Gender','Price']].groupby('Gender').mean())
Or in some lines
df1 = df.groupby(['Gender','Username']).sum()['Price'].reset_index()
df2 = df1[['Gender','Price']].groupby('Gender').mean()
print(df2)
Some notes,
I read your example from the clipboard
import pandas as pd
df = pd.read_clipboard()
which required a separator or the item names to be without spaces.
I put an extra space into space lord for the test. Normally, you
should provide an example file good enough to do the test, so you'd
need one with at least one female in.

To get the average spent by per person, first need to find the mean of the usernames.
Then to get the average amount of average spent per user per gender, do groupby again:
df1 = df.groupby(by=['Gender', 'Username']).mean().groupby(by='Gender').mean()
df1['Gender'] = df1.index
df1.reset_index(drop=True, inplace=True)
df1[['Gender', 'Price']]

Related

Merging two payroll reports in Python/Pandas and then producing columns comparing variances from month to month

Short version: I am merging two Excel payroll files to compare this month to the previous month. I want to add a third column that outputs the variance. I'd also like a separate report that excludes numbers that haven't changed from last month. Can anyone help with either?
Longer version: I am looking at ways to speed up payroll checks for a business. We get monthly reports back from an external company, with a staff member per row and about 50 column headers for different details and deductions. Imagine 50 columns of this instead of 5 columns (for hundreds of staff):
Employee Code
Surname
Salary
Pension
Sick Pay
30
Jones
36,000
1,800
0
31
Smith
46,000
2,100
150
I am using the code below to combine the current month's payroll report and the previous month's report, and put the relevant columns side by side (_x is the current month value, _y is the prior month value). I can then look at any changes from month to month to check they are correct.
Employee Code
Surname
Salary_x
Salary_y
Pension_x
Pension_y
30
Jones
36,000
34,500
1,800
1,800
31
Smith
46,000
46,000
2,100
2,000
The code is:
dfnew = pd.read_csv("sep22.csv") # this month's payroll report
dfold = pd.read_csv("Aug22.csv") # last month's
mergedfiles = pd.merge(dfnew,dfold,
on = 'Employee Code',
how = 'left')
#sort by column header
mergedfiles = mergedfiles.reindex(sorted(mergedfiles.columns), axis=1)
#set the index as employee code
mergedfiles = mergedfiles.set_index('Employee Code')
# Produce an Excel sheet
mergedfiles.to_excel("new_v_old.xlsx")
This works as far as it goes. But then I have to insert dozens of columns manually to produce a differences column. What I would really like is for it to output a differences column like this:
Employee Code
Surname
Salary_x
Salary_y
Salary_diff
Pens_x
Pens_y
diff
30
Jones
36,000
34,500
1,500
1,800
1,800
0
31
Smith
46,000
46,000
0
2,100
2,000
100
Ideally, the difference column would be an excel formula - but failing that a hard typed number would be fine. Can anyone advise on what I could add to the code?
My second question is whether it's possible to have a report that outputs the numbers where only a difference exists? So in the above example, it would show salary for Jones but not Smith - and it would show Pension for Smith and not Jones? Something like this, either in Jupyter itself or in Excel:
Salary:
Jones 36,000; 34,500; 1,500
Pension:
Smith 2,100; 2,000; 100
Please note that column headers are rather inconsistent from month to month, both in terms of their place within the sheet and whether they are there at all.
For example, if one staff member got a recruitment bonus in August and in September nobody got it, the column would disappear from the September report and all the columns to the right of it would shift left one column compared to August's. (I wish they didn't do it this way!)
Any advice and help appreciated.
This is a good use case for MultiIndex.
For your first question, use pd.concat. pd.concat([df1, df2, df3, ...], axis=1) is same as merge but it only aligns on the index. You can optionally specify the keys which turns the resulting dataframe into a multi-indexed columns.
For your second question, use style to hide the cells where delta is 0.
# FIRST QUESTION
cols = ["Salary", "Pension"]
compare = pd.concat([
# the Info columns: Surname, Given Name, Title, etc.
df_new.set_index("Employee Code")[["Surname"]],
# the New columns: Salary & Pension
df_new.set_index("Employee Code")[cols],
# the Old columns: Salary & Pension
df_old.set_index("Employee Code")[cols],
], keys=["Info", "New", "Old"], axis=1)
compare[[("Diff", col) for col in cols]] = compare["New"] - compare["Old"]
# Hide cells where value is 0. Color them green if > 0 and red if < 0. I assume
# you are doing this in a notebook (Jupyter, VS Code, PyCharm, etc.). The
# `styler` function returns a CSS string telling the notebook how to render the
# cell.
def styler(value: float):
return (
"visibility: hidden" if value == 0
else "color: green" if value > 0
else "color: red"
)
# Flatten the MultiIndex in the column names
compare.columns = [c[1] if c[0] == "Info" else " ".join(c) for c in compare.columns]
# The column order: Surname, New, Old, Diff
display_cols = ["Surname"] + [f"{j} {i}" for i in cols for j in ["New", "Old", "Diff"]]
# Display
compare[display_cols].style.applymap(styler, subset=["Diff Salary", "Diff Pension"])
Result (in VS Code):

I need help While counting the cities in a dataframe [duplicate]

I have a dataset with several Oscar winners. I have the following columns: Name of winner, award, place of birth, date of birth and year. I want to check how many rows are filled per year. Let's say for 2005 we have the winner of best director and best actor and for 2006 we have the winner for best supporting actor. I want to get something like this as the result:
year_of_award number of rows
2005 2
2006 1
It looks something so simple, but I can't get it right. Most posts I found would recommend the combination of group by with count().
However, when I write the code below, I get the number of rows for all columns. So I have the year and other 4 columns filled with the number of rows.
df.groupby(['year_of_award']).count()
How can I get just the year and the number of rows?
Try for pandas 0.25+
df.groupby(['year_of_award']).agg(number_of_rows=('award': 'count'))
else
df.groupby(['year_of_award']).agg({'award': 'count'}).rename(columns={'count': 'number_of_rows'})

How to choose only the "male" attribute from a newly compiled dataframe?

I am working with the following dataframe which I created from a much larger csv file with additional information in columns not needed:
df_avg_tot_purch = df_purchase_data.groupby(["SN", "Gender"])["Price"].agg(lambda x: x.unique().mean())
df_avg_tot_purch.head()
This code results in the following:
SN Gender
Adairialis76 Male 2.28
Adastirin33 Female 4.48
Aeda94 Male 4.91
Aela59 Male 4.32
Aelaria33 Male 1.79
Name: Price, dtype: float64
I now need to have this dataframe only show the male gender. The point of the project here is to find all the individuals (which may repeat in the rows), determine the average of each of their purchases. I did it this way because I also need to run another for females and "others" in the column.
after groupby the keys on which you grouped become indices, so now you have to either reset index to change them into normal columns, or explicitly use index while subsetting
df_avg_tot_purch[df_avg_tot_purch.index.isin(['Male'], level='Gender')]
or
df_avg_tot_purch = df_avg_tot_purch.reset_index()
df_avg_tot_purch[df_avg_tot_purch['Gender'] == 'Male']

Ranking values based on two columns

I'm trying to devise a way to rank accounts from best to worst based on their telephone duration and margin.
The data looks like this;
ID TIME_ON_PHONE MARGIN
1 1235 1256
2 12 124
3 1635 0
4 124 652
5 0 4566
Any suggestions on how to rank them from best to worst?
ID 5 = best as we have spent no time on the phone but their margin is the most.
ID 3 = worst as we've spend ages on the phone but no orders.
I've put it into excel to try and devise a solution but I can't get the ranking correct.
I would suggest creating a new metric like
New Metric = Margin / Time on phone
to compare each row.
To create a column with this metric just use:
dataframe["new_metric"] = dataframe["MARGIN"]/dataframe["TIME_ON_PHONE"]
Having 0 values in the TIME_ON_PHONE column will lead to an error, so I recommend replacing those values with a very small one, like 0.001 or something.
After that you can simply use this line of code to sort your rows:
dataframe = dataframe.sort_values("new_metric", ascending = False)
That way you would end up with the first ID being the best one, the second ID the second best one... etc.
Hope it helps.

Obtaining a Total of Only Part of a Column

The database I'm using shows the total number of debtors for every town for every quarter.
Since there's 43 towns listed, there's 43 'total debtors' per quarter (30-Sep-17, etc).
My goal is to find the total number of debtors for every quarter (so theoretically, finding the sum of every 43 'total debtors' listed) but I'm not quite sure how.
I've tried using the sum() function, but I'm sure how to make it so it only adds the total quarter by quarter.
Here's what the database looks like and my attempt (I printed the first 50 rows just to provide an idea of what it looks like)
https://i.imgur.com/h1y43j8.png
Sorry in advance if the explanation was a bit unclear.
You should use groupby. It's a nice pandas function to do exactly what you are trying to do. It groups the df according to whatever column you pick.
total_debtors_pq = df.groupby('Quarter end date')['Total number of debtors'].sum()
You can then extract the total for each quarter from total_debtors_pq.

Categories

Resources