Short version: I am merging two Excel payroll files to compare this month to the previous month. I want to add a third column that outputs the variance. I'd also like a separate report that excludes numbers that haven't changed from last month. Can anyone help with either?
Longer version: I am looking at ways to speed up payroll checks for a business. We get monthly reports back from an external company, with a staff member per row and about 50 column headers for different details and deductions. Imagine 50 columns of this instead of 5 columns (for hundreds of staff):
Employee Code
Surname
Salary
Pension
Sick Pay
30
Jones
36,000
1,800
0
31
Smith
46,000
2,100
150
I am using the code below to combine the current month's payroll report and the previous month's report, and put the relevant columns side by side (_x is the current month value, _y is the prior month value). I can then look at any changes from month to month to check they are correct.
Employee Code
Surname
Salary_x
Salary_y
Pension_x
Pension_y
30
Jones
36,000
34,500
1,800
1,800
31
Smith
46,000
46,000
2,100
2,000
The code is:
dfnew = pd.read_csv("sep22.csv") # this month's payroll report
dfold = pd.read_csv("Aug22.csv") # last month's
mergedfiles = pd.merge(dfnew,dfold,
on = 'Employee Code',
how = 'left')
#sort by column header
mergedfiles = mergedfiles.reindex(sorted(mergedfiles.columns), axis=1)
#set the index as employee code
mergedfiles = mergedfiles.set_index('Employee Code')
# Produce an Excel sheet
mergedfiles.to_excel("new_v_old.xlsx")
This works as far as it goes. But then I have to insert dozens of columns manually to produce a differences column. What I would really like is for it to output a differences column like this:
Employee Code
Surname
Salary_x
Salary_y
Salary_diff
Pens_x
Pens_y
diff
30
Jones
36,000
34,500
1,500
1,800
1,800
0
31
Smith
46,000
46,000
0
2,100
2,000
100
Ideally, the difference column would be an excel formula - but failing that a hard typed number would be fine. Can anyone advise on what I could add to the code?
My second question is whether it's possible to have a report that outputs the numbers where only a difference exists? So in the above example, it would show salary for Jones but not Smith - and it would show Pension for Smith and not Jones? Something like this, either in Jupyter itself or in Excel:
Salary:
Jones 36,000; 34,500; 1,500
Pension:
Smith 2,100; 2,000; 100
Please note that column headers are rather inconsistent from month to month, both in terms of their place within the sheet and whether they are there at all.
For example, if one staff member got a recruitment bonus in August and in September nobody got it, the column would disappear from the September report and all the columns to the right of it would shift left one column compared to August's. (I wish they didn't do it this way!)
Any advice and help appreciated.
This is a good use case for MultiIndex.
For your first question, use pd.concat. pd.concat([df1, df2, df3, ...], axis=1) is same as merge but it only aligns on the index. You can optionally specify the keys which turns the resulting dataframe into a multi-indexed columns.
For your second question, use style to hide the cells where delta is 0.
# FIRST QUESTION
cols = ["Salary", "Pension"]
compare = pd.concat([
# the Info columns: Surname, Given Name, Title, etc.
df_new.set_index("Employee Code")[["Surname"]],
# the New columns: Salary & Pension
df_new.set_index("Employee Code")[cols],
# the Old columns: Salary & Pension
df_old.set_index("Employee Code")[cols],
], keys=["Info", "New", "Old"], axis=1)
compare[[("Diff", col) for col in cols]] = compare["New"] - compare["Old"]
# Hide cells where value is 0. Color them green if > 0 and red if < 0. I assume
# you are doing this in a notebook (Jupyter, VS Code, PyCharm, etc.). The
# `styler` function returns a CSS string telling the notebook how to render the
# cell.
def styler(value: float):
return (
"visibility: hidden" if value == 0
else "color: green" if value > 0
else "color: red"
)
# Flatten the MultiIndex in the column names
compare.columns = [c[1] if c[0] == "Info" else " ".join(c) for c in compare.columns]
# The column order: Surname, New, Old, Diff
display_cols = ["Surname"] + [f"{j} {i}" for i in cols for j in ["New", "Old", "Diff"]]
# Display
compare[display_cols].style.applymap(styler, subset=["Diff Salary", "Diff Pension"])
Result (in VS Code):
I have a dataset with several Oscar winners. I have the following columns: Name of winner, award, place of birth, date of birth and year. I want to check how many rows are filled per year. Let's say for 2005 we have the winner of best director and best actor and for 2006 we have the winner for best supporting actor. I want to get something like this as the result:
year_of_award number of rows
2005 2
2006 1
It looks something so simple, but I can't get it right. Most posts I found would recommend the combination of group by with count().
However, when I write the code below, I get the number of rows for all columns. So I have the year and other 4 columns filled with the number of rows.
df.groupby(['year_of_award']).count()
How can I get just the year and the number of rows?
Try for pandas 0.25+
df.groupby(['year_of_award']).agg(number_of_rows=('award': 'count'))
else
df.groupby(['year_of_award']).agg({'award': 'count'}).rename(columns={'count': 'number_of_rows'})
I am working with the following dataframe which I created from a much larger csv file with additional information in columns not needed:
df_avg_tot_purch = df_purchase_data.groupby(["SN", "Gender"])["Price"].agg(lambda x: x.unique().mean())
df_avg_tot_purch.head()
This code results in the following:
SN Gender
Adairialis76 Male 2.28
Adastirin33 Female 4.48
Aeda94 Male 4.91
Aela59 Male 4.32
Aelaria33 Male 1.79
Name: Price, dtype: float64
I now need to have this dataframe only show the male gender. The point of the project here is to find all the individuals (which may repeat in the rows), determine the average of each of their purchases. I did it this way because I also need to run another for females and "others" in the column.
after groupby the keys on which you grouped become indices, so now you have to either reset index to change them into normal columns, or explicitly use index while subsetting
df_avg_tot_purch[df_avg_tot_purch.index.isin(['Male'], level='Gender')]
or
df_avg_tot_purch = df_avg_tot_purch.reset_index()
df_avg_tot_purch[df_avg_tot_purch['Gender'] == 'Male']
I'm trying to devise a way to rank accounts from best to worst based on their telephone duration and margin.
The data looks like this;
ID TIME_ON_PHONE MARGIN
1 1235 1256
2 12 124
3 1635 0
4 124 652
5 0 4566
Any suggestions on how to rank them from best to worst?
ID 5 = best as we have spent no time on the phone but their margin is the most.
ID 3 = worst as we've spend ages on the phone but no orders.
I've put it into excel to try and devise a solution but I can't get the ranking correct.
I would suggest creating a new metric like
New Metric = Margin / Time on phone
to compare each row.
To create a column with this metric just use:
dataframe["new_metric"] = dataframe["MARGIN"]/dataframe["TIME_ON_PHONE"]
Having 0 values in the TIME_ON_PHONE column will lead to an error, so I recommend replacing those values with a very small one, like 0.001 or something.
After that you can simply use this line of code to sort your rows:
dataframe = dataframe.sort_values("new_metric", ascending = False)
That way you would end up with the first ID being the best one, the second ID the second best one... etc.
Hope it helps.
The database I'm using shows the total number of debtors for every town for every quarter.
Since there's 43 towns listed, there's 43 'total debtors' per quarter (30-Sep-17, etc).
My goal is to find the total number of debtors for every quarter (so theoretically, finding the sum of every 43 'total debtors' listed) but I'm not quite sure how.
I've tried using the sum() function, but I'm sure how to make it so it only adds the total quarter by quarter.
Here's what the database looks like and my attempt (I printed the first 50 rows just to provide an idea of what it looks like)
https://i.imgur.com/h1y43j8.png
Sorry in advance if the explanation was a bit unclear.
You should use groupby. It's a nice pandas function to do exactly what you are trying to do. It groups the df according to whatever column you pick.
total_debtors_pq = df.groupby('Quarter end date')['Total number of debtors'].sum()
You can then extract the total for each quarter from total_debtors_pq.