Python: calculating difference between rows

Python: calculating difference between rows - python

I have a dataframe listing revenue by company and year. See below:
Company | Acc_Name |Date | Value
A2M Sales/Revenue 2016 167770000.0
A2M Sales/Revenue 2017 360842000.0
A2M Sales/Revenue 2018 68087000.0
A2M Sales/Revenue 2019 963000000.0
A2M Sales/Revenue 2020 143346000.0
In python I want to create a new column showing the difference year on year, so 2017 will show the variance between 2017 & 2016.
I'm wanting to run this on a large dataframe with multiple companies.

Here is my solution which creates a new column with previous year data and then simply takes the differences of them:
df["prev_val"] = df["Value"].shift(1) # creates new column with previous year data
df["Difference"] = df["Value"] - df["prev_val"]
Since you are willing to do this on several companies, make sure that you filter out other companies by
this_company_df = df[df["Company"] == "A2M"]
and order data in ascending order by
this_company_df = this_company_df.sort_values(by=["Date"], ascending=True)
So, the final code code should look something like this:
this_company_df = df[df["Company"] == "A2M"]
this_company_df = this_company_df.sort_values(by=["Date"], ascending=True)
this_company_df["prev_val"] = this_company_df["Value"].shift(1)
this_company_df["Difference"] = this_company_df["Value"] - this_company_df["prev_val"]
So, the result is stored in "Difference" column. One more thing you could improve is to take care of initial year by setting it to 0.

revenues['Revenue_Change'] = revenues['Value'].diff(periods=1)
Is the simplest way to do it. However, since your dataframe contains data for multiple companies, you can use this:
revenues['Revenue_Change'] = revenues.groupby('Company',sort=False)['Value'].diff(periods=1)
This sets the first entry for each company in the set to NAN.
If, by any chance, the dataframe is not in order, you can use
revenues = revenues.sort_values('Company')
Groupby will correctly calculate YoY revenue change, even if entries are separated from one another, as long as the actual revenues are chronologically in order for each company.
EDIT:
If everything is out of order, then sort by the year, groupby and then sort by company name:
revenues = revenues.sort_values('Date')
revenues['Revenue_Change'] = revenues.groupby('Company',sort=False)['Value'].diff()
revenues = revenues.sort_values('Company')

Related

Merging two payroll reports in Python/Pandas and then producing columns comparing variances from month to month

Short version: I am merging two Excel payroll files to compare this month to the previous month. I want to add a third column that outputs the variance. I'd also like a separate report that excludes numbers that haven't changed from last month. Can anyone help with either?
Longer version: I am looking at ways to speed up payroll checks for a business. We get monthly reports back from an external company, with a staff member per row and about 50 column headers for different details and deductions. Imagine 50 columns of this instead of 5 columns (for hundreds of staff):
Employee Code
Surname
Salary
Pension
Sick Pay
30
Jones
36,000
1,800
0
31
Smith
46,000
2,100
150
I am using the code below to combine the current month's payroll report and the previous month's report, and put the relevant columns side by side (_x is the current month value, _y is the prior month value). I can then look at any changes from month to month to check they are correct.
Employee Code
Surname
Salary_x
Salary_y
Pension_x
Pension_y
30
Jones
36,000
34,500
1,800
1,800
31
Smith
46,000
46,000
2,100
2,000
The code is:
dfnew = pd.read_csv("sep22.csv") # this month's payroll report
dfold = pd.read_csv("Aug22.csv") # last month's
mergedfiles = pd.merge(dfnew,dfold,
on = 'Employee Code',
how = 'left')
#sort by column header
mergedfiles = mergedfiles.reindex(sorted(mergedfiles.columns), axis=1)
#set the index as employee code
mergedfiles = mergedfiles.set_index('Employee Code')
# Produce an Excel sheet
mergedfiles.to_excel("new_v_old.xlsx")
This works as far as it goes. But then I have to insert dozens of columns manually to produce a differences column. What I would really like is for it to output a differences column like this:
Employee Code
Surname
Salary_x
Salary_y
Salary_diff
Pens_x
Pens_y
diff
30
Jones
36,000
34,500
1,500
1,800
1,800
0
31
Smith
46,000
46,000
0
2,100
2,000
100
Ideally, the difference column would be an excel formula - but failing that a hard typed number would be fine. Can anyone advise on what I could add to the code?
My second question is whether it's possible to have a report that outputs the numbers where only a difference exists? So in the above example, it would show salary for Jones but not Smith - and it would show Pension for Smith and not Jones? Something like this, either in Jupyter itself or in Excel:
Salary:
Jones 36,000; 34,500; 1,500
Pension:
Smith 2,100; 2,000; 100
Please note that column headers are rather inconsistent from month to month, both in terms of their place within the sheet and whether they are there at all.
For example, if one staff member got a recruitment bonus in August and in September nobody got it, the column would disappear from the September report and all the columns to the right of it would shift left one column compared to August's. (I wish they didn't do it this way!)
Any advice and help appreciated.

This is a good use case for MultiIndex.
For your first question, use pd.concat. pd.concat([df1, df2, df3, ...], axis=1) is same as merge but it only aligns on the index. You can optionally specify the keys which turns the resulting dataframe into a multi-indexed columns.
For your second question, use style to hide the cells where delta is 0.
# FIRST QUESTION
cols = ["Salary", "Pension"]
compare = pd.concat([
# the Info columns: Surname, Given Name, Title, etc.
df_new.set_index("Employee Code")[["Surname"]],
# the New columns: Salary & Pension
df_new.set_index("Employee Code")[cols],
# the Old columns: Salary & Pension
df_old.set_index("Employee Code")[cols],
], keys=["Info", "New", "Old"], axis=1)
compare[[("Diff", col) for col in cols]] = compare["New"] - compare["Old"]
# Hide cells where value is 0. Color them green if > 0 and red if < 0. I assume
# you are doing this in a notebook (Jupyter, VS Code, PyCharm, etc.). The
# `styler` function returns a CSS string telling the notebook how to render the
# cell.
def styler(value: float):
return (
"visibility: hidden" if value == 0
else "color: green" if value > 0
else "color: red"
)
# Flatten the MultiIndex in the column names
compare.columns = [c[1] if c[0] == "Info" else " ".join(c) for c in compare.columns]
# The column order: Surname, New, Old, Diff
display_cols = ["Surname"] + [f"{j} {i}" for i in cols for j in ["New", "Old", "Diff"]]
# Display
compare[display_cols].style.applymap(styler, subset=["Diff Salary", "Diff Pension"])
Result (in VS Code):

Remove Date Grouping from Data

Looking to clean multiple data sets in a more automated way. The current format is year as column, month as row, the number values.
Below is an example of the current format, the original data has multiple years/months.
Current Format:
Year
Jan
Feb
2022
300
200
Below is an example of how I would like the new format to look like. It combines month and year into one column and transposes the number into another column.
How would I go about doing this in excel or python? Have files with many years and multiple months.
New Format:
Date
Number
2022-01
300
2022-02
200

Check below solution. You need to extend month_df for the months, current just cater to the example.
import pandas as pd
df = pd.DataFrame({'Year':[2022],'Jan':[300],'Feb':[200]})
month_df = pd.DataFrame({'Char_Month':['Jan','Feb'], 'Int_Month':['01','02']})
melted_df = pd.melt(df, id_vars=['Year'], value_vars=['Jan', 'Feb'], var_name='Char_Month',value_name='Number')
pd.merge(melted_df, month_df,left_on='Char_Month', right_on='Char_Month').\
assign(Year=melted_df['Year'].astype('str')+'-'+month_df['Int_Month'])\
[['Year','Number']]
Output:

How to select rows of a dataframe using MultiIndex in Pandas

I have this two dataframes I need to relate.
The first one, HOLIDAYS, gives me local holiday dates and the stores code in which they're celebrated
Holiday date
Store code
01/02
18005
01/02
18032
...
...
31/03
18043
The second one, BALANCE, shows balance of stores in certain dates with date and number stores as index.
balance
01/02
18001
$35,00
01/02
18002
$38,00
...
...
...
31/03
18099
$20,45
What I need to do is to create a column in BALANCE named Holiday with a boolean value showing if a certain row is showing a balance obtained during a holiday or not.
I tried to create the column 'Holiday' setting the initial value as False and then assigning every value of HOLIDAY in index of BALANCE dataframe to True, I'm getting ValueError (possibly because a dataframe cannot be passed as index of other). I tried to convert HOLIDAY to MultiIndex but again it's not working.
BALANCE['Holiday'] = False
H = pd.MultiIndex.from_frame(HOLIDAY)
BALANCE.loc[H, 'Holiday'] = True
I'm pretty sure this should not be difficult but I'm out of ideas now. There is any way I could work with the first dataframe as MultiIndex of the second?

Your example doesn't have any rows which match, but this should work:
HOLIDAYS['is_holiday'] = True
res = pd.merge(BALANCE,
HOLIDAYS,
how='left',
left_index=True,
right_on=['Holiday_date', 'Store_code'])
res['is_holiday'] = res['is_holiday'].fillna(False)

Output certain groupby-rows in a pandas dataframe as columns?

I'm trying to get sales numbers of the last 5 years of a dataframe shown in additional columns, so I can see the sold items per year of the last 5 years.
Currently my code looks like this:
import pandas as pd
data = [
[1,'Apples','2017-02-23',10,0.4],
[2,'Oranges','2017-03-06',20,0.7],
[1,'Apples','2017-09-23',8,0.5],
[1,'Apples','2018-05-14',14,0.5],
[1,'Apples','2019-04-27',7,0.6],
[2,'Apples','2018-09-10',14,0.4],
[1,'Oranges','2018-07-12',9,0.7],
[1,'Oranges','2018-12-07',4,0.7]]
df = pd.DataFrame(data, columns = ['CustomerID','Product','Invoice Date','Amount','Price'])
df['Invoice Date'] = pd.to_datetime(df['Invoice Date']).dt.strftime('%Y')
grpyear = df.groupby(['CustomerID','Product','Invoice Date','Price'])
grpyear[['Amount']].sum()
How can I get the years to show in columns looking like this:
Customer ID | Product | Amount in 2017 | Amount in 2018 | etc.

I think you did not mean to group by price. Please correct me if I'm wrong though.
In order to get a dataset like you asked:
# Removed `Price` from group
grpyear = df.groupby(['CustomerID','Product','Invoice Date'])
# Sum amounts by group
grpyear = grpyear[['Amount']].sum()
# Pivot result and fill NAs with 0
grpyear.reset_index().pivot(index=['CustomerID','Product'], columns=['Invoice Date'], values=['Amount']).fillna(0)

Cannot fill missing values of certain rows

I am working on imputing NaNs of rows based on certain columns. So my dataframe looks something like this:
Product
Store Name
January Sales
February Sales
March Sales
For example, January Sales would be NaN for a combination of Product and Store Name and am imputing data based on other months averages. Also, other attributes, February Sales might also have NaNs in the same row.
The code that I used was:
indexes = df.index[df['January Sales'].isna()].to_list()
fillCols = df.iloc[:, 3:]
df.loc[indexes, 'January Sales'].fillna(fillCols.mean(axis=0), inplace=True)
But the above code doesn't seem to be working, the code won't impute data, however, when broken down different pieces do work, how to solve this problem?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.