Calculate tax liabilities based on a marginal tax rate schedule - python

income tax calculation python asks how to calculate taxes given a marginal tax rate schedule, and its answer provides a function that works (below).
However, it works only for a single value of income. How would I adapt it to work for a list/numpy array/pandas Series of income values? That is, how do I vectorize this code?
from bisect import bisect
rates = [0, 10, 20, 30] # 10% 20% 30%
brackets = [10000, # first 10,000
30000, # next 20,000
70000] # next 40,000
base_tax = [0, # 10,000 * 0%
2000, # 20,000 * 10%
10000] # 40,000 * 20% + 2,000
def tax(income):
i = bisect(brackets, income)
if not i:
return 0
rate = rates[i]
bracket = brackets[i-1]
income_in_bracket = income - bracket
tax_in_bracket = income_in_bracket * rate / 100
total_tax = base_tax[i-1] + tax_in_bracket
return total_tax

This method implements the vectorized marginal tax calculations just using NumPy if it's needed.
def tax(incomes, bands, rates):
# Broadcast incomes so that we can compute an amount per income, per band
incomes_ = np.broadcast_to(incomes, (bands.shape[0] - 1, incomes.shape[0]))
# Find amounts in bands for each income
amounts_in_bands = np.clip(incomes_.transpose(),
bands[:-1], bands[1:]) - bands[:-1]
# Calculate tax per band
taxes = rates * amounts_in_bands
# Sum tax bands per income
return taxes.sum(axis=1)
For usage, bands should include the upper limit - in my view this makes it more explicit.
incomes = np.array([0, 7000, 14000, 28000, 56000, 77000, 210000])
bands = np.array([0, 12500, 50000, 150000, np.inf])
rates = np.array([0, 0.2, 0.4, 0.45])
df = pd.DataFrame()
df['pre_tax'] = incomes
df['post_tax'] = incomes - tax(incomes, bands, rates)
print(df)
Output:
pre_tax post_tax
0 0 0.0
1 7000 7000.0
2 14000 13700.0
3 28000 24900.0
4 56000 46100.0
5 77000 58700.0
6 210000 135500.0

Two data frames are created, one for the tax parameters and one for the incomes.
For each income, we get the corresponding row indexes from the tax table, using the "searchsorted" method.
With that index we create a new table (df_tax.loc[rows]) and concatenate it with the income table,
then calculate the taxes, and drop the unnecessary columns.
import numpy as np, pandas as pd
# Test data:
df=pd.DataFrame({"name":["Bob","Julie","Mary","John","Bill","George","Andie"], \
"income":[0, 9_000, 10_000, 11_000, 30_000, 69_999, 200_000]})
OUT:
name income
0 Bob 0
1 Julie 9000
2 Mary 10000
3 John 11000
4 Bill 30000
5 George 69999
6 Andie 200000
df_tax=pd.DataFrame({"brackets": [0, 10_000, 30_000, 70_000 ], # lower limits
"rates": [0, .10, .20, .30 ],
"base_tax": [0, 0, 2_000, 10_000 ]} )
rows= df_tax["brackets"].searchsorted(df["income"], side="right") - 1 # aka bisect()
OUT:
[0 0 1 1 2 2 3]
df= pd.concat([df,df_tax.loc[rows].reset_index(drop=True)], axis=1)
df["total_tax"]= df["income"].sub(df["brackets"]).mul(df["rates"]).add(df["base_tax"])
OUT:
name income brackets rates base_tax total_tax
0 Bob 0 0 0.0 0 0.0
1 Julie 9000 0 0.0 0 0.0
2 Mary 10000 10000 0.1 0 0.0
3 John 11000 10000 0.1 0 100.0
4 Bill 30000 30000 0.2 2000 2000.0
5 George 69999 30000 0.2 2000 9999.8
6 Andie 200000 70000 0.3 10000 49000.0
df=df.reindex(columns=["name","income","total_tax"])
OUT:
name income total_tax
0 Bob 0 0.0
1 Julie 9000 0.0
2 Mary 10000 0.0
3 John 11000 100.0
4 Bill 30000 2000.0
5 George 69999 9999.8
6 Andie 200000 49000.0
Edit:
At the beginning, you can calculate the base_tax, too:
df_tax["base_tax"]= df_tax.brackets #edit2
.sub(df_tax.brackets.shift(fill_value=0))
.mul(df_tax.rates.shift(fill_value=0))
.cumsum()

One (probably inefficient) way is to use list comprehension:
def tax_multiple(incomes):
return [tax(income) for income in incomes]

Adapting kantal's answer to run as a function:
def income_tax(income, brackets, rates):
df_tax = pd.DataFrame({'brackets': brackets, 'rates': rates})
df_tax['base_tax'] = df_tax.brackets.\
sub(df_tax.brackets.shift(fill_value=0)).\
mul(df_tax.rates.shift(fill_value=0)).cumsum()
rows = df_tax.brackets.searchsorted(income, side='right') - 1
income_bracket_df = df_tax.loc[rows].reset_index(drop=True)
return pd.Series(income).sub(income_bracket_df.brackets).\
mul(income_bracket_df.rates).add(income_bracket_df.base_tax)
e.g.:
income = [0, 9_000, 10_000, 11_000, 30_000, 69_999, 200_000]
brackets = [0, 10_000, 30_000, 70_000] # Lower limits.
rates = [0, .10, .20, .30]
income_tax(income, brackets, rates).tolist()
# [0.0, 0.0, 0.0, 100.0, 2000.0, 9999.8, 49000.0]

Related

How to do math operations on a dataframe with an undefined number of columns?

I have a data frame in which there is an indefinite number of columns, to be defined later.
Like this:
index
GDP
2004
2005
...
brasil
1000
0.10
0.10
...
china
1000
0.15
0.10
...
india
1000
0.05
0.10
...
df = pd.DataFrame({'index': ['brasil', 'china', 'india'],
'GDP': [1000,1000,1000],
'2004': [0.10, 0.15, 0.05],
'2005': [0.10, 0.10, 0.10]})
Being the column GDP the initial GDP, and the columns from 2004 onwards being floats, representing percentages, relating to GDP growth in each year.
Using percentages to get the absolute number of the GDP in each year, based on initial GDP. I need a dataframe like this:
index
GDP
2004
2005
brasil
1000
1100
1210
china
1000
1150
1265
india
1000
1050
1155
I tried to use itertuples, df.columns and for loops, but i probably missing something.
Remembering that there are an indefinite number of columns.
Thank you very much in advance!
My answer is a combination of Wardy and user19*.
Starting with...
df = pd.DataFrame(data={'GDP': [1000, 1000, 1000],
'2004': [0.10, 0.15, 0.5],
'2005': [0.10, 0.10, 0.10],
'index': ['brasil', 'china', 'india']})
Find the percentage columns and make sure they are in the right order.
columns_of_interest = sorted(c for c in df.columns if c not in ['GDP', 'index'])
Now we calculate...
running_GDP = df['GDP'] # starting value
for column in columns_of_interest:
running_GDP *= 1.0 + df[column]
df[column] = running_GDP
This results in
GDP 2004 2005 index
0 1000 1100.0 1210.0 brasil
1 1000 1150.0 1265.0 china
2 1000 1500.0 1650.0 india
A simple way is to count the columns and loop over:
num = df.shape[1]
start = 2
for idx in range(start, num):
df.iloc[:, idx] = df.iloc[:, idx-1] * (1+df.iloc[:, idx])
print(df)
which gives
index GDP 2004 2005
0 brasil 1000 1100.0 1210.0
1 china 1000 1150.0 1265.0
2 india 1000 1050.0 1155.0
You can use df.columns to access a list of the dataframes columns.
Then you can do a loop over all of these column names. Here is an example of your data frame where I multiplied every value by 2. If you want to do different operations to different columns you can add conditions into the loop.
df = pd.DataFrame({'index': ['brasil', 'china', 'india'],
'GDP': [1000,1000,1000],
'2004': [0.10, 0.15, 0.5],
'2005': [0.10, 0.10, 0.10]})
for colName in df.columns:
df[colName] *= 2
print(df)
this returns...
index GDP 2004 2005
0 brasilbrasil 2000 0.2 0.2
1 chinachina 2000 0.3 0.2
2 indiaindia 2000 1.0 0.2
Hope this helps!
Add one to the percentages; calculate the cumulative product;
q = (df.iloc[:,2:] + 1).cumprod(axis=1)
multiply by the beginning gdp.
q = q.mul(df['GDP'],axis='index')
If you are trying to change the original DataFrame assign the result.
df.iloc[:,2:] = q
If you want to make new DataFrame concatenate the result with the first columns of the original.
new = pd.concat([df.iloc[:,:2],q],axis=1)
You can put those first two lines together if you want.
q = (df.iloc[:,2:] + 1).cumprod(axis=1).mul(df.GDP,axis='index')

Discretization : converting continuous values into a certain number of categories

1 Create a column Usage_Per_Year from Miles_Driven_Per_Year by discretizing the values into three equally sized categories. The names of the categories should be Low, Medium, and High.
2 Group by Usage_Per_Year and print the group sizes as well as the ranges of each.
3 Do the same as in #1, but instead of equally sized categories, create categories that have the same number of points per category.
4 Group by Usage_Per_Year and print the group sizes as well as the ranges of each.
My codes are below
df["Usage_Per_Year "], bins = pd.cut(df["Miles_Driven_Per_Year"], 3, precision=2, retbins=True)
group_label = pd.Series(["Low", "Medium", "High"])
#3.3.2
group_size = df.groupby("Usage_Per_Year").size()
#print(group_size)
print(group_size.reset_index().set_index(group_label))
#3.3.3
Year2 = pd.cut(df["Miles_Driven_Per_Year"], 3, precision=2)
group_label = pd.Series(["Low", "Medium", "High"])
#3.3.4
group_size = df.groupby("Usage_Per_Year").size()
#print(group_size)
print(group_size.reset_index().set_index(group_label))
the out put is below:
Usage_Per_Year 0 Low (-1925.883, 663476.235] 6018 Medium (663476.235, 1326888.118] 0 High (1326888.118, 1990300.0] 1
Usage_Per_Year 0 Low (-1925.883, 663476.235] 6018 Medium (663476.235, 1326888.118] 0 High (1326888.118, 1990300.0] 1
but -1925 is wrong...
The right answer should be like this.
How can I do...
Maybe a typo on line 1: df["Usage_Per_Year "]? There is a space at the end of the column name.
pd.cut bins values into equal size. That's why all of your bins have same size. It seems that you should compute the min and max of each group after binning.
Also, to bin value into equal frequency, you should use pd.qcut.
Example input:
import numpy as np
import pandas as pd
rng = np.random.default_rng(20210514)
df = pd.DataFrame({
'Miles_Driven_Per_Year': rng.gamma(1.05, 10000, (1000,)).astype(int)
})
# 1
group_label = ['Low', 'Medium', 'High']
df['Usage_Per_Year'] = pd.cut(df['Miles_Driven_Per_Year'],
bins=3, labels=group_label)
# 2
print(df.groupby('Usage_Per_Year').agg(['count', 'min', 'max']))
# 3
df['Usage_Per_Year'] = pd.qcut(df['Miles_Driven_Per_Year'],
q=3, labels=group_label)
# 4
print(df.groupby('Usage_Per_Year').agg(['count', 'min', 'max']))
Example output:
Miles_Driven_Per_Year
count min max
Usage_Per_Year
Low 878 31 20905
Medium 107 20955 41196
High 15 41991 62668
Miles_Driven_Per_Year
count min max
Usage_Per_Year
Low 334 31 4378
Medium 333 4449 11424
High 333 11442 62668

Binning a column in a DataFrame into 10 percentiles

I am looking to qcut or cut my "Amount" column into bins of 10 percentiles. Basically the describe() feature but with 0-10%, 11-20%, 21-30%, 31-40%, 41-50%, 51-60%, 61-70%, 71-80%, 81-90%, 91-100% instead.
After the binning i'd like to create a column that shows 1-10 indicating the bin that particular amount is apart of.
I've tried using this code below, however, I do not believe it's achieving what I want.
groups = df.groupby(pd.cut(df['Amount'], 10)).size()
Here is my DataFrame!
df.shape
Out[5]: (1385, 2)
df.head(10)
Out[6]:
Amount New or Repeat Customer
0 23044 New
1 15509 New
2 6184 New
3 6184 New
4 5828 New
5 5461 New
6 5143 New
7 5027 New
8 4992 New
9 4698 Repeat
Use pd.qcut:
# Sample data
size = 100
df = pd.DataFrame({
'Amount': np.random.randint(5000, 20000, size),
'CustomerType': np.random.choice(['New', 'Repeat'], size)
})
# Binning
labels = ['0% to 10%'] + [f'{i+1}% to {i+10}%' for i in range(10, 100, 10)]
df['Bin'] = pd.qcut(df['Amount'], 10, labels=labels)
Result:
Amount CustomerType Bin
0 15597 Repeat 61% to 70%
1 14498 New 51% to 60%
2 6373 Repeat 0% to 10%
3 9901 Repeat 21% to 30%
4 18450 Repeat 91% to 100%
5 9337 Repeat 21% to 30%
6 19310 Repeat 91% to 100%
7 11198 New 31% to 40%
8 12485 New 41% to 50%
9 11130 New 31% to 40%

Calculating the proportional weighted value in a specific segment: Make it more Pythonic

I have to make the following calculation (or similar) many times in my code and it takes a long time to run. I was wondering if it was possible to make the code more pythonic (reduce the time to run).
I am calculating the weighting of the "loan_size" proportional to all other loans that are have the same origination month
loan_plans['weighting'] = loan_plans.loan_size / loan_plans.apply(lambda S: loan_plans.loc[(loan_plans.origination_month == S.origination_month) 'loan_size'].sum(), axis=1)
The following is a set of example data with the desired result:
loan_size origination_month weighting
1000 01-2018 0.25
2000 02-2018 0.2
3000 01-2018 0.75
8000 02-2018 0.8
Update (per OP update):
There's nothing wrong with your approach; you might use groupby instead to get origination_month sums, and then do the weighting:
loan_plans = loan_plans.reset_index().merge(
loan_plans.groupby("origination_month").loan_size.sum().reset_index(), on="origination_month"
)
loan_plans["weighting"] = loan_plans.loan_size_x / loan_plans.loan_size_y
loan_plans.sort_values("index").set_index("index")
loan_size_x origination_month loan_size_y weighting
index
0 1000 01-2018 4000 0.25
1 2000 02-2018 10000 0.20
2 3000 01-2018 4000 0.75
3 8000 02-2018 10000 0.80
Cosmetics:
(loan_plans
.sort_values("index")
.set_index("index")
.rename(columns={"loan_size_x": "loan_size"})
.drop("loan_size_y", 1))
loan_size origination_month weighting
index
0 1000 01-2018 0.25
1 2000 02-2018 0.20
2 3000 01-2018 0.75
3 8000 02-2018 0.80
Earlier answer
You can use div and sum, no need for apply:
loan_plans.loan_size.div(
loan_plans.loc[loan_plans.loan_number.eq(1), "loan_size"].sum()
)
Output:
0 0.024714
1 0.053143
2 0.012143
3 0.010929
4 0.039643
...
Data:
N = 100
data = {"loan_size": np.random.randint(100, 1000, size=N),
"loan_number": np.random.binomial(n=1, p=.3, size=N)}
loan_plans = pd.DataFrame(data)

how to subtract within pandas dataframe

I have a question on arithmetic within a dataframe. Please note that each of the below columns in my dataframe are based on one another except for 'holdings'
Here is a shortened version of my dataframe
'holdings' & 'cash' & 'total'
0.0 10000.0 10000.0
0.0 10000.0 10000.0
1000 9000.0 10000.0
1500 10000.0 11500.0
2000 10000.0 12000.0
initial_cap = 10000.0
But here is my problem... the first time I have holdings, the cash is calculated correctly where cash of 10000.0 - holdings of 1000.0 = 9000.0
I need cash to remain at 9000.0 until my holdings goes back to 0.0 again
Here are my calculations
In other words, how would you calculate cash so that it remains at 9000.0 until holdings goes back to 0.0
Here is how I want it to look like
'holdings' & 'cash' & 'total'
0.0 10000.0 10000.0
0.0 10000.0 10000.0
1000 9000.0 10000.0
1500 9000.0 10500.0
2000 9000.0 11000.0
cash = initial_cap - holdings
So I try to rephrase: You start with initial capital 10 and a given sequence of holdings {0, 0, 1, 1.5, 2} and want to create a cashvariable that is 10 whenever cash is 0. As soon as cash increases in an initial period by x, you want cash to be 10 - x until cash equals 0 again.
If this is correct, this is what I would do (the logic of total and all of this is still unclear to me, but this is what you added in the end, so I focus on this).
PS. Providing code to create your sample is considered nice
df = pd.DataFrame([0, 1, 2, 2, 0, 2, 3, 3], columns=['holdings'])
x = 10
# triggers are when cash is supposed to be zero
triggers = df['holdings'] == 0
# inits are when holdings change for the first time
inits = df.index[triggers].values + 1
df['cash'] = 0
for i in inits:
df['cash'][i:] = x - df['holdings'][i]
df['cash'][triggers] = 0
df
Out[339]:
holdings cash
0 0 0
1 1 9
2 2 9
3 2 9
4 0 0
5 2 8
6 3 8
7 3 8

Categories

Resources