I am trying to model different scenarios for groups of assets in future years. This is something I have accomplished very tediously in Excel, but want to leverage the large database I have built with Pandas.
Example:
annual_group_cost = 0.02
df1:
year group x_count y_count value
2018 a 2 5 109000
2019 a 0 4 nan
2020 a 3 0 nan
2018 b 0 0 55000
2019 b 1 0 nan
2020 b 1 0 nan
2018 c 5 1 500000
2019 c 3 0 nan
2020 c 2 5 nan
df2:
group x_benefit y_cost individual_avg starting_value
a 0.2 0.72 1000 109000
b 0.15 0.75 20000 55000
c 0.15 0.70 20000 500000
I would like to update the values in df1, by taking the previous year's value (or starting value) and adding the x benefit, y cost, and annual cost. I am assuming this will take a function to accomplish, but I don't know of an efficient way to handle it.
The final output I would like to have is:
df1:
year group x_count y_count value
2018 a 2 5 103620
2019 a 0 4 98667.3
2020 a 3 0 97294.248
2018 b 0 0 53900
2019 b 1 0 56822
2020 b 1 0 59685.56
2018 c 5 1 495000
2019 c 3 0 497100
2020 c 2 5 420158
I achieved this by using:
starting_value-(starting_value*annual_group_cost)+(x_count*(individual_avg*x_benefit))-(y_count*(individual_avg*y_cost))
Since subsequent new values are dependent upon previously calculated new values, this will need to involve (even if behind the scenes using e.g. apply) a for loop:
for i in range(1, len(df1)):
if np.isnan(df1.loc[i, 'value']):
df1.loc[i, 'value'] = df1.loc[i-1, 'value'] #your logic here
You should merge the two tables together and then just do the functions on the data Series
hold = df_1.merge(df_2, on=['group']).fillna(0)
x = (hold.x_count*(hold.individual_avg*hold.x_benefit))
y = (hold.y_count*(hold.individual_avg*hold.y_cost))
for year in hold.year.unique():
start = hold.loc[hold.year == year, 'starting_value']
hold.loc[hold.year == year, 'value'] = (start-(start*annual_group_cost)+x-y)
if year != hold.year.max():
hold.loc[hold.year == year + 1, 'starting_value'] = hold.loc[hold.year == year, 'value'].values
hold.drop(['x_benefit', 'y_cost', 'individual_avg', 'starting_value'],axis=1)
Will give you
year group x_count y_count value
0 2018 a 2 5 103620.0
1 2019 a 0 4 98667.6
2 2020 a 3 0 97294.25
3 2018 b 0 0 53900.0
4 2019 b 1 0 55822.0
5 2020 b 1 0 57705.56
6 2018 c 5 1 491000.0
7 2019 c 3 0 490180.0
8 2020 c 2 5 416376.4
Related
I am using Python 3.9 on Pycharm. I have the following dataframe:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
6 C 2020 5
7 C 2021 4
I want to keep individuals that have available data for the whole period. In other words, I would like to filter the rows such that I only keep id that have data for the three years (2019, 2020, 2021). This means excluding all observations of id C and keep all observations of id A and B:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Is it feasible in Python?
As you want to include only the ids for which all three year exist, you can group the dataframe by id then filter based on set equalities for the years you want versus the years available for particular id:
>>> years = {2019, 2020, 2021}
>>> df.groupby('id').filter(lambda x: set(x['year'].values)==years)
# df is your dataframe
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
First, make a set of all the years existing in the column year then use a boolean mask to filter your dataframe. For that, you need pandas.DataFrame.groupby and pandas.DataFrame.transform to count the occurences of each id in each group of year.
from io import StringIO
import pandas as pd
s = """id year gdp
A 2019 3
A 2020 0
A 2021 5
B 2019 4
B 2020 2
B 2021 1
C 2020 5
C 2021 4
"""
df = pd.read_csv(StringIO(s), sep='\t')
mask = df.groupby('id')['year'].transform('count').eq(len(set(df['id'])))
out = df[mask]
>>> print(out)
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Here is a way using pivot and dropna to automatically find ids with missing values:
keep = df.pivot('id', 'year', 'gdp').dropna().index
# ['A', 'B']
out = df[df['id'].isin(keep)]
output:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
I have a dataframe like as shown below
ID,Region,Supplier,year,output
1,Test,Test1,2021,1
2,dummy,tUMMY,2022,1
3,dasho,MASHO,2022,1
4,dahp,ZYZE,2021,0
5,delphi,POQE,2021,1
6,kilby,Daasan,2021,1
7,sarby,abbas,2021,1
df = pd.read_clipboard(sep=',')
My objective is
a) To compare two column values and assign a similarity score.
So, I tried the below
import difflib
[(len(difflib.get_close_matches(x, df['Region'], cutoff=0.6))>1)*1
for x in df['Supplier']]
However, this gives all output to be '0'. Meaning less than cut-off value of 0.6
However, I expect my output to be like as shown below
Converting each column to lower case and making the comparison >= rather than > (since there is at most one match in this examples) fetches the desired output:
from difflib import SequenceMatcher, get_close_matches
df['best_match'] = [x for x in df['Supplier'].str.lower() for x in get_close_matches(x, df['Region'].str.lower()) or ['']]
df['similarity_score'] = df.apply(lambda x: SequenceMatcher(None, x['Supplier'].lower(), x['best_match']).ratio(), axis=1)
df = df.assign(similarity_flag = df['similarity_score'].gt(0.6).astype(int)).drop(columns=['best_match'])
Output:
ID Region Supplier year output similarity_score similarity_flag
0 1 Test Test1 2021 1 0.888889 1
1 2 dummy tUMMY 2022 1 0.800000 1
2 3 dasho MASHO 2022 1 0.800000 1
3 4 dahp ZYZE 2021 0 0.000000 0
4 5 delphi POQE 2021 1 0.000000 0
5 6 kilby Daasan 2021 1 0.000000 0
6 7 sarby abbas 2021 1 0.000000 0
Updated answer with similarity flag and score (using difflib.SequenceMatcher)
cutoff = 0.6
df['similarity_score'] = (
df[['Region','Supplier']]
.apply(lambda x: difflib.SequenceMatcher(None, x[0].lower(), x[1].lower()).ratio(), axis=1)
)
df['similarity_flag'] = (df['similarity_score'] >= cutoff).astype(int)
Output:
ID Region Supplier year output similarity_score similarity_flag
0 1 Test Test1 2021 1 0.888889 1
1 2 dummy tUMMY 2022 1 0.800000 1
2 3 dasho MASHO 2022 1 0.800000 1
3 4 dahp ZYZE 2021 0 0.000000 0
4 5 delphi POQE 2021 1 0.200000 0
5 6 kilby Daasan 2021 1 0.000000 0
6 7 sarby abbas 2021 1 0.200000 0
Try using apply with lambda and axis=1:
df['similarity_flag'] = (
df[['Region','Supplier']]
.apply(lambda x: len(difflib.get_close_matches(x[0].lower(), [x[1].lower()])), axis=1)
)
Output:
ID Region Supplier year output similarity_flag
0 1 Test Test1 2021 1 1
1 2 dummy tUMMY 2022 1 1
2 3 dasho MASHO 2022 1 1
3 4 dahp ZYZE 2021 0 0
4 5 delphi POQE 2021 1 0
5 6 kilby Daasan 2021 1 0
6 7 sarby abbas 2021 1 0
I have a pandas dataframe like so:
id date variable value
1 2019 x 100
1 2019 y 50.5
1 2020 x 10.0
1 2020 y NA
Now, I want to groupby id and date, and for each group add 3 more variables a, b, c with random values such that a+b+c=1.0 and a>b>c.
So my final dataframe will be something like this:
id date variable value
1 2019 x 100
1 2019 y 50.5
1 2019 a 0.49
1 2019 b 0.315
1 2019 c 0.195
1 2020 x 10.0
1 2020 y NA
1 2020 a 0.55
1 2020 b 0.40
1 2020 c 0.05
Update
It's possible without a loop and append dataframes.
d = df.groupby(['date','id','variable'])['value'].mean().unstack('variable').reset_index()
x = np.random.random((len(d),3))
x /= x.sum(1)[:,None]
x[:,::-1].sort()
d[['a','b','c']] = pd.DataFrame(x)
pd.melt(d, id_vars=['date','id']).sort_values(['date','id']).reset_index(drop=True)
Output
date id variable value
0 2019 1 x 100.000000
1 2019 1 y 50.500000
2 2019 1 a 0.367699
3 2019 1 b 0.320325
4 2019 1 c 0.311976
5 2020 1 x 10.000000
6 2020 1 y NaN
7 2020 1 a 0.556441
8 2020 1 b 0.336748
9 2020 1 c 0.106812
Solution with loop
Not elegant, but works.
gr = df.groupby(['id','date'])
l = []
for i,g in gr:
d = np.random.random(3)
d /= d.sum()
d[::-1].sort()
ndf = pd.DataFrame({
'variable': list('abc'),
'value': d
})
ndf['id'] = g['id'].iloc[0]
ndf['date'] = g['date'].iloc[0]
l.append(pd.concat([g, ndf], sort=False).reset_index(drop=True))
pd.concat(l).reset_index(drop=True)
Output
id date variable value
0 1 2019 x 100.000000
1 1 2019 y 50.500000
2 1 2019 a 0.378764
3 1 2019 b 0.366415
4 1 2019 c 0.254821
5 1 2020 x 10.000000
6 1 2020 y NaN
7 1 2020 a 0.427007
8 1 2020 b 0.317555
9 1 2020 c 0.255439
I've got this df:
d={'year':[2019,2018,2017],'B':[10,5,17]}
df=pd.DataFrame(data=d)
print(df):
year B
0 2019 10
1 2018 5
2 2017 17
I want to create a column "B_previous_year" that grabs B data from the previous year, in a way it looks like this:
year B B_previous_year
0 2019 10 5
1 2018 5 17
2 2017 17 NaN
I'm trying this:
df['B_previous_year']=df.B.loc[df.year == (df.year - 1)]
However my B_previous_year is getting full of NaN
year B B_previous_year
0 2019 10 NaN
1 2018 5 NaN
2 2017 17 NaN
How could I do that?
In case if you want to keep in Integer format:
df = df.convert_dtypes()
df['New'] = df.B.shift(-1)
df
Output:
year B New
0 2019 10 5
1 2018 5 17
2 2017 17 <NA>
You might want to sort the dataframe by year first, then verify that the difference from one row to the next is, indeed, one year:
df = df.sort_values(by='year')
df['B_previous_year'] = df[df.year.diff() == 1]['B']
year B B_previous_year
2 2017 17 NaN
1 2018 5 5.0
0 2019 10 10.0
For the DataFrame below, I need to create a new column 'unit_count' which is 'unit'/'count' for each year and month. However, because each year and month is not unique, for each entry, I only want to use the count for a given month from the B option.
key UID count month option unit year
0 1 100 1 A 10 2015
1 1 200 1 B 20 2015
2 1 300 2 A 30 2015
3 1 400 2 B 40 2015
Essentially, I need a function that does the following:
unit_count = df.unit / df.count
for value of unit, but using the only the 'count' value of option 'B' in that given 'month'.
So that the end result would look like the table below, where unit_count is dividing the number of units by the count of 'sector' 'B' for a given month.
key UID count month option unit year unit_count
0 1 100 1 A 10 2015 0.05
1 1 200 1 B 20 2015 0.10
2 1 300 2 A 30 2015 0.075
3 1 400 2 B 40 2015 0.01
Here is the code I used to create the original DataFrame:
df = pd.DataFrame({'UID':[1,1,1,1],
'year':[2015,2015,2015,2015],
'month':[1,1,2,2],
'option':['A','B','A','B'],
'unit':[10,20,30,40],
'count':[100,200,300,400]
})
It seems you can first create NaN where not option is B and then divide back filled NaN values:
Notice: DataFrame has to be sorted by year, month and option first for last value with B for each group
#if necessary in real data
#df.sort_values(['year','month', 'option'], inplace=True)
df['unit_count'] = df.loc[df.option=='B', 'count']
print (df)
UID count month option unit year unit_count
0 1 100 1 A 10 2015 NaN
1 1 200 1 B 20 2015 200.0
2 1 300 2 A 30 2015 NaN
3 1 400 2 B 40 2015 400.0
df['unit_count'] = df.unit.div(df['unit_count'].bfill())
print (df)
UID count month option unit year unit_count
0 1 100 1 A 10 2015 0.050
1 1 200 1 B 20 2015 0.100
2 1 300 2 A 30 2015 0.075
3 1 400 2 B 40 2015 0.100