Pandas groupby and add new rows with random data - python

I have a pandas dataframe like so:
id date variable value
1 2019 x 100
1 2019 y 50.5
1 2020 x 10.0
1 2020 y NA
Now, I want to groupby id and date, and for each group add 3 more variables a, b, c with random values such that a+b+c=1.0 and a>b>c.
So my final dataframe will be something like this:
id date variable value
1 2019 x 100
1 2019 y 50.5
1 2019 a 0.49
1 2019 b 0.315
1 2019 c 0.195
1 2020 x 10.0
1 2020 y NA
1 2020 a 0.55
1 2020 b 0.40
1 2020 c 0.05

Update
It's possible without a loop and append dataframes.
d = df.groupby(['date','id','variable'])['value'].mean().unstack('variable').reset_index()
x = np.random.random((len(d),3))
x /= x.sum(1)[:,None]
x[:,::-1].sort()
d[['a','b','c']] = pd.DataFrame(x)
pd.melt(d, id_vars=['date','id']).sort_values(['date','id']).reset_index(drop=True)
Output
date id variable value
0 2019 1 x 100.000000
1 2019 1 y 50.500000
2 2019 1 a 0.367699
3 2019 1 b 0.320325
4 2019 1 c 0.311976
5 2020 1 x 10.000000
6 2020 1 y NaN
7 2020 1 a 0.556441
8 2020 1 b 0.336748
9 2020 1 c 0.106812
Solution with loop
Not elegant, but works.
gr = df.groupby(['id','date'])
l = []
for i,g in gr:
d = np.random.random(3)
d /= d.sum()
d[::-1].sort()
ndf = pd.DataFrame({
'variable': list('abc'),
'value': d
})
ndf['id'] = g['id'].iloc[0]
ndf['date'] = g['date'].iloc[0]
l.append(pd.concat([g, ndf], sort=False).reset_index(drop=True))
pd.concat(l).reset_index(drop=True)
Output
id date variable value
0 1 2019 x 100.000000
1 1 2019 y 50.500000
2 1 2019 a 0.378764
3 1 2019 b 0.366415
4 1 2019 c 0.254821
5 1 2020 x 10.000000
6 1 2020 y NaN
7 1 2020 a 0.427007
8 1 2020 b 0.317555
9 1 2020 c 0.255439

Related

Filter individuals that don't have data for the whole period

I am using Python 3.9 on Pycharm. I have the following dataframe:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
6 C 2020 5
7 C 2021 4
I want to keep individuals that have available data for the whole period. In other words, I would like to filter the rows such that I only keep id that have data for the three years (2019, 2020, 2021). This means excluding all observations of id C and keep all observations of id A and B:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Is it feasible in Python?
As you want to include only the ids for which all three year exist, you can group the dataframe by id then filter based on set equalities for the years you want versus the years available for particular id:
>>> years = {2019, 2020, 2021}
>>> df.groupby('id').filter(lambda x: set(x['year'].values)==years)
# df is your dataframe
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
First, make a set of all the years existing in the column year then use a boolean mask to filter your dataframe. For that, you need pandas.DataFrame.groupby and pandas.DataFrame.transform to count the occurences of each id in each group of year.
from io import StringIO
import pandas as pd
s = """id year gdp
A 2019 3
A 2020 0
A 2021 5
B 2019 4
B 2020 2
B 2021 1
C 2020 5
C 2021 4
"""
df = pd.read_csv(StringIO(s), sep='\t')
mask = df.groupby('id')['year'].transform('count').eq(len(set(df['id'])))
out = df[mask]
>>> print(out)
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Here is a way using pivot and dropna to automatically find ids with missing values:
keep = df.pivot('id', 'year', 'gdp').dropna().index
# ['A', 'B']
out = df[df['id'].isin(keep)]
output:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1

Finding similarity score between two columns using pandas

I have a dataframe like as shown below
ID,Region,Supplier,year,output
1,Test,Test1,2021,1
2,dummy,tUMMY,2022,1
3,dasho,MASHO,2022,1
4,dahp,ZYZE,2021,0
5,delphi,POQE,2021,1
6,kilby,Daasan,2021,1
7,sarby,abbas,2021,1
df = pd.read_clipboard(sep=',')
My objective is
a) To compare two column values and assign a similarity score.
So, I tried the below
import difflib
[(len(difflib.get_close_matches(x, df['Region'], cutoff=0.6))>1)*1
for x in df['Supplier']]
However, this gives all output to be '0'. Meaning less than cut-off value of 0.6
However, I expect my output to be like as shown below
Converting each column to lower case and making the comparison >= rather than > (since there is at most one match in this examples) fetches the desired output:
from difflib import SequenceMatcher, get_close_matches
df['best_match'] = [x for x in df['Supplier'].str.lower() for x in get_close_matches(x, df['Region'].str.lower()) or ['']]
df['similarity_score'] = df.apply(lambda x: SequenceMatcher(None, x['Supplier'].lower(), x['best_match']).ratio(), axis=1)
df = df.assign(similarity_flag = df['similarity_score'].gt(0.6).astype(int)).drop(columns=['best_match'])
Output:
ID Region Supplier year output similarity_score similarity_flag
0 1 Test Test1 2021 1 0.888889 1
1 2 dummy tUMMY 2022 1 0.800000 1
2 3 dasho MASHO 2022 1 0.800000 1
3 4 dahp ZYZE 2021 0 0.000000 0
4 5 delphi POQE 2021 1 0.000000 0
5 6 kilby Daasan 2021 1 0.000000 0
6 7 sarby abbas 2021 1 0.000000 0
Updated answer with similarity flag and score (using difflib.SequenceMatcher)
cutoff = 0.6
df['similarity_score'] = (
df[['Region','Supplier']]
.apply(lambda x: difflib.SequenceMatcher(None, x[0].lower(), x[1].lower()).ratio(), axis=1)
)
df['similarity_flag'] = (df['similarity_score'] >= cutoff).astype(int)
Output:
ID Region Supplier year output similarity_score similarity_flag
0 1 Test Test1 2021 1 0.888889 1
1 2 dummy tUMMY 2022 1 0.800000 1
2 3 dasho MASHO 2022 1 0.800000 1
3 4 dahp ZYZE 2021 0 0.000000 0
4 5 delphi POQE 2021 1 0.200000 0
5 6 kilby Daasan 2021 1 0.000000 0
6 7 sarby abbas 2021 1 0.200000 0
Try using apply with lambda and axis=1:
df['similarity_flag'] = (
df[['Region','Supplier']]
.apply(lambda x: len(difflib.get_close_matches(x[0].lower(), [x[1].lower()])), axis=1)
)
Output:
ID Region Supplier year output similarity_flag
0 1 Test Test1 2021 1 1
1 2 dummy tUMMY 2022 1 1
2 3 dasho MASHO 2022 1 1
3 4 dahp ZYZE 2021 0 0
4 5 delphi POQE 2021 1 0
5 6 kilby Daasan 2021 1 0
6 7 sarby abbas 2021 1 0

Grabbing data from previous year in a Pandas DataFrame

I've got this df:
d={'year':[2019,2018,2017],'B':[10,5,17]}
df=pd.DataFrame(data=d)
print(df):
year B
0 2019 10
1 2018 5
2 2017 17
I want to create a column "B_previous_year" that grabs B data from the previous year, in a way it looks like this:
year B B_previous_year
0 2019 10 5
1 2018 5 17
2 2017 17 NaN
I'm trying this:
df['B_previous_year']=df.B.loc[df.year == (df.year - 1)]
However my B_previous_year is getting full of NaN
year B B_previous_year
0 2019 10 NaN
1 2018 5 NaN
2 2017 17 NaN
How could I do that?
In case if you want to keep in Integer format:
df = df.convert_dtypes()
df['New'] = df.B.shift(-1)
df
Output:
year B New
0 2019 10 5
1 2018 5 17
2 2017 17 <NA>
You might want to sort the dataframe by year first, then verify that the difference from one row to the next is, indeed, one year:
df = df.sort_values(by='year')
df['B_previous_year'] = df[df.year.diff() == 1]['B']
year B B_previous_year
2 2017 17 NaN
1 2018 5 5.0
0 2019 10 10.0

Transpose dataset and append different columns

I have a dataset like this:
date | a | diff_a | b | diff_b | c | diff_c
2020 0 NaN 10 Nan 5 NaN
2021 1 1 20 10 7 2
2022 3 2 30 10 13 6
2023 4 1 40 10 20 7
And I want to transpose this dataset and merge different columns below, like this:
date | Cat | value | diff
2020 a 0 NaN
2021 a 1 1
2022 a 3 2
2023 a 4 1
2020 b 10 ...
2021 b 20
2022 b 30
2023 b 40
2020 c 5
2021 c 7
2022 c 13
2023 c 20
The diff is not important since if I can put the other columns below I can just filter and then concat the dataframes, but how do I pass this columns to rows?
Kind Regards
My approach with DataFrame.melt
m=df.columns.str.contains('diff')
new_df = (df.melt(df.columns[~m],df.columns[m],var_name='Cat',value_name = 'diff')
.assign(Cat = lambda x: x['Cat'].str.split('_').str[-1],
a = lambda x: x.lookup(x.index, x.Cat)))
new_df = new_df.drop(columns = list(filter(lambda x: x != 'a',new_df.Cat.unique())))
print(new_df)
date a Cat diff
0 2020 0 a NaN
1 2021 1 a 1
2 2022 3 a 2
3 2023 4 a 1
4 2020 10 b Nan
5 2021 20 b 10
6 2022 30 b 10
7 2023 40 b 10
8 2020 5 c NaN
9 2021 7 c 2
10 2022 13 c 6
11 2023 20 c 7
EDIT
If we do not want the diff column we can delete it, also if the column does not have to call a we can do:
m=df.columns.str.contains('diff')
new_df = (df.melt(df.columns[~m],df.columns[m],var_name='Cat',value_name = 'diff')
#.drop(columns = 'diff') #if you want drop diff
.assign(Cat = lambda x: x['Cat'].str.split('_').str[-1],
other = lambda x: x.lookup(x.index, x.Cat)))
new_df = new_df.drop(columns = new_df['Cat'])
print(new_df)
date Cat diff other
0 2020 a NaN 0
1 2021 a 1 1
2 2022 a 2 3
3 2023 a 1 4
4 2020 b Nan 10
5 2021 b 10 20
6 2022 b 10 30
7 2023 b 10 40
8 2020 c NaN 5
9 2021 c 2 7
10 2022 c 6 13
11 2023 c 7 20
I like #ansev answer. Definitely elegant and oozing experience.
My attempt below. Please note I drop the diff columns now that you don't need them and then;
df2=df.set_index('date').stack().reset_index(level=0, drop=False).rename_axis('Cat', axis=0).reset_index().sort_values(by='date')
df2.rename(columns = {0:'value'}, inplace = True)

Pandas/Python Modeling Time-Series, Groups with Different Inputs

I am trying to model different scenarios for groups of assets in future years. This is something I have accomplished very tediously in Excel, but want to leverage the large database I have built with Pandas.
Example:
annual_group_cost = 0.02
df1:
year group x_count y_count value
2018 a 2 5 109000
2019 a 0 4 nan
2020 a 3 0 nan
2018 b 0 0 55000
2019 b 1 0 nan
2020 b 1 0 nan
2018 c 5 1 500000
2019 c 3 0 nan
2020 c 2 5 nan
df2:
group x_benefit y_cost individual_avg starting_value
a 0.2 0.72 1000 109000
b 0.15 0.75 20000 55000
c 0.15 0.70 20000 500000
I would like to update the values in df1, by taking the previous year's value (or starting value) and adding the x benefit, y cost, and annual cost. I am assuming this will take a function to accomplish, but I don't know of an efficient way to handle it.
The final output I would like to have is:
df1:
year group x_count y_count value
2018 a 2 5 103620
2019 a 0 4 98667.3
2020 a 3 0 97294.248
2018 b 0 0 53900
2019 b 1 0 56822
2020 b 1 0 59685.56
2018 c 5 1 495000
2019 c 3 0 497100
2020 c 2 5 420158
I achieved this by using:
starting_value-(starting_value*annual_group_cost)+(x_count*(individual_avg*x_benefit))-(y_count*(individual_avg*y_cost))
Since subsequent new values are dependent upon previously calculated new values, this will need to involve (even if behind the scenes using e.g. apply) a for loop:
for i in range(1, len(df1)):
if np.isnan(df1.loc[i, 'value']):
df1.loc[i, 'value'] = df1.loc[i-1, 'value'] #your logic here
You should merge the two tables together and then just do the functions on the data Series
hold = df_1.merge(df_2, on=['group']).fillna(0)
x = (hold.x_count*(hold.individual_avg*hold.x_benefit))
y = (hold.y_count*(hold.individual_avg*hold.y_cost))
for year in hold.year.unique():
start = hold.loc[hold.year == year, 'starting_value']
hold.loc[hold.year == year, 'value'] = (start-(start*annual_group_cost)+x-y)
if year != hold.year.max():
hold.loc[hold.year == year + 1, 'starting_value'] = hold.loc[hold.year == year, 'value'].values
hold.drop(['x_benefit', 'y_cost', 'individual_avg', 'starting_value'],axis=1)
Will give you
year group x_count y_count value
0 2018 a 2 5 103620.0
1 2019 a 0 4 98667.6
2 2020 a 3 0 97294.25
3 2018 b 0 0 53900.0
4 2019 b 1 0 55822.0
5 2020 b 1 0 57705.56
6 2018 c 5 1 491000.0
7 2019 c 3 0 490180.0
8 2020 c 2 5 416376.4

Categories

Resources