Finding similarity score between two columns using pandas

Finding similarity score between two columns using pandas - python

I have a dataframe like as shown below
ID,Region,Supplier,year,output
1,Test,Test1,2021,1
2,dummy,tUMMY,2022,1
3,dasho,MASHO,2022,1
4,dahp,ZYZE,2021,0
5,delphi,POQE,2021,1
6,kilby,Daasan,2021,1
7,sarby,abbas,2021,1
df = pd.read_clipboard(sep=',')
My objective is
a) To compare two column values and assign a similarity score.
So, I tried the below
import difflib
[(len(difflib.get_close_matches(x, df['Region'], cutoff=0.6))>1)*1
for x in df['Supplier']]
However, this gives all output to be '0'. Meaning less than cut-off value of 0.6
However, I expect my output to be like as shown below

Converting each column to lower case and making the comparison >= rather than > (since there is at most one match in this examples) fetches the desired output:
from difflib import SequenceMatcher, get_close_matches
df['best_match'] = [x for x in df['Supplier'].str.lower() for x in get_close_matches(x, df['Region'].str.lower()) or ['']]
df['similarity_score'] = df.apply(lambda x: SequenceMatcher(None, x['Supplier'].lower(), x['best_match']).ratio(), axis=1)
df = df.assign(similarity_flag = df['similarity_score'].gt(0.6).astype(int)).drop(columns=['best_match'])
Output:
ID Region Supplier year output similarity_score similarity_flag
0 1 Test Test1 2021 1 0.888889 1
1 2 dummy tUMMY 2022 1 0.800000 1
2 3 dasho MASHO 2022 1 0.800000 1
3 4 dahp ZYZE 2021 0 0.000000 0
4 5 delphi POQE 2021 1 0.000000 0
5 6 kilby Daasan 2021 1 0.000000 0
6 7 sarby abbas 2021 1 0.000000 0

Updated answer with similarity flag and score (using difflib.SequenceMatcher)
cutoff = 0.6
df['similarity_score'] = (
df[['Region','Supplier']]
.apply(lambda x: difflib.SequenceMatcher(None, x[0].lower(), x[1].lower()).ratio(), axis=1)
)
df['similarity_flag'] = (df['similarity_score'] >= cutoff).astype(int)
Output:
ID Region Supplier year output similarity_score similarity_flag
0 1 Test Test1 2021 1 0.888889 1
1 2 dummy tUMMY 2022 1 0.800000 1
2 3 dasho MASHO 2022 1 0.800000 1
3 4 dahp ZYZE 2021 0 0.000000 0
4 5 delphi POQE 2021 1 0.200000 0
5 6 kilby Daasan 2021 1 0.000000 0
6 7 sarby abbas 2021 1 0.200000 0
Try using apply with lambda and axis=1:
df['similarity_flag'] = (
df[['Region','Supplier']]
.apply(lambda x: len(difflib.get_close_matches(x[0].lower(), [x[1].lower()])), axis=1)
)
Output:
ID Region Supplier year output similarity_flag
0 1 Test Test1 2021 1 1
1 2 dummy tUMMY 2022 1 1
2 3 dasho MASHO 2022 1 1
3 4 dahp ZYZE 2021 0 0
4 5 delphi POQE 2021 1 0
5 6 kilby Daasan 2021 1 0
6 7 sarby abbas 2021 1 0

Related

a complex computing of the average value in pandas

This is my first question on this forum.
I am conducting experiments in which I measure the current-voltage curve of a device applying different experimental conditions.
The different experimental condition are encoded into a parameter K
I am performing measurements of the current I using back & forth voltage sweeps with V varying from O to 2V then from 2V to -2V and then back to 0V.
Measurements are conducted several times for each value of Kto get an average of the current at each voltage point (backward and forward values). Each measurement is ascribed to a parameter named iter (varying from 0 to 3 for instance).
I have collected all data into a pandas dataframe df and I am putting below a code able of produce the typical dfI have (the real one is way too large):
import numpy as np
import pandas as pd
K_col=[]
iter_col=[]
V_col=[]
I_col=[]
niter = 3
V_val = [0,1,2,1,0,-1,-2,-1,0]
K_val = [1,2]
for K in K_val:
for it in range(niter):
for V in V_val:
K_col.append(K)
iter_col.append(it+1)
V_col.append(V)
I_col.append((2*K+np.random.random())*V)
d={'K':K_col,'iter':iter_col,'V':V_col,'I':I_col}
df=pd.DataFrame(d)
I would like to compute the average value of I at each voltage and compare the impact of the experimental condition K.
For example let's look at 2 measurements conducted for K=1:
df[(df.K==1)&(df.iter.isin([1,2]))]
output:
K iter V I
0 1 1 0 0.000000
1 1 1 1 2.513330
2 1 1 2 4.778719
3 1 1 1 2.430393
4 1 1 0 0.000000
5 1 1 -1 -2.705487
6 1 1 -2 -4.235055
7 1 1 -1 -2.278295
8 1 1 0 0.000000
9 1 2 0 0.000000
10 1 2 1 2.535058
11 1 2 2 4.529292
12 1 2 1 2.426209
13 1 2 0 0.000000
14 1 2 -1 -2.878359
15 1 2 -2 -4.061515
16 1 2 -1 -2.294630
17 1 2 0 0.000000
We can see that for experiment 1 (iter=1) V goes multiple times at 0 (indexes 0, 4 and 8). i do not want to loose these different datapoints.
the first data point for I_avg should be (I[0]+I[9])/2 which would correspond to the first measurement at 0V. The second data point should be (I[1]+I[10])/2 that would correspond the the avg_I measured at 1V with increasing values of V etc...up to (I[8]+I[17])/2 which would be my last data point at 0V.
My first thought was to use the groupby() method using K and V as keys but this wouldn't work because V is varying back & forth hence we have duplicate values of V for each measurements and the groupby would just focus on unique values of V.
The final dataframe I would like to have should looks like this:
K V avg_I
0 1 0 0.000000
1 1 1 2.513330
2 1 2 4.778719
3 1 1 2.430393
4 1 0 0.000000
5 1 -1 -2.705487
6 1 -2 -4.235055
7 1 1 -2.278295
8 1 0 0.000000
9 1 0 0.000000
10 2 1 2.513330
11 2 2 4.778719
12 2 1 2.430393
13 2 0 0.000000
14 2 -1 -2.705487
15 2 -2 -4.235055
16 2 1 -2.278295
17 2 0 0.000000
Would anyone have an idea on how doing this?

In order to compute the mean taking into consideration also the position of each observation during the iterations you could add an extra column containing this information like this:
len_iter = 9
num_iter = len(df['iter'].unique())
num_K = len(df['K'].unique())
df['index'] = np.tile(np.arange(len_iter), num_iter*num_K)
And then compute the group by and mean to get the desired result:
df.groupby(['K', 'V', 'index'])['I'].mean().reset_index().drop(['index'], axis=1)
K V I
0 1 -2 -5.070126
1 1 -1 -2.598104
2 1 -1 -2.576927
3 1 0 0.000000
4 1 0 0.000000
5 1 0 0.000000
6 1 1 2.232128
7 1 1 2.359398
8 1 2 4.824657
9 2 -2 -9.031487
10 2 -1 -4.125880
11 2 -1 -4.350776
12 2 0 0.000000
13 2 0 0.000000
14 2 0 0.000000
15 2 1 4.535478
16 2 1 4.492122
17 2 2 8.569701

If I understand this correctly, you want to have a new datapoint that represents the average I for each V category. We can achieve this by getting the average value of I for each V and then map it on the full dataframe.
avg_I = df.groupby(['V'], as_index=False).mean()[['V', 'I']]
df['avg_I'] = df.apply(lambda x: float(avg_I['I'][avg_I['V'] == x['V']]), axis=1)
df.head()
output:
K iter V I avg_I
0 1 1 0 0.00 0.00
1 1 1 1 2.34 3.55
2 1 1 2 4.54 6.89
3 1 1 1 2.02 3.55
4 1 1 0 0.00 0.00
df.plot()

Pandas DataFrame Change Values Based on Values in Different Rows

I have a DataFrame of store sales for 1115 stores with dates over about 2.5 years. The StateHoliday column is a categorical variable indicating the type of holiday it is. See the piece of the df below. As can be seen, b is the code for Easter. There are other codes for other holidays.
Piece of DF
My objective is to analyze sales before and during a holiday. The way I seek to do this is to change the value of the StateHoliday column to something unique for the few days before a particular holiday. For example, b is the code for Easter, so I could change the value to b- indicating that the day is shortly before Easter. The only way I can think to do this is to go through and manually change these values for certain dates. There aren't THAT many holidays, so it wouldn't be that hard to do. But still very annoying!

Tom, see if this works for you, if not please provide additional information:
In the file I have the following data:
Store,Sales,Date,StateHoliday
1,6729,2013-03-25,0
1,6686,2013-03-26,0
1,6660,2013-03-27,0
1,7285,2013-03-28,0
1,6729,2013-03-29,b
1115,10712,2015-07-01,0
1115,11110,2015-07-02,0
1115,10500,2015-07-03,0
1115,12000,2015-07-04,c
import pandas as pd
fname = r"D:\workspace\projects\misc\data\holiday_sales.csv"
df = pd.read_csv(fname)
df["Date"] = pd.to_datetime(df["Date"])
holidays = df[df["StateHoliday"]!="0"].copy(deep=True) # taking only holidays
dictDate2Holiday = dict(zip(holidays["Date"].tolist(), holidays["StateHoliday"].tolist()))
look_back = 2 # how many days back you want to go
holiday_look_back = []
# building a list of pairs (prev days, holiday code)
for dt, h in dictDate2Holiday.items():
prev = dt
holiday_look_back.append((prev, h))
for i in range(1, look_back+1):
prev = prev - pd.Timedelta(days=1)
holiday_look_back.append((prev, h))
dfHolidayLookBack = pd.DataFrame(holiday_look_back, columns=["Date", "StateHolidayNew"])
df = df.merge(dfHolidayLookBack, how="left", on="Date")
df["StateHolidayNew"].fillna("0", inplace=True)
print(df)
columns StateHolidayNew should have the info you need to start analyzing your data

Assuming you have a dataframe like this:
Store Sales Date StateHoliday
0 2 4205 2016-11-15 0
1 1 684 2016-07-13 0
2 2 8946 2017-04-15 0
3 1 6929 2017-02-02 0
4 2 8296 2017-10-30 b
5 1 8261 2015-10-05 0
6 2 3904 2016-08-22 0
7 1 2613 2017-12-30 0
8 2 1324 2016-08-23 0
9 1 6961 2015-11-11 0
10 2 15 2016-12-06 a
11 1 9107 2016-07-05 0
12 2 1138 2015-03-29 0
13 1 7590 2015-06-24 0
14 2 5172 2017-04-29 0
15 1 660 2016-06-21 0
16 2 2539 2017-04-25 0
What you can do is group the values between the different alphabets which represent the holidays and then groupby to find out the sales according to each group. An improvement to this would be to backfill the numbers before the groups, exp., groups=0.0 would become b_0 which would make it easier to understand the groups and what holiday they represent, but I am not sure how to do that.
df['StateHolidayBool'] = df['StateHoliday'].str.isalpha().fillna(False).replace({False: 0, True: 1})
df = df.assign(group = (df[~df['StateHolidayBool'].between(1,1)].index.to_series().diff() > 1).cumsum())
df = df.assign(groups = np.where(df.group.notna(), df.group, df.StateHoliday)).drop(['StateHolidayBool', 'group'], axis=1)
df[~df['groups'].str.isalpha().fillna(False)].groupby('groups').sum()
Output:
Store Sales
groups
0.0 6 20764
1.0 7 23063
2.0 9 26206
Final DataFrame:
Store Sales Date StateHoliday groups
0 2 4205 2016-11-15 0 0.0
1 1 684 2016-07-13 0 0.0
2 2 8946 2017-04-15 0 0.0
3 1 6929 2017-02-02 0 0.0
4 2 8296 2017-10-30 b b
5 1 8261 2015-10-05 0 1.0
6 2 3904 2016-08-22 0 1.0
7 1 2613 2017-12-30 0 1.0
8 2 1324 2016-08-23 0 1.0
9 1 6961 2015-11-11 0 1.0
10 2 15 2016-12-06 a a
11 1 9107 2016-07-05 0 2.0
12 2 1138 2015-03-29 0 2.0
13 1 7590 2015-06-24 0 2.0
14 2 5172 2017-04-29 0 2.0
15 1 660 2016-06-21 0 2.0
16 2 2539 2017-04-25 0 2.0

Pandas groupby and add new rows with random data

I have a pandas dataframe like so:
id date variable value
1 2019 x 100
1 2019 y 50.5
1 2020 x 10.0
1 2020 y NA
Now, I want to groupby id and date, and for each group add 3 more variables a, b, c with random values such that a+b+c=1.0 and a>b>c.
So my final dataframe will be something like this:
id date variable value
1 2019 x 100
1 2019 y 50.5
1 2019 a 0.49
1 2019 b 0.315
1 2019 c 0.195
1 2020 x 10.0
1 2020 y NA
1 2020 a 0.55
1 2020 b 0.40
1 2020 c 0.05

Update
It's possible without a loop and append dataframes.
d = df.groupby(['date','id','variable'])['value'].mean().unstack('variable').reset_index()
x = np.random.random((len(d),3))
x /= x.sum(1)[:,None]
x[:,::-1].sort()
d[['a','b','c']] = pd.DataFrame(x)
pd.melt(d, id_vars=['date','id']).sort_values(['date','id']).reset_index(drop=True)
Output
date id variable value
0 2019 1 x 100.000000
1 2019 1 y 50.500000
2 2019 1 a 0.367699
3 2019 1 b 0.320325
4 2019 1 c 0.311976
5 2020 1 x 10.000000
6 2020 1 y NaN
7 2020 1 a 0.556441
8 2020 1 b 0.336748
9 2020 1 c 0.106812
Solution with loop
Not elegant, but works.
gr = df.groupby(['id','date'])
l = []
for i,g in gr:
d = np.random.random(3)
d /= d.sum()
d[::-1].sort()
ndf = pd.DataFrame({
'variable': list('abc'),
'value': d
})
ndf['id'] = g['id'].iloc[0]
ndf['date'] = g['date'].iloc[0]
l.append(pd.concat([g, ndf], sort=False).reset_index(drop=True))
pd.concat(l).reset_index(drop=True)
Output
id date variable value
0 1 2019 x 100.000000
1 1 2019 y 50.500000
2 1 2019 a 0.378764
3 1 2019 b 0.366415
4 1 2019 c 0.254821
5 1 2020 x 10.000000
6 1 2020 y NaN
7 1 2020 a 0.427007
8 1 2020 b 0.317555
9 1 2020 c 0.255439

Use Pandas dataframe to add lag feature from MultiIindex Series

I have a MultiIndex Series (3 indices) that looks like this:
Week ID_1 ID_2
3 26 1182 39.0
4767 42.0
31393 20.0
31690 42.0
32962 3.0
....................................
I also have a dataframe df which contains all the columns (and more) used for indices in the Series above, and I want to create a new column in my dataframe df that contains the value matching the ID_1 and ID_2 and the Week - 2 from the Series.
For example, for the row in dataframe that has ID_1 = 26, ID_2 = 1182 and Week = 3, I want to match the value in the Series indexed by ID_1 = 26, ID_2 = 1182 and Week = 1 (3-2) and put it on that row in a new column. Further, my Series might not necessarily have the value required by the dataframe, in which case I'd like to just have 0.
Right now, I am trying to do this by using:
[multiindex_series.get((x[1].get('week', 2) - 2, x[1].get('ID_1', 0), x[1].get('ID_2', 0))) for x in df.iterrows()]
This however is very slow and memory hungry and I was wondering what are some better ways to do this.
FWIW, the Series was created using
saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
and I'm willing to do it a different way if better paths exist to create what I'm looking for.

Increase the Week by 2:
saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
saved_groupby = saved_groupby.reset_index()
saved_groupby['Week'] = saved_groupby['Week'] + 2
and then merge df with saved_groupby:
result = pd.merge(df, saved_groupby, on=['Week', 'ID_1', 'ID_2'], how='left')
This will augment df with the target median from 2 weeks ago.
To make the median (target) saved_groupby column 0 when there is no match, use fillna to change NaNs to 0:
result['Median'] = result['Median'].fillna(0)
For example,
import numpy as np
import pandas as pd
np.random.seed(2016)
df = pd.DataFrame(np.random.randint(5, size=(20,5)),
columns=['Week', 'ID_1', 'ID_2', 'Target', 'Foo'])
saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
saved_groupby = saved_groupby.reset_index()
saved_groupby['Week'] = saved_groupby['Week'] + 2
saved_groupby = saved_groupby.rename(columns={'Target':'Median'})
result = pd.merge(df, saved_groupby, on=['Week', 'ID_1', 'ID_2'], how='left')
result['Median'] = result['Median'].fillna(0)
print(result)
yields
Week ID_1 ID_2 Target Foo Median
0 3 2 3 4 2 0.0
1 3 3 0 3 4 0.0
2 4 3 0 1 2 0.0
3 3 4 1 1 1 0.0
4 2 4 2 0 3 2.0
5 1 0 1 4 4 0.0
6 2 3 4 0 0 0.0
7 4 0 0 2 3 0.0
8 3 4 3 2 2 0.0
9 2 2 4 0 1 0.0
10 2 0 4 4 2 0.0
11 1 1 3 0 0 0.0
12 0 1 0 2 0 0.0
13 4 0 4 0 3 4.0
14 1 2 1 3 1 0.0
15 3 0 1 3 4 2.0
16 0 4 2 2 4 0.0
17 1 1 4 4 2 0.0
18 4 1 0 3 0 0.0
19 1 0 1 0 0 0.0

Pandas/Python Modeling Time-Series, Groups with Different Inputs

I am trying to model different scenarios for groups of assets in future years. This is something I have accomplished very tediously in Excel, but want to leverage the large database I have built with Pandas.
Example:
annual_group_cost = 0.02
df1:
year group x_count y_count value
2018 a 2 5 109000
2019 a 0 4 nan
2020 a 3 0 nan
2018 b 0 0 55000
2019 b 1 0 nan
2020 b 1 0 nan
2018 c 5 1 500000
2019 c 3 0 nan
2020 c 2 5 nan
df2:
group x_benefit y_cost individual_avg starting_value
a 0.2 0.72 1000 109000
b 0.15 0.75 20000 55000
c 0.15 0.70 20000 500000
I would like to update the values in df1, by taking the previous year's value (or starting value) and adding the x benefit, y cost, and annual cost. I am assuming this will take a function to accomplish, but I don't know of an efficient way to handle it.
The final output I would like to have is:
df1:
year group x_count y_count value
2018 a 2 5 103620
2019 a 0 4 98667.3
2020 a 3 0 97294.248
2018 b 0 0 53900
2019 b 1 0 56822
2020 b 1 0 59685.56
2018 c 5 1 495000
2019 c 3 0 497100
2020 c 2 5 420158
I achieved this by using:
starting_value-(starting_value*annual_group_cost)+(x_count*(individual_avg*x_benefit))-(y_count*(individual_avg*y_cost))

Since subsequent new values are dependent upon previously calculated new values, this will need to involve (even if behind the scenes using e.g. apply) a for loop:
for i in range(1, len(df1)):
if np.isnan(df1.loc[i, 'value']):
df1.loc[i, 'value'] = df1.loc[i-1, 'value'] #your logic here

You should merge the two tables together and then just do the functions on the data Series
hold = df_1.merge(df_2, on=['group']).fillna(0)
x = (hold.x_count*(hold.individual_avg*hold.x_benefit))
y = (hold.y_count*(hold.individual_avg*hold.y_cost))
for year in hold.year.unique():
start = hold.loc[hold.year == year, 'starting_value']
hold.loc[hold.year == year, 'value'] = (start-(start*annual_group_cost)+x-y)
if year != hold.year.max():
hold.loc[hold.year == year + 1, 'starting_value'] = hold.loc[hold.year == year, 'value'].values
hold.drop(['x_benefit', 'y_cost', 'individual_avg', 'starting_value'],axis=1)
Will give you
year group x_count y_count value
0 2018 a 2 5 103620.0
1 2019 a 0 4 98667.6
2 2020 a 3 0 97294.25
3 2018 b 0 0 53900.0
4 2019 b 1 0 55822.0
5 2020 b 1 0 57705.56
6 2018 c 5 1 491000.0
7 2019 c 3 0 490180.0
8 2020 c 2 5 416376.4

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding similarity score between two columns using pandas - python

Related

a complex computing of the average value in pandas

Pandas DataFrame Change Values Based on Values in Different Rows

Pandas groupby and add new rows with random data

Use Pandas dataframe to add lag feature from MultiIindex Series

Pandas/Python Modeling Time-Series, Groups with Different Inputs

Categories

Resources