How to find the best match between two pandas columns?

How to find the best match between two pandas columns? - python

Say I have two dataframes, df1 and df2 as shown here:
df1 = pd.DataFrame({'Timestamp_A': [0.6, 1.1, 1.6, 2.1, 2.6, 3.1, 3.6, 4.1, 4.6, 5.1, 5.6, 6.1, 6.6, 7.1]})
df2 = pd.DataFrame({'Timestamp_B': [2.2, 2.7, 3.2, 3.7, 5.2, 5.7]})
Timestamp_A
0 0.6
1 1.1
2 1.6
3 2.1
4 2.6
5 3.1
6 3.6
7 4.1
8 4.6
9 5.1
10 5.6
11 6.1
12 6.6
13 7.1
Timestamp_B
0 2.2
1 2.7
2 3.2
3 3.7
4 5.2
5 5.7
Each dataframe is the output of different sensor readings, and each is being transmitted at the same frequency. What I would like to do, is to align these two dataframes together such that each timestamp in B aligns with the timestamp in A closest to it's value. For all values in Timestamp_A which do not have a match to Timestamp_B, replace them with np.nan. Does anyone have any advice for the best way to go about doing something like this? Here is the desired output:
Timestamp_A Timestamp_B
0 0.6 NaN
1 1.1 NaN
2 1.6 NaN
3 2.1 2.2
4 2.6 2.7
5 3.1 3.2
6 3.6 NaN
7 4.1 NaN
8 4.6 NaN
9 5.1 5.2
10 5.6 5.7
11 6.1 NaN
12 6.6 NaN
13 7.1 NaN

You probably want some application of merge_asof, like so:
import pandas as pd
df1 = pd.DataFrame({'Timestamp_A': [0.6, 1.1, 1.6, 2.1, 2.6, 3.1, 3.6, 4.1, 4.6, 5.1, 5.6, 6.1, 6.6, 7.1]})
df2 = pd.DataFrame({'Timestamp_B': [2.2, 2.7, 3.2, 3.7, 5.2, 5.7]})
df3 = pd.merge_asof(df1, df2, left_on='Timestamp_A', right_on='Timestamp_B',
tolerance=0.5, direction='nearest')
print(df3)
Output as follows:
Timestamp_A Timestamp_B
0 0.6 NaN
1 1.1 NaN
2 1.6 NaN
3 2.1 2.2
4 2.6 2.7
5 3.1 3.2
6 3.6 3.7
7 4.1 3.7
8 4.6 NaN
9 5.1 5.2
10 5.6 5.7
11 6.1 5.7
12 6.6 NaN
13 7.1 NaN
The tolerance will define what "not having a match" means numerically, so that is up to you to determine.

When you only have two columns and one value assignment , I feel like reindex is more suitable
df2.index=df2.Timestamp_B
df1['New']=df2.reindex(df1.Timestamp_A,method='nearest',tolerance=0.5).values
df1
Out[109]:
Timestamp_A New
0 0.6 NaN
1 1.1 NaN
2 1.6 NaN
3 2.1 2.2
4 2.6 2.7
5 3.1 3.2
6 3.6 3.7
7 4.1 3.7
8 4.6 NaN
9 5.1 5.2
10 5.6 5.7
11 6.1 5.7
12 6.6 NaN
13 7.1 NaN
For more columns
s=pd.DataFrame(df2.reindex(df1.Timestamp_A,method='nearest',tolerance=0.5).values,index=df1.index,columns=df2.columns)
df1=pd.concat([df1,s],axis=1)

Related

Creating an accumulative column based on another column that only accumulates for a new ID

I have the following dataset:
Time = ['00:01', '00:02','00:03','00:01','00:02','00:03','00:01','00:02','00:03']
ID = [1, 1, 1, 2, 2, 2, 3, 3, 3]
Value = [3.5, 3.5, 3.5, 4.1, 4.1, 4.1, 2.3, 2.3, 2.3]
df = pd.DataFrame({'Time':Time, 'ID':ID, 'Value':Value})
Each value is the same for each ID. I want to create a new column that adds up Value column accumulatively, but only when each ID changes. V
So instead of getting
3.5 7 10.5 14.6 18.7 22.8 25.1 27.3 29.5
I want
3.5 3.5 3.5 7.6 7.6 7.6 9.9 9.9 9.9

using .loc to assign your value,
shift to test where ID changes
and then
cumsum with ffill
df.loc[:, "Val"] = df[df["ID"].ne(df["ID"].shift())][
"Value"
].cumsum()
df['Val'] = df['Val'].ffill()
print(df)
Time ID Value Val
0 00:01 1 3.5 3.5
1 00:02 1 3.5 3.5
2 00:03 1 3.5 3.5
3 00:01 2 4.1 7.6
4 00:02 2 4.1 7.6
5 00:03 2 4.1 7.6
6 00:01 3 2.3 9.9
7 00:02 3 2.3 9.9
8 00:03 3 2.3 9.9
or more simply as suggested by Ch3steR
df['Value'].where(df['Value'].ne(df['Value'].shift(1))).cumsum().ffill()
0 3.5
1 3.5
2 3.5
3 7.6
4 7.6
5 7.6
6 9.9
7 9.9
8 9.9

How to add columns with a for loop in a dataframe?

I have two dataframes df1, df2 described below
df1
prod age
0 Winalto_eu 28
1 Winalto_uc 25
2 CEM_eu 30
df2
age qx
0 25 2.7
1 26 2.8
2 27 2.8
3 28 2.9
4 29 3.0
5 30 3.2
6 31 3.4
7 32 3.7
8 33 4.1
9 34 4.6
10 35 5.1
11 36 5.6
12 37 6.1
13 38 6.7
14 39 7.5
15 40 8.2
I would like to add new columns with a for loop to df1.
The names of the new colums should be qx1, qx2,...qx10
for i in range(0,10):
df1['qx'+str(i)]
The values of qx1 should be affected by the loop, doing a kind of vlookup on the age :
For instance on the first row, for the prod 'Winalto_eu', the value of qx1 should be the value of
df2['qx'] at the age of 28+1, qx2 the same at 28+2...
The target dataframe should look like this :
prod age qx1 qx2 qx3 qx4 qx5 qx6 qx7 qx8 qx9 qx10
0 Winalto_eu 28 3.0 3.2 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7
1 Winalto_uc 25 2.8 2.8 2.9 3.0 3.2 3.4 3.7 4.1 4.6 5.1
2 CEM_eu 30 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7 7.5 8.2
Have you any idea ?
Thanks

I think this would give what you want. I used shift function to first generate additional columns in df2 and then merged with df1.
import pandas as pd
df1 = pd.DataFrame({'prod': ['Winalto_eu', 'Winalto_uc', 'CEM_eu'], 'age' : [28, 25, 30]})
df2 = pd.DataFrame({'age': list(range(25,41)), 'qx': [2.7, 2.8, 2.8, 2.9, 3, 3.2, 3.4, 3.7, 4.1, 4.6, 5.1, 5.6, 6.1, 6.7, 7.5, 8.2]})
for i in range(1,11):
df2['qx'+str(i)] = df2.qx.shift(-i)
df3 = pd.merge(df1,df2,how = 'left',on = ['age'])

At the beginning you should try with pd.df.set_index('prod",inplace=True) after that transponse df with qx

Here's a way using .loc filtering the data:
top_n = 10
values = [df2.loc[df2['age'].gt(x),'qx'].iloc[:top_n].tolist() for x in df1['age']]
coln = ['qx'+str(x) for x in range(1,11)]
df1[coln] = pd.DataFrame(values)
prod age qx1 qx2 qx3 qx4 qx5 qx6 qx7 qx8 qx9 qx10
0 Winalto_eu 28 3.0 3.2 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7
1 Winalto_uc 25 2.8 2.8 2.9 3.0 3.2 3.4 3.7 4.1 4.6 5.1
2 CEM_eu 30 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7 7.5 8.2

Ridiculously overengineered solution:
pd.concat([df1,pd.DataFrame(columns=['qx'+str(i) for i in range(11)],
data=[ser1.T.loc[:,i:i+10].values.flatten().tolist()
for i in df1['age']])],
axis=1)
prod age qx0 qx1 qx2 qx3 qx4 qx5 qx6 qx7 qx8 qx9 qx10
0 Winalto_eu 28 2.9 3.0 3.2 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7
1 Winalto_uc 25 2.7 2.8 2.8 2.9 3.0 3.2 3.4 3.7 4.1 4.6 5.1
2 CEM_eu 30 3.2 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7 7.5 8.2

Try:
df=df1.assign(key=0).merge(df2.assign(key=0), on="key", suffixes=["", "_y"]).query("age<age_y").drop(["key"], axis=1)
df["q"]=df.groupby("prod")["age_y"].rank()
#to keep only 10 positions for each
df=df.loc[df["q"]<=10]
df=df.pivot_table(index=["prod", "age"], columns="q", values="qx")
df.columns=[f"qx{col:0.0f}" for col in df.columns]
df=df.reset_index()
Output:
prod age qx1 qx2 qx3 ... qx6 qx7 qx8 qx9 qx10
0 CEM_eu 30 3.4 3.7 4.1 ... 5.6 6.1 6.7 7.5 8.2
1 Winalto_eu 28 3.0 3.2 3.4 ... 4.6 5.1 5.6 6.1 6.7
2 Winalto_uc 25 2.8 2.8 2.9 ... 3.4 3.7 4.1 4.6 5.1

Merging two dataframes with one common column name [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have two .csv files "train_id.csv" and "train_ub.csv", I want to load them as pandas dataframes. Their dimensions are different, but they have only common column, let's say:
train_id:
ID id_01 id_02 id_03 id_04
1 0.0 1.0 5.2 7.1
2 0.5 7.7 3.4 9.8
5 1.5 0.8 1.6 2.5
7 3.0 0.2 3.4 6.3
8 5.5 1.8 7.5 7.0
9 7.2 2.6 9.1 1.1
11 9.5 3.5 2.2 0.3
while train_ub:
ID ub_01 ub_02 ub_03 ub_04
1 0.0 1.0 9.2 8.3
2 1.5 2.7 0.4 4.9
3 2.7 4.8 7.6 3.7
4 4.8 9.2 2.4 5.4
6 6.0 5.8 5.5 0.6
10 9.1 3.6 4.1 2.0
11 7.3 7.5 0.2 9.5
One may see that they have in common the first column but there are missing values in each dataframe. Is there a way in pandas to merge them column wise in order to get a dataframe of the form:
ID id_01 id_02 id_03 id_04 ub_01 ub_02 ub_03 ub_04
1 0.0 1.0 5.2 7.1 0.0 1.0 9.2 8.3
2 0.5 7.7 3.4 9.8 1.5 2.7 0.4 4.9
3 NaN NaN NaN NaN 2.7 4.8 7.6 3.7
4 NaN NaN NaN NaN 4.8 9.2 2.4 5.4
5 1.5 0.8 1.6 2.5 NaN NaN NaN NaN
6 NaN NaN NaN NaN 6.0 5.8 5.5 0.6
7 3.0 0.2 3.4 6.3 NaN NaN NaN NaN
8 5.5 1.8 7.5 7.0 NaN NaN NaN NaN
9 7.2 2.6 9.1 1.1 NaN NaN NaN NaN
10 NaN NaN NaN NaN 9.1 3.6 4.1 2.0
11 9.5 3.5 2.2 0.3 9.5 3.5 2.2 0.3
PS: Notice that this is an oversimplified example, the real databases have the shapes id(144233, 41) and ub(590540, 394).

You could accomplish this using an outer join. Here is the code for it:
train_id = pd.read_csv("train_id.csv")
train_up = pd.read_csv("train_up")
train_merged = train_id.merge(train_ub, on=["ID"], how="outer")

How to create new column based on top and bottom parts of single dataframe in PANDAS?

I have merged two dataframes having same column names. Is there a easy way to get another column of mean of these two appended dataframes?
Maybe code explains it better.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'a':[1,2,3,4],'b':[10,20,30,40]})
df2 = pd.DataFrame({'a':[1.2,2.2,3.2,4.2],'b':[10.2,20.2,30.2,40.2]})
df = df1.append(df2)
print(df)
df['a_mean'] = ???
a b
0 1.0 10.0
1 2.0 20.0
2 3.0 30.0
3 4.0 40.0
0 1.2 10.2
1 2.2 20.2
2 3.2 30.2
3 4.2 40.2
How to create a new column a_mean with values
[1.1, 2.1, 3.1, 4.1, 1.1, 2.1, 3.1, 4.1] effectively ?

melt()
df=df.assign(a_mean=df1.add(df2).div(2).melt().value)
Or taking only df, you can do:
df=df.assign(a_mean=df.groupby(df.index)['a'].mean())
a b a_mean
0 1.0 10.0 1.1
1 2.0 20.0 2.1
2 3.0 30.0 3.1
3 4.0 40.0 4.1
0 1.2 10.2 1.1
1 2.2 20.2 2.1
2 3.2 30.2 3.1
3 4.2 40.2 4.1

Try this:
df['a_mean'] = np.tile( (df1.a.to_numpy() + df2.a.to_numpy())/2, 2)
As per the comments, there is already a great answer by Anky, but to extend this method you can do this:
df['a_mean2'] = np.tile( (df.iloc[0: len(df)//2].a.to_numpy() + df.iloc[len(df)//2:].a.to_numpy())/2, 2)
Update:
df['a_mean3'] = np.tile(df.a.to_numpy().reshape(2,-1).mean(0), 2)
Outptut
print(df)
a b a_mean2 a_mean a_mean3
0 1.0 10.0 1.1 1.1 1.1
1 2.0 20.0 2.1 2.1 2.1
2 3.0 30.0 3.1 3.1 3.1
3 4.0 40.0 4.1 4.1 4.1
0 1.2 10.2 1.1 1.1 1.1
1 2.2 20.2 2.1 2.1 2.1
2 3.2 30.2 3.1 3.1 3.1
3 4.2 40.2 4.1 4.1 4.1

Pandas dataframe threshold -- Keep number fixed if exceed

I have a dataframe with scores of three persons (John, Terry, Henry) from day 1 to day 7.
1 2 3 4 5 6 7
John 1.3 2.8 3.0 4.4 2.6 3.1 4.8
Terry 1.1 2.3 4.1 5.5 3.7 2.1 3.8
Henry 0.3 1.0 2.0 3.0 2.7 1.1 2.8
How do I set a score ceiling such that once a score hits > 2.5, all scores from that day onwards is FIXED no matter what the score is
The output should be:
1 2 3 4 5 6 7
John 1.3 2.8 2.8 2.8 2.8 2.8 2.8
Terry 1.1 2.3 4.1 4.1 4.1 4.1 4.1
Henry 0.3 1.0 2.0 3.0 3.0 3.0 3.0
Tried the following didn't work. I first do a boolean on all numbers > 2.5 to "1", then apply a mask to the cumulative sum:
df = df.mask((df > 2.5).cumsum(axis=1) > 0, df)

You can find first non NaN value by where with bfill and select first column by iloc:
m = (df > 2.5).cumsum(axis=1) > 0
s = df.where(m).bfill(axis=1).iloc[:, 0]
print (s)
John 2.8
Terry 4.1
Henry 3.0
Name: 1, dtype: float64
df = df.mask(m, s, axis=0)
Or shift mask and forward filling NaNs to last values:
m = (df > 2.5).cumsum(axis=1) > 0
df = df.mask(m.shift(axis=1).fillna(False)).ffill(axis=1)
print (df)
1 2 3 4 5 6 7
John 1.3 2.8 2.8 2.8 2.8 2.8 2.8
Terry 1.1 2.3 4.1 4.1 4.1 4.1 4.1
Henry 0.3 1.0 2.0 3.0 3.0 3.0 3.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to find the best match between two pandas columns? - python

Related

Creating an accumulative column based on another column that only accumulates for a new ID

How to add columns with a for loop in a dataframe?

Merging two dataframes with one common column name [duplicate]

How to create new column based on top and bottom parts of single dataframe in PANDAS?

Pandas dataframe threshold -- Keep number fixed if exceed

Categories

Resources