Merge Pandas DataFrame where Values are not exactly alike

Merge Pandas DataFrame where Values are not exactly alike - python

I have two DataFrames:
First one (sp_df)
X Y density keep mass size
10 20 33 False 23 23
3 2 52 True 5 5
1.2 3 35 False 25 52
Second one (ep_df)
X Y density keep mass size
2.1 1.1 55 True 4.0 4.4
1.1 2.9 60 False 24.8 54.8
9.0 25.0 33 False 22.0 10.0
now i need to merge them with their X/Y Position into something like this:
X-SP Y-SP density-SP ........ X-EP Y-EP density-EP......
1.5 2.0 30 1.0 2.4 28.7
So with the Data shown above you would get something like this:
X-SP Y-SP density-SP keep-SP mass-SP size-SP X-EP Y-EP density-EP keep-EP mass-EP size-EP
3 2 52 True 5 5 2.1 1.1 55 True 4.0 4.4
1.2 3 35 False 25 52 1.1 2.9 60 False 24.8 54.8
10 20 33 False 23 23 9.0 25.0 33 False 22.0 10.0
My Problem is now that those values are not frequently alike. So I need some kind comparison what two columns in the different dataframes are most likely to be the same. Unfortunately, I have no idea how I can get this done.
Any Tips, advice? Thanks in advance

you can merge the two dataframes like a cartesian product. This will make a dataframe with each row of first data frame joined with every row of second data frame. Than remove the rows which have more difference between X values of the two dataframes. Hope the following code helps,
import pandas as pd
#cartesian_product
sp_df['key'] = 1
ep_df['key'] = 1
df = pd.merge(sp_df, ep_df, on='key', suffixes=['_sp', '_ep'])
del df['key']
## taking difference and removing rows
## with difference more than 1
df['diff'] = df['X_sp'] - df['X_ep']
drop=df.index[df["diff"] >= 1].tolist()
df=df.drop(df.index[drop])
df
Edited code:
#cartesian_product
sp_df['key'] = 1
ep_df['key'] = 1
df = pd.merge(sp_df, ep_df, on='key', suffixes=['_sp', '_ep'])
del df['key']
## taking difference and removing rows
## with difference more than 1
df['diff'] = df['X_sp'] - df['X_ep']
drop=df.index[df["diff"] >= 1.01].tolist()
drop_negative=df.index[df["diff"] <= 0 ].tolist()
droped_values=drop+drop_negative
df=df.drop(df.index[droped_values])
df

Related

Calculating of tolerance

I am working with one data set. Data contains values with different decimal places. Data and code you can see below :
data = {
'value':[9.1,10.5,11.8,
20.1,21.2,22.8,
9.5,10.3,11.9,
]
}
df = pd.DataFrame(data, columns = ['value'])
Which gives the following dataframe:
value
0 9.1
1 10.5
2 11.8
3 20.1
4 21.2
5 22.8
6 9.5
7 10.3
8 11.9
Now I want to add a new column with the title adjusted.This column I want to calculate with numpy.isclose function with a tolerance of 2 (plus or minus 1). At the end I expect to have results as result shown in the next table
value adjusted
0 9.1 10
1 10.5 10
2 11.8 10
3 20.1 21
4 21.2 21
5 22.8 21
6 9.5 10
7 10.3 10
8 11.9 10
I tried with this line but I get only results such true and false and also this is only for one value (10) not for all values.
np.isclose(df1['value'],10,atol=2)
So can anybody help me how to solve this problem and calculate tolerance for values 10 and 21 with one line ?

The exact logic and how this would generalize are not fully clear. Below are two options.
Assuming you want to test your values against a list of defined references, you can use the underlying numpy array and broadcasting:
vals = np.array([10, 21])
a = df['value'].to_numpy()
m = np.isclose(a[:, None], vals, atol=2)
df['adjusted'] = np.where(m.any(1), vals[m.argmax(1)], np.nan)
Assuming you want to group successive values, you can get the diff and start a new group when the difference is above threshold. Then round and get the median per group with groupby.transform:
group = df['value'].diff().abs().gt(2).cumsum()
df['adjusted'] = df['value'].round().groupby(group).transform('median')
Output:
value adjusted
0 9.1 10.0
1 10.5 10.0
2 11.8 10.0
3 20.1 21.0
4 21.2 21.0
5 22.8 21.0
6 9.5 10.0
7 10.3 10.0
8 11.9 10.0

Get weighted average summary data column in new pandas dataframe from existing dataframe based on other column-ID

Somewhat similar question to an earlier question I had here: Get summary data columns in new pandas dataframe from existing dataframe based on other column-ID
However, instead of just taking the sum of datapoints, I wanted to have the weighted average in an extra column. I'll repeat and rephrase the question:
I want to summarize the data in a dataframe and add the new columns to another dataframe. My data contains appartments with an ID-number and it has surfaces and U-values for each room in the appartment. What I want is having a dataframe that summarizes this and gives me the total surface and surface-weighted average U-value per appartment. There are three conditions for the original dataframe:
Three conditions:
the dataframe can contain empty cells
when the values of surface or U-value are equal for all of the rows within that ID
(so all the same values for the same ID), then the data (surface, volumes) is not
summed but one value/row is passed to the new summary column (example: 'ID 4')(as
this could be a mistake in the original dataframe and the total surface/volume was
inserted for all the rooms by the government-employee)
the average U-value should be the Surface-weighted average U-value
Initial dataframe 'data':
print(data)
ID Surface U-value
0 2 10.0 1.0
1 2 12.0 1.0
2 2 24.0 0.5
3 2 8.0 1.0
4 4 84.0 0.8
5 4 84.0 0.8
6 4 84.0 0.8
7 52 NaN 0.2
8 52 96.0 1.0
9 95 8.0 2.0
10 95 6.0 2.0
11 95 12.0 2.0
12 95 30.0 1.0
13 95 12.0 1.5
Desired output from 'df':
print(df)
ID Surface U-value #-> U-value = surface weighted U-value!; Surface = sum of all surfaces except when all surfaces per ID are the same (example 'ID 4')
0 2 54.0 0.777
1 4 84.0 0.8 #-> as the values are the same for each row of this ID in the original data, the sum is not taken, but only one of the rows is passed (see the second condition)
2 52 96.0 1.0 # -> as one of 2 surface is empty, the corresponding U-value for the empty cell is ignored, so the output here should be the weighted average of the values that have both 'Surface'&'U-value'-values (in this case 1,0)
3 95 68.0 1.47
The code of jezrael in the reference already works brilliant for the sum() but how
to add a weighted average 'U-value'-column to it? I really have no idea. An
average could be set with a mean()-function instead of the sum() but
the weighted-average..?
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [2,4,52,95]})
data = pd.DataFrame({"ID": [2,2,2,2,4,4,4,52,52,95,95,95,95,95],
"Surface": [10,12,24,8,84,84,84,np.nan,96,8,6,12,30,12],
"U-value":
[1.0,1.0,0.5,1.0,0.8,0.8,0.8,0.2,1.0,2.0,2.0,2.0,1.0,1.5]})
print(data)
cols = ['Surface']
m1 = data.groupby("ID")[cols].transform('nunique').eq(1)
m2 = data[cols].apply(lambda x: x.to_frame().join(data['ID']).duplicated())
df = data[cols].mask(m1 & m2).groupby(data["ID"]).sum().reset_index()
print(df)

This should do the trick:
data.groupby('ID').apply(lambda g: (g['U-value']*g['Surface']).sum() / g['Surface'].sum())
To add to original dataframe, don't reset the index first:
df = data[cols].mask(m1 & m2).groupby(data["ID"]).sum()
df['U-value'] = data.groupby('ID').apply(
lambda g: (g['U-value'] * g['Surface']).sum() / g['Surface'].sum())
df.reset_index(inplace=True)
The result:
ID Surface U-value
0 2 54.0 0.777778
1 4 84.0 0.800000
2 52 96.0 1.000000
3 95 68.0 1.470588

Merge two DataFrames on ranges intersections

I am trying to merge two DataFrames based on the intersection of min-max values. Does anyone have a nice way to do it with Pandas?
## min max x1 ## min max x2
##0 1 20 0.5 ##0 1 12 1.2
##1 20 30 1.5 ##1 12 30 2.2
Desired output:
## min max x1 x2
##0 1 12 0.5 1.2
##1 12 20 0.5 2.2
##2 20 30 1.5 2.2
Thx!

This gives you what you're looking for based on your data set above, but I have the feeling it may not work in more complex situations.
Code:
# Simple data frame append - since it looks like you want it ordered, you can order it here, and then reset index.
df = df1.append(df2).sort_values(by = 'max')[['min','max','x1','x2']].reset_index(drop = True)
# Here, set 'min' for all but the first row to the 'max' of the previous row
df.loc[1:, 'min'] = df['max'].shift()
# Fill NaNs
df.fillna(method = 'bfill', inplace = True)
# Filter out rows where min == max
df = df.loc[df['min'] != df['max']]
Output:
min max x1 x2
0 1.0 12 0.5 1.2
1 12.0 20 0.5 2.2
2 20.0 30 1.5 2.2

Python3 How to implement sliding window counts on Pandas Dataframe

I want to create a plot from a large Pandas dataframe. The data is in the following format
Type Number ...unimportant additional columns
Foo 13 ...
Foo 25 ...
Foo 56 ...
Foo 56 ...
Bar 10 ...
Bar 10 ...
Bar 11 ...
Bar 23 ...
I need to count the number of elements from column 'Number' in a sliding window from x to x+i to determine the number of values falling in each sliding window bucket.
For example, if the window size is i=10, starting at x=0, and incrementing x by 1 each step, the sliding window bucket for 'Foo' a correct result for the above example would be:
Foo Bar
0 0 2 #(0-10)
1 0 3 #(1-11)
2 0 3 #(2-12)
3 1 3 #(3-13)
4 1 3 #(4-14)
.
.
.
20 1 1 #(13-23)
21 0 1 #(14-24)
22 1 1 #(15-25)
.
.
.
The answer would have df.max().max - [Window Length] rows, and len(df.columns) columns.
Toy code to generate a similar dataframe might be the following:
import pandas as pd
import numpy as np
str_arr = ['Foo','Bar','Python','PleaseHelp']
data1 = np.matrix(np.random.choice(str_arr, 100, p=[0.5, 0.1, 0.1, 0.3])).T
data2 = np.random.randint(100, size=(100,1))
merge = np.concatenate((data1,data2), axis=1)
df = pd.DataFrame(merge, index=range(100), columns=['Type','Number'])
df.sort_values(['Type','Number'], ascending=[True,True], inplace=True)
df = df.reset_index(drop=True)
How can I generate such a list efficiently?
Edit Note: Thanks to FLab who answered my question earlier before I clarified my question.

Here is my proposed solution.
For convenience, let's force 'Number' column to be an int.
df['Number'] = df['Number'].astype(int)
Define all possible ranges:
len_wdw = 10
all_ranges = [(i, i+len_wdw) for i in range(df['Number'].max()-len_wdw)]
And now check how many observations there are for "Number" in each of this ranges:
def get_mask(df, rg):
#rg is a range, e.g. (10-20)
return (df['Number'] >= rg[0]) & (df['Number'] <= rg[1])
result = pd.concat({ rg[0] :
df[get_mask(df, rg)].groupby('Type').count()['Number']
for rg in all_ranges},
axis = 1).fillna(0).T
For the randomly generated numbers, this gives:
Bar Foo PleaseHelp Python
0 1.0 4.0 3.0 1.0
1 1.0 5.0 2.0 1.0
2 1.0 5.0 3.0 1.0
3 1.0 4.0 3.0 0.0
4 1.0 3.0 3.0 1.0
.....
85 2.0 3.0 4.0 1.0
86 1.0 3.0 3.0 1.0
87 1.0 4.0 3.0 1.0
88 1.0 4.0 4.0 1.0
89 1.0 3.0 5.0 1.0

Replace outliers with column quantile in Pandas dataframe

I have a dataframe:
df = pd.DataFrame(np.random.randint(0,100,size=(5, 2)), columns=list('AB'))
A B
0 92 65
1 61 97
2 17 39
3 70 47
4 56 6
Here are 5% quantiles:
down_quantiles = df.quantile(0.05)
A 24.8
B 12.6
And here is the mask for values that are lower than quantiles:
outliers_low = (df < down_quantiles)
A B
0 False False
1 False False
2 True False
3 False False
4 False True
I want to set values in df lower than quantile to its column quantile. I can do it like this:
df[outliers_low] = np.nan
df.fillna(down_quantiles, inplace=True)
A B
0 92.0 65.0
1 61.0 97.0
2 24.8 39.0
3 70.0 47.0
4 56.0 12.6
But certainly there should be a more elegant way. How can I do this without fillna?
Thanks.

You can use DF.mask() method. Wherever there is a presence of a True instance, the values from the other series get's replaced aligned as per matching column names by providing axis=1.
df.mask(outliers_low, down_quantiles, axis=1)
Another variant would be to use DF.where() method after inverting your boolean mask using the tilde (~) symbol.
df.where(~outliers_low, down_quantiles, axis=1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge Pandas DataFrame where Values are not exactly alike - python

Related

Calculating of tolerance

Get weighted average summary data column in new pandas dataframe from existing dataframe based on other column-ID

Merge two DataFrames on ranges intersections

Python3 How to implement sliding window counts on Pandas Dataframe

Replace outliers with column quantile in Pandas dataframe

Categories

Resources