Calculating of tolerance - python

I am working with one data set. Data contains values with different decimal places. Data and code you can see below :
data = {
'value':[9.1,10.5,11.8,
20.1,21.2,22.8,
9.5,10.3,11.9,
]
}
df = pd.DataFrame(data, columns = ['value'])
Which gives the following dataframe:
value
0 9.1
1 10.5
2 11.8
3 20.1
4 21.2
5 22.8
6 9.5
7 10.3
8 11.9
Now I want to add a new column with the title adjusted.This column I want to calculate with numpy.isclose function with a tolerance of 2 (plus or minus 1). At the end I expect to have results as result shown in the next table
value adjusted
0 9.1 10
1 10.5 10
2 11.8 10
3 20.1 21
4 21.2 21
5 22.8 21
6 9.5 10
7 10.3 10
8 11.9 10
I tried with this line but I get only results such true and false and also this is only for one value (10) not for all values.
np.isclose(df1['value'],10,atol=2)
So can anybody help me how to solve this problem and calculate tolerance for values 10 and 21 with one line ?

The exact logic and how this would generalize are not fully clear. Below are two options.
Assuming you want to test your values against a list of defined references, you can use the underlying numpy array and broadcasting:
vals = np.array([10, 21])
a = df['value'].to_numpy()
m = np.isclose(a[:, None], vals, atol=2)
df['adjusted'] = np.where(m.any(1), vals[m.argmax(1)], np.nan)
Assuming you want to group successive values, you can get the diff and start a new group when the difference is above threshold. Then round and get the median per group with groupby.transform:
group = df['value'].diff().abs().gt(2).cumsum()
df['adjusted'] = df['value'].round().groupby(group).transform('median')
Output:
value adjusted
0 9.1 10.0
1 10.5 10.0
2 11.8 10.0
3 20.1 21.0
4 21.2 21.0
5 22.8 21.0
6 9.5 10.0
7 10.3 10.0
8 11.9 10.0

Related

estimate the average value group by a specific column using python

I have an ascii file containing 2 columns as following;
id value
1 15.1
1 12.1
1 13.5
2 12.4
2 12.5
3 10.1
3 10.2
3 10.5
4 15.1
4 11.2
4 11.5
4 11.7
5 12.5
5 12.2
I want to estimate the average value of column "value" for each id (i.e. group by id)
Is it possible to do that in python using numpy or pandas ?
If you don't know how to read the file, there are several methods as you can see here that you could use, so you can try one of them, e.g. pd.read_csv().
Once you have read the file, you could try this using pandas functions as pd.DataFrame.groupby and pd.Series.mean():
df.groupby('id').mean()
#if df['id'] is the index, try this:
#df.reset_index().groupby('id').mean()
Output:
value
id
1 13.566667
2 12.450000
3 10.266667
4 12.375000
5 12.350000
import pandas as pd
filename = "data.txt"
df = pd.read_fwf(filename)
df.groupby(['id']).mean()
Output
value
id
1 13.566667
2 12.450000
3 10.266667
4 12.375000
5 12.350000

pandas: how to get if column is greater than x select the max of two columns otherwise select mean?

I have a df that looks like this and want to add an adj mean that selects the max if one of the two columns (avg or rolling_mean) is 0 otherwise it gets the avg of the two columns.
ID Avg rolling_mean adj_mean (goal to have this column)
0 5 0 5
1 6 6.3 6.15
2 5 8 6.5
3 4 0 4
I was able to get the max value of the columns using this code
df["adj_mean"]=df[["Avg", "rolling_mean"]].max(axis=1)
but not sure how to add the avg if both values are greater than zero.
Many thanks!
One approach can be to treat 0 as NaN and then simply calculate the mean
df['adj_mean'] = df.replace({0: np.nan})[["Avg", "rolling_mean"]].mean(axis=1)
Out[1]:
rolling_mean Avg adj_mean
0 0.0 5 5.00
1 6.3 6 6.15
2 8.0 5 6.50
3 0.0 4 4.00
By default, df.mean() skips null values. Per the docs:
skipna : bool, default True
Exclude NA/null values when computing the result.

Pandas create multiple dataframes by applying different weightings

I've got a dataframe with 3 columns and I want to add them together and test different weights.
I've written this code so far but I feel this might not be the best way:
weights = [0.5,0.6,0.7,0.8,0.9,1.0]
for i in weights:
for j in weights:
for k in weights:
outname='outname'+str(i)+'TV'+str(j)+'BB'+str(k)+'TP'
df_media[['outname']]=df_media[['TP']].multiply(i)
+df_media[['TV']].multiply(j)
+df_media[['BB']].multiply(k)
Below is the input dataframe and the first output iteration of the loops. So all of the columns have been multiplied by 0.5.
df_media:
TV BB TP
1 2 6
11 4 5
4 4 3
Output DataFrame:
'Outname0.5TV0.5BB0.5TP'
4.5
10
5.5
Dictionary
If you need a dataframe for each loop, you can use a dictionary. With this solution you also don't need to store your factor in your column name, since the weight can be your key. Here's one way via a dictionary comprehension:
weights = [0.5,0.6,0.7,0.8,0.9,1.0]
col_name = '-'.join(df_media.columns)
dfs = {w: (df_media.sum(1) * w).to_frame(col_name) for w in weights}
print(dfs[0.5])
TV-BB-TP
0 4.5
1 10.0
2 5.5
Single dataframe
Much more efficient is to store your result in a single dataframe. This removes the need for a Python-level loop.
res = pd.DataFrame(df.sum(1).values[:, None] * np.array(weights),
columns=weights)
print(res)
0.5 0.6 0.7 0.8 0.9 1.0
0 4.5 5.4 6.3 7.2 8.1 9.0
1 10.0 12.0 14.0 16.0 18.0 20.0
2 5.5 6.6 7.7 8.8 9.9 11.0
Then, for example, access the first weight as a series via res[0.5].

Merge Pandas DataFrame where Values are not exactly alike

I have two DataFrames:
First one (sp_df)
X Y density keep mass size
10 20 33 False 23 23
3 2 52 True 5 5
1.2 3 35 False 25 52
Second one (ep_df)
X Y density keep mass size
2.1 1.1 55 True 4.0 4.4
1.1 2.9 60 False 24.8 54.8
9.0 25.0 33 False 22.0 10.0
now i need to merge them with their X/Y Position into something like this:
X-SP Y-SP density-SP ........ X-EP Y-EP density-EP......
1.5 2.0 30 1.0 2.4 28.7
So with the Data shown above you would get something like this:
X-SP Y-SP density-SP keep-SP mass-SP size-SP X-EP Y-EP density-EP keep-EP mass-EP size-EP
3 2 52 True 5 5 2.1 1.1 55 True 4.0 4.4
1.2 3 35 False 25 52 1.1 2.9 60 False 24.8 54.8
10 20 33 False 23 23 9.0 25.0 33 False 22.0 10.0
My Problem is now that those values are not frequently alike. So I need some kind comparison what two columns in the different dataframes are most likely to be the same. Unfortunately, I have no idea how I can get this done.
Any Tips, advice? Thanks in advance
you can merge the two dataframes like a cartesian product. This will make a dataframe with each row of first data frame joined with every row of second data frame. Than remove the rows which have more difference between X values of the two dataframes. Hope the following code helps,
import pandas as pd
#cartesian_product
sp_df['key'] = 1
ep_df['key'] = 1
df = pd.merge(sp_df, ep_df, on='key', suffixes=['_sp', '_ep'])
del df['key']
## taking difference and removing rows
## with difference more than 1
df['diff'] = df['X_sp'] - df['X_ep']
drop=df.index[df["diff"] >= 1].tolist()
df=df.drop(df.index[drop])
df
Edited code:
#cartesian_product
sp_df['key'] = 1
ep_df['key'] = 1
df = pd.merge(sp_df, ep_df, on='key', suffixes=['_sp', '_ep'])
del df['key']
## taking difference and removing rows
## with difference more than 1
df['diff'] = df['X_sp'] - df['X_ep']
drop=df.index[df["diff"] >= 1.01].tolist()
drop_negative=df.index[df["diff"] <= 0 ].tolist()
droped_values=drop+drop_negative
df=df.drop(df.index[droped_values])
df

Pandas np.where with matching range of values on a row

Test data:
In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'AAA' : [4,5,6,7,9,10],
'BBB' : [10,20,30,40,11,10],
'CCC' : [100,50,25,10,10,11]});
In [2]:df
Out[2]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 25
3 7 40 10
4 9 11 10
5 10 10 11
In [3]: thresh = 2
df['aligned'] = np.where(df.AAA == df.BBB,max(df.AAA)|(df.BBB),np.nan)
The following np.where statement provides max(df.AAA or df.BBB) when df.AAA and df.BBB are exactly aligned. I would like to have the max when the columns are within the value in thresh and also consider all columns. It does not have to be via np.where. Can you please show me ways of approaching this?
So for row 5 it should be 11.0 in df.aligned as this is the max value and within thresh of df.AAA and df.BBB.
Ultimately I am looking for ways to find levels across multiple columns where the values are closely aligned.
Current Output with my code:
df
AAA BBB CCC aligned
0 4 10 100 NaN
1 5 20 50 NaN
2 6 30 25 NaN
3 7 40 10 NaN
4 9 11 10 NaN
5 10 10 11 10.0
Desired Output:
df
AAA BBB CCC aligned
0 4 10 100 NaN
1 5 20 50 NaN
2 6 30 25 NaN
3 7 40 10 NaN
4 9 11 10 11.0
5 10 10 11 11.0
The desired output shows rows 4 and 5 with values on df.aligned. As these have values within thresh of each other (values 10 and 11 are within the range specified in thresh variable).
"Within thresh distance" to me means that the difference between the max
and the min of a row should be less than thresh. We can use DataFrame.apply with parameter axis=1 so that we apply the lambda function on each row.
In [1]: filt_thresh = df.apply(lambda x: (x.max() - x.min())<thresh, axis=1)
100 loops, best of 3: 1.89 ms per loop
Alternatively there's a faster solution as pointed out below by #root:
filt_thresh = np.ptp(df.values, axis=1) < tresh
10000 loops, best of 3: 48.9 µs per loop
Or, staying with pandas:
filt_thresh = df.max(axis=1) - df.min(axis=1) < thresh
1000 loops, best of 3: 943 µs per loop
We can now use boolean indexing and calculate the max of each row that matches (hence the axis=1 parameter in max()again):
In [2]: df.loc[filt_thresh, 'aligned'] = df[filt_thresh].max(axis=1)
In [3]: df
Out[3]:
AAA BBB CCC aligned
0 4 10 100 NaN
1 5 20 50 NaN
2 6 30 25 NaN
3 7 40 10 NaN
4 9 11 10 NaN
5 10 10 11 11.0
Update:
If you wanted to calculate the minimum distance between elements for each row, that'd be equivalent to sorting the array of values (np.sort()), calculating the difference between consecutive numbers (np.diff), and taking the min of the resulting array. Finally, compare that to tresh.
Here's the apply way that has the advantage of being a bit clearer to understand.
filt_thresh = df.apply(lambda row: np.min(np.diff(np.sort(row))) < thresh, axis=1)
1000 loops, best of 3: 713 µs per loop
And here's the vectorized equivalent:
filt_thresh = np.diff(np.sort(df)).min(axis=1) < thresh
The slowest run took 4.31 times longer than the fastest.
This could mean that an intermediate result is being cached.
10000 loops, best of 3: 67.3 µs per loop

Categories

Resources