How to merge two DataFrames by complicated condition?

How to merge two DataFrames by complicated condition? - python

Python 3.9.5
The first big DataFrame contains points and the second big DataFrame contains square areas. Square areas are defined by four straight lines, that are parallel to the coordinate axes and are completely defined by a set of constraints: y_min, y_max, x_min, x_max. For example:
points = pd.DataFrame({'y':[0.5, 0.5, 1.5, 1.5], 'x':[0.5, 1.5, 1.5, 0.5]})
points
y x
0 0.5 0.5
1 0.5 1.5
2 1.5 1.5
3 1.5 0.5
square_areas = pd.DataFrame({'y_min':[0,1], 'y_max':[1,2], 'x_min':[0,1], 'x_max':[1,2]})
square_areas
y_min y_max x_min x_max
0 0 1 0 1
1 1 2 1 2
How to get all points, that don't belong to square areas without sequential enumeration of areas in a cycle?
Needed Output:
y x
0 0.5 1.5
1 1.5 0.5

I'm not sure how to do this with a 'merge', but you can iterate over the square_areas dataframe and evaluate the conditions of the point dataframe.
I'm assuming you'll have more than two test cases, so this iterative approach should work. Each iteration only looks at points that have not been evaluated by a prior square_areas row.
points = pd.DataFrame({'y':[0.5, 0.5, 1.5, 1.5], 'x':[0.5, 1.5, 1.5, 0.5]})
print(points)
# assume everything is outside until it evaluates inside
points['outside'] = 'Y'
square_areas = pd.DataFrame({'y_min':[0,1], 'y_max':[1,2], 'x_min':[0,1], 'x_max':[1,2]})
print(square_areas)
for i in range(square_areas.shape[0]):
ymin = square_areas.iloc[i]['y_min']
ymax = square_areas.iloc[i]['y_max']
xmin = square_areas.iloc[i]['x_min']
xmax = square_areas.iloc[i]['x_max']
points.loc[points['outside'] == 'Y', 'outside'] = np.where(points[points['outside'] == 'Y']['x'].between(xmin, xmax) & points[points['outside'] == 'Y']['y'].between(ymin, ymax), 'N', points[points['outside'] == 'Y']['outside'])
points.loc[points['outside'] == 'Y']
Output
y x outside
1 0.50000 1.50000 Y
3 1.50000 0.50000 Y

Related

Optimization of equation parameter values such that largest distance between groups is created

For a particular gene scoring system I would like to set up a rudimentary plot such that new sample values that are entered immediately gravitate, based on multiple gene measurements, towards either a healthy or unhealthy group within the plot. Let's presume we have 5 people, each having 6 genes measured.
Import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.DataFrame(np.array([[A, 1, 1.2, 1.4, 2, 2], [B, 1.5, 1, 1.4, 1.3, 1.2], [C, 1, 1.2, 1.6, 2, 1.4], [D, 1.7, 1.5, 1.5, 1.5, 1.4], [E, 1.6, 1.9, 1.8, 3, 2.5], [F, 2, 2.2, 1.9, 2, 2]]), columns=['Gene', 'Healthy 1', 'Healthy 2', 'Healthy 3', 'Unhealthy 1', 'Unhealthy 2'])
This creates the following table:
Gene
Healthy 1
Healthy 2
Healthy 3
Unhealthy 1
Unhealthy 2
A
1.0
1.2
1.4
2.0
2.0
B
1.5
1.0
1.4
1.3
1.2
C
1.0
1.2
1.6
2.0
1.4
D
1.7
1.5
1.5
1.5
1.4
E
1.6
1.9
1.8
3.0
2.5
F
2.0
2.2
1.9
2.0
2.0
The X and Y coordinates of each sample are then calculated based on adding the contribution of the genes together after multiplying it's parameter/weight * measured value. The first 4 genes contribute towards the Y value, whilst gene 5 and 6 determine the X value. wA - wF are the parameter/weights associated with their gene A-F counterpart.
wA = .15
wB = .25
wC = .35
wD = .45
wE = .50
wF = .60
n=0
for n in range (5):
y1 = df.iat[0,n]
y2 = df.iat[1,n]
y3 = df.iat[2,n]
y4 = df.iat[3,n]
TrueY = wA*y1+wB*y2+wC*y3+wD*y4
x1 = df.iat[4,n]
x2 = df.iat[5,n]
TrueX = (wE*x1+wF*x2)
result = (TrueX, TrueY)
n += 1
label = f"({TrueX},{TrueY})"
plt.scatter(TrueX, TrueY, alpha=0.5)
plt.annotate(label, (TrueX,TrueY), textcoords="offset points", xytext=(0,10), ha='center')
We thus calculate all the coordinates and plot them
Plot
What I would now like to do is find out how I can optimize the wA-wF parameter/weights such that the healthy samples are pushed towards the origin of the plot, let's say (0.0), whilst the unhealthy samples are pushed towards a reasonable opposite point, let's say (1,1). I've looked into K-means/SVM, but as a novice-coder/biochemist I was thoroughly overwhelmed and would appreciate any help available.

Here's an example using scipy.optimize combined with your code. (Since your code contains some syntax and type errors, I had to make small corrections.)
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.DataFrame(np.array([[1, 1.2, 1.4, 2, 2],
[1.5, 1, 1.4, 1.3, 1.2],
[1, 1.2, 1.6, 2, 1.4],
[1.7, 1.5, 1.5, 1.5, 1.4],
[1.6, 1.9, 1.8, 3, 2.5],
[2, 2.2, 1.9, 2, 2]]),
columns=['Healthy 1', 'Healthy 2', 'Healthy 3', 'Unhealthy 1', 'Unhealthy 2'],
index=[['A', 'B', 'C', 'D', 'E', 'F']])
wA = .15
wB = .25
wC = .35
wD = .45
wE = .50
wF = .60
from scipy.optimize import minimize
# use your given weights as the initial guess
w0 = np.array([wA, wB, wC, wD, wE, wF])
# the objective function to be minimized
# - it computes the (square of) the samples' distances to (0,0) resp. (1,1)
def fun(w):
weighted = df.values*w[:, None] # multiply all sample values by their weight
y = sum(weighted[:4]) # compute all 5 "TrueY" coordinates
x = sum(weighted[4:]) # compute all 5 "TrueX" coordinates
y[3:] -= 1 # adjust the "Unhealthy" y to the target (x,1)
x[3:] -= 1 # adjust the "Unhealthy" x to the target (1,y)
return sum(x**2+y**2) # return the sum of (squared) distances
res = minimize(fun, w0)
print(res)
# assign the optimized weights back to your parameters
wA, wB, wC, wD, wE, wF = res.x
# this is mostly your unchanged code
for n in range (5):
y1 = df.iat[0,n]
y2 = df.iat[1,n]
y3 = df.iat[2,n]
y4 = df.iat[3,n]
TrueY = wA*y1+wB*y2+wC*y3+wD*y4
x1 = df.iat[4,n]
x2 = df.iat[5,n]
TrueX = (wE*x1+wF*x2)
result = (TrueX, TrueY)
label = f"({TrueX:.3f},{TrueY:.3f})"
plt.scatter(TrueX, TrueY, alpha=0.5)
plt.annotate(label, (TrueX,TrueY), textcoords="offset points", xytext=(0,10), ha='center')
plt.savefig("mygraph.png")
This yields the parameters [ 1.21773653, 0.22185886, -0.39377451, -0.76513658, 0.86984207, -0.73166533] as the solution array. Therewith we can see the healthy samples clustered around (0,0) and the unhealthy samples around (1,1):
You may want to experiment with other optimization methods - see scipy.optimize.minimize.

Assignment by logical indexing in numpy

I have a real-valued numpy array of size (1000,). All values lie between 0 and 1, and I want to convert this to a categorical array. All values less than 0.25 should be assigned to category 0, values between 0.25 and 0.5 to category 1, 0.5 to 0.75 to category 2, and 0.75 to 1 to category 3. Logical indexing doesn't seem to work:
Y[Y < 0.25] = 0
Y[np.logical_and(Y >= 0.25, Y < 0.5)] = 1
Y[np.logical_and(Y >= 0.5, Y < 0.75)] = 2
Y[Y >= 0.75] = 3
Result:
for i in range(4):
print(f"Y == {i}: {sum(Y == i)}")
Y == 0: 206
Y == 1: 0
Y == 2: 0
Y == 3: 794
What needs to be done instead?

The error is in your conversion logic, not in your indexing. The final statment:
Y[Y >= 0.75] = 3
Converts not only the values in range 0.75 - 1.00, but also the prior assignments to classes 1 and 2.
You can reverse the assignment order, starting with class 3.
You can put an upper limit on the final class, although you still have a boundary problem with 1.00 vs class 1.
Perhaps best would be to harness the regularity of your divisions, such as:
Y = int(4*Y) # but you still have boundary problems.

Binning polar coordinates

My question is quite straight forward. I want to bin polar coordinates, which means that the domain in which I want to bin is limited by 0 and 360, where 0 = 360. Here start my problems, due to this circular behavior of the data, and as I want to bin each 1 degree starting from 0.5 degrees up to 355.5 degrees (Unfortunately due to the nature of the project binning from (0,1] until (359,360]), then, I have to make sure that there is a bin that goes from (355.5,0.5], which is obviously not what will happen by default.
I made up this script to better illustrate what I am looking for:
bins_direction = np.linspace(0.5,360.5,360, endpoint = False)
points = np.random.rand(10000)*360
df = pd.DataFrame({'Points': points})
df['Bins'] = pd.cut(x= df['Points'],
bins=bins_direction)
You will see that if the data is between 355.5 and 0.5 degrees, the binning will be NaN. I want to find a solution in which that would be (355.5,0.5]
So, my result (depending on which seed you set, of course), will look something like this:
Points Bins
0 17.102993 (16.5, 17.5]
1 97.665600 (97.5, 98.5]
2 46.697548 (46.5, 47.5]
3 9.832000 (9.5, 10.5]
4 21.260980 (20.5, 21.5]
5 47.433179 (46.5, 47.5]
6 359.813283 nan
7 355.654251 (355.5, 356.5]
8 0.23740105 nan
And I would like it to be:
Points Bins
0 17.102993 (16.5, 17.5]
1 97.665600 (97.5, 98.5]
2 46.697548 (46.5, 47.5]
3 9.832000 (9.5, 10.5]
4 21.260980 (20.5, 21.5]
5 47.433179 (46.5, 47.5]
6 359.813283 (359.5, 0.5]
7 355.654251 (355.5, 356.5]
8 0.23740105 (359.5, 0.5]

Since you cannot have a pandas interval of the form (355.5, 0.5], you can only have them as string:
bins = [0] + list(np.linspace(0.5,355.5,356)) + [360]
df = pd.DataFrame({'Points': [0,1,350,356, 357, 359]})
(pd.cut(df['Points'], bins=bins, include_lowest=True)
.astype(str)
.replace({'(-0.001, 0.5]':'(355.5,0.5]', '(355.5, 360.0]':'(355.5,0.5]'})
)
Output:
0 (355.5,0.5]
1 (0.5, 1.5]
2 (349.5, 350.5]
3 (355.5,0.5]
4 (355.5,0.5]
5 (355.5,0.5]
Name: Points, dtype: object

Can pd.cut use interval range and labels together?

I'm fiddling around with something like this.
bins = [0, .25, .5, .75, 1, 1.25, 1.5, 1.75, 2]
labels = ['0', '.25', '.5', '.75', '1', '1.25', '1.5', '1.75', '2']
dataset['RatingScore'] = pd.cut(dataset['Rating'], bins, labels)
What I am actually getting is a range, like this: (0.75, 1.0]
I would like to get results like this: .75 or 1 or 1.25
Is it possible to get a specific number and NOT a range? Thanks.
Andy, your code runs, and it gives me actual numbers, rather than ranges, but I'm seeing a lot of gaps too.

You pass labels to the 3rd parameter of pd.cut. The third parameter of pd.cut is right=.... It accepts True/False as values. labels is non-empty list, so it is considered as True. Therefore, pd.cut executes as there is no label. You need to use keyword parameter to correctly specify list labels as labels for pd.cut.
Another thing, number of bins must be one item more than labels. You need to add np.inf to the right of list bins
s = pd.Series([0.2, 0.6, 0.1, 0.9, 2])
bins = [0, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, np.inf]
labels = ['0', '.25', '.5', '.75', '1', '1.25', '1.5', '1.75', '2']
s_cat = pd.cut(s, bins=bins, labels=labels)
Out[1165]:
0 0
1 .5
2 0
3 .75
4 1.75
dtype: category
Categories (9, object): [0 < .25 < .5 < .75 ... 1.25 < 1.5 < 1.75 < 2]

If you don't add infinity to the bins you'll have as possible output float (np.nan) or interval let says you want to take the right interval you could try as follow
import pandas as pd
import numpy as np
def fun(x):
if isinstance(x, float) is True:
return np.nan
else:
return x.right
df = pd.DataFrame({"Rating":[.1* i for i in range(10)]})
bins = [0, .25, .5, .75, 1, 1.25, 1.5, 1.75, 2]
df["RatingScore"] = pd.cut(df['Rating'], bins)
df["RatingScore"].apply(fun)
0 NaN
1 0.25
2 0.25
3 0.50
4 0.50
5 0.50
6 0.75
7 0.75
8 1.00
9 1.00

How can I bin a Pandas Series setting the bin size to a preset value of max/min for each bin

I have a pd.Series of floats and I would like to bin it into n bins where the bin size for each bin is set so that max/min is a preset value (e.g. 1.20)?
The requirement means that the size of the bins is not constant. For example:
data = pd.Series(np.arange(1, 11.0))
print(data)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 9.0
9 10.0
dtype: float64
I would like the bin sizes to be:
1.00 <= bin 1 < 1.20
1.20 <= bin 2 < 1.20 x 1.20 = 1.44
1.44 <= bin 3 < 1.44 x 1.20 = 1.73
...
etc
Thanks

Here's one with pd.cut, where the bins can be computed taking the np.cumprod of an array filled with 1.2:
data = pd.Series(list(range(11)))
import numpy as np
n = 20 # set accordingly
bins= np.r_[0,np.cumprod(np.full(n, 1.2))]
# array([ 0. , 1.2 , 1.44 , 1.728 ...
pd.cut(data, bins)
0 NaN
1 (0.0, 1.2]
2 (1.728, 2.074]
3 (2.986, 3.583]
4 (3.583, 4.3]
5 (4.3, 5.16]
6 (5.16, 6.192]
7 (6.192, 7.43]
8 (7.43, 8.916]
9 (8.916, 10.699]
10 (8.916, 10.699]
dtype: category
Where bins in this case goes up to:
np.r_[0,np.cumprod(np.full(20, 1.2))]
array([ 0. , 1.2 , 1.44 , 1.728 , 2.0736 ,
2.48832 , 2.985984 , 3.5831808 , 4.29981696, 5.15978035,
6.19173642, 7.43008371, 8.91610045, 10.69932054, 12.83918465,
15.40702157, 18.48842589, 22.18611107, 26.62333328, 31.94799994,
38.33759992])
So you'll have to set that according to the range of values of the actual data

This is I believe the best way to do it because you are considering the max and min values from your array. Therefore you won't need to worry about what values are you using, only the multiplier or step_size for your bins (of course you'd need to add a column name or some additional information if you will be working with a DataFrame):
data = pd.Series(np.arange(1, 11.0))
bins = []
i = min(data)
while i < max(data):
bins.append(i)
i = i*1.2
bins.append(i)
bins = list(set(bins))
bins.sort()
df = pd.cut(data,bins,include_lowest=True)
print(df)
Output:
0 (0.999, 1.2]
1 (1.728, 2.074]
2 (2.986, 3.583]
3 (3.583, 4.3]
4 (4.3, 5.16]
5 (5.16, 6.192]
6 (6.192, 7.43]
7 (7.43, 8.916]
8 (8.916, 10.699]
9 (8.916, 10.699]
Bins output:
Categories (13, interval[float64]): [(0.999, 1.2] < (1.2, 1.44] < (1.44, 1.728] < (1.728, 2.074] < ... <
(5.16, 6.192] < (6.192, 7.43] < (7.43, 8.916] <
(8.916, 10.699]]

Thanks everyone for all the suggestions. None does quite what I was after (probably because my original question wasn't clear enough) but they really helped me figure out what to do so I have decided to post my own answer (I hope this is what I am supposed to do as I am relatively new at being an active member of stackoverflow...)
I liked #yatu's vectorised suggestion best because it will scale better with large data sets but I am after the means to not only automatically calculate the bins but also figure out the minimum number of bins needed to cover the data set.
This is my proposed algorithm:
The bin size is defined so that bin_max_i/bin_min_i is constant:
bin_max_i / bin_min_i = bin_ratio
Figure out the number of bins for the required bin size (bin_ratio):
data_ratio = data_max / data_min
n_bins = math.ceil( math.log(data_ratio) / math.log(bin_ratio) )
Set the lower boundary for the smallest bin so that the smallest data point fits in it:
bin_min_0 = data_min
Create n non-overlapping bins meeting the conditions:
bin_min_i+1 = bin_max_i
bin_max_i+1 = bin_min_i+1 * bin_ratio
Stop creating further bins once all dataset can be split between the bins already created. In other words, stop once:
bin_max_last > data_max
Here is a code snippet:
import math
import pandas as pd
bin_ratio = 1.20
data = pd.Series(np.arange(2,12))
data_ratio = max(data) / min(data)
n_bins = math.ceil( math.log(data_ratio) / math.log(bin_ratio) )
n_bins = n_bins + 1 # bin ranges are defined as [min, max)
bins = np.full(n_bins, bin_ratio) # initialise the ratios for the bins limits
bins[0] = bin_min_0 # initialise the lower limit for the 1st bin
bins = np.cumprod(bins) # generate bins
print(bins)
[ 2. 2.4 2.88 3.456 4.1472 4.97664
5.971968 7.1663616 8.59963392 10.3195607 12.38347284]
I am now set to build a histogram of the data:
data.hist(bins=bins)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to merge two DataFrames by complicated condition? - python

Related

Optimization of equation parameter values such that largest distance between groups is created

Assignment by logical indexing in numpy

Binning polar coordinates

Can pd.cut use interval range and labels together?

How can I bin a Pandas Series setting the bin size to a preset value of max/min for each bin

Categories

Resources