Place x,y coordinates into bins - python

I have a Pandas dataframe with two of the columns containing x,y coordinates that I plot as below:
plt.figure(figsize=(10,5))
plt.scatter(df.x, df.y, s=1, marker = ".")
plt.xlim(-1.5, 1.5)
plt.ylim(0, 2)
plt.xticks(np.arange(-1.5, 1.6, 0.1))
plt.yticks(np.arange(0, 2.1, 0.1))
plt.grid(True)
plt.show()
I want to split the x and y axes every 0.1 units to give 600 bins (30x20). Then I want to know how many of my points are in each bin and the indices of these points so I can look them up in my dataframe. I basically want to create 600 new dataframes for each bin.
This is what I've tried so far:
df[(df.x >= -0.1) & (df.x < 0) & (df.y >= 0.7) & (df.y < 0.8)]
This will give me part of the dataframe contained within the square (-0.1 ≤ x < 0) & (0.7 ≤ y < 0.8). I want a way to create 600 of these.

I would use the cut function to create the bins and then group by them and count
#create fake data with bounds for x and y
df = pd.DataFrame({'x':np.random.rand(1000) * 3 - 1.5,
'y':np.random.rand(1000) * 2})
# bin the data into equally spaced groups
x_cut = pd.cut(df.x, np.linspace(-1.5, 1.5, 31), right=False)
y_cut = pd.cut(df.y, np.linspace(0, 2, 21), right=False)
# group and count
df.groupby([x_cut, y_cut]).count()
Output
x y
x y
[-1.5, -1.4) [0, 0.1) 3.0 3.0
[0.1, 0.2) 1.0 1.0
[0.2, 0.3) 3.0 3.0
[0.3, 0.4) NaN NaN
[0.4, 0.5) 1.0 1.0
[0.5, 0.6) 3.0 3.0
[0.6, 0.7) 1.0 1.0
[0.7, 0.8) 2.0 2.0
[0.8, 0.9) 2.0 2.0
[0.9, 1) 1.0 1.0
[1, 1.1) 2.0 2.0
[1.1, 1.2) 1.0 1.0
[1.2, 1.3) 2.0 2.0
[1.3, 1.4) 3.0 3.0
[1.4, 1.5) 2.0 2.0
[1.5, 1.6) 3.0 3.0
[1.6, 1.7) 3.0 3.0
[1.7, 1.8) 1.0 1.0
[1.8, 1.9) 1.0 1.0
[1.9, 2) 1.0 1.0
[-1.4, -1.3) [0, 0.1) NaN NaN
[0.1, 0.2) NaN NaN
[0.2, 0.3) 2.0 2.0
And to completely answer your question. You can add the categories to the original dataframe as columns and then do your searching from there like this.
# add new columns
df['x_cut'] = x_cut
df['y_cut'] = y_cut
print(df.head(15)
x y x_cut y_cut
0 1.239743 1.348838 [1.2, 1.3) [1.3, 1.4)
1 -0.539468 0.349576 [-0.6, -0.5) [0.3, 0.4)
2 0.406346 1.922738 [0.4, 0.5) [1.9, 2)
3 -0.779597 0.104891 [-0.8, -0.7) [0.1, 0.2)
4 1.379920 0.317418 [1.3, 1.4) [0.3, 0.4)
5 0.075020 0.748397 [0, 0.1) [0.7, 0.8)
6 -1.227913 0.735301 [-1.3, -1.2) [0.7, 0.8)
7 -0.866753 0.386308 [-0.9, -0.8) [0.3, 0.4)
8 -1.004893 1.120654 [-1.1, -1) [1.1, 1.2)
9 0.007665 0.865248 [0, 0.1) [0.8, 0.9)
10 -1.072368 0.155731 [-1.1, -1) [0.1, 0.2)
11 0.819917 1.528905 [0.8, 0.9) [1.5, 1.6)
12 0.628310 1.022167 [0.6, 0.7) [1, 1.1)
13 1.002999 0.122493 [1, 1.1) [0.1, 0.2)
14 0.032624 0.426623 [0, 0.1) [0.4, 0.5)
And then to get the combination that you described above: df[(x >= -0.1) & (df.x < 0) & (df.y >= 0.7) & (df.y < 0.8)] you would can set the index as x_cut and y_cut and do some hierarchical index selection.
df = df.set_index(['x_cut', 'y_cut'])
df.loc[[('[-0.1, 0)', '[0.7, 0.8)')]]
Output
x y
x_cut y_cut
[-0.1, 0) [0.7, 0.8) -0.043397 0.702029
[0.7, 0.8) -0.032508 0.799284
[0.7, 0.8) -0.036608 0.709394
[0.7, 0.8) -0.025254 0.741085

One of many ways to do it.
bins = (df // .1 * .1).round(1).stack().groupby(level=0).apply(tuple)
dict_of_df = {name: group for name, group in df.groupby(bins)}
You can get the dataframe of counts with
df.groupby(bins).size().unstack()

you could transform your units into their respective indices 0 - 19 and 0 - 29 and increment a matrix of zeros..
import numpy as np
shape = [30,20]
bins = np.zeros(shape, dtype=int)
xmin = np.min(df.x)
xmax = np.max(df.x)
xwidth = xmax - xmin
xind = int(((df.x - xmin) / xwidth) * shape[0])
#ymin
#ymax
#ywidth
#yind
for ind in zip(xind, yind):
bins[ind] += 1

Related

Pandas efficiently add new column true/false if between two other columns

Using Pandas, how can I efficiently add a new column that is true/false if the value in one column (x) is between the values in two other columns (low and high)?
The np.select approach from here works perfectly, but I "feel" like there should be a one-liner way to do this.
Using Python 3.7
fid = [0, 1, 2, 3, 4]
x = [0.18, 0.07, 0.11, 0.3, 0.33]
low = [0.1, 0.1, 0.1, 0.1, 0.1]
high = [0.2, 0.2, 0.2, 0.2, 0.2]
test = pd.DataFrame(data=zip(fid, x, low, high), columns=["fid", "x", "low", "high"])
conditions = [(test["x"] >= test["low"]) & (test["x"] <= test["high"])]
labels = ["True"]
test["between"] = np.select(conditions, labels, default="False")
display(test)
Like mentioned by #Brebdan, you can use this builtin:
test["between"] = test["x"].between(test["low"], test["high"])
output:
fid x low high between
0 0 0.18 0.1 0.2 True
1 1 0.07 0.1 0.2 False
2 2 0.11 0.1 0.2 True
3 3 0.30 0.1 0.2 False
4 4 0.33 0.1 0.2 False

Bin the pandas data frame horizontally with bins=0.2 (fraction), how should I go about it?

I want to bin the data horizontally in a color-magnitude plane of stars. Here is how my data (red giant stars) look like:
RGB stars in my sample
Now, I want to bin these stars in small bins (bins = 0.2 or 0.3) horizontally i.e. parallel to the given X-axix. The number I am using for bins, as you can see, is not an integer.
Here's what I tried so far:
f814w = RGB_stars['col42'] # These are the stars I want to bin
f814w_cut = pd.cut(f814w, bins=0.2) # using pd.cut with bins=0.2
This gives me an error:
"ValueError: bins should be a positive integer."
Another method I tried was df.sample from pandas, but I don't think it's working correctly for the data sample that I am working with. The output I obtain when I use this method is in random order, so I could not find a way to make sure if the splitting was done in the small bins (bins = 0.2 or 0.3).
What should I do to bin the entire column in number of bins < 1? Is there a workaround? Thanks in advance.
bins is specifying the number of bins that you want to use. Therefore it has to be an integer. Looking at your data, bins of size 0.2 would give about 15 bins. You can specify this in 2 ways:
I’m starting with random values that somewhat look like your f814w series:
>>> import numpy as np
>>> f814w
0 4.5
1 2.4
2 3.6
3 2.1
4 2.6
5 2.5
6 3.1
7 2.7
8 4.9
9 4.0
Name: col42, dtype: float64
Either compute the number of bins:
>>> bins = np.ceil((f814w.max() - f814w.min()) / .2)
>>> pd.cut(f814w, bins=int(bins))
0 (4.3, 4.5]
1 (2.3, 2.5]
2 (3.5, 3.7]
3 (2.097, 2.3]
4 (2.5, 2.7]
5 (2.3, 2.5]
6 (2.9, 3.1]
7 (2.5, 2.7]
8 (4.7, 4.9]
9 (3.9, 4.1]
Name: col42, dtype: category
Categories (14, interval[float64]): [(2.097, 2.3] < (2.3, 2.5] < (2.5, 2.7] < (2.7, 2.9] < ... <
(4.1, 4.3] < (4.3, 4.5] < (4.5, 4.7] < (4.7, 4.9]]
Or specify the bin edges (which I find easier):
>>> pd.cut(f814w, bins=np.arange(f814w.min(), f814w.max() + .2, .2))
0 (4.3, 4.5]
1 (2.3, 2.5]
2 (3.5, 3.7]
3 NaN
4 (2.5, 2.7]
5 (2.3, 2.5]
6 (2.9, 3.1]
7 (2.5, 2.7]
8 (4.7, 4.9]
9 (3.9, 4.1]
Name: col42, dtype: category
Categories (15, interval[float64]): [(2.1, 2.3] < (2.3, 2.5] < (2.5, 2.7] < (2.7, 2.9] < ... <
(4.3, 4.5] < (4.5, 4.7] < (4.7, 4.9] < (4.9, 5.1]]
This gives the exact same result. Note the + .2 on the high bound of np.arange() as it excludes the high bound when generating numbers, just like range(). If you can replace f814w.min() and f814w.max() with more theoretical or at least estimated bounds, then the second option (specifying bins) works better:
>>> pd.cut(f814w, bins=np.arange(2, 5.2, .2))
0 (4.4, 4.6]
1 (2.2, 2.4]
2 (3.4, 3.6]
3 (2.0, 2.2]
4 (2.4, 2.6]
5 (2.4, 2.6]
6 (3.0, 3.2]
7 (2.6, 2.8]
8 (4.8, 5.0]
9 (3.8, 4.0]
Name: col42, dtype: category
Categories (15, interval[float64]): [(2.0, 2.2] < (2.2, 2.4] < (2.4, 2.6] < (2.6, 2.8] < ... <
(4.2, 4.4] < (4.4, 4.6] < (4.6, 4.8] < (4.8, 5.0]]

Optimization of equation parameter values such that largest distance between groups is created

For a particular gene scoring system I would like to set up a rudimentary plot such that new sample values that are entered immediately gravitate, based on multiple gene measurements, towards either a healthy or unhealthy group within the plot. Let's presume we have 5 people, each having 6 genes measured.
Import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.DataFrame(np.array([[A, 1, 1.2, 1.4, 2, 2], [B, 1.5, 1, 1.4, 1.3, 1.2], [C, 1, 1.2, 1.6, 2, 1.4], [D, 1.7, 1.5, 1.5, 1.5, 1.4], [E, 1.6, 1.9, 1.8, 3, 2.5], [F, 2, 2.2, 1.9, 2, 2]]), columns=['Gene', 'Healthy 1', 'Healthy 2', 'Healthy 3', 'Unhealthy 1', 'Unhealthy 2'])
This creates the following table:
Gene
Healthy 1
Healthy 2
Healthy 3
Unhealthy 1
Unhealthy 2
A
1.0
1.2
1.4
2.0
2.0
B
1.5
1.0
1.4
1.3
1.2
C
1.0
1.2
1.6
2.0
1.4
D
1.7
1.5
1.5
1.5
1.4
E
1.6
1.9
1.8
3.0
2.5
F
2.0
2.2
1.9
2.0
2.0
The X and Y coordinates of each sample are then calculated based on adding the contribution of the genes together after multiplying it's parameter/weight * measured value. The first 4 genes contribute towards the Y value, whilst gene 5 and 6 determine the X value. wA - wF are the parameter/weights associated with their gene A-F counterpart.
wA = .15
wB = .25
wC = .35
wD = .45
wE = .50
wF = .60
n=0
for n in range (5):
y1 = df.iat[0,n]
y2 = df.iat[1,n]
y3 = df.iat[2,n]
y4 = df.iat[3,n]
TrueY = wA*y1+wB*y2+wC*y3+wD*y4
x1 = df.iat[4,n]
x2 = df.iat[5,n]
TrueX = (wE*x1+wF*x2)
result = (TrueX, TrueY)
n += 1
label = f"({TrueX},{TrueY})"
plt.scatter(TrueX, TrueY, alpha=0.5)
plt.annotate(label, (TrueX,TrueY), textcoords="offset points", xytext=(0,10), ha='center')
We thus calculate all the coordinates and plot them
Plot
What I would now like to do is find out how I can optimize the wA-wF parameter/weights such that the healthy samples are pushed towards the origin of the plot, let's say (0.0), whilst the unhealthy samples are pushed towards a reasonable opposite point, let's say (1,1). I've looked into K-means/SVM, but as a novice-coder/biochemist I was thoroughly overwhelmed and would appreciate any help available.
Here's an example using scipy.optimize combined with your code. (Since your code contains some syntax and type errors, I had to make small corrections.)
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.DataFrame(np.array([[1, 1.2, 1.4, 2, 2],
[1.5, 1, 1.4, 1.3, 1.2],
[1, 1.2, 1.6, 2, 1.4],
[1.7, 1.5, 1.5, 1.5, 1.4],
[1.6, 1.9, 1.8, 3, 2.5],
[2, 2.2, 1.9, 2, 2]]),
columns=['Healthy 1', 'Healthy 2', 'Healthy 3', 'Unhealthy 1', 'Unhealthy 2'],
index=[['A', 'B', 'C', 'D', 'E', 'F']])
wA = .15
wB = .25
wC = .35
wD = .45
wE = .50
wF = .60
from scipy.optimize import minimize
# use your given weights as the initial guess
w0 = np.array([wA, wB, wC, wD, wE, wF])
# the objective function to be minimized
# - it computes the (square of) the samples' distances to (0,0) resp. (1,1)
def fun(w):
weighted = df.values*w[:, None] # multiply all sample values by their weight
y = sum(weighted[:4]) # compute all 5 "TrueY" coordinates
x = sum(weighted[4:]) # compute all 5 "TrueX" coordinates
y[3:] -= 1 # adjust the "Unhealthy" y to the target (x,1)
x[3:] -= 1 # adjust the "Unhealthy" x to the target (1,y)
return sum(x**2+y**2) # return the sum of (squared) distances
res = minimize(fun, w0)
print(res)
# assign the optimized weights back to your parameters
wA, wB, wC, wD, wE, wF = res.x
# this is mostly your unchanged code
for n in range (5):
y1 = df.iat[0,n]
y2 = df.iat[1,n]
y3 = df.iat[2,n]
y4 = df.iat[3,n]
TrueY = wA*y1+wB*y2+wC*y3+wD*y4
x1 = df.iat[4,n]
x2 = df.iat[5,n]
TrueX = (wE*x1+wF*x2)
result = (TrueX, TrueY)
label = f"({TrueX:.3f},{TrueY:.3f})"
plt.scatter(TrueX, TrueY, alpha=0.5)
plt.annotate(label, (TrueX,TrueY), textcoords="offset points", xytext=(0,10), ha='center')
plt.savefig("mygraph.png")
This yields the parameters [ 1.21773653, 0.22185886, -0.39377451, -0.76513658, 0.86984207, -0.73166533] as the solution array. Therewith we can see the healthy samples clustered around (0,0) and the unhealthy samples around (1,1):
You may want to experiment with other optimization methods - see scipy.optimize.minimize.

Can pd.cut use interval range and labels together?

I'm fiddling around with something like this.
bins = [0, .25, .5, .75, 1, 1.25, 1.5, 1.75, 2]
labels = ['0', '.25', '.5', '.75', '1', '1.25', '1.5', '1.75', '2']
dataset['RatingScore'] = pd.cut(dataset['Rating'], bins, labels)
What I am actually getting is a range, like this: (0.75, 1.0]
I would like to get results like this: .75 or 1 or 1.25
Is it possible to get a specific number and NOT a range? Thanks.
Andy, your code runs, and it gives me actual numbers, rather than ranges, but I'm seeing a lot of gaps too.
You pass labels to the 3rd parameter of pd.cut. The third parameter of pd.cut is right=.... It accepts True/False as values. labels is non-empty list, so it is considered as True. Therefore, pd.cut executes as there is no label. You need to use keyword parameter to correctly specify list labels as labels for pd.cut.
Another thing, number of bins must be one item more than labels. You need to add np.inf to the right of list bins
s = pd.Series([0.2, 0.6, 0.1, 0.9, 2])
bins = [0, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, np.inf]
labels = ['0', '.25', '.5', '.75', '1', '1.25', '1.5', '1.75', '2']
s_cat = pd.cut(s, bins=bins, labels=labels)
Out[1165]:
0 0
1 .5
2 0
3 .75
4 1.75
dtype: category
Categories (9, object): [0 < .25 < .5 < .75 ... 1.25 < 1.5 < 1.75 < 2]
If you don't add infinity to the bins you'll have as possible output float (np.nan) or interval let says you want to take the right interval you could try as follow
import pandas as pd
import numpy as np
def fun(x):
if isinstance(x, float) is True:
return np.nan
else:
return x.right
df = pd.DataFrame({"Rating":[.1* i for i in range(10)]})
bins = [0, .25, .5, .75, 1, 1.25, 1.5, 1.75, 2]
df["RatingScore"] = pd.cut(df['Rating'], bins)
df["RatingScore"].apply(fun)
0 NaN
1 0.25
2 0.25
3 0.50
4 0.50
5 0.50
6 0.75
7 0.75
8 1.00
9 1.00

Differences between dataframe spearman correlation using pandas and scipy

I have a fairly big matrix (4780, 5460) and computed the spearman correlation between rows using both "pandas.DataFrame.corr" and "scipy.stats.spearmanr". Each function return very different correlation coeficients, and now I am not sure which is the "correct", or if my dataset it more suitable to a different implementation.
Some contextualization: the vectors (rows) I want to test for correlation do not necessarily have all same points, there are NaN in some columns and not in others.
df.T.corr(method='spearman')
(r, p) = spearmanr(df.T)
df2 = pd.DataFrame(index=df.index, columns=df.columns, data=r)
In[47]: df['320840_93602.563']
Out[47]:
320840_93602.563 1.000000
3254_642.148.peg.3256 0.565812
13752_42938.1206 0.877192
319002_93602.870 0.225530
328_642.148.peg.330 0.658269
...
12566_42938.19 0.818395
321125_93602.2882 0.535577
319185_93602.1135 0.678397
29724_39.3584 0.770453
321030_93602.1962 0.738722
Name: 320840_93602.563, dtype: float64
In[32]: df2['320840_93602.563']
Out[32]:
320840_93602.563 1.000000
3254_642.148.peg.3256 0.444675
13752_42938.1206 0.286933
319002_93602.870 0.225530
328_642.148.peg.330 0.606619
...
12566_42938.19 0.212265
321125_93602.2882 0.587409
319185_93602.1135 0.696172
29724_39.3584 0.097753
321030_93602.1962 0.163417
Name: 320840_93602.563, dtype: float64
scipy.stats.spearmanr is not designed to handle nan, and its behavior with nan values is undefined. [Update: scipy.stats.spearmanr now has the argument nan_policy.]
For data without nans, the functions appear to agree:
In [92]: np.random.seed(123)
In [93]: df = pd.DataFrame(np.random.randn(5, 5))
In [94]: df.T.corr(method='spearman')
Out[94]:
0 1 2 3 4
0 1.0 -0.8 0.8 0.7 0.1
1 -0.8 1.0 -0.7 -0.7 -0.1
2 0.8 -0.7 1.0 0.8 -0.1
3 0.7 -0.7 0.8 1.0 0.5
4 0.1 -0.1 -0.1 0.5 1.0
In [95]: rho, p = spearmanr(df.values.T)
In [96]: rho
Out[96]:
array([[ 1. , -0.8, 0.8, 0.7, 0.1],
[-0.8, 1. , -0.7, -0.7, -0.1],
[ 0.8, -0.7, 1. , 0.8, -0.1],
[ 0.7, -0.7, 0.8, 1. , 0.5],
[ 0.1, -0.1, -0.1, 0.5, 1. ]])

Categories

Resources