Cluster groups by direction and magnitude - Python

Cluster groups by direction and magnitude - Python - python

I'm hoping to cluster vectors based on the direction and magnitude using python. I've found limited examples using R but none for python. Not to confuse with standard k-means for scatter points, I'm actually trying to cluster the whole vector.
The following takes two sets of xy points to generate a vector. I'm then hoping to cluster these vectors based on the length and direction.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
df = pd.DataFrame(np.random.randint(0,20,size=(100, 4)), columns=list('ABCD'))
plt.rcParams['image.cmap'] = 'Paired'
fig,ax = plt.subplots()
ax.set_xlim(-5, 25)
ax.set_ylim(-5, 25)
A = df['A']
B = df['B']
C = df['C']
D = df['D']
ax.quiver(A, B, (C-A), (D-B), angles = 'xy', scale_units = 'xy', scale = 1, alpha = 0.5)
X_1 = np.array(df[['A','B','C','D']])
model = KMeans(n_clusters = 20)
model.fit(X_1)
cluster_labels = model.predict(X_1)
df['n_cluster'] = cluster_labels
centroids_1 = pd.DataFrame(data = model.cluster_centers_, columns = ['start_x', 'start_y', 'end_x', 'end_y'])
cc = model.cluster_centers_
a = cc[:, 0]
b = cc[:, 1]
c = cc[:, 2]
d = cc[:, 3]
lc1 = ax.quiver(a, b, (c-a), (d-b), angles = 'xy', scale_units = 'xy', scale = 1, alpha = 0.8)
The following figure displays an example

What about this :
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import hdbscan
df = pd.DataFrame(np.random.randint(0,20,size=(100, 4)), columns=list('ABCD'))
plt.rcParams['image.cmap'] = 'Paired'
A = df['A'] #X start
B = df['B'] #Y start
C = df['C'] #X arrive
D = df['D'] #Y arrive
clusterer = hdbscan.HDBSCAN()
df['LENGTH'] = np.sqrt(np.square(df.C-df.A) + np.square(df.D-df.B))
df['DIRECTION'] = np.degrees(np.arctan2(df.D-df.B, df.C-df.A))
coords = df[['LENGTH', 'DIRECTION']].values
clusterer.fit_predict(coords)
cluster_labels = clusterer.labels_
num_clusters = len(set(cluster_labels))
clusters = pd.DataFrame(
[(coords[cluster_labels==n], len(coords[cluster_labels==n])) for n in range(num_clusters)],
columns=["points", "weight"]
)
colors = {0:"green", 1:"blue", 2:"red", 3:"yellow", 4:"pink"}
df['CLUSTER'] = np.nan
for x, (cluster, weight) in enumerate(clusters[clusters.weight>0].values.tolist()):
df_this_cluster = pd.DataFrame(cluster, columns=['LENGTH', 'DIRECTION'])
df_this_cluster['TEMP'] = x
df = df.merge(df_this_cluster, on=['LENGTH', 'DIRECTION'], how='left')
ix = df[df.TEMP.notnull()].index
df.loc[ix, "CLUSTER"] = df.loc[ix, "TEMP"]
df.drop("TEMP", axis=1, inplace=True)
df['COLOR'] = df['CLUSTER'].map(colors).fillna('black')
fig,ax = plt.subplots()
ax.set_xlim(-5, 25)
ax.set_ylim(-5, 25)
ax.quiver(df.A, df.B, (df.C-df.A), (df.D-df.B), angles='xy', scale_units='xy', scale=1, alpha=0.5, color=df.COLOR)
This will use clustering based on length and direction (direction being transformed to degrees, radians' small range doesn't match very well with the model on my first try).
I don't think this will be a very "cartesian" solution as the two values beeing analysed in the model are not in the same metrics... But the visual results are not so bad...
I did try another match based on the 4 coordinates, which is more rigorous. But it is (quite expectably) clustering the vectors by subareas of the space (when there are any) :
coords = df[['A', 'B', 'C', 'D']].values
clusterer.fit_predict(coords)
cluster_labels = clusterer.labels_
num_clusters = len(set(cluster_labels))
clusters = pd.DataFrame(
[(coords[cluster_labels==n], len(coords[cluster_labels==n])) for n in range(num_clusters)],
columns=["points", "weight"]
)
colors = {0:"green", 1:"blue", 2:"red", 3:"yellow", 4:"pink"}
df['CLUSTER'] = np.nan
for x, (cluster, weight) in enumerate(clusters[clusters.weight>0].values.tolist()):
df_this_cluster = pd.DataFrame(cluster, columns=['A', 'B', 'C', 'D'])
df_this_cluster['TEMP'] = x
df = df.merge(df_this_cluster, on=['A', 'B', 'C', 'D'], how='left')
ix = df[df.TEMP.notnull()].index
df.loc[ix, "CLUSTER"] = df.loc[ix, "TEMP"]
df.drop("TEMP", axis=1, inplace=True)
df['COLOR'] = df['CLUSTER'].map(colors).fillna('black')
EDIT
I gave it another try, based on the (very good) suggestion that angles are not a good variable given the fact that there are discontinuities around 0/2pi ; so I choose to use both sinuses and cosinuses instead. I also scaled the length (to have matching scales for the 3 variables) :
So the result would be :
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import robust_scale
import hdbscan
df = pd.DataFrame(np.random.randint(0,20,size=(100, 4)), columns=list('ABCD'))
plt.rcParams['image.cmap'] = 'Paired'
A = df['A'] #X start
B = df['B'] #Y start
C = df['C'] #X arrive
D = df['D'] #Y arrive
clusterer = hdbscan.HDBSCAN()
df['LENGTH'] = robust_scale(np.sqrt(np.square(df.C-df.A) + np.square(df.D-df.B)))
df['DIRECTION'] = np.arctan2(df.D-df.B, df.C-df.A)
df['COS'] = np.cos(df['DIRECTION'])
df['SIN'] = np.sin(df['DIRECTION'])
columns = ['LENGTH', 'COS', 'SIN']
clusterer = hdbscan.HDBSCAN()
values = df[columns].values
clusterer.fit_predict(values)
cluster_labels = clusterer.labels_
num_clusters = len(set(cluster_labels))
clusters = pd.DataFrame(
[(values[cluster_labels==n], len(values[cluster_labels==n])) for n in range(num_clusters)],
columns=["points", "weight"]
)
def get_cmap(n, name='hsv'):
'''
Returns a function that maps each index in 0, 1, ..., n-1 to a distinct
RGB color; the keyword argument name must be a standard mpl colormap name.
Credits to #Ali
https://stackoverflow.com/questions/14720331/how-to-generate-random-colors-in-matplotlib#answer-25628397
'''
return plt.cm.get_cmap(name, n)
cmap = get_cmap(num_clusters+1)
colors = {x:cmap(x) for x in range(num_clusters)}
df['CLUSTER'] = np.nan
for x, (cluster, weight) in enumerate(clusters[clusters.weight>0].values.tolist()):
df_this_cluster = pd.DataFrame(cluster, columns=columns)
df_this_cluster['TEMP'] = x
df = df.merge(df_this_cluster, on=columns, how='left')
df.reset_index(drop=True, inplace=True)
ix = df[df.TEMP.notnull()].index
df.loc[ix, "CLUSTER"] = df.loc[ix, "TEMP"]
df.drop("TEMP", axis=1, inplace=True)
df['CLUSTER'] = df['CLUSTER'].fillna(num_clusters-1)
df['COLOR'] = df['CLUSTER'].map(colors)
print("Number of clusters : ", num_clusters-1)
nrows = num_clusters//2 if num_clusters%2==0 else num_clusters//2 + 1
fig,axes = plt.subplots(nrows=nrows, ncols=2)
axes = [y for row in axes for y in row]
for k,ax in enumerate(axes):
ax.set_xlim(-5, 25)
ax.set_ylim(-5, 25)
ax.set_aspect('equal', adjustable='box')
if k+1 <num_clusters:
ax.set_title(f"CLUSTER #{k+1}", fontsize=10)
this_df = df[df.CLUSTER==k]
ax.quiver(
this_df.A, #X
this_df.B, #Y
(this_df.C-this_df.A), #X component of vector
(this_df.D-this_df.B), #Y component of vector
angles = 'xy',
scale_units = 'xy',
scale = 1,
color=this_df.COLOR
)
The results are way better (though it depends much of the input dataset) ; the last subplots refers to the vectors not being found to be inside a cluster:
Edit #2
If by "direction" you mean angle in the [0..pi[ interval (ie undirected vectors), you will want to include the following code before computing the cosinuses/sinuses :
ix = df[df.DIRECTION<0].index
df.loc[ix, "DIRECTION"] += np.pi

Maybe you can also cluster the angles (besides the vector norms) by the projections of a normalized vector onto the two unit vectors (1,0) and (0,1) with this function. Handling the projections directly (which are essentially the angles), we won't went into trouble caused by the periodicity of cosine function
def get_norm_and_angle(e1):
e1_norm = np.linalg.norm(e1,axis=1)
e1 = e1 / e1_norm[:,None]
e2 = np.array([1,0])
e3 = np.array([0,1])
return np.stack((e1_norm,e1#e2,e1#e3),axis=1)
Based on this function, here is one possible solution where there is no constraint on how many clusters we want to find. In the script below, five features are used for clustering
Vector norm
Vector projections on x and y axis
Vector starting points
with these five features
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.cluster import KMeans
def get_norm_and_angle(e1):
e1_norm = np.linalg.norm(e1,axis=1)
e1 = e1 / e1_norm[:,None]
e2 = np.array([1,0])
e3 = np.array([0,1])
return np.stack((e1_norm,e1#e2,e1#e3),axis=1)
data = np.cumsum(np.random.randint(0,10,size=(50, 4)),axis=0)
df = pd.DataFrame(data, columns=list('ABCD'))
A = df['A'];B = df['B']
C = df['C'];D = df['D']
starting_points = np.stack((A,B),axis=1)
vectors = np.stack((C,D),axis=1) - np.stack((A,B),axis=1)
different_view = get_norm_and_angle(vectors)
different_view = np.hstack((different_view,starting_points))
num_clusters = 8
model = KMeans(n_clusters=num_clusters)
model.fit(different_view)
cluster_labels = model.predict(different_view)
df['n_cluster'] = cluster_labels
cluster_centers = model.cluster_centers_
cluster_offsets = cluster_centers[:,0][:,None] * cluster_centers[:,1:3]
cluster_starts = np.vstack([np.mean(starting_points[cluster_labels==ind],axis=0) for ind in range(num_clusters)])
main_streams = np.hstack((cluster_starts,cluster_starts+cluster_offsets))
a,b,c,d = main_streams.T
fig,ax = plt.subplots(figsize=(8,8))
ax.set_xlim(-np.max(data)*.1,np.max(data)*1.1)
ax.set_ylim(-np.max(data)*.1,np.max(data)*1.1)
colors = sns.color_palette(n_colors=num_clusters)
lc1 = ax.quiver(a, b, (c-a), (d-b), angles = 'xy', scale_units = 'xy', color = colors, scale = 1, alpha = 0.8, zorder=100)
lc2 = ax.quiver(A, B, (C-A), (D-B), angles = 'xy', scale_units = 'xy', scale = .6, alpha = 0.2)
start_colors = [colors[ind] for ind in cluster_labels]
ax.scatter(starting_points[:,0],starting_points[:,1],c=start_colors)
plt.show()
A sample output is
as you can see in the figure, vectors with close starting points are clustered into the same group.

Related

Struggling to graph a Beta Distribution using Python

Given some measures I am trying to create a beta distribution. Given a max, min, mean and also an alpha and beta how do I call the beta.ppf or beta.pfd to generate a proper data set?
Working Sample
https://www.kaggle.com/iancoetzer/betaworking
Broken Sample
https://www.kaggle.com/iancoetzer/betaproblem
import matplotlib.pyplot as plt
from scipy.stats import beta
#
# Set the shape paremeters
#
a = 2.8754
b = 3.0300
minv = 82.292
maxv = 129.871
mean = 105.46
#
# Generate the value between
#
x = np.linspace(beta.ppf(minv, a, b),beta.ppf(maxv, a, b), 100)
#
# Plot the beta distribution
#
plt.figure(figsize=(7,7))
plt.xlim(0.7, 1)
plt.plot(x, beta.pdf(x, a, b), 'r-')
plt.title('Beta Distribution', fontsize='15')
plt.xlabel('Values of Random Variable X (0, 1)', fontsize='15')
plt.ylabel('Probability', fontsize='15')
plt.show()```

We managed to code a simple solution to compute and plot the Beta Distribution as follow: see the red beta curve.
Now we are trying to plot a Weibull distribution ...
#import libraries
import pandas as pd, numpy as np, gc, time, os, uuid, math, datetime
from joblib import Parallel, delayed
from numpy.random import default_rng
from scipy.stats import beta
from scipy import special
from scipy.stats import exponweib
import matplotlib.pyplot as plt
#sample parameters
low, high, mean, a, b, trials = 82.292, 129.871, 105.46, 2.8754, 3.0300, 10000
scale = (high-low)/6
#normal
normal_arr = np.random.normal(loc=mean, scale=scale, size=trials)
#triangular
triangular_arr = np.random.triangular(left=low, mode=mean, right=high, size=trials)
#log normal
mu = math.log(math.pow(mean,2) / math.sqrt(math.pow(scale,2) + math.pow(mean,2)))
sigma = math.sqrt(math.log(math.pow(scale,2)/(math.pow(mean,2)) + 1))
lognorm_arr = np.random.lognormal(mean=mu, sigma=sigma, size=trials)
#beta
beta_x = np.linspace(beta.ppf(0.0, a, b),beta.ppf(1, a, b), trials)
#by = beta.pdf(bx, a, b)
beta_arr = beta.ppf(beta_x, a, b, loc=low, scale=high - low)
#define binning(arr) method:
def binning(arr):
df = pd.DataFrame(arr)
df["Trial"] = range(1, len(df) + 1)
df[0] = df[0].astype(float)
df.rename(columns = {0: "Result"}, inplace=True)
minval = df["Result"].min()
maxval = df["Result"].max()
binCount = 100
bins = np.linspace(minval, maxval, binCount + 1)
labels = np.arange(1, binCount + 1)
df["bins"] = pd.cut(df["Result"], bins = bins, labels = labels, include_lowest = True)
dfBin = df.groupby(["bins"])["Result"].mean()
dfCount = df.groupby(["bins"])["Result"].count()
dfBin.replace(np.nan, 0.0, inplace=True)
dfCount.replace(np.nan, 0, inplace=True)
dfCount = pd.DataFrame(dfCount)
dfBin = pd.DataFrame(dfBin)
dfBin["bin"] = range(1, len(dfBin) + 1)
dfBin["Result"] = dfBin["Result"].astype(float)
df = pd.merge(dfBin, dfCount, left_index=True, right_index=True)
#Rename the resulting columns
df.rename(columns = {'Result_x':'Mean'}, inplace = True)
df.rename(columns = {'Result_y':'Trials'}, inplace = True)
return df
dfNormal = binning(normal_arr)
dfLog = binning(lognorm_arr)
dfTriangular = binning(triangular_arr)
dfBeta = binning(beta_arr)
dfWeibull = binning(wei_arr)
dfNormal.drop(dfNormal[dfNormal["Mean"] == 0].index, inplace=True)
dfLog.drop(dfLog[dfLog["Mean"] == 0].index, inplace=True)
dfTriangular.drop(dfTriangular[dfTriangular["Mean"] == 0].index, inplace=True)
dfBeta.drop(dfBeta[dfBeta["Mean"] == 0].index, inplace=True)
dfWeibull.drop(dfWeibull[dfWeibull["Mean"] == 0].index, inplace=True)
plt.plot(dfNormal["Mean"], dfNormal["Trials"], label="Normal")
plt.plot(dfLog["Mean"], dfLog["Trials"], label="Lognormal")
plt.plot(dfTriangular["Mean"], dfTriangular["Trials"], label="Triangular")
plt.plot(dfBeta["Mean"], dfBeta["Trials"], label="Beta")
plt.plot(dfWeibull["Mean"], dfWeibull["Trials"], label="Weibull")
plt.legend(loc='upper right')
plt.xlabel("R amount")
plt.ylabel("# Trials")
#plt.xlim(low, high)
plt.show()

Having some problem to understand the x_bin in regplot of Seaborn

I used the seaborn.regplot to plot data, but not quite understand how the error bar in regplot was calculated. I have compared the results with the mean and standard deviation derived from mannual calculation. Here is my testing script.
import numpy as np
import pandas as pd
import seaborn as sn
def get_data_XYE(p):
x_list = []
lower_list = []
upper_list = []
for line in p.lines:
x_list.append(line.get_xdata()[0])
lower_list.append(line.get_ydata()[0])
upper_list.append(line.get_ydata()[1])
y = 0.5 * (np.asarray(lower_list) + np.asarray(upper_list))
y_error = np.asarray(upper_list) - y
x = np.asarray(x_list)
return x, y, y_error
x = [37.3448,36.6026,42.7795,34.7072,75.4027,226.2615,192.7984,140.8045,242.9952,458.451,640.6542,726.1024,231.7347,107.5605,200.2254,190.0006,314.1349,146.8131,152.4497,175.9096,284.9926,116.9681,118.2953,312.3787,815.8389,458.0146,409.5797,595.5373,188.9955,15.7716,36.1839,244.8689,57.4579,94.8717,112.2237,87.0687,72.79,22.3457,24.1728,29.505,80.8765,252.7454,280.6002,252.9573,348.246,112.705,98.7545,317.0541,300.9573,402.8411,406.6884,56.1286,30.1385,32.9909,497.556,19.3606,20.8409,95.2324,108.6074,15.7753,54.5511,45.5623,64.564,101.1934,81.8459,88.286,58.2642,56.1225,51.2943,38.0649,63.5882,63.6847,120.495,102.4097,49.3255,111.3309,171.6028,58.9526,28.7698,144.6884,180.0661,116.6028,146.2594,199.8702,128.9378,423.2363,119.8537,124.6508,518.8625,306.3023,79.5213,121.0309,116.9346,170.8863,930.361,48.9983,55.039,47.1092,72.0548,75.4045,103.521,83.4134,142.3253,146.6215,121.4467,101.4252,68.4812,291.4275,143.9475,142.647,78.9826,47.094,204.2196,89.0208,82.792,27.1346,142.4764,83.7874,67.3216,112.9531,138.2549,133.3446,86.2659,45.3464,56.1604,43.5882,54.3623,86.296,115.7272,96.5498,111.8081,36.1756,40.2947,34.2532,89.1452,53.9062,36.458,113.9297,176.9962,77.3125,77.8891,64.807,64.1515,127.7242,119.6876,976.2324,322.8454,434.2883,168.6923,250.0284,234.7329,131.0793,152.335,118.8838,243.1772,24.1776,168.6327,170.7541,167.8444,75.9315,110.1045,113.4417,60.5464,66.8956,79.7606,71.6659,72.5251,77.513,207.8019,21.8592,35.2787,169.7698,146.5012,412.9934,248.0708,318.5489,104.1278,184.7592,108.0581,175.2646,169.7698,340.3732,570.3396,23.9853,69.0405,66.7391,67.9435,294.6085,68.0537,77.6344,433.2713,104.3178,229.4615,187.8587,78.1399,121.4737,122.5451,384.5935,38.5232,117.6835,50.3308,318.2513,103.6695,20.7181,321.9601,510.3248,13.4754,16.1188,44.8082,37.7291,733.4587,446.6241,21.1822,287.9603,327.2367,274.1109,195.4713,158.2114,64.4537,26.9857,172.8503]
y = [37,40,30,29,24,23,27,12,21,20,29,28,27,32,23,29,28,22,28,23,24,29,32,18,22,12,12,14,29,31,34,31,22,40,25,36,27,27,29,35,33,25,25,27,27,19,35,26,18,24,25,37,52,47,34,39,40,48,41,44,35,36,53,46,38,44,23,26,26,28,27,21,25,21,20,27,35,24,46,34,22,30,30,30,31,26,25,28,21,31,24,27,33,21,31,33,29,33,32,21,25,22,39,31,34,26,23,18,20,18,34,25,20,12,23,25,21,21,25,31,17,27,28,29,25,24,25,21,24,27,23,22,23,22,22,26,22,19,26,35,33,35,29,26,26,30,22,32,33,33,28,32,26,29,36,37,37,28,24,30,25,20,29,24,33,35,30,32,31,33,40,35,37,24,34,29,27,24,36,26,26,26,27,27,20,17,28,34,18,20,20,18,19,23,20,22,25,32,44,41,39,41,40,44,36,42,31,32,26,29,23,29,29,28,31,22,29,24,28,28,25]
xbreaks = [13.4754, 27.1346, 43.5882, 58.9526, 72.79, 89.1452, 110.1045, 131.0793, 158.2114, 180.0661, 207.8019, 234.7329, 252.9573, 300.9573, 327.2367, 348.246, 412.9934, 434.2883, 458.451, 518.8625, 595.5373, 640.6542, 733.4587, 815.8389, 930.361, 976.2324]
df = pd.DataFrame([x,y]).T
df.columns = ['x','y']
# Check the bin average and std using agge
bins = pd.cut(df.x,xbreaks,right=False)
t = df[['x','y']].groupby(bins).agg({"x": "mean", "y": ["mean","std"]})
t.reset_index(inplace=True)
t.columns = ['range_cut','x_avg_cut','y_avg_cut','y_std_cut']
t.index.name ='id'
# Get the bin average from
g = sns.regplot(x='x',y='y',data=df,fit_reg=False,x_bins=xbreaks,seed=seed)
xye = pd.DataFrame(get_data_XYE(g)).T
xye.columns = ['x_regplot','y_regplot','e_regplot']
xye.index.name = 'id'
t2 = xye.merge(t,on='id',how='left')
t2
You can see the y and e from the two ways are different. I understand that the default x_ci or x_estimator may afect the result of regplot, but I still can not the these values in excel by removing some lowest and/or highest values in each bin.

In seaborn.regplot, the x_bins are the center of each bin, and the original x values are assigned to the nearest bin value. Whereas in pandas.cut, the breaks define the bin edges.

plotly -- overlay significance stars / annotation over boxplots

I'm trying to generate a faceted boxplot for linear model results, with treatments on the x axis. A conventional way to show significance is to append asterisks.
I'm finding this surprisingly difficult to do in plotly.
Example code:
import numpy as np
import pandas as pd
# Data
n = 10
conditiona = ['left', 'right']
conditionb = ['top', 'middle', 'bottom']
N = n * len(conditiona) * len(conditionb)
trt = np.repeat(['a', 'b','c'], N)
eff = np.repeat([3, 2, 1], N)
noise = np.random.normal(size = 3* N, loc = 0, scale = 1)
pval = np.repeat(['**', '', ''], N)
col = np.tile( np.repeat( conditiona, n * len(conditionb)), 3)
row = np.tile( np.repeat( conditionb, n) , len(conditiona) *3)
df = pd.DataFrame( { 'y' : noise + eff, 'trt' : trt, 'p' : pval, 'column' : col,
'row' : row})
## Plot
import plotly.graph_objects as go
from plotly.subplots import make_subplots
rows = df.row.unique().tolist()
cols = df.column.unique().tolist()
groups = df.trt.unique().tolist()
labs = [i + ' ' + j for j in rows for i in cols]
colors = ['red', 'green', 'blue']
fig = make_subplots(rows = len(rows), cols = len(cols),
shared_xaxes=True, subplot_titles = labs)
for group, dx in df.groupby(['row','column','trt']):
r = rows.index( group[0] ) + 1 # 1-based numbering
c = cols.index( group[1] ) + 1
name = str(group[2])
id = groups.index(group[2])
tr = go.Box( y = dx['y'], boxpoints = 'all', name = name ,marker_color = colors[id], text = dx['p'])
# tr2 = go.Scatter(x = 'x0', <- how do I get relative x coordinates of tr to put in here ?
# y = dx['y'].median(), text = dx['p'].unique())
fig.add_trace( tr, row = r, col = c )
fig.show()
[Desired] Output:
Is there an easy way to 'extract' the x coordinates of a box trace so I can overlay a marker?
Seems like this shouldn't be hard.

Figured it out eventually. You just have to know how plotly sets things up beforehand, apparently.
You can use annotations with xref and yref referencingthe subplots. The pattern of assignment is confusing (to me) and poorly documented.
y_refs increase sequentially from the bottom left, reading left to right. Thus in this figure bottom left panel is 'y1', bottom right is 'y2', middle left is 'y3' , middle right is 'y4' and so on.
ncol = len(cols)
fig = make_subplots(rows = len(rows), cols = len(cols),
shared_xaxes=True, subplot_titles = labs)
for group, dx in df.groupby(['row','column','trt']):
r = rows.index( group[0] ) + 1 # 1-based numbering
c = cols.index( group[1] ) + 1
name = str(group[2])
id = groups.index(group[2])
tr = go.Box( y = dx['y'], boxpoints = 'all', name = name ,marker_color = colors[id], text = dx['p'])
fig.add_trace( tr, row = r, col = c )
xref = 'x' + str(c)
yref = 'y' + str( (r-1)*ncol + c ) # yrefs added in strange pattern
fig.add_annotation(x = name,
y = dx.y.median(),
text = dx.p.unique()[0],
ax = 0, ay = 0,showarrow = False,
xref = xref, yref = yref,
font= dict(size = 24))
fig.show()

dictionary or sub df from df

I am totally new in programming in general, so please explain.
The general aim: I am dealing with x,y,z data. I want to reduce the number of points in each cell (could have variable sizes depinding on the project)to let's say 50 without affecting the mean value.
The problem: I have df with x,y,z,binnumber and I want to produce either dictionary(ex binnumber:[x,y,z],[x,y,z].....which is inside this bin), or some how sub datasets that I can work with as df so I can work with.
what I did:
`# import the data
import pandas as pd
import numpy as np
from scipy.stats import binned_statistic_2d
inputpath=input("write the file path:")
Data = pd.read_csv(inputpath, index_col=False, header= None, names =
['X','Y', 'Z'],skip_blank_lines=True) # file name , index =False means
without index , names are the columns names
Data = pd.DataFrame(Data)
# creating the grid cells
min_x = int(min(Data['X']))
max_x = int(max(Data['X'])+1)
min_y = int(min(Data['Y']))
max_y = int(max(Data['Y'])+1)
bin_size = float(input('write the cell size:'))
bx= int(((max_x-min_x)//bin_size)+1)
by=int(((max_y-min_y)//bin_size)+1)
xedges = np.linspace(min_x, max_x, bx, dtype=int)
yedges = np.linspace(min_y, max_y, by, dtype=int)
# assign the data to the cells
count, x_edge,y_edge,binnumber= binned_statistic_2d(Data['X'], Data['Y'],
Data['Z'],bins=(xedges, yedges))
Data['binnumber']= binnumber
# sub sets
subsets = dict(Data.groupby('binnumber'))
print (subsets)
this did not work...
Another solution was to deal with the cells itself but it did not work also.
cells= {}
for i in xedges:
for j in yedges:
cells[str(i),str(j)]=[]
print(cells.keys())
for x in Data.X:
for y in Data.Y:
for z in Data.Z:
for k,v in cells.keys():
if x>= int(k[0]) and x < int(k[0]) +1 and y>= int(k[1]) and y
< int(k[1]) +1:
k=(x,y,z)
else:
cells=('0')
print(cells)
Thanks for any try to help.

import the data
import pandas as pd
import numpy as np
from scipy.stats import binned_statistic_2d
inputpath=input("write the file path:")
Data = pd.read_csv(inputpath, index_col=False, header= None, names =
['X','Y', 'Z'],skip_blank_lines=True) # file name , index =False means
without index , names are the columns names
Data = pd.DataFrame(Data)
# creating the grid cells
min_x = int(min(Data['X']))
max_x = int(max(Data['X'])+1)
min_y = int(min(Data['Y']))
max_y = int(max(Data['Y'])+1)
bin_size = float(input('write the cell size:'))
bx= int(((max_x-min_x)//bin_size)+1)
by=int(((max_y-min_y)//bin_size)+1)
xedges = np.linspace(min_x, max_x, bx, dtype=int)
yedges = np.linspace(min_y, max_y, by, dtype=int)
# assign the data to the cells
count, x_edge,y_edge,binnumber= binned_statistic_2d(Data['X'], Data['Y'],
Data['Z'],bins=(xedges, yedges))
Data['binnumber']= binnumber
# making dictionary with >>> binnumber: all associated points......
Data['value'] = list(zip(Data['X'], Data['Y'], Data['Z']))
d = defaultdict(list)
for idx, row in Data.iterrows():
d[row['binnumber']].append(row['value'])

calculating slope for a series trendline in Pandas

Is there an idiomatic way of getting the slope for linear trend line fitting values in a DataFrame column? The data is indexed with DateTime index.

This should do it:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(100, 5), pd.date_range('2012-01-01', periods=100))
def trend(df):
df = df.copy().sort_index()
dates = df.index.to_julian_date().values[:, None]
x = np.concatenate([np.ones_like(dates), dates], axis=1)
y = df.values
return pd.DataFrame(np.linalg.pinv(x.T.dot(x)).dot(x.T).dot(y).T,
df.columns, ['Constant', 'Trend'])
trend(df)
Using the same df above for its index:
df_sample = pd.DataFrame((df.index.to_julian_date() * 10 + 2) + np.random.rand(100) * 1e3,
df.index)
coef = trend(df_sample)
df_sample['trend'] = (coef.iloc[0, 1] * df_sample.index.to_julian_date() + coef.iloc[0, 0])
df_sample.plot(style=['.', '-'])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cluster groups by direction and magnitude - Python - python

Related

Struggling to graph a Beta Distribution using Python

Having some problem to understand the x_bin in regplot of Seaborn

plotly -- overlay significance stars / annotation over boxplots

dictionary or sub df from df

calculating slope for a series trendline in Pandas

Categories

Resources