I have a dataframe that holds the Word Mover's Distance between each document in my dataframe. I am running kmediods on this to generate clusters.
1 2 3 4 5
1 0.00 0.05 0.07 0.04 0.05
2 0.05 0.00 0.06 0.04 0.05
3. 0.07 0.06 0.00 0.06 0.06
4 0.04 0.04. 0.06 0.00 0.04
5 0.05 0.05 0.06 0.04 0.00
kmed = KMedoids(n_clusters= 3, random_state=123, method ='pam').fit(distance)
After running on this initial matrix and generating clusters, I want to add new points to be clustered. After adding a new document to the distance matrix I end up with:
1 2 3 4 5 6
1 0.00 0.05 0.07 0.04 0.05 0.12
2 0.05 0.00 0.06 0.04 0.05 0.21
3. 0.07 0.06 0.00 0.06 0.06 0.01
4 0.04 0.04. 0.06 0.00 0.04 0.05
5 0.05 0.05 0.06 0.04 0.00 0.12
6. 0.12 0.21 0.01 0.05 0.12 0.00
I have tried using kmed.predict on the new row.
kmed.predict(new_distance.loc[-1: ])
However, this gives me an error of incompatible dimensions X.shape[1] == 6 while Y.shape[1] == 5.
How can I use this distance of the new document to determine which cluster it should be a part of? Is this even possible, or do I have to recompute clusters every time? Thanks!
The source code for k-medoids says the following:
def transform(self, X):
"""Transforms X to cluster-distance space.
Parameters
----------
X : {array-like, sparse matrix}, shape (n_query, n_features), \
or (n_query, n_indexed) if metric == 'precomputed'
Data to transform.
"""
I assume that you use the precomputed metric (because you compute the distances outside the classifier), so in your case n_query is the number of new documents, and n_indexed is the number of the documents for which the fit method was called.
In your particular case when you fit the model on 5 documents and then want to classify the 6'th one, the X for classification should have shape (1,5), that can be computed as
kmed.predict(new_distance.loc[-1: , :-1])
this is my trial, we must recompute the distance between the new point and the old ones each time.
import pandas as pd
from sklearn_extra.cluster import KMedoids
from sklearn.metrics import pairwise_distances
import numpy as np
# dummy data for trial
df = pd.DataFrame({0: [0,1],1 : [1,2]})
# calculatie distance
distance = pairwise_distances(df.values, df.values)
# fit model
kmed = KMedoids(n_clusters=2, random_state=123, method='pam').fit(distance)
new_point = [2,3]
distance = pairwise_distances(np.array(new_point).reshape(1, -1), df.values)
#calculate the distance between the new point and the initial dataset
print(distance)
#get ride of the last element which is the ditance of the new point with itself
print(kmed.predict(distance[0][:2].reshape(1, -1)))
Related
I'm working with a dataframe containing environnemental values (sentinel2 satellite : NDVI) like:
Date ID_151894 ID_109386 ID_111656 ID_110006 ID_112281 ID_132408
0 2015-07-06 0.82 0.61 0.85 0.86 0.76 nan
1 2015-07-16 0.83 0.81 0.77 0.83 0.84 0.82
2 2015-08-02 0.88 0.89 0.89 0.89 0.86 0.84
3 2015-08-05 nan nan 0.85 nan 0.83 0.77
4 2015-08-12 0.82 0.77 nan 0.65 nan 0.42
5 2015-08-22 0.85 0.85 0.88 0.87 0.83 0.83
The columns correspond to different places and the nan values are due to cloudy conditions (which happen often in Belgium). There are obviously lot more values. To remove outliers, I use the method described in the timesat manual (Jönsson & Eklundh, 2015) :
it deviates more than a maximum deviation (here called cutoff) from the median
value is lower than the mean value of its immediate neighbors minus the cutoff
or it is larger than the highest value of its immediate neighbor plus the cutoff
So, I have made the code below to do so :
NDVI = pd.read_excel("C:/Python_files/Cartofor/NDVI_frene_5ha.xlsx")
date = NDVI["Date"]
MED = NDVI.median(axis = 0, skipna = True, numeric_only=True)
SD = NDVI.std(axis = 0, skipna = True, numeric_only=True)
cutoff = 1.5 * SD
for j in range(1,21): #columns
for i in range(1,480): #rows
if (NDVIF.iloc[i,j] < (((NDVIF.iloc[i-1,j] + NDVIF.iloc[i+1,j])/2) - cutoff.iloc[j])):
NDVIF.iloc[i,j] == float('NaN')
elif (NDVIF.iloc[i,j] > (max(NDVIF.iloc[i-1,j], NDVIF.iloc[i+1,j]) + cutoff.iloc[j])): #2)
NDVIF.iloc[i,j] == float('NaN')
elif ((NDVIF.iloc[i,j] >= abs(MED.iloc[j] - cutoff.iloc[j]))) & (NDVIF.iloc[i,j] <= abs(MED.iloc[j] + cutoff.iloc[j])): #1)
NDVIF.iloc[i,j] == NDVIF.iloc[i,j]
else:
NDVIF.iloc[i,j] == float('NaN')
The problem is that I need to omit the 'NaN' values for the calculations. The goal is to have a dataframe like the one above without the outliers.
Once this is made, I have to interpolate the values for a new chosen time index (e.g. one value per day or one value every five days from 2016 to 2020) and write each interpolated column on a txt file to enter it on the TimeSat software.
I hope my english is not too bad and thank you for your answers! :)
I want to pass the parameters in my def to inputs in pandas loc but I am not sure how to do so, as loc requires defined labels as inputs. Or is there any other way I can perform Excel INDEX MATCH equivalent in Python but not using loc? Many thanks!
Below please find my code:
def get_correl_diff_tenor(p1, p2):
correl = IRCorrMatrix.loc['p1', 'p2']
return correl
p1 and p2 in loc['p1', 'p2'] refer to the tenor pairs for calling the corresponding correlation value in the matrix below.
IRCorrMatrix is shown below, which is a correlation matrix defined by tenor pairs.
2w 1m 3m 6m 1y
Tenor
2w 1.00 0.73 0.64 0.57 0.44
1m 0.73 1.00 0.78 0.67 0.50
3m 0.64 0.78 1.00 0.85 0.66
6m 0.57 0.67 0.85 1.00 0.81
1y 0.44 0.50 0.66 0.81 1.00
IIUC remove '' from 'p1', 'p2' for pass variables from function:
IRCorrMatrix.loc[p1, p2]
I have a Pandas data frame with the following structure:
alpha beta gamma mse
0 0.00 0.00 0.00 0.000000
1 0.05 0.05 0.90 0.025411
2 0.05 0.10 0.85 0.025794
3 0.05 0.15 0.80 0.026289
4 0.05 0.20 0.75 0.025320
.. ... ... ... ...
148 0.75 0.05 0.20 0.026816
149 0.75 0.10 0.15 0.025817
150 0.75 0.15 0.10 0.025702
151 0.80 0.05 0.15 0.027104
152 0.80 0.10 0.10 0.025936
I would like to visualise the data frame with a heatmap where alpha is represented on the x-axis, beta is represented on the y-axis, and for each square of the lattice, the mean MSE over all gammas is computed. Is there an easy way to do this by using Seaborn?
Thanks in advance.
For what you showed, yes, you can do with:
sns.heatmap(df.pivot_table(index='beta', columns='alpha', values='mse'))
All the calculation should be done in your DataFrame.
Once you have the data, you could use pivoted DataFrame to build the heatmap
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Assuming that you have the df variable with your data
# pivot the data
pivoted = df.pivot('alpha', 'beta', 'mse')
# plot the heatmap
sns.heatmap(pivoted, annot=True)
plt.show()
More information in the official documentation: https://seaborn.pydata.org/generated/seaborn.heatmap.html
I have the following dataset (replication):
ordinal_var fraction error_on_fraction
1 1.2 0.1
2 0.87 0.23
4 1.12 0.11
5 0.75 0.06
5 0.66 0.15
6 0.98 0.08
7 1.34 0.05
7 2.86 0.12
Now I want to do linear regression analysis (preferably in R but python is also fine) were I pass the error in y for each point within the formula. So in R this would be something like (for better understanding of the question):
lm(fraction +-error_on_fraction ~ ordinal_var, data = dataset)
Of course I tried to find how to do it myself first but I can't find an answer.
For previous analysis with error on x and y I just the scipy.odr library but I can't find how to do it with only an error in the y(response) variable.
Any help would be much appreciated!
We can use a simple weighted least squares model.
Sample data
Let's read in your sample data.
df <- read.table(text =
"ordinal_var fraction error_on_fraction
1 1.2 0.1
2 0.87 0.23
4 1.12 0.11
5 0.75 0.06
5 0.66 0.15
6 0.98 0.08
7 1.34 0.05
7 2.86 0.12", header = T)
Weighted least squares model
We fit a weighted linear model of the form fraction ~ ordered(ordinal_var), where the weights are given by 1 / error_on_fraction.
fit <- lm(
fraction ~ ordered(ordinal_var),
weights = 1 / error_on_fraction,
data = df)
summary(fit)
#
#Call:
#lm(formula = fraction ~ ordered(ordinal_var), data = df, weights = 1/error_on_fraction)
#
#Weighted Residuals:
# 1 2 3 4 5 6 7
# 2.220e-16 -1.851e-16 -1.753e-17 1.050e-01 -1.660e-01 1.810e-17 -1.999e+00
# 8
# 3.097e+00
#
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 1.1136 0.3365 3.309 0.0804 .
#ordered(ordinal_var).L 0.3430 0.7847 0.437 0.7047
#ordered(ordinal_var).Q 0.6228 0.7057 0.883 0.4706
#ordered(ordinal_var).C 0.2794 0.8920 0.313 0.7838
#ordered(ordinal_var)^4 0.2127 0.9278 0.229 0.8400
#ordered(ordinal_var)^5 -0.2469 0.7916 -0.312 0.7846
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 2.61 on 2 degrees of freedom
#Multiple R-squared: 0.5427, Adjusted R-squared: -0.6004
#F-statistic: 0.4748 on 5 and 2 DF, p-value: 0.783
Using on the following answer, I tried to code hierarchical class clustering based on confusion matrix. Confusion matrix is used to evaluate results of classification problem and isn't symmetric. Each row represents the instances in an actual class. Here is an example of confusion matrix where you can read that 25% of the samples of the 'zero' class is predicted as class 'six'.
I tried to modify the code with:
conf_mat = 1 - conf_mat # 1.0 means dissimilarity
sch.linkage(conf_mat, method='warp')
But I got wrong results.
How should I organize my data in order to apply the clustering?
The following should give me the rearranged order of cluster, right ?
ind = sch.fcluster(Y, 0, 'distance')
I'm not sure I understand WHY you are doing this, but, based on the comment which you posted above, it seems that you'd like to cluster 10 objects ('zero', 'one' 'nine') by comparing their values in your confusion matrix, generated by some other algorithm.
I would like the clusters to maximize the classification results: if
one class is mainly recognize as another one then both classes should
be fused.
... So, looking at your data, object 'eight' and object 'nine' might be in the same cluster because they have both have mostly low values and one relatively high value for the 'eight' column.
To do this, you can treat each of the 10 objects as having 10 arbitrary properties; then this is a standard setup. Perhaps Euclidean distance is appropriate to determine the distance between objects; you would know best.
It sounds like you'd like to do some hierarchical clustering, which you can do with scipy.cluster.hierarchy ; example below.
Example
I didn't want to type up your data by hand, so I just randomly generated a matrix. To avoid confusion I'm calling the objects 'zero' ... 'nine' (spelled out) and I'm using numerals '1' through '9' as the object's properties.
0 1 2 3 4 5 6 7 8 9
zero 0.37 0.27 0.23 0.92 0.86 0.62 0.08 0.95 0.35 0.69
one 0.24 0.23 0.70 0.39 0.52 0.03 0.14 0.00 0.53 0.10
two 0.78 0.12 0.85 0.79 0.32 0.90 0.78 0.07 0.07 0.62
...
nine 0.15 0.39 0.27 0.93 0.12 0.14 0.34 0.11 0.72 0.52
So this is my "confusion matrix".
Hierarchical clustering with SciPy. I'm using Euclidean distance, and the single-link agglomerative method.
from scipy.cluster import hierarchy
Y = hierarchy.distance.pdist(data.as_matrix(), metric='euclidean')
Z = hierarchy.linkage(Y, method='single')
ax = hierarchy.dendrogram(Z, show_contracted=True, labels=data.index.tolist())
[I put my matrix in a dataframe so I could add labels to columns and indices. That's why I'm using pandas commands data.as_matrix() to get the raw data, and data.index.tolist() to set the labels.]
This gives: