How to visualise means with Seaborn?

How to visualise means with Seaborn? - python

I have a Pandas data frame with the following structure:
alpha beta gamma mse
0 0.00 0.00 0.00 0.000000
1 0.05 0.05 0.90 0.025411
2 0.05 0.10 0.85 0.025794
3 0.05 0.15 0.80 0.026289
4 0.05 0.20 0.75 0.025320
.. ... ... ... ...
148 0.75 0.05 0.20 0.026816
149 0.75 0.10 0.15 0.025817
150 0.75 0.15 0.10 0.025702
151 0.80 0.05 0.15 0.027104
152 0.80 0.10 0.10 0.025936
I would like to visualise the data frame with a heatmap where alpha is represented on the x-axis, beta is represented on the y-axis, and for each square of the lattice, the mean MSE over all gammas is computed. Is there an easy way to do this by using Seaborn?
Thanks in advance.

For what you showed, yes, you can do with:
sns.heatmap(df.pivot_table(index='beta', columns='alpha', values='mse'))

All the calculation should be done in your DataFrame.
Once you have the data, you could use pivoted DataFrame to build the heatmap
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Assuming that you have the df variable with your data
# pivot the data
pivoted = df.pivot('alpha', 'beta', 'mse')
# plot the heatmap
sns.heatmap(pivoted, annot=True)
plt.show()
More information in the official documentation: https://seaborn.pydata.org/generated/seaborn.heatmap.html

Related

How to assign new observations to cluster using distance matrix and kmedoids?

I have a dataframe that holds the Word Mover's Distance between each document in my dataframe. I am running kmediods on this to generate clusters.
1 2 3 4 5
1 0.00 0.05 0.07 0.04 0.05
2 0.05 0.00 0.06 0.04 0.05
3. 0.07 0.06 0.00 0.06 0.06
4 0.04 0.04. 0.06 0.00 0.04
5 0.05 0.05 0.06 0.04 0.00
kmed = KMedoids(n_clusters= 3, random_state=123, method ='pam').fit(distance)
After running on this initial matrix and generating clusters, I want to add new points to be clustered. After adding a new document to the distance matrix I end up with:
1 2 3 4 5 6
1 0.00 0.05 0.07 0.04 0.05 0.12
2 0.05 0.00 0.06 0.04 0.05 0.21
3. 0.07 0.06 0.00 0.06 0.06 0.01
4 0.04 0.04. 0.06 0.00 0.04 0.05
5 0.05 0.05 0.06 0.04 0.00 0.12
6. 0.12 0.21 0.01 0.05 0.12 0.00
I have tried using kmed.predict on the new row.
kmed.predict(new_distance.loc[-1: ])
However, this gives me an error of incompatible dimensions X.shape[1] == 6 while Y.shape[1] == 5.
How can I use this distance of the new document to determine which cluster it should be a part of? Is this even possible, or do I have to recompute clusters every time? Thanks!

The source code for k-medoids says the following:
def transform(self, X):
"""Transforms X to cluster-distance space.
Parameters
----------
X : {array-like, sparse matrix}, shape (n_query, n_features), \
or (n_query, n_indexed) if metric == 'precomputed'
Data to transform.
"""
I assume that you use the precomputed metric (because you compute the distances outside the classifier), so in your case n_query is the number of new documents, and n_indexed is the number of the documents for which the fit method was called.
In your particular case when you fit the model on 5 documents and then want to classify the 6'th one, the X for classification should have shape (1,5), that can be computed as
kmed.predict(new_distance.loc[-1: , :-1])

this is my trial, we must recompute the distance between the new point and the old ones each time.
import pandas as pd
from sklearn_extra.cluster import KMedoids
from sklearn.metrics import pairwise_distances
import numpy as np
# dummy data for trial
df = pd.DataFrame({0: [0,1],1 : [1,2]})
# calculatie distance
distance = pairwise_distances(df.values, df.values)
# fit model
kmed = KMedoids(n_clusters=2, random_state=123, method='pam').fit(distance)
new_point = [2,3]
distance = pairwise_distances(np.array(new_point).reshape(1, -1), df.values)
#calculate the distance between the new point and the initial dataset
print(distance)
#get ride of the last element which is the ditance of the new point with itself
print(kmed.predict(distance[0][:2].reshape(1, -1)))

Python: how can I pass parameters in def to inputs in pandas loc?

I want to pass the parameters in my def to inputs in pandas loc but I am not sure how to do so, as loc requires defined labels as inputs. Or is there any other way I can perform Excel INDEX MATCH equivalent in Python but not using loc? Many thanks!
Below please find my code:
def get_correl_diff_tenor(p1, p2):
correl = IRCorrMatrix.loc['p1', 'p2']
return correl
p1 and p2 in loc['p1', 'p2'] refer to the tenor pairs for calling the corresponding correlation value in the matrix below.
IRCorrMatrix is shown below, which is a correlation matrix defined by tenor pairs.
2w 1m 3m 6m 1y
Tenor
2w 1.00 0.73 0.64 0.57 0.44
1m 0.73 1.00 0.78 0.67 0.50
3m 0.64 0.78 1.00 0.85 0.66
6m 0.57 0.67 0.85 1.00 0.81
1y 0.44 0.50 0.66 0.81 1.00

IIUC remove '' from 'p1', 'p2' for pass variables from function:
IRCorrMatrix.loc[p1, p2]

Getting meaningful results from pandas.describe()

I called describe on one column of a dataframe and ended up with the following output,
count 1.048575e+06
mean 8.232821e+01
std 2.859016e+02
min 0.000000e+00
25% 3.000000e+00
50% 1.400000e+01
75% 6.000000e+01
max 8.599700e+04
What parameter do I pass to get meaningful integer values. What I mean is when I check the SQL count its about 43 million. All the other values are also different.Can someone help me understand what this conversion means and how do I get float rounded to 2 decimal places. I'm new to Pandas.

You can directly use round() and pass the number of decimals you want as argument
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# setting the seed to create the dataframe
np.random.seed(25)
# Creating a 5 * 4 dataframe
df = pd.DataFrame(np.random.random([5, 4]), columns =["A", "B", "C", "D"])
# rounding describe
df.describe().round(2)
A B C D
count 5.00 5.00 5.00 5.00
mean 0.52 0.47 0.38 0.42
std 0.21 0.23 0.19 0.29
min 0.33 0.12 0.16 0.11
25% 0.41 0.37 0.28 0.19
50% 0.45 0.58 0.37 0.44
75% 0.56 0.59 0.40 0.52
max 0.87 0.70 0.68 0.84
DOCS

There are two ways to control the output of pandas, either by controlling it or by using apply.
pd.set_option('display.float_format', lambda x: '%.5f' % x)
df['X'].describe().apply("{0:.5f}".format)

Matplotlib like graphs with plotly express

Following is my Pandas dataframe, its very easy creating a line plot for all the items with matplotlib. I just write
df.plot()
And it create a separate line for all the items, But I want to create same line plots with plotly express, But I am not able to do it, may be because I have date columns
df;
dataDate 2019-10-01 2019-10-02 2019-10-01 2019-10-01 2019-10-02
name
item1 0.24 0.12 0.19 0.20 0.12
item2 0.26 0.25 0.17 0.17 0.13
item3 0.22 0.24 0.18 0.17 0.16
item4 0.72 0.22 0.19 0.20 0.15
item5 0.55 0.23 0.19 0.18 0.14
Suggest me how I can create line plots for all the items across the time with plotly express. Thanks

They have great examples on their documentation (https://plot.ly/python/plotly-express/#scatter-and-line-plots).
By design it works best with tidy data so you would have a column for Date, a column for Item Number, and then a column for the value.
from datetime import datetime, timedelta
import numpy as np
import pandas as pd
base = datetime.today()
dates = [base - timedelta(days=x) for x in range(10)] * 3
cats = ['A'] * 10 + ['B'] * 10 + ['C'] * 10
vals = np.arange(30)
df = pd.DataFrame({'Date': dates, 'Category': cats, 'Value': vals})
px.line(df, x='Date', y='Value', color='Category')

Create new DF with values representing difference between two dataframes

I am working with two numeric data.frames, both with 13803obs and 13803 variables. Their col- and rownames are identical however their entries are different. What I want to do is create a new data.frame where I have subtracted df2 values with df1 values.
"Formula" would be this, df1(entri-values) - df2(entri-values) = df3 difference. In other words, the purpose is to find the difference between all entries.
My problem illustrated here.
DF1
[GENE128] [GENE271] [GENE2983]
[GENE231] 0.71 0.98 0.32
[GENE128] 0.23 0.61 0.90
[GENE271] 0.87 0.95 0.63
DF2
[GENE128] [GENE271] [GENE2983]
[GENE231] 0.70 0.94 0.30
[GENE128] 0.25 0.51 0.80
[GENE271] 0.82 0.92 0.60
NEW DF3
[GENE128] [GENE271] [GENE2983]
[GENE231] 0.01 0.04 0.02
[GENE128] -.02 0.10 0.10
[GENE271] 0.05 0.03 0.03
So, in DF3 the values are the difference between DF1 and DF2 for each entry.
DF1(GENE231) - DF2(GENE231) = DF3(DIFFERENCE-GENE231)
DF1(GENE271) - DF2(GENE271) = DF3(DIFFERENCE-GENE271)
and so on...
Help would be much appreciated!
Kind regards,
Harebell

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to visualise means with Seaborn? - python

For what you showed, yes, you can do with: sns.heatmap(df.pivot_table(index='beta', columns='alpha', values='mse'))

Related

How to assign new observations to cluster using distance matrix and kmedoids?

Python: how can I pass parameters in def to inputs in pandas loc?

Getting meaningful results from pandas.describe()

Matplotlib like graphs with plotly express

Create new DF with values representing difference between two dataframes

Categories

Resources