Related
I have a heatmap in Seaborn via sns.heatmap. I now want to white out the bottom row and right column but keep the values.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
np.random.seed(2021)
df = pd.DataFrame(np.random.normal(0, 1, (6, 4))
df = df.rename(columns = {0:"a", 1:"b", 2:"c", 3:"d"})
df.index = [value for key, value in {0:"a", 1:"b", 2:"c", 3:"d", 4:"e", 5:"f"}.items()]
sns.heatmat(df, annot = True)
plt.show
I thought I had to include a mask argument in my sns.heatmap call, but I am not having success giving a proper mask, and the mask removes the annotation. I also need to preserve the text indices of my data frame d. How can I get those cells whited out while preserving the text indices?
Here is an approach:
use the original data for annotation (annot=data)
create a "norm" using the original data, to be used for coloring
create a copy of the colormap and assign an "over" color as "white"
create a copy of the data, and fill the right column and lower row with a value higher than the maximum of the original data (np.inf can't be used, because then no annotation will be placed); use this copy for the coloring; seaborn will magically use the appropriate color for the annotation
to use the dataframe's column and index names in the heatmap, just use sns.heatmap(..., xticklabels=df.columns, yticklabels=df.index)
if you don't have a recent seaborn version installed, you might consider using one of matplotlib's standard colormaps, or create one via matplotlib.colors.ListedColormap(). Maybe cmap = ListedColormap(sns.color_palette('rocket', 256))?
In example code:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from copy import copy
np.random.seed(2021)
df = pd.DataFrame(np.random.normal(0, 1, (6, 4)), columns=[*"abcd"], index=[*"abcdef"])
data = df.to_numpy()
data_for_colors = data.copy()
data_for_colors[:, -1] = data.max() + 10
data_for_colors[-1, :] = data.max() + 10
norm = plt.Normalize(data[:-1, :-1].min(), data[:-1, :-1].max())
# cmap = sns.color_palette('rocket', as_cmap=True).copy()
cmap = copy(plt.get_cmap('RdYlGn'))
cmap.set_over('white')
sns.set_style('white')
sns.heatmap(data=data_for_colors, xticklabels=df.columns, yticklabels=df.index,
annot=data, cmap=cmap, norm=norm)
plt.show()
I need help, I am unable to display the seaborn plot well.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv('sales.csv', header=0,sep =',',
usecols = [1,2,3,4])
#remove NaN
dataset.dropna(inplace = True)
df = pd.DataFrame(data=dataset)
sns.regplot(data=df, x='TV', y='sales')
plt.show()
As example for sales_csv :
id,TV,radio,newspaper,sales
1,230.10000000,37.8,69.2,22.1
2,1e12,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9
5,180.8,10.8,58.4,12.9
6,8.7,48.9,75,7.2
7,57.5,32.8,23.5,11.8
8,120.2,19.6,11.6,13.2
9,8.6,2.1,1,4.8
10,199.8,2.6,21.2,10.6
11,66.1,5.8,24.2,8.6
12,214.7,24,4,17.4
13,23.8,35.1,65.9,9.2
14,97.5,7.6,7.2,9.7
15,1,32.9,46,19
16,195.4,47.7,52.9,22.4
17,67.8,36.6,114,12.5
18,281.4,39.6,55.8,24.4
19,69.2,20.5,18.3,11.3
20,147.3,23.9,19.1,14.6
21,218.4,27.7,53.4,18
22,237.4,5.1,23.5,12.5
23,13.2,15.9,49.6,5.6
24,228.3,16.9,26.2,15.5
25,62.3,12.6,18.3,9.7
26,262.9,3.5,19.5,12
27,142.9,29.3,12.6,15
28,240.1,16.7,22.9,15.9
29,248.8,27.1,22.9,18.9
30,70.6,16,40.8,10.5
31,292.9,28.3,43.2,21.4
32,112.9,17.4,38.6,11.9
33,97.2,1.5,30,9.6
34,1e12,20,0.3,17.4
The main problem is that the dataset contains values of 1e12 used to represent NA. These values should be replaced or dropped. The easiest way to convert '1e12' to NA is via the na_values='1e12' parameter to pd.read_csv().
Alternatively, dataset.replace(1e12, pd.NA, inplace=True) can be used to convert them later.
Note that dataset already is a dataframe, so the call df = pd.DataFrame(data=dataset) is unnecessary.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
dataset = pd.read_csv('sales.csv', header=0, sep=',', na_values='1e12',
usecols=[1, 2, 3, 4])
# remove NaN
dataset.dropna(inplace=True)
sns.regplot(data=dataset, x='TV', y='sales')
plt.show()
I have a dataframe with a list of items and associated values. Which metric and method is best for performing the clustering?
I want to create a seaborn clustermap (dendrogram Plus heatmap) from the list on the basis of rows only, map it (that is done as shown is code), but how can I get the list of items for each cluster or each protein with its cluster information. (similar to Extract rows of clusters in hierarchical clustering using seaborn clustermap, but only based on rows and not columns)
How do I determine which "method" and "metric" is best for my data?
data.csv example:
item,v1,v2,v3,v4,v5
A1,1,2,3,4,5
B1,2,4,6,8,10
C1,3,6,9,12,15
A1,2,3,4,5,6
B2,3,5,7,9,11
C2,4,7,10,13,16
My code for creating the clustermap:
import pandas as pd
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.cluster.hierarchy as sch
df = pd.read_csv('data.csv', index_col=0)
sns.clustermap(df, col_cluster=False, cmap="coolwarm", method='ward', metric='euclidean', figsize=(40,40))
plt.savefig('plot.pdf', dpi=300)
I just hacked this together. Is this what you want?
import pandas as pd
import numpy as np
import seaborn as sns
cars = {'item': ['A1','B1','C1','A1','B1','C1'],
'v1': [1.0,2.0,3.0,2.0,3.0,4.0],
'v2': [2.0,4.0,6.0,3.0,5.0,7.0],
'v3': [3.0,6.0,9.0,4.0,7.0,10.0],
'v4': [4.0,8.0,12.0,5.0,9.0,13.0],
'v5': [5.0,10.0,15.0,6.0,11.0,16.0]
}
df = pd.DataFrame(cars)
df
heatmap_data = pd.pivot_table(df, values=['v1','v2','v3','v4','v5'],
index=['item'])
heatmap_data.head()
sns.clustermap(heatmap_data)
df = df.drop(['item'], axis=1)
g = sns.clustermap(df)
Also, check out links below for more info on this topic.
https://seaborn.pydata.org/generated/seaborn.clustermap.html
https://kite.com/python/docs/seaborn.clustermap
I want to plot a dataframe where y values are stored as ndarrays within a column
i.e.:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame(index=np.arange(0,4), columns=('sample','class','values'))
for iloc in [0,2]:
df.loc[iloc] = {'sample':iloc,
'class':'raw',
'values':np.random.random(5)}
df.loc[iloc+1] = {'sample':iloc,
'class':'predict',
'values':np.random.random(5)}
grid = sns.FacetGrid(df, col="class", row="sample")
grid.map(plt.plot, np.arange(0,5), "value")
TypeError: unhashable type: 'numpy.ndarray'
Do I need to break out the ndarrays into separate rows? Is there a simple way to do this?
Thanks
This is quite an unusual way of storing data in a dataframe. Two options (I'd recommend option B):
A. Custom mapping in seaborn
Indeed seaborn does not support such format natively. You may construct your own function to plot to the grid though.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame(index=np.arange(0,4), columns=('sample','class','values'))
for iloc in [0,2]:
df.loc[iloc] = {'sample':iloc,
'class':'raw',
'values':np.random.random(5)}
df.loc[iloc+1] = {'sample':iloc,
'class':'predict',
'values':np.random.random(5)}
grid = sns.FacetGrid(df, col="class", row="sample")
def plot(*args,**kwargs):
plt.plot(args[0].iloc[0], **kwargs)
grid.map(plot, "values")
B. Unnesting
However I would advise to "unnest" the dataframe first and get rid of the numpy arrays inside the cells.
pandas: When cell contents are lists, create a row for each element in the list shows a way to do that.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame(index=np.arange(0,4), columns=('sample','class','values'))
for iloc in [0,2]:
df.loc[iloc] = {'sample':iloc,
'class':'raw',
'values':np.random.random(5)}
df.loc[iloc+1] = {'sample':iloc,
'class':'predict',
'values':np.random.random(5)}
res = df.set_index(["sample", "class"])["values"].apply(pd.Series).stack().reset_index()
res.columns = ["sample", "class", "original_index", "values"]
Then use the FacetGrid in the usual way.
grid = sns.FacetGrid(res, col="class", row="sample")
grid.map(plt.plot, "original_index", "values")
I am plotting a multi-index columns DataFrame.
What is the syntax to specify the column(s) to be plotted on secondary_y using the .plot method of pandas DataFrame?
Setup
import numpy as np
import pandas as pd
mt_idx = pd.MultiIndex.from_product([['A', 'B'], ['first', 'second']])
df = pd.DataFrame(np.random.randint(0, 10, size=(20, len(mt_idx))), columns=mt_idx)
My Attempts
df.plot(secondary_y=('B', 'second'))
df.plot(secondary_y='(B, second)')
None of the above worked, as all the lines were plotted on the principal y-axis.
One possible solution would be to plot each column, then specify secondary=True. Doing it the following way requires you to specifiy the axes to which they will be plotted:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
mt_idx = pd.MultiIndex.from_product([['A', 'B'], ['first', 'second']])
df = pd.DataFrame(np.random.randint(0, 10, size=(20, len(mt_idx))), columns=mt_idx)
df.A.plot(ax=ax)
df.B.plot(ax=ax, secondary_y=True)
plt.show()
You might drop the upper column index level. If you don't want to modify the original dataframe, this could be done on a copy of it.
df2 = df.copy()
df2.columns = df2.columns.map('_'.join)
df2.plot(secondary_y=('B_second'))