Avoid plotting missing values in Seaborn - python

Problem: I have timeseries data of several days and I use sns.FacetGrid function of Seaborn python library to plot this data in facet form. In several cases, I found that mentioned seaborn function plots consecutive missing values (nan values) between two readings with a continuous line. While as matplotlib shows missing values as a gap, which makes sense. A demo example is as
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# create timeseries data for 3 days such that day two contains NaN values
time_duration1 = pd.date_range('1/1/2018', periods=24,freq='H')
data1 = np.random.randn(len(time_duration1))
ds1 = pd.Series(data=data1,index=time_duration1)
time_duration2 = pd.date_range('1/2/2018',periods=24,freq='H')
data2 = [float('nan')]*len(time_duration2)
ds2 = pd.Series(data=data2,index=time_duration2)
time_duration3 = pd.date_range('1/3/2018', periods=24,freq='H')
data3 = np.random.randn(len(time_duration3))
ds3 = pd.Series(data=data3,index=time_duration3)
# combine all three days series and then convert series into pandas dataframe
DS = pd.concat([ds1,ds2,ds3])
DF = DS.to_frame()
DF.plot()
It results into following plot
Above Matplotlib plot shows missing values with a gap.
Now let us prepare same data for seaborn function as
DF['col'] = np.ones(DF.shape[0])# dummy column but required for facets
DF['timestamp'] = DF.index
DF.columns = ['data_val','col','timestamp']
g = sns.FacetGrid(DF,col='col',col_wrap=1,size=2.5)
g.map_dataframe(plt.plot,'timestamp','data_val')
See, how seaborn plot shows missing data with a line. How should I force seaborn to not plot nan values with such a line?
Note: This is a dummy example, and I need facet grid in any case to plot my data.

FacetGrid by default removes nan from the data. The reason is that some functions inside seaborn would not work properly with nans (especially some of the statistical function, I'd say).
In order to keep the nan values in the data, use the dropna=False argument to FacetGrid:
g = sns.FacetGrid(DF,... , dropna=False)

Related

Rearranging the columns of my heatmap using python's seaborn

I'm trying to visualize the following .csv data:
Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11,Q12,Q13,Q14,Q15,Q16,Q17,Q18,Q19,Q20
4,4,2,2,4,2,3,5,3,4,2,5,2,1,4,4,2,1,5,2
2,2,4,4,4,2,2,2,4,4,2,4,2,2,3,2,2,4,5,2
4,5,4,1,4,2,2,4,4,3,2,2,2,1,2,4,4,2,5,4
3,4,2,4,4,2,2,2,4,3,2,4,4,3,3,4,2,4,5,1
4,4,3,2,4,3,4,5,4,3,1,5,3,2,4,2,2,3,4,2
4,5,2,3,5,1,3,4,3,3,1,2,4,4,5,4,1,4,5,4
5,5,5,2,4,3,2,4,4,2,2,4,4,2,4,2,2,4,4,5
4,4,3,1,5,3,2,4,2,2,1,4,4,2,4,1,2,5,5,3
1,3,5,2,4,4,3,1,4,4,2,3,1,4,3,4,3,3,4,1
3,3,5,2,4,2,4,4,3,4,1,5,4,2,1,2,2,4,5,2
Here's my code:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
map = sns.clustermap(df, annot=True, linewidths=2, linecolor='yellow', metric="correlation", method="single")
plt.show()
Which returns:
I want to rearrange my heatmap and order it column-wise by the frequency of each response. For example, The column Q5 has the value 4 repeated 8 times (more than any other column), so it should be the first column. Columns 17 and 19 have a value that is repeated 7 times, so they should come in second and third (exact order doesn't matter). How can I do this?
You can compute the order and reindex before using the data in clustermap:
order = (df.apply(pd.Series.value_counts)
.max()
.sort_values(ascending=False)
.index
)
import seaborn as sns
cm = sns.clustermap(df[order], col_cluster=False, annot=True, linewidths=2, linecolor='yellow', metric="correlation", method="single")
Output:

Multi Index Seaborn Line Plot

I have a multi index dataframe, with the two indices being Sample and Lithology
Sample 20EC-P 20EC-8 20EC-10-1 ... 20EC-43 20EC-45 20EC-54
Lithology Pd Di-Grd Gb ... Hbl Plag Pd Di-Grd Gb
Rb 7.401575 39.055118 6.456693 ... 0.629921 56.535433 11.653543
Ba 24.610102 43.067678 10.716841 ... 1.073115 58.520532 56.946630
Th 3.176471 19.647059 3.647059 ... 0.823529 29.647059 5.294118
I am trying to put it into a seaborn lineplot as such.
spider = sns.lineplot(data = data, hue = data.columns.get_level_values("Lithology"),
style = data.columns.get_level_values("Sample"),
dashes = False, palette = "deep")
The lineplot comes out as
1
I have two issues. First, I want to format hues by lithology and style by sample. Outside of the lineplot function, I can successfully access sample and lithology using data.columns.get_level_values, but in the lineplot they don't seem to do anything and I haven't figured out another way to access these values. Also, the lineplot reorganizes the x-axis by alphabetical order. I want to force it to keep the same order as the dataframe, but I don't see any way to do this in the documentation.
To use hue= and style=, seaborn prefers it's dataframes in long form. pd.melt() will combine all columns and create new columns with the old column names, and a column for the values. The index too needs to be converted to a regular column (with .reset_index()).
Most seaborn functions use order= to set an order on the x-values, but with lineplot the only way is to make the column categorical applying a fixed order.
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
column_tuples = [('20EC-P', 'Pd '), ('20EC-8', 'Di-Grd'), ('20EC-10-1 ', 'Gb'),
('20EC-43', 'Hbl Plag Pd'), ('20EC-45', 'Di-Grd'), ('20EC-54', 'Gb')]
col_index = pd.MultiIndex.from_tuples(column_tuples, names=["Sample", "Lithology"])
data = pd.DataFrame(np.random.uniform(0, 50, size=(3, len(col_index))), columns=col_index, index=['Rb', 'Ba', 'Th'])
data_long = data.melt(ignore_index=False).reset_index()
data_long['index'] = pd.Categorical(data_long['index'], data.index) # make categorical, use order of the original dataframe
ax = sns.lineplot(data=data_long, x='index', y='value',
hue="Lithology", style="Sample", dashes=False, markers=True, palette="deep")
ax.set_xlabel('')
ax.legend(loc='upper left', bbox_to_anchor=(1.01, 1.02))
plt.tight_layout() # fit legend and labels into the figure
plt.show()
The long dataframe looks like:
index Sample Lithology value
0 Rb 20EC-P Pd 6.135005
1 Ba 20EC-P Pd 6.924961
2 Th 20EC-P Pd 44.270570
...

seaborn violin plot for single column splitting by a categorical column

I have a dataframe that looks like this:
num_column is_train
30.75 1
12.05 1
.. ..
43.79 0
15.35 0
I want to see the distribution of num_column using a violin plot and with each side(or split) of the violin showing the data for each of my two categories in is_train column.
From the examples in documentation, here's what I could come up with:
import seaborn as sns
sns.violinplot(x=merged_data.loc[:,'num_column'], hue=merged_data.loc[:,'is_train'], split=True)
From the result of this, I could see that the arguments hue and split had no effect at all. Meaning sides of the violin weren't split and I couldn't see any legend, so I presumed hue argument had no effect.
I am trying to compare distributions of a column from my train and test data.
The split= argument is to be used with hue-nesting, which can only be used if you already have an x= argument. Therefore you need to provide columns for both x (should be the same value for both datasets) and hue (coded depending on the dataset):
merged_data['dummy'] = 0
sns.violinplot(data=merged_data, y='num_column', split=True, hue='is_train', x='dummy')
You can use the x= parameter to create multiple violins. The hue and split parameters are used when a differentiation via a third column is needed.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
merged_data = pd.DataFrame({'num_column': 20 + np.random.randn(1000).cumsum(),
'is_train': np.repeat([0, 1], 500)})
sns.violinplot(data=merged_data, x='is_train', y='num_column')
plt.show()

How can I visualise categorical feature vs date column

In my dataset I have a categorical column named 'Type'contain(eg.,INVOICE,IPC,IP) and 'Date' column contain dates(eg,2014-02-01).
how can I plot these two.
On x axis I want date
On y axis a line for (eg.INVOCE) showing its trend
enter image description here
Not very sure what you mean by plot and show trend, one ways is to count like #QuangHoang suggested, and plot with a heatmap, something like below. If it is something different, please expand on your question.
import pandas as pd
import numpy as np
import seaborn as sns
dates = pd.date_range(start='1/1/2018', periods=5, freq='3M')[np.random.randint(0,5,20)]
type = np.random.choice(['INVOICE','IPC','IP'],20)
df = pd.DataFrame({'dates':dates ,'type':type})
tab = pd.crosstab(df['type'],df['dates'].dt.strftime('%d-%m-%Y'))
n = np.unique(tab.values)
cmap = sns.color_palette("BuGn_r",len(n))
sns.heatmap(tab,cmap=cmap)

Seaborn HUE in Plotly

I have this dataframe Gas Price Brazil /
Data Frame
I get only the gasoline values from this DF and want to plot the average price (PREÇO MEDIO) over time (YEARS - ANO) from each region (REGIAO)
I used Seaborn with HUE and get this:
But when I try to plot the same thing at Plotly the result is:
How can I get the same plot with plotly?
I searched and find this: Seaborn Hue on Plotly
But this didn't work to me.
The answer:
You will achieve the same thing using plotly express and the color attribute:
fig = px.line(dfm, x="dates", y="value", color='variable')
The details:
You haven't described the structure of your data in detail, but assigning hue like this is normally meant to be applied to a data structure such as...
Date Variable Value
01.01.2020 A 100
01.01.2020 B 90
01.02.2020 A 110
01.02.2020 B 120
... where a unique hue or color is assigned to different variable names that are associated with a timestamp column where each timestamp occurs as many times as there are variables.
And that seems to be the case for seaborn too:
hue : name of variables in data or vector data, optional
Grouping variable that will produce points with different colors. Can
be either categorical or numeric, although color mapping will behave
differently in latter case.
You can achieve the same thing with plotly using the color attribute in go.Scatter(), but it seems that you could make good use of plotly.express too. Until you've provided a proper data sample, I'll show you how to do it using some sampled data in a dataframe using numpy and pandas.
Plot:
Code:
# imports
import numpy as np
import pandas as pd
import plotly.express as px
# sample time series data
np.random.seed(123)
df = pd.DataFrame(np.random.randint(-10,12,size=(50, 4)), columns=list('ABCD'))
datelist = pd.date_range(pd.datetime(2020, 1, 1).strftime('%Y-%m-%d'), periods=50).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
df.iloc[0]=0
df=df.cumsum().reset_index()
# melt data to provide the data structure mentioned earlier
dfm=pd.melt(df, id_vars=['dates'], value_vars=df.columns[1:])
dfm.set_index('dates')
dfm.head()
# plotly
fig = px.line(dfm, x="dates", y="value", color='variable')
fig.show()

Categories

Resources