Plot duplication in Pandas Plot() - python

There is an issue with the plot() function in Pandas
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'A', 'B'])
ax = df.plot()
ax.legend(ncol=1, bbox_to_anchor=(1., 1, 0., 0), loc=2 , prop={'size':6})
This will make a plot with too many lines. Note however that half will be on top of each other. It seems to have something to do with the axis because when I do not use them the issue goes away.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'A', 'B'])
df.plot()
UPDATE
While not idea for my use case the issue can be fixed by using MultiIndex
columns = pd.MultiIndex.from_arrays([np.hstack([ ['left']*2, ['right']*2]), ['A', 'B']*2], names=['High', 'Low'])
df = pd.DataFrame(np.random.randn(8, 4), columns=columns)
ax = df.plot()
ax.legend(ncol=1, bbox_to_anchor=(1., 1, 0., 0), loc=2 , prop={'size':16})

It has to do with your duplication of column names, not ax at all (if you call plt.legend after your second example you see the same extra lines). Having multiple columns with the same name is confusing the call to DataFrame.plot_frame.
If you change your columns to ['A', 'B', 'C', 'D'] instead, it's fine.

Related

Hide non observed categories in a seaborn boxplot

I am currently working on a data analysis, and want to show some data distributions through seaborn boxplots.
I have a categorical data, 'seg1' which can in my dataset take 3 values ('Z1', 'Z3', 'Z4'). However, data in group 'Z4' is too exotic to be reported for me, and I would like to produce boxplots showing only categories 'Z1' and 'Z3'.
Filtering the data source of the plot did not work, as category 'Z4' is still showed with no data point.
Is there any other solution than having to create a new CategoricalDtype with only ('Z1', 'Z3') and cast/project my data back on this new category?
I would simply like to hide 'Z4' category.
I am using seaborn 0.10.1 and matplotlib 3.3.1.
Thanks in advance for your answers.
My tries are below, and some data to reproduce.
Dummy data
dummy_cat = pd.CategoricalDtype(['a', 'b', 'c'])
df = pd.DataFrame({'col1': ['a', 'b', 'a', 'b'], 'col2': [12., 5., 3., 2]})
df.col1 = df.col1.astype(dummy_cat)
sns.boxplot(data=df, x='col1', y='col2')
Apply no filter
fig, axs = plt.subplots(figsize=(8, 25), nrows=len(indicators2), squeeze=False)
for j, indicator in enumerate(indicators2):
sns.boxplot(data=orders, y=indicator, x='seg1', hue='origin2', ax=axs[j, 0], showfliers=False)
Which produces:
Filter data source
mask_filter = orders.seg1.isin(['Z1', 'Z3'])
fig, axs = plt.subplots(figsize=(8, 25), nrows=len(indicators2), squeeze=False)
for j, indicator in enumerate(indicators2):
sns.boxplot(data=orders.loc[mask_filter], y=indicator, x='seg1', hue='origin2', ax=axs[j, 0], showfliers=False)
Which produces:
To cut off the last (or first) x-value, set_xlim() can be used, e.g. ax.set_xlim(-0.5, 1.5).
Another option is to work with seaborn's order= parameter and only add the desired values in that list. Optionally that can be created programmatically:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
dummy_cat = pd.CategoricalDtype(['a', 'b', 'c'])
df = pd.DataFrame({'col1': ['a', 'b', 'a', 'b'], 'col2': [12., 5., 3., 2]})
df.col1 = df.col1.astype(dummy_cat)
order = [cat for cat in dummy_cat.categories if df['col1'].str.contains(cat).any()]
sns.boxplot(data=df, x='col1', y='col2', order=order)
plt.show()

sns plot warning with IndexError

import numpy as np
import pandas as pd
import seaborn as sns
df = pd.DataFrame({
'col': np.append(np.random.choice(np.array(['a', 'b', 'c']), 10), ['d']),
'x': np.random.normal(size = 11),
'y': np.random.normal(size = 11),
})
sns.lmplot(x = 'x', y = 'y', col = 'col', data = df)
I got the following warning:
IndexError: invalid index to scalar variable.
I appreciate suggestions! Thanks!
I think the issue comes from the fact that when you randomly generate 'col', you get some of the letters only once so then lmplot ends up with only one value for the plot for that col and can't produce it with only one value.
Could you try and replace the line
'col': np.append(np.random.choice(np.array(['a', 'b', 'c']), 10), ['d']),
by
'col': np.random.choice(np.array(['a', 'b', 'c']), 11)
Your code should work then. Though ideally you would want to input a fixed list of values into 'col' as generating it randomly might use a letter only once and then you'd get the same issue you are having now.

seaborn pairplot after converting a integer column to string

I am facing a trouble with seaborn.pairplot() with the below code
I have a dataframe and in one case I have to convert one of the column to string; After converting to String.
Pairplot() is not working properly.
How to fix the issue.
Below is the code,
import numpy as np
from pandas import DataFrame
import seaborn as sns
%matplotlib inline
Index= ['aaa', 'bbb', 'ccc', 'ddd', 'eee']
Cols = ['A', 'B', 'C', 'D']
df_temp = DataFrame(abs(np.random.randn(5, 4)), index=Index, columns=Cols)
print(df_temp)
sns.pairplot(df_temp) # This works
# convert one of the column to String datatype
df_temp['A'] = df_temp['A'].astype(str)
sns.pairplot(df_temp) # Gives error
Complete error log - Error log
On the diagonal of a pairplot there are histograms. It is not possible to draw histrograms from strings. Since I'm not sure what you would want to show on the diagonal instead in such case, let's leave that out and simply plot a pair grid from the dataframe which contains strings in one column,
import matplotlib.pyplot as plt
import numpy as np
from pandas import DataFrame
import seaborn as sns
Index= ['aaa', 'bbb', 'ccc', 'ddd', 'eee']
Cols = ['A', 'B', 'C', 'D']
df = DataFrame(abs(np.random.randn(5, 4)), index=Index, columns=Cols)
df['A'] = list("VWXYZ")
g = sns.PairGrid(df, vars=df.columns, height=2)
g.map_offdiag(sns.scatterplot)
plt.show()
If instead the aim is to just use numeric columns, you can filter the dataframe by dtype.
import matplotlib.pyplot as plt
import numpy as np
from pandas import DataFrame
import seaborn as sns
Index= ['aaa', 'bbb', 'ccc', 'ddd', 'eee']
Cols = ['A', 'B', 'C', 'D']
df = DataFrame(abs(np.random.randn(5, 4)), index=Index, columns=Cols)
# convert one of the column to String datatype
df['A'] = df['A'].astype(str)
sns.pairplot(df.select_dtypes(include=[np.number]))
plt.show()
import numpy as np
from pandas import DataFrame
import seaborn as sns
%matplotlib inline
Index= ['aaa', 'bbb', 'ccc', 'ddd', 'eee']
Cols = ['A', 'B', 'C', 'D']
df_temp = DataFrame(abs(np.random.randn(5, 4)), index=Index, columns=Cols)
print(df_temp)
# convert one of the column to String datatype
df_temp['A'] = df_temp['A'].astype(str)
You can find all the columns of type float and plot only those.
cols_to_plot=df_temp[df_temp.types=='float']#find not strings
sns.pairplot(df_temp[cols_to_plot[cols_to_plot==1].index])

How to save multiple Seaborn plots into single pdf file

So I'm trying to save multiple plots that i create in a for loop into a single pdf file. I've searched around on SO and pieced together some code that seems to work except it doesn't save the figures it creates a pdf but without anything in it.
Here's the code to reproduce it:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
dftest = pd.DataFrame(np.random.randint(low=0, high=10, size=(5, 5)),
columns=['a', 'b', 'c', 'd', 'e'])
from matplotlib.backends.backend_pdf import PdfPages
with PdfPages('count.pdf') as pdf_pages:
df1 = dftest.select_dtypes([np.int, np.float, np.object])
for i, col in enumerate(df1.columns):
plt.figure(i)
countplot = sns.countplot(x=col, data=df1)
pdf_pages.savefig(countplot.fig)
Saving the plt.figure works for me
with PdfPages('count.pdf') as pdf_pages:
df1 = dftest.select_dtypes([np.int, np.float, np.object])
for i, col in enumerate(df1.columns):
figu = plt.figure(i)
countplot = sns.countplot(x=col, data=df1)
pdf_pages.savefig(figu)

Color a heatmap in Python/Matplotlib according to requirement

I'm trying to make a heatmap with a specified requirement of the coloring. I want to set an interval for the data and judge that as ok and color it green, the rest of the results should be colored as red. Does anyone have a clue of how to do this??
I have attache a simple example using pandas and matplotlib for better understanding.
import numpy as np
from pandas import *
import matplotlib.pyplot as plt
Index= ['aaa', 'bbb', 'ccc', 'ddd', 'eee']
Cols = ['A', 'B', 'C', 'D']
data= abs(np.random.randn(5, 4))
df = DataFrame(data, index=Index, columns=Cols)
plt.pcolor(df)
plt.yticks(np.arange(0.5, len(df.index), 1), df.index)
plt.xticks(np.arange(0.5, len(df.columns), 1), df.columns)
plt.show()
There's more than one way to do this.
The easiest way is to just pass in a boolean array to pcolor and then choose a colormap where green is high and red is low.
For example:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Index= ['aaa', 'bbb', 'ccc', 'ddd', 'eee']
Cols = ['A', 'B', 'C', 'D']
data= np.random.random((5, 4))
df = pd.DataFrame(data, index=Index, columns=Cols)
plt.pcolor(df > 0.5, cmap='RdYlGn')
plt.yticks(np.arange(0.5, len(df.index), 1), df.index)
plt.xticks(np.arange(0.5, len(df.columns), 1), df.columns)
plt.show()
Alternately, as #Cyber mentioned, you could make a two-color colormap based on your values and use it:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
Index= ['aaa', 'bbb', 'ccc', 'ddd', 'eee']
Cols = ['A', 'B', 'C', 'D']
data= np.random.random((5, 4))
df = pd.DataFrame(data, index=Index, columns=Cols)
# Values from 0-0.5 will be red and 0.5-1 will be green
cmap, norm = mcolors.from_levels_and_colors([0, 0.5, 1], ['red', 'green'])
plt.pcolor(df, cmap=cmap, norm=norm)
plt.yticks(np.arange(0.5, len(df.index), 1), df.index)
plt.xticks(np.arange(0.5, len(df.columns), 1), df.columns)
plt.show()
(The color difference is just because the "RdYlGn" colormap uses darker greens and reds as its endpoints.)
On a side note, it's also considerably faster to use pcolormesh for this, rather than pcolor. For small arrays, it won't make a significant difference, but for large arrays pcolor is excessively slow. imshow is even faster yet, if you don't mind raster output. Use imshow(data, interpolation='nearest', aspect='auto', origin='lower') to match the defaults of pcolor and pcolormesh.
You can make a 2 color colormap.
Then you can set the cutoff value between red and green.

Categories

Resources