sns plot warning with IndexError - python

import numpy as np
import pandas as pd
import seaborn as sns
df = pd.DataFrame({
'col': np.append(np.random.choice(np.array(['a', 'b', 'c']), 10), ['d']),
'x': np.random.normal(size = 11),
'y': np.random.normal(size = 11),
})
sns.lmplot(x = 'x', y = 'y', col = 'col', data = df)
I got the following warning:
IndexError: invalid index to scalar variable.
I appreciate suggestions! Thanks!

I think the issue comes from the fact that when you randomly generate 'col', you get some of the letters only once so then lmplot ends up with only one value for the plot for that col and can't produce it with only one value.
Could you try and replace the line
'col': np.append(np.random.choice(np.array(['a', 'b', 'c']), 10), ['d']),
by
'col': np.random.choice(np.array(['a', 'b', 'c']), 11)
Your code should work then. Though ideally you would want to input a fixed list of values into 'col' as generating it randomly might use a letter only once and then you'd get the same issue you are having now.

Related

ValueError due to a missing element in color map

I need to build a network where nodes (from df1) have some specific colors based on labels from a different dataset (df2). In df1 not all the nodes have labelled assigned in df2 (for example, because they have not been labelled yet, so they have currently nan value).
The below code should provide a good example on what I mean:
import networkx as nx
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt, colors as mcolor
# Sample DataFrames
df1 = pd.DataFrame({
'Node': ['A', 'A', 'B', 'B', 'B', 'Z'],
'Edge': ['B', 'D', 'N', 'A', 'X', 'C']
})
df2 = pd.DataFrame({
'Nodes': ['A', 'B', 'C', 'D', 'N', 'S', 'X'],
'Attribute': [-1, 0, -1.5, 1, 1, 9, 0]
})
# Simplified construction of `colour_map`
uni_val = df2['Attribute'].unique()
colors = plt.cm.jet(np.linspace(0, 1, len(uni_val)))
# Map colours to_hex then zip with
mapper = dict(zip(uni_val, map(mcolor.to_hex, colors)))
color_map =df2.set_index('Nodes')['Attribute'].map(mapper).fillna('black')
G = nx.from_pandas_edgelist(df1, source='Node', target='Edge')
# Add Attribute to each node
nx.set_node_attributes(G, color_map, name="colour")
# Then draw with colours based on attribute values:
nx.draw(G,
node_color=nx.get_node_attributes(G, 'colour').values(),
with_labels=True)
plt.show()
Z is not df2 because df2 was created considering only non NA values.
I would like to assign the color black to unlabelled nodes, i.e., for those nodes that are not in df2.
Trying to run the code above, I am getting this error:
ValueError: 'c' argument has 7 elements, which is inconsistent with 'x' and 'y' with size 8.
It is clear that this error is caused by the add of color black for missing, not included in color_map.
What it is not clear to me is how to fix the issue. I hope in some help for figuring it out.
Since Z is not in df2, but is one of the nodes, we should, instead of creating properties exclusively from df2 we should reindex the color_map from nodes nodes with a fill_value:
# Create graph before color map:
G = nx.from_pandas_edgelist(df1, source='Node', target='Edge')
# Create Colour map. Ensure all nodes have a value via reindex using nodes
color_map = (
df2.set_index('Nodes')['Attribute'].map(mapper)
.reindex(G.nodes(), fill_value='black')
)
color_map without reindex
df2.set_index('Nodes')['Attribute'].map(mapper)
Nodes
A #000080
B #0080ff
C #7dff7a
D #ff9400
N #ff9400
S #800000
X #0080ff
Name: Attribute, dtype: object
nodes (using nodes here since this will be all nodes in the Graph, rather than just those in df2)
G.nodes()
['A', 'B', 'D', 'N', 'X', 'Z', 'C']
reindex to ensure all nodes are present in mapping:
df2.set_index('Nodes')['Attribute'].map(mapper).reindex(G.nodes(), fill_value='black')
Nodes
A #000080
B #0080ff
D #ff9400
N #ff9400
X #0080ff
Z black # <- Missing Nodes are added with specified value
C #7dff7a
Name: Attribute, dtype: object
Complete Code:
import networkx as nx
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt, colors as mcolor
# Sample DataFrames
df1 = pd.DataFrame({
'Node': ['A', 'A', 'B', 'B', 'B', 'Z'],
'Edge': ['B', 'D', 'N', 'A', 'X', 'C']
})
df2 = pd.DataFrame({
'Nodes': ['A', 'B', 'C', 'D', 'N', 'S', 'X'],
'Attribute': [-1, 0, -1.5, 1, 1, 9, 0]
})
# Simplified construction of `colour_map`
uni_val = df2['Attribute'].unique()
colors = plt.cm.jet(np.linspace(0, 1, len(uni_val)))
# Map colours to_hex then zip with
mapper = dict(zip(uni_val, map(mcolor.to_hex, colors)))
G = nx.from_pandas_edgelist(df1, source='Node', target='Edge')
# Create Colour map. Ensure all nodes have a value via reindex
color_map = (
df2.set_index('Nodes')['Attribute'].map(mapper)
.reindex(G.nodes(), fill_value='black')
)
# Add Attribute to each node
nx.set_node_attributes(G, color_map, name="colour")
# Then draw with colours based on attribute values:
nx.draw(G,
node_color=nx.get_node_attributes(G, 'colour').values(),
with_labels=True)
plt.show()

Hide non observed categories in a seaborn boxplot

I am currently working on a data analysis, and want to show some data distributions through seaborn boxplots.
I have a categorical data, 'seg1' which can in my dataset take 3 values ('Z1', 'Z3', 'Z4'). However, data in group 'Z4' is too exotic to be reported for me, and I would like to produce boxplots showing only categories 'Z1' and 'Z3'.
Filtering the data source of the plot did not work, as category 'Z4' is still showed with no data point.
Is there any other solution than having to create a new CategoricalDtype with only ('Z1', 'Z3') and cast/project my data back on this new category?
I would simply like to hide 'Z4' category.
I am using seaborn 0.10.1 and matplotlib 3.3.1.
Thanks in advance for your answers.
My tries are below, and some data to reproduce.
Dummy data
dummy_cat = pd.CategoricalDtype(['a', 'b', 'c'])
df = pd.DataFrame({'col1': ['a', 'b', 'a', 'b'], 'col2': [12., 5., 3., 2]})
df.col1 = df.col1.astype(dummy_cat)
sns.boxplot(data=df, x='col1', y='col2')
Apply no filter
fig, axs = plt.subplots(figsize=(8, 25), nrows=len(indicators2), squeeze=False)
for j, indicator in enumerate(indicators2):
sns.boxplot(data=orders, y=indicator, x='seg1', hue='origin2', ax=axs[j, 0], showfliers=False)
Which produces:
Filter data source
mask_filter = orders.seg1.isin(['Z1', 'Z3'])
fig, axs = plt.subplots(figsize=(8, 25), nrows=len(indicators2), squeeze=False)
for j, indicator in enumerate(indicators2):
sns.boxplot(data=orders.loc[mask_filter], y=indicator, x='seg1', hue='origin2', ax=axs[j, 0], showfliers=False)
Which produces:
To cut off the last (or first) x-value, set_xlim() can be used, e.g. ax.set_xlim(-0.5, 1.5).
Another option is to work with seaborn's order= parameter and only add the desired values in that list. Optionally that can be created programmatically:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
dummy_cat = pd.CategoricalDtype(['a', 'b', 'c'])
df = pd.DataFrame({'col1': ['a', 'b', 'a', 'b'], 'col2': [12., 5., 3., 2]})
df.col1 = df.col1.astype(dummy_cat)
order = [cat for cat in dummy_cat.categories if df['col1'].str.contains(cat).any()]
sns.boxplot(data=df, x='col1', y='col2', order=order)
plt.show()

How do I make a grouped bar chart with a consistent grid across columns?

I am trying to make a grouped bar chart in Altair where the columns are not so obvious (probably by removing the space between them).
The solution proposed in the issue here relies on several depreciated methods. Additionally, the desired visual grouping described there (which is what I'm looking for) was closed as a vega-lite issue. This has since been resolved.
Is there an updated way to create a cleanly grouped bar chart?
Here's what I have so far:
import pandas as pd
import numpy as np
import altair as alt
vals = np.concatenate(((np.arange(10) ** 2) / 9, np.arange(10)))
df = pd.DataFrame({
'cat': np.repeat(['a', 'b'], 10),
'x': np.tile(np.arange(10), 2),
'y': vals
})
alt.Chart(df).mark_bar(width=20).encode(
x='cat',
y='y',
color='cat',
column='x'
).configure_view(strokeWidth=0)
Is it possible to keep the horizontal grid lines while also maintaining the space between each group?
You can combine facet spacing with adjusting the scale domain to do something like this:
import pandas as pd
import numpy as np
import altair as alt
vals = np.concatenate(((np.arange(10) ** 2) / 9, np.arange(10)))
df = pd.DataFrame({
'cat': np.repeat(['a', 'b'], 10),
'x': np.tile(np.arange(10), 2),
'y': vals
})
alt.Chart(df).mark_bar(width=20).encode(
x=alt.X('cat', scale=alt.Scale(domain=['', 'a', 'b'])),
y='y',
color='cat',
).facet(
'x', spacing=0
).configure_view(
strokeWidth=0
)

How do you plot by Groupby?

I want to know how to plot a bar graph among groups 'new_build', X Axis shows the towns and Y Axis shows the percentage values from the calculation performed in the code below
df_House.groupby(['new_build', 'town'])['price_paid'].count()/df_House.groupby(['new_build', 'town'])['price_paid'].count().sum()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
data = [['a', 'x', 100], ['b', 'y', 200], ['c', 'z', 300], ['a', 'y', 400], ['b', 'z', 600], ['c', 'x', 100]]
df = pd.DataFrame(data, columns=['new_build', 'town', 'price_paid'])
df_town = df.groupby(['new_build', 'town']).agg({'price_paid': 'sum'})
new_df = df_town.div(state, level='new_build') * 100
new_df.reset_index(inplace=True)
sns.set()
new_df.set_index(['new_build', 'town']).price_paid.plot(kind='bar', stacked=True)
plt.ylabel('% of price_paid')
Not sure if you need a stacked graph to better represent the data but to keep things less complicated, I have created a bar graph performing the task you requested.

Plot duplication in Pandas Plot()

There is an issue with the plot() function in Pandas
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'A', 'B'])
ax = df.plot()
ax.legend(ncol=1, bbox_to_anchor=(1., 1, 0., 0), loc=2 , prop={'size':6})
This will make a plot with too many lines. Note however that half will be on top of each other. It seems to have something to do with the axis because when I do not use them the issue goes away.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'A', 'B'])
df.plot()
UPDATE
While not idea for my use case the issue can be fixed by using MultiIndex
columns = pd.MultiIndex.from_arrays([np.hstack([ ['left']*2, ['right']*2]), ['A', 'B']*2], names=['High', 'Low'])
df = pd.DataFrame(np.random.randn(8, 4), columns=columns)
ax = df.plot()
ax.legend(ncol=1, bbox_to_anchor=(1., 1, 0., 0), loc=2 , prop={'size':16})
It has to do with your duplication of column names, not ax at all (if you call plt.legend after your second example you see the same extra lines). Having multiple columns with the same name is confusing the call to DataFrame.plot_frame.
If you change your columns to ['A', 'B', 'C', 'D'] instead, it's fine.

Categories

Resources