How to show some selected rows with FacetGrid - python

I have a dataframe and with a column called "my_row". It has many values. I only want to see some of the data on FacetGrid that belong to specific values of "my_row" on the row. I tried to make a subset of my dataframe and visualize that, but still somehow seaborn "knows" that my original dataframe had more values in "my_row" column and shows empty plots for the rows that I dont want.
So using the following code still gives me a figure with 2 rows of data that I want and many empty plots after that.
X = df[(df['my_row']=='1') | (df['my_row']=='2')].copy()
g = sns.FacetGrid(X, row='my_row', col='column')
How can I tell python to just plot that 2 rows?
I get plots like this with many empty plots:

I cannot reproduce this. The code from the question seems to work fine. Here we have a dataframe with four different values in the my_row column. Then filtering out two of them creates a FacetGrid with only two rows.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
df = pd.DataFrame({"my_row" : np.random.choice(list("1234"), size=40),
"column" : np.random.choice(list("AB"), size=40),
"x" : np.random.rand(40),
"y" : np.random.rand(40)})
X = df[(df['my_row']=='1') | (df['my_row']=='2')].copy()
g = sns.FacetGrid(X, row='my_row', col='column')
g.map(plt.scatter, "x", "y")
plt.show()

For anyone encountering this problem-- the issue is that my_row is a categorical type. To solve, change this to a str.
i.e.
X = df[(df['my_row']=='1') | (df['my_row']=='2')].copy()
X['my_row']=X['my_row'].astype(str)
g = sns.FacetGrid(X, row='my_row', col='column')
This should now work! :)

I got inspired by this link:
Plot lower triangle in a seaborn Pairgrid
and changed my code to this:
g = sns.FacetGrid(df, row='my_row', col='column')
for i in list(range(2,48)):
for j in list(range(0,12)):
g.axes[i,j].set_visible(False)
So I had to iterate over each plot individually at make it invisible. But I think there should be an easier way to do this. And in the end I still don't understand how FacetGrid knows anything about the size of my original dataframe df when I use X and its input.
This is an answer that works, but I think there must be better solutions. One problem with my answer is that when I save the figure, I get a big white space in the saved plot (corresponding to the axes that I set their visibility to False) that I do not see in jupyter notebooks when I am running the code. If FacetGrid just plots the dataframe that I am giving it as the input (in this case X), there would have been no problem anymore. There should be a way to do that.

Related

How do I create a count plot with multiple columns without the axes being stored in a numpy.ndarray?

I'm new to coding and this is my first post. Sorry if it could be worded better!
I'm taking a free online course, and for one of the projects I have to make a count plot with 2 subplot columns.
I've managed to make a count plot with multiple subplots using the code below, and all of the values are correct.
fig = sns.catplot(x = 'variable', hue = 'value', order = ['active', 'alco', 'cholesterol', 'gluc', 'overweight', 'smoke'], col='cardio', data = df_cat, kind = 'count')
But because of the way I've done it, the fig.axes is stored in a 2 dimensional array. The only difference between both rows of the array is the title (cardio = 0 or cardio = 1). I'm assuming this is because of the col='cardio'. Does the col argument always cause the fig.axes to be stored in a 2D array? Is there a way around this or do I have to completely change how I'm making my graph?
I'm sure it's not usually a problem, but because of this, when I run my program through the test module, it fails since some of the functions in the test module don't work on numpy.ndarrays.
I pass the test if I change the reference from fig.axes[0] to fig.axes[0,0], but obviously I cant just change the test module to pass.
I found something. This is just an implementation detail, so it would be nuts to rely on it. If you set col_wrap, then you get an axes ndarray of a different shape.
Reproduced like this:
import seaborn as sns
# I don't have your data but I have this example
tips = sns.load_dataset("tips")
fig = sns.catplot(x='day', hue='sex', col='time', data=tips, kind='count', col_wrap=2)
fig.axes.shape
And it has shape (2,) i.e it's 1D. seaborn==0.11.2.

How to align bars with tick labels in plt or pandas histogram (when plotting multiple columns)

I have started using python for lots of data problems at work and the datasets are always slightly different. I'm trying to explore more efficient ways of plotting data using the inbuilt pandas function rather than individually writing out the code for each column and editing the formatting to get a nice result.
Background: I'm using Jupyter notebook and looking at histograms where the values are all unique integers.
Problem: I want the xtick labels to align with the centers of the histogram bars when plotting multiple columns of data with the one function e.g. df.hist() to get histograms of all columns at once.
Does anyone know if this is possible?
Or is it recommended to do each graph on its own vs. using the inbuilt function applied to all columns?
I can modify them individually following this post: Matplotlib xticks not lining up with histogram
which gives me what I would like but only for one graph and with some manual processing of the values.
Desired outcome example for one graph:
Basic example of data I have:
# Import libraries
import pandas as pd
import numpy as np
# create list of datapoints
data = [[170,30,210],
[170,50,200],
[180,50,210],
[165,35,180],
[170,30,190],
[170,70,190],
[170,50,190]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['height', 'width','weight'])
# print dataframe.
df
Code that displays the graphs in the problem statement
df.hist(figsize=(5,5))
plt.show()
Code that displays the graph for weight how I would like it to be for all
df.hist(column='weight',bins=[175,185,195,205,215])
plt.xticks([180,190,200,210])
plt.yticks([0,1,2,3,4,5])
plt.xlim([170, 220])
plt.show()
Any tips or help would be much appreciated!
Thanks
I hope this helps.You take the column and count the frequency of each label (value counts) then you specify sort_index in order to get the order by the label not by the frecuency, then you plot the bar plot.
data = [[170,30,210],
[170,50,200],
[180,50,210],
[165,35,180],
[170,30,190],
[170,70,190],
[170,50,190]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['height', 'width','weight'])
df.weight.value_counts().sort_index().plot(kind = 'bar')
plt.show()

Subplot counts of multiple categorical variables on a single bar chart

I'm trying to create a single barplot from multiple dataframe columns each of which is a categorical variable (all based on the same levels). I want it to show a count of the levels occurring in each column.
The below code achieves what I want, but on 4 different bar plots. I'd like it all to be on one plot, so the bars are side by side (labels/legend would be rad). I'm trying to a get clean, simple solution using matplotlib but so far I can't figure it out. Help?
Thanks!
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.DataFrame({"A":['cow','pig','horse','goat','cow'], "B":['cow','pig','horse','cow','goat'], "C":['pig','horse','goat','pig','cow'], "D":['cow','pig','horse','horse','goat'], "E":['pig','horse','goat','cow','goat']})
levels = np.sort(df['A'].unique())
df.A.value_counts()[levels].plot(kind='bar')
df.B.value_counts()[levels].plot(kind='bar')
df.C.value_counts()[levels].plot(kind='bar')
df.D.value_counts()[levels].plot(kind='bar')
You should apply pd.series.value_counts and plot a bar graph, stacked or unstacked.
If you need each column on its own;
df.apply(pd.Series.value_counts).plot(kind='bar')
if you need them stacked;
df.apply(pd.Series.value_counts).plot(kind='bar', stacked=True)

modifying scipy stats.probplot plotting function with matplotlib

I am not am expert with matplotlib, so I am having a hard time trying to set the parameters of scipy stats.
My code takes a pandas df column, iterates over the columns, and attempts to plot the values of the columns using the stats.probplot function. This is my code:
plt.figure(figsize=(10,5))
for col in model_predictions.columns:
res = stats.probplot(df[col]), plot=plt)
plt.legend = col
plt.show()
This generates the charts I want, but difficult to read (no legends, sames colors). Aside from plotting them on top of each other, I would like to plot each line in a different color, as well as add a legend for each line equal to the str in col. Any way to do this?
I can always take the tuple output of the function, run it by another new def, and add the outputs to a new pandas df (to later plot with more control); but I was wondering if there is a quicker way.
Thanks
You can plot them manually by taking the output of stats.probplot, i.e.:
from scipy.stats import probplot
for col in model_predictions.columns
plt.plot(*stats.probplot(df[col])[0], label=col)
plt.legend(loc='best')
plt.show()

Adjusting data in a dictionary and then plotting it

I have
x = collections.Counter(df.f.values.tolist())
if 'nan' in x:
del x['nan']
plt.bar(range(len(x)), x.values(), align='center')
plt.xticks(range(len(x)), list(x.keys()))
plt.show()
My question is, how can I remove the nan's from the dictionary that is created, and how can I change the order of the bar plot to go from 1-5? The first 3 nan's are empty spots in the data (intentional since its from a poll), and the last one is the title of the column. I tried manually changing the range part of plt.bar to be 1-5 but it does not seem to work.
You can use .value_counts on a pandas.Series to simply get how many times each value occurs. This makes it simple to then make a barplot.
By default, value_counts will ignore the NaN values, so that takes care of that, and by using .sort_index() we can guarantee the values are plotted in order. It seems we need to use .to_frame() so that it only plots one color for the column (it chooses one color per row for a Series).
Sample Data
import pandas as pd
import numpy as np
# Get your plot settings
import seaborn as sns
sns.set()
np.random.seed(123)
df = pd.DataFrame({'f': np.random.randint(1,6,100)})
df = df.append(pd.DataFrame({'f': np.repeat(np.NaN,1000)}))
Code
df.f.value_counts().to_frame().sort_index().plot(kind='bar', legend=False)

Categories

Resources