Visualize NaN-Values in Features of a Class via Pandas GroupBy - python

Cause of the kind and helping community i solved the first problem i had in my work which you can see here: Basic Problem - necessary for understanding the upcoming
After i used this, i wanted to visualize the distribution of the classes and the nan-Values in the features. Therefore i plottet it in a bar-diagramm. with a few classes it is pretty handy.
the problem is, i have about 120 different classes and all in all 50000 data objects - the plots are not readable with this amount of data.
therefor i wanted to split the visualization.
for each class there should be a subplot that is showing the sum of the nan-values of each feature.
Data:
CLASS FEATURE1 FEATURE2 FEATURE3
X 1 1 2
B 0 0 0
C 2 3 1
Actual Plot:
Expected Plots:
Non of my approaches worked so far.
i tried to solve it through df.groupBy('Class').plot(kind="barh", subplots=True) - completely destroyed the layout and plotted per feature not per class.
i tried this approach but if i write my groupBy-df into the Variable 'grouped' i can print it in a perfect format with all informations, but i can not access it in the way it is done in the solution. i always get the error: 'string indices must be integers'
my approach:
grouped = df.groupby('Class')
for name, group in grouped:
group.plot.bar()
EDIT - Further Information
The Data I use is completely categorical - no numerical values - i want to display the amount of nan-values in the different features of the classes(labels) of my dataset.

A solution using seaborn
import seaborn as sns
ndf = pd.melt(df, id_vars="CLASS", var_name="feature", value_name="val")
sns.catplot("feature", "val", col="CLASS", data=ndf, kind="bar", col_wrap=1)
plt.show()

Grouping is the way to go, just set the labels
for name, grp in df3.groupby('CLASS'):
ax = grp.plot.bar()
ax.set_xticks([])
ax.set_xlabel(name)

With the solution provided by #meW i was able to achieve a result that is near my goal.
I had to do two steps to actually use his solution.
Cast the GroupBy-Object to an DataFrame-Object via df = pd.DataFrame(df.groupBy('Class').count().rsub(df.groupby('Class').size(), axis =0)
Through the groupby-query the Class-Column was transformed to the index so i had to transform it back via grouped['class'] = grouped.index
Two Questions that rise from this solution: is it possible to fit the ticks to the different amounts of nan. because there are classes with only 5-10 nan values in the features and classes with over 1000 nan-values. (see pictures)
Second Question - The Feature-Names are only shown in the last plot - is there a way to add them to all x-axis of all plots

Related

Question about plotting mean across segments in Matplotlib and Seaborn

I'm trying to plot some dataframes and I have some problems.
What I want to find out is where are the best performers performing best in and where are they performing worst in by seeing it in the chart (I would have done a lineplot but barchart would work as well.
The values are in percent.
I first created a dataframe: (to get the ones over 80%)
best_performers_numbers = best_performers_numbers[best_performers_numbers['MEAN'] > 80]
then i created a pivot:
best_performers_pivot = pd.pivot_table(best_performers_numbers, values=metrics_no_target, index=['MEAN'],
aggfunc={np.mean})
best_performers_pivot.sort_values(by='MEAN', ascending=False, inplace=True)
Now my df looks like this:
I wanted to plot it now to see in which segments (e.g. COM) the best performers are performing best in and where the worst
I started with matplotlib until I gave it up and then started with seaborn but I'm quite lost now because my value error says it does not match the length
sns.lineplot(data=best_performers_pivot, x=best_performers_pivot.index, y=best_performers_pivot.columns[0:])

Clustering plot in one-dimension data in Python

I have a dataset with three columns and thousands of rows as shown below.
The number of classes (clusters) are 4 as shown in column three (R, I, C, F).
row id VALUE CLASS
1 284 R
2 254 I
3 184 C
4 177 F
..........
I am trying to get the cluster plot from the above data based on the 4 classes. The expected output is shown in the picture below.
What I tried:
Scatter plot in seaborn
from pandas import read_csv
import seaborn as sns
df2 = read_csv(r'C:\Users\jo\Downloads\Clusters.csv')
sns.scatterplot(data=df2, x="VALUE", y= "rowid",hue="CLASS")
Well, I have to say that the clustering algo is almost certainly doing absolutely what it is supposed to do. Clustering is non-supervised, of course, so you don't have any training/testing and you don't know what the outcome will be. You can feed in different features, and see what the outcome is. Also, you don't really share any code, so it's impossible to say for sure what is going on here. I would suggest taking a look at following links, below, and doing some more Googling on this subject.
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20-%20Historical%20Stock%20Prices.ipynb
https://www.askpython.com/python/examples/plot-k-means-clusters-python
https://towardsdatascience.com/visualizing-clusters-with-pythons-matplolib-35ae03d87489

Pandas : using both log and stack on a bar plot

I have some data that comes from amazon that I'd like to work on. One of the plot I'd like to include is a distribution of ratings for each brand, I thought the best way of doing this would be a stacked bar plot.
However, some brands are much more reviewed than others, so I have to use the log scale or else the plot would be 3 peaks and the other brands would be impossible to decently see.
There are about 300'000 entires that look like this
reviewID brand overall
0 Logitech 5.0
1 Garmin 4.0
2 Logitech 4.0
3 Logitech 5.0
I've used this code
brandScore = swissDF.groupby(['brand', 'overall'])['brand']
brandScore = brandScore.count().unstack('overall')
brandScore.plot(kind='bar', stacked=True, log=True, figsize=(8,6))
And this is the result
Now, if you aren't familiar with the data this might look acceptable, but it really isn't. The 1.0 rating stacks look way too big compared to the others, because the logarithm isn't in "full effect" in that range but crunches the better scores.
Is there any way to represent the ratings distribution linearly on a logarithmic plot ?
By that I mean if 60% of the ratings are 5.0 then 60% of the bar should be pink, instead of what I have right now
In order to have the total bar height living on a logarithmic scale, but the proportions of the categories within the bar being linear, one could recalculate the stacked data such that it appears linear on the logarithmic scale.
As a showcase example let's choose 6 datasets with very different totals ([5,10,50,100,500,1000]) such that on a linear scale the lower bars would be much to small. Let's divide it into pieces of in this case 30%, 50% and 20% (for simplicity all different data are divided by the same proportions).
We can then calculate for each datapoint which should later on appear on a stacked bar how large it would need to be, such that the ratio of 30%, 50% and 20% is preserved in the logarithmically scaled plot and finally plot those newly created data.
from __future__ import division
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
a = np.array([5,10,50,100,500,1000])
p = [0.3,0.5,0.2]
c = np.c_[p[0]*a,p[1]*a, p[2]*a]
d = np.zeros(c.shape)
for j, row in enumerate(c):
g = np.zeros(len(row)+1)
G = np.sum(row)
g[1:] = np.cumsum(row)
f = 10**(g/G*np.log10(G))
f[0] = 0
d[j, :] = np.diff( f )
collabels = ["{:3d}%".format(int(100*i)) for i in p]
dfo = pd.DataFrame(c, columns=collabels)
df2 = pd.DataFrame(d, columns=collabels)
fig, axes = plt.subplots(ncols=2)
axes[0].set_title("linear stack bar")
dfo.plot.bar(stacked=True, log=False, ax=axes[0])
axes[0].set_xticklabels(a)
axes[1].set_title("log total barheight\nlinear stack distribution")
df2.plot.bar(stacked=True, log=True, ax=axes[1])
axes[1].set_xticklabels(a)
axes[1].set_ylim([1, 1100])
plt.show()
A final remark: I think one should be careful with such a plot. It may be useful for inspection, but I wouldn't recommend showing such a plot to other people unless one can make absolutely sure they understand what is plotted and how to read it. Otherwise this may cause a lot of confusion, because the stacked categories' height does not match with the scale which is simply false. And showing false data can cause a lot of trouble!
To avoid the problem with the log scale you can not stack the bars in the plot. With this you can compare each bar with the same scale. But you will need a much longer figure (5 times more). Simply stacked=False. An example with sample data:
Two suggestions without the data (providing sample data is better)
option 1
use value_counts(normalize=True)
brandScore = swissDF.groupby(['brand', 'overall'])['brand']
brandScore = brandScore.value_counts(normalize=True).unstack('overall')
brandScore.plot(kind='bar', stacked=True, figsize=(8,6))
option 2
divide by row sums
brandScore = swissDF.groupby(['brand', 'overall'])['brand']
brandScore = brandScore.count().unstack('overall')
brandScore.div(brandScore.sum(1), 0).plot(kind='bar', stacked=True, figsize=(8,6))

Seaborn pairplot: how to change legend label text

I'm making a simple pairplot with Seaborn in Python that shows different levels of a categorical variable by the color of plot elements across variables in a Pandas DataFrame. Although the plot comes out exactly as I want it, the categorical variable is binary, which makes the legend quite meaningless to an audience not familiar with the data (categories are naturally labeled as 0 & 1).
An example of my code:
g = sns.pairplot(df, hue='categorical_var', palette='Set3')
Is there a way to change legend label text with pairplot? Or should I use PairGrid, and if so how would I approach this?
Found it! It was answered here: Edit seaborn legend
g = sns.pairplot(df, hue='categorical_var', palette='Set3')
g._legend.set_title(new_title)
Since you don't provide a full example of code, nor mock data, I will use my own codes to answer.
First solution
The easiest must be to keep your binary labels for analysis and to create a column with proper names for plotting. Here is a sample code of mine, you should grab the idea:
def transconum(morph):
if (morph == 'S'):
return 1.0
else:
return 0.0
CompactGroups['MorphNum'] = CompactGroups['MorphGal'].apply(transconum)
Second solution
Another way would be to overwrite labels on the flight. Here is a sample code of mine which works perfectly:
grid = sns.jointplot(x="MorphNum", y="PropS", data=CompactGroups, kind="reg")
grid.set_axis_labels("Central type", "Spiral proportion among satellites")
grid.ax_joint.set_xticks([0, 1, 1])
plt.xticks(range(2), ('$Red$', '$S$'))

Plot several continuous variables according to the nominal values of two variables using Python

I would like to create a figure that shows how much money people earned in a game (continuous variable) as a function of the categorical values of three other variables. The first variable is whether people were included or excluded prior to the Money game, the second variable is whether people knew their decision-making partner and the last is the round of the game (participants played 5 rounds with a known co-player and 5 rounds with an unknown co-player). I know how to do draw violin plots as a function of the values of two categorical variables using FacetGrid (see below) but I did not manage to add another layer to it.
g= sns.FacetGrid(df_long, col = 'XP_Social_Condition', size=5, aspect=1)
g.map(sns.boxplot, 'DM partner', 'Money', palette = col_talk)
I have created two dataframe versions: my initial one and a melted one (see image below). I have also tried to create two plots together using f, (ax_l, ax_r) = but this does not seem to take FacetGrid plots as plots within the plot... You can see below links to see the data and the kind of plot I would like to use as a subplot - one showing a known player and one showing the unknown player. I am happy to share the data if it would help.
I have now tried the solution proposed
grid = sns.FacetGrid(melted_df, hue='DM partner', col='XP_Social_Condition')
grid.map(sns.violinplot, 'Round', 'Money')
But it still does not work. This shows the plot shown below, with the third hue variable not showing well the different conditions.
here is the new figure I get - almost there
data - original and melted
Thank you very much for your help.
OK, so you want to create one plot of continuous data depending on three different categorical variables?
I think what you're looking for is:
grid = sns.FacetGrid(melted_df, col='XP_Social_Condition')
grid.map(sns.violinplot, 'Round', 'Money', 'DM partner').add_legend()
The col results in two plots, one for each value of XP_Social_Condition. The three values passed to grid.map split the data so 'Round' becomes the x-axis, 'money' the y-axis and 'DM partner' the color. You can play around and swap the values 'DM_partner', 'XP_Social_Condition' and 'Round'.
The result should now look something like this or this ('Round' and 'DM Partner' swapped).

Categories

Resources