Pandas : using both log and stack on a bar plot - python

I have some data that comes from amazon that I'd like to work on. One of the plot I'd like to include is a distribution of ratings for each brand, I thought the best way of doing this would be a stacked bar plot.
However, some brands are much more reviewed than others, so I have to use the log scale or else the plot would be 3 peaks and the other brands would be impossible to decently see.
There are about 300'000 entires that look like this
reviewID brand overall
0 Logitech 5.0
1 Garmin 4.0
2 Logitech 4.0
3 Logitech 5.0
I've used this code
brandScore = swissDF.groupby(['brand', 'overall'])['brand']
brandScore = brandScore.count().unstack('overall')
brandScore.plot(kind='bar', stacked=True, log=True, figsize=(8,6))
And this is the result
Now, if you aren't familiar with the data this might look acceptable, but it really isn't. The 1.0 rating stacks look way too big compared to the others, because the logarithm isn't in "full effect" in that range but crunches the better scores.
Is there any way to represent the ratings distribution linearly on a logarithmic plot ?
By that I mean if 60% of the ratings are 5.0 then 60% of the bar should be pink, instead of what I have right now

In order to have the total bar height living on a logarithmic scale, but the proportions of the categories within the bar being linear, one could recalculate the stacked data such that it appears linear on the logarithmic scale.
As a showcase example let's choose 6 datasets with very different totals ([5,10,50,100,500,1000]) such that on a linear scale the lower bars would be much to small. Let's divide it into pieces of in this case 30%, 50% and 20% (for simplicity all different data are divided by the same proportions).
We can then calculate for each datapoint which should later on appear on a stacked bar how large it would need to be, such that the ratio of 30%, 50% and 20% is preserved in the logarithmically scaled plot and finally plot those newly created data.
from __future__ import division
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
a = np.array([5,10,50,100,500,1000])
p = [0.3,0.5,0.2]
c = np.c_[p[0]*a,p[1]*a, p[2]*a]
d = np.zeros(c.shape)
for j, row in enumerate(c):
g = np.zeros(len(row)+1)
G = np.sum(row)
g[1:] = np.cumsum(row)
f = 10**(g/G*np.log10(G))
f[0] = 0
d[j, :] = np.diff( f )
collabels = ["{:3d}%".format(int(100*i)) for i in p]
dfo = pd.DataFrame(c, columns=collabels)
df2 = pd.DataFrame(d, columns=collabels)
fig, axes = plt.subplots(ncols=2)
axes[0].set_title("linear stack bar")
dfo.plot.bar(stacked=True, log=False, ax=axes[0])
axes[0].set_xticklabels(a)
axes[1].set_title("log total barheight\nlinear stack distribution")
df2.plot.bar(stacked=True, log=True, ax=axes[1])
axes[1].set_xticklabels(a)
axes[1].set_ylim([1, 1100])
plt.show()
A final remark: I think one should be careful with such a plot. It may be useful for inspection, but I wouldn't recommend showing such a plot to other people unless one can make absolutely sure they understand what is plotted and how to read it. Otherwise this may cause a lot of confusion, because the stacked categories' height does not match with the scale which is simply false. And showing false data can cause a lot of trouble!

To avoid the problem with the log scale you can not stack the bars in the plot. With this you can compare each bar with the same scale. But you will need a much longer figure (5 times more). Simply stacked=False. An example with sample data:

Two suggestions without the data (providing sample data is better)
option 1
use value_counts(normalize=True)
brandScore = swissDF.groupby(['brand', 'overall'])['brand']
brandScore = brandScore.value_counts(normalize=True).unstack('overall')
brandScore.plot(kind='bar', stacked=True, figsize=(8,6))
option 2
divide by row sums
brandScore = swissDF.groupby(['brand', 'overall'])['brand']
brandScore = brandScore.count().unstack('overall')
brandScore.div(brandScore.sum(1), 0).plot(kind='bar', stacked=True, figsize=(8,6))

Related

How could I plot the relative frequency of data split into categories?

I want to get the relative frequency of peoples weights based on a category label and then graph that as a bar chart that would look something like this:
The data frame looks like this:
Weight Category
83.8 A
87.3 A
75.1 B
70.9 A
69.8 C
75.5 B
... ...
I thought of extracting the weight of each category out into its own data frame and managed to get the relative frequency of one category, but I am unsure as to how to plot them all together.
# This holds the total number of people in each category, with categories in alphabetical order
counts = df.groupby("Category")["Weight"].count()
catA = df.loc[df["Category"] == "A"]["Weight"].reset_index().drop(columns="index")
catA["bucket"] = pd.cut(catA["Weight"], 10)
newA = catA[["bucket", "Weight"]].groupby("bucket").count()
newE["relative"] = newE["Weight"] / counts[0]
ax = newA["relative"].plot(kind="bar", title="Relative Frequency of Weight for Category A")
ax.set(xlabel="Weight Bucket", ylabel="Relative Frequency (%)")
ax.tick_params(axis="x", labelrotation=45)
plt.show()
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. (https://seaborn.pydata.org/)
you will not have the same flexibility as with raw matplotlib, but maybe it just works for you and gives you powerful defaults.
using a histplot with hue and multiple=dodge seems to do what you are looking for. from the official docs at https://seaborn.pydata.org/generated/seaborn.histplot.html#seaborn.histplot
sns.histplot(data=tips, x="day", hue="sex", multiple="dodge", shrink=.8)
Use pd.cut to bucket your frequency, and pd.crosstab to count:
(pd.crosstab(pd.cut(df['Weight'], bins=np.linspace(0,100,10)),
df['Category'])
.plot.bar()
)

Visualize NaN-Values in Features of a Class via Pandas GroupBy

Cause of the kind and helping community i solved the first problem i had in my work which you can see here: Basic Problem - necessary for understanding the upcoming
After i used this, i wanted to visualize the distribution of the classes and the nan-Values in the features. Therefore i plottet it in a bar-diagramm. with a few classes it is pretty handy.
the problem is, i have about 120 different classes and all in all 50000 data objects - the plots are not readable with this amount of data.
therefor i wanted to split the visualization.
for each class there should be a subplot that is showing the sum of the nan-values of each feature.
Data:
CLASS FEATURE1 FEATURE2 FEATURE3
X 1 1 2
B 0 0 0
C 2 3 1
Actual Plot:
Expected Plots:
Non of my approaches worked so far.
i tried to solve it through df.groupBy('Class').plot(kind="barh", subplots=True) - completely destroyed the layout and plotted per feature not per class.
i tried this approach but if i write my groupBy-df into the Variable 'grouped' i can print it in a perfect format with all informations, but i can not access it in the way it is done in the solution. i always get the error: 'string indices must be integers'
my approach:
grouped = df.groupby('Class')
for name, group in grouped:
group.plot.bar()
EDIT - Further Information
The Data I use is completely categorical - no numerical values - i want to display the amount of nan-values in the different features of the classes(labels) of my dataset.
A solution using seaborn
import seaborn as sns
ndf = pd.melt(df, id_vars="CLASS", var_name="feature", value_name="val")
sns.catplot("feature", "val", col="CLASS", data=ndf, kind="bar", col_wrap=1)
plt.show()
Grouping is the way to go, just set the labels
for name, grp in df3.groupby('CLASS'):
ax = grp.plot.bar()
ax.set_xticks([])
ax.set_xlabel(name)
With the solution provided by #meW i was able to achieve a result that is near my goal.
I had to do two steps to actually use his solution.
Cast the GroupBy-Object to an DataFrame-Object via df = pd.DataFrame(df.groupBy('Class').count().rsub(df.groupby('Class').size(), axis =0)
Through the groupby-query the Class-Column was transformed to the index so i had to transform it back via grouped['class'] = grouped.index
Two Questions that rise from this solution: is it possible to fit the ticks to the different amounts of nan. because there are classes with only 5-10 nan values in the features and classes with over 1000 nan-values. (see pictures)
Second Question - The Feature-Names are only shown in the last plot - is there a way to add them to all x-axis of all plots

Plotting in ggplot with non-discrete x and y

I want to create a ggplot where the x-axis is a distance (currently the distances are continuous values that range between 0 and 45 feet) that can be binned and the y-axis is whether or not the basket was made (0 is missed, 1 is made). Here is a subset of the dataframe, which is a pandas dataframe. EDIT: Not sure this is helpful, but I have also added a column that represents the bucket/bin for each attempt's shot distance.
distance(ft) outcome category
----------- --------- --------
9.5 1 (9,18]
23.3 1 (18,27]
18.7 0 (18,27]
10.8 0 (9,18]
43.6 1 (36,45]
I could just make a scatterplot where x-axis is distance and the y-axis is miss/made. However, I don't want to visualize every shot attempt as a point. Let's say I want the x axis to be bins (where each bin is every 9 ft: 0-9 ft, 9-18 ft, 18-27 ft, 27-36 ft, 36-45 ft), and the y to be the proportion of shots that was made in that bin.
What is the best way to achieve this in ggplot? How much preprocessing do I have to do before leveraging ggplot capabilities? I can imagine doing all the necessary computation myself to find the proportion of shots made per bin and then plotting those values easily, but I feel there should be some built-in capabilities to help me with this (although I am new to ggplot and unsure at this point). Any guidance would be greatly appreciated. Thanks!
You are likely using a Pandas DataFrame (or Series) to hold your data, right?
If so, you can bin your values using Pandas in-built functionality, specifically the pandas.cut function.
e.g.
bins = 9 # can be int or sequence of scalars
out, bins = df.cut(data, bins, retbins = True, include_lowest = True)

Multivalued Histogram as combined scatter and histogram plot

I have some theoretical calculations for something in my research. I want to represent the accuracy of this data by taking the theoretical values and subtracting them from the experimental values. This leaves some difference that I would like to plot to display this data.
I have made a mock representation of the type of plot I'm looking for. The red line is the zero of the plot, meaning no difference between the theoretical and experimental values. The x-axis has V1, V2, ..., VN which are different things to be calculated. The problem is that each V has between two or three values, represented by the "X" in the mock figure I made.
I'm a bit lost on how to do this. I tried looking at Multivalued histograms with Gnuplot, though it turned up empty. Can anyone give any insight on this, or have a working example Gnuplot script? I'm open to using other ideas too if you know a way to do this in Python or some other way. The problem is I know nothing about Python.
Using gnuplot there are several ways to achieve this. Here is one option, which I find quite reasonable::
Store the values belonging to one v-value in one data block. Two data blocks are separated with two new lines from each other. So an example data file might be:
# v1 values
-0.5
1.1
0.4
-0.2
# v2 values
-0.1
0.1
-0.7
# v3 values
0.9
0.5
0.2
The labels are stored in a string, separated by space characters. (With this you can only use labels which don't contain spaces themselves, quoting doesn't work).
labels = "v1 v2 v3"
As numerical value for the x-axis you can take the number of the data block, which you get with the special column -2, i.e. with using (column(-2)). This number can also be used to access the respective label from the labels string.
Here is an example script:
set xzeroaxis lc rgb 'red' lt 1 lw 2
set offset 0.2,0.2,0,0
set xtics 1
unset key
set linetype 1 linetype 2 lc rgb 'black' lw 2
labels = "v1 v2 v3"
plot 'data.dat' using (column(-2)):1:xtic(word(labels, column(-2)+1))
The result with 4.6.5 is:
Of course you have a lot of options to modify or extend this script, depending on your actual needs.
You don't seem to be counting anything, so your plot isn't a histogram. It's a bunch of vertical 1D scatter plots arranged horizontally.
The following uses matplotlib to get pretty close to your mock up (out of habit, I renamed "Differences" to the fairly conventional term "Residuals"):
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(123)
# Demo data consists of a list of names of the "variables",
# and a list of the residuals (in numpy arrays) for each variable.
names = ['V1', 'V2', 'V3', 'V4']
r1 = np.random.randn(3)
r2 = np.random.randn(2)
r3 = np.random.randn(3)
r4 = np.random.randn(3)
residuals = [r1, r2, r3, r4]
# Make the plot
for k, (name, res) in enumerate(zip(names, residuals)):
plt.plot(np.zeros_like(res) + k, res, 'kx',
markersize=7.0, markeredgewidth=2)
plt.ylabel("Residuals", fontsize=14)
plt.xlim(-1, len(names))
ax = plt.gca()
ax.set_xticks(range(len(names)))
ax.set_xticklabels(names)
plt.axhline(0, color='r')
plt.show()

How to best utilize the hist() to show a cumulative and normed histogram?

I have a problem while dealing with a data set which the value range from 0 to tens of thousand. And there is no problem to show the histogram of the whole data set using hist(). However, if I only want to show the cumulative and normed detailed histogram using say x = [0, 120], I have to use 600000 bins to assure the detail.
The tricky problem is if I just use the range of (0 ,120) to show normed and cumulative hist, it will end with 1. But actually it is far less than the real '1' since it just normed within this small range of data. Could anyone have some ideas how to utilize the hist() in matplotlib to tackle this problem? I thought this should not be so complicated that I have to write another function to draw the hist I need.
You can set bins to a list, not an integer, e.g., bins=[1,2,3,..,120,30000,60000].
To answer your commnet below, here is an excerpt from the documentation:
bins:
Either an integer number of bins or a sequence giving the bins. If bins is an integer, bins + 1 bin edges will be returned, consistent with numpy.histogram() for numpy version >= 1.3, and with the new = True argument in earlier versions. Unequally spaced bins are supported if bins is a sequence.
And here is an example with cumulative normalized histogram. Notice the effect of bins = [100,125,150,160,170,180,190,200,210,220,230,240,250,275,300] on this bar plot, how the first two bars are wider than the middle bars.
Hmmm, I guess this is related to your previous question (Memory error when dealing with huge data). My suggestion there doesn't seem to work for a cumulative histogram.
I can't get plt.hist() to play nice with cyborg's suggestion, so I did the cumsum and normalisation by hand:
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
from numpy.random import normal
inp = np.abs(normal(0, 100000, 100000))
bins = range(0, 120)
a,b = np.histogram(inp, bins = bins)
bar_edges = b[:-1]
bar_width = b[1] - b[0]
bar_height = (np.cumsum(a) + sum(inp<min(bins))) / len(inp)
plt.figure(1)
plt.bar(bar_edges, bar_height, width = bar_width)
plt.show()

Categories

Resources