Multivalued Histogram as combined scatter and histogram plot - python

I have some theoretical calculations for something in my research. I want to represent the accuracy of this data by taking the theoretical values and subtracting them from the experimental values. This leaves some difference that I would like to plot to display this data.
I have made a mock representation of the type of plot I'm looking for. The red line is the zero of the plot, meaning no difference between the theoretical and experimental values. The x-axis has V1, V2, ..., VN which are different things to be calculated. The problem is that each V has between two or three values, represented by the "X" in the mock figure I made.
I'm a bit lost on how to do this. I tried looking at Multivalued histograms with Gnuplot, though it turned up empty. Can anyone give any insight on this, or have a working example Gnuplot script? I'm open to using other ideas too if you know a way to do this in Python or some other way. The problem is I know nothing about Python.

Using gnuplot there are several ways to achieve this. Here is one option, which I find quite reasonable::
Store the values belonging to one v-value in one data block. Two data blocks are separated with two new lines from each other. So an example data file might be:
# v1 values
-0.5
1.1
0.4
-0.2
# v2 values
-0.1
0.1
-0.7
# v3 values
0.9
0.5
0.2
The labels are stored in a string, separated by space characters. (With this you can only use labels which don't contain spaces themselves, quoting doesn't work).
labels = "v1 v2 v3"
As numerical value for the x-axis you can take the number of the data block, which you get with the special column -2, i.e. with using (column(-2)). This number can also be used to access the respective label from the labels string.
Here is an example script:
set xzeroaxis lc rgb 'red' lt 1 lw 2
set offset 0.2,0.2,0,0
set xtics 1
unset key
set linetype 1 linetype 2 lc rgb 'black' lw 2
labels = "v1 v2 v3"
plot 'data.dat' using (column(-2)):1:xtic(word(labels, column(-2)+1))
The result with 4.6.5 is:
Of course you have a lot of options to modify or extend this script, depending on your actual needs.

You don't seem to be counting anything, so your plot isn't a histogram. It's a bunch of vertical 1D scatter plots arranged horizontally.
The following uses matplotlib to get pretty close to your mock up (out of habit, I renamed "Differences" to the fairly conventional term "Residuals"):
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(123)
# Demo data consists of a list of names of the "variables",
# and a list of the residuals (in numpy arrays) for each variable.
names = ['V1', 'V2', 'V3', 'V4']
r1 = np.random.randn(3)
r2 = np.random.randn(2)
r3 = np.random.randn(3)
r4 = np.random.randn(3)
residuals = [r1, r2, r3, r4]
# Make the plot
for k, (name, res) in enumerate(zip(names, residuals)):
plt.plot(np.zeros_like(res) + k, res, 'kx',
markersize=7.0, markeredgewidth=2)
plt.ylabel("Residuals", fontsize=14)
plt.xlim(-1, len(names))
ax = plt.gca()
ax.set_xticks(range(len(names)))
ax.set_xticklabels(names)
plt.axhline(0, color='r')
plt.show()

Related

Display all the bins on sns distplot [duplicate]

To simplify my problem (it's not exactly like that but I prefer simple answers to simple questions):
I have several 2D maps that portray rectangular region areas. I'd like to add on the map axes and ticks to show the distances on this map (with matplotlib, since the old code is with it), but the problem is that the areas are different sized. I'd like to put on the axes nice, clear ticks, but the widths and heights of the maps can be anything...
To try to explain what I mean: Let's say I have a map of a region whose size is 4.37 km * 6.42 km. I want that there is on x-axis ticks on 0, 1, 2, 3, and 4 km:s and on y-axis ticks on 0, 1, 2, 3, 4, 5, and 6 km:s. However, the image and the axes reach a bit further than to 4 km and 6 km, since the region is larger then 4 km * 6 km.
The space between the ticks can be constant, 1 km. However, the sizes of the maps vary quite a lot (let's say, between 5-15 km), and they are float values. My current script knows the size of the region and can scale the image into right height/width ratio, but how to tell it where to put the ticks?
There may be already solution for this problem, but since I couldn't find suitable search words for my problem, I had to ask it here...
Just set the tick locator to use matplotlib.ticker.MultipleLocator(x) where x is the spacing that you want (e.g. 1.0 in your example above).
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator, FormatStrFormatter
x = np.arange(20)
y = x * 0.1
fig, ax = plt.subplots()
ax.plot(x, y)
ax.xaxis.set_major_locator(MultipleLocator(1.0))
ax.yaxis.set_major_locator(MultipleLocator(1.0))
# Forcing the plot to be labeled with "plain" integers instead of scientific notation
ax.xaxis.set_major_formatter(FormatStrFormatter('%i'))
plt.show()
The advantage to this is that no matter how we zoom or interact with the plot, it will always be labeled with ticks 1 unit apart.
This should give you ticks at all integer values within your current axis limits on the x axis:
from matplotlib import pylab as plt
import math
# get values for the axis limits (unless you already have them)
xmin,xmax = plt.xlim()
# get the outermost integer values using floor and ceiling
# (I need to convert them to int to avoid a DeprecationWarning),
# then get all the integer values between them using range
new_xticks = range(int(math.ceil(xmin)),int(math.floor(xmax)+1))
plt.xticks(new_xticks,new_xticks)
# passing the same argment twice here because the first gives the tick locations
# and the second gives the tick labels, which should just be the numbers
Repeat for the y axis.
Out of curiosity: what kind of ticks do you get by default?
Okay, I tried your versions, but unfortunately I couldn't make them work, since there was some scaling and PDF locating stuff that made me (and your code suggestions) badly confused. But by testing them, I learned again a lot of python, thanks!
I managed finally to find a solution that isn't very exact but satisfies my needs. Here is how I did it.
In my version, one km is divided by a suitable integer constant named STEP_PART. The bigger is STEP_PART, the more accurate the axis values are (and if it is too big, the axis becomes messy to read). For example, if STEP_PART is 5, the accuracy is 1 km / 5 = 200 m, and ticks are put to every 200 m.
STEP_PART = 5 # In the start of the program.
height = 6.42 # These are actually given elsewhere,
width = 4.37 # but just as example...
vHeight = range(0, int(STEP_PART*height), 1) # Make tick vectors, now in format
# 0, 1, 2... instead of 0, 0.2...
vWidth = range(0, int(STEP_PART*width), 1) # Should be divided by STEP_PART
# later to get right values.
To avoid making too many axis labels (0, 1, 2... are enough, 0, 0.2, 0.4... is far too much), we replace non-integer km values with string "". Simultaneously, we divide integer km values by STEP_PART to get right values.
for j in range(len(vHeight)):
if (j % STEP_PART != 0):
vHeight[j] = ""
else:
vHeight[j] = int(vHeight[j]/STEP_PART)
for i in range(len(vWidth)):
if (i % STEP_PART != 0):
vWidth[i] = ""
else:
vWidth[i] = int(vWidth[i]/STEP_PART)
Later, after creating the graph and axes, ticks are put in that way (x axis as an example). There, x is the actual width of the picture, got with shape() command (I don't exactly understand how... there is quite a lot scaling and stuff in the code I'm modifying).
xt = np.linspace(0,x-1,len(vWidth)+1) # For locating those ticks on the same distances.
locs, labels = mpl.xticks(xt, vWidth, fontsize=9)
Repeat for y axis. The result is a graph where is ticks on every 200 m's but data labels on the integer km values. Anyway, the accuracy of those axes are 200 m's, it's not exact but it was enough for me. The script will be even better if I find out how to grow the size of the integer ticks...

Bokeh line color based on the True/Fase condition

I am trying to plot anomaly regions in Bokeh. The idea is to have a line that will use red color to show that those samples are anomalous ones.
Here is a sample reproducible code.
import numpy as np
import random
n=300
dat = pd.DataFrame()
dat['X_axis'] = np.linspace(start=0.0, stop=1000, num = n)
mean = 4
std = 1
dat['Y_axis']=np.random.normal(loc=mean, scale=std, size = n)
dat['anom'] = np.random.choice([False, True ], size = (n,), p= [0.90, 0.10])
I was able to implement the Box Annotation, and I am trying to do the same thing but this time, the same region will just have a red color for that portion of the line.
EDIT:
Following a comment/suggestion, I plotted those two lines as separate. However, Bokeh interpolates between values, instead of having a smooth transaction. Is there a way to drop interpolation, or at least minimize between it to two adjacent values?
EDIT 2:
I was able to break it into individual segments. However, now there are gaps between data samples that need to be eliminated. Any suggestion on how to do that?
You will have to split your data up and use either multiple calls to line or a single call to multi_line. It is not possible to specify different colors along different parts of a single line.

Pandas : using both log and stack on a bar plot

I have some data that comes from amazon that I'd like to work on. One of the plot I'd like to include is a distribution of ratings for each brand, I thought the best way of doing this would be a stacked bar plot.
However, some brands are much more reviewed than others, so I have to use the log scale or else the plot would be 3 peaks and the other brands would be impossible to decently see.
There are about 300'000 entires that look like this
reviewID brand overall
0 Logitech 5.0
1 Garmin 4.0
2 Logitech 4.0
3 Logitech 5.0
I've used this code
brandScore = swissDF.groupby(['brand', 'overall'])['brand']
brandScore = brandScore.count().unstack('overall')
brandScore.plot(kind='bar', stacked=True, log=True, figsize=(8,6))
And this is the result
Now, if you aren't familiar with the data this might look acceptable, but it really isn't. The 1.0 rating stacks look way too big compared to the others, because the logarithm isn't in "full effect" in that range but crunches the better scores.
Is there any way to represent the ratings distribution linearly on a logarithmic plot ?
By that I mean if 60% of the ratings are 5.0 then 60% of the bar should be pink, instead of what I have right now
In order to have the total bar height living on a logarithmic scale, but the proportions of the categories within the bar being linear, one could recalculate the stacked data such that it appears linear on the logarithmic scale.
As a showcase example let's choose 6 datasets with very different totals ([5,10,50,100,500,1000]) such that on a linear scale the lower bars would be much to small. Let's divide it into pieces of in this case 30%, 50% and 20% (for simplicity all different data are divided by the same proportions).
We can then calculate for each datapoint which should later on appear on a stacked bar how large it would need to be, such that the ratio of 30%, 50% and 20% is preserved in the logarithmically scaled plot and finally plot those newly created data.
from __future__ import division
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
a = np.array([5,10,50,100,500,1000])
p = [0.3,0.5,0.2]
c = np.c_[p[0]*a,p[1]*a, p[2]*a]
d = np.zeros(c.shape)
for j, row in enumerate(c):
g = np.zeros(len(row)+1)
G = np.sum(row)
g[1:] = np.cumsum(row)
f = 10**(g/G*np.log10(G))
f[0] = 0
d[j, :] = np.diff( f )
collabels = ["{:3d}%".format(int(100*i)) for i in p]
dfo = pd.DataFrame(c, columns=collabels)
df2 = pd.DataFrame(d, columns=collabels)
fig, axes = plt.subplots(ncols=2)
axes[0].set_title("linear stack bar")
dfo.plot.bar(stacked=True, log=False, ax=axes[0])
axes[0].set_xticklabels(a)
axes[1].set_title("log total barheight\nlinear stack distribution")
df2.plot.bar(stacked=True, log=True, ax=axes[1])
axes[1].set_xticklabels(a)
axes[1].set_ylim([1, 1100])
plt.show()
A final remark: I think one should be careful with such a plot. It may be useful for inspection, but I wouldn't recommend showing such a plot to other people unless one can make absolutely sure they understand what is plotted and how to read it. Otherwise this may cause a lot of confusion, because the stacked categories' height does not match with the scale which is simply false. And showing false data can cause a lot of trouble!
To avoid the problem with the log scale you can not stack the bars in the plot. With this you can compare each bar with the same scale. But you will need a much longer figure (5 times more). Simply stacked=False. An example with sample data:
Two suggestions without the data (providing sample data is better)
option 1
use value_counts(normalize=True)
brandScore = swissDF.groupby(['brand', 'overall'])['brand']
brandScore = brandScore.value_counts(normalize=True).unstack('overall')
brandScore.plot(kind='bar', stacked=True, figsize=(8,6))
option 2
divide by row sums
brandScore = swissDF.groupby(['brand', 'overall'])['brand']
brandScore = brandScore.count().unstack('overall')
brandScore.div(brandScore.sum(1), 0).plot(kind='bar', stacked=True, figsize=(8,6))

Histogram has only one bar

My data--a 196,585-record numpy array extracted from a pandas dataframe--are being placed into a single bin by matplotlib.hist. The data were originally integers, so I tried converting them to float as wel, as shown below, but they are still not being distributed among 10 bins.
Interestingly, a small sub-sample (using df.sample(0.00x)) of the integer data are successfully distributed.
Any suggestions on where I may be erring in data preparation or use of matplotlib's histogram function would be appreciated.
x = df[(df['UNIT']=='X')].OPP_VALUE.values
num_bins = 10
n, bins, patches = plt.hist((x[(x>0)]).astype(float), num_bins, normed=False, facecolor='0.5', alpha=0.8)
plt.show()
Most likely what is happening is that the number of data points with x > 0.5 is very small but you do have some outliers that forces the hist function to pick the scale it does. Try removing all values > 0.5 (or 1 if you do not want to convert to float) and then plot again.
you should modify number of bins, for exam
number_of_bins = 200
bin_cutoffs = np.linspace(np.percentile(x,0), np.percentile(x,99),number_of_bins)

Opacity misleading when plotting two histograms at the same time with matplotlib

Let's say I have two histograms and I set the opacity using the parameter of hist: 'alpha=0.5'
I have plotted two histograms yet I get three colors! I understand this makes sense from an opacity point of view.
But! It makes is very confusing to show someone a graph of two things with three colors. Can I just somehow set the smallest bar for each bin to be in front with no opacity?
Example graph
The usual way this issue is handled is to have the plots with some small separation. This is done by default when plt.hist is given multiple sets of data:
import pylab as plt
x = 200 + 25*plt.randn(1000)
y = 150 + 25*plt.randn(1000)
n, bins, patches = plt.hist([x, y])
You instead which to stack them (this could be done above using the argument histtype='barstacked') but notice that the ordering is incorrect.
This can be fixed by individually checking each pair of points to see which is larger and then using zorder to set which one comes first. For simplicity I am using the output of the code above (e.g n is two stacked arrays of the number of points in each bin for x and y):
n_x = n[0]
n_y = n[1]
for i in range(len(n[0])):
if n_x[i] > n_y[i]:
zorder=1
else:
zorder=0
plt.bar(bins[:-1][i], n_x[i], width=10)
plt.bar(bins[:-1][i], n_y[i], width=10, color="g", zorder=zorder)
Here is the resulting image:
By changing the ordering like this the image looks very weird indeed, this is probably why it is not implemented and needs a hack to do it. I would stick with the small separation method, anyone used to these plots assumes they take the same x-value.

Categories

Resources