Plotting in ggplot with non-discrete x and y

Plotting in ggplot with non-discrete x and y - python

I want to create a ggplot where the x-axis is a distance (currently the distances are continuous values that range between 0 and 45 feet) that can be binned and the y-axis is whether or not the basket was made (0 is missed, 1 is made). Here is a subset of the dataframe, which is a pandas dataframe. EDIT: Not sure this is helpful, but I have also added a column that represents the bucket/bin for each attempt's shot distance.
distance(ft) outcome category
----------- --------- --------
9.5 1 (9,18]
23.3 1 (18,27]
18.7 0 (18,27]
10.8 0 (9,18]
43.6 1 (36,45]
I could just make a scatterplot where x-axis is distance and the y-axis is miss/made. However, I don't want to visualize every shot attempt as a point. Let's say I want the x axis to be bins (where each bin is every 9 ft: 0-9 ft, 9-18 ft, 18-27 ft, 27-36 ft, 36-45 ft), and the y to be the proportion of shots that was made in that bin.
What is the best way to achieve this in ggplot? How much preprocessing do I have to do before leveraging ggplot capabilities? I can imagine doing all the necessary computation myself to find the proportion of shots made per bin and then plotting those values easily, but I feel there should be some built-in capabilities to help me with this (although I am new to ggplot and unsure at this point). Any guidance would be greatly appreciated. Thanks!

You are likely using a Pandas DataFrame (or Series) to hold your data, right?
If so, you can bin your values using Pandas in-built functionality, specifically the pandas.cut function.
e.g.
bins = 9 # can be int or sequence of scalars
out, bins = df.cut(data, bins, retbins = True, include_lowest = True)

Related

bin value of histograms from grouped data

I am a beginner in Python and I am making separate histograms of travel distance per departure hour. Data I'm using, about 2500 rows of this. Distance is float64, the Departuretime is str. However, for making further calculations I'd like to have the value of each bin in a histogram, for all histograms.
Up until now, I have the following:
df['Distance'].hist(by=df['Departuretime'], color = 'red',
edgecolor = 'black',figsize=(15,15),sharex=True,density=True)
This creates in my case a figure with 21 small histograms. Histogram output I'm receiving.
Of all these histograms I want to know the y-axis value of each bar, preferably in a dataframe with the distance binning as rows and the hours as columns.
With single histograms, I'd paste counts, bins, bars = in front of the entire line and the variable counts would contain the data I was looking for, however, in this case it does not work.
Ideally I'd like a dataframe or list of some sort for each histogram, containing the density values of the bins. I hope someone can help me out! Big thanks in advance!

First of all, note that the bins used in the different histograms that you are generating don't have the same edges (you can see this since you are using sharex=True and the resulting bars don't have the same width), in all cases you are getting 10 bins (the default), but they are not the same 10 bins.
This makes it impossible to combine them all in a single table in any meaningful way. You could provide a fixed list of bin edges as the bins parameter to standarize this.
Alternatively, I suggest you calculate a new column that describes to which bin each row belongs, this way we are also unifying the bins calulation.
You can do this with the cut function, which also gives you the same freedom to choose the number of bins or the specific bin edges the same way as with hist.
df['DistanceBin'] = pd.cut(df['Distance'], bins=10)
Then, you can use pivot_table to obtain a table with the counts for each combination of DistanceBin and Departuretime as rows and columns respectively as you asked.
df.pivot_table(index='DistanceBin', columns='Departuretime', aggfunc='count')

Python: Compare Histograms with different bin size

I want to compare 2 histograms, that are coming from a evaluation board, which is already binning the counted events in a histogram. I am taking data from 2 channels with different number of events (in fact, one is background only, one is background + signal, a pretty usual experimental setting), and with different number of bins, different bin width and different center position of bins.
The datafile looks like this:
HSlice [CH1]
...
44.660 46.255 6
46.255 47.850 10
47.850 49.445 18
49.445 51.040 8
51.040 52.635 28
52.635 54.230 4
54.230 55.825 18
55.825 57.421 183
57.421 59.016 582
59.016 60.611 1786
...
HSlice [CH2]
...
52.022 53.880 0
53.880 55.738 9
55.738 57.596 213
57.596 59.454 728
59.454 61.312 2944
61.312 63.170 9564
...
The first two columns give the boundaries of the respective bin (that is time) and the last column represents the number of events within this timeframe.
Now I want make a kind of background-reduction, so to say subtract the background-histogram from the "background+signal"-histogram to obtain the time trace of the actual signal. I can not do this line-wise since the histograms are quite different. Is there a simple function in python or an elegant solution how to make the data comparable, (for example by interpolating between two datapoints in one histogram to fit the position of a bin of the other histogram) without messing up the time resolution given by the experiment (neither make it worse than it is, nor pretending a better time resolution).
Thank you,
lepakk

Channel 2 has a bigger bin size than channel 1 (1.858 vs 1.595). So I would transfer the values from the smaller bins into the bigger bins. That will lead to a loss of resolution, but I think thats more honest than transferring from bigger bins into smaller bin and therefore increase the resolution.
Now my approach would be to take all the values from the bins in channel 1 and assign them the point in the center of their time bin. You don't really know exactly where in the bin they were originally measured, so this is the point where you cheat a little bit.
Now fill the values of channel 1 into the bins of channel 2 according to their new time value.
That would be my first approach.

Set axis limits in matplotlib but autoscale within them

Is it possible to set the max and min values of an axis in matplotlib, but then autoscale when the values are much smaller than these limits?
For example, I want a graph of percentage change to be limited between -100 and 100, but many of my plots will be between, say, -5 and 5. When I use ax.set_ylim(-100, 100), this graph is very unclear.
I suppose I could use something like ax.set_ylim(max((-100, data-n)), min((100, data+n))), but is there a more built in way to achieve this?

If you want to drop extreme outliers you could use the numpy quantile function to find say the first 0.001 % of the data and last 99.999 %.
near_min = np.quantile(in_data, 0.0001)
near_max = np.quantile(in_data, 0.9999)
ax.set_ylim(near_min, near_max)
You will need to adjust the quantile depending on the volume of data you drop. You might want to include some test of whether the difference between near_min and true min is significant?

As ImportanceOfBeingErnest pointed out, there is no support for this feature. In the end I just used my original idea, but scaled it by the value of the max and min to give the impression of autoscale.
ax.set_ylim(max((-100, data_min+data_min*0.1)), min((100, data_max+data_max*0.1)))
Where for my case it is true that
data_min <= 0, data_max >= 0

Why not just set axis limits based on range of the data each time plot is updated?
ax.set_ylim(min(data), max(data))
Or check if range of data is below some threshold, and then set axis limits.
if min(abs(data)) < thresh:
ax.set_ylim(min(data), max(data))

A way to maintain index pointers over a contiguous array

In Python, I am currently trying to create a per-note frequency visualisation of a specific guitar riff I like
In order to do this and have the points plotted by matplotlib.pyplot I am doing something like this for each note, but will ultimately be summing y-values at specific points for 2 specific frequencies
import numpy
import matplotlib.pyplot as plt
t_per_beat = 110/60.0 #tempo is 110 bpm, finding no of seconds per beat
#creating range of x values for 8 beats of music, in this case 2 bars
x0 = numpy.linspace(0, t_per_beat * 8, 100)
a = []
#generate y-axis values
for i in x0:
a.append(numpy.sin(<note_freq> * i)
I want the y-axis values to be contiguous like the x-axis, so an array of plotted points is best, but I also want to be able to index specific intervals in the array, down to a granularity of a 'sixteenth note' (t_per_beat/4)
Because the frequency value of my note may increase (so I will need to increase the amount of points in my numpy.linspace array, I cannot be assured the interval of index numbers across the array will be consistent.
Of course splitting into a container of separate arrays (i.e: 2-dimensional list) would be preferable, but where the modelling of the waves means that 2 waves coalesce over beat boundaries, this is not really ideal.
In essence my question is (in absence of a better solution that I haven't thought of), is there logic to store a reference to a piece of data in an array such that when called I can always find the index of said data in the array?

Pandas : using both log and stack on a bar plot

I have some data that comes from amazon that I'd like to work on. One of the plot I'd like to include is a distribution of ratings for each brand, I thought the best way of doing this would be a stacked bar plot.
However, some brands are much more reviewed than others, so I have to use the log scale or else the plot would be 3 peaks and the other brands would be impossible to decently see.
There are about 300'000 entires that look like this
reviewID brand overall
0 Logitech 5.0
1 Garmin 4.0
2 Logitech 4.0
3 Logitech 5.0
I've used this code
brandScore = swissDF.groupby(['brand', 'overall'])['brand']
brandScore = brandScore.count().unstack('overall')
brandScore.plot(kind='bar', stacked=True, log=True, figsize=(8,6))
And this is the result
Now, if you aren't familiar with the data this might look acceptable, but it really isn't. The 1.0 rating stacks look way too big compared to the others, because the logarithm isn't in "full effect" in that range but crunches the better scores.
Is there any way to represent the ratings distribution linearly on a logarithmic plot ?
By that I mean if 60% of the ratings are 5.0 then 60% of the bar should be pink, instead of what I have right now

In order to have the total bar height living on a logarithmic scale, but the proportions of the categories within the bar being linear, one could recalculate the stacked data such that it appears linear on the logarithmic scale.
As a showcase example let's choose 6 datasets with very different totals ([5,10,50,100,500,1000]) such that on a linear scale the lower bars would be much to small. Let's divide it into pieces of in this case 30%, 50% and 20% (for simplicity all different data are divided by the same proportions).
We can then calculate for each datapoint which should later on appear on a stacked bar how large it would need to be, such that the ratio of 30%, 50% and 20% is preserved in the logarithmically scaled plot and finally plot those newly created data.
from __future__ import division
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
a = np.array([5,10,50,100,500,1000])
p = [0.3,0.5,0.2]
c = np.c_[p[0]*a,p[1]*a, p[2]*a]
d = np.zeros(c.shape)
for j, row in enumerate(c):
g = np.zeros(len(row)+1)
G = np.sum(row)
g[1:] = np.cumsum(row)
f = 10**(g/G*np.log10(G))
f[0] = 0
d[j, :] = np.diff( f )
collabels = ["{:3d}%".format(int(100*i)) for i in p]
dfo = pd.DataFrame(c, columns=collabels)
df2 = pd.DataFrame(d, columns=collabels)
fig, axes = plt.subplots(ncols=2)
axes[0].set_title("linear stack bar")
dfo.plot.bar(stacked=True, log=False, ax=axes[0])
axes[0].set_xticklabels(a)
axes[1].set_title("log total barheight\nlinear stack distribution")
df2.plot.bar(stacked=True, log=True, ax=axes[1])
axes[1].set_xticklabels(a)
axes[1].set_ylim([1, 1100])
plt.show()
A final remark: I think one should be careful with such a plot. It may be useful for inspection, but I wouldn't recommend showing such a plot to other people unless one can make absolutely sure they understand what is plotted and how to read it. Otherwise this may cause a lot of confusion, because the stacked categories' height does not match with the scale which is simply false. And showing false data can cause a lot of trouble!

To avoid the problem with the log scale you can not stack the bars in the plot. With this you can compare each bar with the same scale. But you will need a much longer figure (5 times more). Simply stacked=False. An example with sample data:

Two suggestions without the data (providing sample data is better)
option 1
use value_counts(normalize=True)
brandScore = swissDF.groupby(['brand', 'overall'])['brand']
brandScore = brandScore.value_counts(normalize=True).unstack('overall')
brandScore.plot(kind='bar', stacked=True, figsize=(8,6))
option 2
divide by row sums
brandScore = swissDF.groupby(['brand', 'overall'])['brand']
brandScore = brandScore.count().unstack('overall')
brandScore.div(brandScore.sum(1), 0).plot(kind='bar', stacked=True, figsize=(8,6))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.